• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Monday, May 25, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

Josh by Josh
May 25, 2026
in Technology And Software
0
AI agents are quietly generating chaos engineering failures enterprises don’t track yet



There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template.

READ ALSO

Why short clips are taking over your social media feed

Check Out Meshchera, An Atmospheric Match-Three Game For Playdate Set In A Haunted Marsh

The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. And, by the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure,  because the frameworks for thinking about these two things have never been connected.

The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.

What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.

I've spent six years building infrastructure automation systems at enterprise scale, first at Cisco (leading AI-driven lifecycle platforms deployed across 20-plus global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments).

During that time I also filed a patent on intent-based chaos engineering methodology. And across all of it, I kept watching organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.

The judgment call that agents skip

To understand why this matters, you need to understand what's actually broken in how enterprises govern chaos today,  before you add agents to the picture.

Most mature engineering organizations have invested in chaos engineering programs. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, the sequence has a critical property: A human is making a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable. It's imperfect and often intuitive, but there is at least a person in the loop asking the right question before anything runs.

When you introduce an autonomous remediation agent,  one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies,  that question disappears. The agent sees an anomaly. The agent takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.

Here is the specific failure mode I have watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; a reasonable action given its training data and its narrow view of the incident. What the agent doesn't know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.

What started as a latency spike the agent was designed to fix becomes a cascade the agent was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.

Nobody's chaos engineering program had tested for that specific combination. Nobody's blast radius calculation had included the agent as an actor. Because we don't think of agents as chaos injectors. We should.

According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.

Absorb capacity is a resource; most systems don't treat it that way

The underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agents don't manage it at all.

Through structured primary research with site reliability engineering (SRE) and platform engineering practitioners across organizations including Intuit and GPTZero, I've been developing a resilience budget model. The core idea is to treat absorb capacity as a continuously recomputed, consumable resource rather than a static threshold you try not to breach.

A resilience budget draws on four live signal classes.

  • SLO burn rate is the primary input, because it directly encodes the distance between current system behavior and the commitment that actually matters. If a system is burning its monthly error budget at five times the expected rate, the resilience budget is near zero regardless of what CPU utilization looks like.

  • P99 latency trend matters more than absolute latency, because a service trending upward over forty minutes tells you something different than a service that has been stable at the same absolute value.

  • Dependency saturation state is the most commonly missed signal; a chaos experiment or an agent action that assumes a shared connection pool is freely available when it's sitting at 87% will produce failure modes that nobody designed for.

  • Application behavioral signals,  session completion rates, API call pattern shifts, conversion degradation, and surface system stress earlier than infrastructure metrics do, because users feel the degradation before Prometheus reports it.

What makes this a budget rather than a threshold is that it is consumable. Every chaos experiment draws from the available capacity. Every agent action draws from it. In multi-team organizations where multiple experiments and multiple agents may be acting simultaneously, the budget is shared.

Without a shared ledger of consumption, two teams running experiments against overlapping dependencies produce a combined blast radius that neither team planned. Add autonomous agents acting completely outside the ledger, and the accounting collapses.

Where language models help,  and exactly where they fail

Several engineering organizations are now running experiments using large language models (LLMs) to generate chaos hypotheses from dependency graphs and incident postmortem corpora. The results are directionally useful. Language models surface plausible failure modes that experienced SREs recognize as worth testing, and they generate hypotheses faster than manual processes, particularly when working from rich postmortem history.

The limit is dependency graph staleness, and it is a hard limit. A hypothesis generated from a graph that doesn't reflect last month's service extraction, or a new shared library dependency added two sprints ago, will propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake, it's that the model doesn't know it's making one. It will be confidently incorrect about a system boundary that no longer exists, and in chaos engineering, confident incorrectness in production means an unplanned outage.

Stanford's Trustworthy AI Research Lab found that model-level guardrails alone are insufficient: Fine-tuning attacks bypassed leading models in the majority of tested cases. The implication for chaos hypothesis generation is direct, a model that cannot reliably hold its own safety boundaries cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.

When hypothesis generation draws instead from postmortem corpora, the staleness problem shrinks considerably. Postmortems describe failures that actually occurred in the system at a specific moment in time. The signal is inherently validated by production reality. This is the tractable near-term AI application in this space, and it is genuinely useful for organizations with mature incident documentation practices.

What AI cannot do,  and should not be asked to do, is make the execution decision when signals are ambiguous. That judgment requires awareness of things that live entirely outside any monitoring system: Pending deployments that changed the dependency landscape an hour ago, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable until Monday.

A model without access to that context should not be making that call. This is not a temporary limitation pending a more capable model. It is a structural constraint of what machine observability can represent, and building an agent architecture that ignores it is building one that will eventually make a consequential decision with incomplete information — and no human in the loop to catch it.

What this means for how enterprises govern agents in production

The governance implication is straightforward to describe and harder to implement than it sounds. Every autonomous agent action that touches infrastructure needs to register against the same live signal layer that governs chaos experiments. The same SLO burn rates, latency trends, dependency saturation states that a human engineer would check before initiating an experiment should gate what an agent is permitted to do and when. If the resilience budget is below a defined floor, the agent waits or escalates. It does not act.

Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question isn't only whether the restart completed successfully. It's whether the blast radius of that action was proportionate to the available absorb capacity, and what cascading effects it produced across dependencies. That is chaos engineering data. It belongs in the budget model, feeding the next decision the agent or the team needs to make.

And when signals are genuinely ambiguous, when the budget score is unclear, when a recent deployment has changed the topology in ways the agent's context window doesn't capture, when dependency states are in flux,  the execution decision needs to go to a human. Not as a permanent limitation on agent autonomy, but as a hard engineering requirement for the current state of the technology.

A circuit breaker that hands ambiguous cases to a human is not a weakness in the agent architecture. It is the thing that makes the architecture trustworthy enough to actually run in production. Intent-based verification formalizes exactly this: Defining what correct agent behavior looks like before deployment, then continuously probing whether those boundaries hold under live system conditions.

The organizations that operate autonomous agents reliably at scale are not the ones with the most sophisticated models. They are the ones that understood, before something went badly wrong, that every agent action is a chaos event and built their governance layer accordingly.

The practical first step is unglamorous: Audit every autonomous agent currently touching infrastructure, map its action surface against your live SLO burn rate signals, and define explicit floor conditions below which the agent is required to wait or escalate. That audit will surface agents acting entirely outside your resilience accounting.

Most organizations running agents at scale today have several. Find them before production does.

Sayali Patil has spent 6-plus years at Cisco Systems and Splunk building the reliability and automation systems that keep enterprise AI infrastructure running at scale.



Source_link

Related Posts

Why short clips are taking over your social media feed
Technology And Software

Why short clips are taking over your social media feed

May 25, 2026
Check Out Meshchera, An Atmospheric Match-Three Game For Playdate Set In A Haunted Marsh
Technology And Software

Check Out Meshchera, An Atmospheric Match-Three Game For Playdate Set In A Haunted Marsh

May 24, 2026
Best Memorial Day Mattress Deals: Helix, Saatva (2026)
Technology And Software

Best Memorial Day Mattress Deals: Helix, Saatva (2026)

May 24, 2026
I tried Amazon’s Bee wearable and am both intrigued and slightly creeped out
Technology And Software

I tried Amazon’s Bee wearable and am both intrigued and slightly creeped out

May 24, 2026
Your AI agents need a terminal, not just a vector database
Technology And Software

Your AI agents need a terminal, not just a vector database

May 24, 2026
Ansel Adams’ Trust Says AI-Colorized Version Of His Work Was Exhibited Without Permission
Technology And Software

Ansel Adams’ Trust Says AI-Colorized Version Of His Work Was Exhibited Without Permission

May 24, 2026

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

AI-powered fintech Alaan raises $48M, one of the largest Series A rounds in MENA

AI-powered fintech Alaan raises $48M, one of the largest Series A rounds in MENA

August 5, 2025
How Google keeps user emails private

How Google keeps user emails private

April 11, 2026
Six Digital Holiday Shoppers Every 2025 Campaign Should Plan

Six Digital Holiday Shoppers Every 2025 Campaign Should Plan

December 11, 2025
Glīd is building an autonomous shortcut to move freight from road to rail — catch it at TechCrunch Disrupt 2025

Glīd is building an autonomous shortcut to move freight from road to rail — catch it at TechCrunch Disrupt 2025

October 27, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • AI agents are quietly generating chaos engineering failures enterprises don’t track yet
  • Vibe coding is coming to your phone
  • How to Track Quality Traffic Percentage
  • Why short clips are taking over your social media feed
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions