Why Agentic AI Needs Systems Engineering Discipline

April 2026

The demo is always impressive. An LLM-powered agent browses the web, writes code, calls APIs, reasons through a multi-step plan, and produces a result that looks like magic. Then you try to run it in production, on real data, with real stakes, and it falls apart in ways that are genuinely hard to diagnose.

This is not a criticism of the underlying models. Foundation models have gotten remarkably capable. The problem is in how we compose them. When you chain LLM calls across long-horizon tasks—when you build systems of agents that coordinate, delegate, and act on each other's outputs—you enter a regime where traditional software engineering is necessary but insufficient, and where ML research alone does not prepare you for what goes wrong.

I have spent the last several years building multi-agent systems for operational technology domains: environments where software meets physical infrastructure, where decisions have real-world consequences, and where "retry the request" is not an acceptable failure strategy. The single most important lesson from that work is this: agentic AI needs systems engineering discipline, and it needs it now, before the patterns calcify.

Agents Fail Differently Than Models

A single LLM call has well-understood failure modes. The model hallucinates, it refuses, it misinterprets the prompt. You can mitigate these with better prompting, retrieval augmentation, fine-tuning, or output validation. The failure is local and containable.

Agents fail differently. When you chain calls together—when one agent's output becomes another agent's input, when tool results feed back into planning loops, when context accumulates across dozens of steps—the failure modes compound in ways that are not additive but multiplicative.

Context degradation is the quietest killer. Over a long-horizon task, the agent's working context accumulates noise: irrelevant tool outputs, earlier reasoning that is no longer applicable, partial results from abandoned subplans. The model does not forget this context; it drowns in it. Decision quality degrades gradually, and by the time the output is visibly wrong, the root cause is buried thirty steps back.

Cascading errors are the most dangerous. Agent A misclassifies an input. Agent B, trusting that classification, makes a downstream decision. Agent C acts on that decision. By the time a human notices, three layers of confident-sounding reasoning have been built on a faulty foundation. Unlike a software bug that throws an exception, this failure mode produces plausible-looking output. It passes cursory review. It looks like it worked.

Loss of coherence is the most subtle. In long-running agent workflows, the system can lose its sense of what it was originally trying to accomplish. Goal drift happens when intermediate results subtly reframe the problem. The agent is still "working"—it is still calling tools, still producing output—but it has wandered off the path in a way that is obvious to a human observer and invisible to the agent itself.

These are not edge cases. In my experience, they are the dominant failure modes of production agent systems. And they are not problems you solve with a better model. They are architectural problems.

What Systems Engineering Brings

Safety-critical physical systems—avionics, nuclear controls, industrial automation—have been dealing with analogous challenges for decades. Complex systems composed of imperfect components, operating in uncertain environments, where failure has real consequences. The discipline of systems engineering exists precisely because you cannot build these systems from the bottom up and hope they work.

Structured decomposition means breaking a complex agent workflow into well-defined stages with explicit interfaces. Each stage has a clear input contract and output contract. You know what each component is responsible for, and you can reason about its behavior in isolation. This is not microservices architecture applied naively to agents; it is the recognition that composability requires boundaries.

Failure mode analysis means systematically asking "what happens when this component produces the wrong output?" for every component in the system. In traditional systems engineering, this is FMEA—Failure Mode and Effects Analysis. For agent systems, it means tracing the blast radius of every possible misclassification, hallucination, or tool failure through the entire workflow. It means knowing, before you deploy, which failures are recoverable and which are catastrophic.

Governance by design means building oversight into the architecture, not bolting it on after the fact. Every consequential decision point in an agent workflow should have a defined governance policy: does a human need to approve this? Is there an automated check? What evidence is logged? This is not about slowing agents down. It is about making their autonomy legible and bounded.

Observability means you can reconstruct, after the fact, exactly what the system did and why. Not just logging—structured traces that capture the full chain of reasoning, tool calls, intermediate results, and decision points. When an agent system produces a bad outcome, you need to be able to perform a root cause analysis with the same rigor you would apply to a post-incident review of a physical system failure.

Lessons from the Field

Building multi-agent systems for operational technology taught me a few things that I think generalize beyond that domain.

Every agent decision must be auditable. If you cannot explain why the system did what it did, you do not have an agent system—you have a liability. This means structured decision logs, not just chat transcripts. It means capturing the state of the world as the agent perceived it at the time of each decision, not just the decision itself.

Human override cannot break agent state. If a human intervenes—overrides a decision, corrects a classification, redirects a workflow—the system must incorporate that intervention cleanly. Too many agent architectures treat human input as an interruption rather than a first-class event. If overriding an agent decision leaves the system in an inconsistent state, your architecture is wrong.

Design for graceful degradation. When a component fails or a model produces low-confidence output, the system should narrow its scope rather than halt entirely. This means defining fallback behaviors at every level of the architecture. A fully autonomous agent that degrades to a semi-autonomous agent that degrades to a human-in-the-loop tool is more useful than one that either works perfectly or crashes.

Separate planning from execution. The agent that decides what to do should not be the same agent that does it. This is the principle of separation of concerns applied to agent architectures, and it makes everything else—governance, observability, human override—dramatically easier to implement.

Test at the system level, not just the component level. An agent that scores well on benchmarks can still produce terrible outcomes when composed with other agents. Integration testing for agent systems is harder than for traditional software—the behavior is non-deterministic, the state space is enormous—but it is not optional.

Who Should Build These Systems

The next generation of agentic AI systems will be built by people who understand both the capabilities of foundation models and the discipline required to compose them into reliable, governable, observable systems. That combination is rare today because the fields have not historically overlapped.

ML researchers bring deep understanding of model capabilities and limitations. Software engineers bring distributed systems expertise and production discipline. But the challenge of building agent systems that work reliably at scale, in domains where failure matters, is fundamentally a systems engineering challenge. It requires thinking about failure modes, interfaces, governance, and degradation from the beginning—not as an afterthought once the demo works.

The gap between impressive demos and trustworthy production systems is not going to close itself. It is going to be closed by engineers who treat agent architectures with the same rigor we apply to any other safety-critical system. The models are good enough. The engineering discipline is what is missing.