Designing Agent Harnesses for Long-Horizon Tasks
Most agent demos last about thirty seconds. A user types a prompt, the model calls a tool, returns a result, and everyone applauds. But the interesting engineering problems start when you need an agent to stay coherent across hours, days, or weeks of work—when the task outlives a single context window, when multiple agents need to collaborate without dropping information, and when a human operator needs to step in and steer without blowing up the state machine.
This is a fundamentally different design problem. Short-horizon agents are stateless functions. Long-horizon agents are systems. And like all systems, they need architecture.
I have spent the past two years building multi-agent platforms that run long-horizon tasks in production. What follows is what I have learned about the harness—the infrastructure that wraps around a model and makes it useful beyond a single turn.
Memory Architecture Matters
The context window is not memory. It is working memory—a scratchpad that gets wiped between sessions and has a hard ceiling on capacity. For any task that spans multiple sessions, you need a persistent memory layer that exists outside the model's context.
The naive approach is to dump everything into a vector store and retrieve it with semantic search. This works for simple question-answering, but it falls apart for agentic workflows. The problem is retrieval noise: when an agent is midway through a complex task, pulling in tangentially related memories actively degrades performance. The agent starts chasing stale context instead of focusing on the current subtask.
What works better is structured memory with explicit schemas. Rather than treating memory as a bag of embeddings, define what an agent needs to remember: active goals, completed milestones, open decisions, known constraints, and key artifacts. Store these as structured records with metadata—timestamps, relevance scores, and provenance chains that trace back to the original context where the memory was formed.
The critical tradeoff is between completeness and noise. You want the agent to have access to everything it might need, but you cannot afford to surface everything it has ever seen. The solution is a tiered memory architecture: a small set of always-loaded context (current goals, active constraints), a medium-priority ring of recently relevant state, and a deep archive that only surfaces on explicit query. Each tier has different retention policies and different retrieval strategies.
One pattern I have found particularly effective is the memory index—a lightweight summary document that the agent loads at the start of every session. It contains pointers to detailed records rather than the records themselves, letting the agent decide what to pull into working memory based on the current task. This keeps the context window focused while preserving access to the full history.
Context Compression Is Not Optional
As tasks grow longer, the raw transcript of everything that has happened becomes too large to fit in any context window. You have two options: lose information silently as older context falls off the edge, or compress it deliberately. Only one of these is engineering.
Summarization checkpoints are the most straightforward strategy. At defined intervals—after completing a subtask, before a handoff, or when the context window is approaching capacity—the agent generates a structured summary of what has happened, what was decided, and what remains open. This summary replaces the raw transcript in the context, preserving the conclusions while discarding the intermediate reasoning.
But flat summarization has limits. When you compress a two-hour debugging session into a paragraph, you lose the diagnostic reasoning that might be relevant if the bug resurfaces. Hierarchical context addresses this: maintain summaries at multiple levels of granularity. A high-level summary captures the overall task trajectory. Mid-level summaries preserve key decision points and their rationale. The raw transcript is archived but not loaded unless specifically needed.
Priority-based retention adds another dimension. Not all context is equally important. Active errors and unresolved questions should persist at full fidelity. Completed subtasks can be aggressively compressed. Background context—environment details, configuration state, known constraints—should be stored in structured memory rather than occupying space in the narrative context.
The implementation detail that matters most is making compression deterministic. If you rely on the model to decide what to compress and when, you get inconsistent behavior. Build the compression triggers into the harness itself: explicit checkpoints, token-count thresholds, and task-boundary detection. The model generates the summaries, but the harness controls the schedule.
Communication Between Agents
Single-agent systems hit a ceiling. At some point you need specialists: one agent for code generation, another for testing, another for research, another for orchestration. The hard problem is not building individual agents—it is designing how they talk to each other without losing information at the boundaries.
There are two fundamental patterns: shared state and message passing. Shared state means all agents read from and write to a common data store—a filesystem, a database, a shared document. Message passing means agents communicate through structured messages with defined schemas.
In practice, you need both. Shared state works well for artifacts: code files, configuration, test results, accumulated evidence. These are concrete objects that multiple agents need to access without going through a relay. Message passing works for coordination: task assignments, status updates, handoff signals, and requests for input.
The failure mode I see most often is unstructured handoffs. Agent A finishes a subtask and passes the result to Agent B as a blob of natural language. Agent B has to parse that blob, extract the relevant information, and figure out what it is supposed to do next. Every handoff like this is an opportunity for context loss. The fix is to define explicit handoff protocols with typed schemas: what fields must be present, what format they take, and what the receiving agent should do with them.
Another pattern that pays dividends is the orchestrator model. Rather than having agents communicate peer-to-peer, route all coordination through a dedicated orchestrator agent that maintains the global task state, assigns work, and validates that handoffs are complete. This adds latency but dramatically reduces the surface area for coordination failures.
The Role of Human-in-the-Loop
Fully autonomous agents make for compelling demos but questionable production systems. The reality is that long-horizon tasks encounter ambiguity, unexpected states, and judgment calls that benefit from human input. The question is not whether to include a human in the loop, but how to make the handoff seamless.
I have converged on three modes of autonomy that an agent harness should support, and critically, should be able to switch between at any point during execution:
Autonomous mode is for well-understood subtasks with clear success criteria. The agent executes without interruption and reports results when done. This is appropriate for routine operations, established workflows, and tasks where the cost of a wrong decision is low and recoverable.
Copilot mode is for tasks where the agent does the heavy lifting but a human reviews key decisions before they are executed. The agent proposes actions, the human approves or redirects, and execution continues. This is the right default for most production work—it captures the productivity gains of automation while preserving human judgment at critical junctures.
Manual mode is for high-stakes decisions, novel situations, and cases where the agent has low confidence. The human drives and the agent assists—providing information, suggesting options, and executing specific instructions. The key is that the agent remains engaged and retains context even when it is not the primary decision-maker.
The engineering challenge is mode switching. When a human takes the wheel in the middle of an autonomous run, the agent needs to surface its current state, pending actions, and open questions in a format the human can quickly absorb. When the human hands control back, the agent needs to incorporate any decisions the human made without losing track of the broader task. This requires the harness to maintain a clean separation between task state and execution state—the "what needs to happen" should be independent of "who is currently driving."
Closing Thoughts
The model is important. But the harness—the memory systems, the context management, the communication protocols, the human-agent interface—is what determines whether an agent can do real work over real timescales. These are systems engineering problems, not machine learning problems, and they deserve the same rigor we apply to any distributed system: clear interfaces, explicit state management, graceful degradation, and observable behavior.
The next generation of agent platforms will be defined less by which model they use and more by how well they manage the complexity that emerges when capable models are embedded in long-running, multi-agent, human-collaborative workflows. The harness is not scaffolding to be discarded once the model gets smarter. It is infrastructure, and it is where much of the real engineering lives.