From Five Agents
Down to One Loop.
In the Frontier Model era, Multi-Agent is quietly shifting from 2024's best practice to 2026's over-engineering — a controlled experiment report based on Respan AI's production-grade agent rewrite.
Models are now powerful enough to complete tasks end-to-end within full context. Splitting them up means architecting around a constraint that no longer exists.
All Three Dimensions Improved
Same model, same parameters, only variable is architecture. Tool-Use Efficiency +28.8%, Helpfulness +16.0%, Grounding +9.6%.
Architecture > Prompt
V2 taught itself to call docs_search when uncertain — without anyone telling it to. This isn't prompt engineering; it's the space the loop provides.
The Litmus Test
If your agents are passing context between each other rather than doing truly independent work — you probably just need a single loop with full context.
Respan is an agent observability and evals platform — providing tracing, evaluator pipelines, experiment comparison, and more. Their product has an embedded AI agent that helps users operate platform features: browse logs, build evaluators, run experiments, configure automations.
The rewrite target was this embedded agent. Not an agent users deploy for their own tasks, but Respan's own dogfooding experiment subject. This allowed strict controlled comparison — same model, same parameters, same evaluation slate, with architecture as the only variable.
This "dogfooding your own platform" setup maximizes data comparability, but comes with limitations: sample size = 1, and Respan is the vendor selling the evals platform, making the article both a case study and product marketing. The specific frontier model used was not disclosed — only "same model and settings."
V1 was the classic 2024 multi-agent pattern: a top-level router agent (~10K char prompt) reads user messages and dispatches to four domain specialists — logs / evaluators / automations, etc. Each specialist has its own ~50K char focused prompt and narrow tool set. The router uses each specialist's return to decide next steps.
Three Structural Problems
Context collapses at every handoff.
When sub-agents are wrapped as tools, the router only sees the summary the specialist returns — not the full conversation, the reasoning, or the intermediate decisions. Each handoff is a lossy compression step.
Multi-agent was over-engineering for today's frontier models.
The reflex to split work across specialists made sense when models were weaker. Narrow each agent's surface, keep it on-rails. But for product AI on a current-gen model, the agent is already capable enough to handle the whole task end to end. What it actually needs is full context, not isolation. Structuring the work around the model's limits is now structuring around a limit that's mostly gone.
Five prompts, five blast radii.
Adding a new tool meant editing the router (so it knew which specialist to route to), the specialist (to teach the new tool), and the routing rules. A typo in any one of those five surfaces shipped a regression.
V2 eliminated the router and all specialists. A single agent decides each action step in one continuous reasoning chain until the task is complete — same model, same parameters, architecture as the only variable.
But giving the agent full context comes at the cost of a ballooning resting context. The key to making single loop practical is two-tier tool loading — core tools always loaded, long-tail tools behind search, schemas pulled on demand.
Six Key Design Decisions
log_* / dataset_* / experiment_*) and defers ~70 long-tail tools behind tool_search. The agent fetches the schema only when it needs it.id vs workflow_id)task_create / task_update let the agent plan multi-step work in a visible location.notification_method_list returns nothing matching the user's described Slack channel, ask the user instead of silently creating one.experiment_create requires version PK not family ID" baked into the prompt + diagnose-not-retry pattern.Respan dogfooded their own platform — datasets pulled from production traces, evaluator pipelines self-built, experiments compared V1 vs V2 on the same slate. All three evaluators use 1-100 strictness-anchored scoring, judged by claude-haiku-4-5.
Three Evaluators
Helpfulness & Completeness
Does the final assistant message actually answer the user's question? Did the agent take the action when asked, or stop at "Want me to proceed?"
Tool-Use Efficiency
Count function_call items, classify question complexity (pure-knowledge, simple, medium, complex), penalize redundancy — duplicate creates, scratchpad spam, repeat list calls without filter changes.
Hallucination & Grounding
Extract every concrete claim from the final message (numbers, IDs, named resources), then check whether each one is traceable to an earlier function_call_output.
It's strictness-anchored, not a quality percentage. Most decent agent runs land in the 40-70 range. 80+ only when every must-have from the ground truth is hit.
Internal Probes
— Fixed · Run on every prompt change
A fixed test set run on every prompt change, covering four bug categories:
- Taxonomy — Does the agent keep evaluator vs grader vs pipeline straight?
- Actions — Can the agent execute multi-step create/update flows end to end?
- Cross-domain — Does the same ID-discipline transfer across domains?
- Honest-failure — Does the agent silently invent a trigger type for unsupported features?
OOD Customer Questions
— Held out · Never in regression net
A smaller set of real customer questions paraphrased from the support inbox, deliberately kept out of the regression net so they stay novel. Used to verify whether the internal slate's improvements generalize to questions the slate has never seen.
"What's my current monthly request volume and how is it trending?"
"My error rate spiked this morning — find the failed logs and tell me why."
Internal Slate
The same pattern held on OOD real customer questions, with Helpfulness & Completeness showing the most significant gains — in real scenarios, "stopping at confirmation" means "the customer didn't get what they wanted."
The article's verdict: regression net ensures known failure modes don't regress, OOD set ensures improvements generalize to unseen questions. Both layers are indispensable.
Back when V2's prompt was still a rough draft, the Respan team noticed the agent doing something they hadn't told it to. When it hit an unfamiliar request or an unexpected tool error, it would call docs_search on its own, multiple times, verifying the current schema, double-checking which ID type a field expected, working out why a tool had failed.
Nothing in the prompt said "look up the docs when unsure." The loop just had the room.
V1 couldn't do this. A specialist that hit an unfamiliar error returned a string to the router, which had to decide on that summary without re-entering the loop. V2's loop takes the error, reads the docs, retries with the right shape, and only then turns back to the user — all in one pass, no state lost.
A clean loop with full context isn't just a tidier engineering surface. It changes what the model is capable of, because every recovery path stays open.
Respan AI · on the V2 rewrite
This is not prompt engineering — it's a feature of the architecture, not the prompt.
Respan's argument has boundaries — the article explicitly states "both patterns are valid; the choice has trade-offs around context fidelity, prompt blast radius, latency, and how much of the model's reasoning is visible to the user." The core test: if your agents are passing context between each other rather than doing truly independent work, you probably just need one loop with full context.
When to collapse
- Agents are passing context rather than doing independent work
- Model capability is sufficient for end-to-end task completion
- Current system is a multi-agent designed in 2024
- Debuggability and visible reasoning chain are needed
When to split
- Agents are doing truly independent work
- Agents don't need to pass context or share state
- Single loop's prompt blast radius is unmanageable
- Tasks can achieve real speedup through parallelism
Tool Loading is the key to making Single Loop work — not "give the agent all tools and pray," but core set always present + long tail behind search. This is essential prerequisite engineering.
Regression Net is the minimum viable eval for Agent Rewrites — fixed internal probes (covering known failure modes) + OOD real customer questions (verifying generalization). Both layers are indispensable.