← PLAYBOOK
Architecture Quarterly
Vol. 04 · Issue 12 May 2026 Field Note № 07 中文
Field Note № 07 · Production Rewrite

From Five Agents
Down to One Loop.

In the Frontier Model era, Multi-Agent is quietly shifting from 2024's best practice to 2026's over-engineering — a controlled experiment report based on Respan AI's production-grade agent rewrite.

Author
Marcus Huang
Published
2026-05-05
Verified
Two-pass fact-check
— The Argument in Brief

Models are now powerful enough to complete tasks end-to-end within full context. Splitting them up means architecting around a constraint that no longer exists.

01

All Three Dimensions Improved

Same model, same parameters, only variable is architecture. Tool-Use Efficiency +28.8%, Helpfulness +16.0%, Grounding +9.6%.

02

Architecture > Prompt

V2 taught itself to call docs_search when uncertain — without anyone telling it to. This isn't prompt engineering; it's the space the loop provides.

03

The Litmus Test

If your agents are passing context between each other rather than doing truly independent work — you probably just need a single loop with full context.

§I
Background · The Subject of the Rewrite

Respan used their own product's Agent as the test subject.

Respan is an agent observability and evals platform — providing tracing, evaluator pipelines, experiment comparison, and more. Their product has an embedded AI agent that helps users operate platform features: browse logs, build evaluators, run experiments, configure automations.

The rewrite target was this embedded agent. Not an agent users deploy for their own tasks, but Respan's own dogfooding experiment subject. This allowed strict controlled comparison — same model, same parameters, same evaluation slate, with architecture as the only variable.

This "dogfooding your own platform" setup maximizes data comparability, but comes with limitations: sample size = 1, and Respan is the vendor selling the evals platform, making the article both a case study and product marketing. The specific frontier model used was not disclosed — only "same model and settings."

§II
V1 · Router + 4 Specialists

Every handoff is a lossy compression.

V1 was the classic 2024 multi-agent pattern: a top-level router agent (~10K char prompt) reads user messages and dispatches to four domain specialists — logs / evaluators / automations, etc. Each specialist has its own ~50K char focused prompt and narrow tool set. The router uses each specialist's return to decide next steps.

FIG. 01 V1 — Router pattern (multi-agent) 5 prompts · ~210K char total
U USER Router ~10K char prompt specialist logs · ~50K tools specialist evaluators tools specialist automations tools specialist datasets · ~50K tools → summary only lossy compression at every handoff 5 prompts × 5 blast radii — any typo ships a regression

Three Structural Problems

PROB · 01

Context collapses at every handoff.

When sub-agents are wrapped as tools, the router only sees the summary the specialist returns — not the full conversation, the reasoning, or the intermediate decisions. Each handoff is a lossy compression step.

User's perspective: a "create an evaluator" request turns into a stream of tool calls followed by "I've created it," with no visible reasoning in between.
PROB · 02

Multi-agent was over-engineering for today's frontier models.

The reflex to split work across specialists made sense when models were weaker. Narrow each agent's surface, keep it on-rails. But for product AI on a current-gen model, the agent is already capable enough to handle the whole task end to end. What it actually needs is full context, not isolation. Structuring the work around the model's limits is now structuring around a limit that's mostly gone.

What the model needs is full context, not isolation — the most counter-intuitive conclusion from a 2024 perspective.
PROB · 03

Five prompts, five blast radii.

Adding a new tool meant editing the router (so it knew which specialist to route to), the specialist (to teach the new tool), and the routing rules. A typo in any one of those five surfaces shipped a regression.

Maintenance cost doesn't grow linearly — it's 5× surface area, 5× coin flip.
§III
V2 · Single Loop + Two-Tier Tools

One agent, one prompt, full context.

V2 eliminated the router and all specialists. A single agent decides each action step in one continuous reasoning chain until the task is complete — same model, same parameters, architecture as the only variable.

But giving the agent full context comes at the cost of a ballooning resting context. The key to making single loop practical is two-tier tool loading — core tools always loaded, long-tail tools behind search, schemas pulled on demand.

FIG. 02 V2 — Single loop with two-tier tool catalog 1 prompt · 21 core tools · ~70 long-tail behind search
U USER single loop one agent · one prompt full context continuous reasoning chain TIER 1 — CORE always loaded · 21 tools log_* dataset_* experiment_* tool_search( ) TIER 2 — LONG TAIL deferred · ~70 tools · pulled on demand every recovery path stays open

Six Key Design Decisions

D.01Two-tier tool loading
V2 loads 21 core tools at startup (the most-used surface: log_* / dataset_* / experiment_*) and defers ~70 long-tail tools behind tool_search. The agent fetches the schema only when it needs it.
Same full coverage, much smaller resting context
D.02Tool-namespace grouping
Grouped by domain (evaluators / workflows / datasets / prompts). Agent selects namespace first, then tool.
Improved argument shape accuracy, especially for ID confusion scenarios (id vs workflow_id)
D.03Task tools self-organization
task_create / task_update let the agent plan multi-step work in a visible location.
No need to compress the plan into reasoning tokens
D.04List-no-match means ask, not invent
When notification_method_list returns nothing matching the user's described Slack channel, ask the user instead of silently creating one.
Eliminates hallucination-class bugs
D.05Async-aware
Long operations (experiments / dataset imports) report "running async" after launch, no polling in the same turn.
Eliminates "claimed done but still running" bugs
D.06ID asymmetry hard rules
"experiment_create requires version PK not family ID" baked into the prompt + diagnose-not-retry pattern.
Eliminates the most frequent tool call errors
§IV
Evaluation · Two-Layer Regression Net

Without a regression net, every prompt change is a coin flip.

Respan dogfooded their own platform — datasets pulled from production traces, evaluator pipelines self-built, experiments compared V1 vs V2 on the same slate. All three evaluators use 1-100 strictness-anchored scoring, judged by claude-haiku-4-5.

Three Evaluators

Helpfulness & Completeness

Does the final assistant message actually answer the user's question? Did the agent take the action when asked, or stop at "Want me to proceed?"

Caps at 40 if the agent stalled at confirmation when the user clearly asked for an action.

Tool-Use Efficiency

Count function_call items, classify question complexity (pure-knowledge, simple, medium, complex), penalize redundancy — duplicate creates, scratchpad spam, repeat list calls without filter changes.

Rewards direct paths.

Hallucination & Grounding

Extract every concrete claim from the final message (numbers, IDs, named resources), then check whether each one is traceable to an earlier function_call_output.

Heavy penalty for invented UUIDs or numbers that don't appear in any tool result.

It's strictness-anchored, not a quality percentage. Most decent agent runs land in the 40-70 range. 80+ only when every must-have from the ground truth is hit.

01

Internal Probes

— Fixed · Run on every prompt change

A fixed test set run on every prompt change, covering four bug categories:

  • Taxonomy — Does the agent keep evaluator vs grader vs pipeline straight?
  • Actions — Can the agent execute multi-step create/update flows end to end?
  • Cross-domain — Does the same ID-discipline transfer across domains?
  • Honest-failure — Does the agent silently invent a trigger type for unsupported features?
02

OOD Customer Questions

— Held out · Never in regression net

A smaller set of real customer questions paraphrased from the support inbox, deliberately kept out of the regression net so they stay novel. Used to verify whether the internal slate's improvements generalize to questions the slate has never seen.

"What's my current monthly request volume and how is it trending?"
"My error rate spiked this morning — find the failed logs and tell me why."

§V
The Data · V1 vs V2 head-to-head

All three dimensions improved, same model, same parameters.

Internal Slate

Judge: claude-haiku-4-5 · Scale 1-100
Tool-Use Efficiency
V1 · 43.3
V2 · 55.8
+12.5+28.8 % Δ
Helpfulness & Completeness
V1 · 50.1
V2 · 58.1
+8.0+16.0 % Δ
Hallucination & Grounding
V1 · 55.6
V2 · 61.0
+5.3+9.6 % Δ
Latency / turn (s)
V1 · 19.8s
V2 · 38.8s
+18.9s~ 2× slower
V2 is slower per turn because it actually completes the action end to end. V1's "fast" responses were mostly stops at "Want me to proceed?" — that speed is user cost, not user value.

The same pattern held on OOD real customer questions, with Helpfulness & Completeness showing the most significant gains — in real scenarios, "stopping at confirmation" means "the customer didn't get what they wanted."

The article's verdict: regression net ensures known failure modes don't regress, OOD set ensures improvements generalize to unseen questions. Both layers are indispensable.

§VI
Emergent Behavior · The Surprise

Nobody taught it — it learned on its own to look up docs when uncertain.

Back when V2's prompt was still a rough draft, the Respan team noticed the agent doing something they hadn't told it to. When it hit an unfamiliar request or an unexpected tool error, it would call docs_search on its own, multiple times, verifying the current schema, double-checking which ID type a field expected, working out why a tool had failed.

Nothing in the prompt said "look up the docs when unsure." The loop just had the room.

V1 couldn't do this. A specialist that hit an unfamiliar error returned a string to the router, which had to decide on that summary without re-entering the loop. V2's loop takes the error, reads the docs, retries with the right shape, and only then turns back to the user — all in one pass, no state lost.

A clean loop with full context isn't just a tidier engineering surface. It changes what the model is capable of, because every recovery path stays open.

Respan AI · on the V2 rewrite

This is not prompt engineering — it's a feature of the architecture, not the prompt.

§VII
The Boundary · When to use which

It's not "Multi-Agent is dead" — it's only split when you truly need context isolation.

Respan's argument has boundaries — the article explicitly states "both patterns are valid; the choice has trade-offs around context fidelity, prompt blast radius, latency, and how much of the model's reasoning is visible to the user." The core test: if your agents are passing context between each other rather than doing truly independent work, you probably just need one loop with full context.

A · Collapse to Single Loop

When to collapse

  • Agents are passing context rather than doing independent work
  • Model capability is sufficient for end-to-end task completion
  • Current system is a multi-agent designed in 2024
  • Debuggability and visible reasoning chain are needed
B · Keep Multi-Agent

When to split

  • Agents are doing truly independent work
  • Agents don't need to pass context or share state
  • Single loop's prompt blast radius is unmanageable
  • Tasks can achieve real speedup through parallelism
Note: The original article did not list specific conditions for multi-agent applicability. The table above is distilled from their arguments. Core trade-off dimensions from the source: context fidelity / prompt blast radius / latency / reasoning visibility.

Tool Loading is the key to making Single Loop work — not "give the agent all tools and pray," but core set always present + long tail behind search. This is essential prerequisite engineering.

Regression Net is the minimum viable eval for Agent Rewrites — fixed internal probes (covering known failure modes) + OOD real customer questions (verifying generalization). Both layers are indispensable.