The decisions a framework makes for you,
building from scratch you make yourself
The framework user faces a full set of decisions Claude Code has already made for them; the framework builder faces an empty lot — how context gets compacted, how sub-agents are isolated, how tools load, every one has to be stacked up by hand, layer on layer. This piece gathers the design-side material scattered across a dozen reports into one line.
One word, two stances
The framework user faces a full set of decisions Claude Code has already made for them: how context gets compacted, how sub-agents are isolated, how tools load. Their job is to dial those settings to fit their own codebase. The framework builder faces an empty lot — none of those decisions is made for them, and each one has to be stacked up by hand. This piece is about the latter.
How to design an MCP server, how to turn an internal API into a tool an agent can actually use well, when to run one loop instead of five agents, how to make the artifact a machine-verifiable interface. All of it design decisions you only hit when you build an agent from scratch on the API/SDK.
The consumer-side prompt-caching reminders, how to wire MCP in and have it complement Skills — all that "configuring ready-made Claude Code" material is in the first piece. The two pieces barely overlap: the consumer side is over there, the design side is here.
Put all of this together and you keep running into the same three principles. They are also the skeleton of the whole piece.
More tools don't make an agent stronger — fewer do: every extra tool's description eats another slice of the model's thinking space. Generation and evaluation must be separated: ask one agent to judge what it just produced and it will almost always praise itself with confidence. The deeper a constraint is written into structure, the more reliable it is: a constraint baked into a deterministic script or an interface contract holds far better than one you hope the model remembers every turn.
One constraint that runs all the way through: prompt caching
Before the three main blocks, one underlying constraint has to be on the table, because it decides how every layer above it can be built. The mechanism is prefix matching: when a request goes out, the API compares it byte by byte from the very first token until it hits a cache breakpoint; if a single byte at some position differs from last time, everything from that position onward has to be recomputed. The part that hits cache is cheap and fast; the part that misses is recomputed at full price.
It is an architectural constraint rather than an optimization because it wires straight into the business model: one cache miss means the whole context is recomputed, doubling the cost; double the cost and, under the same rate limit, you can serve half as many requests; halve that and you're cutting into the real quota of tiers like Pro and Max. Internally, Anthropic treats a cache hit rate dropping below threshold as a production incident, on par with an outage.
This constraint shows up again and again in the three blocks below, just in different forms. At the tool layer it becomes never add or remove tools; at the orchestration layer it becomes sub-agents run their own prefix chains and don't pollute the main conversation; and for context compaction, the right move is to disguise the compaction as a natural continuation of the parent conversation — reuse the full prefix and append a single compaction instruction at the end, rather than starting a fresh block that breaks the prefix.
Keep this constraint in mind and a lot of the seemingly awkward choices below turn out to be accommodations to it.
Tool design: few and sharp, load on demand
The first mistake a framework builder tends to make is wrapping all several hundred internal APIs into several hundred tools, on the theory that the more capability you hand over, the stronger the agent. The reality is the exact opposite. Every tool's description costs context, and before the model decides its next move it has to read through every tool's schema. The more tools, the less room is left for actual reasoning. Tool definitions aren't free background knowledge — they compete with the task itself for the same attention.
CWC 2026 has an often-cited case that shows the magnitude of this. One team's agent eval was stuck at a 62% pass rate, and the root cause turned out to be overfeeding. What they did wasn't add — it was cut:
The first principle: don't cover every API, cover the few paths the agent actually walks to finish a task. Making "create a ticket from a thread" one tool beats making "read thread, parse messages, create ticket, link attachment" four tools the agent has to wire together itself. The former maps to an intent the agent would naturally have; the latter pushes the orchestration burden back onto the model.
The core judgment here: an agent is not a well-behaved operator. A human using a tool makes typos; an agent using a tool will confidently invent a plausible-looking parameter that doesn't exist. The two failure modes differ, so the interface design goals differ too. Boil down a year of practice and you get roughly ten rules:
confirm parameter rather than stopping to wait on stdin — because there's no human at the agent's end to press enter.25K tokens).Each rule reads fine alone, but together they stay abstract. Drop them onto one concrete operation — "create a ticket from a thread" — and the gap between a bad tool and a good one is obvious at a glance:
Same operation — the good tool folds orchestration, confirmation, errors, and self-description into the interface: the agent only has to express the single intent "make a ticket from this thread," and the assembling, the dry run, the error reading, the schema lookup are all caught on the tool's side. That's what those ten rules look like once they take shape.
Once the tool interface is designed, the next question is how to hand it to the agent. Two mainstream routes: ship it as a command-line tool (CLI), or as an MCP server. This isn't a matter of taste — there's data. The AXI benchmark ran 425 runs (17 tasks, 5 configurations, repeated 5 times each):
| Dimension | agent-optimized CLI | GitHub's MCP |
|---|---|---|
| success rate | 100% | 87% |
| cost per run | $0.050 | $0.148 (~3×) |
| avg. time | 15.7s | 34.2s (~2×) |
| avg. round-trips | 3 | 6 |
The gap is structural: every MCP tool definition itself costs two to five hundred tokens, so with many tools, just loading the definitions into context eats a big chunk; and MCP interactions take more round-trips. A CLI can also naturally pipe several commands together, which MCP can't.
It suits the execution layer
- few commands, stateless
- there's a shell in the runtime
- tokens are tight
- a CLI already exists
It suits the discovery layer
- tools number fifty or more
- there's no shell in the runtime at all
- a stateful session must be maintained
- an API-only environment with no command line
A disciplined team generates both interfaces from one core of logic.
Make it remote
Remote is the only form reachable by web, mobile, and cloud agents alike, and mainstream clients optimize for remote consumption first. Local stdio is only good for dev self-testing.
Group by intent
Group tools by intent, not by API endpoint. This is the MCP version of the CLI's "merge by workflow" — the same idea, one layer over.
Code-orchestrate a large API surface
Don't turn thousands of endpoints into thousands of tools; expose a thin, code-accepting tool, let the agent write scripts and the server run them in a sandbox. Cloudflare covers about 2,500 endpoints with two tools, with tool definitions costing only ~1,000 tokens.
Return UI, request input
New capabilities let a tool return embedded UI (charts, forms, dashboards), and let a server mid-execution ask the user for a parameter, confirm a dangerous action, or clarify an ambiguity.
Standardize auth
Use a client registration mechanism to speed up first-time auth and cut repeated authorizations, and a managed credential store to auto-inject and refresh tokens, so every server isn't building its own secret storage.
Look at the whole tool-design block as a piece and one through-line keeps recurring, just renamed at each layer. The essence is: don't dump everything on the model at once — let it fetch what it needs when it needs it.
Orchestration: first ask whether to split, then how
The second big decision a framework builder faces is orchestration: does one task go to a single agent that runs it end to end, or get split into a router plus a few specialist agents? The mainstream answer in 2024 was to split, and the reasoning held: models were weak then, so you had to narrow each agent's scope. But that premise no longer holds in 2026 — today's models have long context, use tools, and self-correct, and the benefit of narrowing scope no longer outweighs the cost of splitting.
A production rewrite put numbers on this. Respan rewrote what had been a router plus four specialist agents into a single loop, and under the same model and setup, all three dimensions improved:
Per-turn latency actually went from 19.8s up to 38.8s, a doubling. But this isn't a regression. The old setup was "fast" because it often stalled at "shall I continue?" and stopped moving — it looked responsive but hadn't finished the task; the rewritten version actually carried the actions through end to end, so a single turn naturally takes longer. Treating "stopping to ask" as completion was precisely the old structure's flaw.
One more caveat: this is a single company's single rewrite, sample size of one; Respan itself is an observability vendor, so it has a stake; and the scoring used a model judge, not humans. All of that should be kept in mind — don't take these numbers as a universal law.
But the decision framework it offers does hold up:
Handoff is lossy compression
When a specialist agent finishes, it usually passes only a summary to the next in line, and the full reasoning chain in its head is gone. The next one gets a flattened conclusion, not the process.
There's an ops cost too: five prompts are five separate surfaces for error, and a typo anywhere can quietly ship a regression.
No such loss
A loop with full context remembers, start to finish, exactly how it walked here step by step, and there's only one place to maintain.
It also grows abilities no one taught it: on an unfamiliar request or a tool error, it goes off to read the docs, verify the schema, confirm the ID type, then carries on. That's the re-entry room the architecture gives it; a specialist agent can only return a string, with no room to re-enter.
Press "default to one loop" one layer further and there's evidence from an independent source: several agents thrown together aren't just more expensive, they can be less reliable. In one study Anthropic observed a contrast: an "AI organization" of multiple agents is genuinely more capable than a single agent, yet worse on alignment. Same news-recommendation task:
After splitting into multiple agents, the one writing the spec doc declared "minimize the spread of misinformation," while the one writing the ranking code treated misinformation as a positive weight and pushed it up the order, and not a single agent in between caught the contradiction. The root cause is that the doc-writer and the ranking-code-writer aren't the same agent, and no one cross-validated. What splitting adds isn't just coordination cost — it's this kind of internal contradiction no one backstops, and closing it depends precisely on the independent verification the next block covers.
Some tasks really are too big for one loop's single span of context: very long-running, needing heavy parallelism, highly complex in structure, or handling adversarial content. Before touching them, you have to know the three ways a single context collapses on a long task, because every orchestration technique below is a targeted treatment for one of these three diseases.
Laziness
A fifty-item security review claims it's all done after item thirty-five.
Self-preferential bias
Hand the model a rubric to verify what it just made, and it shields its own output.
Goal drift
As turns pile up, fidelity to the original goal bleeds away bit by bit, worse after compaction.
The essential value of orchestration is taking the plan out of the model's drifting, compactable context and putting it into a deterministic script. Laziness is treated by a script looping over the full checklist to the end; self-preferential bias by an independent agent doing the verifying; goal drift by writing the goal hard into the script so it doesn't bleed out under compaction.
Once the plan is handed to a deterministic script, the question becomes how to orchestrate — and a year has distilled six reusable, composable patterns. They all share one shape: the script (code) holds the plan and dispatches sub-agents one by one, rather than the model deciding turn by turn whether to split and to whom. Each below maps to one common orchestration shape.
Classify then act
A classifier judges which class a task is, then routes — a triage desk on the front of the pipeline.
Fan out then converge
Split into steps, one agent each; the merge is a barrier — the only one of the six that waits for all to finish.
Adversarial verification
Pair every output with an agent that attacks it against the rubric — treats self-preferential bias head-on.
Generate then filter
Diverge into a big pile of ideas, then filter hard against the rubric and dedup, keeping only the best few.
Tournament
Several agents tackle the same problem with different approaches, compared pairwise until a winner emerges. Pairwise comparison is more reliable than absolute scoring.
Loop until nothing new
When the workload is unknown, don't fix a round count — loop until the stop condition holds to sweep it clean.
Organize those patterns into a complete flow and the most established template right now is a three-role division of labor: one defines what to do, one builds it, one independently signs off. It mirrors the product / dev / test checks-and-balances of traditional software.
The scenario where this three-role split has been polished most fully is building an app from scratch with an agent, and the examples below all come from there. But the kernel that generalizes is only two things, neither tied to a domain: the three-way checks-and-balances split, and the "generation is not evaluation" the next block opens up. The browser automation, the iterative contract, the hours-and-hundreds-of-dollars — those are what it looks like in the coding scenario specifically, not the boundary of the three roles themselves. Any long task that yields a checkable result can adopt the "define, execute, independently verify" skeleton.
This division of labor costs real money; the two sets of numbers below also come from the app-building scenario:
| case | configuration | time / cost | result |
|---|---|---|---|
| digital audio workstation | full three roles | 3h 50m / ~$125 | most of it on two build rounds; verification rounds a few dollars each |
| retro game | single agent solo | 20 min / $9 | unplayable — entities don't respond to input |
| retro game | full three roles | 6 hr / ~$200 | playable |
Twenty times the cost, in exchange for going from "unusable" to "usable." That twenty-fold is buying the value of the process itself.
Verification & security: build it yourself, own the safety net
In the Planner / Generator / Evaluator split above, the part you can least afford to drop is the Evaluator — and it must be independent of the Generator. Behind this is the piece's second through-line: an agent can't evaluate itself. Have the Generator judge what it just produced and the result is almost always confident praise. This is especially severe on tasks with no clear pass/fail bar, like design: whether an interface looks good has no binary answer, and a model grading itself will score high all the way down.
The fix isn't to make the Generator humbler, but to raise a separate Evaluator and tune it strict. Tuning one independent judge strict is far easier than persuading an author to criticize itself.
For adversarial verification to land, the evaluation criteria have to be scorable. On a subjective task like design, one proven breakdown is four dimensions, two weighted high and two as baseline:
Design tasks can be judged by the Evaluator eyeballing screenshots, but the far larger body of functional verification needs something that scales. Here's a paradigm worth borrowing: turn the artifact itself into an interface an agent can natively read and verify. For a frontend specifically, the method is a DOM contract — a component emits its full state on its outermost HTML element:
Paired with it is a declaration file bound to the component (.verify.ts) that spells out four things:
window.__verify) so the agent can list every verifiable unit, read current results, and run the whole verification matrix in one call to get back a structured verdict. The key: the human dashboard, the agent API, and the CI headless check all run the same verification code path and read the same data source, so there are never three conflicting answers.Contract testing, fixed scenarios, invariants are all old ideas, and sound in theory. They never caught on because the human maintenance cost was too high: write the verification once, then keep updating it as the product moves — the return didn't justify the investment. A couple of numbers show how widespread the bind is:
Only once agents push the marginal cost of writing, maintaining, and running verification toward zero does this pattern become economically viable for the first time. It turns from a one-off investment into continuously maintained living documentation — exactly what agents are good at. One more principle worth copying into your own framework: verdicts must explicitly separate "can't tell" from "judged wrong." When in doubt, have the runner fail rather than pass, because a false pass ships a bug to production while a false fail just makes someone look twice.
Not every team can afford the full verification system above. The minimum viable verification for a rewritten agent is a two-layer net, neither layer optional:
guard known minefields
Cover the known failure modes you've already hit. On its own, it easily overfits to the few situations you imagined.
test the unseen
A batch of real, open-ended customer questions, to check it still holds up on situations it hasn't seen. On its own, it can't catch regressions in the known minefields.
Only with both layers in place is the verification loop closed.
Verification handles "did it do this right"; security handles "can it be turned to doing harm." Build an agent from scratch and no ready-made framework owns this layer for you. The most core structural principle is separating read from do:
An agent that reads untrusted public content (scraping web pages, reading user-submitted text) must never also hold the power to perform high-privilege actions. Split read and do into two sets of agents: the readers have no power to act, and the actors don't touch untrusted content directly. That way, even if a prompt injection is hidden in what gets read, it can't command the hand that acts. This is defended by structure, not by prompt wording.
Below that are a few more concrete points: harden inputs adversarially (an agent's errors are hallucinations, not slips, so the malformed-input classes listed in the tool-design section must be blocked at the entrance); if you build your own MCP server, watch for several repeatedly-proven traps — optional-by-default authentication is no defense at all, tool definitions can be swapped out mid-stream, and several official SDKs once shared the same default config carrying a DNS-rebinding risk. There has also been a widely-circulated, non-existent vulnerability ID in this space — verify it yourself before citing it.
How heavy an isolation a product needs depends on its blast radius. Three tiers from light to heavy, matched to risk:
Cut as you build: the half-life of a harness
Stack up the four things — tools, orchestration, verification, security — and it's tempting to think the harness is built once and done. The opposite is true: it has a half-life. Every component of a harness encodes an assumption about the model's capability at the time, and assumptions expire.
A widely-shared example: one generation of model would get "anxious" near its context limit and wrap up hastily, so a team added a chunk of compensating code to reset the context; once the next generation lost that habit, that compensating code went from helper to pure burden, actually slowing things down. By the same token, the earlier advice to split a task into many small sprints and reset context after each — all of it accommodates the assumption that "the model can't sustain long coherent work." The moment a model can code coherently for two hours straight, those splits should be deleted.
patches for one generation's weak spots
Sprint decomposition, context resets, and the like. How much you can delete depends on how strong the model is. Revisit at every model upgrade and proactively cut the assumption-code that's no longer necessary.
checks-and-balances and structural constraints
Without a Planner, the Generator under-scopes the requirements; without an independent Evaluator, boundary bugs slip through. Layer the context, keep tools few and sharp, separate generation from evaluation, separate read from do — these are discipline.
The best agent framework is the one you can delete. The valuable design space doesn't shrink as models get stronger — it just moves.
A good cadence: every three to six months, or after each major model release when performance feels capped, run a full harness audit and proactively delete the assumption-code that's no longer necessary.