AGENT HARNESS · 2026 · THE DESIGN-SIDE COMPANIONDWG NO. GP-HARNESS-02 · built from scratch

The decisions a framework makes for you,
building from scratch you make yourself

The framework user faces a full set of decisions Claude Code has already made for them; the framework builder faces an empty lot — how context gets compacted, how sub-agents are isolated, how tools load, every one has to be stacked up by hand, layer on layer. This piece gathers the design-side material scattered across a dozen reports into one line.

PROJECTBuilding an agent harness from scratch SHEETdesign-side companion · tools / orchestration / verification REVIEWevery load-bearing figure fact-checked DATE2026-06 · part two

◇Intro · two stances ※Through-line · caching 1Tool design 2Orchestration 3Verification & security ½Closing · cut as you build

GenAI Playbook · the companion to Configuring the Claude Code Harness · that one covers configuring a ready-made framework, this one covers building one from scratch

Intro · the framework builder vs. the framework user

One word, two stances

The framework user faces a full set of decisions Claude Code has already made for them: how context gets compacted, how sub-agents are isolated, how tools load. Their job is to dial those settings to fit their own codebase. The framework builder faces an empty lot — none of those decisions is made for them, and each one has to be stacked up by hand. This piece is about the latter.

▸ In scope: building that framework layer from scratch

How to design an MCP server, how to turn an internal API into a tool an agent can actually use well, when to run one loop instead of five agents, how to make the artifact a machine-verifiable interface. All of it design decisions you only hit when you build an agent from scratch on the API/SDK.

▸ Not repeated here: using the ready-made framework

The consumer-side prompt-caching reminders, how to wire MCP in and have it complement Skills — all that "configuring ready-made Claude Code" material is in the first piece. The two pieces barely overlap: the consumer side is over there, the design side is here.

Put all of this together and you keep running into the same three principles. They are also the skeleton of the whole piece.

More tools don't make an agent stronger — fewer do: every extra tool's description eats another slice of the model's thinking space. Generation and evaluation must be separated: ask one agent to judge what it just produced and it will almost always praise itself with confidence. The deeper a constraint is written into structure, the more reliable it is: a constraint baked into a deterministic script or an interface contract holds far better than one you hope the model remembers every turn.

The three blocks below open up these three principles from the angles of tools, orchestration, and verification.

0Sheet 00 · Foundationa constraint running through everything · not an optimization

One constraint that runs all the way through: prompt caching

Before the three main blocks, one underlying constraint has to be on the table, because it decides how every layer above it can be built. The mechanism is prefix matching: when a request goes out, the API compares it byte by byte from the very first token until it hits a cache breakpoint; if a single byte at some position differs from last time, everything from that position onward has to be recomputed. The part that hits cache is cheap and fast; the part that misses is recomputed at full price.

It is an architectural constraint rather than an optimization because it wires straight into the business model: one cache miss means the whole context is recomputed, doubling the cost; double the cost and, under the same rate limit, you can serve half as many requests; halve that and you're cutting into the real quota of tiers like Pro and Max. Internally, Anthropic treats a cache hit rate dropping below threshold as a production incident, on par with an outage.

Layered request structure: static up front, dynamic at the back. The earlier something sits, the less it should change — change it and you invalidate the cache for everyone downstream. An agent built from scratch should layer its request structure in exactly this order.

This constraint shows up again and again in the three blocks below, just in different forms. At the tool layer it becomes never add or remove tools; at the orchestration layer it becomes sub-agents run their own prefix chains and don't pollute the main conversation; and for context compaction, the right move is to disguise the compaction as a natural continuation of the parent conversation — reuse the full prefix and append a single compaction instruction at the end, rather than starting a fresh block that breaks the prefix.

Keep this constraint in mind and a lot of the seemingly awkward choices below turn out to be accommodations to it.

1Sheet 01 · Tool designfew and sharp · load on demand

Tool design: few and sharp, load on demand

The first mistake a framework builder tends to make is wrapping all several hundred internal APIs into several hundred tools, on the theory that the more capability you hand over, the stronger the agent. The reality is the exact opposite. Every tool's description costs context, and before the model decides its next move it has to read through every tool's schema. The more tools, the less room is left for actual reasoning. Tool definitions aren't free background knowledge — they compete with the task itself for the same attention.

1.1The counterintuitive starting point: more tools, dumber agent

CWC 2026 has an often-cited case that shows the magnitude of this. One team's agent eval was stuck at a 62% pass rate, and the root cause turned out to be overfeeding. What they did wasn't add — it was cut:

system prompt

400 lines

↓ cut to

15 lines

custom tools

↓ down to

3 primitives

pass rate from

62%

↑ up to

92%

Sub-agents also collapsed from 3 down to 1. What got cut wasn't capability — it was noise.

The first principle: don't cover every API, cover the few paths the agent actually walks to finish a task. Making "create a ticket from a thread" one tool beats making "read thread, parse messages, create ticket, link attachment" four tools the agent has to wire together itself. The former maps to an intent the agent would naturally have; the latter pushes the orchestration burden back onto the model.

1.2What a tool for an agent actually looks like

The core judgment here: an agent is not a well-behaved operator. A human using a tool makes typos; an agent using a tool will confidently invent a plausible-looking parameter that doesn't exist. The two failure modes differ, so the interface design goals differ too. Boil down a year of practice and you get roughly ten rules:

Merge commands by workflow, don't wrap APIs one-to-one. Each command maps to a complete action step; stringing a few APIs together inside is an implementation detail, not something the agent sees.

Structured output is the floor. Every command should be able to emit JSON, return three or four core fields by default with an escape hatch to pull everything, and pre-aggregate fields wherever possible instead of making the agent compute them.

Let tools describe themselves. Offer an entry point to query the schema so the agent can ask once and get parameter types, required flags, enum values, and examples; the first two or three lines of help text must be real, runnable examples.

Harden inputs adversarially. Path traversal, control characters, a whole query stuffed into an ID field, double URL-encoding, oversized inputs — block these at the entrance, because they're hallucinated, not fat-fingered.

Every write supports dry-run, and never an interactive y/n prompt. Express a dangerous action with an explicit confirm parameter rather than stopping to wait on stdin — because there's no human at the agent's end to press enter.

Authentication must be non-interactive. Use environment variables, credential files, service accounts — no browser OAuth popups, no "press enter to continue."

Errors must be legible. Return a structured error object with a cause, a suggestion, and a semantic exit code (transient-retryable, misconfiguration, invalid input, rate limit, timeout each get their own code) — not a bare stack trace for the agent to guess at.

Package what help text can't convey into a skill. The invariants and best practices — "under what condition should this field even be filled" — go into a load-on-demand doc, pulled into context only when needed.

Govern the returned tokens. Pagination, field masking, truncation-with-notice when over limit, pre-aggregation — all to keep one call from blowing out the context (the hard cap Claude Code sets for a single tool return is 25K tokens).

Leave a raw-JSON passthrough. When a CLI sits on top of a REST API, give a path to pass a JSON body directly — zero translation loss, and LLMs are good at generating JSON to begin with.

Each rule reads fine alone, but together they stay abstract. Drop them onto one concrete operation — "create a ticket from a thread" — and the gap between a bad tool and a good one is obvious at a glance:

# ✗ Bad tool: one-to-one API wrappers, the assembly pushed onto the agent read_thread → parse_messages → create_ticket → link_attachment // four tools, agent wires them up $ create_ticket --project OPS ? summary: ▮ // missing field → blocks on stdin, but no one's there to press enter Traceback (most recent call last): ... // error dumps a stack, agent can only guess what broke # ✓ Good tool: merged into one command by workflow, every detail aimed at the agent $ tickets create-from-thread --thread T-8842 --dry-run --format json {"would_create":{"project":"OPS","linked":3}} // ① dry-run first, nothing committed $ tickets create-from-thread --thread T-8842 --confirm --format json {"id":"OPS-1234","url":"…","status":"open"} // ② returns 3-4 core fields + an escape hatch {"error":"rate_limited","retryable":true} // ③ failures return structured errors + semantic exit codes $ tickets schema create-from-thread // ④ param types / required / enums, the agent just asks

Same operation — the good tool folds orchestration, confirmation, errors, and self-description into the interface: the agent only has to express the single intent "make a ticket from this thread," and the assembling, the dry run, the error reading, the schema lookup are all caught on the tool's side. That's what those ten rules look like once they take shape.

1.3CLI or MCP: a set of measured numbers

Once the tool interface is designed, the next question is how to hand it to the agent. Two mainstream routes: ship it as a command-line tool (CLI), or as an MCP server. This isn't a matter of taste — there's data. The AXI benchmark ran 425 runs (17 tasks, 5 configurations, repeated 5 times each):

Dimension	agent-optimized CLI	GitHub's MCP
success rate	100%	87%
cost per run	$0.050	$0.148 (~3×)
avg. time	15.7s	34.2s (~2×)
avg. round-trips	3	6

The gap is structural: every MCP tool definition itself costs two to five hundred tokens, so with many tools, just loading the definitions into context eats a big chunk; and MCP interactions take more round-trips. A CLI can also naturally pipe several commands together, which MCP can't.

choose CLI

It suits the execution layer

few commands, stateless
there's a shell in the runtime
tokens are tight
a CLI already exists

choose MCP

It suits the discovery layer

tools number fifty or more
there's no shell in the runtime at all
a stateful session must be maintained
an API-only environment with no command line

A disciplined team generates both interfaces from one core of logic.

1.4If you do build an MCP server, there are five established patterns

Pattern 01

Make it remote

Remote is the only form reachable by web, mobile, and cloud agents alike, and mainstream clients optimize for remote consumption first. Local stdio is only good for dev self-testing.

Pattern 02

Group by intent

Group tools by intent, not by API endpoint. This is the MCP version of the CLI's "merge by workflow" — the same idea, one layer over.

Pattern 03

Code-orchestrate a large API surface

Don't turn thousands of endpoints into thousands of tools; expose a thin, code-accepting tool, let the agent write scripts and the server run them in a sandbox. Cloudflare covers about 2,500 endpoints with two tools, with tool definitions costing only ~1,000 tokens.

Pattern 04

Return UI, request input

New capabilities let a tool return embedded UI (charts, forms, dashboards), and let a server mid-execution ask the user for a parameter, confirm a dangerous action, or clarify an ambiguity.

Pattern 05

Standardize auth

Use a client registration mechanism to speed up first-time auth and cut repeated authorizations, and a managed credential store to auto-inject and refresh tokens, so every server isn't building its own secret storage.

1.5"Load on demand" is one thing under several names

Look at the whole tool-design block as a piece and one through-line keeps recurring, just renamed at each layer. The essence is: don't dump everything on the model at once — let it fetch what it needs when it needs it.

Four names, one thing. Take the Two-tier branch: the Respan agent loads only 21 core tools at startup, with about 70 long-tail tools behind a search entry point fetched on demand — keeping full coverage while compressing the at-rest context. Internalize this one principle and most of the token waste at the tool layer is avoided.

2Sheet 02 · Orchestrationfirst ask whether to split · then how

Orchestration: first ask whether to split, then how

The second big decision a framework builder faces is orchestration: does one task go to a single agent that runs it end to end, or get split into a router plus a few specialist agents? The mainstream answer in 2024 was to split, and the reasoning held: models were weak then, so you had to narrow each agent's scope. But that premise no longer holds in 2026 — today's models have long context, use tools, and self-correct, and the benefit of narrowing scope no longer outweighs the cost of splitting.

2.1Default to one loop, not to splitting agents

A production rewrite put numbers on this. Respan rewrote what had been a router plus four specialist agents into a single loop, and under the same model and setup, all three dimensions improved:

tool-use efficiency

43.3

↑ after rewrite

55.8

helpful / complete

50.1

↑ after rewrite

58.1

grounded

55.6

↑ after rewrite

61.0

The three rose by roughly 29%, 16%, and 10% respectively.

⚠ One number to explain, so it isn't misread

Per-turn latency actually went from 19.8s up to 38.8s, a doubling. But this isn't a regression. The old setup was "fast" because it often stalled at "shall I continue?" and stopped moving — it looked responsive but hadn't finished the task; the rewritten version actually carried the actions through end to end, so a single turn naturally takes longer. Treating "stopping to ask" as completion was precisely the old structure's flaw.

One more caveat: this is a single company's single rewrite, sample size of one; Respan itself is an observability vendor, so it has a stake; and the scoring used a model judge, not humans. All of that should be kept in mind — don't take these numbers as a universal law.

But the decision framework it offers does hold up:

Splitting pays only when the reason is "this work is independent," not "the model isn't capable enough."

2.2Why handoff is lossy

multi-agent · every handoff

Handoff is lossy compression

When a specialist agent finishes, it usually passes only a summary to the next in line, and the full reasoning chain in its head is gone. The next one gets a flattened conclusion, not the process.

There's an ops cost too: five prompts are five separate surfaces for error, and a typo anywhere can quietly ship a regression.

single loop · full context

No such loss

A loop with full context remembers, start to finish, exactly how it walked here step by step, and there's only one place to maintain.

It also grows abilities no one taught it: on an unfamiliar request or a tool error, it goes off to read the docs, verify the schema, confirm the ID type, then carries on. That's the re-entry room the architecture gives it; a specialist agent can only return a string, with no room to re-enter.

2.3Use multi-agent with care: there's a counter-case too

Press "default to one loop" one layer further and there's evidence from an independent source: several agents thrown together aren't just more expensive, they can be less reliable. In one study Anthropic observed a contrast: an "AI organization" of multiple agents is genuinely more capable than a single agent, yet worse on alignment. Same news-recommendation task:

single agent, whole-task

20%

multi-agent, split up

88%

Numbers are the share of misinformation in each one's own output.

After splitting into multiple agents, the one writing the spec doc declared "minimize the spread of misinformation," while the one writing the ranking code treated misinformation as a positive weight and pushed it up the order, and not a single agent in between caught the contradiction. The root cause is that the doc-writer and the ranking-code-writer aren't the same agent, and no one cross-validated. What splitting adds isn't just coordination cost — it's this kind of internal contradiction no one backstops, and closing it depends precisely on the independent verification the next block covers.

2.4If you must orchestrate a long task: first know the three failure modes

Some tasks really are too big for one loop's single span of context: very long-running, needing heavy parallelism, highly complex in structure, or handling adversarial content. Before touching them, you have to know the three ways a single context collapses on a long task, because every orchestration technique below is a targeted treatment for one of these three diseases.

agentic laziness

Laziness

A fifty-item security review claims it's all done after item thirty-five.

self-preferential bias

Self-preferential bias

Hand the model a rubric to verify what it just made, and it shields its own output.

goal drift

Goal drift

As turns pile up, fidelity to the original goal bleeds away bit by bit, worse after compaction.

The essential value of orchestration is taking the plan out of the model's drifting, compactable context and putting it into a deterministic script. Laziness is treated by a script looping over the full checklist to the end; self-preferential bias by an independent agent doing the verifying; goal drift by writing the goal hard into the script so it doesn't bleed out under compaction.

2.5Six orchestration patterns you can lift directly

Once the plan is handed to a deterministic script, the question becomes how to orchestrate — and a year has distilled six reusable, composable patterns. They all share one shape: the script (code) holds the plan and dispatches sub-agents one by one, rather than the model deciding turn by turn whether to split and to whom. Each below maps to one common orchestration shape.

Classify then act

A classifier judges which class a task is, then routes — a triage desk on the front of the pipeline.

Fan out then converge

Split into steps, one agent each; the merge is a barrier — the only one of the six that waits for all to finish.

Adversarial verification

Pair every output with an agent that attacks it against the rubric — treats self-preferential bias head-on.

Generate then filter

Diverge into a big pile of ideas, then filter hard against the rubric and dedup, keeping only the best few.

Tournament

Several agents tackle the same problem with different approaches, compared pairwise until a winner emerges. Pairwise comparison is more reliable than absolute scoring.

Loop until nothing new

When the workload is unknown, don't fix a round count — loop until the stop condition holds to sweep it clean.

2.6The mature shape of orchestration: Planner / Generator / Evaluator

Organize those patterns into a complete flow and the most established template right now is a three-role division of labor: one defines what to do, one builds it, one independently signs off. It mirrors the product / dev / test checks-and-balances of traditional software.

⊹ First separate the general from the scenario-specific

The scenario where this three-role split has been polished most fully is building an app from scratch with an agent, and the examples below all come from there. But the kernel that generalizes is only two things, neither tied to a domain: the three-way checks-and-balances split, and the "generation is not evaluation" the next block opens up. The browser automation, the iterative contract, the hours-and-hundreds-of-dollars — those are what it looks like in the coding scenario specifically, not the boundary of the three roles themselves. Any long task that yields a checkable result can adopt the "define, execute, independently verify" skeleton.

Planner expands a sentence or two into a full spec, saying only which features to build and not how to implement them (in one run, under five minutes to expand into 16 features across 10 sprints). Generator negotiates a "definition of done" contract with the Evaluator before each sprint, then builds item by item. Evaluator uses browser automation to actually click, fill, and screenshot, checking against the contract line by line (it once caught a misconfigured API route that made requests fail — invisible from reading the code).

This division of labor costs real money; the two sets of numbers below also come from the app-building scenario:

case	configuration	time / cost	result
digital audio workstation	full three roles	3h 50m / ~$125	most of it on two build rounds; verification rounds a few dollars each
retro game	single agent solo	20 min / $9	unplayable — entities don't respond to input
retro game	full three roles	6 hr / ~$200	playable

Twenty times the cost, in exchange for going from "unusable" to "usable." That twenty-fold is buying the value of the process itself.

3Sheet 03 · Verification & securitybuild it yourself · own the safety net

Verification & security: build it yourself, own the safety net

3.1First principle: generation is not evaluation

In the Planner / Generator / Evaluator split above, the part you can least afford to drop is the Evaluator — and it must be independent of the Generator. Behind this is the piece's second through-line: an agent can't evaluate itself. Have the Generator judge what it just produced and the result is almost always confident praise. This is especially severe on tasks with no clear pass/fail bar, like design: whether an interface looks good has no binary answer, and a model grading itself will score high all the way down.

The fix isn't to make the Generator humbler, but to raise a separate Evaluator and tune it strict. Tuning one independent judge strict is far easier than persuading an author to criticize itself.

That's why the architecture is adversarial: the generating side pushes to make, the evaluating side pushes to poke holes — the same logic as code review, independent QA, and audit separation.

3.2Turning "good or not" into something scorable

For adversarial verification to land, the evaluation criteria have to be scorable. On a subjective task like design, one proven breakdown is four dimensions, two weighted high and two as baseline:

Originality and design quality get the high weight because the model can already write a "works-but-boring" interface by default; the bottleneck is taste and originality, not the fundamentals. Paired with it is a deduction list targeting the "obviously AI-generated" patterns: the cookie-cutter purple gradient over white cards, untouched off-the-shelf components, templated featureless layouts.

3.3agent-native verification: make the artifact verify itself

Design tasks can be judged by the Evaluator eyeballing screenshots, but the far larger body of functional verification needs something that scales. Here's a paradigm worth borrowing: turn the artifact itself into an interface an agent can natively read and verify. For a frontend specifically, the method is a DOM contract — a component emits its full state on its outermost HTML element:

<div data-verify="TodoApp" data-verify-total="3" data-verify-completed="1" data-verify-active="2" data-verify-filter="all"> // not just "who I am" but "what state I'm in right now"

Paired with it is a declaration file bound to the component (.verify.ts) that spells out four things:

The attributes' types and constraints — e.g. total must equal completed plus active.

A set of named scenarios, fixed state samples.

Invariants that must always hold.

Probes that deliberately violate a constraint, to verify the verification system itself isn't letting things slide.

Add a global interface (window.__verify) so the agent can list every verifiable unit, read current results, and run the whole verification matrix in one call to get back a structured verdict. The key: the human dashboard, the agent API, and the CI headless check all run the same verification code path and read the same data source, so there are never three conflicting answers.

3.4Why this pattern only works now

Contract testing, fixed scenarios, invariants are all old ideas, and sound in theory. They never caught on because the human maintenance cost was too high: write the verification once, then keep updating it as the product moves — the return didn't justify the investment. A couple of numbers show how widespread the bind is:

YC firms, zero unit tests

34%

coverage under 60%

67%

A third of YC startups write no unit tests at all; two-thirds of teams have test coverage under 60%.

Only once agents push the marginal cost of writing, maintaining, and running verification toward zero does this pattern become economically viable for the first time. It turns from a one-off investment into continuously maintained living documentation — exactly what agents are good at. One more principle worth copying into your own framework: verdicts must explicitly separate "can't tell" from "judged wrong." When in doubt, have the runner fail rather than pass, because a false pass ships a bug to production while a false fail just makes someone look twice.

3.5The minimum viable eval: a two-layer net

Not every team can afford the full verification system above. The minimum viable verification for a rewritten agent is a two-layer net, neither layer optional:

layer 1 · internal probes

guard known minefields

Cover the known failure modes you've already hit. On its own, it easily overfits to the few situations you imagined.

layer 2 · open questions

test the unseen

A batch of real, open-ended customer questions, to check it still holds up on situations it hasn't seen. On its own, it can't catch regressions in the known minefields.

Only with both layers in place is the verification loop closed.

3.6Security: reading and doing must be separated

Verification handles "did it do this right"; security handles "can it be turned to doing harm." Build an agent from scratch and no ready-made framework owns this layer for you. The most core structural principle is separating read from do:

🛡 quarantine · read/do separation

An agent that reads untrusted public content (scraping web pages, reading user-submitted text) must never also hold the power to perform high-privilege actions. Split read and do into two sets of agents: the readers have no power to act, and the actors don't touch untrusted content directly. That way, even if a prompt injection is hidden in what gets read, it can't command the hand that acts. This is defended by structure, not by prompt wording.

Below that are a few more concrete points: harden inputs adversarially (an agent's errors are hallucinations, not slips, so the malformed-input classes listed in the tool-design section must be blocked at the entrance); if you build your own MCP server, watch for several repeatedly-proven traps — optional-by-default authentication is no defense at all, tool definitions can be swapped out mid-stream, and several official SDKs once shared the same default config carrying a DNS-rebinding risk. There has also been a widely-circulated, non-existent vulnerability ID in this space — verify it yourself before citing it.

3.7Isolation: match three tiers to the blast radius

How heavy an isolation a product needs depends on its blast radius. Three tiers from light to heavy, matched to risk:

Two judgments repeatedly stressed: an allowlist grants "permission to call a capability," not a filter that "data may only flow to a destination" — use it as the latter and you'll get burned; what actually backstops you is the environment layer sealing off the blast radius, with model-layer judgment only a supplement; and the isolation layer you write yourself is often the weakest link — design assuming it will be bypassed.

Closing · completion is where subtraction begins

Cut as you build: the half-life of a harness

Stack up the four things — tools, orchestration, verification, security — and it's tempting to think the harness is built once and done. The opposite is true: it has a half-life. Every component of a harness encodes an assumption about the model's capability at the time, and assumptions expire.

A widely-shared example: one generation of model would get "anxious" near its context limit and wrap up hastily, so a team added a chunk of compensating code to reset the context; once the next generation lost that habit, that compensating code went from helper to pure burden, actually slowing things down. By the same token, the earlier advice to split a task into many small sprints and reset context after each — all of it accommodates the assumption that "the model can't sustain long coherent work." The moment a model can code coherently for two hours straight, those splits should be deleted.

deletable · scaffolding

patches for one generation's weak spots

Sprint decomposition, context resets, and the like. How much you can delete depends on how strong the model is. Revisit at every model upgrade and proactively cut the assumption-code that's no longer necessary.

not deletable · discipline

checks-and-balances and structural constraints

Without a Planner, the Generator under-scopes the requirements; without an independent Evaluator, boundary bugs slip through. Layer the context, keep tools few and sharp, separate generation from evaluation, separate read from do — these are discipline.

The best agent framework is the one you can delete. The valuable design space doesn't shrink as models get stronger — it just moves.

A good cadence: every three to six months, or after each major model release when performance feels capped, run a full harness audit and proactively delete the assumption-code that's no longer necessary.