CLAUDE CODE · 2026 · THE COMPLETE PLAYBOOKDWG NO. GP-HARNESS-01 · construction manual

The model is the fulcrum.
The harness is the lever.

Claude Code ships the agent loop for you; the ring around that loop you build yourself. With the same Claude, how well you build that ring decides whether it edits a few lines or carries a 700K-line legacy system.

projectConfiguring the Claude Code harness sheetGeneral notes · floors 0–4 checkedload-bearing numbers fact-checked date2026-06 · rev v3
model · fulcrum (fixed) harness (you configure) output ceiling ↑ lever arm = floors 0–4, longer at every floor
GenAI Playbook · A consolidation of a dozen-plus Claude Code field notes · Read this one, skip the rest
Why build the harness first

Same model,
the environment sets its ceiling

Onboard Claude like a new developer and stalled work starts moving again; yet the model's own capability is the ceiling. Put both together: the harness is the adjustable lever, the model the fixed fulcrum. You can't change the fulcrum — the room to act is the lever.

Two weeks
UW's Skyline: a feature shelved a year shipped in two weeks; a nightly-test module idle three years restarted in under a day
11 days
Bun ported Zig→Rust with the Workflow tool, ~750K lines, 99.8% test pass rate
43→84%
Swap only the model under one harness: compaction quality rises from Sonnet 4.5 to Opus 4.6

A discipline running through the whole piece: a harness decays. When the model improves, yesterday's scaffolding can become dead weight. So for every layer below, weigh when to retire it — covered in the closing.

Scope, stated up front

This piece is only about configuring Claude Code

Claude Code gives you the main loop Anthropic already built; you don't write the agent loop. This article covers one thing: how to configure that ready-made framework. Three categories shouldn't be conflated: ① what the model gives you (can't change it, Layer 0), ② what you configure (this article), ③ what you build from scratch (out of scope here).

▸ ②Configuring Claude Code (the subject)

Model selection, context management, the build order for knowledge and tools, orchestration with subagents and Workflows, pulling verification into its own role. All of it is what you configure, not build.

Plus ①model capability: a given you can't change, only route around — the bedrock (Layer 0).

▸ Out of scope: building an agent from scratch

prompt-caching design rules, implementing an MCP server, an agent-facing CLI, the single-loop vs. multi-agent trade-off, turning your product into a verifiable interface. These only come up building an agent from scratch with the API/SDK.

Where such material appears it's noted briefly and tagged "out of scope," not expanded.

The full blueprint · click any floor to jump

Five floors, built bottom-up

0 · Foundation: model capabilitygiven — accept, can't change 1 · Context: the one resource you truly manage5 branches 2 · Knowledge & tools: the 7-layer build orderCLAUDE.md → MCP 3 · Orchestration: who holds the plansubagent / Workflow 4 · Verification: generation ≠ evaluationindependent evaluator Case · onboarding 700K lines + Closing · build and prune build order · bottom-up
Each floor rests on the one below: accept the foundation, then manage context, then add knowledge and tools, and only then talk orchestration and verification. The left rail follows you — the floor you're reading lights up.
0Sheet 00 · Foundation①Model capability · given, not built

Layer 0: the bedrock you can't route around

This layer isn't built; it's given. But understand it first — it decides how lean the harness can be.

0.1The context window and context rot

The window is 1M tokens (system prompt + full conversation + all tool output + files read), but a large window isn't license to fill it.

window capacity noise share ↑ (old, irrelevant) usable attention / signal ↓ 0 token 1M token
An intrinsic property of attention, not a bug: longer context, lower signal-to-noise. 1M isn't for cramming everything in — it's for completing one task more reliably.

Practical tip: put a context progress bar in your status line. No more hammering /context and breaking flow — how much window is left is visible at a glance:

claude — acme-monorepo
› refactor the payments service to use the new ledger client
Read services/payments/ledger.ts (+2.1K tokens)
Edit services/payments/charge.ts
~/acme-monorepo⎇ mainopus-4.8 · xhighcontext720K / 1M · 72%
That line is what the plugin adds to your status bar: remaining window, always visible. Past ~70–80% it warms from green to amber to red, nudging you to /compact or spin up a subagent — no more /context breaking your flow.
0.2Choosing the right model is itself harness design

The strongest coding model today is Opus 4.8, but more important than the version are a few principles that don't expire on upgrade.

A
Effort defaults to high; set it to xhigh for coding. Opus 4.8 defaults to high on every surface (including Claude Code) — the balance of quality and experience. But Anthropic explicitly recommends xhigh (between high and max) for coding, agentic, and long-running async work. Reserve max for genuinely hard problems; it overthinks and burns tokens, with diminishing returns. (4.7's Claude Code default was xhigh; 4.8 moved it back to high — which is exactly point C below.)
B
The thinking budget is adaptive. To push: "think carefully, step by step — this is harder than it looks." To speed up: "respond quickly, answer directly when uncertain."
C
Every generation changes its defaults. Prompts tuned for an older model need re-checking: response length, reasoning vs. tool calls, default subagent count all shift. Don't assume "strictly better" — every generation makes trade-offs. The table below shows the three you'll trip over most.
D
Delegate to Claude as an engineer, not a pair-programming partner. Each turn adds overhead, so authorize fully: in turn one, state intent, constraints, acceptance criteria, file locations; batch the work so it keeps moving.
What changedSpecificsWhat to do
Default effort changedClaude Code defaulted to xhigh on 4.7; 4.8 moved it back to highSet it back to xhigh for coding
Long-context retrieval regressed (the 4.7 gen)Multi-needle MRCR: 256k 91.9%→59.2%, 1M 78.3%→32.2%; heavy web research fell too, BrowseComp ≈ −4.4ppRAG / long-doc / web-research: A/B before upgrading, don't trust the version number
Default language preference shifted4.8 leans toward English; default Chinese output is weakerNon-English teams: ask for the target language explicitly in CLAUDE.md
0.3The more reliable the model, the thinner the verification harness

This generation's hallucination dropped sharply, which directly changes how much verification costs. The same fact-check process audited 11 reports (independent subagent compares each claim to the source in a clean context, fixes, hands to a fresh subagent until zero issues). The day we switched to Opus 4.8 was a clear watershed:

before 4.8 after 4.8 avg audit rounds 3.2 1.2 first-pass clean 0/6 4/5
Even the most number-dense report (832 accounts / 13,873 observations / 33%→56% / r=0.28) passed first try, zero fixes. Hallucination rate directly sets the rework count.

But don't misread it as "no verification after 4.8." What thins is the rework, not verification itself. Generation isn't evaluation; an independent agent signing off rather than the executor self-grading is a separation of duties that can't be removed (see Layer 4).

0.4Caching decides your quota: rules to follow

Prompt caching is an "invisible but billable" layer: byte-exact prefix match — change one byte anywhere and every cache after it dies. A miss recomputes every turn at full price, doubling cost and halving your plan's rate limit — so hit rate directly sets the quota a Pro/Max plan gives you. Just work with it:

·
Don't switch models mid-conversation. Dropping a 100K-token Opus session to Haiku is more expensive (rebuilds the whole cache). Stay on the model, or spawn a subagent (own prefix chain, no pollution).
·
/compact while the window has room. Compaction reuses the parent prefix to avoid breaking the cache — work with the system, not against it. Reasoning in Layer 1.

Caching also has 5 design rules for agents you build from scratch (no mid-stream tool changes, defer_loading, no model switching, update via messages not the system prompt, disguise compaction as a parent continuation) — out of scope here, not covered.

1Sheet 01 · Context②Configuring Claude Code

Context: the one resource you actually manage

The loop is given; you manage what goes in and when to clear it. Almost every difference in outcome traces here. After each step, you stand at a branch point — 5 choices, 4 of them about managing context quality:

step done Continuekeep the conversation, context unchanged /rewindroll back; keep reads, drop failed try /clearfresh session, carry your own notes /compactsummarize, then continue Subagentfresh window, returns conclusions only
Hover to trace each branch. "Continue" is the default; the other four are the levers for managing context.
1.2 / 1.3Compaction, and the auto-compact trap

Compaction is lossy: near the limit the whole conversation becomes a summary, and the model continues from it in a fresh window. The danger is when automatic compaction fires — exactly at the model's weakest moment:

session start context fills → model reliability (declines with context rot) ✓ proactive /compact in good shape, you know the next step ✗ auto-compact fires here weakest moment, most critical summary
Typical failure: after a long debug, auto-compact summarizes the debugging; you say "fix that warning in bar.ts," but its details were dropped. Fix: use the 1M headroom and compact early, while in good shape.
1.4Rewind beats correction

Scenario: 5 files read, approach A failed, switch to B.

Verbal correction

Failure and fix both linger

"That doesn't work, try B" — the failed approach plus your correction stay in context, adding noise and clouding judgment.

/rewind (double-tap Esc)

Back to the fork, switch cleanly

Roll back to "5 files read, A not tried," then say "don't use A — foo doesn't expose that interface, go to B." Keep the reads, drop the failed try.

Companion "Summarize from here": Claude condenses what it learned into a handoff note — a letter from the future Claude to its past self ("I tried A, it didn't work, because…") — so a later session doesn't repeat it.

1.5 / 1.8How to choose: two cheat sheets

/compact and /clear both clear context; the difference is who decides what to keep:

Dimension/compact/clear
EffortLow — Claude summarizesHigh — you distill
PrecisionClaude decides what to keepYou decide what to carry
Context rotPartly mitigatedFully eliminated
Best forClearing clutter mid-taskStarting a fresh task
ScenarioUseWhy
Same task, context still relevantContinueIt's all still working for you
Claude went down the wrong pathRewindKeep reads, drop failed try
Session clogged with stale debug/compact + steerLow effort, you set direction
Starting a fresh task/clearZero rot, full control
Next step: lots of conclusion-only outputSubagentNoise stays in the child

New task = new session. For related tasks: highly relevant → continue (skip re-reads); otherwise /clear. Subagent mental test — "do I need the tool output itself, or just the conclusion?" Conclusion only → dispatch. More in Layer 3.

2Sheet 02 · Knowledge & tools②Configuring Claude Code · core framework

Knowledge & tools: a 7-layer build order

The core framework. An often-missed premise: the harness matters as much as, or more than, the model. First the underlying mechanism — Claude finds code via agentic search, not RAG:

DimensionRAG toolsClaude Code · agentic search
Index upkeepNeeds an embedding pipelineNone
FreshnessLags (hours/days/weeks)Real time on the live codebase
Failure modeReturns deleted/renamed funcs, no signalNo such problem
ScalingCan't keep up with thousands of commitsEach instance runs independently

The cost: it needs enough starting context to know "where to begin," and navigation quality tracks setup quality. Conclusion — teams that invest in codebase setup get better results.

2.2Build order: seven layers, bottom-up
build bottom-up L1 · CLAUDE.mdauto-read L2 · Hooksself-evolve L3 · Skillsfocus here L4 · Pluginspackage & ship L5 · LSPadd-on L6 · MCPexternal systems L7 · Subagentsadd-on
Each layer rests on the one below; don't skip. Solid = five extension points; dashed = LSP / Subagents, two add-on capabilities. Also: initialize from a subdirectory, not the repo root (Claude walks up the tree, getting more focused context).
LayerMost common misuse
L1 CLAUDE.mdStuffed with reusable expertise (belongs in a skill)
L2 HooksDone via prompt instead of auto-running
L3 SkillsAll crammed into CLAUDE.md
L4 PluginsGood config left in one person's hands
L5 LSPAssumed automatic (needs a plugin + language server)
L6 MCPBuilt before the basics
L7 SubagentsExploring and editing in one session
2.3Making the codebase navigable: six practices
1
Keep CLAUDE.md lean and layered: root global, subdir local, additive; root holds only pointers and key gotchas.
2
Initialize from a subdirectory, not the repo root: Claude walks up, loading every CLAUDE.md en route, so root context isn't lost.
3
Scope test/lint by subdirectory: else it changes one service but runs the whole suite — timing out, wasting context.
4
.ignore generated/build/third-party, and commit permissions.deny to version control for the whole team.
5
Build a codebase map when structure is unclear: a lightweight markdown at root, one line per top-level dir.
6
Run LSP to search by symbol, not string: filtering happens before Claude reads, not by opening hits one by one.
acme-monorepo/
├── CLAUDE.md root: pointers + gotchas only
├── .claude/
│   └── settings.json permissions.deny, checked in
├── apps/web/
│   ├── CLAUDE.md per-subdir test / lint
│   └── src/
├── services/payments/
│   └── CLAUDE.md init here, not at root
├── codebase-map.md directory-level index
└── (LSP: clangd) search by symbol, not string
The six practices aren't about "making Claude smarter" — they make the codebase easier to explore. Claude's ceiling in a big repo = its ability to find the right context.
2.4Skills in depth: a folder, not one .md

Many people think a skill is just a SKILL.md. It's actually a folder — SKILL.md is only the entry point, with scripts, references, templates, even its own hooks alongside it. That's exactly what makes it stronger than a prompt.

my-skill/
├── SKILL.md entry: description = trigger
├── scripts/ reusable scripts to compose, not rebuild
├── references/ progressive disclosure, read on demand (api.md…)
├── assets/ templates, sample files
├── hooks/ intercept as needed, e.g. /careful blocks rm -rf
└── config.json user config; if unset, AskUserQuestion
The folder structure is the skill itself: SKILL.md gives the method, scripts/ the hands, references/ load on demand to save context.

Anthropic groups the common uses into nine kinds, from looking up docs all the way to ops:

Library / API ref
Product verification
Data analysis
Process automation
Scaffolding
Code review
CI / CD
Runbooks
Infra ops

Nine best practices from Anthropic's hundreds of internal skills:

Authoring · first five

Getting a skill right

  • No filler: focus on what pushes Claude off its defaults (e.g. "avoid Inter and purple gradients")
  • Build a Gotchas section: the most valuable content — log failure points, keep updating
  • Progressive disclosure via the file system: split detail into references/api.md
  • Don't over-constrain; Description is the trigger, not a summary
Operations · last four

Keeping it durable

  • Store scripts for Claude to compose, not rebuilt boilerplate
  • Hooks on demand: /careful intercepts rm -rf, DROP TABLE
  • Persist in CLAUDE_PLUGIN_DATA (skill dirs are wiped on upgrade)
  • User config in config.json; if unset, ask via AskUserQuestion

Sharing: small teams check into .claude/skills; for scale, a plugin marketplace (with vetting). A PreToolUse hook logging invocations shows what's popular. One lesson: good skills start as a few lines and improve as you hit edge cases.

2.5MCP: how to use it on the consumer side

MCP brings external data and systems into the session. This piece covers plugging it in, not designing a server (out of scope here). Two cost levers — both client capabilities:

↓85%
Tool Search: with many tools, search on demand instead of preloading — tool-definition tokens drop ~85%, accuracy unchanged
↓37%
Programmatic Tool Calling: calls loop/filter/aggregate in a code sandbox, only the final output enters context — ~37% less overhead
10 + 8
One real analytics plugin bundles 10 skills + 8 MCP servers — one install, generalist → domain expert

MCP and Skills are complementary: MCP gives tools and data, Skills give operating knowledge, and the strongest agents use both. (The five MCP-server design patterns are out of scope here.)

2.6The throughline: less is more

Back in late 2024, Claude 3.5 Sonnet scored 49% on SWE-bench Verified with just two general tools, Bash and Text Editor — the SOTA then. Claude Code's foundation is still those two; Skills and Memory are combinations of them.

Tool schemas and overlong instructions crowd out reasoning space. Don't rush an elaborate toolchain — first check whether general tools do the job.
3Sheet 03 · Orchestration②Configuring Claude Code

Orchestration: who should hold the plan

The question becomes: does Claude carry the whole plan turn by turn, or is it held elsewhere? Principle: don't split small tasks; don't make the model hold an entire large plan in context alone.

3.1Subagents: three traits, built-in types, five scenarios
3 traits / 3 built-in types

What it is

  • Three traits: fresh start (no history), parallel, permission isolation (research read-only / impl editable)
  • General-purpose: full tools, complex multi-step
  • Plan Agent: read-only, research + strategy
  • Explore Agent: read-only, fast search
5 scenarios

When to dispatch

  • Research-heavy (dozens of files)
  • Multiple independent tasks (parallel)
  • A fresh perspective (no inherited bias)
  • Pre-commit verification (independent check)
  • Pipeline workflow (clear stages/handoffs)
10+ files
More files to explore than this → dispatch
3+ subtasks
More independent subtasks than this → dispatch
3≈1
Three parallel subagents take about as long as one
3.2 / 3.3Five ways to invoke; four patterns, five anti-patterns

The maturity path in one line: conversation first, automation later. Start in natural language, watch what recurs, then harden it:

step 1
Conversational
Natural language; Ctrl+B to background
step 2
Custom subagent
.claude/agents/, description = a trigger
step 3
CLAUDE.md rules
Harden into standing conventions
step 4
Skills workflow
Reusable playbooks
step 5
Hooks automation
Most automated — e.g. a Stop hook blocking finish while tests fail
4 patterns

How it should be done

  • Research then implement (start from a summary, not 20 raw files)
  • Parallel batch edits (one pattern, independent files)
  • Independent code review (a fresh view is more objective)
  • Pipeline (design/implement/test, handoff via files)
5 anti-patterns

Don't do this

  • Sequentially dependent: one session in order is cleaner
  • Same-file edits: parallel edits collide
  • Tiny tasks: delegation has overhead, just do it
  • Too many expert agents: lowers auto-delegation reliability
  • Needing inter-agent coordination: they can't talk — use Agent Teams
3.4Leveling up: Workflows move the plan into a script

What separates the three isn't scale, it's one question: who holds the PLAN token. With Subagent/Skill, Claude holds it turn by turn and intermediates land in context; with a Workflow, the script holds it and Claude sees only the final answer.

Subagent
Claude, per turn
CLAUDE CONTEXT ◆ PLAN { } SCRIPT
Plan in the head, every intermediate lands in context
Skill
Claude follows it
CLAUDE CONTEXT ◆ PLAN { } SCRIPT
Instructions are advice, the how is still per-turn
Workflow
the script
CLAUDE: final answer only { } SCRIPT ◆ PLAN
Loops/branches in the script, intermediates stay out
SubagentsSkillsWorkflows
Who decides nextClaude, per turnClaude, per promptThe script
Intermediates liveClaude's contextClaude's contextScript variables
ReproducibleWorker definitionInstructionsThe orchestration
ScaleA few per turnSame as subagentTens to hundreds/run
On interruptionRestart the turnRestart the turnResumable in-session

A Workflow isn't a replacement for a Skill; they're orthogonal: a Skill changes "what the model knows," a Workflow changes "how things are deterministically orchestrated." An agent fanned out in a workflow can load a skill first.

3.5 / 3.6What moving the plan into code unlocks, and hard limits
1
Intermediates stay out of context → no rot → runs for hours or days. Coordination happens outside the conversation, so the plan stays on track no matter the size.
2
Quality routines enforced as code, not left to the model to remember — "code forces every conclusion through N independent skeptics."
3
The whole orchestration can be saved and re-run: pick a run in /workflows, press s to save it as a command.
16
Max concurrent agents (fewer on low cores)
1,000
Max agents per run
0
User input you can inject mid-run
1
Resume only within the same session

Convergence is the exit condition, not a fixed round count. Triggers: workflow in a prompt / /effort ultracode / a saved command; a built-in /deep-research. On by default for Max/Team/API, manual via /config for Pro, off for Enterprise. Token use far exceeds a normal session — try a small task first; before a big run, check /model and route lighter stages to a smaller model.

⚠ 3.7 · The Bun case's preconditions, not to skip

A Workflow ported Bun from Zig to Rust: ~750K lines, 11 days, 99.8% test pass rate, via four chained workflows (① map Rust lifetimes ② translate per file, hundreds of agents in parallel, two reviewers each ③ drive build/tests to green ④ optimize overnight, a PR per change for human review). But the preconditions are extreme; not "any migration in 11 days":

  • strangler-fig incremental replacement, not a rewrite — Zig and Rust linked into one binary, switched class by class
  • every switch passed tests + shadow-diff + a ≤2% performance gate — verification-driven, not "done means trusted"
  • very high existing test coverage (99.8% presupposes a strong suite), single-author, not yet in production (99.8% isn't 100%)

Correct reading: on a high-value module with strong tests and gates, a deterministic script + build/test fix loop compresses quarter-scale work to days. Pilot on such a module; don't copy the timeline.

3.8Know the three failure modes first

The default harness packs planning and execution into one context; on long-running, massively parallel, or adversarial work it breaks — three ways, which is exactly why the plan should move into a script:

done! 35 / 50
agentic laziness
Laziness

A 50-item review claims "done" at item 35.

output self-grade ✓ all pass
self-preferential bias
Self-preference

It shields work it just produced when judging it.

don't X
goal drift
Goal drift

Fidelity leaks over turns; compaction drops "don't X."

Workflows fix all three structurally: laziness — the script holds the full checklist, not done until the loop ends; self-preference — verification by another independent agent; goal drift — the goal lives in the script, immune to compaction.

3.9Six composable patterns

Comparative judgment beats absolute scoring, especially for ranking.

3.10 / 3.11Use cases: a good share isn't writing code

Counterintuitively, it's often more useful for non-technical work. Of the nine use cases named officially, over half aren't about writing code:

Migration / refactoringOne agent per change in its own worktree, another reviews adversarially, then merge
Deep researchNot just the web — also assembling a status report from Slack
Deep fact-checkingOne agent extracts every claim; a subagent per claim verifies and cites
RankingTournament / pairwise pipeline
Memory & rule-following
Root-cause investigation
Triage at scaleBrings out the quarantine pattern below
Taste-driven exploration
Model routing
🛡 quarantine isolation

An agent reading untrusted public content must not take high-privilege actions; those go to a separate set of agents. Reading and acting are kept apart — a structural defense against prompt injection.

When not to use it: Workflows can produce exceptional results, but not every task needs one and they burn far more tokens. First ask "does this really need more compute" — most traditional coding tasks don't need a five-person review panel. Tips: a detailed prompt matters most; small tasks can use a quick workflow; pair with /loop and /goal; write use 10k tokens to cap; press s to save into ~/.claude/workflows or a skill. Treat it as a starting point, not an endpoint.

4Sheet 04 · Verification②Configuring Claude Code

Verification: the doer shouldn't grade its own work

Writing code is cheap now, so the bottleneck moves to verification. The core: generation isn't evaluation; the two must be separated. Ask the doer to grade itself and it almost always praises — like a developer reviewing their own code. An engineering reality: tuning an independent Evaluator to be strict is far easier than teaching the Generator to self-criticize.

spec · what deliver verdict · contract loop Plannerdefines what, not how Generatorbuilds by sprint Evaluatortests like a real user
The Evaluator drives Playwright, scores each criterion (one level editor had 27), catching real bugs like "fill tool fails because the function didn't fire on mouseUp." Design Quality / Originality are high-weight; Craft / Functionality are baseline.
▸ ②in scope here

Use an independent review subagent, or the Workflow adversarial-verification pattern, so reviewer ≠ executor and they share no context. Pure Claude Code usage, no extra code.

▸ Only when it's your own product (out of scope)

Turn the artifact into an agent-readable interface (data-verify-* contracts, window.__verify, .verify.ts). The thesis is "an economic inversion": agents drive verification's marginal cost toward zero, so contract testing finally scales on the front end. You'd only do this when reshaping your own product.

Sheet 05 · Capstone · only off-the-shelf capability

Onboarding a 700K-line legacy codebase

All four layers on the opening's Skyline project. Core method: onboard Claude like a new hire — explain enough background to finish a bounded project, and produce better context for the next round. A five-step loop, each round starting higher:

↻ next round, higher start scope it provide context do the task distill context expand scope
As MacLean puts it, you wouldn't put a new hire in front of 700K lines on day one. Each task ends by writing what it learned back, so the next round starts higher.
UW · MacCoss Lab700K+ lines C#200K+ nightly testssince 2008 · 17 yrs
Context
Context lives in a separate repo pwiz-ai, not the code repo — else it's isolated by branch and Claude becomes a "different person" per branch. Maintained as a real engineering artifact, versioned.
Knowledge
Skills follow "Reference, Don't Embed." The debugging skill is a hard trigger: must load on any bug/failure/anomaly, forcing root-cause analysis — no change-one-line-and-see.
Tools
Two MCP servers: a screenshot-diff server reproduces 2,000+ tutorial screenshots at near-100%, letting Claude "see" UI regressions; a daily-digest server feeds the failures from 200K nightly tests straight to Claude.
Two weeks
The Files View panel — shelved a year after its dev left — shipped, all commits co-authored
<1 day
The Java/LabKey nightly module, idle three years, restarted, incl. CSS he'd have hired a designer for
all 4 layers
A once-skeptical teammate built a panel on their own, too

Four takeaways: scope isolation is the prerequisite (no one-shot on a huge legacy base — slice and expand) · the debugging skill must be a hard constraint (blindly changing one line in 700K can chain-fail) · context must be independent of code branches · MCP wiring frees people from triaging failures. One honest caveat: not an install-and-go silver bullet — someone maintains the context layer and advances incrementally. But it was proven in "the scenario least suited to AI."

Sheet 06 · Closing · build, then prune

Cut as you build: the harness's half-life

A harness isn't build-once-and-done; it has a half-life. Every component encodes an assumption about model capability, and assumptions expire. How much you can cut depends on how strong the model is.

A real case: Sonnet 4.5 ended tasks early near the limit ("context anxiety"), so a context-reset workaround was added; once Opus 4.5 shipped and the behavior vanished, that code became pure dead weight. Likewise a hook intercepting writes to force Perforce p4 edit turned redundant once it was native. The right mindset isn't "how to make Claude stronger" but "what can I stop doing."

Sonnet 4.5
43%
Opus 4.5
68%
Opus 4.6
84%
Compaction quality ceiling: capability, not the compression algorithm, sets the limit. The stronger the model, the more control you hand back, the thinner the harness.
Same "build a full app" task — a model upgrade lets you cut scaffolding
Opus 4.5 Planner Generator Evaluator Sprint split context reset 6h · $200
Opus 4.6 Planner Generator Evaluator Sprint split context reset 3h50 · $124

But some parts can't be cut. The Planner / Evaluator separation of duties stays: without a Planner, the Generator under-scopes; without an Evaluator, edge-case bugs slip through. Cleanly distinguish the cuttable (scaffolding) from the uncuttable (context discipline, separated verification). Audit the harness every 3 to 6 months.

134 days / 126 rel.
Claude Code's own cadence, ~6.6 releases a week, shipping on ~70% of days
$200→$124
Same output, cheaper and simpler after the upgrade
it just moves
The interesting harness space doesn't shrink as models improve, it relocates