CLAUDE CODE · 2026 · THE COMPLETE PLAYBOOKDWG NO. GP-HARNESS-01 · construction manual

The model is the fulcrum.
The harness is the lever.

Claude Code ships the agent loop for you; the ring around that loop you build yourself. With the same Claude, how well you build that ring decides whether it edits a few lines or carries a 700K-line legacy system.

projectConfiguring the Claude Code harness sheetGeneral notes · floors 0–4 checkedload-bearing numbers fact-checked date2026-06 · rev v3

0Foundation · model 1Context 2Knowledge & tools 3Orchestration 4Verification ◆700K-line case ½Closing · half-life

GenAI Playbook · A consolidation of a dozen-plus Claude Code field notes · Read this one, skip the rest

Why build the harness first

Same model,
the environment sets its ceiling

Onboard Claude like a new developer and stalled work starts moving again; yet the model's own capability is the ceiling. Put both together: the harness is the adjustable lever, the model the fixed fulcrum. You can't change the fulcrum — the room to act is the lever.

Two weeks

UW's Skyline: a feature shelved a year shipped in two weeks; a nightly-test module idle three years restarted in under a day

11 days

Bun ported Zig→Rust with the Workflow tool, ~750K lines, 99.8% test pass rate

43→84%

Swap only the model under one harness: compaction quality rises from Sonnet 4.5 to Opus 4.6

A discipline running through the whole piece: a harness decays. When the model improves, yesterday's scaffolding can become dead weight. So for every layer below, weigh when to retire it — covered in the closing.

Scope, stated up front

This piece is only about configuring Claude Code

Claude Code gives you the main loop Anthropic already built; you don't write the agent loop. This article covers one thing: how to configure that ready-made framework. Three categories shouldn't be conflated: ① what the model gives you (can't change it, Layer 0), ② what you configure (this article), ③ what you build from scratch (out of scope here).

▸ ②Configuring Claude Code (the subject)

Model selection, context management, the build order for knowledge and tools, orchestration with subagents and Workflows, pulling verification into its own role. All of it is what you configure, not build.

Plus ①model capability: a given you can't change, only route around — the bedrock (Layer 0).

▸ Out of scope: building an agent from scratch

prompt-caching design rules, implementing an MCP server, an agent-facing CLI, the single-loop vs. multi-agent trade-off, turning your product into a verifiable interface. These only come up building an agent from scratch with the API/SDK.

Where such material appears it's noted briefly and tagged "out of scope," not expanded.

The full blueprint · click any floor to jump

Five floors, built bottom-up

Each floor rests on the one below: accept the foundation, then manage context, then add knowledge and tools, and only then talk orchestration and verification. The left rail follows you — the floor you're reading lights up.

0Sheet 00 · Foundation①Model capability · given, not built

Layer 0: the bedrock you can't route around

This layer isn't built; it's given. But understand it first — it decides how lean the harness can be.

0.1The context window and context rot

The window is 1M tokens (system prompt + full conversation + all tool output + files read), but a large window isn't license to fill it.

An intrinsic property of attention, not a bug: longer context, lower signal-to-noise. 1M isn't for cramming everything in — it's for completing one task more reliably.

Practical tip: put a context progress bar in your status line. No more hammering /context and breaking flow — how much window is left is visible at a glance:

claude — acme-monorepo

› refactor the payments service to use the new ledger client

● Read services/payments/ledger.ts (+2.1K tokens)

● Edit services/payments/charge.ts

…

~/acme-monorepo⎇ mainopus-4.8 · xhighcontext720K / 1M · 72%

That line is what the plugin adds to your status bar: remaining window, always visible. Past ~70–80% it warms from green to amber to red, nudging you to /compact or spin up a subagent — no more /context breaking your flow.

0.2Choosing the right model is itself harness design

The strongest coding model today is Opus 4.8, but more important than the version are a few principles that don't expire on upgrade.

Effort defaults to high; set it to xhigh for coding. Opus 4.8 defaults to high on every surface (including Claude Code) — the balance of quality and experience. But Anthropic explicitly recommends xhigh (between high and max) for coding, agentic, and long-running async work. Reserve max for genuinely hard problems; it overthinks and burns tokens, with diminishing returns. (4.7's Claude Code default was xhigh; 4.8 moved it back to high — which is exactly point C below.)

The thinking budget is adaptive. To push: "think carefully, step by step — this is harder than it looks." To speed up: "respond quickly, answer directly when uncertain."

Every generation changes its defaults. Prompts tuned for an older model need re-checking: response length, reasoning vs. tool calls, default subagent count all shift. Don't assume "strictly better" — every generation makes trade-offs. The table below shows the three you'll trip over most.

Delegate to Claude as an engineer, not a pair-programming partner. Each turn adds overhead, so authorize fully: in turn one, state intent, constraints, acceptance criteria, file locations; batch the work so it keeps moving.

What changed	Specifics	What to do
Default effort changed	Claude Code defaulted to `xhigh` on 4.7; 4.8 moved it back to `high`	Set it back to `xhigh` for coding
Long-context retrieval regressed (the 4.7 gen)	Multi-needle MRCR: 256k 91.9%→59.2%, 1M 78.3%→32.2%; heavy web research fell too, BrowseComp ≈ −4.4pp	RAG / long-doc / web-research: A/B before upgrading, don't trust the version number
Default language preference shifted	4.8 leans toward English; default Chinese output is weaker	Non-English teams: ask for the target language explicitly in CLAUDE.md

0.3The more reliable the model, the thinner the verification harness

This generation's hallucination dropped sharply, which directly changes how much verification costs. The same fact-check process audited 11 reports (independent subagent compares each claim to the source in a clean context, fixes, hands to a fresh subagent until zero issues). The day we switched to Opus 4.8 was a clear watershed:

Even the most number-dense report (832 accounts / 13,873 observations / 33%→56% / r=0.28) passed first try, zero fixes. Hallucination rate directly sets the rework count.

But don't misread it as "no verification after 4.8." What thins is the rework, not verification itself. Generation isn't evaluation; an independent agent signing off rather than the executor self-grading is a separation of duties that can't be removed (see Layer 4).

0.4Caching decides your quota: rules to follow

Prompt caching is an "invisible but billable" layer: byte-exact prefix match — change one byte anywhere and every cache after it dies. A miss recomputes every turn at full price, doubling cost and halving your plan's rate limit — so hit rate directly sets the quota a Pro/Max plan gives you. Just work with it:

Don't switch models mid-conversation. Dropping a 100K-token Opus session to Haiku is more expensive (rebuilds the whole cache). Stay on the model, or spawn a subagent (own prefix chain, no pollution).

/compact while the window has room. Compaction reuses the parent prefix to avoid breaking the cache — work with the system, not against it. Reasoning in Layer 1.

Caching also has 5 design rules for agents you build from scratch (no mid-stream tool changes, defer_loading, no model switching, update via messages not the system prompt, disguise compaction as a parent continuation) — out of scope here, not covered.

1Sheet 01 · Context②Configuring Claude Code

Context: the one resource you actually manage

The loop is given; you manage what goes in and when to clear it. Almost every difference in outcome traces here. After each step, you stand at a branch point — 5 choices, 4 of them about managing context quality:

Hover to trace each branch. "Continue" is the default; the other four are the levers for managing context.

1.2 / 1.3Compaction, and the auto-compact trap

Compaction is lossy: near the limit the whole conversation becomes a summary, and the model continues from it in a fresh window. The danger is when automatic compaction fires — exactly at the model's weakest moment:

Typical failure: after a long debug, auto-compact summarizes the debugging; you say "fix that warning in bar.ts," but its details were dropped. Fix: use the 1M headroom and compact early, while in good shape.

1.4Rewind beats correction

Scenario: 5 files read, approach A failed, switch to B.

Verbal correction

Failure and fix both linger

"That doesn't work, try B" — the failed approach plus your correction stay in context, adding noise and clouding judgment.

/rewind (double-tap Esc)

Back to the fork, switch cleanly

Roll back to "5 files read, A not tried," then say "don't use A — foo doesn't expose that interface, go to B." Keep the reads, drop the failed try.

Companion "Summarize from here": Claude condenses what it learned into a handoff note — a letter from the future Claude to its past self ("I tried A, it didn't work, because…") — so a later session doesn't repeat it.

1.5 / 1.8How to choose: two cheat sheets

/compact and /clear both clear context; the difference is who decides what to keep:

Dimension	/compact	/clear
Effort	Low — Claude summarizes	High — you distill
Precision	Claude decides what to keep	You decide what to carry
Context rot	Partly mitigated	Fully eliminated
Best for	Clearing clutter mid-task	Starting a fresh task

Scenario	Use	Why
Same task, context still relevant	Continue	It's all still working for you
Claude went down the wrong path	Rewind	Keep reads, drop failed try
Session clogged with stale debug	/compact + steer	Low effort, you set direction
Starting a fresh task	/clear	Zero rot, full control
Next step: lots of conclusion-only output	Subagent	Noise stays in the child

New task = new session. For related tasks: highly relevant → continue (skip re-reads); otherwise /clear. Subagent mental test — "do I need the tool output itself, or just the conclusion?" Conclusion only → dispatch. More in Layer 3.

2Sheet 02 · Knowledge & tools②Configuring Claude Code · core framework

Knowledge & tools: a 7-layer build order

The core framework. An often-missed premise: the harness matters as much as, or more than, the model. First the underlying mechanism — Claude finds code via agentic search, not RAG:

Dimension	RAG tools	Claude Code · agentic search
Index upkeep	Needs an embedding pipeline	None
Freshness	Lags (hours/days/weeks)	Real time on the live codebase
Failure mode	Returns deleted/renamed funcs, no signal	No such problem
Scaling	Can't keep up with thousands of commits	Each instance runs independently

The cost: it needs enough starting context to know "where to begin," and navigation quality tracks setup quality. Conclusion — teams that invest in codebase setup get better results.

2.2Build order: seven layers, bottom-up

Each layer rests on the one below; don't skip. Solid = five extension points; dashed = LSP / Subagents, two add-on capabilities. Also: initialize from a subdirectory, not the repo root (Claude walks up the tree, getting more focused context).

Layer	Most common misuse
L1 CLAUDE.md	Stuffed with reusable expertise (belongs in a skill)
L2 Hooks	Done via prompt instead of auto-running
L3 Skills	All crammed into CLAUDE.md
L4 Plugins	Good config left in one person's hands
L5 LSP	Assumed automatic (needs a plugin + language server)
L6 MCP	Built before the basics
L7 Subagents	Exploring and editing in one session

2.3Making the codebase navigable: six practices

Keep CLAUDE.md lean and layered: root global, subdir local, additive; root holds only pointers and key gotchas.

Initialize from a subdirectory, not the repo root: Claude walks up, loading every CLAUDE.md en route, so root context isn't lost.

Scope test/lint by subdirectory: else it changes one service but runs the whole suite — timing out, wasting context.

.ignore generated/build/third-party, and commit permissions.deny to version control for the whole team.

Build a codebase map when structure is unclear: a lightweight markdown at root, one line per top-level dir.

Run LSP to search by symbol, not string: filtering happens before Claude reads, not by opening hits one by one.

acme-monorepo/

├── CLAUDE.md root: pointers + gotchas only

├── .claude/

│ └── settings.json permissions.deny, checked in

├── apps/web/

│ ├── CLAUDE.md per-subdir test / lint

│ └── src/

├── services/payments/

│ └── CLAUDE.md init here, not at root

├── codebase-map.md directory-level index

└── (LSP: clangd) search by symbol, not string

The six practices aren't about "making Claude smarter" — they make the codebase easier to explore. Claude's ceiling in a big repo = its ability to find the right context.

2.4Skills in depth: a folder, not one .md

Many people think a skill is just a SKILL.md. It's actually a folder — SKILL.md is only the entry point, with scripts, references, templates, even its own hooks alongside it. That's exactly what makes it stronger than a prompt.

my-skill/

├── SKILL.md entry: description = trigger

├── scripts/ reusable scripts to compose, not rebuild

├── references/ progressive disclosure, read on demand (api.md…)

├── assets/ templates, sample files

├── hooks/ intercept as needed, e.g. /careful blocks rm -rf

└── config.json user config; if unset, AskUserQuestion

The folder structure is the skill itself: SKILL.md gives the method, scripts/ the hands, references/ load on demand to save context.

Anthropic groups the common uses into nine kinds, from looking up docs all the way to ops:

Library / API ref

Product verification

Data analysis

Process automation

Scaffolding

Code review

CI / CD

Runbooks

Infra ops

Nine best practices from Anthropic's hundreds of internal skills:

Authoring · first five

Getting a skill right

No filler: focus on what pushes Claude off its defaults (e.g. "avoid Inter and purple gradients")
Build a Gotchas section: the most valuable content — log failure points, keep updating
Progressive disclosure via the file system: split detail into references/api.md
Don't over-constrain; Description is the trigger, not a summary

Operations · last four

Keeping it durable

Store scripts for Claude to compose, not rebuilt boilerplate
Hooks on demand: /careful intercepts rm -rf, DROP TABLE
Persist in CLAUDE_PLUGIN_DATA (skill dirs are wiped on upgrade)
User config in config.json; if unset, ask via AskUserQuestion

Sharing: small teams check into .claude/skills; for scale, a plugin marketplace (with vetting). A PreToolUse hook logging invocations shows what's popular. One lesson: good skills start as a few lines and improve as you hit edge cases.

2.5MCP: how to use it on the consumer side

MCP brings external data and systems into the session. This piece covers plugging it in, not designing a server (out of scope here). Two cost levers — both client capabilities:

↓85%

Tool Search: with many tools, search on demand instead of preloading — tool-definition tokens drop ~85%, accuracy unchanged

↓37%

Programmatic Tool Calling: calls loop/filter/aggregate in a code sandbox, only the final output enters context — ~37% less overhead

10 + 8

One real analytics plugin bundles 10 skills + 8 MCP servers — one install, generalist → domain expert

MCP and Skills are complementary: MCP gives tools and data, Skills give operating knowledge, and the strongest agents use both. (The five MCP-server design patterns are out of scope here.)

2.6The throughline: less is more

Back in late 2024, Claude 3.5 Sonnet scored 49% on SWE-bench Verified with just two general tools, Bash and Text Editor — the SOTA then. Claude Code's foundation is still those two; Skills and Memory are combinations of them.

Tool schemas and overlong instructions crowd out reasoning space. Don't rush an elaborate toolchain — first check whether general tools do the job.

3Sheet 03 · Orchestration②Configuring Claude Code

Orchestration: who should hold the plan

The question becomes: does Claude carry the whole plan turn by turn, or is it held elsewhere? Principle: don't split small tasks; don't make the model hold an entire large plan in context alone.

3.1Subagents: three traits, built-in types, five scenarios

3 traits / 3 built-in types

What it is

Three traits: fresh start (no history), parallel, permission isolation (research read-only / impl editable)
General-purpose: full tools, complex multi-step
Plan Agent: read-only, research + strategy
Explore Agent: read-only, fast search

5 scenarios

When to dispatch

Research-heavy (dozens of files)
Multiple independent tasks (parallel)
A fresh perspective (no inherited bias)
Pre-commit verification (independent check)
Pipeline workflow (clear stages/handoffs)

10+ files

More files to explore than this → dispatch

3+ subtasks

More independent subtasks than this → dispatch

3≈1

Three parallel subagents take about as long as one

3.2 / 3.3Five ways to invoke; four patterns, five anti-patterns

The maturity path in one line: conversation first, automation later. Start in natural language, watch what recurs, then harden it:

step 1

Conversational

Natural language; Ctrl+B to background

step 2

Custom subagent

.claude/agents/, description = a trigger

step 3

CLAUDE.md rules

Harden into standing conventions

step 4

Skills workflow

Reusable playbooks

step 5

Hooks automation

Most automated — e.g. a Stop hook blocking finish while tests fail

4 patterns

How it should be done

Research then implement (start from a summary, not 20 raw files)
Parallel batch edits (one pattern, independent files)
Independent code review (a fresh view is more objective)
Pipeline (design/implement/test, handoff via files)

5 anti-patterns

Don't do this

Sequentially dependent: one session in order is cleaner
Same-file edits: parallel edits collide
Tiny tasks: delegation has overhead, just do it
Too many expert agents: lowers auto-delegation reliability
Needing inter-agent coordination: they can't talk — use Agent Teams

3.4Leveling up: Workflows move the plan into a script

What separates the three isn't scale, it's one question: who holds the PLAN token. With Subagent/Skill, Claude holds it turn by turn and intermediates land in context; with a Workflow, the script holds it and Claude sees only the final answer.

Subagent

Claude, per turn

Plan in the head, every intermediate lands in context

Skill

Claude follows it

Instructions are advice, the how is still per-turn

Workflow

the script

Loops/branches in the script, intermediates stay out

	Subagents	Skills	Workflows
Who decides next	Claude, per turn	Claude, per prompt	The script
Intermediates live	Claude's context	Claude's context	Script variables
Reproducible	Worker definition	Instructions	The orchestration
Scale	A few per turn	Same as subagent	Tens to hundreds/run
On interruption	Restart the turn	Restart the turn	Resumable in-session

A Workflow isn't a replacement for a Skill; they're orthogonal: a Skill changes "what the model knows," a Workflow changes "how things are deterministically orchestrated." An agent fanned out in a workflow can load a skill first.

3.5 / 3.6What moving the plan into code unlocks, and hard limits

Intermediates stay out of context → no rot → runs for hours or days. Coordination happens outside the conversation, so the plan stays on track no matter the size.

Quality routines enforced as code, not left to the model to remember — "code forces every conclusion through N independent skeptics."

The whole orchestration can be saved and re-run: pick a run in /workflows, press s to save it as a command.

Max concurrent agents (fewer on low cores)

1,000

Max agents per run

User input you can inject mid-run

Resume only within the same session

Convergence is the exit condition, not a fixed round count. Triggers: workflow in a prompt / /effort ultracode / a saved command; a built-in /deep-research. On by default for Max/Team/API, manual via /config for Pro, off for Enterprise. Token use far exceeds a normal session — try a small task first; before a big run, check /model and route lighter stages to a smaller model.

⚠ 3.7 · The Bun case's preconditions, not to skip

A Workflow ported Bun from Zig to Rust: ~750K lines, 11 days, 99.8% test pass rate, via four chained workflows (① map Rust lifetimes ② translate per file, hundreds of agents in parallel, two reviewers each ③ drive build/tests to green ④ optimize overnight, a PR per change for human review). But the preconditions are extreme; not "any migration in 11 days":

strangler-fig incremental replacement, not a rewrite — Zig and Rust linked into one binary, switched class by class
every switch passed tests + shadow-diff + a ≤2% performance gate — verification-driven, not "done means trusted"
very high existing test coverage (99.8% presupposes a strong suite), single-author, not yet in production (99.8% isn't 100%)

Correct reading: on a high-value module with strong tests and gates, a deterministic script + build/test fix loop compresses quarter-scale work to days. Pilot on such a module; don't copy the timeline.

3.8Know the three failure modes first

The default harness packs planning and execution into one context; on long-running, massively parallel, or adversarial work it breaks — three ways, which is exactly why the plan should move into a script:

agentic laziness

Laziness

A 50-item review claims "done" at item 35.

self-preferential bias

Self-preference

It shields work it just produced when judging it.

goal drift

Goal drift

Fidelity leaks over turns; compaction drops "don't X."

Workflows fix all three structurally: laziness — the script holds the full checklist, not done until the loop ends; self-preference — verification by another independent agent; goal drift — the goal lives in the script, immune to compaction.

3.9Six composable patterns

Classify-and-act

A classifier decides the category, routes to different agents. A triage desk for the pipeline.

Fan-out-and-synthesize

Split into steps, one agent each; synthesis is a barrier — the only one that waits.

Adversarial verification

Each worker gets a dedicated agent to verify adversarially against a rubric. Cures self-preference.

Generate-and-filter

Diverge into many ideas, then filter and dedupe to the best few.

Tournament

N agents, different strategies on one task; pairwise compare to a winner.

Loop until done

When the workload is unknown, no fixed round count — loop to a stop condition.

Comparative judgment beats absolute scoring, especially for ranking.

3.10 / 3.11Use cases: a good share isn't writing code

Counterintuitively, it's often more useful for non-technical work. Of the nine use cases named officially, over half aren't about writing code:

Migration / refactoringOne agent per change in its own worktree, another reviews adversarially, then merge

Deep researchNot just the web — also assembling a status report from Slack

Deep fact-checkingOne agent extracts every claim; a subagent per claim verifies and cites

RankingTournament / pairwise pipeline

Memory & rule-following

Root-cause investigation

Triage at scaleBrings out the quarantine pattern below

Taste-driven exploration

Model routing

🛡 quarantine isolation

An agent reading untrusted public content must not take high-privilege actions; those go to a separate set of agents. Reading and acting are kept apart — a structural defense against prompt injection.

When not to use it: Workflows can produce exceptional results, but not every task needs one and they burn far more tokens. First ask "does this really need more compute" — most traditional coding tasks don't need a five-person review panel. Tips: a detailed prompt matters most; small tasks can use a quick workflow; pair with /loop and /goal; write use 10k tokens to cap; press s to save into ~/.claude/workflows or a skill. Treat it as a starting point, not an endpoint.

4Sheet 04 · Verification②Configuring Claude Code

Verification: the doer shouldn't grade its own work

Writing code is cheap now, so the bottleneck moves to verification. The core: generation isn't evaluation; the two must be separated. Ask the doer to grade itself and it almost always praises — like a developer reviewing their own code. An engineering reality: tuning an independent Evaluator to be strict is far easier than teaching the Generator to self-criticize.

The Evaluator drives Playwright, scores each criterion (one level editor had 27), catching real bugs like "fill tool fails because the function didn't fire on mouseUp." Design Quality / Originality are high-weight; Craft / Functionality are baseline.

▸ ②in scope here

Use an independent review subagent, or the Workflow adversarial-verification pattern, so reviewer ≠ executor and they share no context. Pure Claude Code usage, no extra code.

▸ Only when it's your own product (out of scope)

Turn the artifact into an agent-readable interface (data-verify-* contracts, window.__verify, .verify.ts). The thesis is "an economic inversion": agents drive verification's marginal cost toward zero, so contract testing finally scales on the front end. You'd only do this when reshaping your own product.

Sheet 05 · Capstone · only off-the-shelf capability

Onboarding a 700K-line legacy codebase

All four layers on the opening's Skyline project. Core method: onboard Claude like a new hire — explain enough background to finish a bounded project, and produce better context for the next round. A five-step loop, each round starting higher:

As MacLean puts it, you wouldn't put a new hire in front of 700K lines on day one. Each task ends by writing what it learned back, so the next round starts higher.

UW · MacCoss Lab700K+ lines C#200K+ nightly testssince 2008 · 17 yrs

Context

Context lives in a separate repo pwiz-ai, not the code repo — else it's isolated by branch and Claude becomes a "different person" per branch. Maintained as a real engineering artifact, versioned.

Knowledge

Skills follow "Reference, Don't Embed." The debugging skill is a hard trigger: must load on any bug/failure/anomaly, forcing root-cause analysis — no change-one-line-and-see.

Tools

Two MCP servers: a screenshot-diff server reproduces 2,000+ tutorial screenshots at near-100%, letting Claude "see" UI regressions; a daily-digest server feeds the failures from 200K nightly tests straight to Claude.

Two weeks

The Files View panel — shelved a year after its dev left — shipped, all commits co-authored

<1 day

The Java/LabKey nightly module, idle three years, restarted, incl. CSS he'd have hired a designer for

all 4 layers

A once-skeptical teammate built a panel on their own, too

Four takeaways: scope isolation is the prerequisite (no one-shot on a huge legacy base — slice and expand) · the debugging skill must be a hard constraint (blindly changing one line in 700K can chain-fail) · context must be independent of code branches · MCP wiring frees people from triaging failures. One honest caveat: not an install-and-go silver bullet — someone maintains the context layer and advances incrementally. But it was proven in "the scenario least suited to AI."

Sheet 06 · Closing · build, then prune

Cut as you build: the harness's half-life

A harness isn't build-once-and-done; it has a half-life. Every component encodes an assumption about model capability, and assumptions expire. How much you can cut depends on how strong the model is.

A real case: Sonnet 4.5 ended tasks early near the limit ("context anxiety"), so a context-reset workaround was added; once Opus 4.5 shipped and the behavior vanished, that code became pure dead weight. Likewise a hook intercepting writes to force Perforce p4 edit turned redundant once it was native. The right mindset isn't "how to make Claude stronger" but "what can I stop doing."

Sonnet 4.5

43%

Opus 4.5

68%

Opus 4.6

84%

Compaction quality ceiling: capability, not the compression algorithm, sets the limit. The stronger the model, the more control you hand back, the thinner the harness.

Same "build a full app" task — a model upgrade lets you cut scaffolding

Opus 4.5 Planner Generator Evaluator Sprint split context reset 6h · $200

Opus 4.6 Planner Generator Evaluator Sprint split context reset 3h50 · $124

But some parts can't be cut. The Planner / Evaluator separation of duties stays: without a Planner, the Generator under-scopes; without an Evaluator, edge-case bugs slip through. Cleanly distinguish the cuttable (scaffolding) from the uncuttable (context discipline, separated verification). Audit the harness every 3 to 6 months.

134 days / 126 rel.

Claude Code's own cadence, ~6.6 releases a week, shipping on ~70% of days

$200→$124

Same output, cheaper and simpler after the upgrade

it just moves

The interesting harness space doesn't shrink as models improve, it relocates

Same model,the environment sets its ceiling