The model is the fulcrum.
The harness is the lever.
Claude Code ships the agent loop for you; the ring around that loop you build yourself. With the same Claude, how well you build that ring decides whether it edits a few lines or carries a 700K-line legacy system.
Same model,
the environment sets its ceiling
Onboard Claude like a new developer and stalled work starts moving again; yet the model's own capability is the ceiling. Put both together: the harness is the adjustable lever, the model the fixed fulcrum. You can't change the fulcrum — the room to act is the lever.
A discipline running through the whole piece: a harness decays. When the model improves, yesterday's scaffolding can become dead weight. So for every layer below, weigh when to retire it — covered in the closing.
This piece is only about configuring Claude Code
Claude Code gives you the main loop Anthropic already built; you don't write the agent loop. This article covers one thing: how to configure that ready-made framework. Three categories shouldn't be conflated: ① what the model gives you (can't change it, Layer 0), ② what you configure (this article), ③ what you build from scratch (out of scope here).
Model selection, context management, the build order for knowledge and tools, orchestration with subagents and Workflows, pulling verification into its own role. All of it is what you configure, not build.
Plus ①model capability: a given you can't change, only route around — the bedrock (Layer 0).
prompt-caching design rules, implementing an MCP server, an agent-facing CLI, the single-loop vs. multi-agent trade-off, turning your product into a verifiable interface. These only come up building an agent from scratch with the API/SDK.
Where such material appears it's noted briefly and tagged "out of scope," not expanded.
Five floors, built bottom-up
Layer 0: the bedrock you can't route around
This layer isn't built; it's given. But understand it first — it decides how lean the harness can be.
The window is 1M tokens (system prompt + full conversation + all tool output + files read), but a large window isn't license to fill it.
Practical tip: put a context progress bar in your status line. No more hammering /context and breaking flow — how much window is left is visible at a glance:
/compact or spin up a subagent — no more /context breaking your flow.The strongest coding model today is Opus 4.8, but more important than the version are a few principles that don't expire on upgrade.
high; set it to xhigh for coding. Opus 4.8 defaults to high on every surface (including Claude Code) — the balance of quality and experience. But Anthropic explicitly recommends xhigh (between high and max) for coding, agentic, and long-running async work. Reserve max for genuinely hard problems; it overthinks and burns tokens, with diminishing returns. (4.7's Claude Code default was xhigh; 4.8 moved it back to high — which is exactly point C below.)| What changed | Specifics | What to do |
|---|---|---|
| Default effort changed | Claude Code defaulted to xhigh on 4.7; 4.8 moved it back to high | Set it back to xhigh for coding |
| Long-context retrieval regressed (the 4.7 gen) | Multi-needle MRCR: 256k 91.9%→59.2%, 1M 78.3%→32.2%; heavy web research fell too, BrowseComp ≈ −4.4pp | RAG / long-doc / web-research: A/B before upgrading, don't trust the version number |
| Default language preference shifted | 4.8 leans toward English; default Chinese output is weaker | Non-English teams: ask for the target language explicitly in CLAUDE.md |
This generation's hallucination dropped sharply, which directly changes how much verification costs. The same fact-check process audited 11 reports (independent subagent compares each claim to the source in a clean context, fixes, hands to a fresh subagent until zero issues). The day we switched to Opus 4.8 was a clear watershed:
But don't misread it as "no verification after 4.8." What thins is the rework, not verification itself. Generation isn't evaluation; an independent agent signing off rather than the executor self-grading is a separation of duties that can't be removed (see Layer 4).
Prompt caching is an "invisible but billable" layer: byte-exact prefix match — change one byte anywhere and every cache after it dies. A miss recomputes every turn at full price, doubling cost and halving your plan's rate limit — so hit rate directly sets the quota a Pro/Max plan gives you. Just work with it:
Caching also has 5 design rules for agents you build from scratch (no mid-stream tool changes, defer_loading, no model switching, update via messages not the system prompt, disguise compaction as a parent continuation) — out of scope here, not covered.
Context: the one resource you actually manage
The loop is given; you manage what goes in and when to clear it. Almost every difference in outcome traces here. After each step, you stand at a branch point — 5 choices, 4 of them about managing context quality:
Compaction is lossy: near the limit the whole conversation becomes a summary, and the model continues from it in a fresh window. The danger is when automatic compaction fires — exactly at the model's weakest moment:
Scenario: 5 files read, approach A failed, switch to B.
Failure and fix both linger
"That doesn't work, try B" — the failed approach plus your correction stay in context, adding noise and clouding judgment.
Back to the fork, switch cleanly
Roll back to "5 files read, A not tried," then say "don't use A — foo doesn't expose that interface, go to B." Keep the reads, drop the failed try.
Companion "Summarize from here": Claude condenses what it learned into a handoff note — a letter from the future Claude to its past self ("I tried A, it didn't work, because…") — so a later session doesn't repeat it.
/compact and /clear both clear context; the difference is who decides what to keep:
| Dimension | /compact | /clear |
|---|---|---|
| Effort | Low — Claude summarizes | High — you distill |
| Precision | Claude decides what to keep | You decide what to carry |
| Context rot | Partly mitigated | Fully eliminated |
| Best for | Clearing clutter mid-task | Starting a fresh task |
| Scenario | Use | Why |
|---|---|---|
| Same task, context still relevant | Continue | It's all still working for you |
| Claude went down the wrong path | Rewind | Keep reads, drop failed try |
| Session clogged with stale debug | /compact + steer | Low effort, you set direction |
| Starting a fresh task | /clear | Zero rot, full control |
| Next step: lots of conclusion-only output | Subagent | Noise stays in the child |
New task = new session. For related tasks: highly relevant → continue (skip re-reads); otherwise /clear. Subagent mental test — "do I need the tool output itself, or just the conclusion?" Conclusion only → dispatch. More in Layer 3.
Knowledge & tools: a 7-layer build order
The core framework. An often-missed premise: the harness matters as much as, or more than, the model. First the underlying mechanism — Claude finds code via agentic search, not RAG:
| Dimension | RAG tools | Claude Code · agentic search |
|---|---|---|
| Index upkeep | Needs an embedding pipeline | None |
| Freshness | Lags (hours/days/weeks) | Real time on the live codebase |
| Failure mode | Returns deleted/renamed funcs, no signal | No such problem |
| Scaling | Can't keep up with thousands of commits | Each instance runs independently |
The cost: it needs enough starting context to know "where to begin," and navigation quality tracks setup quality. Conclusion — teams that invest in codebase setup get better results.
| Layer | Most common misuse |
|---|---|
| L1 CLAUDE.md | Stuffed with reusable expertise (belongs in a skill) |
| L2 Hooks | Done via prompt instead of auto-running |
| L3 Skills | All crammed into CLAUDE.md |
| L4 Plugins | Good config left in one person's hands |
| L5 LSP | Assumed automatic (needs a plugin + language server) |
| L6 MCP | Built before the basics |
| L7 Subagents | Exploring and editing in one session |
permissions.deny to version control for the whole team.Many people think a skill is just a SKILL.md. It's actually a folder — SKILL.md is only the entry point, with scripts, references, templates, even its own hooks alongside it. That's exactly what makes it stronger than a prompt.
SKILL.md gives the method, scripts/ the hands, references/ load on demand to save context.Anthropic groups the common uses into nine kinds, from looking up docs all the way to ops:
Nine best practices from Anthropic's hundreds of internal skills:
Getting a skill right
- No filler: focus on what pushes Claude off its defaults (e.g. "avoid Inter and purple gradients")
- Build a Gotchas section: the most valuable content — log failure points, keep updating
- Progressive disclosure via the file system: split detail into
references/api.md - Don't over-constrain; Description is the trigger, not a summary
Keeping it durable
- Store scripts for Claude to compose, not rebuilt boilerplate
- Hooks on demand:
/carefulinterceptsrm -rf,DROP TABLE - Persist in
CLAUDE_PLUGIN_DATA(skill dirs are wiped on upgrade) - User config in
config.json; if unset, ask via AskUserQuestion
Sharing: small teams check into .claude/skills; for scale, a plugin marketplace (with vetting). A PreToolUse hook logging invocations shows what's popular. One lesson: good skills start as a few lines and improve as you hit edge cases.
MCP brings external data and systems into the session. This piece covers plugging it in, not designing a server (out of scope here). Two cost levers — both client capabilities:
MCP and Skills are complementary: MCP gives tools and data, Skills give operating knowledge, and the strongest agents use both. (The five MCP-server design patterns are out of scope here.)
Back in late 2024, Claude 3.5 Sonnet scored 49% on SWE-bench Verified with just two general tools, Bash and Text Editor — the SOTA then. Claude Code's foundation is still those two; Skills and Memory are combinations of them.
Orchestration: who should hold the plan
The question becomes: does Claude carry the whole plan turn by turn, or is it held elsewhere? Principle: don't split small tasks; don't make the model hold an entire large plan in context alone.
What it is
- Three traits: fresh start (no history), parallel, permission isolation (research read-only / impl editable)
- General-purpose: full tools, complex multi-step
- Plan Agent: read-only, research + strategy
- Explore Agent: read-only, fast search
When to dispatch
- Research-heavy (dozens of files)
- Multiple independent tasks (parallel)
- A fresh perspective (no inherited bias)
- Pre-commit verification (independent check)
- Pipeline workflow (clear stages/handoffs)
The maturity path in one line: conversation first, automation later. Start in natural language, watch what recurs, then harden it:
.claude/agents/, description = a triggerHow it should be done
- Research then implement (start from a summary, not 20 raw files)
- Parallel batch edits (one pattern, independent files)
- Independent code review (a fresh view is more objective)
- Pipeline (design/implement/test, handoff via files)
Don't do this
- Sequentially dependent: one session in order is cleaner
- Same-file edits: parallel edits collide
- Tiny tasks: delegation has overhead, just do it
- Too many expert agents: lowers auto-delegation reliability
- Needing inter-agent coordination: they can't talk — use Agent Teams
What separates the three isn't scale, it's one question: who holds the PLAN token. With Subagent/Skill, Claude holds it turn by turn and intermediates land in context; with a Workflow, the script holds it and Claude sees only the final answer.
| Subagents | Skills | Workflows | |
|---|---|---|---|
| Who decides next | Claude, per turn | Claude, per prompt | The script |
| Intermediates live | Claude's context | Claude's context | Script variables |
| Reproducible | Worker definition | Instructions | The orchestration |
| Scale | A few per turn | Same as subagent | Tens to hundreds/run |
| On interruption | Restart the turn | Restart the turn | Resumable in-session |
A Workflow isn't a replacement for a Skill; they're orthogonal: a Skill changes "what the model knows," a Workflow changes "how things are deterministically orchestrated." An agent fanned out in a workflow can load a skill first.
/workflows, press s to save it as a command.Convergence is the exit condition, not a fixed round count. Triggers: workflow in a prompt / /effort ultracode / a saved command; a built-in /deep-research. On by default for Max/Team/API, manual via /config for Pro, off for Enterprise. Token use far exceeds a normal session — try a small task first; before a big run, check /model and route lighter stages to a smaller model.
A Workflow ported Bun from Zig to Rust: ~750K lines, 11 days, 99.8% test pass rate, via four chained workflows (① map Rust lifetimes ② translate per file, hundreds of agents in parallel, two reviewers each ③ drive build/tests to green ④ optimize overnight, a PR per change for human review). But the preconditions are extreme; not "any migration in 11 days":
- strangler-fig incremental replacement, not a rewrite — Zig and Rust linked into one binary, switched class by class
- every switch passed tests + shadow-diff + a ≤2% performance gate — verification-driven, not "done means trusted"
- very high existing test coverage (99.8% presupposes a strong suite), single-author, not yet in production (99.8% isn't 100%)
Correct reading: on a high-value module with strong tests and gates, a deterministic script + build/test fix loop compresses quarter-scale work to days. Pilot on such a module; don't copy the timeline.
The default harness packs planning and execution into one context; on long-running, massively parallel, or adversarial work it breaks — three ways, which is exactly why the plan should move into a script:
Laziness
A 50-item review claims "done" at item 35.
Self-preference
It shields work it just produced when judging it.
Goal drift
Fidelity leaks over turns; compaction drops "don't X."
Workflows fix all three structurally: laziness — the script holds the full checklist, not done until the loop ends; self-preference — verification by another independent agent; goal drift — the goal lives in the script, immune to compaction.
Classify-and-act
A classifier decides the category, routes to different agents. A triage desk for the pipeline.
Fan-out-and-synthesize
Split into steps, one agent each; synthesis is a barrier — the only one that waits.
Adversarial verification
Each worker gets a dedicated agent to verify adversarially against a rubric. Cures self-preference.
Generate-and-filter
Diverge into many ideas, then filter and dedupe to the best few.
Tournament
N agents, different strategies on one task; pairwise compare to a winner.
Loop until done
When the workload is unknown, no fixed round count — loop to a stop condition.
Comparative judgment beats absolute scoring, especially for ranking.
Counterintuitively, it's often more useful for non-technical work. Of the nine use cases named officially, over half aren't about writing code:
An agent reading untrusted public content must not take high-privilege actions; those go to a separate set of agents. Reading and acting are kept apart — a structural defense against prompt injection.
When not to use it: Workflows can produce exceptional results, but not every task needs one and they burn far more tokens. First ask "does this really need more compute" — most traditional coding tasks don't need a five-person review panel. Tips: a detailed prompt matters most; small tasks can use a quick workflow; pair with /loop and /goal; write use 10k tokens to cap; press s to save into ~/.claude/workflows or a skill. Treat it as a starting point, not an endpoint.
Verification: the doer shouldn't grade its own work
Writing code is cheap now, so the bottleneck moves to verification. The core: generation isn't evaluation; the two must be separated. Ask the doer to grade itself and it almost always praises — like a developer reviewing their own code. An engineering reality: tuning an independent Evaluator to be strict is far easier than teaching the Generator to self-criticize.
Use an independent review subagent, or the Workflow adversarial-verification pattern, so reviewer ≠ executor and they share no context. Pure Claude Code usage, no extra code.
Turn the artifact into an agent-readable interface (data-verify-* contracts, window.__verify, .verify.ts). The thesis is "an economic inversion": agents drive verification's marginal cost toward zero, so contract testing finally scales on the front end. You'd only do this when reshaping your own product.
Onboarding a 700K-line legacy codebase
All four layers on the opening's Skyline project. Core method: onboard Claude like a new hire — explain enough background to finish a bounded project, and produce better context for the next round. A five-step loop, each round starting higher:
pwiz-ai, not the code repo — else it's isolated by branch and Claude becomes a "different person" per branch. Maintained as a real engineering artifact, versioned.Four takeaways: scope isolation is the prerequisite (no one-shot on a huge legacy base — slice and expand) · the debugging skill must be a hard constraint (blindly changing one line in 700K can chain-fail) · context must be independent of code branches · MCP wiring frees people from triaging failures. One honest caveat: not an install-and-go silver bullet — someone maintains the context layer and advances incrementally. But it was proven in "the scenario least suited to AI."
Cut as you build: the harness's half-life
A harness isn't build-once-and-done; it has a half-life. Every component encodes an assumption about model capability, and assumptions expire. How much you can cut depends on how strong the model is.
A real case: Sonnet 4.5 ended tasks early near the limit ("context anxiety"), so a context-reset workaround was added; once Opus 4.5 shipped and the behavior vanished, that code became pure dead weight. Likewise a hook intercepting writes to force Perforce p4 edit turned redundant once it was native. The right mindset isn't "how to make Claude stronger" but "what can I stop doing."
But some parts can't be cut. The Planner / Evaluator separation of duties stays: without a Planner, the Generator under-scopes; without an Evaluator, edge-case bugs slip through. Cleanly distinguish the cuttable (scaffolding) from the uncuttable (context discipline, separated verification). Audit the harness every 3 to 6 months.