GENAI-PLAYBOOK · INTERNAL METRIC FACT-CHECK-LOOP

The day we switched to Opus 4.8,
fact-checking went from
endless rework to one pass.

Over the past two months I ran the same fact-check loop on 11 reports: a fresh subagent re-reads every claim against the source in a clean context, I fix whatever it flags, then another fresh subagent re-checks, until zero issues remain. Every round got written into the project's STATUS file as it happened. These aren't benchmark numbers. They're a ledger that piled up inside real work.

Avg. rounds to pass

3.2→1.2

Six reports all took 3-4 rounds; after, down to 1-2

Passed on first try

0/6→4/5

From none at all to 80% clean on round one

The dividing line

05·29

After this date, the reports changed character

The cliff speaks for itself

The x-axis lays the 11 reports out by date; the y-axis is how many rounds each one took to reach "all pass." The first six bounce around 3-4 rounds. Past the 05-29 line, they hug "one round."

Before 4.8 · avg 3.2 rounds After 4.8 · avg 1.2 rounds 05-29 dividing line

Report by report

Every round count comes from that report's own STATUS record, not assembled after the fact. Which round caught how many issues, whether it was a factual error or imprecise wording, was written down at the time.

Report	Rounds	First-try pass	What that stack fixed
Failure modes of AI-native startups 05-15	3	—	Zero issues only by round 3
Claude Code on large codebases 05-21	4	—	Not clean until round 4
Anthropic rebuilds its sales org 05-22	3	—	3 rounds
Agent-friendly CLI design 05-26	3	—	5 fixes
Agent-native verification 05-26	3	—	4 issues: 1 factual + 3 imprecise wording
Zero Trust for AI agents 05-28	3	—	4 corrections
━━━━ SWITCHED TO OPUS 4.8 · 2026-05-29 ━━━━
Securing source code with LLMs 05-29	1	PASS	One round, zero issues
Dynamic workflows 05-30	1	PASS	Fresh agent re-fetched the official pages line by line, 0 errors
How we contain Claude 06-01	1	PASS	Verbatim-checked 15 English quotes + every number/architecture detail, zero issues
Dynamic workflows · patterns 06-03	2	—	R1 fixed 1 truncated quote, R2 passed
LLM ATT&CK Navigator 06-04	1	PASS	Packed with exact numbers, zero corrections

The hardest one

Averages alone aren't convincing. So take the report with the scariest density of numbers — the one most likely to derail. It passed in a single round.

LLM ATT&CK Navigator · 06-04 · 1 round · zero corrections

This kind of report almost never used to pass on the first try

It's wall-to-wall precise readings: account sample size, observation counts, the before/after change in attack success rate, a correlation coefficient, two sets of scores side by side, three-axis ARiES values, and more. In the past, the moment numbers got dense, a subagent could always find a few where a row got transposed or a decimal got copied wrong during transcription. This time the independent check focused on tracing each number back to the source — and not one was touched.

832 accounts 13,873 observations 33% → 56% r = 0.28 56.4 vs 46.8 3-axis ARiES

The day we switched to Opus 4.8, fact-checking went from endless rework to one pass.

The cliff speaks for itself

Report by report

The hardest one

This kind of report almost never used to pass on the first try

The day we switched to Opus 4.8,
fact-checking went from
endless rework to one pass.