GENAI-PLAYBOOK · INTERNAL METRIC FACT-CHECK-LOOP

The day we switched to Opus 4.8,
fact-checking went from
endless rework to one pass.

Over the past two months I ran the same fact-check loop on 11 reports: a fresh subagent re-reads every claim against the source in a clean context, I fix whatever it flags, then another fresh subagent re-checks, until zero issues remain. Every round got written into the project's STATUS file as it happened. These aren't benchmark numbers. They're a ledger that piled up inside real work.

Avg. rounds to pass
3.21.2
Six reports all took 3-4 rounds; after, down to 1-2
Passed on first try
0/64/5
From none at all to 80% clean on round one
The dividing line
05·29
After this date, the reports changed character
01

The cliff speaks for itself

The x-axis lays the 11 reports out by date; the y-axis is how many rounds each one took to reach "all pass." The first six bounce around 3-4 rounds. Past the 05-29 line, they hug "one round."

Before 4.8 · avg 3.2 rounds After 4.8 · avg 1.2 rounds 05-29 dividing line
4 rds 3 rds 2 rds 1 rd ▲ OPUS 4.8 · 05-29 05-15 05-21 05-22 05-26 ×2 05-28 05-29 05-30 06-01 06-03 06-04
02

Report by report

Every round count comes from that report's own STATUS record, not assembled after the fact. Which round caught how many issues, whether it was a factual error or imprecise wording, was written down at the time.

Report Rounds First-try pass What that stack fixed
Failure modes of AI-native startups
05-15
3 Zero issues only by round 3
Claude Code on large codebases
05-21
4 Not clean until round 4
Anthropic rebuilds its sales org
05-22
3 3 rounds
Agent-friendly CLI design
05-26
3 5 fixes
Agent-native verification
05-26
3 4 issues: 1 factual + 3 imprecise wording
Zero Trust for AI agents
05-28
3 4 corrections
━━━━ SWITCHED TO OPUS 4.8 · 2026-05-29 ━━━━
Securing source code with LLMs
05-29
1 PASS One round, zero issues
Dynamic workflows
05-30
1 PASS Fresh agent re-fetched the official pages line by line, 0 errors
How we contain Claude
06-01
1 PASS Verbatim-checked 15 English quotes + every number/architecture detail, zero issues
Dynamic workflows · patterns
06-03
2 R1 fixed 1 truncated quote, R2 passed
LLM ATT&CK Navigator
06-04
1 PASS Packed with exact numbers, zero corrections
03

The hardest one

Averages alone aren't convincing. So take the report with the scariest density of numbers — the one most likely to derail. It passed in a single round.

LLM ATT&CK Navigator · 06-04 · 1 round · zero corrections

This kind of report almost never used to pass on the first try

It's wall-to-wall precise readings: account sample size, observation counts, the before/after change in attack success rate, a correlation coefficient, two sets of scores side by side, three-axis ARiES values, and more. In the past, the moment numbers got dense, a subagent could always find a few where a row got transposed or a decimal got copied wrong during transcription. This time the independent check focused on tracing each number back to the source — and not one was touched.

832 accounts 13,873 observations 33% → 56% r = 0.28 56.4 vs 46.8 3-axis ARiES