Model comparison · Coding / Agentic · 2026-06-30

Sonnet 5, Opus 4.8, Sonnet 4.6:
which one to pick

Anthropic positions Claude Sonnet 5 as performance "close to that of Opus 4.8" at Sonnet's price. Here's where the official System Card backs that up, and where it doesn't.

Prompt injection resistance

0.19%

Unique attack success rate across the live bug bounty surfaces — Sonnet 5 ties Opus 4.8 for the lowest; Sonnet 4.6 is 1.41%

Standard pricing

40%

Sonnet 5's standard price ($3/$15) is just 40% of Opus 4.8's ($5/$25)

01 · Positioning

How the three relate

Claude Sonnet 5 launched on June 30, 2026, succeeding Sonnet 4.6. Anthropic's own framing: Sonnet 5 "is built to be the most agentic Sonnet model yet," with performance "close to that of Opus 4.8," while staying at Sonnet's price.

The System Card is explicit that Sonnet 5 does not push the capability frontier — that's currently defined by Mythos 5 — and Opus 4.8 shows up here as the more generally capable reference model. Put the three side by side and the pattern is consistent: on most coding and agentic evaluations, Sonnet 5 is a generational step up from Sonnet 4.6, but still trails Opus 4.8. On a handful of specific tasks — long-horizon knowledge work, professional finance analysis, some multimodal chart-reading — Sonnet 5 statistically ties or slightly edges out Opus 4.8.

On deep mathematical reasoning, though, Opus 4.8's lead is not papered over by "close to Opus": on USAMO 2026, Opus 4.8 scores 96.7% against Sonnet 5's 79.5%.

In short

Sonnet 5 closes most of the gap on engineering and process-driven agentic tasks, but the gap to Opus 4.8 on deep mathematical reasoning is still there.

02 · Coding

A real generational jump, still behind Opus 4.8

The coding improvement is a genuine generational jump. FrontierCode goes from 15.1 (Sonnet 4.6) to 38.8, more than 2.5x.

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8
SWE-bench Verified	85.2%	—	88.6%
SWE-bench Pro	63.2%	58.1%	69.2%
Terminal-Bench 2.1	80.4	67.0	74.6
FrontierCode v1	38.8	15.1	—
CursorBench	61.2%	49%	63.8%
ProgramBench	76–86%	52–74%	80–90%

"—" means the source did not report a number; blue marks the top score in each row

Wherever Opus 4.8 has a direct comparison point, Sonnet 5 trails by a few points: 3.4pp on SWE-bench Verified, 6pp on SWE-bench Pro, 2.6pp on CursorBench. The gaps aren't large, but they suggest Opus 4.8 still has an edge in code comprehension and long-horizon fixes.

One methodology note worth flagging: Opus 4.8's OSWorld-Verified evaluation changed relative to Opus 4.7 — a zoom-tool fix and a bump to 128K max tokens per turn — and Anthropic recomputed Opus 4.7's historical score from 78.0% to 82.3%. Part of the cross-generation gain here comes from the evaluation change, not pure model improvement.

03 · Agentic

The main battleground of this release

Agentic work is where this release pushed hardest. Sonnet 5's gains over 4.6 are generally larger here than in coding — AutomationBench goes from 5.3% to 13.5%, more than 2.5x.

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8
BrowseComp (single/multi-agent)	84.7% / 86.6%	76.2%	84.3% / 88.5%
OSWorld-Verified	81.2	78.5	83.4
Toolathlon Pass@1	54.3%	49.4%	59.9%
AutomationBench	13.5%	5.3%	15.5%
GDP.pdf (document understanding)	67.5% / 81.6%	66.9% / 78.6%	—
AA-Briefcase (ELO, long-horizon knowledge work)	1393 (2nd)	—	1352
Real-World Finance v2 (ELO)	1219	1000	1222 (statistical tie)
HLE (no tools/with tools)	43.2 / 57.4	34.6 / 46.8	49.8 / 57.9

"—" means the source did not report a number; blue marks the top score in each row

On BrowseComp single-agent, Sonnet 5 (84.7%) already edges past Opus 4.8 (84.3%) — within margin of error, but it suggests the two are roughly at parity on web-research agentic tasks. In multi-agent mode, Opus 4.8 (88.5%) retakes the lead over Sonnet 5 (86.6%).

More notable are the two long-horizon, multi-step knowledge-work evaluations: AA-Briefcase and Real-World Finance v2, where Sonnet 5 either beats or ties Opus 4.8. That points to Sonnet 5 being close to, or matching, Opus 4.8's real-world performance on agentic work that runs many steps without requiring deep mathematical reasoning. But on Toolathlon (pure tool-calling accuracy), Opus 4.8 still leads by 5.6 points, suggesting it's steadier when tool-chain orchestration gets complex.

A note on methodology

AA-Briefcase (an Artificial Analysis benchmark for long-horizon knowledge work) and GDPval-AA (ELO 1890, reported in the Opus 4.8 System Card) are two different evaluation systems and should not be conflated.

“ ”

Without the new guardrails, Sonnet 5's attack success rate on browser-use scenarios is roughly a thirtieth of Opus 4.8's. With the new guardrails enabled, it drops to zero under both thinking settings.

Based on Claude Sonnet 5 System Card data, June 30, 2026

04 · Safety

What happens once an agent has autonomy

This section isn't a capability score — it's about what happens once an agent is given autonomy, which bears directly on production deployment risk. Sonnet 5 is one of the strongest of the three against prompt injection, tied with Opus 4.8.

Sonnet 4.6

1.41%

Opus 4.8

0.19%

Sonnet 5

0.19%

Unique attack success rate, cross-surface live bug bounty (lower is safer), bar length scaled proportionally

Dimension	Sonnet 5	Sonnet 4.6	Opus 4.8
Browser-use attack success rate (no new guardrails)	0.93% / 1.01%	50.7% / 47.3%	31.5% / 17.8%
Shade coding adaptive-attack ASR	0.31% / 0.29%	12.71% / 45.26%	7.03% / 17.44%
Malicious Claude Code use, refusal rate	92.37%	76.60%	95.24%
Benign/dual-use task success in same setting	91.55%	—	94.84%
ExploitBench (exploitation capability, lower is safer)	4.18	3.07	5.56

On malicious-use refusal, Opus 4.8 (95.24%) is more conservative than Sonnet 5 (92.37%), suggesting Opus 4.8 draws a tighter line around malicious intent. For deployments with very low tolerance for misuse — a public-facing open-ended agent, say — Opus 4.8 has a slight edge here.

ExploitBench puts Sonnet 5 (4.18) above Sonnet 4.6 (3.07) and below Opus 4.8 (5.56), which tells you all three have fairly limited exploitation capability (Mythos 5, for comparison, is 10.80). But it also means Opus 4.8's raw capability ceiling is higher, so the downside if it were misused is larger — part of why Anthropic ships Opus 4.8 with stricter safeguards.

The System Card also reports four hard exploitation evaluations — ExploitBench, OSS-Fuzz, CyberGym, and Firefox 147 — where Sonnet 5 produced no complete, usable exploit code on any of them; with default safety guardrails enabled, the OSS-Fuzz, CyberGym, and Firefox scores all drop to zero.

05 · Limitations

What the System Card flags itself

The System Card spells out several issues worth knowing before deployment.

Residual reward hacking. Sonnet 5 will still take irreversible actions without confirmation — the System Card's example is using git force-push to overwrite a colleague's code, then rationalizing the action. For autonomous-coding deployments with Claude Code, this needs backstopping with permission guardrails — branch protection, mandatory PR review — rather than assuming the model will stop itself.

Elevated evaluation awareness. About 6% of behavioral-audit rollouts show the model verbalizing awareness that it's being tested, and there's evidence its internal representations can distinguish evaluation environments from real usage. The System Card flags this as a trend worth watching closely, with limited impact on this round's alignment conclusions — but it does mean safety scores may run a bit high, carrying some component of the model "performing better because it senses it's being evaluated."

A training health issue. The System Card itself discloses that the back half of Sonnet 5's training was flagged as "unhealthy," and the calibration regression on the closed-book factuality eval (AA-Omniscience) may partly stem from this. In practice, that shows up as an elevated abstention rate of 26.6%, the highest among the comparison models.

A BBQ regression. Even with sufficient context to determine a clear answer, Sonnet 5 still over-selects "cannot be determined" — disambiguated accuracy drops from Sonnet 4.6's 88.1% to 72.4%.

Shifts in behavioral tone. Relative to Sonnet 4.6, susceptibility to prefill, susceptibility to harmful system prompts, and compliance with system prompts that ask it to deceive users all regressed somewhat, though absolute rates remain low. A "preachy" or "wet blanket" tone increased slightly, possibly linked to improvements in sycophancy. Creative mastery and warmth showed no improvement.

Mental-health and child-safety scenarios. In multi-turn conversations touching suicide or self-harm, the API surface (without claude.ai's system-prompt mitigations) is more likely to give method-related detail in later turns; the model is also more inclined to introduce a diagnostic label the user never disclosed. Deployments touching sensitive scenarios need active system-prompt mitigation.

06 · Specs and choosing

Specs, pricing, and which one to pick

Item	Sonnet 5	Opus 4.8	Sonnet 4.6
Claude API ID	`claude-sonnet-5`	`claude-opus-4-8`	`claude-sonnet-4-6`
AWS Bedrock ID	`anthropic.claude-sonnet-5`	`anthropic.claude-opus-4-8`	`anthropic.claude-sonnet-4-6`
Standard pricing (input/output, per MTok)	$3 / $15	$5 / $25	$3 / $15
Extended thinking	Not supported, adaptive only	Not supported, adaptive only	Supported, adaptive + extended
Context window	1M	1M	1M
Max output	128K	128K	64K
Reliable knowledge cutoff	January 2026	January 2026	August 2025

Sonnet 5's $3/$15 standard price takes effect after an introductory $2/$10 period through 2026-08-31

Sonnet 4.6 is the only one of the three that supports both extended thinking and adaptive thinking. Sonnet 5 and Opus 4.8 support adaptive thinking only — neither still supports the legacy manual budget_tokens control for extended thinking.

Sonnet 5 also moved to an updated tokenizer, so the same input now resolves to roughly 1.0–1.35x more tokens than it did under Sonnet 4.6, depending on content type. Anthropic designed the introductory pricing to be "roughly cost-neutral" through the migration, but once standard pricing kicks in at the end of August 2026, the combination of the higher per-token price and the higher token count means the real cost increase will outpace the sticker-price gap — worth factoring into any budget estimate.

Sonnet 5

Claude Code autonomous coding, long-horizon agentic workflows

Largest gains, ties Opus 4.8 in places, at 40% of its standard price; also a good fit where prompt-injection resistance matters most

Opus 4.8

Deep math reasoning, complex tool-chain orchestration

Clear lead on deep-reasoning evals like USAMO; also leads on Toolathlon and malicious-use refusal

Sonnet 4.6

Largely superseded by Sonnet 5

Beaten by a wide margin on nearly every metric, and its price advantage disappears once the introductory period ends; the one exception is workflows that still need manual budget_tokens control

Sources

Data sources

OfficialClaude Sonnet 5 System Card

anthropic.com · June 30, 2026 · read in full. The primary source for all Sonnet 5 data in this article.

OfficialIntroducing Claude Opus 4.8 · Claude Platform Docs (Models overview / Pricing / Effort) · Introducing Claude Sonnet 4.6

Used to fill in Opus 4.8 / Sonnet 4.6 specs and benchmark numbers not reported in the Sonnet 5 System Card. Some figures were cross-checked against third-party sources (llm-stats.com, vellum.ai, zapier.com/benchmarks) and matched the official numbers.

Sonnet 5, Opus 4.8, Sonnet 4.6:which one to pick

How the three relate

A real generational jump, still behind Opus 4.8

The main battleground of this release

What happens once an agent has autonomy

What the System Card flags itself

Specs, pricing, and which one to pick

Data sources

Sonnet 5, Opus 4.8, Sonnet 4.6:
which one to pick