Anthropic Security · Field guide

Finding was never the bottleneck
Fixing is

From finding and fixing vulnerabilities across their own code and open-source software, Anthropic's security team's primary takeaway: discovery is now straightforward to parallelize — and the bottleneck has shifted to verification, triage, and patching.

vulnerabilities
disclosed

1,596

patched, to
their knowledge

97≈ 6%

From scanning open source software, as of May 22, 2026. This gap is the whole problem the guide addresses — discovery is cheap to parallelize; the bottleneck moved to everything that comes after. The article's spine is a loop: a one-time investment that powers a repeating defender's loop.

Source: Anthropic, "Using LLMs to secure source code", 2026-05-27 (Eugene Yan, Henna Dattani). This is a structured deep-dive; the original blog wording and quotes are preserved verbatim.

▶

The find-and-fix loop

Two one-time setup steps, four you run on repeat

Teams finding and fixing the most vulnerabilities converged on a variation of existing best practices. The first two steps are a one-time setup (done once per codebase, revisited when the system changes); the next four are the loop you run against the source.

one-time investment

Threat model

Sandbox

⟶

↻

defender's loop · repeating

03 Discovery

find

→ optimize recall

04 Verification

confirm

→ optimize precision

06 Patching

fix

↩ close the loop

05 Triage

rank

→ dedup & rate

On cadence: the first run surfaces the most findings; later runs are fewer but often more complex. But don't expect the nth run to return zero — models are stochastic, and a large codebase has a long tail of vulnerabilities that trickle in even when the code is unchanged. On the first iteration, run the loop multiple times, deciding when to stop based on net-new findings and your risk tolerance; afterward, scan periodically or whenever the code meaningfully changes.

Threat ModelSetup

The #1 cause of false positives: the model doesn't understand your trust boundaries

The model might flag code as vulnerable because it assumes a client could send corrupted values or an attacker could control the config — even though those inputs are trusted in your environment. Conversely, it might assume an internet-facing service is internal-only and under-report true vulnerabilities. In both cases, the model is wrong about the threat model, not the code.

Field note

When the threat model was well-defined, the model's findings "were exploitable 90 percent of the time."

— One team: the model performed best on systems with well-documented threat models, design docs, requirements, and constraints

Build the threat model in two steps

① Bootstrap from the code, docs, and vulnerability history. Feed the model what you'd hand a new security engineer on day one: architecture docs, wikis, entry points, git history, past vulnerabilities. Then have it produce a threat model with system context, assets, entry points, and trust boundaries, cluster past bugs, and list the relevant vulnerability classes. Document what you do and don't care about — and why.

Field note · cheat-code

"'What have people exploited in the past' is sometimes a much easier cheat-code towards success than 'find me vulnerabilities in this codebase.'"

— One team distilled hundreds of past CVEs and security-fix commits into "bug-shape" hints, asked the model "was the fix complete, and applied everywhere else?", and found three exploitable issues in an hour

② Have the model interview someone who knows the system. Consider Shostack's four questions: What are we building? What can go wrong? What are we doing about it? Did we do a good job? Run bootstrap first so the interviewee starts from a draft, not from scratch. The interview is optional, but adds context the model can't get from code or docs.

A few practices that make a big difference

Consider your dependencies' security policies. Many publish one (e.g. vLLM's security.md, SQLite's "Defense Against the Dark Arts", ImageMagick's policy). Consider them directly instead of rebuilding from scratch.
Name what is trusted. If you trust config files or authenticated clients, document it — these assumptions separate non-exploitable bugs from actual exploits.
Include a THREAT_MODEL.md with the code. Keep it in the repo and update it as code changes; the discovery agent reads it before searching, skipping known non-issues.

You'll use it in two places: in discovery as scope (partition, prioritize, skip what's out of scope); in triage as a filter (calibrate severity to your system).

good context of the code, but not good context of us.

— One team had a 40% false positive rate: findings were reproducible and PoCs proved exploitability, yet the code owners dismissed them because they didn't fit the project's threat model. A CISO summed up the model's blind spot

threat-model skill

bootstrap derives a draft from your code, CVEs, and git history; interview refines it via Shostack's four questions. Output: a THREAT_MODEL.md used in Discovery and Triage.

SandboxSetup

The sandbox does two jobs: protect your systems, and prove exploitability

Job one · protect your systems. To let models run safely and autonomously, you need a strong isolation layer. Without it, the agent may overshoot the target and do something unexpected.

One team told the model it had no network access—when it actually did—and the model discovered it could fetch from GitHub anyway. Another team observed an agent answer a GitHub issue mid-scan.

— Constraints must be enforced via code and configuration, not by telling the model

Match isolation to your threat model: a container is fine for the discovery agent reading code, but run the target and its PoCs in a microVM (like Firecracker) or a full VM with egress locked down. And never make credentials (~/.aws, ~/.ssh, .env) available to the agent.

Give the sandbox network access only while setting it up

net on

pull deps / build / install tools

›

net on

deploy target / run existing tests

›

net on

snapshot the environment

›

net off

remove network access

›

net off

scan: allow model API only (local proxy)

Load the snapshot at the start of each run so every scan begins from the same clean slate.

Job two · prove exploitability. During static scanning the model can only hypothesize what might break — it can't test reachability or compensating controls. When teams built a sandbox where the agent could compile, run tests, and detonate a PoC, non-exploitable findings dropped significantly.

Field note · 6 weeks in

"the biggest efficacy lever has been giving the model test beds, live systems, and running the PoCs."

— One offensive-security team: it's only a true positive if the agent can build a PoC and run it on the test bed

Pin as much as you can — image tags, commit SHAs, dependencies, build commands — so every run uses the same code in the same environment.

they now build Docker containers with dependencies pinned to match production, so the finding agent and the verification agent operate on the same artifacts an attacker would.

— One team's "vulnerability" turned out to be the agent downloading an older library than was deployed, caught by an engineer reading the transcript

Make the sandbox faithful enough to production: excluding dependencies (a queue or datastore) under-reports bugs; ignoring production defenses (a WAF or auth gateway) over-reports findings prod already mitigates.

⚡Pragmatic out

If a representative sandbox is impractical (cloud dependencies, data stores, etc.), start with the discovery step instead. Frontier models are good at finding vulnerabilities from just analyzing source code — several teams, including Anthropic's own, found this effective. The trade-off is in verification: without a running target you can't prove findings with a PoC, so budget more time there. You can invest in the sandbox later, once the volume of findings justifies it.

Reference implementation: the harness README provides a reference sandbox — agents and targets run in gVisor-isolated containers with egress locked to the model API, the target built from a Dockerfile pinned to a specific commit, with setup_sandbox.sh handling setup.

DiscoveryLoop · recall

Rich context, shorter prompts, useful tools

Give the discovery agent context it can load as needed: the threat model, architecture docs, results of past scans. When it understands your trust boundaries and how the system is actually deployed, it identifies vulnerabilities specific to your system.

⚡Counterintuitive · shorter prompts win

Frontier models benefit from increasingly simple prompts during discovery. Counterintuitively, more prescriptive prompts make discovery worse — long checklists reduce the model's creativity and generate fewer novel bugs.

✗ Over-prescriptive prescriptive

long checklists dictating how to scan
pins down "how to scan"
narrows what the model tries
fewer novel bugs

✓ Goal + context goal + context

state the why / what: why you're scanning, what a finding that matters looks like, what system
leave "how to scan" to the model
optionally name a specific vulnerability class (what it does, where it lives)
define structured output + an escape hatch for weak findings

Give tools — and let the model build its own

Give it grep, glob, and other search/read tools; let it use SAST scanners and fuzzers your team uses; ask what tools a task needs and provide them; and let it build tools as needed — recent frontier models are increasingly good at writing the tools they need.

Field note · ~100% true-positive

the agent didn't need to guess whether a path could be reached and could test each candidate against the running application as it went, improving their true-positive rate to nearly 100 percent.

— One pentesting team gave the discovery agent tools to send requests, check responses, and query traffic logs

Partition + parallelize, don't brute-force

Have the model do a first pass to partition the search space (by attack surface, endpoint, component); feed the partitions to parallel discovery agents so they don't converge on the same shallow bugs; then run a system-level pass with the partition findings as context.

"We initially tried to just horizontally scale and send more agents, but saw limiting returns."

— Brute-forcing discovery hits diminishing returns fast; another team scaled up parallel agents and got "tons of issues", most duplicates of each other

If you have a sandbox to run the target, ask the agent to build a PoC (a script, a crashing input, a failing test) — it helps the agent pin down the finding and gives the verifier concrete evidence. Findings it can't reproduce can still be reported, flagged as unproven, to keep recall high.

vuln-scan skill

Reads your THREAT_MODEL.md, partitions the target into focus areas, and fans out parallel review agents per area. Output: structured findings the next steps consume directly.

VerificationLoop · precision

Discovery optimizes recall, verification optimizes precision — don't make one agent do both

03 Discovery → Recall

Recall

find as many vulnerabilities as possible — even unlikely ones

04 Verification → Precision

Precision

exclude findings that are not actually exploitable

When an agent tries to do both in the same step, it can self-censor and exclude exploitable true positives.

We learned this the hard way, where asking discovery agents to also verify findings led to them filtering out true positives that a separate verification step would have confirmed.

— Anthropic

The verifier must be independent

Run the verifier in a fresh container with no shared filesystem or conversation history. If it's exposed to the discovery agent's reasoning, it may simply agree instead of testing the claim. Give it only (1) the PoC or written finding and (2) the codebase — so it can search for mitigations the finder missed (upstream validation, auth gates, type constraints, unreachable code).

If a single pass still lets too many through, run multiple independent verifiers (different angles or models) and take a majority vote; consider a separate judge to decide between discovery and verification. Prompt the verifier to disprove each finding: assume it's a false positive and search for why it's wrong. This matters most when there's no PoC.

An adversarial verifier roughly halved the non-exploitable rate

≈0

Requiring it to also build a confirming PoC brought false positives near zero

↓↓

Together, they significantly cut downstream triage and patching load

If you can sufficiently reproduce production in a sandbox, have the verifier build and execute a reproducible PoC — if it works, the finding is exploitable. But the inverse isn't true: failure to produce a working PoC is not proof of a false positive.

Field note

"Validation is the biggest holdup and the PoC is the validation."

— One team scanning open-source packages: scan → generate a PoC → deploy a mock app that uses the package and triggers the PoC

TriageLoop · dedup & rate

The model finds a hundred candidates before lunch — triage is the new bottleneck

Verification confirms a finding is exploitable; triage assesses patching priority. Previously the engineer who found a bug also triaged it; now, with findings exploding, triage is the bottleneck. Too many duplicated or inflated-severity bugs cause alert fatigue — product engineers stop reading them, even the ones that need immediate patching. Open-source maintainers are especially likely to be overwhelmed.

if we send product engineers a pile of findings where a majority are non-exploitable, they will lose trust in the reports and give up.

— A lesson multiple teams shared; others pointed the model at their existing backlog (prior scanners/models, bug-bounty intake) and cleared hundreds of stale items in days

Deduplicate by root cause

First a cheap deterministic pass (same file, same category, line numbers within ten lines of each other), then let a model apply qualitative rules to what remains:

Treat as duplicate duplicate

the same root cause worded differently
the same vulnerability at multiple call sites
a missing global protection (e.g. an auth check) reported per endpoint
a cause and its consequence flagged in the same path

Treat as distinct distinct

different vulnerability classes in the same file
different variables reaching different sinks
two independent bugs inside one helper
the same missing check on two endpoints, each needing its own fix

If your harness generates PoCs and patches per finding, another dedup approach: check whether one finding's patch also disarms the PoCs of others.

Rate severity on six dimensions

Reachability

Reachable from a real entry point, or only internally?

Attacker control

Does untrusted input reach the sink intact, or is it sanitized upstream?

Preconditions

Needs a non-default setting / feature flag / narrow time window?

Authentication

Triggerable unauthenticated, or needs a logged-in user / admin?

Read vs. write

Can the attacker only read, or also modify?

Blast radius

One user or all? One tenant or the platform?

Critical / High

zero preconditions + unauthenticated remote access

Medium

one or two preconditions, or an authenticated path

Low

three or more preconditions, or local-only

Have the model write out its answer to each question before assigning severity — going through the evidence first keeps it from anchoring on the bug class ("SQL injection, so critical") and inflating to match. Adjust the thresholds to your system.

Models inflate severity when they lack context: they may not know what an attacker controls, or can't see an upstream WAF or auth. The fix is to give the triage agent the threat model — clarifying "we trust authenticated clients" can remove a whole class of criticals.

triage skill

Does both verification and triage: multi-vote verification per finding, deduplication across runs, and re-ranking by derived exploitability. Output: a short, ranked, owned list instead of a raw dump.

PatchingLoop · close the loop

Fix the root cause, find variants, minimize the patch — a human still owns it

Patching closes the loop, and improves the threat model from verified findings while feeding past findings into the next scan. Each cycle hardens the codebase and makes the next scan better informed.

First, write a test that fails (TDD)

Before patching, write a new test that fails with the existing code; implement the fix and confirm it passes without breaking anything else. Without a test, the fix can silently regress, and it's hard to retroactively prove the bug was real.

Field note

By giving the model feedback to iterate against, patch quality jumped, saving time on human review.

— One pentester: telling the harness to validate patches by re-running the PoC against the patched code turned inconsistent patches consistent

Fix the root cause, then find variants at two levels

Models may narrowly fix one call site; simply prompting it to identify and fix the root cause is effective. Then look for variants: ① same pattern — other call sites or copies of the buggy code; ② same class — a codebase with one SQL injection tends to have more.

Before shipping: adversarial check + minimal patch

Run an adversarial check: a fresh discovery agent probes the patch as an attacker to confirm it's comprehensive. Then simplify the patch — minimal changes are easier to review and less likely to introduce new bugs. Prompt for the smallest change that fixes the root cause: no refactoring, no drive-by cleanups, no reformatting.

"The recommended patches tend to be as restrictive as possible, to the point that they would break connections with other services. It would address the issue, but break the dependencies that allow the service to work in the first place."

— One team's most common patch failure

A validation ladder (cheapest first)

cheapest first ↓

Build

The patch compiles and the new tests pass.

Try to reproduce

The original PoC should stop working — catches ineffective patches.

Check for regressions

The original test suite still passes — catches broken or over-restrictive patches.

Re-attack

A fresh discovery agent runs an adversarial check — catches incomplete patches.

The model can write the patch, but a human still needs to own it. Generated patches fail in predictable ways: fixing the symptom not the root cause, blocking legitimate input, removing access to a dependent service. Validate each patch as much as possible so human review focuses on nuances the model can't see (incoming changes, code style).

patch skill

Consumes the triage output and generates a candidate diff per finding, with an independent reviewer agent checking each one.

→

Moving forward

Wire the harness to events

It's getting easier for models to find and exploit vulnerabilities in code. As defenders, the job is to find and fix them before adversaries do. Some teams have gone as far as connecting their harnesses to events:

a bug bounty report triggers an automated variant analysis
a security review triggers scanning, with candidate findings attached
a verified vulnerability updates the static analysis tooling to prevent it recurring

done right, it's the start of a larger, more hopeful shift, where we'll be able to find and fix vulnerabilities before attackers exploit them.

— Anthropic

Getting started & resources

Clone defending-code-reference-harness and run /quickstart in Claude Code to walk an interactive workflow — threat modeling to scanning to triage — on a demo target. The repo also includes an autonomous harness and a /customize skill. Then run it on your own code — budget for the pipeline after the scan before you budget for more scanning.

Claude Security

Anthropic's managed product for agentic vulnerability detection and patching.

defending-code-reference-harness

Companion repo: skills for interactive workflows + a demo harness for autonomous runs.

claude-code-security-review action

A GitHub Action with Claude as a security reviewer on every pull request.

Vulnerability Detection Agent

Cookbook: an agent that builds a threat model, scans, and triages findings into a structured report.

Finding was never the bottleneckFixing is

Two one-time setup steps, four you run on repeat

The #1 cause of false positives: the model doesn't understand your trust boundaries

Build the threat model in two steps

A few practices that make a big difference

The sandbox does two jobs: protect your systems, and prove exploitability

Give the sandbox network access only while setting it up

Rich context, shorter prompts, useful tools

✗ Over-prescriptive prescriptive

✓ Goal + context goal + context

Give tools — and let the model build its own

Partition + parallelize, don't brute-force

Discovery optimizes recall, verification optimizes precision — don't make one agent do both

The verifier must be independent

The model finds a hundred candidates before lunch — triage is the new bottleneck

Deduplicate by root cause

Treat as duplicate duplicate

Treat as distinct distinct

Rate severity on six dimensions

Fix the root cause, find variants, minimize the patch — a human still owns it

First, write a test that fails (TDD)

Fix the root cause, then find variants at two levels

Before shipping: adversarial check + minimal patch

A validation ladder (cheapest first)

Wire the harness to events

Getting started & resources

Finding was never the bottleneck
Fixing is