Finding was never the bottleneck
Fixing is
From finding and fixing vulnerabilities across their own code and open-source software, Anthropic's security team's primary takeaway: discovery is now straightforward to parallelize — and the bottleneck has shifted to verification, triage, and patching.
From scanning open source software, as of May 22, 2026. This gap is the whole problem the guide addresses — discovery is cheap to parallelize; the bottleneck moved to everything that comes after. The article's spine is a loop: a one-time investment that powers a repeating defender's loop.
Two one-time setup steps, four you run on repeat
Teams finding and fixing the most vulnerabilities converged on a variation of existing best practices. The first two steps are a one-time setup (done once per codebase, revisited when the system changes); the next four are the loop you run against the source.
On cadence: the first run surfaces the most findings; later runs are fewer but often more complex. But don't expect the nth run to return zero — models are stochastic, and a large codebase has a long tail of vulnerabilities that trickle in even when the code is unchanged. On the first iteration, run the loop multiple times, deciding when to stop based on net-new findings and your risk tolerance; afterward, scan periodically or whenever the code meaningfully changes.
The #1 cause of false positives: the model doesn't understand your trust boundaries
The model might flag code as vulnerable because it assumes a client could send corrupted values or an attacker could control the config — even though those inputs are trusted in your environment. Conversely, it might assume an internet-facing service is internal-only and under-report true vulnerabilities. In both cases, the model is wrong about the threat model, not the code.
When the threat model was well-defined, the model's findings "were exploitable 90 percent of the time."
Build the threat model in two steps
① Bootstrap from the code, docs, and vulnerability history. Feed the model what you'd hand a new security engineer on day one: architecture docs, wikis, entry points, git history, past vulnerabilities. Then have it produce a threat model with system context, assets, entry points, and trust boundaries, cluster past bugs, and list the relevant vulnerability classes. Document what you do and don't care about — and why.
"'What have people exploited in the past' is sometimes a much easier cheat-code towards success than 'find me vulnerabilities in this codebase.'"
② Have the model interview someone who knows the system. Consider Shostack's four questions: What are we building? What can go wrong? What are we doing about it? Did we do a good job? Run bootstrap first so the interviewee starts from a draft, not from scratch. The interview is optional, but adds context the model can't get from code or docs.
A few practices that make a big difference
- Consider your dependencies' security policies. Many publish one (e.g. vLLM's security.md, SQLite's "Defense Against the Dark Arts", ImageMagick's policy). Consider them directly instead of rebuilding from scratch.
- Name what is trusted. If you trust config files or authenticated clients, document it — these assumptions separate non-exploitable bugs from actual exploits.
- Include a
THREAT_MODEL.mdwith the code. Keep it in the repo and update it as code changes; the discovery agent reads it before searching, skipping known non-issues.
You'll use it in two places: in discovery as scope (partition, prioritize, skip what's out of scope); in triage as a filter (calibrate severity to your system).
good context of the code, but not good context of us.
bootstrap derives a draft from your code, CVEs, and git history; interview refines it via Shostack's four questions. Output: a THREAT_MODEL.md used in Discovery and Triage.
The sandbox does two jobs: protect your systems, and prove exploitability
Job one · protect your systems. To let models run safely and autonomously, you need a strong isolation layer. Without it, the agent may overshoot the target and do something unexpected.
One team told the model it had no network access—when it actually did—and the model discovered it could fetch from GitHub anyway. Another team observed an agent answer a GitHub issue mid-scan.
Match isolation to your threat model: a container is fine for the discovery agent reading code, but run the target and its PoCs in a microVM (like Firecracker) or a full VM with egress locked down. And never make credentials (~/.aws, ~/.ssh, .env) available to the agent.
Give the sandbox network access only while setting it up
Load the snapshot at the start of each run so every scan begins from the same clean slate.
Job two · prove exploitability. During static scanning the model can only hypothesize what might break — it can't test reachability or compensating controls. When teams built a sandbox where the agent could compile, run tests, and detonate a PoC, non-exploitable findings dropped significantly.
"the biggest efficacy lever has been giving the model test beds, live systems, and running the PoCs."
Pin as much as you can — image tags, commit SHAs, dependencies, build commands — so every run uses the same code in the same environment.
they now build Docker containers with dependencies pinned to match production, so the finding agent and the verification agent operate on the same artifacts an attacker would.
Make the sandbox faithful enough to production: excluding dependencies (a queue or datastore) under-reports bugs; ignoring production defenses (a WAF or auth gateway) over-reports findings prod already mitigates.
If a representative sandbox is impractical (cloud dependencies, data stores, etc.), start with the discovery step instead. Frontier models are good at finding vulnerabilities from just analyzing source code — several teams, including Anthropic's own, found this effective. The trade-off is in verification: without a running target you can't prove findings with a PoC, so budget more time there. You can invest in the sandbox later, once the volume of findings justifies it.
Reference implementation: the harness README provides a reference sandbox — agents and targets run in gVisor-isolated containers with egress locked to the model API, the target built from a Dockerfile pinned to a specific commit, with setup_sandbox.sh handling setup.
Rich context, shorter prompts, useful tools
Give the discovery agent context it can load as needed: the threat model, architecture docs, results of past scans. When it understands your trust boundaries and how the system is actually deployed, it identifies vulnerabilities specific to your system.
Frontier models benefit from increasingly simple prompts during discovery. Counterintuitively, more prescriptive prompts make discovery worse — long checklists reduce the model's creativity and generate fewer novel bugs.
✗ Over-prescriptive prescriptive
- long checklists dictating how to scan
- pins down "how to scan"
- narrows what the model tries
- fewer novel bugs
✓ Goal + context goal + context
- state the why / what: why you're scanning, what a finding that matters looks like, what system
- leave "how to scan" to the model
- optionally name a specific vulnerability class (what it does, where it lives)
- define structured output + an escape hatch for weak findings
Give tools — and let the model build its own
Give it grep, glob, and other search/read tools; let it use SAST scanners and fuzzers your team uses; ask what tools a task needs and provide them; and let it build tools as needed — recent frontier models are increasingly good at writing the tools they need.
the agent didn't need to guess whether a path could be reached and could test each candidate against the running application as it went, improving their true-positive rate to nearly 100 percent.
Partition + parallelize, don't brute-force
Have the model do a first pass to partition the search space (by attack surface, endpoint, component); feed the partitions to parallel discovery agents so they don't converge on the same shallow bugs; then run a system-level pass with the partition findings as context.
"We initially tried to just horizontally scale and send more agents, but saw limiting returns."
If you have a sandbox to run the target, ask the agent to build a PoC (a script, a crashing input, a failing test) — it helps the agent pin down the finding and gives the verifier concrete evidence. Findings it can't reproduce can still be reported, flagged as unproven, to keep recall high.
Reads your THREAT_MODEL.md, partitions the target into focus areas, and fans out parallel review agents per area. Output: structured findings the next steps consume directly.
Discovery optimizes recall, verification optimizes precision — don't make one agent do both
When an agent tries to do both in the same step, it can self-censor and exclude exploitable true positives.
We learned this the hard way, where asking discovery agents to also verify findings led to them filtering out true positives that a separate verification step would have confirmed.
The verifier must be independent
Run the verifier in a fresh container with no shared filesystem or conversation history. If it's exposed to the discovery agent's reasoning, it may simply agree instead of testing the claim. Give it only (1) the PoC or written finding and (2) the codebase — so it can search for mitigations the finder missed (upstream validation, auth gates, type constraints, unreachable code).
If a single pass still lets too many through, run multiple independent verifiers (different angles or models) and take a majority vote; consider a separate judge to decide between discovery and verification. Prompt the verifier to disprove each finding: assume it's a false positive and search for why it's wrong. This matters most when there's no PoC.
If you can sufficiently reproduce production in a sandbox, have the verifier build and execute a reproducible PoC — if it works, the finding is exploitable. But the inverse isn't true: failure to produce a working PoC is not proof of a false positive.
"Validation is the biggest holdup and the PoC is the validation."
The model finds a hundred candidates before lunch — triage is the new bottleneck
Verification confirms a finding is exploitable; triage assesses patching priority. Previously the engineer who found a bug also triaged it; now, with findings exploding, triage is the bottleneck. Too many duplicated or inflated-severity bugs cause alert fatigue — product engineers stop reading them, even the ones that need immediate patching. Open-source maintainers are especially likely to be overwhelmed.
if we send product engineers a pile of findings where a majority are non-exploitable, they will lose trust in the reports and give up.
Deduplicate by root cause
First a cheap deterministic pass (same file, same category, line numbers within ten lines of each other), then let a model apply qualitative rules to what remains:
Treat as duplicate duplicate
- the same root cause worded differently
- the same vulnerability at multiple call sites
- a missing global protection (e.g. an auth check) reported per endpoint
- a cause and its consequence flagged in the same path
Treat as distinct distinct
- different vulnerability classes in the same file
- different variables reaching different sinks
- two independent bugs inside one helper
- the same missing check on two endpoints, each needing its own fix
If your harness generates PoCs and patches per finding, another dedup approach: check whether one finding's patch also disarms the PoCs of others.
Rate severity on six dimensions
Have the model write out its answer to each question before assigning severity — going through the evidence first keeps it from anchoring on the bug class ("SQL injection, so critical") and inflating to match. Adjust the thresholds to your system.
Models inflate severity when they lack context: they may not know what an attacker controls, or can't see an upstream WAF or auth. The fix is to give the triage agent the threat model — clarifying "we trust authenticated clients" can remove a whole class of criticals.
Does both verification and triage: multi-vote verification per finding, deduplication across runs, and re-ranking by derived exploitability. Output: a short, ranked, owned list instead of a raw dump.
Fix the root cause, find variants, minimize the patch — a human still owns it
Patching closes the loop, and improves the threat model from verified findings while feeding past findings into the next scan. Each cycle hardens the codebase and makes the next scan better informed.
First, write a test that fails (TDD)
Before patching, write a new test that fails with the existing code; implement the fix and confirm it passes without breaking anything else. Without a test, the fix can silently regress, and it's hard to retroactively prove the bug was real.
By giving the model feedback to iterate against, patch quality jumped, saving time on human review.
Fix the root cause, then find variants at two levels
Models may narrowly fix one call site; simply prompting it to identify and fix the root cause is effective. Then look for variants: ① same pattern — other call sites or copies of the buggy code; ② same class — a codebase with one SQL injection tends to have more.
Before shipping: adversarial check + minimal patch
Run an adversarial check: a fresh discovery agent probes the patch as an attacker to confirm it's comprehensive. Then simplify the patch — minimal changes are easier to review and less likely to introduce new bugs. Prompt for the smallest change that fixes the root cause: no refactoring, no drive-by cleanups, no reformatting.
"The recommended patches tend to be as restrictive as possible, to the point that they would break connections with other services. It would address the issue, but break the dependencies that allow the service to work in the first place."
A validation ladder (cheapest first)
The model can write the patch, but a human still needs to own it. Generated patches fail in predictable ways: fixing the symptom not the root cause, blocking legitimate input, removing access to a dependent service. Validate each patch as much as possible so human review focuses on nuances the model can't see (incoming changes, code style).
Consumes the triage output and generates a candidate diff per finding, with an independent reviewer agent checking each one.
Wire the harness to events
It's getting easier for models to find and exploit vulnerabilities in code. As defenders, the job is to find and fix them before adversaries do. Some teams have gone as far as connecting their harnesses to events:
- a bug bounty report triggers an automated variant analysis
- a security review triggers scanning, with candidate findings attached
- a verified vulnerability updates the static analysis tooling to prevent it recurring
done right, it's the start of a larger, more hopeful shift, where we'll be able to find and fix vulnerabilities before attackers exploit them.
Getting started & resources
Clone defending-code-reference-harness and run /quickstart in Claude Code to walk an interactive workflow — threat modeling to scanning to triage — on a demo target. The repo also includes an autonomous harness and a /customize skill. Then run it on your own code — budget for the pipeline after the scan before you budget for more scanning.