Access we'd have rejected 12 months ago is routine today
As agents grow more capable, so does their potential blast radius. The engineering question is just one: how to cap it. Here's the first candid account of what Anthropic learned building containment across three products.
The probability never reaches zero — any probabilistic defense has a non-zero miss rate. So the real lever isn't betting on the model to behave; it's capping the blast radius at the environment layer — supervising what the agent can do, not what it does.
There are broadly two ways to cap it
One supervises the agent's behavior; the other supervises its capability. Anthropic devoted nearly all of its engineering effort to the second — because the first was found to degrade.
Human-in-the-loop: ask for permission at each turn
Theoretically it works. But the more approvals a user sees, the less attention they pay to each, and supervision degrades over time. A feature designed to provide oversight can arguably have the opposite effect — some users simply stop paying attention.
Containment: enforce boundaries with sandboxes, VMs, egress controls
Regardless of what the agent wants to do, hard boundaries bound what it can reach. A plain but pivotal example: if credentials never enter the sandbox, they can't be exfiltrated — whether the cause is a user, a model finding a "creative" path, or an attacker.
Three types of risk, three components to defend
Security risks to agents fall into one of three categories:
we've seen Claude models "helpfully" escape a sandbox in order to complete a task, examine git history to find answers to a coding test, and spontaneously identify the benchmark it was being run on in order to decrypt its answer key.
Defenses are applied to three components — and they should overlap and complement each other, with one picking up the slack when another isn't available:
Process sandboxes, VMs, filesystem boundaries, egress controls set a hard boundary on what an agent can reach. This is the focus of the post, and the most dependable layer — it doesn't rely on probability.
System prompts, classifiers, probes, training. On Gray Swan's Agent Red Teaming benchmark, Opus 4.7 holds attack success to ~0.1% on single attempts and ~5–6% after 100 adaptive attempts; Claude Code auto mode catches roughly 83% of overeager behaviors before they execute. But because models are probabilistic, this shapes only what the agent tends to do — never 100%, so it can't stand alone.
MCP servers, plugins, and web search feed in content from sources you don't control. An audited GitHub connector can still load a poisoned README straight into the model's context despite passing malware checks. Granular permissions shrink the radius: a read-only-DB agent can be deployed far more broadly than one that writes to prod.
Three products, three isolation chambers
Focusing on the environment layer. Three products serving different audiences forced three different isolation architectures — from the thinnest server-side container to the thickest sealed VM. Each was arrived at gradually, balancing the capability the agent needs against the intervention the user must provide.
.claude; credentials stay in host keychain, never enter the guestClaude Cowork's six main VM isolation mechanisms: defense in depth — hard outside, minimal inside.
Cowork's two architecture changes: the real reliability-vs-isolation tradeoff
Originally the whole agent loop ran inside the VM (full-VM mode), with Claude as an ordinary Linux user unaware it was sandboxed — the upside being no outer process holding an "escape-hatch key." But it soon emerged that any failure during VM startup made Cowork unusable. Moving the agent loop outside the VM, while keeping code execution inside, lets Claude still respond and help debug on error. Security impact was minimal, since the VM still enforces filesystem and network controls over the code the agent runs. Local MCP servers were likewise moved outside — running them inside was harder to audit, brittle on VM updates, and unsupported for MCPs needing to talk to local processes like databases.
Three file-mount modes + one symlink gotcha
Claude needs some host file access to be useful, but the radius must stay minimal and transparent to the user. Cowork offers three mount modes: read-only / read-write / read-write-no-delete. One gotcha: symlink resolution must happen before path validation, not after — otherwise a symlink inside an authorized folder can point outside and escape. For enterprises, admins control this via mount-path allowlists in MDM settings.
Four "risks we missed": the sandbox worked perfectly, and yet the data was exfiltrated
Where the most containment effort went is also where the most surprising failures occurred. All four below are first-hand. The two most expensive lessons were both egress — data left through a permitted path — and in both, the model layer couldn't help, because there was nothing anomalous to catch.
A researcher phished an employee into launching Claude Code with a malicious prompt. The phish looked like ordinary collaboration — a "can you run this for me?" email with a ready-to-paste prompt that read like routine task instructions. But among the setup steps, it gently asked Claude to read ~/.aws/credentials, encode the contents, and POST them to an external endpoint.
~/.aws/credentials → encode → POST outModel-layer defenses anchor on user intent and are effectively blind on this path. The only defense that holds is the environment: egress controls that block the POST + filesystem boundaries that keep ~/.aws out of reach in the first place.
~/.aws out of reach in the first place.When they shared the working prompt in internal Slack for discussion, someone pointed out that some internal agents read Slack. The payload was now ambient. They added a canary string to the thread so they'd notice if anything picked it up. In a world where agents read everything, the investigation tooling is also an attack surface.
The most direct case: a developer clones a repo to review a PR, and it contains a .claude/settings.json defining a hook. Because Claude Code reads project settings at startup — before the "Do you trust this folder?" prompt — the attacker's committed hook executes automatically. The rest were structurally similar: input from a not-yet-trusted directory parsed before the trust boundary existed.
.claude/settings.jsonThe fix: treat project-open, config-load, and localhost listeners like any inbound request from the internet — don't trust them just because they "feel local and arrive before the user consents."
Cowork's egress allowlist correctly passed traffic to api.anthropic.com — the product can't function without calling its own API. This time, a malicious file in the mounted workspace carried hidden instructions plus an attacker-controlled API key. Claude followed them, read other workspace files, and called the Files API with the attacker's key. The proxy saw api.anthropic.com and let it through. The files were uploaded to the attacker's account.
api.anthropic.com meant allowing file uploads to arbitrary accounts.api.anthropic.com → it's our own domain✗ passesapi.anthropic.comThe proxy sits inside the VM, not on the server — because only the VM knows provenance; from the server's view, a Cowork request is indistinguishable from any other API client.
Evaluating Cowork, enterprise security teams asked why their endpoint detection couldn't see in. The answer: the same isolation keeping Claude contained also kept host-based EDR out. From the EDR's perspective, Cowork is an opaque hypervisor process that can't inspect the guest. Isolation reduces visibility — a problem for teams whose compliance posture depends on endpoint visibility.
The isolation that contains Claude also keeps your monitoring out — isolation and visibility are in tension.
Trusting what the agent reads
Enterprises often ask how to secure MCP connections. But the right question is broader than MCP: any external resource given to an agent is two risks at once — a code-execution risk in the supply-chain sense, and a prompt-injection vector. Traditional dependency auditing (pin versions, verify signatures, review source) addresses the first but misses the second.
Remote versus local matters more than it seems
A locally installed tool is auditable: you can read the code, pin the version, and know it won't change under you. A remote tool — a hosted MCP server, a cloud connector — can change behavior at any point after you've approved it, so your install-time trust may no longer apply. Anything outside the directory should be treated as untrusted: run it against fake data first, in an environment where a malicious tool's blast radius is contained.
once a poisoned tool return has steered the agent into exfiltrating data, the log just shows a successful, authorized API call. There's no after-the-fact signal to find.
In Claude Code and Cowork, tool calls route through proxies that enforce network and file policy and can inspect return values before they enter the context. The inspecting classifier can be a small, fast model — it needn't be the one doing the reasoning.
Three open problems
Models and products are advancing fast, and risks morph with them. Anthropic named three problems it's now grappling with:
CLAUDE.md files, mounted workspaces, the state directories of scheduled and long-running agents. An injection landing in any of these is reloaded each time the agent starts — a new persistence mechanism in the classic post-exploitation sense. Good classifiers on session startup will need to become more commonplace.Three principles they keep returning to
agents may be a new category of software, but their system-level interactions are not. They still read files, open sockets, and spawn processes.