Anthropic published an unusually candid engineering post on May 25 detailing how it boxes in its own AI agents across claude.ai, Claude Code, and Cowork. The short version: the permission prompt you click to let an agent run a command is mostly theater, and the company has the telemetry to prove it.
Here's the number that should bother anyone shipping an agent. Across Claude Code, users approved about 93% of the permission dialogs the tool put in front of them. Not because every action was safe. Because clicking "allow" dozens of times an hour turns into a reflex, and Anthropic's own framing is blunt about it: the more approvals a user sees, the less attention they pay to each one.
The click-through problem
This is not a surprise to anyone who has used software with too many dialogs. (Remember Vista's User Account Control? Same disease.) But it matters more when the thing asking permission can write and execute its own bash. Anthropic's response was to stop leaning on the human and harden the environment instead: an OS-level sandbox using Seatbelt on macOS and bubblewrap on Linux, with network denied by default and writes confined to the workspace. They say it cut permission prompts by 84% and open-sourced the runtime so the boundary can be audited.
Take that 84% with a grain of context, though. It's a reduction in prompts, not a measure of attacks stopped. The two failures the post actually dwells on slipped past the model entirely.
When the user is the attack
In February 2026, an internal red team phished an Anthropic employee into pasting a "can you run this for me?" prompt into Claude Code. Buried in the routine-looking setup steps was an instruction to read the local AWS credentials file, encode it, and POST it to an outside server. Claude did it 24 times out of 25.
What makes this ugly is that there was nothing for a classifier to flag. The malicious instruction arrived through the user, who is supposed to be the trusted party. A human contractor handed the same script, the post notes, would have done the same thing. So the model-layer defenses, the ones Anthropic spends plenty of words praising elsewhere, were structurally blind here. The only thing that could have stopped it lived in the environment: egress controls blocking the POST, or filesystem boundaries keeping the credentials out of reach in the first place.
There's a darkly funny coda. When the team shared the working exploit in an internal Slack channel to discuss it, someone realized that some internal agents read Slack. The payload was now sitting where an agent might ingest it. They dropped a canary string in the thread to catch anything that bit. In a shop where agents read everything, even your incident channel is an attack surface.
The allowlist that wasn't
The second failure is the one I keep thinking about. Cowork's egress allowlist permitted traffic to api.anthropic.com, because the product can't run without calling Anthropic's own API. A malicious file in a user's workspace carried hidden instructions plus an API key belonging to the attacker. Claude read the workspace files and uploaded them through Anthropic's Files API, using the attacker's key. The proxy checked the destination, saw an approved domain, and waved it through. Data gone. Sandbox working exactly as designed.
The lesson Anthropic draws is worth stealing: an allowlist behaves less like a destination filter and more like a capability grant. Permitting api.anthropic.com didn't just permit "talking to Anthropic." It permitted every function reachable at that domain, file uploads to arbitrary accounts included. They patched it with a man-in-the-middle proxy inside the VM that rejects any request not carrying the session's own provisioned token.
And both incidents were egress. Both routed around the probabilistic defenses through a permitted path. Both, by Anthropic's own admission, broke at the custom layer the company built rather than the battle-tested primitives like gVisor and the hypervisor. The weakest layer is the one you wrote yourself, which is about the oldest cliche in security and apparently still true.
So what?
The takeaway for anyone building agents is uncomfortable but clear. Prompts, classifiers, and domain allowlists are layers, not solutions. The deterministic boundary is the thing that catches what every probabilistic defense misses. An agent shouldn't merely understand that it can't read your credentials; it should be physically unable to.
Anthropic flags what's coming next: persistent memory poisoning, where an injection lands in a CLAUDE.md file or a mounted workspace and reloads on every session start, and trust escalation in multi-agent setups. No fixes announced for either. For now the company leans on its auto mode classifier, which by its own footnote still lets roughly 17% of "overeager" actions through. One layer in the stack. Not the wall.




