Anthropic adds 'Dreaming' to Claude Managed Agents

Anthropic added three new features to its Claude Managed Agents platform on May 6 at the Code with Claude developer conference in San Francisco. The marquee item is "dreaming," a scheduled background process that reviews past agent sessions and restructures stored memory between runs. Outcomes and multi-agent orchestration both moved from research preview to public beta the same day.

Dreaming, or: a memory consolidation cron job

Strip the metaphor and what's left is an asynchronous job that reads an existing memory store plus up to 100 past sessions, deduplicates entries, and writes a curated replacement. The original session logs stay intact. The Decoder reports the feature currently runs on Claude Opus 4.7 and Sonnet 4.6, billed at standard API token rates.

Useful, sure. It is also not quite what the marketing implies. Anthropic framed dreaming as agents that "learn from their own mistakes," analogous to how human brains consolidate memories during sleep. That analogy is doing a lot of work. What is actually shipping is offline memory curation with a good name.

Alex Albert, who leads research product management at Anthropic, told VentureBeat the system was learning to "write better notes for their future self." Fine. But the trust question Albert raised in the same interview is worth pulling forward: developers can let dreaming write to memory automatically or require review before any change lands. Most teams running production agents will pick the second option, which softens the autonomy story considerably.

The Harvey number

Anthropic cited a roughly 6x lift in task completion at legal-AI company Harvey after dreaming was switched on. The gain came from the agent retaining lessons like filetype workarounds and tool-specific patterns across sessions. That is a real win on a narrow workload. Whether it generalizes is a separate question, and Anthropic does not claim it does.

Outcomes is the one to watch

Outcomes is the feature most likely to change how production agents actually get built. Developers write a rubric describing what success looks like (a structural template, a presentation standard, a brand voice, whatever fits the task). The agent works toward it. Then a separate Claude instance grades the output against the rubric in its own context window. If the work misses the bar, the grader flags what is wrong and the agent takes another pass. No human in the loop.

The architectural choice that matters here is the grader's isolated context. The grader does not see the agent's reasoning trace, only the final output and the rubric. That separation is what stops the agent from talking the grader into accepting a flawed result by walking it through clever justifications. Whether it actually prevents that in practice is another question. The design, at least, is right.

The numbers Anthropic published are vendor benchmarks, so apply the usual discount. Internal testing showed up to 10 percentage points of task success improvement over a standard prompting loop, with the largest gains on harder problems. File generation specifically: 8.4% on docx and 10.1% on pptx, per IT Brief. The precision feels suspicious in the way vendor numbers always do, but the mechanism is plausible and the gains are modest enough to be credible. Treat them as directional.

Subagent orchestration, finally with an audit trail

Multi-agent orchestration lets a lead agent break work into pieces and delegate each one to a specialist subagent with its own model, system prompt, and tools. Up to 20 specialists; up to 25 threads running concurrently. The subagents share a filesystem but keep isolated context windows, and they feed results back into the lead's context. Webhooks notify the developer when the run finishes.

Honestly, Claude Code and Cowork already do most of this informally. What is new is the managed runtime, the persistent event log, and the visibility layer in the Claude Console, where developers can see exactly which subagent did what, in what order, and why. That audit trail is the actual product. The orchestration pattern itself has been around for a while.

Netflix's platform team is using the system to process build logs from hundreds of applications in parallel, identifying recurring issues across batches. Spiral, the writing tool from Every, runs a lead agent that fields user requests and farms drafting out to subagents producing multiple parallel drafts, scored against editorial principles and the user's voice. Both are reasonable applications. Neither requires a model breakthrough to work, which is sort of the point Anthropic is making with the platform.

What this is, and what it isn't

On stage, the pitch was that the three features compose: orchestration parallelizes work, outcomes grades it, dreaming pulls lessons across runs into a continuously improving loop. That is a coherent narrative, and one the company's engineering blog has been building toward since the platform's April public beta.

It also requires buying the framing that agent infrastructure (not model capability) is the binding constraint on usefulness. Maybe it is. The 80x revenue growth Dario Amodei disclosed at the same event suggests enterprise developers are voting that way. But dreaming in particular reads like a feature with a name in search of a capability, while outcomes is the genuinely useful piece getting less stage time.

Dreaming is in research preview, with access by request. Outcomes, multi-agent orchestration, and memory are in public beta. No general availability date has been announced for any of them.