Table of Contents
Anthropic Dreaming, Outcomes, and Multiagent Orchestration: What Shipped at Code with Claude 2026
Code with Claude 2026 (May 6-7, San Francisco, London, Tokyo) shipped no new models, and that was the point. Three Claude Managed Agents features (Dreaming, Outcomes, Multiagent Orchestration) each replace a layer of glue code that builders currently write and maintain by hand. So the question this release forces is not "are these good features." It is: for a given workload, is the glue they replace worth more than the lock-in and managed-layer pricing they add. That single trade decides everything below, and for most single-shot API workloads the candid answer is no. The customer numbers (Harvey 6x task completion, Wisedocs 50% review-time cut, Netflix parallel log scanning across hundreds of repos) are real signal, but each one is a result of a specific workload shape, and reading them without that shape is how you buy the wrong thing.
What follows: the mechanism under each feature, how you opt in, how to read those customer numbers without getting sold, and the build-vs-hand-it-over decision stated as a rule.
What Anthropic shipped on May 6-7
Five features were announced. Three of them live under Claude Managed Agents:
- Dreaming (research preview): a scheduled background process that reviews completed sessions, extracts patterns, and updates persistent memory stores between runs
- Outcomes (public beta): a separate grader agent that scores output against a developer-written rubric and triggers a retry if the bar is not met
- Multiagent Orchestration (public beta): a lead agent that breaks jobs into pieces and delegates them to specialist subagents running in parallel on a shared filesystem
The other two, Claude Finance (ten financial-services agent templates) and Microsoft 365 Add-ins, are out of scope here.
Boris Churnney, who created Claude Code, said on stage that "there is literally no manually written code anywhere in Anthropic anymore." Agents coordinate over Slack. Hold onto that detail. It explains where Dreaming came from: Anthropic ran agents on its own internal workflows long enough to watch memory continuity break down, then built the fix.
Dreaming: how offline memory consolidation works
The name comes from neuroscience. Anthropic's docs point at hippocampal memory consolidation, the brain replaying the day during sleep and deciding what is worth keeping. The analogy holds. Dreaming runs between agent sessions, never during one.
The sequence is short. An agent finishes a job. Dreaming fires on a schedule, reads the completed session alongside the existing memory stores, and extracts three categories of pattern:
- Recurring mistakes the agent made and corrected mid-job
- Workflows the agent converged on across multiple jobs
- Preferences that emerged across agent teams
Then it rewrites the memory store. Stale notes get condensed. Load-bearing ones get promoted. The next session opens with that updated context. Model weights never move. The whole mechanism runs on persistent plain-text notes that get fed into the agent's next system prompt, nothing more exotic than that.
You get two configuration modes. One pushes memory updates live with no human in the loop. The other queues every proposed change for a developer to approve before it takes effect. If you want oversight before any context change propagates, default to the second one.
Dreaming sits in research preview today and needs a developer access request. Not self-serve yet.
The Hermes precedent
The open-source Hermes framework did between-session memory consolidation for roughly a year before Code with Claude 2026. So the concept is not Anthropic's. The managed default is. With Hermes you built and maintained the scheduler, the pattern extractor, and the memory update logic yourself. Dreaming ships all of that as infrastructure you never see. That gap is the whole decision when you are weighing how much custom glue you want to own.
Harvey: 6x task completion
Harvey, the legal-AI startup, ran Dreaming in internal testing and reported roughly a six-fold jump in task completion. The cause was boring and repeatable. Agents kept failing identical legal-drafting jobs because they forgot file-type quirks between sessions. Turn Dreaming on and those workarounds stuck. The failure pattern got extracted once and promoted to a standing memory entry, so every later job started with that knowledge already loaded.
Anthropic's own internal benchmarks show up to 10 percentage points of improvement on hard tasks versus standard prompting. Read both numbers as directional. Harvey's is a customer testimonial from internal testing, Anthropic's is self-reported, and neither is a controlled third-party benchmark. They point the same way, which is what makes them usable signal rather than noise.
Outcomes: structured success criteria for long-running agents
Outcomes fixes one specific failure: agents anchor on their own reasoning. An agent that argued its way to a mediocre answer grades that answer kindly, because the grading happens in the same context window that produced it. The result is a system that is loudly confident it nailed a job it botched.
The fix is structural separation. A developer writes a plain-language rubric for what success looks like. The task agent finishes. Then a separate grader agent opens a fresh context window, reads the rubric and the output, and scores one against the other with zero memory of how the task agent got there. Output short of the bar? The grader hands back specific feedback and the task agent runs again.
Internal benchmarks as of May 7, 2026: 8.4% quality improvement on .docx generation, 10.1% on .pptx. Anthropic's own framing of those numbers is the part worth quoting: "quality problems often stem from inadequate evaluation rather than model capability." Translation: the rubric is doing more work here than any model upgrade would.
Wisedocs, a document-review startup, pointed Outcomes at its internal review guidelines and cut review time in half, per the Code with Claude 2026 materials. Outcomes is in public beta and live now.
A minimal Outcomes rubric
The rubric is plain text. Here is the structure Anthropic documents for a legal brief grader:
outcomes:
rubric: |
Score the output on three criteria:
1. Jurisdiction accuracy: all cited cases are from the correct jurisdiction.
2. Citation format: all citations follow Bluebook 21st edition.
3. Argument structure: counter-argument addressed before conclusion.
Pass threshold: all three criteria met. Fail = retry with grader feedback.
max_retries: 3
notify_webhook: "https://your-endpoint.example.com/agent-complete"
Look at notify_webhook. Start an Outcomes run and walk away. The webhook fires when the agent either clears the rubric or burns through its retries. Long legal, document, and code-review jobs fit this shape almost perfectly.
Multiagent Orchestration: parallel tasks and shared context
Of the three, Orchestration maps most cleanly onto patterns developers were already hand-building. A lead agent takes a complex job, splits it, and hands each piece to a specialist subagent. Every specialist gets its own model, prompt, and tool set. They run in parallel on a shared filesystem, and the lead can check progress mid-workflow because every action is persistent and auditable in the Claude Console.
Public beta. Available now.
The real difference from a hand-rolled loop on the Messages API is state management. Roll your own and you own context passing, subagent tracking, failure handling, and log surfacing. Managed Agents owns that whole layer instead, and every agent action shows up in Claude Console without you wiring it.
Netflix's platform team showed the clearest production case at the conference. An orchestrated agent chews through build logs from hundreds of repositories and surfaces recurring issues across applications. Parallel scanning, filtered reporting, one lead agent coordinating. Sequential, that job runs for hours. Wired by hand across that many sources, it never gets built at all.
Spiral by Every built a writing tool on the same skeleton. A Haiku-based lead agent fields incoming requests while a set of Opus-based subagents draft in parallel, and nothing returns to a user until it clears an Outcomes rubric. That is Orchestration and Outcomes composed together in one product.
Setting up an orchestrated run
import anthropic
client = anthropic.Anthropic()
orchestration = client.managed_agents.orchestrations.create(
lead_model="claude-opus-4-5",
task="Analyze build logs across the following repos and identify recurring failures.",
specialists=[
{
"model": "claude-haiku-4-5",
"role": "log_scanner",
"tools": ["filesystem_read"],
"scope": "repos/service-a"
},
{
"model": "claude-haiku-4-5",
"role": "log_scanner",
"tools": ["filesystem_read"],
"scope": "repos/service-b"
},
],
shared_filesystem=True
)
print(orchestration.run_id)
Lead coordinates. Specialists run concurrently. Results flow back to the lead for final synthesis and output.
Reading the customer numbers
Three companies, three use cases, three different stories behind the numbers.
Harvey (6x task completion): The lift is memory persistence, not a smarter model. The agents could already do the task. They kept failing because they re-learned the same file-type handling every single run. Dreaming converted a per-session patch into a standing memory. The multiple is huge because the baseline failure rate was huge. If your agents do not keep hitting the same class of problem across sessions, expect a smaller gain.
Wisedocs (50% review time cut): This is an Outcomes number, full stop. The speedup is automated grading replacing a human review loop, not the agent being more accurate on attempt one. Define the review criteria once, let a grader enforce them on every run, and you stop paying a human to score each output by hand. That recovered time is the 50%.
Netflix (parallel log processing): No completion rate. No time-savings figure. Netflix shipped neither, and that is fine, because the value here is architectural. Scanning hundreds of repositories sequentially is impractical; parallel orchestration makes it tractable. The published fact is that the platform team runs this in production. A team that size betting on it counts as signal even with no percentage attached.
When to use Managed Agents vs your own loop on the Messages API
This is the decision most builders actually face. Managed Agents is not the right call for every workload, and pretending otherwise wastes your budget.
| Factor | Hand-rolled Messages API loop | Claude Managed Agents |
|---|---|---|
| Session length | Short (minutes, single context window) | Long (hours, multiple sessions) |
| Memory continuity | You build and maintain | Managed (Memory + Dreaming handles it) |
| Evaluation | Custom or absent | Outcomes: rubric-based, automatic retry |
| Parallelism | You wire subagent coordination | Orchestration: lead/specialist pattern built in |
| Auditability | Whatever you log | Claude Console, all actions persistent |
| Cost model | You control every call | Managed layer has its own pricing |
| Access path | Available now on Messages API | Dreaming: request access; Outcomes + Orchestration: public beta |
| Best fit | One-shot tasks, tight integration needs, custom tool orchestration | Long-running work spanning many sessions, built-in eval loops, parallel decomposition |
The graduation point lands roughly here. Agent running multi-session jobs while you babysit your own memory store? Dreaming is worth the access request. Hand-writing retry logic around evaluation? Outcomes replaces it. Coordinating several agents by hand? Orchestration is the layer you stop building.
Single-session task that fires off the Messages API and returns? Managed Agents buys you complexity with no matching payoff. Keep the simple loop and move on.
How this changes Cursor, Cline, and Aider workflows
The three coding tools sitting closest to this are Cursor, Cline, and Aider. None is a Managed Agents product. Each calls Claude through the API and stacks its own agent loop on top, so the Code with Claude 2026 news is downstream infrastructure, not a UI change inside any of them. The link is indirect. It is also real, in three places.
Memory continuity across sessions. Cursor, Cline, and Aider each ship their own memory or context mechanism, and none currently plugs into Managed Agents' Memory or Dreaming layer. A developer building a custom agent on the Messages API for a multi-day job, say a long codebase migration, can now opt into Dreaming instead of building the memory infrastructure cold. That matters to teams treating these tools as architectural inspiration, not to people just using Cursor day to day.
Evaluation loops. Aider already runs tests after each change and reverts on failure. That is a hand-rolled evaluation loop. Outcomes is the same pattern as managed infrastructure. A team building its own coding agent gets to delete the run-tests, grade, retry code it would otherwise write and maintain.
Parallel agent runs. Cline is single-agent per session. Cursor's Background Agent and Bugbot handle one job per invocation. Orchestration targets jobs that split cleanly into parallel subagent work. Fanning out code review, doc generation, or test writing across many files at once is exactly its lane, so that is the layer to evaluate.
One practical note for anyone using these tools today: the May 6-7 release changes nothing about Cursor, Cline, or Aider's UI or behavior. What moved is the infrastructure available to builders who want to go past the off-the-shelf loop.
What to watch next: pricing, API parity, and the Dreaming opt-in
Three things are still open as of May 13, 2026.
Dreaming pricing. Anthropic's research-preview features have historically graduated to GA with usage-based pricing, and Dreaming's cost model is not published yet. Memory operations and pattern extraction are extra compute layered on top of base model calls. Budget for that layer to be non-trivial on long-running, high-volume workloads.
API parity for Outcomes and Orchestration. The public-beta paths are documented. The full API surface for multi-specialist orchestrations and complex rubrics is not settled. Expect parameter names and configuration schema to shift before GA.
The Dreaming access path. Building on long-running agent work today? Request access now instead of waiting for GA. Anthropic's research-preview programs have moved to general availability within two to four months of release, and early access drops you into the feedback loop that shapes the final configuration model.
Outcomes is the lowest-friction way in. Write a plain-language rubric, point it at a task with measurable quality criteria, run it. The grader-to-agent-retry loop is the piece that bends your architecture, so understand its shape before you design around it.