Raindrop Workshop: open-source local debugger for AI agents

Raindrop, the San Francisco observability startup formerly known as Dawn AI, has released Workshop under an MIT license. It's a local debugger for AI agents that streams tokens, tool calls, and decisions into a browser UI at localhost:5899, then exposes those traces to a coding agent through an MCP server so the agent can write evals, run them, and patch its own code.

The pitch, in the company's words, is the first sane way to debug an agent locally. Whether that's true depends on how much you've suffered with the alternatives.

What it actually does

Workshop runs as a local daemon paired with a Vite-built UI. Drop the Raindrop SDK into your project, set RAINDROP_LOCAL_DEBUGGER, and every span from your agent run mirrors into the local interface as it happens. No polling. No cloud forwarding. Everything sits in a single SQLite .db file on your machine, which is a relief for anyone whose legal team has Opinions about agent traces leaving the laptop.

The install is one line: curl -fsSL https://raindrop.sh/install | bash. Bun underneath, if you're building from source. macOS, Linux, Windows. The documentation lists SDK coverage that's broader than most observability tools at launch: Vercel AI SDK, OpenAI Agents SDK, Anthropic SDK, Claude Agent SDK, LangChain, LangGraph, CrewAI, Mastra, Pydantic AI, DSPy, Google ADK, Strands, Agno, and Deep Agents.

That's a lot of frameworks for a tool that's been public for less than a week. Whether the integrations are equally polished across all of them is a different question, and one I can't answer without running each one.

The self-healing loop

The interesting part isn't the tracing. Plenty of tools trace agents. The interesting part is that Workshop exposes those traces over MCP, which means a coding agent running in your terminal, Claude Code, Codex, Cursor, can read the trace as a first-class input.

Raindrop's demo uses a veterinary triage assistant. The agent is supposed to ask clarifying questions about symptoms; in the failing trace, it skips them. Claude Code reads the span, writes an eval that asserts the agent should ask follow-ups for a given input, runs the agent, watches it fail, edits the prompt or code, re-runs, and repeats until the assertion passes. The company calls it the self-healing eval loop. The team behind it includes Ben Hylak, who spent four years on Apple's Human Interface team working on visionOS before co-founding Raindrop with Zubin Koticha and Alexis Gauba.

It's a clean demo. I want to see it on something messier than a triage prompt before declaring it a category-definer. Production agent failures rarely look like "the prompt forgot one instruction." They look like a tool returning malformed JSON six steps deep, which the model then hallucinates around for another four steps until the user gives up. Whether Claude Code can untangle that from a trace, write a meaningful eval, and fix the right thing without breaking three others is the actual test.

Replay, and what it's for

The replay feature is the part I'd reach for first. You take a trace from production, run it through your locally-running agent code, and watch the new trace stream back into Workshop side by side with the original. Edit the prompt, swap the model, change a tool, see what diverges.

This is what most teams cobble together with notebooks and screenshots. Having it in a single tool that the coding agent can also see is genuinely useful. It's also where the open-source decision pays off: replay is the kind of feature you want auditable, because if it silently mangles a trace your fix is based on a lie.

The business question

Raindrop raised $15 million in seed funding led by Lightspeed Venture Partners in late 2025, with participation from Figma Ventures, Vercel Ventures, Y Combinator, and a small constellation of founders from Replit, Cognition, Framer, and Notion. The company sells a hosted production monitoring platform that they describe as "Sentry for AI agents."

So why give away the local piece? The same reason Sentry gave away the SDK. Workshop is the wedge. You debug locally, you ship to production, and at some point you want the same traces aggregated, alerted on, and grouped into incidents across millions of events. That's the paid product. CEO Zubin Koticha told VentureBeat earlier this year that evals catch the regressions you already know about, and the worst issues are the ones you haven't imagined yet. The hosted product is the bet that you'll pay to find those.

The open-source release doesn't change that bet. It just lowers the friction at the front of the funnel.

What's missing

A few things I noticed and the documentation doesn't address.

The MCP integration is the whole story, and MCP itself is still moving. A coding agent reading traces over MCP, writing evals, and modifying source code is also a coding agent with broad filesystem access running against potentially adversarial trace content. Workshop doesn't really discuss the threat model. Maybe that's fine for solo developers debugging their own agents locally; it's a different conversation when the trace came from a production user.

Cost isn't mentioned either. The self-healing loop means Claude Code or Codex burns through tokens watching itself fix things. For complex traces, that's not free. The pitch is that the engineer's time is more expensive, which is usually right, but worth knowing before you point Claude at a 50-step trajectory.

And the language coverage claim, TypeScript, Python, Go, Rust, is technically true but uneven in practice. The Python and TypeScript SDKs are clearly the priority. If you're on Go or Rust you're earlier in the queue.

Worth installing?

Yes, probably. The install is one command, the local-only architecture means you can try it on a real project without sending anything to Raindrop's servers, and if the self-healing loop works on even half the bugs you encounter, it pays for itself in saved screenshots-pasted-into-Claude.

The longer-term question is whether "local debugger for agents" turns out to be a real category or just a nice-to-have on top of a hosted monitoring product. Raindrop is betting the former. Their funding suggests their investors agree. The rest of us get a free tool while they find out.