Supermemory Hits 99% on LongMemEval With Agent Swarm

Supermemory, the AI memory startup founded by 20-year-old Dhravya Shah, published a blog post today claiming ~99% accuracy on LongMemEval_s, the 500-question benchmark from ICLR 2025 that tests long-term conversational memory. The technique behind it, which they're calling ASMR (Agentic Search and Memory Retrieval), ditches vector databases entirely in favor of parallel LLM agents that read, search, and reason over conversation histories.

Before anyone gets too excited: Shah himself flags that this is "not our main production Supermemory engine (yet)." It is an experimental setup. The company's production system sits at roughly 85% on the same benchmark. But the gap between 85 and 99 is interesting enough to dig into, even if the path there raises some questions.

What they actually built

LongMemEval is designed to simulate the mess of real assistant interactions: 115,000+ tokens of conversation history, contradictory facts, events scattered across sessions, and questions that require temporal reasoning. Most RAG systems score somewhere in the 40-60% range. Supermemory's production engine, which already leads public benchmarks according to the company's own research page, gets to about 85%. Full-context GPT-4o without any memory layer manages just 60.2%, according to numbers published in the Hindsight paper.

ASMR replaces the entire vector search pipeline with agents. Three "reader" agents (running on Gemini 2.0 Flash) ingest conversation sessions in parallel, extracting structured knowledge across six categories: personal info, preferences, events, temporal data, updates, and assistant-generated info. No embeddings, no chunking. When a question comes in, three more "search" agents fan out across those extracted findings, each with a different focus: one hunts for direct facts, another looks for contextual implications, and the third reconstructs temporal timelines.

Then it gets a bit wild.

The ensemble trick

The headline 98.6% number comes from routing retrieved context through eight specialized prompt variants in parallel. A "Precise Counter," a "Time Specialist," a "Context Deep Dive," and five others, each independently generating an answer. If any of the eight gets it right, the question counts as correct.

I want to sit with that for a second. Eight independent shots at each question, and you only need one hit. That is a generous scoring methodology. It tells you something about the ceiling of what agentic retrieval can achieve, but it doesn't tell you much about what a user would actually experience. A user gets one answer, not eight.

Shah's team seems aware of this, because they also ran a second configuration: 12 specialist agents feeding into an aggregator LLM that produces a single consensus answer via majority voting. That scored 97.2%, which is more defensible as a real-world metric, though still built on a pretty heavy compute stack (twelve GPT-4o-mini calls plus an aggregation step per question).

The competitive picture

The agent memory space has gotten crowded. Mastra's Observational Memory system reported 94.87% on LongMemEval using GPT-5-mini. OMEGA, a local-first MCP server, claims 95.4% with a classification and extraction pipeline running entirely on-device. The Hindsight system from Vectorize hit 91.4% with Gemini-3 Pro. And Zep, which uses a temporal knowledge graph approach, published 71.2% with GPT-4o.

So Supermemory's experimental result is the highest anyone has posted. But the comparison is apples-to-oranges in a few ways. OMEGA runs locally with zero API calls. Mastra's system uses a stable, prompt-cacheable context window. Supermemory's ASMR pipeline fires off six parallel agents for ingestion, three more for search, then eight or twelve more for answering. The compute budget is not comparable.

No vector database?

The "no vector DB required" claim is the part that will get the most attention from engineers building memory systems. And it is real, if you squint. ASMR replaces vector similarity search with agents that actively read and reason over extracted findings. Shah's argument is that semantic similarity matching can't distinguish between an old fact and a newer correction, which is fair. Temporal reasoning is where traditional RAG consistently falls apart.

But "no vector DB" doesn't mean "no expensive infrastructure." You're swapping one cost (embedding storage and retrieval) for another (a dozen frontier-model API calls per query). Whether that trade makes sense depends on what you're optimizing for. Benchmark accuracy? ASMR wins. Latency and cost per query? I'd want to see those numbers before committing.

The founder's trajectory

Shah raised $3 million in seed funding backed by Google AI chief Jeff Dean, Cloudflare CTO Dane Knecht, and Sentry founder David Cramer, among others. He dropped out of Arizona State, moved to San Francisco, and has been shipping at a pace that makes the funding story almost secondary. The GitHub repo shows plugins for Claude Code, Cursor, and other coding agents, plus a benchmarking framework called MemoryBench for comparing memory providers head-to-head.

The company says it will open-source the complete ASMR experimental code in early April. That matters more than the benchmark number, frankly. Benchmark scores are easy to cherry-pick. Reproducible code is harder to argue with.

What to make of this

A 99% score on any benchmark sounds like the problem is solved. Shah even flirts with this framing, ending his blog post with "Agent memory is now (probably) a solved problem?" with a question mark that's doing a lot of heavy lifting.

It isn't solved. LongMemEval tests 500 questions across roughly 40 sessions. OMEGA's creator built a separate benchmark called MemoryStress that throws 1,000 sessions at memory systems, and the numbers crater. Supermemory had a production outage on March 6 caused by API key tracking queries under heavy load, which is the kind of real-world scaling problem that benchmarks don't capture.

Still, going from 60% (full-context GPT-4o) to 99% on a serious memory benchmark, even with experimental methods and generous scoring, tells you something about where agentic architectures are heading. The interesting question isn't whether ASMR's specific approach ships to production. It is whether the core insight, that agents actively reasoning over memory beat passive similarity search, holds up when someone builds the cost-efficient version.

The open-source release is expected in early April. That's when the real scrutiny starts.