Researchers Want to Untangle the Mess of AI Agent Memory. Good Luck.

A team of 47 researchers from institutions including Fudan University, Oxford, and others published a survey this week attempting something the AI agent community badly needs: a coherent way to talk about memory. The paper, "Memory in the Age of AI Agents," dropped on arXiv on December 15 and immediately surfaces a problem anyone building agents has encountered. Everyone uses the word "memory," but nobody means the same thing.

The terminology problem

The frustration driving this work is real. Walk into any AI engineering discussion and you'll hear "memory" used to describe everything from a simple vector database lookup to persistent model fine-tuning to the contents of a context window. The paper's authors explicitly call out the "proliferation of loosely defined memory terminologies" as a barrier to progress.

They're not wrong. Consider how the landscape looks right now: MemGPT, the UC Berkeley research project from late 2023 that drew parallels between LLM context management and operating system virtual memory, spawned a company (now called Letta) and a whole vocabulary around "memory tiers." Mem0, an open-source project that describes itself as a "universal memory layer for AI agents," takes a different approach focused on extracting and storing discrete facts. LangChain's LangMem does something else entirely.

The survey tries to impose structure by separating agent memory from three related concepts: LLM memory (what's encoded in model weights), retrieval-augmented generation (pulling relevant documents at inference time), and context engineering (managing what fits in the context window). These distinctions matter because they imply different technical solutions and different failure modes. Treating a RAG pipeline as "agent memory" leads to different architectural choices than treating parametric knowledge as memory.

Three lenses, not one

Rather than forcing everything into the familiar long-term/short-term dichotomy borrowed from cognitive science, the authors propose examining agent memory through three perspectives simultaneously: forms, functions, and dynamics.

The forms taxonomy identifies three dominant implementations. Token-level memory operates within the context window, essentially treating the prompt as working memory. Parametric memory lives in model weights, modified through fine-tuning or adapter layers. Latent memory sits somewhere between, using learned representations stored externally.

The functional taxonomy is more interesting. The authors distinguish factual memory (what the agent knows about the world), experiential memory (what happened in past interactions), and working memory (what's relevant to the current task). This framing exposes a gap most current systems struggle with: experiential memory that actually influences future behavior rather than just being retrievable.

The dynamics perspective asks how memory forms, evolves, and gets retrieved. This is where things get messy in practice. Most production systems punt on evolution entirely, opting for append-only logs rather than genuine consolidation or forgetting.

What's missing from the map

The survey compiles an inventory of memory benchmarks and open-source frameworks. LoCoMo, for testing long-context conversation memory. HotpotQA and NarrativeQA for reading comprehension with memory components. The General Agentic Memory framework that claims state-of-the-art results by treating memory construction and retrieval as separate agent tasks.

But the benchmark situation reveals a deeper issue. Most evaluations test whether an agent can recall a specific fact from earlier in a conversation. Fewer test whether memory actually improves task performance over time. Almost none test whether agents can selectively forget outdated information, despite this being essential for any system that runs for more than a few sessions.

The multi-agent memory problem barely exists in the literature. When multiple agents need to share context, current approaches mostly reduce to "give everyone access to the same database." The paper flags this as a research frontier, which is a polite way of saying nobody has solved it.

Context engineering enters the chat

The survey draws a line between agent memory and context engineering, but this boundary is getting blurry. Google's Agent Development Kit documentation from earlier this month describes context as "a compiled view over a richer stateful system," which sounds an awful lot like memory with extra steps.

Andrej Karpathy's framing of context engineering as managing the LLM's "RAM" has caught on. Lance Martin at LangChain broke the problem into four operations: write (save context externally), select (retrieve relevant context), compress (reduce token count), and isolate (sandbox different context types). These operations map reasonably well onto traditional memory system design.

The practical question is whether you treat context management as an infrastructure problem (build the memory system, let the agent use it) or as an agent capability (let the agent manage its own context). MemGPT took the latter approach, giving agents tools to edit their own memory. The Manus team, which has rewritten its agent harness five times in six months according to recent presentations, has been moving toward simpler designs where the model does more context management itself.

Where this goes

The survey identifies several research frontiers: memory automation (letting agents decide what to remember), reinforcement learning integration (learning memory policies from feedback), multimodal memory (remembering images and audio, not just text), and trustworthiness (preventing memory poisoning and ensuring privacy).

That last one matters more than the academic framing suggests. If agents accumulate memories across sessions, those memories become attack surfaces. A carefully crafted earlier conversation could influence behavior in later ones. The paper acknowledges this but doesn't offer solutions beyond "more research needed."

The 47-author collaboration itself signals something about the field. Memory isn't a niche concern anymore. It's becoming the bottleneck as agents move from single-turn assistants to long-running autonomous systems. Whether this particular taxonomy takes hold is less important than the underlying observation: the field needs shared vocabulary before it can make systematic progress.

The paper positions memory as "a first-class primitive in the design of future agentic intelligence." That phrasing, with its implicit argument that memory has been treated as second-class, captures both the opportunity and the current state of play.