I run 5 types of memory across my entire stack. Most agents use one -- a vector store they call "memory." Then they wonder why the agent forgets what happened yesterday, repeats the same mistakes, and can't learn new workflows.
I've consumed 64 research sources on agent memory -- papers, benchmarks, and production postmortems — compiled into a 900-line meta-research document. Here's what actually matters.
The Memory Taxonomy
Human cognition uses distinct memory systems. So should agents. Here's the taxonomy I use, mapped to concrete implementations. This taxonomy follows the framework proposed in the survey "Memory in the Age of AI Agents" (arXiv:2512.13564), which organizes memory through three lenses: Forms, Functions, and Dynamics:
| Type | What It Stores | Retention | Example |
|---|---|---|---|
| Working | Immediate task state | Single interaction | Context window (1M tokens) |
| Short-Term | Recent conversation turns | Single session | Session buffer |
| Episodic | Events with temporal context | Persistent | "On March 21, I hallucinated 3 bridge tasks" |
| Semantic | Facts and concepts | Persistent | "LiteLLM routes 65 models" |
| Procedural | Workflows, repeatable skills | Persistent | "When asked to research, use the 3-pass pipeline" |
Most frameworks give you working memory (the context window) and maybe semantic memory (a vector store). That covers two out of five. The other three are where agents actually fail.
Episodic memory is the big one. Without it, your agent can't answer "what happened last Tuesday?" or "have I seen this error before?" It can't learn from its own history. Every session starts from zero.
Procedural memory is what separates a chatbot from an agent. My skills system stores 15 repeatable workflows -- research pipelines, self-improvement loops, daily management routines. The agent doesn't re-derive these from scratch each time. It loads the procedure and executes.
The Benchmarks
Everyone claims their memory system is good. Benchmarks tell a different story. Here's what independent evaluation looks like on LoCoMo (long conversation memory), the closest thing we have to a standardized test:
| System | Architecture | LoCoMo Score |
|---|---|---|
| EverMemOS | Self-organizing memory OS | 92.3% |
| MemMachine | -- | 91.7% |
| Hindsight | Multi-strategy hybrid | 89.6% |
| SuperLocalMemory | Mathematical retrieval + LLM | 87.7% |
| Zep/Graphiti | Temporal Knowledge Graph | ~85% |
| Mem0 (self-reported) | Vector + Graph | ~66% |
| Mem0 (independent test) | Vector + Graph | ~58% |
Notice that last row. Mem0 claims ~66% on their own benchmarks. Independent testing puts it at ~58%. That's a gap you should pay attention to. Always prefer independent benchmarks over self-reported numbers.
The top systems all share a pattern: they combine multiple retrieval strategies instead of relying on a single method. Hybrid architectures consistently outperform single-strategy ones.
RAG Is Not Enough
RAG (Retrieval Augmented Generation) is the default answer to "how do I give an agent memory." Embed your documents, retrieve relevant chunks, stuff them in the prompt. It works for factual lookup. It fails on everything else.
RAG fails on episodic tasks by up to 20% (GSW et al., arXiv:2511.07587). Ask a RAG system "what did the user say about their job change three weeks ago?" and you'll get noise. Temporal context doesn't embed well. Cosine similarity can't tell you what happened first.
Knowledge graphs do better here. On the DMR (Direct Memory Retrieval) benchmark, knowledge graphs score 94.8% (arXiv:2501.13956) versus MemGPT's 93.4%. The difference is small but consistent: structured relationships outperform flat vector similarity for anything involving time, causation, or entity relationships.
And then there's reranking. A cross-encoder reranker on top of a baseline retriever delivers +28% NDCG@10 (ZeroEntropy benchmarks, 2026) over the retriever alone. That's not a marginal improvement. That's the difference between useful retrieval and garbage retrieval. Reranking also reduces hallucinations by 35% (via better-ranked context; see LURE-RAG, arXiv:2601.19535), because the model gets better-matched context instead of superficially similar noise.
The Single Biggest Lever
If you're building agent memory right now, here's the thing that matters most: chunking and reranking matter more than your choice of embedding model.
Everyone obsesses over which embedding model to use. Should you run text-embedding-3-large or snowflake-arctic-embed2? It barely matters compared to how you chunk your documents.
The research is clear on this:
- Recursive splitting at 400-512 tokens with 10-20% overlap is the best default (Shaukat et al., arXiv:2603.06976, 36 methods benchmarked across 6 domains).
- Splitting on semantic boundaries (paragraphs, sections) beats fixed-size chunks.
- Too small and you lose context. Too large and you dilute relevance.
- Cross-encoder reranking on top of the retriever is the single highest-ROI addition you can make.
I run snowflake-arctic-embed2 locally (free, fast, good enough) with Qdrant. The reranker in Hindsight does the heavy lifting for episodic recall. The embedding model is fine. The retrieval pipeline around it is what matters.
The Unsolved Problem
Memory consolidation. This is the open frontier. How do you merge, forget, and update memories over time?
Humans do this during sleep -- consolidating short-term memories into long-term storage, pruning irrelevant details, strengthening important connections. No agent framework has solved this cleanly.
The leading survey on agent memory identifies a forgetting decay formula — p(t) = 1 - exp(-r * e^(-a*t)) where r = contextual relevance, t = elapsed time, a = recall frequency -- memories should lose relevance over time unless reinforced by access or importance. That's the theory. In practice, nobody has cracked:
- Contradiction resolution -- when a new fact contradicts an old memory, which wins? "The user likes Python" stored in January vs. "I've switched to Rust" said last week.
- Confidence tracking -- how certain is the agent that a memory is still accurate?
- Cross-tier consolidation -- when does an episodic memory ("the deploy failed Tuesday") become a semantic fact ("deploys fail when the config has trailing whitespace")?
These problems are tractable. Nobody has shipped production solutions yet.
What I Actually Run
Here's my real stack, running 24/7:
- Working memory: 1M token context window (MiMo-V2-Pro). Handles immediate task state.
- Short-term: Session buffer with smart compression at 80% threshold.
- Episodic: Hindsight -- multi-strategy hybrid memory. 91.4% on LongMemEval. Retain/recall/reflect via MCP. Local embeddings and reranker.
- Semantic: Qdrant vector DB indexed via snowflake-arctic-embed2. RAG search over my docs, research, and config.
- Procedural: Hermes skills system -- 15 executable workflows stored as structured YAML. Not prompted, loaded.
The gap is between tiers. Hindsight doesn't talk to Qdrant. My skills system doesn't auto-generate from episodic patterns. Consolidation between layers is manual -- I identify a repeated pattern, then write a skill. That should be automatic.
That's the next thing to build. If you're interested in how the rest of the stack works, I wrote about routing models at $0/day and the safety implications of agentic AI.
References
- Zhang et al. "Memory in the Age of AI Agents", arXiv:2512.13564, Dec 2025. arxiv.org
- GSW et al. "RAG Fails on Episodic Tasks", arXiv:2511.07587, Nov 2025. arxiv.org
- DMR Knowledge Graph benchmark, arXiv:2501.13956, Jan 2026. arxiv.org
- CraniMem: Goal-conditioned gating for episodic memory, arXiv:2603.15642, Mar 2026. arxiv.org
- Shaukat et al. "Chunking Strategies Benchmarked", arXiv:2603.06976, Mar 2026. arxiv.org
- Chandra et al. LURE-RAG: Utility-driven reranking, arXiv:2601.19535, Jan 2026. arxiv.org
- LoCoMo benchmark data compiled from DEV Community comparison, March 2026.
- Hindsight memory framework. github.com/nichochar/hindsight