Mem0 vs MNEMOS vs Letta: which memory layer your agent needs A flat editorial illustration. Three labelled memory-layer cards feed a single agent. Mem0 is the lightweight memory-as-API option. MNEMOS is the layered memory operating system with MOIRAI and KRONOS subsystems. Letta is the full stateful-agent runtime with ADE. The takeaway: match the layer to the workload, not the demo. Three ways to give an agent memory Mem0 vs MNEMOS vs Letta, benchmarked on a 30-day history Mem0 Memory-as-API Lightest. Attach to an existing LLM app. Fastest to production MNEMOS Memory operating system Layered. MOIRAI compression + KRONOS. Audit + observability Letta Full agent runtime Stateful agents with the Letta ADE. Self-hosted control Match the layer to the workload not the demo. Test your 30-day recall before committing. Pondero · agent memory infrastructure · May 2026
Guide advanced

Mem0 vs MNEMOS vs Letta: The Memory Layer Your AI Agents Need in 2026

The short version

Mem0 with LlamaIndex, MNEMOS v5.0.0, and Letta Server benchmarked on recall, freshness, and storage cost across a 30-day chat history. Pick the right memory layer for your agent stack.

Published May 15, 2026 by Pondero Editorial
Table of Contents

Mem0 vs MNEMOS vs Letta: The Memory Layer Your AI Agents Need in 2026

The decision that matters here is not which memory layer scores highest. It is which shape of thing you are buying, because Mem0, MNEMOS, and Letta are not three implementations of the same product, and picking the wrong shape is a six-month rewrite, not a config change. Mem0 is an add-on. MNEMOS is a memory operating system. Letta is an agent runtime that happens to own memory. The only hard recall number that exists across them, Mem0's 49% on LongMemEval, points at the real differentiator, which is freshness handling, not raw retrieval. This piece argues that case and gives you a benchmark protocol to settle it on your own workload, because your workload is the only benchmark that should decide a production memory layer.

Anthropic's Dreaming (research preview, May 7, 2026) is what forced the question back open: it curates a Claude-side memory store inside the Managed Agents runtime, but the moment you run your own orchestration or self-host for data residency, you supply the memory layer yourself. MNEMOS hit v5.0.0 GA on May 2, Mem0 has a stable LlamaIndex integration, and Letta (formerly MemGPT) is the open-source incumbent with a full server runtime.

Why May 2026 is the first time this comparison is stable

Three things converged inside two weeks, and before they did, comparing these was a moving-target exercise. Dreaming's preview (May 7) made every production team ask whether they needed the same capability outside Anthropic's runtime. MNEMOS closed its v5.0.0 charter on May 2, stabilizing GRAEAE and KRONOS, the two subsystems that were rough in v4. Mem0's LlamaIndex module is now documented against ReAct and FunctionCalling agents, not just a toy SimpleChatEngine. The APIs are now stable enough that a comparison written today is still true next month.

What each one actually is

These three projects are not the same shape of thing. Treating them as interchangeable "memory libraries" is the mistake that lands you in that rewrite.

Mem0 is a memory layer you attach to an existing LLM application. Its job is narrow: store memories, retrieve them at query time, surface the relevant ones into context. The hosted platform adds graph memory and team-level recall, but the open-source core is a standalone add-on. Think of it as a smarter k=5 semantic search over past conversations.

MNEMOS v5.0.0 is a memory operating system. That phrase sounds like marketing but the architecture backs it. It runs as a FastAPI server (or embedded in-process on the edge profile) and manages the full memory lifecycle: write, embed, search, compress, version, reason-over, audit, federate, export, import. The GRAEAE bus routes across multiple LLM providers with consensus scoring. The MOIRAI compression stack (APOLLO + ARTEMIS) produces auditable transformation receipts, so you can see exactly how a memory was compressed and why. KRONOS gives you recall observability out of the box. The footprint is proportionally larger than Mem0's, and it deserves to be.

Letta (formerly MemGPT) is a full agent runtime, not just a memory layer. Its OS-inspired architecture gives agents a context window that works like managed memory: in-context (hot), archival (warm), and external storage (cold). Agents actively decide what to keep in context versus archive. You can run just the memory subsystem via the Letta Python SDK, but the native shape is a server (Letta Server) with an Agent Development Environment. If your team needs multi-agent orchestration with memory as a first-class concern, Letta is the only one of the three that was designed for that from the start.

Mem0 with LlamaIndex

Install

pip install llama-index-core llama-index-memory-mem0 python-dotenv

Agent setup

import os
from dotenv import load_dotenv
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.memory.mem0 import Mem0Memory

load_dotenv()

# Point at the Mem0 platform (swap for OSS config if self-hosting)
Settings.llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])

memory = Mem0Memory.from_client(
    context={"user_id": "prod-user-001"},
    api_key=os.environ["MEM0_API_KEY"],
    search_msg_limit=5,  # messages considered for retrieval; default 5
)

ReAct agent with memory

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

def get_account_status(account_id: str) -> str:
    """Return current account status for a given account ID."""
    # your internal lookup here
    return f"Account {account_id}: active, last seen 2026-05-14"

agent = ReActAgent.from_tools(
    tools=[FunctionTool.from_defaults(fn=get_account_status)],
    memory=memory,
    verbose=True,
)

response = agent.chat("What is the status of account ACC-9921?")
print(response)

Query latency

The Mem0 LlamaIndex docs do not publish P50/P95 retrieval numbers. From the Vectorize.io survey (2026-05-15 read), Mem0 scores 49% on LongMemEval, the lowest among dedicated memory systems in the comparison set. That tracks with the narrow design: Mem0 is fast at semantic lookup but does not reason over memories, compress them across time, or version them. If your agent runs one user, one session type, and simple preference recall, that 49% is probably fine. If you're running multi-session, multi-user production workloads where older memories need to surface correctly over a 30-day window, that number matters.

search_msg_limit=5 is the default. Raise it to 10-15 for longer-context workloads; watch your retrieval latency climb linearly.

MNEMOS v5.0.0

Install

# Requires Docker + Compose; PostgreSQL + pgvector for the server profile
git clone https://github.com/mnemos-os/mnemos.git
cd mnemos
docker compose -f deploy/server/docker-compose.yml up -d

Write and query via REST

import httpx

BASE = "http://localhost:8765"  # default MNEMOS server port

# Write a memory
httpx.post(f"{BASE}/v1/memories", json={
    "agent_id": "support-agent-v2",
    "user_id":  "prod-user-001",
    "content":  "User prefers responses under 3 sentences. Last escalation: 2026-04-30.",
    "tier":     "working",   # working | archival | episodic
})

# Query memories
result = httpx.get(f"{BASE}/v1/memories/search", params={
    "agent_id": "support-agent-v2",
    "user_id":  "prod-user-001",
    "query":    "user communication preferences",
    "top_k":    5,
})
print(result.json())

GRAEAE reasoning over memories

MNEMOS v5.0.0 ships the GRAEAE bus as stable. You can send a memory corpus through multi-provider consensus scoring before surfacing it to your agent:

httpx.post(f"{BASE}/v1/graeae/reason", json={
    "agent_id":   "support-agent-v2",
    "user_id":    "prod-user-001",
    "query":      "How should I handle this user's escalation?",
    "providers":  ["openai/gpt-4o", "anthropic/claude-3-7-sonnet-20250219"],
    "consensus":  "majority",
})

Query latency

MNEMOS v5.0.0 does not publish P50/P95 numbers in its current docs, so treat any latency figure as unverified. The server profile runs PostgreSQL + pgvector, whose performance on simple vector queries depends on your index and table size, with GRAEAE consensus adding a round-trip per provider you configure. The edge profile (SQLite + sqlite-vec, in-process) trades storage efficiency for lower latency on single-agent deployments. Benchmark it against your own data rather than relying on a published number that does not exist.

KRONOS, the recall observability subsystem that shipped stable in v5.0.0, gives you per-query latency traces you can pipe to your existing observability stack. That matters in production. You are not flying blind on retrieval performance.

Letta (Server + ADE)

Install

pip install letta
letta server  # starts the server on :8283 by default

Create an agent with memory

from letta import create_client

client = create_client()  # connects to local Letta server at :8283

agent_state = client.create_agent(
    name="support-agent-prod",
    memory={
        "human":   "User: Alex. Timezone: PST. Prefers concise answers.",
        "persona": "You are a support agent. You have access to archival memory.",
    },
)

print(f"Agent ID: {agent_state.id}")

Send a message and inspect memory

response = client.send_message(
    agent_id=agent_state.id,
    role="user",
    message="What did we discuss about my account last month?",
)

for msg in response.messages:
    print(msg.role, ":", msg.text)

# Inspect in-context memory block
state = client.get_agent(agent_id=agent_state.id)
print(state.memory.get_block("human").value)

Archival search (the long-tail recall path)

Letta agents route long-tail recall to archival storage automatically. You can also call it directly:

results = client.get_archival_memory(
    agent_id=agent_state.id,
    query="account billing dispute April 2026",
    limit=5,
)
for r in results:
    print(r.text, r.score)

Query latency

Letta's hot path (in-context memory block) has zero retrieval latency. It's already in the prompt. The archival path runs a vector search, comparable to Mem0's latency profile for simple queries. The overhead to watch is agent-side reasoning: the agent decides what to archive and when, which adds LLM inference cost per turn. That pays off in multi-agent setups; for single-agent chatbots it can be heavier than you need.

How these compare, and how to benchmark them yourself

One published number exists across these systems: an independent Vectorize.io evaluation measured Mem0 at 49% on LongMemEval with standard retrieval, the lowest among dedicated memory systems in that comparison set (note Mem0's own published figure is far higher, which is exactly why methodology matters). MNEMOS and Letta have no published LongMemEval figures as of this writing, so the table below is design-level analysis on every row except that one, and it says so explicitly. Anyone selling you a single "winner" number across all three is inventing it.

DimensionMem0 (platform)MNEMOS v5.0.0Letta Server
Published recall benchmark49% LongMemEval (per Vectorize.io)None publishedNone published
Memory freshness handlingNone built-in; recency is retrieval-rankKRONOS tracks per-memory age; MOIRAI compresses stale entriesAgents actively archive; recency via agent reasoning
Storage shape over timeRaw vector store; no compressionCompressed via MOIRAI; auditable receiptsArchival store; fixed-size in-context block
ObservabilityBasic platform dashboardKRONOS per-query tracesLetta ADE, per-agent inspect
Self-host optionOSS core (Qdrant + OpenAI config)Full (server or edge profile)Full (Letta Server)
Multi-tenantPlatform-managedMulti-agent, multi-user via RESTMulti-agent native

Freshness handling is the row that should decide most evaluations, and it is the one no benchmark captures well. Mem0 does not compress or version memories, so a stale preference from day 1 competes on equal footing with a correction from day 29 unless you build TTL logic yourself. MNEMOS's MOIRAI stack does that automatically and emits an auditable receipt for each compression. Letta pushes the decision into agent reasoning, which is flexible but adds an LLM call per turn. That is the mechanism the LongMemEval number gestures at: 49% is not "Mem0 is slow," it is "Mem0 has no notion of a memory getting old."

Because the published evidence is this thin, the right move is a fixed protocol on your own data, not a borrowed score. A defensible one: assemble a 30-day corpus from your real session logs (a few hundred to a thousand messages, multiple user personas with overlapping topics), then on each system measure three things. One, recall@k on questions whose answer is in an old message that was later contradicted by a newer one (this is the freshness test, and where the systems separate). Two, p50 retrieval latency at your real top_k. Three, storage growth over the full corpus. KRONOS and Letta ADE both expose per-query traces that make this measurable without external tooling; Mem0's platform dashboard is coarser, so instrument the client side there. Run it on your workload before you commit, because a memory layer that wins on someone else's corpus can still lose on the contradiction pattern your domain actually produces.

The buyer matrix

Hosted vs. self-host

Mem0's hosted platform is the fastest path to memory-in-production. API key, pip install, done. The cost is that your memory store lives on their infrastructure and the graph-memory feature (which closes most of that LongMemEval gap) is gated at $249/month. If data residency or export control is a requirement, the OSS core is the path, but you're responsible for the Qdrant backend, OpenAI embedding costs, and retrieval tuning.

MNEMOS is self-host-first. The server profile runs on any machine with Docker and PostgreSQL. The edge profile drops to SQLite + sqlite-vec for laptop or Pi-class deployments. There is no hosted MNEMOS platform. You own the infra. That is a feature for regulated industries and teams that need full data control; it's an ops burden for everyone else.

Letta Server is also self-hosted. The Letta ADE (Agent Development Environment) is a local UI for inspecting and editing agent memory live, which is useful during development. The trade is the same as MNEMOS: you run and scale it yourself.

Single-tenant vs. multi-tenant

Use caseBest fit
Single user, single agent, low volumeMem0 OSS or Letta Server (light)
Multi-user, single tenant, moderate volumeMem0 platform or MNEMOS server
Multi-agent, multi-user, production scaleMNEMOS server or Letta Server
Regulated industry, full data controlMNEMOS server (audit receipts via MOIRAI)
Agent reasoning over memory requiredLetta or MNEMOS (GRAEAE bus)

The rough cost model

Mem0 platform: free tier limited, $249/month for graph memory. OSS: your Qdrant and embedding costs.

MNEMOS: compute for FastAPI + PostgreSQL. A $20/month DigitalOcean droplet handles low-volume workloads; scale from there based on pgvector index size and GRAEAE provider calls.

Letta Server: Python process + vector backend, similar footprint to MNEMOS without the Compose stack. ADE is local only as of May 2026.

Pairing any of these with Anthropic Dreaming

Dreaming (research preview, May 7, 2026) is a scheduled background process inside Anthropic's Managed Agents runtime. It reviews past sessions, extracts patterns, and curates the Claude-side memory store, either automatically or staged for manual review in regulated environments.

If you're running Claude inside Managed Agents and want to layer in one of these three systems, the pattern is:

  1. Dreaming maintains the Claude-side memory store (extraction + curation).
  2. Your external memory system (Mem0, MNEMOS, or Letta) stores domain-specific, long-tail, or cross-agent memories that Dreaming doesn't own.
  3. At retrieval time, your agent merges both: Dreaming-curated context from the Managed Agents runtime plus a top_k retrieval hit from your external store.

Dreaming handles Claude-native memory management. You still need one of these three systems for anything outside the Managed Agents perimeter: self-hosted Claude wrappers, non-Claude models on the same agent loop, cross-agent memory, and data that must stay on-prem. The two layers are not mutually exclusive and they do not overlap in scope. Dreaming owns the Claude-side store inside the runtime; an external memory system owns everything that store cannot see.

Which one to pick

Pick Mem0 if you need memory in production this week and your recall pattern is simple: one user, one session type, preference lookup. The LlamaIndex integration is stable and the hosted platform owns the ops. The call flips the moment your agents have to surface a month-old memory correctly against a newer contradiction, because the 49% LongMemEval figure is exactly that failure, and the graph-memory feature that closes most of the gap is gated at $249/month.

Pick MNEMOS v5.0.0 if you need a memory system, not a memory library, specifically if compression with an audit receipt or multi-provider reasoning is a requirement rather than a nice-to-have. KRONOS, MOIRAI, and GRAEAE have no equivalent in the other two. The flip condition is ops capacity: you are running a FastAPI server on PostgreSQL with no hosted escape hatch, so if you cannot own that infrastructure, this is the wrong pick regardless of features.

Pick Letta if your architecture is multi-agent first and you want memory and orchestration in one runtime. The in-context/archival split is the cleanest design here for long-running agents and the ADE makes memory inspection practical in development. The cost is an LLM call per turn for the archive decision; profile that against your turn volume first, because for a single-agent chatbot it is overhead you do not need and Mem0 wins on simplicity.

None of these replace Dreaming inside the Managed Agents runtime. They cover the ground Dreaming cannot reach.