Memory vs plain RAG: a reproducible eval

A memory engine should answer what is true now — not echo back every stale thing it was ever told. We built a transparent, fully-seeded scenario where facts evolve over time, then asked current-truth questions two ways: through RememberOS's graph memory (which supersedes and expires facts) and through a plain RAG store (verbatim, no supersession). Here is what we measured.

52%
Top-1 correct · memory
vs 24% RAG
48%
Contradiction · memory
vs 91% RAG (lower is better)
88%
Recall@5 · memory
vs 85% RAG

Results (overall, 4 scenarios · 33 queries)

MetricMemory modeRAG baseline
Top-1 correct (top result states the current truth)51.5%24.2%
Contradiction rate (a superseded fact appears in top-5) — lower is better48.5%90.9%
Recall@5 (current truth appears anywhere in top-5)87.9%84.9%

The headline isn't recall — both arms find the right fact about equally often. The difference is contradiction: the plain RAG store keeps returning stale, superseded values because it has no concept of one fact replacing another. RememberOS's graph supersession marks old values as no longer current, so they stop crowding out the truth — roughly halving contradictions and ~2×-ing top-1 accuracy on the same data.

Graph extraction uses an LLM, so individual cells move a few points between runs (and the small 4–5-query rows swing on a single answer) — re-run 2026-06-13 with the new expiry scenario. The stable result is the contradiction gap: memory roughly halves it versus plain RAG. Re-run it yourself with the command below.

By scenario

ScenarioTop-1 (mem / rag)Contradiction (mem / rag)
Career — evolving employer / city / role (43 statements, 15 q)46.7%13.3%40.0%100.0%
Preferences & multi-hop — flipping preferences + person→project→property (19 statements, 9 q)44.4%22.2%44.4%66.7%
Temporal / recency — dated restatements; latest is truth (13 statements, 4 q)50.0%50.0%50.0%100.0%
Expiry — time-bound facts that expire and are replaced (15 statements, 5 q)80.0%40.0%80.0%100.0%

On the expiry scenario memory leads on top-1 (80% vs 40%) and contradiction (80% vs 100%), but note the absolute contradiction is still high: an expired value can still reach the top-5 on some queries. Supersession reduces stale recall here, it doesn't fully eliminate it.

Methodology

Honesty note. This is our own scenario, not an academic benchmark. It is not RememberOSEval or LoCoMo — those require licensed datasets and a larger eval budget, and we'll add them later. What this is: a transparent, reproducible test of the one thing a memory engine must do that a vector store doesn't — keep current truth current. The numbers above are whatever the run produced; the source is public so you can check or extend it.

Reproduce it

git clone https://github.com/11data/longmem && cd longmem
python -m evals.run_eval --base-url https://rememberos.ai --admin-key mv_your_key
# scenario: evals/scenario.py   runner: evals/run_eval.py   output: evals/results.json

Source: github.com/11data/longmem/evals. Numbers from a run on 2026-06-13 against the production API.

Intelligent memory: dedup & related

Shared memory drifts — near-duplicates pile up and recall needs to surface the right neighbours. Two more reproducible evals measure those ops against the live deployment.

1.00
Dedup precision
never deleted a distinct memory
1.00
Related precision@3
same-topic neighbours
1.00
Related coverage
every memory got a neighbour
1.00
Hybrid lexical recall@5
keyword-only match still surfaces
1.00
Hybrid MRR
vs text-only 0.50 — covers every query class
EvalWhat it measuresResult
DedupSeed known duplicate groups + distinct memories; run dedup across a threshold sweep (0.95→0.80) and check it merges duplicates without deleting distinct onesprecision 1.00 at every threshold; recall rises as the threshold loosens (0.95 is conservative, 0.80 collapses most paraphrase groups)
RelatedSeed 4 clearly-distinct topical clusters (4 memories each); ask each memory for its top-3 neighbours and check they share its topicprecision@3 1.00, coverage 1.00 — no cross-topic bleed
Hybrid recallSeed distractors plus one memory carrying a rare token that's semantically dissimilar to the query; query the bare token with a high vector gate so the vector arm is empty, and check the keyword-only memory is still returnedrecall@5 1.00, precision 1.00 — invisible to pure vector search (0 hits), surfaced by hybrid fusion
Hybrid rankingSeed a labeled set with semantic (paraphrase, no keyword overlap), lexical (rare token), and mixed queries; score recall@5 and MRR for vector, text, and hybrid modeshybrid recall@5 1.00, MRR 1.00 — matches the best single arm on every query while text-only drops to 0.50 (misses the semantic queries) and the gated-lexical case (row above) defeats vector-only

Small, fully-seeded evals (a dozen-odd memories), run live against the deployment with real embeddings, deterministic and reproducible — same honesty bar as the memory-vs-RAG eval above, not an academic benchmark. Dedup precision stays 1.00 because the distinct memories are on far-apart topics; recall is a tunable tradeoff set by the threshold.

python -m evals.run_dedup_eval         --key mv_your_key
python -m evals.run_related_eval       --key mv_your_key
python -m evals.run_hybrid_recall_eval  --key mv_your_key
python -m evals.run_hybrid_ranking_eval --key mv_your_key
# source: evals/run_dedup_eval.py · evals/run_related_eval.py · evals/run_hybrid_recall_eval.py · evals/run_hybrid_ranking_eval.py · results in evals/RESULTS.md