Memory vs plain RAG: a reproducible eval

A memory engine should answer what is true now — not echo back every stale thing it was ever told. We built a transparent, fully-seeded scenario where facts evolve over time, then asked current-truth questions two ways: through RememberOS's graph memory (which supersedes and expires facts) and through a plain RAG store (verbatim, no supersession). Here is what we measured.

52%

Top-1 correct · memory

vs 24% RAG

48%

Contradiction · memory

vs 91% RAG (lower is better)

88%

Recall@5 · memory

vs 85% RAG

Results (overall, 4 scenarios · 33 queries)

Metric	Memory mode	RAG baseline
Top-1 correct (top result states the current truth)	51.5%	24.2%
Contradiction rate (a superseded fact appears in top-5) — lower is better	48.5%	90.9%
Recall@5 (current truth appears anywhere in top-5)	87.9%	84.9%

The headline isn't recall — both arms find the right fact about equally often. The difference is contradiction: the plain RAG store keeps returning stale, superseded values because it has no concept of one fact replacing another. RememberOS's graph supersession marks old values as no longer current, so they stop crowding out the truth — roughly halving contradictions and ~2×-ing top-1 accuracy on the same data.

Graph extraction uses an LLM, so individual cells move a few points between runs (and the small 4–5-query rows swing on a single answer) — re-run 2026-06-13 with the new expiry scenario. The stable result is the contradiction gap: memory roughly halves it versus plain RAG. Re-run it yourself with the command below.

By scenario

Scenario	Top-1 (mem / rag)		Contradiction (mem / rag)
Career — evolving employer / city / role (43 statements, 15 q)	46.7%	13.3%	40.0%	100.0%
Preferences & multi-hop — flipping preferences + person→project→property (19 statements, 9 q)	44.4%	22.2%	44.4%	66.7%
Temporal / recency — dated restatements; latest is truth (13 statements, 4 q)	50.0%	50.0%	50.0%	100.0%
Expiry — time-bound facts that expire and are replaced (15 statements, 5 q)	80.0%	40.0%	80.0%	100.0%

On the expiry scenario memory leads on top-1 (80% vs 40%) and contradiction (80% vs 100%), but note the absolute contradiction is still high: an expired value can still reach the top-5 on some queries. Supersession reduces stale recall here, it doesn't fully eliminate it.

Methodology

Four scenarios, 90 statements, 33 queries. Career: each persona's employer, city, and role evolve (e.g. "Alex works at Google" → "…joined Stripe" → "…now works at Anthropic"), with distractors. Preferences & multi-hop: a person's editor/drink preferences flip over time, plus multi-hop facts where the answer lives one link away (person → project → language). Temporal / recency: the same attribute is restated at different dates ("In January… by April… as of this month…") and only the latest is current. Expiry: explicitly time-bound facts ("acting manager until the reorg", "remote from Bali for the month") that later expire and are replaced — the post-expiry value is the truth. All fully seeded and deterministic — no LLM generates the scenarios, so they're reproducible.
Two arms, identical data: Memory mode sends each statement through /remember (graph extraction supersedes/expires prior facts). RAG baseline stores every statement verbatim with add() — a plain vector store.
Queries: present-truth questions like "Where does X work now?", "What editor does X use now?", plus multi-hop ones like "What language is X's main project written in?" — always asking for what is true now.
Judging is mechanical (substring match against the known current value and the known stale values) — no LLM judge, nothing to fudge. Top-1 = the first result states the current truth; contradiction = any top-5 result states a superseded value; recall@5 = the current truth appears in the top-5.
Run live against the production API using disposable tenants that are created and then erased through the GDPR endpoints.

Honesty note. This is our own scenario, not an academic benchmark. It is not RememberOSEval or LoCoMo — those require licensed datasets and a larger eval budget, and we'll add them later. What this is: a transparent, reproducible test of the one thing a memory engine must do that a vector store doesn't — keep current truth current. The numbers above are whatever the run produced; the source is public so you can check or extend it.

Reproduce it

git clone https://github.com/11data/rememberos && cd rememberos
python -m evals.run_eval --base-url https://rememberos.ai --admin-key mv_your_key
# scenario: evals/scenario.py   runner: evals/run_eval.py   output: evals/results.json

Source: github.com/11data/rememberos/evals. Numbers from a run on 2026-06-13 against the production API.

Intelligent memory: dedup & related

Shared memory drifts — near-duplicates pile up and recall needs to surface the right neighbours. Two more reproducible evals measure those ops against the live deployment.

1.00

Dedup precision

never deleted a distinct memory

1.00

Related precision@3

same-topic neighbours

1.00

Related coverage

every memory got a neighbour

1.00

Hybrid lexical recall@5

keyword-only match still surfaces

1.00

Hybrid MRR

vs text-only 0.50 — covers every query class

Eval	What it measures	Result
Dedup	Seed known duplicate groups + distinct memories; run dedup across a threshold sweep (0.95→0.80) and check it merges duplicates without deleting distinct ones	precision 1.00 at every threshold; recall rises as the threshold loosens (0.95 is conservative, 0.80 collapses most paraphrase groups)
Related	Seed 4 clearly-distinct topical clusters (4 memories each); ask each memory for its top-3 neighbours and check they share its topic	precision@3 1.00, coverage 1.00 — no cross-topic bleed
Hybrid recall	Seed distractors plus one memory carrying a rare token that's semantically dissimilar to the query; query the bare token with a high vector gate so the vector arm is empty, and check the keyword-only memory is still returned	recall@5 1.00, precision 1.00 — invisible to pure vector search (0 hits), surfaced by hybrid fusion
Hybrid ranking	Seed a labeled set with semantic (paraphrase, no keyword overlap), lexical (rare token), and mixed queries; score recall@5 and MRR for vector, text, and hybrid modes	hybrid recall@5 1.00, MRR 1.00 — matches the best single arm on every query while text-only drops to 0.50 (misses the semantic queries) and the gated-lexical case (row above) defeats vector-only

Small, fully-seeded evals (a dozen-odd memories), run live against the deployment with real embeddings, deterministic and reproducible — same honesty bar as the memory-vs-RAG eval above, not an academic benchmark. Dedup precision stays 1.00 because the distinct memories are on far-apart topics; recall is a tunable tradeoff set by the threshold.

python -m evals.run_dedup_eval         --key mv_your_key
python -m evals.run_related_eval       --key mv_your_key
python -m evals.run_hybrid_recall_eval  --key mv_your_key
python -m evals.run_hybrid_ranking_eval --key mv_your_key
# source: evals/run_dedup_eval.py · evals/run_related_eval.py · evals/run_hybrid_recall_eval.py · evals/run_hybrid_ranking_eval.py · results in evals/RESULTS.md