A memory engine should answer what is true now — not echo back every stale thing it was ever told. We built a transparent, fully-seeded scenario where facts evolve over time, then asked current-truth questions two ways: through RememberOS's graph memory (which supersedes and expires facts) and through a plain RAG store (verbatim, no supersession). Here is what we measured.
| Metric | Memory mode | RAG baseline |
|---|---|---|
| Top-1 correct (top result states the current truth) | 51.5% | 24.2% |
| Contradiction rate (a superseded fact appears in top-5) — lower is better | 48.5% | 90.9% |
| Recall@5 (current truth appears anywhere in top-5) | 87.9% | 84.9% |
The headline isn't recall — both arms find the right fact about equally often. The difference is contradiction: the plain RAG store keeps returning stale, superseded values because it has no concept of one fact replacing another. RememberOS's graph supersession marks old values as no longer current, so they stop crowding out the truth — roughly halving contradictions and ~2×-ing top-1 accuracy on the same data.
Graph extraction uses an LLM, so individual cells move a few points between runs (and the small 4–5-query rows swing on a single answer) — re-run 2026-06-13 with the new expiry scenario. The stable result is the contradiction gap: memory roughly halves it versus plain RAG. Re-run it yourself with the command below.
| Scenario | Top-1 (mem / rag) | Contradiction (mem / rag) | ||
|---|---|---|---|---|
| Career — evolving employer / city / role (43 statements, 15 q) | 46.7% | 13.3% | 40.0% | 100.0% |
| Preferences & multi-hop — flipping preferences + person→project→property (19 statements, 9 q) | 44.4% | 22.2% | 44.4% | 66.7% |
| Temporal / recency — dated restatements; latest is truth (13 statements, 4 q) | 50.0% | 50.0% | 50.0% | 100.0% |
| Expiry — time-bound facts that expire and are replaced (15 statements, 5 q) | 80.0% | 40.0% | 80.0% | 100.0% |
On the expiry scenario memory leads on top-1 (80% vs 40%) and contradiction (80% vs 100%), but note the absolute contradiction is still high: an expired value can still reach the top-5 on some queries. Supersession reduces stale recall here, it doesn't fully eliminate it.
/remember (graph extraction supersedes/expires prior facts). RAG baseline
stores every statement verbatim with add() — a plain vector store.git clone https://github.com/11data/longmem && cd longmem
python -m evals.run_eval --base-url https://rememberos.ai --admin-key mv_your_key
# scenario: evals/scenario.py runner: evals/run_eval.py output: evals/results.json
Source: github.com/11data/longmem/evals. Numbers from a run on 2026-06-13 against the production API.
Shared memory drifts — near-duplicates pile up and recall needs to surface the right neighbours. Two more reproducible evals measure those ops against the live deployment.
| Eval | What it measures | Result |
|---|---|---|
| Dedup | Seed known duplicate groups + distinct memories; run dedup across a threshold sweep (0.95→0.80) and check it merges duplicates without deleting distinct ones | precision 1.00 at every threshold; recall rises as the threshold loosens (0.95 is conservative, 0.80 collapses most paraphrase groups) |
| Related | Seed 4 clearly-distinct topical clusters (4 memories each); ask each memory for its top-3 neighbours and check they share its topic | precision@3 1.00, coverage 1.00 — no cross-topic bleed |
| Hybrid recall | Seed distractors plus one memory carrying a rare token that's semantically dissimilar to the query; query the bare token with a high vector gate so the vector arm is empty, and check the keyword-only memory is still returned | recall@5 1.00, precision 1.00 — invisible to pure vector search (0 hits), surfaced by hybrid fusion |
| Hybrid ranking | Seed a labeled set with semantic (paraphrase, no keyword overlap), lexical (rare token), and mixed queries; score recall@5 and MRR for vector, text, and hybrid modes | hybrid recall@5 1.00, MRR 1.00 — matches the best single arm on every query while text-only drops to 0.50 (misses the semantic queries) and the gated-lexical case (row above) defeats vector-only |
Small, fully-seeded evals (a dozen-odd memories), run live against the deployment with real embeddings, deterministic and reproducible — same honesty bar as the memory-vs-RAG eval above, not an academic benchmark. Dedup precision stays 1.00 because the distinct memories are on far-apart topics; recall is a tunable tradeoff set by the threshold.
python -m evals.run_dedup_eval --key mv_your_key
python -m evals.run_related_eval --key mv_your_key
python -m evals.run_hybrid_recall_eval --key mv_your_key
python -m evals.run_hybrid_ranking_eval --key mv_your_key
# source: evals/run_dedup_eval.py · evals/run_related_eval.py · evals/run_hybrid_recall_eval.py · evals/run_hybrid_ranking_eval.py · results in evals/RESULTS.md