Performance
A memory query does three things: embed the query, check the cache, and run hybrid search. RememberOS keeps all three fast and — importantly — keeps embeddings on-box, so a search never makes a network round-trip to an embedding API.
The query path#
- Embed — the query is embedded by an on-box ONNX model (MiniLM, 384-dim) by default. No external API call, so no network latency and nothing leaves the server.
- Cache — identical searches return from Redis; the cache key includes the request body, so boosted and plain searches never collide.
- Search — hybrid retrieval in Postgres: pgvector ANN for semantics + a full-text index for keywords, fused with reciprocal-rank fusion, all in one round trip.
Measured latency#
Per-stage p50/p95 measured live on the production box (a single 4 GB Hetzner
instance — deliberately modest hardware), from GET /v1/memory/admin/metrics:
| Stage | p50 | p95 | What it is |
|---|---|---|---|
| Embed (query) | ~10 ms | ~17 ms | on-box ONNX MiniLM — no network |
| Cache get | ~0.2 ms | ~2 ms | Redis lookup |
| Search | ~2.4 ms | ~32 ms | pgvector + full-text, RRF-fused |
These are real numbers from a live run, not a benchmark rig — your mileage varies with collection size, hardware, and whether you switch embeddings to OpenAI (which adds a network round trip but frees the box). The live values for your own deployment are always on the admin analytics page and thestagesfield of/admin/metrics.
Notes#
- On-box vs OpenAI embeddings — local keeps queries network-free and private; OpenAI (1536-dim) trades a round trip for offloading CPU. Switch per deployment.
- Graph extraction is async —
/rememberreturns immediately and the LLM extraction runs on a worker, so write latency isn't gated on the model. - Scope — metrics are per worker process and in-memory; honest about it in the
payload (
scope: "process").