Performance

A memory query does three things: embed the query, check the cache, and run hybrid search. RememberOS keeps all three fast and — importantly — keeps embeddings on-box, so a search never makes a network round-trip to an embedding API.

The query path#

Embed — the query is embedded by an on-box ONNX model (MiniLM, 384-dim) by default. No external API call, so no network latency and nothing leaves the server.
Cache — identical searches return from Redis; the cache key includes the request body, so boosted and plain searches never collide.
Search — hybrid retrieval in Postgres: pgvector ANN for semantics + a full-text index for keywords, fused with reciprocal-rank fusion, all in one round trip.

Measured latency#

Per-stage p50/p95 measured live on the production box (a single 4 GB Hetzner instance — deliberately modest hardware), from GET /v1/memory/admin/metrics:

Stage	p50	p95	What it is
Embed (query)	~10 ms	~17 ms	on-box ONNX MiniLM — no network
Cache get	~0.2 ms	~2 ms	Redis lookup
Search	~2.4 ms	~32 ms	pgvector + full-text, RRF-fused

These are real numbers from a live run, not a benchmark rig — your mileage varies with collection size, hardware, and whether you switch embeddings to OpenAI (which adds a network round trip but frees the box). The live values for your own deployment are always on the admin analytics page and the stages field of /admin/metrics.

Notes#

On-box vs OpenAI embeddings — local keeps queries network-free and private; OpenAI (1536-dim) trades a round trip for offloading CPU. Switch per deployment.
Graph extraction is async — /remember returns immediately and the LLM extraction runs on a worker, so write latency isn't gated on the model.
Scope — metrics are per worker process and in-memory; honest about it in the payload (scope: "process").

PreviousOwn Your Knowledge NextChangelog