A Local-First Memory Layer for LLMs
mnemo is a local-first memory layer that gives LLM applications persistent context without any cloud dependency.
It runs as a sidecar service, ingesting raw text, extracting entities and relationships with a configurable LLM, and building a knowledge graph in SQLite.
The graph is the key differentiator: entities are deduplicated across sessions, relationships are weighted, and retrieval uses a 6‑stage pipeline—full‑text search, entity name search, BFS graph expansion, relation filtering, scoring, and assembly of a context prompt string.
Graph‑expanded hits are scored at 0.5× so that direct matches always rank higher.
This yields a structured, self‑contained memory you fully control, ideal when you need to inject relevant past context into a model’s system prompt in custom pipelines.
git clone https://github.com/zaydmulani09/mnemo cd mnemo docker compose up -d # Pull the llama3 model the first time (~4 GB) docker exec mnemo-ollama ollama pull llama3 # Verify everything is healthy curl http://localhost:8080/health
Other Installation Paths
You can also install mnemo as a standalone binary via Cargo.
Set environment variables to point to a local Ollama instance or any OpenAI‑compatible API, then run mnemo-api.
For Python projects, the mnemo-sdk package wraps the API:
from mnemo import MnemoClient client = MnemoClient() # server at http://localhost:8080 # Store a memory client.ingest("I'm building a Rust vector database called vecdb") # Get context for injection into your next LLM prompt print(client.get_context("what am I working on?"))
# Store a memory mnemo ingest "I use Neovim and prefer dark mode" # Retrieve relevant context mnemo search "what editor do I use?" # List all extracted entities mnemo entities # Show entity detail + graph neighbors mnemo entity <uuid> --neighbors # List memory chunks mnemo chunks # Server health mnemo health # Memory statistics mnemo stats # Delete everything (prompts for confirmation) mnemo wipe # Skip confirmation prompt mnemo wipe --yes # Point at a non-default server mnemo --server http://192.168.1.10:8080 stats
Core HTTP API
All endpoints accept and return JSON at http://localhost:8080.
Key endpoints:
- GET /health – server, database, and LLM status.
- POST /ingest – store text; include a
session_idto group related turns. - POST /retrieve – run the full memory pipeline and return a
context_promptstring that provides persistent context for your next LLM call. You controlmax_chunks,max_entities,min_confidence(default 0.5),include_graph(default true), andgraph_depth(default 2). - GET /entities, GET /chunks – paginated listing, plus a POST /search for full‑text lookups.
- DELETE /wipe – irreversible clear‑all; requires the confirmation header
X-Confirm-Wipe: true. - GET /stats – counts and uptime.
A complete reference is in docs/api.md.
Configuration
Settings are picked up from environment variables and an optional TOML file (passed with --config).
Key variables:
MNEMO_DB_PATH (sqlite file), MNEMO_PORT (8080), MNEMO_LLM_BASE_URL, MNEMO_LLM_MODEL (llama3), MNEMO_LLM_API_KEY, and MNEMO_LLM_PROVIDER (ollama, openai, anthropic, or custom).
Environment variables always override TOML values.
The active configuration source is reported in GET /health → config_source.
A sample file is provided in mnemo.example.toml.
Constraints and Best Practices
Limitations:
- Not a replacement for managed agent frameworks that already handle memory.
- Graph‑expanded results are scored at 0.5×, so direct hits always outrank inferred connections.
- Neighbor depth is capped at 5. Entity deletion cascades (relationships are removed). Wipe is irreversible.
Best practices:
- Use
session_idto scope retrieval. - Set
min_confidence(≥ 0.5) to filter low‑quality extractions. - Disable graph expansion (
include_graph: false) when you need only direct matches. - Run in
--releasemode for 3–5× speed gains. - Prefer environment variables in containerised deployments.
- Start with the Docker + Ollama path for a zero‑cost, fully local setup.





