Tailored news hub
home›Memory›

mnemo: Local-First Knowledge Graph for Persistent LLM Memory

A practical guide to mnemo, a Rust-based sidecar service providing structured, persistent memory for LLMs without cloud dependencies.

mnemo: Local-First Knowledge Graph for Persistent LLM Memory
#Context#Framework#LLM#Memory#Python

mnemo is a local-first memory layer for LLMs, offering persistent, structured context via a sidecar service. It extracts entities and relationships into a knowledge graph from raw text, and retrieves ranked context for LLM prompts, supporting fully local setups with Ollama or integration with OpenAI.

A Local-First Memory Layer for LLMs

mnemo is a local-first memory layer that gives LLM applications persistent context without any cloud dependency. It runs as a sidecar service, ingesting raw text, extracting entities and relationships with a configurable LLM, and building a knowledge graph in SQLite.
The graph is the key differentiator: entities are deduplicated across sessions, relationships are weighted, and retrieval uses a 6‑stage pipeline—full‑text search, entity name search, BFS graph expansion, relation filtering, scoring, and assembly of a context prompt string. Graph‑expanded hits are scored at 0.5× so that direct matches always rank higher.
This yields a structured, self‑contained memory you fully control, ideal when you need to inject relevant past context into a model’s system prompt in custom pipelines.

git clone https://github.com/zaydmulani09/mnemo
cd mnemo
docker compose up -d

# Pull the llama3 model the first time (~4 GB)
docker exec mnemo-ollama ollama pull llama3

# Verify everything is healthy
curl http://localhost:8080/health

Other Installation Paths

You can also install mnemo as a standalone binary via Cargo. Set environment variables to point to a local Ollama instance or any OpenAI‑compatible API, then run mnemo-api.
For Python projects, the mnemo-sdk package wraps the API:

from mnemo import MnemoClient

client = MnemoClient()  # server at http://localhost:8080

# Store a memory
client.ingest("I'm building a Rust vector database called vecdb")

# Get context for injection into your next LLM prompt
print(client.get_context("what am I working on?"))
# Store a memory
mnemo ingest "I use Neovim and prefer dark mode"

# Retrieve relevant context
mnemo search "what editor do I use?"

# List all extracted entities
mnemo entities

# Show entity detail + graph neighbors
mnemo entity <uuid> --neighbors

# List memory chunks
mnemo chunks

# Server health
mnemo health

# Memory statistics
mnemo stats

# Delete everything (prompts for confirmation)
mnemo wipe

# Skip confirmation prompt
mnemo wipe --yes

# Point at a non-default server
mnemo --server http://192.168.1.10:8080 stats

Core HTTP API

All endpoints accept and return JSON at http://localhost:8080.
Key endpoints:

  • GET /health – server, database, and LLM status.
  • POST /ingest – store text; include a session_id to group related turns.
  • POST /retrieve – run the full memory pipeline and return a context_prompt string that provides persistent context for your next LLM call. You control max_chunks, max_entities, min_confidence (default 0.5), include_graph (default true), and graph_depth (default 2).
  • GET /entities, GET /chunks – paginated listing, plus a POST /search for full‑text lookups.
  • DELETE /wipe – irreversible clear‑all; requires the confirmation header X-Confirm-Wipe: true.
  • GET /stats – counts and uptime.

A complete reference is in docs/api.md.

Configuration

Settings are picked up from environment variables and an optional TOML file (passed with --config).
Key variables:
MNEMO_DB_PATH (sqlite file), MNEMO_PORT (8080), MNEMO_LLM_BASE_URL, MNEMO_LLM_MODEL (llama3), MNEMO_LLM_API_KEY, and MNEMO_LLM_PROVIDER (ollama, openai, anthropic, or custom).
Environment variables always override TOML values. The active configuration source is reported in GET /health → config_source. A sample file is provided in mnemo.example.toml.

Constraints and Best Practices

Limitations:

  • Not a replacement for managed agent frameworks that already handle memory.
  • Graph‑expanded results are scored at 0.5×, so direct hits always outrank inferred connections.
  • Neighbor depth is capped at 5. Entity deletion cascades (relationships are removed). Wipe is irreversible.

Best practices:

  • Use session_id to scope retrieval.
  • Set min_confidence (≥ 0.5) to filter low‑quality extractions.
  • Disable graph expansion (include_graph: false) when you need only direct matches.
  • Run in --release mode for 3–5× speed gains.
  • Prefer environment variables in containerised deployments.
  • Start with the Docker + Ollama path for a zero‑cost, fully local setup.
Related Articles