Claude-Mem Deep Dive: Persistent Memory Plugin for Claude Code
Claude-Mem gives Claude Code cross-session persistent memory via hooks, AI compression, hybrid search, and Endless Mode. Full architecture breakdown and comparison with native CLAUDE.md.
Claude CodeClaude-MemAI MemoryPlugin ArchitectureMCP
2613  Words
2026-02-02

The most frustrating thing about Claude Code is not when it writes bad code. It is when every new session starts with a blank slate. The architecture discussion from yesterday, the bug you spent two hours tracking down, the coding conventions you agreed on — all gone. You end up repeating context over and over, like introducing yourself every morning to someone with amnesia.
Claude-Mem was built to fix this. It is a Claude Code plugin that automatically captures every interaction, compresses it into structured memory using AI, and intelligently injects relevant context into future sessions. In short, it gives Claude Code a long-term memory system.
By the end of this article, you will understand:
- How Claude-Mem’s core architecture is designed
- How its Hook system captures context transparently
- How three-tier progressive search saves 10x on tokens
- How Endless Mode breaks through context window limits
- How it compares to the native CLAUDE.md memory system
Why You Need Claude-Mem
The Amnesia Problem
Claude Code has a 200K token context window (Claude Sonnet 4). That sounds large, but in practice each tool call consumes 1,000 to 10,000 tokens. A moderately complex development task with 50 tool calls can fill the entire window. More critically, once a session ends, all context vanishes.
This means:
- Architecture decisions discussed yesterday need to be explained again today
- A bug you spent hours debugging requires a fresh investigation in a new session
- Team coding conventions must be manually restated every time
Limitations of Existing Solutions
Claude Code natively provides the CLAUDE.md memory mechanism — you place a Markdown file in your project root, and Claude reads it automatically at startup. This works, but has clear limitations:
| Pain Point | Description |
|---|---|
| Manual maintenance | You decide what to remember and what to forget |
| Static content | Does not automatically record discoveries and decisions |
| No search | Finding specific information becomes harder as the file grows |
| Context overhead | Larger files leave fewer tokens for actual work |
Claude-Mem takes a fundamentally different approach: let AI decide what to remember, how to compress it, and when to inject it.
Core Architecture: Five Layers Working Together
Think of Claude-Mem’s architecture as a smart archive. Hooks are the archivists (they collect information), the Worker is the librarian (it classifies and compresses), databases are the filing cabinets (they store everything), the search system is the index desk (it retrieves), and SessionStart injection is your daily briefing.
Architecture Overview
┌─────────────────────────────────────────────┐
│ Claude Code IDE │
│ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
│ │Session │ │UserPrompt│ │ PostToolUse │ │
│ │Start │ │Submit │ │ │ │
│ │Hook │ │Hook │ │ Hook │ │
│ └────┬────┘ └────┬─────┘ └──────┬───────┘ │
│ │ │ │ │
└───────┼───────────┼──────────────┼───────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────┐
│ Worker Service (Port 37777) │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ Context │ │ Session Manager │ │
│ │ Builder │ │ (AI Agent Generator) │ │
│ └─────────────┘ └──────────────────────┘ │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ Search │ │ SSE Broadcaster │ │
│ │ Manager │ │ (Real-time Events) │ │
│ └─────────────┘ └──────────────────────┘ │
└────────┬──────────────────┬─────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ SQLite + FTS5 │ │ ChromaDB │
│ (Structured) │ │ (Embeddings) │
└────────────────┘ └────────────────┘
The system consists of six Hook scripts, an HTTP Worker service, dual-database storage, and 4 MCP tools. Let us break down each layer.
The Hook System: Transparent Capture
Claude-Mem leverages Claude Code’s Hook lifecycle to inject logic at five critical moments:
| Hook | Trigger | Purpose |
|---|---|---|
| Smart Install | Before session start | Checks dependencies (Bun, uv, etc.), installs what is missing |
| SessionStart | Session begins | Retrieves relevant context from the last 10 sessions and injects it |
| UserPromptSubmit | User sends a message | Records user input and session metadata |
| PostToolUse | After each tool call | Captures observation from tool operations (file reads, code changes, etc.) |
| Stop/SessionEnd | Session ends | Uses AI to generate a semantic summary for the next session |
A key design choice: all Hooks are lightweight HTTP clients (each roughly 75 lines of code after the v7.0 refactor). They only send requests to the Worker Service and perform no heavy computation. This ensures Hooks never slow down Claude Code’s response time.
// PostToolUse Hook core logic (simplified)
// After capturing a tool call, immediately send it async to Worker
const response = await fetch(`http://127.0.0.1:37777/api/sessions/observations`, {
method: 'POST',
body: JSON.stringify({
session_id: currentSessionId,
tool_name: toolResult.name,
tool_input: toolResult.input,
tool_output: toolResult.output
})
});
// Returns 202 Accepted, does not wait for processing
Worker Service: The Central Orchestrator
The Worker Service is the brain of the entire system, running on Bun and listening on 127.0.0.1:37777 by default. It uses a two-phase startup:
- Phase 1 (fast): HTTP server binds the port immediately and returns control to the Hook (prevents timeout)
- Phase 2 (background): Initializes databases, crash recovery, SearchManager, and MCP connections
Core responsibilities include:
- Session management: Tracks active development sessions with dual ID mapping (IDE session ID to memory agent session ID)
- AI agent dispatch: Generates observations and summaries in the background
- Search orchestration: Coordinates hybrid search across SQLite FTS5 and ChromaDB
- Real-time broadcasting: Pushes events to the Web UI via SSE (Server-Sent Events)
- Crash recovery: Scans the
pending_messagestable on startup and retries incomplete tasks
Three-Layer Storage Architecture
Claude-Mem uses triple-redundant storage:
SQLite (structured data):
- Database location:
~/.claude-mem/claude-mem.db - Core tables:
sdk_sessions,observations,session_summaries,user_prompts,pending_messages - Uses FTS5 virtual tables for keyword full-text search
- The
pending_messagestable ensures crash recovery — any unprocessed work is persisted
ChromaDB (vector embeddings):
- Location:
~/.claude-mem/vector-db/ - Each observation field (title, narrative, facts) is stored as an independent vector
ChromaSynchandles asynchronous syncing with smart backfill strategies
File system:
~/.claude-mem/settings.json: Configuration file~/.claude-mem/logs/: Runtime logs- Optional
CLAUDE.mdactivity timeline file
AI Compression Engine: 10:1 to 100:1 Ratios
What Is an Observation?
Every time you use Claude Code to read a file, write code, or run a command, the PostToolUse Hook captures the raw tool input and output. But raw data is too large — a single file read can be thousands of tokens.
Claude-Mem’s AI agent compresses this raw data into structured observations of roughly 500 tokens, containing:
{
"id": "obs_20260203_001",
"type": "discovery",
"title": "Found race condition in API auth middleware",
"narrative": "While investigating intermittent 401 errors on /api/users...",
"facts": [
"Auth middleware does not lock during token expiry",
"Concurrent requests can trigger simultaneous token refreshes",
"Fix: use mutex to ensure single refresh"
],
"concepts": ["problem-solution", "gotcha"],
"session_id": "sess_abc123",
"timestamp": "2026-02-03T00:00:00Z"
}
Think of it as compressing a book into high-quality reading notes — preserving core insights while discarding redundant details.
Three AI Engine Options
Claude-Mem supports three AI providers with hot-swapping at runtime (shared conversation history, no context loss):
| Engine | Advantage | Best For |
|---|---|---|
| SDKAgent (default) | Uses Claude Agent SDK, highest observation quality | Best results, included in Claude Code subscription |
| GeminiAgent | Free tier with 1,500 requests/day, rate limiting | Budget-conscious, lightweight usage |
| OpenRouterAgent | 100+ models available, some free | Flexible model selection, experimentation |
All engines automatically fall back to SDKAgent when encountering errors (429/5xx).
Three-Tier Progressive Search: 10x Token Savings
Traditional RAG (Retrieval-Augmented Generation) typically dumps everything retrieved into the context. In a token-scarce environment, this is extremely wasteful. Claude-Mem implements a progressive disclosure search pattern:
The Three-Tier Workflow
Tier 1: Index Search (~50-100 tokens per entry)
│ Returns observation ID, title, date, type
│ Quick filtering to find entries of interest
▼
Tier 2: Timeline Context
│ Shows timeline around an anchor observation
│ Understand causal relationships and decision chains
▼
Tier 3: Full Details (~500-1000 tokens per entry)
│ Retrieves full text only for selected observations
│ Final confirmation of needed information
▼
Result: ~10x token savings compared to traditional RAG
The elegance of this design is filter first, fetch later. It is like visiting a library: you would not carry every book off the shelf before choosing. You check the catalog, find the chapter, then turn to the specific page.
Hybrid Search Strategy
SearchManager invokes three search strategies simultaneously and fuses the results:
- ChromaStrategy: Vector similarity search via ChromaDB, excels at semantic matching (“that auth bug I fixed yesterday”)
- SQLiteStrategy: Keyword search via FTS5, excels at exact matching (“401 error”)
- HybridStrategy: Relevance fusion ranking, combining the strengths of both
MCP Tool Interface
Claude-Mem exposes search capabilities through 4 MCP tools:
| Tool | Function | Token Cost |
|---|---|---|
search | Returns compact index | ~50-100/entry |
timeline | Timeline context | Medium |
get_observations | Full observation details | ~500-1000/entry |
__IMPORTANT | Workflow documentation | One-time |
A major optimization in v7.0 was consolidating 9 MCP tools (~2,500 tokens) into 1 Skill (~250 tokens preamble + on-demand instructions), dramatically reducing token overhead during tool registration.
Context Injection: Your Session Briefing
The Injection Flow
When you start a new Claude Code session, the SessionStart Hook triggers context injection:
- Sends a
GET /api/context/injectrequest to the Worker - ContextBuilder retrieves up to 50 relevant observations from the last 10 sessions (both configurable)
- Ranks by relevance using hybrid search
- Formats as Markdown and injects into the new session
Fine-Grained Configuration
You can tune injection parameters in the Web UI at http://localhost:37777:
{
"CLAUDE_MEM_CONTEXT_OBSERVATIONS": 50,
"CLAUDE_MEM_CONTEXT_SHOW_LAST_SUMMARY": false,
"CLAUDE_MEM_CONTEXT_SHOW_LAST_MESSAGE": false,
"CLAUDE_MEM_SKIP_TOOLS": [
"ListMcpResourcesTool",
"SlashCommand",
"Skill"
]
}
Filtering by type (bugfix, feature, refactor, etc.) and concept (how-it-works, problem-solution, gotcha, etc.) is also supported, giving you precise control over what information enters the new session.
Token Economics
From v3 to v7, context injection token consumption went through a massive improvement:
| Version | Context Injection | Notes |
|---|---|---|
| v3 | ~25,000 tokens | Full dump, brute force |
| v7 | ~1,500 tokens | Compression + progressive loading |
A 94% reduction in tokens, meaning far more of the context window is available for actual coding work.
Endless Mode: Breaking the Context Window Barrier
The Problem: O(N squared) Token Consumption
Standard Claude Code sessions have quadratic token growth. Each tool call not only adds new content but also retains all previous tool outputs in the context. After roughly 50 tool calls, the 200K context window is full.
The Solution: Bionic Memory Architecture
Endless Mode (currently in Beta) implements a two-tier memory system inspired by human cognition:
┌───────────────────────────────┐
│ Working Memory │ ← In context window
│ Compressed observations, │
│ ~500 tokens each │
│ (like short-term memory) │
└───────────────┬───────────────┘
│ Compression
▼
┌───────────────────────────────┐
│ Archive Memory │ ← On disk
│ Full tool outputs, │
│ retrieved on demand │
│ (like long-term memory) │
└───────────────────────────────┘
How it works: The PostToolUse Hook blocks after each tool call (up to 110 seconds), allowing the AI agent to compress the full tool output into a ~500 token observation. The compressed observation then replaces the original output in the context. The full output is archived to disk.
Results:
- Token consumption drops from O(N squared) to O(N)
- ~95% reduction in tokens within the context window
- Tool call capacity increases roughly 20x
Trade-offs
Endless Mode is not a free lunch:
| Benefit | Cost |
|---|---|
| Massively extends single-session capacity | Adds 60-90 seconds latency per tool call |
| Context never overflows | Compression may lose details |
| Great for extended development tasks | Still experimental, may have bugs |
Best suited for scenarios where you need to work continuously in a single session for a long time — large refactors, complex bug investigations, multi-module development.
Comparison with Native CLAUDE.md
A common question: Claude Code already has a CLAUDE.md memory system. Why would you need Claude-Mem?
The Fundamental Differences
| Dimension | CLAUDE.md (Native) | Claude-Mem |
|---|---|---|
| Memory method | Manually written and maintained | AI auto-captures and compresses |
| Content type | Static rules, preferences, conventions | Dynamic work history and decision records |
| Search | None (full file loaded into context) | Semantic search + keyword search |
| Token efficiency | Wastes more as file grows | Progressive loading, on-demand retrieval |
| Privacy control | .local.md excluded from repo | <private> tag for fine-grained control |
| Version control | Git-friendly | Independent database storage |
| Team collaboration | Can be committed and shared | Personal memory, does not sync across devices |
They Are Complementary, Not Competing
The best practice is to use both together:
- CLAUDE.md: Store project-level static knowledge — tech stack, coding conventions, architecture principles, common commands
- Claude-Mem: Automatically record dynamic work processes — bug investigation trails, architecture decision rationale, approaches you have tried
Think of it like a company’s employee handbook (CLAUDE.md) versus work journal (Claude-Mem) — one tells you the rules, the other records what you did.
Some developers have proposed a two-tier memory architecture:
- Tier 1 (CLAUDE.md, ~150 lines): Auto-generated concise briefing with the most important project knowledge
- Tier 2 (full database): Complete storage of all facts, decisions, and observations, queried on demand via MCP tools
80% of sessions only need Tier 1, with the remaining 20% fetching from Tier 2 as needed.
Installation and Configuration
Quick Install
Run these commands in Claude Code:
# Add from plugin marketplace
> /plugin marketplace add thedotmack/claude-mem
# Install the plugin
> /plugin install claude-mem
After restarting Claude Code, context from previous sessions will automatically appear in new sessions.
Key Configuration Options
After installation, the configuration file is at ~/.claude-mem/settings.json. Here are the most commonly adjusted parameters:
| Setting | Default | Description |
|---|---|---|
CLAUDE_MEM_PROVIDER | claude | AI engine: claude / gemini / openrouter |
CLAUDE_MEM_MODEL | claude-sonnet-4-5 | Specific model |
CLAUDE_MEM_CONTEXT_OBSERVATIONS | 50 | Number of observations to inject (1-200) |
CLAUDE_MEM_WORKER_PORT | 37777 | Worker service port |
CLAUDE_MEM_LOG_LEVEL | INFO | Log level: DEBUG/INFO/WARN/ERROR/SILENT |
CLAUDE_MEM_SKIP_TOOLS | Multiple system tools | Tools excluded from observation capture |
Web Dashboard
Visit http://localhost:37777 to view in real time:
- Current session’s observation stream
- Historical session list and summaries
- Memory database search
- Context injection parameter tuning (with live preview)
- Stable / Beta version switching
Privacy Protection
If you want certain content excluded from memory, use the privacy tag in your conversation with Claude:
<private>
This content contains sensitive information. Do not record it to memory.
API Key: sk-xxx...
</private>
You can also exclude specific tools from observation capture in the configuration.
FAQ and Troubleshooting
Does Claude-Mem slow down Claude Code?
Not in normal mode. All Hooks are asynchronous — they send HTTP requests to the Worker and return immediately without waiting for processing. However, Endless Mode does add noticeable latency (60-90 seconds per tool call).
Is the data secure?
All data is stored locally (~/.claude-mem/), nothing is uploaded to any cloud service. The Worker only listens on 127.0.0.1, so it is inaccessible from outside. AI compression uses your own Claude subscription (or a configured third-party API key).
How much disk space does it use?
The SQLite database typically stays in the tens of MB range. ChromaDB vector embeddings may be slightly larger, but negligible for modern drives. If it grows too large over time, you can manually clean up historical sessions.
Is it compatible with Git Worktrees?
Yes. Claude-Mem supports unified Git Worktree context, so multiple worktrees can share the same memory database.
What does the AGPL-3.0 license mean?
For personal use, there are no restrictions. But if you modify Claude-Mem’s code and deploy it as a network service (SaaS), you must open-source your modifications. Note that the ragtime/ directory uses the PolyForm Noncommercial License, which is limited to non-commercial use.
Conclusion
Claude-Mem addresses one of the most fundamental pain points with AI coding assistants — memory continuity. Its core value lies in:
- Transparent operation: No manual effort after installation; Hooks capture everything automatically
- Smart compression: AI-driven observation generation with 10:1 to 100:1 compression ratios
- Precise retrieval: Three-tier progressive search that fetches only what is needed
- Token efficiency: From 25K tokens in v3 down to 1.5K tokens in v7
Of course, it has limitations: dependency on a background Worker service, noticeable Endless Mode latency, and unsuitability for team-shared memory. For most individual developers, the CLAUDE.md + Claude-Mem combination is currently the most practical memory solution for Claude Code.
If you use Claude Code daily and are tired of repeating context every session, Claude-Mem is worth trying. After all, an AI assistant that actually remembers things is the one you will want to keep using.
Comments
Join the discussion — requires a GitHub account