🇨🇳 中文

Claude-Mem Deep Dive: Persistent Memory Plugin for Claude Code

Claude-Mem gives Claude Code cross-session persistent memory via hooks, AI compression, hybrid search, and Endless Mode. Full architecture breakdown and comparison with native CLAUDE.md.

Bruce

Claude CodeClaude-MemAI MemoryPlugin ArchitectureMCP

AI Guides

2613  Words

2026-02-02


Claude-Mem architecture diagram showing the persistent memory plugin for Claude Code

The most frustrating thing about Claude Code is not when it writes bad code. It is when every new session starts with a blank slate. The architecture discussion from yesterday, the bug you spent two hours tracking down, the coding conventions you agreed on — all gone. You end up repeating context over and over, like introducing yourself every morning to someone with amnesia.

Claude-Mem was built to fix this. It is a Claude Code plugin that automatically captures every interaction, compresses it into structured memory using AI, and intelligently injects relevant context into future sessions. In short, it gives Claude Code a long-term memory system.

By the end of this article, you will understand:

  • How Claude-Mem’s core architecture is designed
  • How its Hook system captures context transparently
  • How three-tier progressive search saves 10x on tokens
  • How Endless Mode breaks through context window limits
  • How it compares to the native CLAUDE.md memory system

Why You Need Claude-Mem

The Amnesia Problem

Claude Code has a 200K token context window (Claude Sonnet 4). That sounds large, but in practice each tool call consumes 1,000 to 10,000 tokens. A moderately complex development task with 50 tool calls can fill the entire window. More critically, once a session ends, all context vanishes.

This means:

  • Architecture decisions discussed yesterday need to be explained again today
  • A bug you spent hours debugging requires a fresh investigation in a new session
  • Team coding conventions must be manually restated every time

Limitations of Existing Solutions

Claude Code natively provides the CLAUDE.md memory mechanism — you place a Markdown file in your project root, and Claude reads it automatically at startup. This works, but has clear limitations:

Pain PointDescription
Manual maintenanceYou decide what to remember and what to forget
Static contentDoes not automatically record discoveries and decisions
No searchFinding specific information becomes harder as the file grows
Context overheadLarger files leave fewer tokens for actual work

Claude-Mem takes a fundamentally different approach: let AI decide what to remember, how to compress it, and when to inject it.

Core Architecture: Five Layers Working Together

Think of Claude-Mem’s architecture as a smart archive. Hooks are the archivists (they collect information), the Worker is the librarian (it classifies and compresses), databases are the filing cabinets (they store everything), the search system is the index desk (it retrieves), and SessionStart injection is your daily briefing.

Architecture Overview

┌─────────────────────────────────────────────┐
│                Claude Code IDE               │
│  ┌─────────┐ ┌──────────┐ ┌──────────────┐  │
│  │Session  │ │UserPrompt│ │ PostToolUse  │  │
│  │Start    │ │Submit    │ │              │  │
│  │Hook     │ │Hook      │ │ Hook         │  │
│  └────┬────┘ └────┬─────┘ └──────┬───────┘  │
│       │           │              │           │
└───────┼───────────┼──────────────┼───────────┘
        │           │              │
        ▼           ▼              ▼
┌─────────────────────────────────────────────┐
│          Worker Service (Port 37777)         │
│  ┌─────────────┐  ┌──────────────────────┐  │
│  │ Context     │  │ Session Manager      │  │
│  │ Builder     │  │ (AI Agent Generator) │  │
│  └─────────────┘  └──────────────────────┘  │
│  ┌─────────────┐  ┌──────────────────────┐  │
│  │ Search      │  │ SSE Broadcaster      │  │
│  │ Manager     │  │ (Real-time Events)   │  │
│  └─────────────┘  └──────────────────────┘  │
└────────┬──────────────────┬─────────────────┘
         │                  │
         ▼                  ▼
┌────────────────┐  ┌────────────────┐
│  SQLite + FTS5 │  │ ChromaDB       │
│  (Structured)  │  │ (Embeddings)   │
└────────────────┘  └────────────────┘

The system consists of six Hook scripts, an HTTP Worker service, dual-database storage, and 4 MCP tools. Let us break down each layer.

The Hook System: Transparent Capture

Claude-Mem leverages Claude Code’s Hook lifecycle to inject logic at five critical moments:

HookTriggerPurpose
Smart InstallBefore session startChecks dependencies (Bun, uv, etc.), installs what is missing
SessionStartSession beginsRetrieves relevant context from the last 10 sessions and injects it
UserPromptSubmitUser sends a messageRecords user input and session metadata
PostToolUseAfter each tool callCaptures observation from tool operations (file reads, code changes, etc.)
Stop/SessionEndSession endsUses AI to generate a semantic summary for the next session

A key design choice: all Hooks are lightweight HTTP clients (each roughly 75 lines of code after the v7.0 refactor). They only send requests to the Worker Service and perform no heavy computation. This ensures Hooks never slow down Claude Code’s response time.

// PostToolUse Hook core logic (simplified)
// After capturing a tool call, immediately send it async to Worker
const response = await fetch(`http://127.0.0.1:37777/api/sessions/observations`, {
  method: 'POST',
  body: JSON.stringify({
    session_id: currentSessionId,
    tool_name: toolResult.name,
    tool_input: toolResult.input,
    tool_output: toolResult.output
  })
});
// Returns 202 Accepted, does not wait for processing

Worker Service: The Central Orchestrator

The Worker Service is the brain of the entire system, running on Bun and listening on 127.0.0.1:37777 by default. It uses a two-phase startup:

  • Phase 1 (fast): HTTP server binds the port immediately and returns control to the Hook (prevents timeout)
  • Phase 2 (background): Initializes databases, crash recovery, SearchManager, and MCP connections

Core responsibilities include:

  1. Session management: Tracks active development sessions with dual ID mapping (IDE session ID to memory agent session ID)
  2. AI agent dispatch: Generates observations and summaries in the background
  3. Search orchestration: Coordinates hybrid search across SQLite FTS5 and ChromaDB
  4. Real-time broadcasting: Pushes events to the Web UI via SSE (Server-Sent Events)
  5. Crash recovery: Scans the pending_messages table on startup and retries incomplete tasks

Three-Layer Storage Architecture

Claude-Mem uses triple-redundant storage:

SQLite (structured data):

  • Database location: ~/.claude-mem/claude-mem.db
  • Core tables: sdk_sessions, observations, session_summaries, user_prompts, pending_messages
  • Uses FTS5 virtual tables for keyword full-text search
  • The pending_messages table ensures crash recovery — any unprocessed work is persisted

ChromaDB (vector embeddings):

  • Location: ~/.claude-mem/vector-db/
  • Each observation field (title, narrative, facts) is stored as an independent vector
  • ChromaSync handles asynchronous syncing with smart backfill strategies

File system:

  • ~/.claude-mem/settings.json: Configuration file
  • ~/.claude-mem/logs/: Runtime logs
  • Optional CLAUDE.md activity timeline file

AI Compression Engine: 10:1 to 100:1 Ratios

What Is an Observation?

Every time you use Claude Code to read a file, write code, or run a command, the PostToolUse Hook captures the raw tool input and output. But raw data is too large — a single file read can be thousands of tokens.

Claude-Mem’s AI agent compresses this raw data into structured observations of roughly 500 tokens, containing:

{
  "id": "obs_20260203_001",
  "type": "discovery",
  "title": "Found race condition in API auth middleware",
  "narrative": "While investigating intermittent 401 errors on /api/users...",
  "facts": [
    "Auth middleware does not lock during token expiry",
    "Concurrent requests can trigger simultaneous token refreshes",
    "Fix: use mutex to ensure single refresh"
  ],
  "concepts": ["problem-solution", "gotcha"],
  "session_id": "sess_abc123",
  "timestamp": "2026-02-03T00:00:00Z"
}

Think of it as compressing a book into high-quality reading notes — preserving core insights while discarding redundant details.

Three AI Engine Options

Claude-Mem supports three AI providers with hot-swapping at runtime (shared conversation history, no context loss):

EngineAdvantageBest For
SDKAgent (default)Uses Claude Agent SDK, highest observation qualityBest results, included in Claude Code subscription
GeminiAgentFree tier with 1,500 requests/day, rate limitingBudget-conscious, lightweight usage
OpenRouterAgent100+ models available, some freeFlexible model selection, experimentation

All engines automatically fall back to SDKAgent when encountering errors (429/5xx).

Three-Tier Progressive Search: 10x Token Savings

Traditional RAG (Retrieval-Augmented Generation) typically dumps everything retrieved into the context. In a token-scarce environment, this is extremely wasteful. Claude-Mem implements a progressive disclosure search pattern:

The Three-Tier Workflow

Tier 1: Index Search (~50-100 tokens per entry)
    │  Returns observation ID, title, date, type
    │  Quick filtering to find entries of interest
Tier 2: Timeline Context
    │  Shows timeline around an anchor observation
    │  Understand causal relationships and decision chains
Tier 3: Full Details (~500-1000 tokens per entry)
    │  Retrieves full text only for selected observations
    │  Final confirmation of needed information
  Result: ~10x token savings compared to traditional RAG

The elegance of this design is filter first, fetch later. It is like visiting a library: you would not carry every book off the shelf before choosing. You check the catalog, find the chapter, then turn to the specific page.

Hybrid Search Strategy

SearchManager invokes three search strategies simultaneously and fuses the results:

  • ChromaStrategy: Vector similarity search via ChromaDB, excels at semantic matching (“that auth bug I fixed yesterday”)
  • SQLiteStrategy: Keyword search via FTS5, excels at exact matching (“401 error”)
  • HybridStrategy: Relevance fusion ranking, combining the strengths of both

MCP Tool Interface

Claude-Mem exposes search capabilities through 4 MCP tools:

ToolFunctionToken Cost
searchReturns compact index~50-100/entry
timelineTimeline contextMedium
get_observationsFull observation details~500-1000/entry
__IMPORTANTWorkflow documentationOne-time

A major optimization in v7.0 was consolidating 9 MCP tools (~2,500 tokens) into 1 Skill (~250 tokens preamble + on-demand instructions), dramatically reducing token overhead during tool registration.

Context Injection: Your Session Briefing

The Injection Flow

When you start a new Claude Code session, the SessionStart Hook triggers context injection:

  1. Sends a GET /api/context/inject request to the Worker
  2. ContextBuilder retrieves up to 50 relevant observations from the last 10 sessions (both configurable)
  3. Ranks by relevance using hybrid search
  4. Formats as Markdown and injects into the new session

Fine-Grained Configuration

You can tune injection parameters in the Web UI at http://localhost:37777:

{
  "CLAUDE_MEM_CONTEXT_OBSERVATIONS": 50,
  "CLAUDE_MEM_CONTEXT_SHOW_LAST_SUMMARY": false,
  "CLAUDE_MEM_CONTEXT_SHOW_LAST_MESSAGE": false,
  "CLAUDE_MEM_SKIP_TOOLS": [
    "ListMcpResourcesTool",
    "SlashCommand",
    "Skill"
  ]
}

Filtering by type (bugfix, feature, refactor, etc.) and concept (how-it-works, problem-solution, gotcha, etc.) is also supported, giving you precise control over what information enters the new session.

Token Economics

From v3 to v7, context injection token consumption went through a massive improvement:

VersionContext InjectionNotes
v3~25,000 tokensFull dump, brute force
v7~1,500 tokensCompression + progressive loading

A 94% reduction in tokens, meaning far more of the context window is available for actual coding work.

Endless Mode: Breaking the Context Window Barrier

The Problem: O(N squared) Token Consumption

Standard Claude Code sessions have quadratic token growth. Each tool call not only adds new content but also retains all previous tool outputs in the context. After roughly 50 tool calls, the 200K context window is full.

The Solution: Bionic Memory Architecture

Endless Mode (currently in Beta) implements a two-tier memory system inspired by human cognition:

┌───────────────────────────────┐
  Working Memory                  In context window
  Compressed observations,     
  ~500 tokens each             
  (like short-term memory)     
└───────────────┬───────────────┘
                 Compression
                
┌───────────────────────────────┐
  Archive Memory                  On disk
  Full tool outputs,           
  retrieved on demand          
  (like long-term memory)      
└───────────────────────────────┘

How it works: The PostToolUse Hook blocks after each tool call (up to 110 seconds), allowing the AI agent to compress the full tool output into a ~500 token observation. The compressed observation then replaces the original output in the context. The full output is archived to disk.

Results:

  • Token consumption drops from O(N squared) to O(N)
  • ~95% reduction in tokens within the context window
  • Tool call capacity increases roughly 20x

Trade-offs

Endless Mode is not a free lunch:

BenefitCost
Massively extends single-session capacityAdds 60-90 seconds latency per tool call
Context never overflowsCompression may lose details
Great for extended development tasksStill experimental, may have bugs

Best suited for scenarios where you need to work continuously in a single session for a long time — large refactors, complex bug investigations, multi-module development.

Comparison with Native CLAUDE.md

A common question: Claude Code already has a CLAUDE.md memory system. Why would you need Claude-Mem?

The Fundamental Differences

DimensionCLAUDE.md (Native)Claude-Mem
Memory methodManually written and maintainedAI auto-captures and compresses
Content typeStatic rules, preferences, conventionsDynamic work history and decision records
SearchNone (full file loaded into context)Semantic search + keyword search
Token efficiencyWastes more as file growsProgressive loading, on-demand retrieval
Privacy control.local.md excluded from repo<private> tag for fine-grained control
Version controlGit-friendlyIndependent database storage
Team collaborationCan be committed and sharedPersonal memory, does not sync across devices

They Are Complementary, Not Competing

The best practice is to use both together:

  • CLAUDE.md: Store project-level static knowledge — tech stack, coding conventions, architecture principles, common commands
  • Claude-Mem: Automatically record dynamic work processes — bug investigation trails, architecture decision rationale, approaches you have tried

Think of it like a company’s employee handbook (CLAUDE.md) versus work journal (Claude-Mem) — one tells you the rules, the other records what you did.

Some developers have proposed a two-tier memory architecture:

  • Tier 1 (CLAUDE.md, ~150 lines): Auto-generated concise briefing with the most important project knowledge
  • Tier 2 (full database): Complete storage of all facts, decisions, and observations, queried on demand via MCP tools

80% of sessions only need Tier 1, with the remaining 20% fetching from Tier 2 as needed.

Installation and Configuration

Quick Install

Run these commands in Claude Code:

# Add from plugin marketplace
> /plugin marketplace add thedotmack/claude-mem

# Install the plugin
> /plugin install claude-mem

After restarting Claude Code, context from previous sessions will automatically appear in new sessions.

Key Configuration Options

After installation, the configuration file is at ~/.claude-mem/settings.json. Here are the most commonly adjusted parameters:

SettingDefaultDescription
CLAUDE_MEM_PROVIDERclaudeAI engine: claude / gemini / openrouter
CLAUDE_MEM_MODELclaude-sonnet-4-5Specific model
CLAUDE_MEM_CONTEXT_OBSERVATIONS50Number of observations to inject (1-200)
CLAUDE_MEM_WORKER_PORT37777Worker service port
CLAUDE_MEM_LOG_LEVELINFOLog level: DEBUG/INFO/WARN/ERROR/SILENT
CLAUDE_MEM_SKIP_TOOLSMultiple system toolsTools excluded from observation capture

Web Dashboard

Visit http://localhost:37777 to view in real time:

  • Current session’s observation stream
  • Historical session list and summaries
  • Memory database search
  • Context injection parameter tuning (with live preview)
  • Stable / Beta version switching

Privacy Protection

If you want certain content excluded from memory, use the privacy tag in your conversation with Claude:

<private>
This content contains sensitive information. Do not record it to memory.
API Key: sk-xxx...
</private>

You can also exclude specific tools from observation capture in the configuration.

FAQ and Troubleshooting

Does Claude-Mem slow down Claude Code?

Not in normal mode. All Hooks are asynchronous — they send HTTP requests to the Worker and return immediately without waiting for processing. However, Endless Mode does add noticeable latency (60-90 seconds per tool call).

Is the data secure?

All data is stored locally (~/.claude-mem/), nothing is uploaded to any cloud service. The Worker only listens on 127.0.0.1, so it is inaccessible from outside. AI compression uses your own Claude subscription (or a configured third-party API key).

How much disk space does it use?

The SQLite database typically stays in the tens of MB range. ChromaDB vector embeddings may be slightly larger, but negligible for modern drives. If it grows too large over time, you can manually clean up historical sessions.

Is it compatible with Git Worktrees?

Yes. Claude-Mem supports unified Git Worktree context, so multiple worktrees can share the same memory database.

What does the AGPL-3.0 license mean?

For personal use, there are no restrictions. But if you modify Claude-Mem’s code and deploy it as a network service (SaaS), you must open-source your modifications. Note that the ragtime/ directory uses the PolyForm Noncommercial License, which is limited to non-commercial use.

Conclusion

Claude-Mem addresses one of the most fundamental pain points with AI coding assistants — memory continuity. Its core value lies in:

  1. Transparent operation: No manual effort after installation; Hooks capture everything automatically
  2. Smart compression: AI-driven observation generation with 10:1 to 100:1 compression ratios
  3. Precise retrieval: Three-tier progressive search that fetches only what is needed
  4. Token efficiency: From 25K tokens in v3 down to 1.5K tokens in v7

Of course, it has limitations: dependency on a background Worker service, noticeable Endless Mode latency, and unsuitability for team-shared memory. For most individual developers, the CLAUDE.md + Claude-Mem combination is currently the most practical memory solution for Claude Code.

If you use Claude Code daily and are tired of repeating context every session, Claude-Mem is worth trying. After all, an AI assistant that actually remembers things is the one you will want to keep using.

Comments

Join the discussion — requires a GitHub account