headroom: a context compression layer that reduces LLM token usage for AI agents via content-aware compressors and a local proxy

What it solves

Headroom reduces the number of tokens sent to and received from LLMs, significantly lowering costs and latency for AI agents. It targets the "waste" in agentic workloads—such as repetitive tool outputs, verbose logs, RAG chunks, and redundant model preambles—without sacrificing the accuracy of the model's answers.

How it works

Headroom acts as a local compression layer that sits between an AI agent and the LLM provider. It uses a ContentRouter to detect the type of data and apply the most efficient compressor:

SmartCrusher: For JSON data.
CodeCompressor: AST-aware compression for multiple programming languages.
Kompress-v2-base: A specialized HuggingFace model for prose/text.

It features CCR (Reversible Compression), which caches original content locally so the LLM can retrieve the full version via a tool call if needed. It also includes a CacheAligner to ensure prompt prefixes remain stable for provider KV caches. To reduce output costs, it uses verbosity steering and effort routing to trim unnecessary model responses.

Who it’s for

Developers running AI coding agents (like Claude Code, Cursor, Aider, or Cline) daily.
Teams using multiple different AI agents and wanting a shared memory store.
Application developers who want to integrate token compression into their Python or TypeScript apps via a library or proxy.

Highlights

Multiple Integration Modes: Available as a Python/TypeScript library, a drop-in proxy, or an MCP server.
Agent Wrapping: One-command wrapping for popular agents (e.g., headroom wrap claude).
Output Reduction: Trims model preambles and reduces "thinking" effort on routine steps to save on output tokens.
Cross-Agent Memory: Shared, auto-deduplicated memory across different LLM providers.
Failure Mining: The headroom learn command analyzes failed sessions to write corrections to agent configuration files.

headroom: a context compression layer that reduces LLM token usage for AI agents via content-aware compressors and a local proxy

headroom: a context compression layer that reduces LLM token usage for AI agents via content-aware compressors and a local proxy

What it solves

How it works

Who it’s for

Highlights

Sources