Two developers watched their AI coding bill spike overnight with no change in what they were building. When they dug into where the money was going, they found the model wasn’t thinking harder. It was reading files it never needed. That single discovery became Code Context Engine, an open-source local index that cut their retrieval tokens by 94%.
If you use Claude Code, Cursor, Copilot, or Codex daily, this is the same problem sitting in your own bill. Here’s what’s actually happening under the hood, why fixing it looks a lot like RAG for your own codebase, and how to set it up in about five minutes.
Key Takeaways
- A FastAPI benchmark showed Code Context Engine cutting retrieval tokens by 94%, from 83,681 to 4,927 tokens per query, while still finding the right code 90% of the time (Code Context Engine GitHub, 2026).
- Roughly 90% of an AI coding session’s cost comes from input tokens, not output, so trimming what gets sent to the model matters far more than shortening its replies.
- Install takes one command:
uvx --from "code-context-engine[local]" cce init, which auto-detects Claude Code, Cursor, Copilot, Codex, and other MCP-compatible editors.
Why Do AI Coding Tools Burn So Many Tokens?
Every AI coding assistant does the same thing by default: it hands your code to the model as context, on the theory that more context means better answers. In practice, most of that context is dead weight the model never uses.
Raj Sakthivel, who built Code Context Engine with his collaborator Fos, measured this directly on their own project. A typical query was sending 45,000 tokens of context, but only about 5,000 tokens were actually relevant to the question being asked (We Cut 94% of AI Coding Tokens With a Local Code Index, 2026). The other 40,000 tokens were paid for on every single call and never used. Sound familiar?
“It’s like ordering a pizza and paying for nine extra pizzas you don’t eat, every time,” Raj Sakthivel said, describing what it felt like watching context tokens pile up on every query.
This is the same context-window problem that retrieval-augmented generation was built to solve for chatbots. Except here, the “documents” are your source files, and the retriever needs to understand functions and classes instead of paragraphs. If you’ve ever hit a wall with context limits in Claude, this is the same budget problem playing out on every single query your coding assistant runs, just automated and invisible.
What Actually Costs Money in an AI Coding Session?
The wake-up call came after three fixes that didn’t work, and the reasons why are the most useful part of the story.
First, they shortened their prompts. It sounded logical: smaller prompt, fewer tokens. But the model had already received the full 45,000-token context dump before it even read the prompt. The cost was locked in before the instruction arrived.
Second, they tuned model settings like max tokens and temperature. Same result: those settings shape the output, and the money was overwhelmingly in the input.
Third, they compressed the output, telling the model to write shorter answers. This one actually worked: output shrank by 75%. But output was only about 10% of total cost to begin with, so a 75% cut against a small slice barely moved the total bill.
That ratio is the whole story: roughly 90% of an AI coding session’s cost is input. Files, search results, and context sent to the model. Only 10% is output, the code the model writes back. Cutting output by 75% saves about 8% overall. Cutting input by 94% saves about 61% overall. Same-looking percentages, very different bills.
What Is Code Context Engine?
Code Context Engine (CCE) is an open-source local index that sits between your codebase and your AI coding assistant, connecting through the Model Context Protocol (Code Context Engine GitHub, 2026). Instead of the assistant reading whole files, it searches the index and gets back only the small piece of code it actually needs.
CCE runs entirely on your machine. Indexing, search, and embeddings all stay local, using sqlite-vec for embedding storage rather than shipping code to a hosted vector database like Weaviate. Built-in secret detection and PII scrubbing keep anything sensitive from leaving your environment. It works with Claude Code, VS Code/Copilot, Cursor, Gemini CLI, OpenAI Codex, OpenCode, and Tabnine.
Under the hood, the pipeline runs five steps:
- Chunking. Tree-sitter parses your code into an abstract syntax tree across 11 languages, so chunks are whole functions, classes, and methods, not arbitrary line splits.
- Hybrid search. Every query runs a semantic (vector) search and a keyword (BM25) search at the same time, then merges the results. This is where most of the savings come from.
- Compression. Results can be shrunk further, down to a function’s name and description, collapsing a 50-line function into five lines when the full body isn’t needed.
- Dependency graph. The engine tracks which functions call which, so finding one relevant chunk surfaces everything connected to it.
- Confidence scoring. Every result gets scored; anything below the threshold is dropped rather than sent to the model as noise.
This is also where CCE’s approach to AI agents using tools safely matters: an assistant only acts on context it can trust, and low-confidence results never make it into the prompt at all.
Why Do Two Search Methods Beat One?
Hybrid search is the core mechanism, and it’s a direct answer to a weakness in each method alone. Semantic search is good at finding related ideas but can miss exact names. Searching for an “authenticate user” function might surface a similarly-worded but unrelated function instead. Keyword search is good at exact names but misses related phrasing: a search for “login flow” can miss code labeled “sign in.”
Run separately, each search method misses roughly one in four relevant results. Combined, they miss closer to one in ten, because each one covers the other’s blind spot (We Cut 94% of AI Coding Tokens With a Local Code Index, 2026).
The harder problem turned out to be knowing which results to trust. Sometimes a search returns ten results and none of them are actually relevant, and an AI assistant that confidently uses a bad match is worse than one that returns no answer at all. The team tried having the model judge its own search results, but that added two to three seconds of latency per query. A simple score cutoff was too blunt, since short, valid queries would score low even on a perfect match.
What worked was a weighted formula: 50% semantic similarity score, 30% keyword match score, and 20% code recency, with the pass threshold adjusting to the current result set. It runs in about 0.4 milliseconds with no extra model calls. The lesson the team drew from testing more complex alternatives first: a simple formula beat a complex model most of the time.
Setting Up Code Context Engine with Claude Code
You’ll need:
- Python 3.11 or later
- A C compiler for Tree-sitter grammars (
xcode-select --installon macOS;sudo apt install build-essential cmakeon Ubuntu/Debian; Visual Studio Build Tools with the C++ workload plus CMake on Windows) uv/uvxinstalled- Claude Code, Cursor, VS Code/Copilot, or another MCP-compatible editor
- ~5 minutes
Step 1: Install and Initialize
The fastest path is a single command that installs CCE and indexes your project in one step:
uvx --from "code-context-engine[local]" cce initBashFor a persistent install you’ll reuse across projects:
uv tool install "code-context-engine[local]"
cd /path/to/your/project
cce initBashWhat just happened: cce init indexes your current project, installs the necessary hooks, and auto-detects your editor to register the MCP connection.
Step 2: Configure Your Editor (Auto-Detected)
cce init writes the right config file for whichever editor it finds:
| Editor | Config File | Instructions File |
|---|---|---|
| Claude Code | .mcp.json | CLAUDE.md |
| VS Code/Copilot | .vscode/mcp.json | .github/copilot-instructions.md |
| Cursor | .cursor/mcp.json | .cursorrules |
| Gemini CLI | .gemini/settings.json | GEMINI.md |
| Codex | ~/.codex/config.toml | AGENTS.md |
To target a specific editor instead of auto-detection, pass a flag:
cce init --agent claudeBashUse --agent all if you switch between multiple editors on the same project. This also enables the shared index and cross-session memory described further down.
Watch out: if cce init doesn’t detect your editor, run it again with the explicit --agent flag rather than editing the MCP config file by hand.
Step 3: Verify the Index
cce statusBashExpected output shows the index health, file count, and last indexing time. If the index looks stale after pulling new changes, re-run it directly:
cce reindexBashTesting It: Search, Savings, and the Dashboard
With the index live, test a query the way your assistant would:
cce search "auth flow"BashThis returns the ranked chunks CCE would hand to your AI assistant for that query, a quick way to sanity-check recall before trusting it in a real session.
To see what CCE is actually saving you, in tokens and dollars:
cce savings
cce savings --allBashFor a visual view across projects:
cce dashboardBashOn the team’s own real-world usage, the savings ledger showed 247 queries, 12.4 million tokens saved, and close to $186 not spent. 84% of that saving came from the hybrid search layer, and the rest from output compression (We Cut 94% of AI Coding Tokens With a Local Code Index, 2026). That figure isn’t an estimate: CCE compares what would have been sent against what was actually sent, on every query, then multiplies by the configured model’s price.
Common Setup Errors
| Error | Cause | Fix |
|---|---|---|
| Build fails during install | Missing C compiler for Tree-sitter | Install the platform-specific compiler toolchain listed in Prerequisites, then re-run cce init |
| Editor not auto-detected | Editor config file in a nonstandard location | Re-run with explicit --agent <name> |
| Index missing recent files | Index built before latest commit | Run cce reindex |
| Low recall on search | Large, loosely-organized codebase | See the honest limits below: recall drops on files that do many unrelated things |
| Remote embeddings needed | No local Ollama instance | Set compression.ollama_url or export CCE_OLLAMA_URL |
What Do the Real Numbers (and the Honest Limits) Actually Show?
The headline 94% figure comes from a public, reproducible benchmark against FastAPI: 53 files, 20 real developer questions, run with and without CCE. Without it: 83,681 tokens per question. With it: 4,927 tokens per question, a 94% reduction, and CCE’s own additional output compression cut that further to 523 tokens per question. Recall@10 held at 0.90, meaning the right code was still found nine times out of ten (Code Context Engine GitHub, 2026).
At Sonnet-class pricing, the team’s cost example works out to roughly $0.14 per coding session without CCE versus $0.04 per session with it (We Cut 94% of AI Coding Tokens With a Local Code Index, 2026).
So does that mean your project will see the same 94%? Not necessarily.
The project is upfront about where the numbers don’t hold. The 94% figure is the worst-case comparison against reading full files on every query. Tools like Claude Code are already smarter than that baseline, so real-world savings will typically land lower than 94%. On a large, loosely-organized 396-file Go monorepo, recall dropped to almost zero: CCE works best when files each do one focused thing, and struggles when files mix many unrelated responsibilities. The team also chose a small, fast embedding model over a larger one for speed. Reindexing runs in under a second, at the cost of some retrieval quality a bigger model would catch.
Can One Shared Index Serve Multiple AI Coding Tools?
Ever explained the same codebase to three different AI tools in one afternoon? That was the second problem the team ran into: using Claude Code for hard problems, Cursor for quick edits, and Copilot for small completions means re-explaining the same codebase to tools that don’t share anything. CCE’s answer is a single shared index that every configured editor connects to, plus cross-session memory: record a decision in one tool (record_decision), recall it in another (session_recall), and you don’t repeat yourself every session.
This shared-memory approach is close in spirit to the self-running agent patterns covered in loop engineering for Claude Code, where an agent needs persistent state across sessions rather than starting cold each time.
The nine MCP tools CCE exposes: context_search, expand_chunk, related_context, session_recall, record_decision, record_code_area, index_status, reindex, and set_output_compression.
Configuration Tuning
CCE’s config lives at ~/.cce/config.yaml or a project-local .context-engine.yaml:
compression:
level: standard # minimal | standard | full
output: standard # off | lite | standard | max
ollama_url: http://localhost:11434
retrieval:
top_k: 20
confidence_threshold: 0.5
pricing:
model: opus # opus | sonnet | haiku | gpt-4o | etc.YAMLOutput compression levels scale predictably: off (0% savings, full output), lite (~30%), standard (~65%), and max (~75%, telegraphic style). Given the input-versus-output math above, treat output compression as a secondary lever. It’s worth turning on, but it’s not where the bulk of your savings will come from.
Frequently Asked Questions
Does Code Context Engine send my code to the cloud?
No. CCE runs entirely locally: indexing, embeddings, and search all happen on your machine, with built-in secret detection and PII scrubbing before anything is exposed to an MCP-connected editor.
Which AI coding tools does it work with?
CCE connects to any MCP-compatible editor, including Claude Code, VS Code/Copilot, Cursor, Gemini CLI, OpenAI Codex, OpenCode, and Tabnine. Running cce init --agent all configures multiple editors against the same shared index.
Is this the same thing as RAG?
It’s the same underlying pattern: semantic retrieval instead of full-context dumping, applied specifically to source code. CCE adds AST-aware chunking and a dependency graph on top of hybrid vector-plus-keyword search, which general-purpose RAG setups don’t need.
Will I really see 94% savings on my own project?
Only in the worst-case scenario the benchmark measures: reading whole files on every query. Real savings depend on your codebase’s structure; well-organized, single-responsibility files see savings closer to the FastAPI benchmark, while large files mixing many responsibilities see lower recall and smaller gains.
What does setup actually cost me in time?
The quick path is one command, uvx --from "code-context-engine[local]" cce init, which typically completes in under a minute once the C compiler prerequisite is installed for your OS. If your bill is spiking from API rate limits or retries as much as raw context volume, it’s worth checking both at once.
Next Steps
Run cce savings after a week of normal usage to see your own numbers rather than someone else’s benchmark. The tool tracks every query against what would have been sent otherwise. If your project spans multiple languages or a large monorepo, check cce status for per-directory recall before assuming it will behave like the FastAPI benchmark.
Related reading on this site:
- How RAG actually works, for the retrieval pattern CCE applies to code
- Mastering the skill of using AI agents, for getting more out of the assistant CCE feeds
- Handling 429 errors and rate limits, the other half of the AI coding cost equation
Official resources:
Gowtham writes about practical AI and machine learning tooling, RAG patterns, and developer workflows at aiwithgowtham.in.
Sources:
- Code Context Engine, GitHub repository and README, retrieved 2026-07-05, github.com/elara-labs/code-context-engine
- “We Cut 94% of AI Coding Tokens With a Local Code Index,” YouTube, Raj Sakthivel, retrieved 2026-07-05, youtube.com/watch?v=dRmWYHuIJxM