Type to search…

Memory

I started tracking my actual token usage with a tool called ccusage.

claude-mem

claude-mem is an open-source plugin for Claude Code that gives the AI persistent memory across sessions. Wit hout it, every session starts from zero. You spend the first ten minutes re-explaining the project structure, the decisions you made last week, and the context that took an hour to build. Then the session ends, and you have to repeat it all next time.

Claude-Mem fixes this by capturing what happens during a session, compressing it with AI, and injecting only the relevant context when you start the next one. The result appears in the token data in a specific way: the cache-read column.

Claude is pulling from compressed memory rather than re-reading files and history from scratch on every call. That’s not marginal efficiency. We are talking about a structural shift in how context gets delivered.

Graphify

Even with memory working efficiently, Claude was still spending significant time and tokens navigating our codebase, looking for files and tracing relationships. It was re-learning why something was built the way it was, even though it had built it. Every architecture question meant reading a handful of files to get oriented before doing any actual work.

Graphify builds a knowledge graph of your entire codebase. Instead of navigating by file search, Claude reads the graph and already knows the structure, which components connect to which routes, where the logic lives, and what depends on what. It replaces grep with traversal.

https://graphify.net/graphify-claude-code-integration.html

shell
uv tool install graphifyy
graphify install
graphify claude install   # from inside your project

Open your AI coding assistant and type:

shell
/graphify .

What this has to do with your AI bill The same principles that made my $100 plan punch like $800 are the same principles most teams can apply to their API bills. And the stakes are higher at enterprise scale because volumes are higher. If your team is doing document processing with AI — ingestion, extraction, enrichment — you are sitting on the most cache-friendly workload there is. The system prompt, the extraction instructions, and the output schema: all of it stays identical across every document run. Only the document changes. Those static pieces should be cached once and read at 10% of the standard input token cost on every subsequent call. The math is simple. Say your system prompt and schema are 2,000 tokens, and you’re processing 500 documents. Without caching: 1 million tokens at full price. With caching: 2,000 tokens at full price, 998,000 at 10%. That’s not a small optimization. That’s a line item. Beyond caching, there are two other levers worth knowing about. The Batch API. Anthropic offers 50% off standard pricing for work that doesn’t require a real-time response. Nightly document ingestion, scheduled enrichment, and background extraction — none of these need to be processed as real-time calls. Same models, same quality, half the price. The only tradeoff is asynchronous results, which, for batch workloads, isn’t a tradeoff at all. Batching the runs themselves. Cache keys expire after five minutes of inactivity. If documents are processed sporadically throughout the day, the cache expires between calls, and you’re paying full price anyway. Grouping runs keeps the cache hot and makes the math above actually work. One quick audit worth doing in the Anthropic Console: look at the cache-write-to-cache-read ratio. If writes are high relative to reads, the cache is being created but not hit. That single number tells you whether caching is working or just adding write cost. None of this is Claude-specific. Graphify works natively with OpenAI. If your team uses Cursor or builds against GPT-4, prompt caching is available there too, and the same math holds. The question worth asking your engineering team Are we using prompt caching? If the answer is no, or if nobody knows, that’s the starting point. Not a new tool, not a new architecture. Just consistent prompt structure and grouped execution on workloads that are already running. I built a production SaaS on a $100/month budget by paying attention to how context flows through AI systems. The same attention scales. Whether you’re a solo developer trying to get more out of a flat-rate plan or an enterprise team running document pipelines at volume, the lever is the same: stop paying full price for context you already have. Efficiency is present at both levels. So is the savings.