Been using Claude Code on Max for a few weeks and kept running into rate limits by early afternoon. Same tasks as colleagues who weren’t hitting limits at all. Figured it was just quota differences, but it turns out the issue was entirely on my end.
Anthropic just published an engineering post explaining how Claude Code’s cost structure actually works, and it changed how I use the tool.
The core mechanic: every request is built as a prefix chain (system prompt → tools → project docs → messages). The API caches that chain. If the prefix matches on the next request, those tokens cost 1/10 the normal price. If anything in the prefix changes, the cache invalidates from that point forward — full price recalculation.
The things I was doing that were silently killing my cache:
-
Switching between Sonnet and Opus mid-conversation with
/model. Cache is model-bound, so every switch wiped everything I’d accumulated. -
Opening a new
claudesession for every task instead of continuing the previous one. -
Adding MCP tools mid-session when I needed them. Tool definitions are part of the cached prefix.
The fix that made the biggest difference: claude --resume. It restores your last session and picks up the cache chain where it left off. I’d never used it before.
Also: long conversations actually get cheaper over time because of how Claude Code’s compaction works. I was doing the opposite — short sessions, frequent restarts — which meant I was always paying full price for the first turns.
Full writeup here if you want the details:
https://www.anthropic.com/engineering/claude-code-prompt-caching
Curious if others have noticed the model-switching issue specifically — that one surprised me the most.