Be Digital · Field notes
Cost optimization: working with tokens
Agentic AI doesn't bill per task — it bills per token, every turn. The surprising part isn't the price; it's the shape of the curve. Left alone, an agent re-pays for its whole conversation on every single step. Here's where the money actually goes, and the four levers I reach for to bend the curve down — drawn from a real "raise the test-coverage number" build.
1 · Every turn re-pays for the whole context
A chat feels linear, so it's easy to assume each message costs about the same. It doesn't. On every turn the model is re-sent the entire context window as input: the system prompt and tool definitions, your instruction files, whatever documents you pasted, and the full running history — every prior file read, every command's output, every edit. Output tokens (what the model writes) are a smaller slice, but they're usually priced higher per token. So the dominant cost is input, and input grows with history.
2 · The biggest lever: compress what you feed in
If history is the tax, the fastest win is shrinking what you put into context in the first
place. In the coverage build, the agent needed to know which classes were least tested — that
lived in a jacoco.xml report that ran to thousands of lines. Pasting it in each
turn was the single largest cost. The fix was a 100-line script that parses the XML once and
emits a compact ranked table; the agent reads that. Same decision quality, a fraction
of the tokens.
3 · Four levers I reach for
None of these are exotic. They're the habits that, together, turn a runaway loop into a predictable one.
-
Keep the always-on layer lean
Whatever loads every session — instruction files, system rules — is a tax on every turn. Short isn't style; it's how the model keeps following the rules that matter.
-
Load detail on demand
Procedures and references belong in things that load only when triggered, not in the always-on context. Pay for depth only when a task needs it.
-
Compress the inputs
Summarize big files before they enter context. A small deterministic script beats pasting raw XML, logs, or datasets the model has to re-read every turn.
-
Keep tool output off the meter
Run commands quietly (
-q), scope them narrowly, and start fresh sessions for new units of work so verbose logs don't pile into history and get re-sent.
4 · Why this matters more for agents than chat
A one-shot question is cheap no matter what. An agent — one that reads files, runs builds, reacts to results, and tries again — multiplies the history effect, because each tool result is appended and re-sent on the next step. That's exactly where an autonomous loop earns its keep or burns your budget. The same loop that writes tests across five classes will, if you let it, also re-send five rounds of Maven logs. Quiet commands, tight scope, and compact inputs are what keep the acceleration from becoming an expense.
Three take-aways I'm keeping
- Cost tracks context, not turns. Optimize what's in the window, and the per-turn bill — and the quadratic creep — both come down.
- Summarize before you send. The cheapest token is the one you computed outside the model. A tiny parsing script paid for itself in one session.
- Lean context is also better output. Trimming the window doesn't just cut spend — it sharpens the model's attention on the instructions that count.
Be Digital — notes from the bench. Related field notes: Claude Skills & context engineering · Claude Code implementation field notes.
Want this kind of work for your team?
Cost-aware, auditable AI workflows that respect your guardrails and your budget.
See GenAI & AppSec advisory