Back to Portfolio HomePortfolioToken Cost Optimization

Be Digital · Field notes

Cost optimization: working with tokens

Agentic AI doesn't bill per task — it bills per token, every turn. The surprising part isn't the price; it's the shape of the curve. Left alone, an agent re-pays for its whole conversation on every single step. Here's where the money actually goes, and the four levers I reach for to bend the curve down — drawn from a real "raise the test-coverage number" build.

1 · Every turn re-pays for the whole context

A chat feels linear, so it's easy to assume each message costs about the same. It doesn't. On every turn the model is re-sent the entire context window as input: the system prompt and tool definitions, your instruction files, whatever documents you pasted, and the full running history — every prior file read, every command's output, every edit. Output tokens (what the model writes) are a smaller slice, but they're usually priced higher per token. So the dominant cost is input, and input grows with history.

Input tokens billed per turn, stacked, growing as conversation history accumulates Turn 1 Turn 2 Turn 3 Turn 4 Input tokens billed (each turn) fixed: system + tools instructions / pasted docs history — grows each turn
The fixed and instruction layers are roughly constant; conversation history is the part that compounds. Input is re-billed in full every turn, so cost climbs faster than the number of turns.

2 · The biggest lever: compress what you feed in

If history is the tax, the fastest win is shrinking what you put into context in the first place. In the coverage build, the agent needed to know which classes were least tested — that lived in a jacoco.xml report that ran to thousands of lines. Pasting it in each turn was the single largest cost. The fix was a 100-line script that parses the XML once and emits a compact ranked table; the agent reads that. Same decision quality, a fraction of the tokens.

A large XML report compressed to a small ranked table before being sent to the model jacoco.xml every class, every line counter ~thousands of tokens / turn rank_coverage.py parse once, locally ranked table top 5 · ~dozens of tokens
Pre-process big artifacts outside the model and feed it only the decision-grade summary. The same idea applies to logs, datasets, and search results.

3 · Four levers I reach for

None of these are exotic. They're the habits that, together, turn a runaway loop into a predictable one.

  • Keep the always-on layer lean

    Whatever loads every session — instruction files, system rules — is a tax on every turn. Short isn't style; it's how the model keeps following the rules that matter.

  • Load detail on demand

    Procedures and references belong in things that load only when triggered, not in the always-on context. Pay for depth only when a task needs it.

  • Compress the inputs

    Summarize big files before they enter context. A small deterministic script beats pasting raw XML, logs, or datasets the model has to re-read every turn.

  • Keep tool output off the meter

    Run commands quietly (-q), scope them narrowly, and start fresh sessions for new units of work so verbose logs don't pile into history and get re-sent.

4 · Why this matters more for agents than chat

A one-shot question is cheap no matter what. An agent — one that reads files, runs builds, reacts to results, and tries again — multiplies the history effect, because each tool result is appended and re-sent on the next step. That's exactly where an autonomous loop earns its keep or burns your budget. The same loop that writes tests across five classes will, if you let it, also re-send five rounds of Maven logs. Quiet commands, tight scope, and compact inputs are what keep the acceleration from becoming an expense.

Three take-aways I'm keeping

  1. Cost tracks context, not turns. Optimize what's in the window, and the per-turn bill — and the quadratic creep — both come down.
  2. Summarize before you send. The cheapest token is the one you computed outside the model. A tiny parsing script paid for itself in one session.
  3. Lean context is also better output. Trimming the window doesn't just cut spend — it sharpens the model's attention on the instructions that count.

Be Digital — notes from the bench. Related field notes: Claude Skills & context engineering · Claude Code implementation field notes.

Want this kind of work for your team?

Cost-aware, auditable AI workflows that respect your guardrails and your budget.

See GenAI & AppSec advisory