← Back to Blog Engineering

Everyone runs out of Claude tokens. We built a system so we don't.

Zovia Studio | | 12 min read
claude-code ai-coding agentic-coding developer-workflow tokens context-engineering
A clean, calm desk setup representing disciplined, low-waste work rather than frantic output

At a Glance

  • “I keep running out of Claude tokens” is one of the most common complaints in agentic coding, and the usual fix offered is a bigger plan. That treats a symptom.
  • “Running out” is actually three different failures: hitting your plan’s usage cap, exhausting the model’s context window mid-task, and burning real money on metered API calls. Same disease, three different gauges.
  • The disease is waste, and the largest single source of waste is rework: debugging self-inflicted bugs, re-deriving context you already had, and letting the wrong model flail at a task it was never sized for.
  • We never set out to save tokens. We built guardrails for quality. Efficiency fell out as a byproduct, because every mechanism that stops a bug or preserves context also stops a token from being wasted.
  • This post is the principle-level blueprint. You can apply all of it to Claude, and most of it to any agentic coding tool you use.

There is a complaint you will see in every community of people building software with AI. Some version of “I ran out of tokens again.” It arrives mid-afternoon, mid-task, right when the work was flowing. The advice that follows is almost always the same: upgrade your plan, buy more capacity, wait for the window to reset.

We run a small studio. We ship a lot of software with Claude, every day, across a Flutter app ecosystem and a couple of client projects. And we almost never run out.

For a long time we assumed we were just lucky, or under some threshold. Then we looked at it honestly and realized it was not luck. It was the framework. The same set of guardrails and conventions we built to keep our code clean turned out to be, almost by accident, a token-efficiency system. Not because any of it was designed to save tokens. Because waste and low quality come from the same place.

This post is the explanation. What “running out” actually means, why it happens, and the principles we use to make it rare. We will keep the deepest parts of our own implementation to ourselves, but the principles are the part that transfers, and the principles are most of the value.


”Running out” is three different problems

The phrase hides three distinct failures. They feel the same in the moment, your work stops, but they have different causes and different fixes. If you do not separate them, you will keep reaching for the one fix that does not apply.

1. The usage cap. Subscription plans meter usage in rolling windows. A five-hour window, a weekly ceiling. Run hot enough and you get locked out until it resets, often at the worst possible time. People treat this as a hard limit on how much they can build. It is not. It is a limit on how many tokens you spend, and you spend far more than you need to.

2. Context-window exhaustion. The model has a working memory, and a long session fills it. When it fills, the tool compacts or summarizes to make room, and quality quietly degrades. The model forgets a decision you made an hour ago and re-derives it, differently this time. It re-reads a file it already read. It loses the thread. You experience this as the model “getting dumber” late in a session. What actually happened is you spent its memory on things that did not need to be there.

3. Real money. On metered API access, every token is a line item. The same waste that trips a usage cap shows up here as a bill. This is the most honest gauge, because it puts a number on the thing the other two only imply.

Three gauges, one underlying quantity: how many tokens it takes you to ship a unit of working software. Lower that number and all three problems recede at once. You do not hit the cap as fast. The context window lasts longer. The bill shrinks. Nobody sells you a setting for this. It comes from how you work.


The disease is waste, and most of it is rework

Here is the part most “token tips” miss. The expensive tokens are not the ones you spend writing code. They are the ones you spend recovering from going wrong.

Walk through where an agentic coding session actually leaks:

  • Debugging a bug you just created. The model writes something subtly broken, you run it, it fails, and now you spend a long, token-heavy loop investigating, hypothesizing, and patching. The original feature might have cost a thousand tokens. The bug hunt costs ten times that.
  • Re-deriving what you already knew. New session, same project. The model re-reads the same files, re-learns the same constraints, re-discovers the same gotcha that bit you last week. You pay full price for knowledge you already bought.
  • The wrong model flailing. Point an underpowered model at a task with real design complexity and it does not fail cleanly. It loops. It produces something plausible and wrong, which you then pay again to debug. Point an overpowered model at a one-line change and you pay a premium for a job that needed none.
  • Context spent on scaffolding. Reading entire files when you needed ten lines. Loading tools and references you never use. Every token in the window is a token not available for the actual work.

None of this is the model being wasteful. It is the workflow being wasteful. And the fix is not a bigger plan. It is a set of guardrails that stop the leaks before they start. The reason ours double as a quality system is that a bug you never wrote is both higher quality and cheaper. Those are the same event.


The system, as five principles

We are not going to hand you our rulebook line by line, or our hook scripts, or the internal tooling that ties it together. That part we are keeping. But the shape of the system is five principles, and the shape is what you can build from.

1. Right-size the model before the work starts

The first thing that happens on any new task in our setup is not code. It is a one-line judgment: how big is this, really, and what model and effort level does it deserve? A typo fix and a new authentication architecture are not the same task, and they should not run on the same settings.

This cuts waste in both directions, which is the part people miss. The obvious waste is running your most powerful model at maximum effort on a trivial edit. The hidden, larger waste is the opposite: a weak model turned loose on a hard problem. It does not save you money by being cheaper per token. It costs you money by being wrong, looping, and generating a mess you pay to clean up. Matching the tool to the task is the single highest-leverage habit we have, and it takes one sentence at the top of each task.

You can do this manually today. Before you prompt, ask: is this a five-minute mechanical change, or does it have decisions in it? Set your model and effort accordingly. The discipline is to make that call before you start, not after the cheap model has already dug a hole.

2. Never learn the same thing twice

Every project accumulates hard-won knowledge. The gotcha that took three hours to track down. The non-obvious reason a function is shaped the way it is. The convention that is not written anywhere in the code. By default, that knowledge lives in your head and the model’s temporary context, and it evaporates when the session ends. Next time, you both pay to relearn it.

So we write it down, once, in a persistent form the model reads at the start of relevant work. The single most valuable file in our setup is a running list of mistakes we have already made and never want to make again. It is, quite literally, a catalog of expensive bugs converted into a few cheap sentences. Reading those sentences costs almost nothing. Rediscovering what they encode costs a session.

The principle: any fact you had to work to obtain should be captured the moment you obtain it, somewhere the model will see it again, so neither of you pays for it twice. A project memory file, a conventions doc, a “things that will bite you here” list. The format matters less than the habit.

3. Do not make the model your linter

This is the most concrete lever, and the one we would push hardest if you take only one thing from this post.

A large amount of what an AI coding agent does in a session is verification it does not need to perform with intelligence. Did this file still pass the analyzer? Did this change violate a style rule? Does this code reference a database field that actually exists? These are deterministic checks. A script answers them perfectly, instantly, and for zero tokens.

So we moved that work off the model entirely. Edits trigger automatic checks that run outside the conversation: the analyzer runs, project-specific audits run, and anything bound for deployment passes a validation gate first. The model is told the result only if something is wrong. When everything is fine, which is most of the time, not a single token is spent confirming it.

The payoff is double. You stop paying the model to do a linter’s job, and you catch the broken edit at the moment it happens, instead of three steps later when it has compounded into a confusing failure that costs a real debugging loop. Mechanical verification belongs in mechanical tools. Reserve the model’s tokens, and its context, for the work that actually needs judgment.

Every serious tool supports some version of this. Pre-commit hooks, editor-integrated checks, CI gates, task runners the agent can call. Wire your deterministic checks to fire automatically and keep their noise out of the conversation unless they fail.

4. Prevent rework, because it is the most expensive token you will ever spend

Most of our coding conventions exist to stop a specific class of bug we have shipped before. Each one reads like a quality rule. Each one is also a token-efficiency rule, because the bug it prevents is a debugging session that now never happens.

One example we can share in the abstract: we had a recurring failure where the shape of data coming back from the server drifted out of sync with what the client expected, and it would take down several screens at once with a confusing error. Tracking that down, every time, was a long and expensive loop. So we added a standing check that compares the two shapes and fails loudly and early if they diverge. The cost is a few seconds and a handful of tokens. The thing it replaces is a multi-hour, high-token hunt through unrelated code. That trade is not close.

The principle: when a bug costs you a real debugging session, do not just fix it. Spend a little to make the entire class of bug impossible or instantly visible. The cheapest debugging is the debugging you never do, and prevention is almost always an order of magnitude cheaper than the cure, in tokens as much as in time.

5. Protect the context window like a budget

The context window is not free space. It is the model’s working memory, and everything you put in it competes with everything else. Fill it with the wrong things and you get the “model got dumber late in the session” effect, which is really “the model can no longer see what matters.”

Two habits keep ours clean. First, we read narrowly. When the question is “where is X defined,” we do not load whole files into the conversation. We send a focused search that returns the relevant excerpts and nothing else. Second, we delegate self-contained sub-tasks to separate workers with their own memory. A broad codebase search, a focused audit, a contained fix. Each runs in its own context and reports back only the conclusion. The main session stays clear, holding the through-line of the actual task instead of the debris of everything it took to get there.

The principle: treat context as the scarce resource it is. Pull in excerpts, not entire files. Push isolated work into isolated contexts and keep only the answers. The longer your working session stays focused, the longer it stays smart, and the fewer times you pay to rebuild context you should never have lost.


Why it compounds

Any one of these helps. The reason they add up to “we basically do not run out” is that they are written down and applied automatically, not remembered and applied by willpower. The model-sizing call happens at the top of every task. The checks fire on every edit without anyone deciding to run them. The memory is read before the relevant work, not when someone thinks of it. Discipline that depends on remembering to be disciplined is not discipline. It is luck with extra steps.

The leverage is that we built it once and now it runs on everything we touch. A new project inherits the same guardrails on day one. That is the difference between a person who has good habits and a workshop that has good habits built into the bench.


Build this yourself, the decisions that matter

If you want the same property, here is what we would tell you to set up, in order of payoff. None of it is Claude-specific. It applies to whatever agent you drive.

1. Make a model-sizing call the first thing on every task. One sentence, before any code: how big is this, what does it deserve. Right-sizing in both directions is the highest-leverage habit you have, and it is free.

2. Keep a running “mistakes we already made” file the model reads. Convert every expensive bug into a cheap sentence the moment you solve it. This is the single best token investment we have found, because it is pure prevented rework.

3. Move every deterministic check off the model. Analyzer, linter, schema or type validation, anything a script can answer. Fire them automatically, surface them only on failure. Stop paying intelligence for work that needs none.

4. Turn recurring bugs into standing guards. When something costs you a debugging session, do not just patch it. Make the class of bug visible or impossible. Prevention is an order of magnitude cheaper than the cure.

5. Read narrowly and delegate isolated work. Excerpts, not whole files. Push self-contained sub-tasks into their own contexts and keep only the conclusions. Your main session stays focused and stays sharp.

6. Write it all down and automate it. The whole thing only compounds if it runs without you remembering to run it. A conventions file the model loads, hooks that fire on their own, defaults that apply everywhere. Habits on the bench, not in your head.


The takeaway

When people say they ran out of Claude tokens, they almost always mean one of three things: they hit a usage cap, they filled the context window, or they spent more money than they expected. All three are the same underlying fact wearing different clothes. It took too many tokens to ship the work.

It took too many tokens because of rework. Bugs that did not have to exist, knowledge relearned instead of recorded, the wrong model flailing at the wrong job, context spent on noise. Fix the waste and the cap stops being a wall, the context window stops collapsing, and the bill comes down. Not because you bought more, but because you needed less.

We did not build any of this to save tokens. We built it to ship good software without tripping over our own feet. The efficiency was a gift the quality gave us. If you want to stop running out, stop optimizing for tokens and start optimizing for not being wrong. The tokens take care of themselves.


Built at Zovia Studio. We ship apps that families use every day, on a stack and a workflow we have spent years sharpening.

Was this useful?

Tap a heart if it landed. Share it with someone who would care.

← Back to Blog