The economics of AI tokens for businesses
AI token cost is the new line item on your software bill. Here is how token pricing actually works, why caching changes the math, and how to control spend as you scale.
By Andrew Pagulayan · Published
The first AI bill almost always surprises someone. A team ships a feature that summarizes support tickets, the demo costs a fraction of a cent, the launch goes well, and then finance forwards an invoice with a number nobody modeled. The feature works. The AI token cost just scaled faster than anyone expected, because the thing that was cheap to try turned out to be priced per use, and use went up.
Tokens are the unit of that bill, and most people who approve the budget have never been told what a token is. It is not a word, not a character, not a request. It is a chunk of text, often a few characters or a short word fragment, that a language model reads and writes one piece at a time. Every model provider, from OpenAI to Anthropic to Google, prices their service by the token: so much for the tokens you send in, more for the tokens the model sends back. Understand that one mechanic and the mysterious invoice becomes a spreadsheet you can actually steer.
This post breaks down the real economics: how token pricing works, why input and output are priced differently, how caching can cut a recurring workload by most of its cost, how to pick a model without overpaying, and how to keep AI token cost flat while usage grows. The goal is not to scare you off AI. It is to make the meter legible, so you spend on the value and stop paying the tax you did not know you were carrying.
What an AI token actually is, and why it is the unit of cost
A token is roughly four characters of English text, or about three quarters of a word, though the exact split depends on the model's tokenizer. The word "automation" might be one token. A rare technical term, a long URL, or a chunk of code might be several. As a rule of thumb, a thousand tokens is around seven hundred and fifty words, so a dense page of text is close to a thousand tokens and a short email is one or two hundred. Numbers, punctuation, and other languages tokenize differently, which is why the same idea can cost more to express in some languages than in English.
The reason this matters for your budget is that a model charges for both the tokens it reads and the tokens it generates. When you send a prompt, every word of instruction, every example, every document you paste in, and the entire running conversation history all count as input tokens. When the model replies, every word it writes counts as output tokens. A single question that looks like one short message can carry thousands of input tokens behind it if you are also feeding the model a long system prompt and a transcript of everything said before.
You are not billed for questions and answers. You are billed for tokens, and a short question can hide a very long prompt. The text you never see is the text you pay for most.
This is the single most important mental shift. The visible message is the tip of the prompt. Underneath sit the instructions, the retrieved documents, the tool definitions, and the chat history, all of which the model has to read again on every turn. A chatbot that feels like a quick exchange can be re-reading a small book on every reply. Once you see the prompt as the real unit of cost rather than the message, the levers for controlling spend become obvious.
How token pricing works: input, output, and why they differ
Providers publish prices per million tokens, and they almost always charge more for output than for input. Output is typically a few times the price of input on the same model, because generating text is the computationally expensive part: the model produces one token, feeds it back into itself, produces the next, and repeats. Reading your prompt is comparatively cheap. Writing the answer is where the work happens, and the price reflects it.
That asymmetry should shape how you design any AI feature. The cheapest workloads read a lot and write a little: classification, extraction, routing, scoring, yes or no decisions. You feed the model a big chunk of context and ask it to emit a single label or a short structured result. The expensive workloads do the reverse: long-form drafting, code generation, exhaustive reports, anything where the model produces pages of output. Neither is wrong, but knowing which side of the ledger a feature sits on tells you in advance whether it will be cheap or costly at scale.
Prices also vary enormously across models, often by a factor of ten or more between a small fast model and a large frontier one. The same task that costs a few cents on a flagship model might cost a fraction of a cent on a smaller model that is perfectly capable of the work. The headline price per million tokens is therefore only half the story. The real driver of your bill is the product of three things: how many tokens each call uses, how often the call fires, and which model you route it to. Get those three numbers in front of you and the abstract worry about AI token cost turns into arithmetic.
Caching: the biggest lever most teams never pull
Here is the detail that changes the economics more than any other, and the one most teams discover too late. A large share of the tokens in a typical AI application are identical on every single call. The system prompt that defines the assistant's behavior, the company context you inject, the tool definitions, the few-shot examples, the policy document: this block can be thousands of tokens, and it is the same on the first request and the ten thousandth. You are paying full price to make the model read the same preamble over and over.
Prompt caching fixes exactly this. When a provider supports it, the model stores the processed form of a repeated prefix and reuses it on later calls, charging a steep discount for the cached portion. Both OpenAI and Anthropic have shipped prompt caching, and the cached input tokens are priced far below standard input tokens, commonly a small fraction of the normal rate. For a workload with a big stable prefix and a small changing suffix, which describes most assistants and agents, caching can remove the majority of the input cost without changing a single word of the output.
The practical design rule that follows is simple, and it is worth treating as a discipline:
- Put the stable stuff first. Structure every prompt so the unchanging content, the system instructions, context, and examples, sits at the front, and the variable content, the user's actual question, sits at the end. Caching works on prefixes, so a stable prefix is a cacheable prefix.
- Do not shuffle the preamble. If you reorder or lightly edit the system prompt on every request, you break the cache and pay full price again. Keep it byte-for-byte stable across calls so the cache actually hits.
- Batch the repeats. Caches expire after a window of inactivity, so workloads that fire frequently benefit most. A prompt reused hundreds of times an hour stays warm; one used twice a day may go cold between uses and lose the discount.
- Measure the hit rate. A cache you assume is working but never verify is just hope. Watch the share of input tokens billed as cached, and treat a low hit rate as a bug to fix, not a number to ignore.
The reason this lever is so powerful is that it is free value. You are not degrading quality, shortening answers, or switching to a weaker model. You are simply refusing to pay full price to re-read text the model has already seen. Any platform serious about cost should bake caching in by default rather than leave it as an expert setting. It is part of why Team Brain's AI automation layer caches the shared conversation prefix automatically, so the company context every agent reads is paid for once and reused, not billed on every run.
Model selection: paying for intelligence you actually use
The instinct to reach for the smartest, most expensive model on every task is the second biggest source of waste, after uncached prefixes. Frontier models are remarkable, but most of the work inside a business is not frontier-hard. Tagging an email by topic, extracting a date from an invoice, deciding whether a message is a complaint or a question: these are tasks a small, cheap model handles with near-perfect accuracy. Routing them to a flagship model is like chartering a jet to cross the street.
The discipline that controls this is matching the model to the task, and the cleanest way to do it is to sort your workloads into tiers:
- Cheap and fast for high-volume, low-judgment work. Classification, extraction, routing, simple rewrites, structured tagging. These run constantly, so the per-call price dominates, and a small model keeps that price near zero while doing the job well.
- Mid-tier for everyday reasoning. Drafting a solid reply, summarizing a document, answering a question against retrieved context. A balanced model gives you most of the quality of the top tier at a meaningful discount, which is the right default for the broad middle of real work.
- Frontier only where judgment pays for itself. Tricky multi-step reasoning, nuanced writing that represents the brand, code that has to be correct, decisions where a wrong answer is expensive. Reserve the premium model for the small slice of work where its extra quality changes the outcome, and the price is easy to justify.
A useful pattern is to let a cheap model do the first pass and escalate only when it is unsure or the stakes are high. A small model can triage, draft, and filter; a larger one handles the few cases that genuinely need it. This cascade routinely cuts cost by most of its total while keeping quality where it counts, because the expensive model only sees the fraction of traffic that actually warrants it. The mistake is picking one model for everything. The fix is treating model choice as a routing decision made per task, not a single procurement decision made once.
The math of scale: small numbers that become large bills
The trap in AI economics is that everything looks cheap at the unit level and expensive at the aggregate level. A single call costing a third of a cent feels like nothing. The same call firing fifty thousand times a day is a meaningful monthly line item, and a careless version of that call, with an uncached prefix and an oversized model, can be five or ten times higher for identical output. The cost is not in any one call. It is in the multiplication.
Walk through a concrete case. Suppose an assistant answers customer questions, and each call carries a two thousand token system prompt plus context, a two hundred token question, and a three hundred token answer. Naively, every call pays for two thousand two hundred input tokens and three hundred output tokens. Now apply the levers. Caching the stable two thousand token prefix drops most of that input to the cached rate, cutting the input bill by a large fraction. Routing the easy eighty percent of questions to a smaller model cuts their per-call price by an order of magnitude. Together, on the same traffic and the same answers, the monthly bill can fall by well over half, sometimes far more, with no visible change to the user.
At scale, the difference between a thoughtful AI architecture and a careless one is not ten percent. It is often the difference between a bill you forget about and a bill that ends the project.
This is why metering from day one matters so much. You cannot optimize what you cannot see. The teams that keep AI token cost under control are the ones who track tokens per feature, per model, and per customer from the first day in production, so a runaway prompt or a misrouted model shows up as a line on a dashboard instead of a shock on an invoice three weeks later. Industry analysts at Gartner and McKinsey have repeatedly noted that the gap between AI pilots and durable production value often comes down to operational discipline rather than model quality, and cost observability is a large part of that discipline.
Hidden token costs that never make the estimate
The naive cost model counts the obvious tokens and misses the ones that quietly dominate the bill. Before you size a budget, price these, because each one is a place real spend hides:
- Conversation history. In a chat interface, every turn re-sends the entire prior exchange as input. A twenty message conversation pays for the first message twenty times. Long sessions get quadratically expensive unless you summarize or trim old turns.
- Retrieved context. Retrieval-augmented systems paste documents into the prompt. Retrieve ten long passages when two would do and you have multiplied the input cost of every single call for accuracy you did not need.
- Tool and function definitions. Agents ship the full schema of every tool they can call on every request. A dozen verbose tool definitions can outweigh the user's actual question several times over, and they ride along on every turn.
- Retries and failures. A call that errors, times out, or returns malformed output and gets retried is billed for every attempt. A flaky integration can silently double its own token cost without anyone noticing the duplication.
- Reasoning and verbosity. Models that think step by step, or that are simply chatty, generate far more output tokens than a terse answer needs. Asking for a paragraph when a single label would do pays the expensive output rate for words nobody reads.
None of these are exotic. They are the default behavior of most AI systems, and each one is fixable: summarize history, retrieve less and rank better, prune unused tools, make retries idempotent, and ask for concise structured output when that is all you need. The savings from tightening these is often larger than the savings from switching models, because they attack the token count itself rather than the price per token.
Building a cost-control discipline that survives scale
Controlling AI token cost is not a one-time optimization. It is an operating habit, the same way you would treat cloud spend or any other usage-based bill. The mechanics are not complicated, but they have to be deliberate, and they have to be in place before the bill teaches you the hard way. A workable discipline looks like this:
- Meter everything from day one. Record input, cached, and output tokens for every call, tagged by feature and model. Without this, every cost conversation is a guess, and guesses lose to whoever sounds most confident.
- Set budgets and alerts. Put a ceiling on spend per workspace, per feature, or per customer, and alert when a workload crosses it. A runaway loop should trip a wire, not run all weekend on your dime.
- Cache by default, not by exception. Treat a cacheable prefix that is not being cached as a defect. The cheapest token is the one you only pay full price for once.
- Right-size every model choice. Default to the cheapest model that passes your quality bar for each task, and escalate deliberately, not reflexively. Re-check the choice as cheaper models keep getting better.
- Trim the prompt. Shorter system prompts, leaner context, fewer tools in scope, concise output formats. Every token you remove is removed from every future call, so the saving compounds.
- Review the bill like a real line item. Look at token spend monthly the way you look at payroll or rent. Features that cost more than they return get fixed or cut, and the ones that earn their keep get more room.
The teams that get this right are rarely the ones with the biggest model budget. They are the ones who treated tokens as a measurable, steerable cost from the start, cached aggressively, routed work to the cheapest capable model, and watched the meter. Done well, AI token cost becomes one of the most controllable lines in a software budget, because almost all of it is in your hands: how much context you send, how often you send it, and which model reads it.
That control gets dramatically easier when your data, your documents, and your automation live in one place instead of being stitched across separate tools that each re-send the same context and bill you for the seams. An AI-native workspace can cache the company context once, route each task to the right model, and meter spend per workspace out of the box. If you want to see how that consolidation lowers the cost side of the equation, explore the use cases, compare plans on pricing, or start free on sign up.
Sources
- OpenAI, API pricing and prompt caching documentation
- Anthropic, model pricing and prompt caching
- McKinsey, The state of AI: adoption, value, and scaling beyond pilots
- Gartner, managing the cost and operational risk of generative AI
- Stanford HAI, AI Index Report on model cost and performance trends
- a16z, the economics of running large language models in production
- Harvard Business Review, budgeting for usage-based AI infrastructure