Microsoft and Uber have both hit the token wall. The response is not to stop using AI. It is to start engineering it.
It is May 2026, and Fortune's coverage of the unfolding AI cost crisis tells a story most organizations will recognize before long: agentic workflows consume far more tokens per session than single-turn completions, and the unit economics that looked reasonable at the pilot stage stop working at the adoption stage. Uber's CTO told The Information the company burned through its entire 2026 AI coding tools budget in four months, with per-engineer monthly API costs running between $500 and $2,000 as adoption spread across a 5,000-engineer organization. Microsoft responded by mandating a switch to its own GitHub Copilot CLI by June 30. Unlike Microsoft, most organizations do not have a frontier model of their own to redirect employees toward — which means the cost problem is not a negotiating position. It is a constraint.
As Greg Herlein has laid out, much more intentional budgeting will be required moving forward for organizations that want to stay solvent as token consumption scales with capability.
If AI capability is necessary to future competitiveness, and the cost of access is rising faster than value is being demonstrated, one productive engineering response is to question where inference happens at all.
We have been working on a project analyzing a corpus of roughly 430,000 operational support tickets — spanning years of institutional history — to extract procedural knowledge and generate standardized documentation. The project required two distinct LLM tasks, and we treated them as separate selection problems, because they were.
Classification required screening 155,000 candidate tickets as in-scope or out-of-scope for further processing: binary yes/no, structured JSON output, run across the full corpus. The selection criteria were accuracy on the tickets that mattered, throughput sufficient to complete the run over a long unattended weekend, and cost. We evaluated models running locally on a MacBook M4 Pro with 24GB of RAM against Claude Haiku. The results were instructive: a local Qwen 3.5 4B model — running at zero cost — achieved 89% recall on the operationally significant tickets, compared to 79.5% for Haiku. Haiku was twice as fast but markedly less accurate on exactly the tickets we most needed to catch. A free local model, running unattended over four days, delivered a better answer for this task. Total cost: $0.
JPMorgan's Jamie Dimon has observed publicly that with generative AI, the answer can vary each time you ask the same question — often cited as a reason for caution. It is worth tempering that concern with a clear understanding of what you are actually asking the AI to do. Classification is opinion work: does this ticket belong to this category? When the task is probabilistic judgment rather than exact calculation, what matters is whose aggregate opinion is more accurate, not whether every individual answer is identical. The local model's opinion, measured objectively across 200 test tickets, was simply better.
One technical note worth flagging for practitioners attempting this: the Qwen 3 and 3.5 model families default to a chain-of-thought "thinking" mode that produces narrative output instead of JSON. Suppressing it via a /no_think prefix is unreliable. The correct fix is passing "think": false directly in the Ollama /api/chat request body — reliable, zero performance cost, and essential for structured output tasks with these models.
Extraction pointed in the opposite direction. For each qualifying ticket, we needed the model to read a full conversation thread and produce structured procedure documentation — not categorize the content, but understand what had actually happened, identify what had gone wrong, and synthesize observations into operational lessons. The local models produced taxonomies: labels that could tag a ticket. The frontier model produced guidance: sentences that could appear in a procedure document and actually teach someone what to watch for. The distinction is decisive. We selected Claude Sonnet for extraction.
To contain that cost, we applied a saturation sampling strategy: extract in batches of roughly 500 tickets per transaction type, synthesize each batch into a procedure document, measure how much new content the next batch contributes, and stop when marginal contribution falls below a threshold. Procedure documents saturate well before you exhaust the corpus — the three hundredth ticket of a given type adds very little that the fiftieth did not. Estimated cost to saturation: $100–$150 and a few hours of processing, against a full-corpus cost of approximately $1,600 and weeks of continuous compute for little if any incremental gain. That is a deliberate engineering tradeoff with a clearly positive outcome, and it only exists because the classification step was done first and done well. The local model was not a compromise. It was the right tool for its task, and it created the conditions under which the expensive model could be used sparingly on the work that genuinely required it.
The design principle that emerges: use local models for tasks that require matching — pattern recognition, classification, schema-constrained output where the decision boundary is well-defined and precision can be measured. Use frontier models for tasks that require understanding — when value lies in synthesizing what was actually happening in human communication, and when the output will be read by people and used to make decisions. The boundary is not about context length, token count, or model size. It is about whether the task requires the model to produce insight, or merely recognize a pattern.
Organizations will face this constraint. The ones that treat it as an engineering problem — rather than a budget request or a vendor conversation — are the ones that will build AI-assisted workflows that actually hold up at scale.