Reading the Wind

Token spend is a new line item every company has, and no one can tell you what it's buying.

Jun 12, 2026

AI costs are getting out of hand. I recently spent time with a group of impressive CTOs running tech teams of 30 to a few hundred, and they’re spending anywhere from $50k to $500k per month on tokens. It seems to scale roughly linearly with team size. Meanwhile, most of my portfolio companies (I’m an early stage investor) are still burning through lab credits and haven’t yet felt the pain.

It reminds me of getting $100k in AWS credits as a founder ten years ago. Back then I used to think, “that’s how they get you hooked.” Cloud spend was mandatory. An ad tech company spending hundreds of thousands a month on infra had no alternative; that was the cost of existing. Token spend shows up as discretionary: a net new, purely additive line item that didn’t exist for any company two years ago. But the dependency has become total. Once your team’s workflow rebuilds itself around these tools, there’s no going back. Nobody handwrites code anymore for the same reason nobody writes assembly. Writing code by hand has become archaic and you simply don’t have to anymore. We’ve climbed the ladder of abstraction and we can think about higher level systems now.

But then how do you deal with this new cost structure? What happens when you blow through your Claude Opus budget on day 25 of the month? You can’t dial usage down without sacrificing work quality. Drop to Sonnet, maybe. Or an open source model.

How big is that downgrade? In theory, benchmarks tell us. In practice, I suspect SWE-bench suffers from the same problem LeetCode does for interviewing engineers: it’s just not representative of the actual work. I’ve never had to implement an LRU cache or a complex backtracking algorithm at any of my companies. Have you?

What every company actually needs is its own benchmark, evals built on its own codebase, its own tickets, its own definition of done. If you had that, the quality dial becomes rational: you could measure exactly what dropping from a frontier model to an open source one costs you, and route accordingly. Maybe open source models are perfectly sustainable for 80% of the work. Maybe they’re not. Today, nobody knows, because evals are still unsolved for most complex knowledge work. The heuristics I hear, like fraction of PRs that get merged, are rough at best.

So we have an irreversible dependency, a spend line growing linearly with headcount, and no way to measure what we’re buying. The companies that build their own evals first won’t just spend less, they’ll be the only ones who know what they’re spending it on. Everyone else is licking a finger and reading the wind.

Stack Trace

Discussion about this post

Ready for more?