the four month problem
Uber reportedly burned through its entire 2026 AI coding budget in four months, and the part that should make every CTO uncomfortable is what they got for it, which by the accounts going around was roughly nothing measurable in terms of projects actually shipped. Not slower, not faster, just a much bigger bill. I keep coming back to that because it maps almost perfectly onto what we see when we get called in to clean up after a team has gone all in on agentic coding tools and is now wondering why their lead times haven't moved despite a Cursor invoice that looks like a second AWS account.
The number that's been circulating, and I have no way to independently audit it but it tracks with my gut, is that companies are spending something like 44 percent of their AI tokens fixing bugs that the AI itself introduced. Read that twice. Nearly half the spend is the tool cleaning up after the tool. If a junior contractor wrote code and then billed you again at the same rate to fix the defects in that same code, you would fire them inside a week, and yet because the token meter ticks invisibly in the background nobody flags it.
James Shore put the cleanest version of the argument out there and it went around for a reason. Doubling your writing speed only helps if you have also halved your maintenance cost, and absolutely nobody has halved their maintenance cost. Writing was never the expensive part of software. Reading, understanding, debugging, and safely changing code that already exists, that's where the money goes, and AI tools optimize the cheap step while quietly inflating the expensive one.
velocity is measuring the wrong thing
Every velocity dashboard I've seen a founder proudly screen-share since the start of the year measures something adjacent to value but not value itself. Lines of code, pull requests merged, story points closed, commits per engineer per week. These were always weak proxies, and they were tolerable when a human had to type every character because the typing was at least correlated with thought. That correlation is now broken. An agent can open fourteen pull requests before lunch and your PR-per-week chart goes vertical while your actual product moves not one inch closer to anything a customer wanted.
What happens next is the genuinely dangerous part, and it's organizational, not technical. The VP of Engineering sees the chart going up and reports velocity upward to the board, the board sees velocity going up and approves more AI seats, the engineers on the ground know the codebase is getting harder to work in but the metric says everything is great so nobody wants to be the person killing the vibe. You have built a closed loop that rewards token throughput and punishes anyone who points out that throughput isn't output.
We had a client, a mid-size SaaS doing document processing, whose merged-PR count had roughly doubled year over year and whose actual feature delivery to customers had, when we counted the things their users could now do that they couldn't before, gone slightly down. The extra PRs were refactors of AI-generated code, reverts of AI-generated regressions, and tests written to cover behavior nobody had decided on. The chart was a fiction the whole company had agreed to believe.
where the debt actually hides
Maintenance debt from AI-written code doesn't show up where you'd look for it. It doesn't announce itself in a sprint retro. It accumulates in the half-second of extra hesitation every engineer feels before touching a file that an agent generated three months ago, the file nobody fully understands because nobody wrote it, the file with the slightly-wrong abstraction that technically works and passes its tests but models the domain in a way that fights every new requirement.
The defects also cluster differently. Human bugs tend to be local and dumb, an off-by-one, a missed null check, the kind of thing a code review catches. AI bugs tend to be plausible, confident, and structural, an entire error-handling strategy that silently swallows exceptions, a caching layer that looks idiomatic and is subtly racy under load, a database migration that's syntactically perfect and semantically catastrophic. These are exactly the defects that pass review because they look like what good code looks like, and they're the most expensive class of bug there is because you find them in production, weeks later, when the context is gone.
We ran a security audit recently on a customer portal built largely with an agentic tool and found three separate places where user input flowed into a query without parameterization, each one wrapped in code clean enough that two human reviewers had waved it through. The AI had written SQL injection vulnerabilities in a confident, well-commented, beautifully formatted way. That's the texture of the debt. It's not messy code you can spot, it's clean code that's wrong, and clean-but-wrong is the hardest thing in the world to catch.
a measurement framework that survives contact with a board
If you're going to govern this spend you need metrics that measure value delivered and cost incurred, not characters produced, and they need to be simple enough to put in front of non-technical board members without a translation layer. Here's what we've started recommending and using internally.
Measure change failure rate and mean time to recovery, the two DORA metrics that actually capture quality, and watch them against your AI spend over time. If your deploy frequency is up but your change failure rate climbed from 8 percent to 19 percent over the same period your AI tooling rolled out, you have your answer, and that's a sentence a board understands. Track the percentage of engineering hours spent on net-new customer-facing capability versus rework, and be honest that reverts and regression fixes count as rework even when the original code was AI-generated and shipped fast.
Put a real number on cost-per-shipped-outcome. Not cost per PR, not cost per line, but take your total AI spend plus the loaded engineering hours and divide by the count of things your customers can now do that they couldn't last quarter. It's a crude number and it will be uncomfortable and that discomfort is the point, because right now most teams are reporting a denominator that's inflated by AI noise and a cost that's hidden in a SaaS invoice nobody reads.
And track read-time-to-change on your worst files, the time it takes an engineer to safely modify the gnarliest parts of the codebase. When that number creeps up across the AI-touched parts of your system, you're watching maintenance debt accrue in real time, before it shows up as missed deadlines.
what we actually tell founders to do
None of this means turn the tools off. We use them every day, on Next.js frontends and Go services and Django backends, and on a tightly scoped task with a senior engineer steering, an agentic tool genuinely earns its keep. The failure mode is treating a hundred to two hundred dollars a month per engineer as a productivity multiplier that justifies itself automatically, when it's actually a power tool that amplifies whatever judgment is already in the room. Hand it to a strong team with good review discipline and it's leverage. Hand it to a thin team under deadline pressure and it's a debt machine that ships fast and bills you twice.
So the governance move isn't a spreadsheet of token limits, it's a discipline. Require that AI-generated code clears the same review bar as human code, which in practice means slowing down review, not speeding it up, because confident-and-wrong is harder to catch than messy-and-wrong. Keep the agent on a short leash for anything touching auth, payments, migrations, or data access, the exact places we keep finding the expensive structural bugs. And report the quality metrics to your board alongside the velocity ones, because the moment change failure rate and cost-per-outcome sit next to PR count, the velocity chart stops being able to lie on its own.
The teams that win the next two years won't be the ones writing the most code. They'll be the ones who figured out, earlier than their competitors, that they were paying for the bugs and the fixes both, and decided to stop pretending the chart going up was the same thing as the business getting better.
