The METR Productivity Illusion

the number that should scare you

METR ran a randomized controlled trial with experienced open source developers working on their own repositories, the kind of codebases they'd been maintaining for years, and the result that everyone skipped past in the rush to argue about it was that the developers using AI tools took 19% longer to complete their tasks, not faster, slower, with a confidence interval that didn't cross zero. That alone is interesting but not damning, because reasonable people can disagree about task selection and tool maturity and whether the developers were using Cursor well. The part that should keep you up at night if you're approving tooling budgets is the perception gap. The same developers, after finishing, estimated that the AI had made them roughly 20% faster. So the lived experience was a 20% speedup and the stopwatch said a 19% slowdown, which is a 39 point swing between what happened and what people felt happened. That is not measurement noise. That is a systematic, structural distortion in how engineers perceive their own productivity when an LLM is in the loop, and if your entire case for rolling out Claude Code across the team rests on developer surveys and gut feel, which is how most of these decisions actually get made, you are building on a sensor that is miscalibrated by almost forty points. We've seen this firsthand on client teams who swore Cursor had transformed their velocity and then couldn't point to a single cycle time metric that had moved.

which tasks actually get slower

The slowdown isn't uniform, and pretending it is would be as dishonest as pretending it doesn't exist. AI tools genuinely speed up a specific shape of work, the greenfield CRUD endpoint, the boilerplate React component, the regex you half remember, the test scaffold for a function whose behavior is obvious. We use Claude Code internally for exactly this and it earns its keep. The problem is that this is not where senior engineers spend their hours. The hours go into unfamiliar codebases where the model confidently suggests an approach that violates an invariant nobody documented, into cross-cutting refactors where renaming a concept means understanding seventeen call sites and the AI happily edits twelve of them and silently misses five that used reflection or string-based dispatch, and into debugging non-deterministic failures where the bug only shows up under load and the model keeps proposing plausible-sounding fixes that address the symptom you described rather than the race condition you haven't found yet. In each of these, the AI generates output fast, and the output looks right, and so you spend your time reviewing and correcting plausible code instead of writing correct code from a mental model you already hold. The deeper your existing understanding of the system, the more the AI's suggestions become a tax, because you have to load the suggestion into your head, compare it against your own model, find where it's subtly wrong, and undo it. That review loop is slower than just writing the thing, and the more senior you are, the worse the ratio gets.

why the confidence is baked in

The perception gap isn't a quirk of the participants, it falls directly out of how these models work. An LLM is trained to produce the most plausible continuation, and plausible code is, by construction, code that looks like correct code, which means the failure mode is never a syntax error you'd catch in two seconds, it's a confidently structured function that compiles, reads cleanly, passes the happy path, and is wrong in a way that only surfaces three weeks later in production. The model never says I'm not sure about this database transaction boundary. It generates the boundary with the same fluent authority it generates a print statement. Your brain, meanwhile, is running a heuristic that's served humans well for a hundred thousand years: fluent, confident, well-structured output correlates with competence. So you read the generated code, it parses smoothly, it matches the shape of code you've seen work before, and your brain files it under done. The act of reading good-looking code feels like progress in a way that staring at a blank file and thinking hard does not, even when the staring-and-thinking produces correct work faster. That's the engine of the illusion. You feel fast because the screen keeps filling up with reasonable text, and the cost of verifying whether that text is actually correct is deferred, externalized, and easy to underestimate in the moment. The dopamine of acceptance arrives immediately, the cost of the subtle bug arrives later and gets attributed to something else.

what to actually measure

If you're a CTO signing off on per-seat AI spend, stop running developer satisfaction surveys as your primary signal, because METR just showed you that signal is off by 39 points and your engineers will tell you they're flying right up until the cycle time data says otherwise. Measure the boring things instead. Track cycle time from first commit to merged PR, broken down by task type if you can tag them, and compare a cohort using the tools against one that isn't, on comparable work, over at least a quarter, because two weeks of honeymoon data tells you nothing. Watch the rework rate, specifically the number of PRs that get reverted or hot-fixed within seven days of merge, because that's where the deferred cost of plausible-but-wrong code shows up. Look at review time per PR, since a real productivity win should not be quietly transferring hours from the author to the reviewer. Pay attention to where the wins concentrate. On our own projects the honest answer is that AI tooling is a clear win for the junior-to-mid work of scaffolding, test generation, and exploring an unfamiliar API surface, and a wash or a loss on the senior work of refactoring a payment pipeline or chasing an intermittent failure in a distributed job queue. That means the right policy probably isn't a blanket rollout or a blanket ban, it's matching the tool to the task and being ruthlessly honest about which bucket a given piece of work falls into. The teams that win with these tools are the ones that measured, not the ones that vibed.

the uncomfortable conclusion

None of this means AI coding tools are a scam, and I want to be precise here because the discourse keeps collapsing into either evangelism or dismissal. The tools are real, they're useful, and they're getting better fast enough that a result from a 2025 study deserves a re-run every six months. What the METR work establishes is narrower and more durable than the headline: your perception of your own AI-assisted productivity is unreliable, predictably so, and biased in the optimistic direction. Treat that the way you'd treat any miscalibrated instrument. You don't throw it out, you correct for the bias and you cross-check it against a measurement you trust. For us at steezr that's meant being deliberate about where we reach for Claude Code versus where we tell people to close the tab and think, and it's meant treating any claim of a productivity revolution, our own included, as a hypothesis to be tested against merge metrics rather than a conclusion. The senior engineers who get the most out of these tools aren't the ones who use them the most. They're the ones who developed a sharp instinct for the exact moment the AI stops accelerating them and starts quietly taxing them, and who have the discipline to stop. That instinct is hard to build precisely because the feedback signal lies to you, which is the whole problem in one sentence.

The METR Productivity Illusion in Real Engineering Work

the number that should scare you

which tasks actually get slower

why the confidence is baked in

what to actually measure

the uncomfortable conclusion

Prompt Injection Is a Code Execution Primitive Now

Stop Constraining Agents With Prompts, Constrain Them With Infrastructure

Structured Outputs Guarantee Syntax, Not Sanity

Want to work with us?