What if the benchmark everyone uses to compare AI coding models was quietly broken the entire time?
The SWE-Bench Admission
OpenAI just published a striking admission: SWE-bench Verified, the de facto standard for measuring AI coding ability, no longer reflects frontier capability.
An audit found that nearly 60% of the failing subset had flawed test cases that reject functionally correct answers. Frontier models have also memorized solutions from training data, sometimes guessing helper function names that aren't even in the problem statement.
In other words: the leaderboard score and the underlying capability have drifted apart. Models are getting better at SWE-bench in ways that no longer mean they are getting better at software engineering.
This Isn't Really About AI
The takeaway isn't really about AI. It's about measurement.
Every engineering leader who has ever picked a productivity metric knows this story. The moment a number becomes a target — lines of code, tickets closed, even single-dimensional DORA snapshots — people (and now models) optimize for the number instead of the underlying behavior.
Goodhart's Law doesn't care whether you're benchmarking a model or a team. As soon as a measure becomes a target, it stops being a good measure. The optimization happens whether or not the people doing it are conscious of it.
What Gaming Looks Like in Engineering Teams
You don't have to be malicious to game a metric. You just have to know it's being watched. A few patterns that show up everywhere:
- PR count goes up, PR size goes down. The work is the same; it's just sliced differently to look more productive.
- Cycle time improves because tickets are scoped smaller. Throughput rises in the dashboard; actual delivery doesn't.
- Test coverage hits the target via tests that don't really test anything. The number is green; the safety net isn't real.
- Review turnaround improves because reviews get shallower. Faster, not better.
None of these are bad engineers being dishonest. They are normal humans responding to what the organization has signaled it values. The SWE-bench story is the same pattern at a much larger scale.
The Fix Is the Same in Both Worlds
The fix is the same whether you're evaluating models or engineering teams:
- Use multiple signals, not a single headline number. Any one metric will be optimized for, in isolation, in ways that erode the thing you actually care about. A handful of signals viewed together is much harder to game accidentally.
- Refresh evaluations as the system evolves. A metric that was useful a year ago may now be measuring something the team has already optimized past. Stale metrics drift into noise.
- Pair public benchmarks with private, context-specific ones that reflect your actual work. Industry-standard metrics give comparability. Internal metrics tied to your real delivery patterns give validity. You need both.
- Be honest about what the metric does and doesn't capture. Every measurement has blind spots. The leaders who acknowledge them out loud build teams that take metrics seriously without weaponizing them.
What "Ungameable" Actually Means
A benchmark you can't game is a benchmark that respects how messy real engineering work actually is. The same is true of the metrics we use to understand our own teams.
That doesn't mean abandoning metrics. It means choosing ones that stay informative as people start paying attention to them — and being willing to evolve them when they stop.
SWE-bench will get fixed or replaced. The pattern it just demonstrated will not. Every metric your team relies on is on the same trajectory: the longer it stays a target, the less it tells you about reality. Plan for that, and the measurements you keep will actually mean something.