If AI Is Doing the Work, What Are You Measuring?

The most interesting thing about the latest wave of AI coding agents is not the code they write. It is the code they are making invisible.

OpenAI just expanded Codex into a general-purpose desktop agent. Anthropic's Claude does similar things. These tools now automate spreadsheets, summarize Slack threads, schedule social media, and fix broken Linux servers — all through the same interface developers use to ship features.

The Boundary Between Developer Tools and Knowledge Work Is Dissolving

Here is what engineering leaders should be paying attention to. The boundary between developer tools and knowledge work tools is dissolving. When a CEO with no programming background can orchestrate a multi-week content campaign in two hours using a coding agent, we need to ask what developer productivity even measures anymore.

This is not a hypothetical shift. It is happening in practice, and the implications for how we evaluate engineering teams are significant. The output metrics we have relied on for years were built around an implicit assumption: that developers are the ones doing developer work. When that assumption breaks, the metrics break with it.

The Community Is Split — And Both Sides Are Right

Power users report 10x gains on tasks that previously took weeks. Skeptics point out that these agents read sensitive files without asking, drain laptop batteries, and occasionally delete OS user profiles. Both sides are right.

This is the shape of every major tooling shift in software: uneven, contested, and genuinely capable of both enormous gains and unexpected failures. Treating it as a single narrative (AI is great / AI is dangerous) obscures what leaders actually need to reason about — which tasks benefit, which introduce risk, and what the organizational cost looks like at scale.

The Measurement Gap

What concerns me most is the measurement gap. Teams are adopting these tools faster than their engineering metrics can account for.

Cycle time, deployment frequency, and code review throughput were designed for a world where humans wrote and reviewed every line. When agents generate code, run tests, and open pull requests autonomously, your DORA metrics might look stellar while actual system reliability degrades.

That is the trap. The dashboard shows green. The velocity numbers keep climbing. But underneath, the relationship between the numbers and the health of your system has quietly broken. You are measuring the wrong thing with more precision than ever.

Velocity Is Not Value

The organizations that win here will not be the fastest adopters. They will be the ones who adapt their measurement frameworks to distinguish between velocity and value.

This is exactly where things get tricky. If AI agents are contributing to delivery, reviewing code, and moving work forward, then measuring output alone stops being useful. You need to understand how work happens, not just how fast it moves.

That means looking beyond traditional metrics into:

Contribution patterns. Who (or what) is generating the work? Who is reviewing it? How is responsibility distributed across humans and agents?
Feedback loops. How quickly do issues surface? When agent-generated code breaks production, how long until the team knows and responds?
How teams actually operate in practice. Where are humans still adding judgment? Where has that judgment been quietly outsourced to tools? Is that outsourcing working?

The Shift in What Engineering Leadership Means

When AI agents take on more of the execution, the role of engineering leadership shifts too. Speed becomes cheap. Coherence becomes expensive. Maintaining a clear picture of who is doing what, and whether that work is holding together as a system, is harder than ever — and more valuable than ever.

The leaders who treat this as a measurement problem rather than a tooling problem will build teams that stay coherent as the work gets faster. The ones who keep reporting the same old metrics will watch their dashboards light up green while their systems quietly degrade.

If AI is doing the work, the question is not how much work is getting done. The question is whether the work still adds up to something you can ship, maintain, and trust.

Book a demo

If AI Is Doing the Work, What Are You Measuring?

The Boundary Between Developer Tools and Knowledge Work Is Dissolving

The Community Is Split — And Both Sides Are Right

The Measurement Gap

Velocity Is Not Value

The Shift in What Engineering Leadership Means

Ready to connect AI adoption with delivery outcomes?

From AI activity to delivery evidence.

If AI Is Doing the Work, What Are You Measuring?

The Boundary Between Developer Tools and Knowledge Work Is Dissolving

The Community Is Split — And Both Sides Are Right

The Measurement Gap

Velocity Is Not Value

The Shift in What Engineering Leadership Means

Related Articles

The Linux Kernel Just Drew a Line on AI Code. Every Engineering Leader Should Take Note.

AI Doesn't Reduce the Need for Engineering Discipline. It Exposes the Lack of It.

Beyond Lines of Code: Measuring the Real Impact of AI on Developer Productivity

Ready to connect AI adoption with delivery outcomes?

From AI activity to delivery evidence.