Back to Blog
ProductFebruary 5, 20266 min read

Human-in-the-Loop AI: Why We'll Never Auto-Merge a Fix

IncidentPilot generates pull requests and writes root cause analyses. But it never merges autonomously. This is an intentional design decision — and it matters more than you might think.

SO

Sam Okafor

Head of AI at IncidentPilot. ML researcher turned product engineer.

Every week, someone asks us: "Why doesn't IncidentPilot just merge the PR? You've already done the investigation. You've already written the fix. Why make us approve it?"

It's a fair question. From a purely technical standpoint, auto-merging is straightforward. We have the credentials. We create the PR. We could merge it in the same API call. But we don't. And we won't. Here's why.

AI confidence is not the same as correctness

Language models are remarkably good at appearing confident. They produce fluent, authoritative-sounding text even when they're wrong. This is a well-documented property of current AI systems, and it's particularly dangerous in high-stakes contexts like production deployments.

IncidentPilot's root cause analyses are accurate the majority of the time. But "the majority of the time" is not good enough for autonomous production changes. Even a 5% error rate — which would be considered excellent for many AI tasks — means that 1 in 20 auto-merged PRs would be based on a wrong diagnosis. In a system handling thousands of incidents per year, that's dozens of bad changes pushed to production automatically.

The asymmetry of harm

There's a fundamental asymmetry between the cost of requiring human review and the cost of an incorrect auto-merge:

  • Cost of human review: 5–15 minutes of an engineer's time to read the RCA, validate the fix, and click merge.
  • Cost of an incorrect auto-merge: a new production incident, potential data corruption, customer impact, and the overhead of investigating and reverting a change that was supposed to be a fix.

The expected value calculation strongly favors human review, especially given the stakes involved in production systems.

Who owns the system?

There's a deeper reason beyond accuracy rates. When an autonomous system makes a change that causes harm, accountability becomes murky. Who is responsible? The engineer who set up the integration? The vendor who built the AI? The individual who approved the overall system?

By keeping humans in the loop at the point of production change, the accountability model remains clear. An engineer reads the RCA, validates the fix, and approves the merge. That engineer is accountable for that decision — and they have the full context to make it well, because IncidentPilot has done all the investigation work for them.

Automation should eliminate toil, not accountability. The engineer who clicks merge has the information, the context, and the judgment to own that decision. That's exactly as it should be.

The speed argument doesn't hold

Some argue that the value of autonomous merging is speed — that seconds matter in a production incident. We'd push back on this. For most incidents, the bottleneck isn't the time to merge a PR. It's the time to understand the problem, identify the fix, and build confidence that the fix won't make things worse.

IncidentPilot dramatically compresses that bottleneck — from 40 minutes of investigation to reviewing a 2-page RCA in 5 minutes. The remaining 5 minutes to review and merge the PR is not the constraint.

For the rare cases where every second genuinely matters — large-scale outages with significant customer impact — the appropriate response is a rollback of the entire release, not an AI-generated patch pushed autonomously.

What we're optimizing for

We're optimizing for trust. For IncidentPilot to become a durable part of how engineering teams work, engineers need to trust it. That trust is built through transparency (showing all the evidence behind an RCA), accuracy (being honest about uncertainty), and appropriate scope (not doing things the system isn't qualified to do autonomously).

Auto-merging would make for a more impressive demo. It would generate more buzz. But it would also erode the foundational trust that makes the product genuinely useful in the long run. That's a trade-off we're not willing to make.