Why Manual Incident Investigation Is Broken (And What We're Doing About It)

It's 2:47am. Your phone buzzes. Sentry just fired a P1 alert — NullPointerException in the checkout service, 200 events per minute, and it's trending up. You open your laptop, eyes barely focusing, and begin the ritual.

First, you find the Sentry issue. You read the stack trace. You look for context. Which release? Which environment? How long has this been happening? You open GitHub and scroll through recent commits. You check the deploy history. You ping the on-call channel. You ask if anyone made changes to the checkout flow recently.

Forty-five minutes later, you find it: a one-line change in a PR merged six hours ago that introduced a null return in CartService.getUser(). You write a hotfix, open a PR, get it reviewed, merge it, and watch the error rate drop. You close your laptop at 4am.

The hidden cost of manual triage

What I described above is the standard incident response playbook at most engineering teams. And the cost is enormous — not just in lost sleep, but in cognitive overhead, context-switching, and the compound effect on feature delivery.

We surveyed 200 engineering teams across companies ranging from 10-person startups to 500-person scale-ups. The findings were stark:

Engineers spend an average of 4.3 hours per week on incident investigation — not resolution, just investigation.
62% report that on-call rotations cause measurable anxiety, even when no incidents occur.
After a middle-of-the-night incident, engineers report 40–60% lower productivity the following day.
Teams with frequent incidents ship 23% fewer features per quarter than comparable teams without them.

Why the current tooling doesn't help

Observability tools like Sentry, Datadog, and Grafana are excellent at detecting and surfacing problems. But they stop there. They tell you that something is broken. They don't tell you why, what changed, who's responsible, or what to do next. That gap is where engineering hours are lost.

Runbooks help for known failure modes, but novel incidents require investigation. Postmortems document what happened after the fact but don't speed up the investigation in the moment. On-call escalation paths get humans involved — but only after context-switching them out of whatever they were doing.

What AI changes

Modern large language models, combined with API access to your existing tooling, can close this gap. Not by replacing engineers — but by doing the tedious, time-consuming investigation work that currently falls on them.

Here's what a typical IncidentPilot investigation looks like for the same 2am checkout incident:

Sentry webhook fires at 2:47am. IncidentPilot receives it immediately.
Stack trace is parsed. The failing code path is identified: CartService.getUser() → null return.
GitHub API is queried for commits touching CartService in the last 24 hours.
One commit is found: 'refactor: simplify CartService.getUser' — merged 6 hours ago by @mikek.
Root cause analysis is generated: the refactor introduced an unhandled null return path when userId is missing from the session.
A fix PR is created with a null-check added to the method.
Slack message posted to #incidents with the full summary and PR link.

Total time: 38 minutes. No engineer woken up. By the time the on-call engineer checks their phone in the morning, the PR is already waiting for review.

The human-in-the-loop principle

We're deliberate about what IncidentPilot does and doesn't do autonomously. It investigates. It prepares fixes. It notifies. But it never merges. The engineer reviews the root cause analysis, validates the fix, and approves the PR.

This isn't a technical limitation — it's a design choice. Production systems deserve human judgment at the point of change. AI should augment that judgment, not bypass it.

The goal isn't to remove engineers from incident response. It's to remove the parts that don't require an engineer — the tedious, repetitive investigation work — so engineers can focus on the parts that do.

What this means for your team

Teams using IncidentPilot report that on-call rotations become dramatically less stressful. Engineers still get paged for things that require human judgment. But the 3am 'go investigate this' pages drop significantly. The cognitive load of incident response falls. Feature delivery recovers.

That's the world we're building toward. And we're just getting started.