Root Cause Analysis at Scale: How IncidentPilot Connects Sentry Errors to Git History

At first glance, "find the commit that caused this error" seems straightforward. You look at the stack trace, identify the failing file, run git blame, find the recent change, and you're done.

In practice, it's rarely that simple. Stack traces point to library code. The actual bug was introduced three layers up the call chain. The deploy happened six hours ago but the error only started appearing after a specific user action triggered a code path nobody considered. The commit was merged by one person but reviewed by another who caught a different bug but missed this one.

Scale this to a production system with hundreds of files, dozens of deploys per week, and tens of thousands of error events per day — and manual RCA becomes the bottleneck it is for most engineering teams.

The data sources involved

A complete root cause analysis for a production incident typically draws from at least five separate data sources:

Sentry event data: stack trace, affected release, event frequency, first/last seen timestamps, user impact.
GitHub commit history: commits merged since the last clean deploy, touching files referenced in the stack trace.
GitHub pull requests: the PR context, description, reviewer comments, and any CI failures that were bypassed.
Deploy history: exact deployment timestamps mapped to commit SHAs, so we can bracket the incident window.
CODEOWNERS / blame data: who owns the affected code and who made recent changes.

Manually cross-referencing these sources for a single incident can take 20–45 minutes. IncidentPilot does it in under 5 minutes, programmatically.

How the pipeline works

Step 1: Parse the stack trace

The first step is extracting meaningful signal from the Sentry event. We parse the full stack trace and identify the application frames — filtering out third-party library frames that are unlikely to be the root cause. For each application frame, we extract the file path, function name, and line number.

Step 2: Bracket the incident window

Using the Sentry event's first-seen timestamp and the project's deploy history (fetched via the Sentry API), we identify the last clean deploy before the incident started. This gives us a commit range: everything merged between the clean deploy and the incident onset is a candidate for the root cause.

Step 3: Query GitHub for relevant commits

We fetch commits in the candidate range from GitHub. For each commit, we look at the diff and check whether any changed files overlap with files in the stack trace. Commits that touch relevant files are ranked higher.

// Simplified commit scoring logic
function scoreCommit(commit, stackFrameFiles) {
  const touchedFiles = commit.files.map(f => f.filename);
  const overlap = touchedFiles.filter(f =>
    stackFrameFiles.some(sf => f.includes(sf) || sf.includes(f))
  );
  return overlap.length / stackFrameFiles.length;
}

Step 4: AI-powered correlation

The scored commits, stack trace, error message, and PR descriptions are sent to an LLM with a structured prompt asking it to identify the most likely root cause. The LLM's job here isn't to reason from first principles — it's to synthesize the collected evidence into a coherent narrative.

This is where the quality of the input data matters enormously. Good commit messages, descriptive PR descriptions, and well-named functions all help the LLM produce more accurate analyses. Terse commit messages like 'fix stuff' produce worse results.

Step 5: Generate the RCA

The final output is a structured root cause analysis that includes: the most likely responsible commit and author, the specific code change that introduced the issue, an explanation of the failure mechanism, and a suggested fix approach. This becomes both the PR description and the Slack notification body.

Handling ambiguity

Not every incident has a clear single cause. Sometimes the root cause is a combination of changes. Sometimes it's environmental — a configuration change, a dependency update, or a data migration that exposed a latent bug. IncidentPilot is designed to communicate uncertainty when it exists rather than fabricate false confidence.

When the evidence is ambiguous, the RCA says so. A confident-sounding wrong answer is worse than an honest "we found three candidates — here's the evidence for each."

Accuracy in practice

Across the incidents processed by IncidentPilot in our private beta, 78% of root cause identifications were rated as "accurate" or "very accurate" by the engineers who reviewed them. 14% were "partially correct" — the right area but not the exact commit. 8% were incorrect.

We're not at 100%, and we don't claim to be. But even a partially-correct RCA that narrows the search space from 50 commits to 3 is enormously valuable. It transforms a 45-minute investigation into a 5-minute validation.