Building Reliable AI for Incident Response: The Technical Challenges

There's an obvious irony in building a system that helps you recover from outages: if your system goes down during an outage, you've made the problem worse. This constraint shapes every architectural decision we make at IncidentPilot.

In this post, I'll walk through the key technical challenges we've encountered and the solutions we've built. Some of these are specific to the AI incident response domain. Others are general reliability patterns that any production system should apply.

Challenge 1: AI inference latency and reliability

The most obvious challenge in building an AI-powered system is the dependency on LLM APIs. These calls can take 5–30 seconds, can fail with rate limits or timeouts, and the providers themselves can have outages. Our entire value proposition — fast incident investigation — is at risk if the AI layer is the bottleneck.

Our approach is a layered resilience strategy:

Multi-provider fallback: we maintain integrations with multiple LLM providers. If the primary provider is degraded, we automatically fall back to an alternative. Response quality may vary, but the investigation completes.
Timeout budgets: every AI call has a strict timeout. If it exceeds the budget, we proceed with the investigation using a template-based analysis and flag it as 'AI analysis unavailable'.
Async processing: investigations are processed asynchronously via a job queue. Webhook ingestion is decoupled from investigation execution, so a slow LLM call doesn't affect our ability to receive incoming events.
Graceful degradation: if AI analysis is unavailable, we still collect and correlate all the contextual data (stack traces, commits, deploys) and present it in a structured format. Less insight, but still useful.

Challenge 2: Webhook reliability

Sentry fires webhooks on a best-effort basis. They can fail silently. They can be delivered out of order. They can be delivered more than once. Our webhook ingestion layer needs to handle all of these cases correctly.

We handle this with an idempotent ingestion layer. Every webhook payload is fingerprinted with a dedup key derived from the Sentry issue ID. Duplicate deliveries are detected and discarded. Webhooks are immediately acknowledged (HTTP 200) and enqueued for async processing — never processed inline — so slow processing never causes Sentry to retry.

// Webhook handler — acknowledge immediately, process async
export async function POST(req: NextRequest) {
  const payload = await req.json();
  const dedupKey = buildDedupKey(payload);

  await queue.enqueue({
    type: 'sentry_webhook',
    dedupKey,
    payload,
    receivedAt: new Date(),
  });

  return NextResponse.json({ ok: true }, { status: 200 });
}

Challenge 3: GitHub API rate limits

A thorough investigation of a production incident can require dozens of GitHub API calls: fetching commits, reading diffs, looking up pull requests, querying CODEOWNERS. For teams with high incident rates, this can burn through GitHub's rate limits quickly.

We address this with aggressive caching and request batching. Commit metadata is cached with a short TTL. Pull request data is cached until the PR is closed. We use GitHub's GraphQL API where possible to fetch multiple resources in a single request.

We also implement a token rotation strategy for teams using multiple GitHub Apps, distributing API load across tokens to multiply effective rate limits.

Challenge 4: Handling cascading incidents

Production systems often fail in cascades: one service goes down and triggers alerts across dozens of dependent services. If IncidentPilot attempted to fully investigate each alert independently, it would generate a storm of duplicate investigations and PRs, overwhelming the team.

We detect cascade patterns by looking for temporal clustering of alerts (multiple alerts within a short window), shared infrastructure dependencies (same database, same upstream service), and correlated error signatures. When a cascade is detected, we group the alerts into a single investigation and focus on the root service.

Challenge 5: Our own reliability

IncidentPilot needs to be operational precisely when your systems are under stress. This means we can't rely on the same infrastructure patterns that might be acceptable for lower-stakes applications.

Multi-region deployment: we run active-active across multiple regions. A regional outage doesn't take us down.
Database replication: all incident and investigation data is synchronously replicated. We don't lose data on primary failure.
Queue durability: our job queue persists to disk. Pending investigations survive process restarts.
Health monitoring: we eat our own dog food — IncidentPilot monitors itself and pages us if it's degraded.

We target 99.99% uptime for the webhook ingestion layer — the component that receives your Sentry alerts. Even during degraded AI analysis capacity, we always receive and queue incidents for investigation.

What we've learned

The central lesson from building IncidentPilot is that reliability in AI systems requires the same engineering discipline as reliability in any production system — plus additional consideration for the non-determinism and latency characteristics of AI inference.

Design for degradation from day one. Your system should be useful even when AI analysis is unavailable. Decouple ingestion from processing. Cache aggressively. And monitor everything — including the quality of your AI outputs over time, not just whether the API calls succeed.