Engineering insights
Thoughts on incident response, AI in production, and building reliable systems.
Why Manual Incident Investigation Is Broken (And What We're Doing About It)
The average engineer spends 4–6 hours per week on incident investigation. That's time not spent shipping features, paying off technical debt, or sleeping. Here's why the current approach isn't working and how AI changes the equation.
Root Cause Analysis at Scale: How IncidentPilot Connects Sentry Errors to Git History
Finding the commit that caused a production incident sounds simple. In practice it involves cross-referencing error timestamps, deploy logs, recent PRs, and git blame across dozens of files. Here's how we automated it.
Human-in-the-Loop AI: Why We'll Never Auto-Merge a Fix
IncidentPilot generates pull requests, writes root cause analyses, and notifies your team. But it never merges autonomously. This is an intentional design decision — and it matters more than you might think.
How One Team Reduced MTTR by 87% Without Hiring More SREs
A 12-person engineering team was spending 30% of their sprint capacity on incident response. After integrating IncidentPilot, that dropped to under 5%. Here's the full story.
Building Reliable AI for Incident Response: The Technical Challenges
When the system you're building is supposed to help during production outages, reliability isn't optional. Here's how we architect IncidentPilot to be available exactly when you need it most.