How One Team Reduced MTTR by 87% Without Hiring More SREs

Meridian Labs is a 45-person B2B SaaS company building financial data infrastructure. Their backend is a Python/FastAPI monolith deployed on Kubernetes, with a React frontend and a data pipeline processing millions of financial records daily. They run a lean engineering team of 12.

In Q3 2025, Meridian's CTO reached out to us with a specific problem: their on-call rotation had become unsustainable. Engineers were burning out. Sprint commitments were slipping. And they couldn't hire their way out of it — SRE headcount was expensive and hard to find.

The situation before IncidentPilot

Meridian was averaging 18 P1/P2 incidents per month, with a mean time to resolution of 2.8 hours. Their investigation process was entirely manual: an engineer would get paged, spend 20–60 minutes investigating in Sentry and GitHub, draft a fix, get it reviewed, and deploy.

18 incidents/month × 2.8 hrs MTTR = ~50 engineer-hours per month on incident response.
30% of those incidents occurred outside business hours, requiring on-call interruptions.
Post-incident review and documentation added another 10–15 hours per month.
Total: approximately 65 engineer-hours per month — nearly 2 full engineer-weeks — spent on incidents.

With a 12-person team running two-week sprints of roughly 80 engineer-hours each, that's 30% of total capacity absorbed by incident response. Feature delivery was suffering. Morale was dropping. And the problem was getting worse as their product grew.

The integration

Meridian integrated IncidentPilot in a single afternoon. They connected their Sentry organization, authenticated their GitHub account with read/write access to their monorepo, and configured the Slack integration to post to their #incidents channel. The Sentry webhook took about 10 minutes to set up.

They ran in parallel mode for the first two weeks — IncidentPilot investigated incidents alongside their manual process, letting engineers compare the AI's analysis against their own. Accuracy was high enough after two weeks that they switched to trusting IncidentPilot as the primary investigation step.

Results after 90 days

Mean time to resolution: 2.8 hours → 22 minutes. An 87% reduction.
Engineer time per incident: ~50 minutes → ~8 minutes (review RCA + approve PR).
After-hours pages: 5.4/month → 1.1/month. Most incidents were resolved before morning.
Sprint capacity recovered: from 30% absorbed by incidents to under 5%.
Incidents per month: unchanged at ~18 — this isn't about preventing incidents, it's about resolving them faster.

What the team noticed

Beyond the metrics, Meridian's engineers reported qualitative changes in how they experienced incidents. The anxiety of being on-call decreased significantly — knowing that if something goes wrong overnight, IncidentPilot will investigate and have a PR ready by morning changed the emotional weight of the rotation.

"The thing that surprised me most was how often the RCA was right," said Meridian's lead backend engineer. "I expected to spend a lot of time correcting it. Instead I'm spending most of my time just reading it, thinking 'yes, that's exactly right,' and clicking merge."

The goal was never to eliminate engineers from the loop. It was to eliminate the 40 minutes of repetitive investigation before they could do anything useful. That's exactly what happened.

Edge cases and limitations

Not every incident was handled perfectly. About 12% of IncidentPilot's analyses required significant correction by engineers. These tended to cluster around a few categories: incidents caused by infrastructure changes (not captured in application code commits), data-related failures that weren't obvious from stack traces, and novel failure modes with no clear recent-commit correlation.

For these cases, IncidentPilot still provided value — it narrowed the search space, collected the relevant context, and provided a starting hypothesis. Engineers could correct the analysis rather than starting from scratch.

Where Meridian is now

Six months in, incident response is no longer a sprint planning concern for Meridian. They've used the recovered capacity to ship two major features that had been deprioritized for quarters. Their on-call rotation has expanded to include more engineers, making individual rotations less frequent. And their postmortem process has improved because every incident comes with a ready-made analysis to start from.