PagerDuty Blog

What is incident analysis and why should you do it?

This post was originally published on the Jeli blog. Jeli was acquired by PagerDuty in 2023 and we’re reposting it here to bring their thought leadership to our community.

Service outages are a common part of modern software operations, especially if you’re moving fast or operating at scale! Many companies have realized this and have begun to invest in incident analysis in order to learn from their incidents. Let’s first talk about what exactly incident analysis is, and what benefits your teams can realize from it.

Defining Incident Analysis

Incident analysis is a process for identifying what happened during an outage: discovering things like who and what parts of the system were involved, and how the problem was handled. There are many different methods to conduct incident analysis. At its core, however, incident analysis typically consists of:

  1. Gathering data about the event
  2. Analyzing the data
  3. Drawing conclusions from the data
  4. Enhancing future resilience

Many view a core function of incident analysis simply as preventing future recurrence: taking corrective actions to fix a bug, improve observability or update the runbooks. We see it as more than that. It’s about enhancing future resilience by better preparing individuals, teams—and yes, their software systems—to handle unanticipated failures.

Taking It a Step Further

We like to borrow from the comedy improv technique. We say, “Yes! We want to prevent this from happening in the future…and that means we’ll prepare engineers with a broader skill set than simply preventing the exact same incident at a later point in time!”

Just as you never step in the same river twice, because it is continuously flowing, so too will you never face the same incident twice, because continuous integration/continuous deployment = continuous change. When the focus of the analysis is on learning, not just fixing, incident analysis makes a company better able to respond to future incidents.

How We Best Benefit

Of course, you do want your organization to learn about the nature of unexpected events and to take actions that can help minimize that failure mode from being a problem in the future. However, the real benefit of incident analysis is to increase understanding about how the system works under different kinds of operating conditions. This understanding can better equip engineers in handling future surprises—some of which may look and feel like past incidents!

It’s important that your incident analysis does both: prevents similar recurrences from happening in the future and teaches engineers a wider range of skills that help them handle ongoing challenges to reliability. As noted in the IBM Garage methodology for incident analysis, “Repetitive problems frustrate users, burn out engineers, and can lead to a loss of faith in the reliability of your application. More broadly, repeated issues harm the reputation of the team or organization, resulting in business consequences such as lost customers. Incident analysis is a critical skill for any site reliability engineer, and indeed, all technical roles to develop.”

What We’re Learning Next

We believe incident analysis is a crucial method for improving not just the resilience of engineering teams, but the business as a whole. So get excited to dive in to our Incident Analysis 101 series! We’ll cover topics like:

  • which incidents to investigate
  • who should lead the investigations
  • what kinds of data should you use in your incident reviews
  • how to write compelling reports that help people learn
  • how to share the findings with others and drive meaningful learning
  • how to decide on what improvements to make after the analysis is finished

For more detailed information on these and other topics, you can always check out Jeli’s Howie: The Post Incident Guide for more information around Incident Analysis.

Happy learning!