What is incident analysis and why should we do it?
February 1, 2022
We’re kicking off Incident Analysis 101: a series where we break down the basics of incident analysis!
Service outages are a common part of modern software operations, especially if you’re moving fast or operating at scale! Many companies have realized this and have begun to invest in incident analysis in order to learn from their incidents. To introduce our Incident Analysis 101 series, let’s first talk about what exactly incident analysis is, and what benefits your teams can realize from it.
Defining Incident Analysis
Incident analysis is a process for identifying what happened during an outage: discovering things like who and what parts of the system were involved, and how the problem was handled. There are many different methods to conduct incident analysis. At its core, however, incident analysis typically consists of:
Gathering data about the event
Analyzing the data
Drawing conclusions from the data
Enhancing future resilience
Many view a core function of incident analysis simply as preventing future recurrence: taking corrective actions to fix a bug, improve observability or update the runbooks. Here at Jeli, we see it as more than that. It’s about enhancing future resilience by better preparing individuals, teams—and yes, their software systems—to handle unanticipated failures.
Taking It a Step Further
We like to borrow from the comedy improv technique.1 We say, “Yes! We want to prevent this from happening in the future…and that means we’ll prepare engineers with a broader skill set than simply preventing the exact same incident at a later point in time!”
Just as you never step in the same river twice, because it is continuously flowing, so too will you never face the same incident twice, because continuous integration/continuous deployment = continuous change. My fellow Jeli Bean, Vanessa Huerta Granda, says it best: “Each incident is unique. Even if an incident seems similar, or like a recurrence of a past problem, it happened to different people at different times. There is almost always something new to be learned.” Look out for more on this insight from Vanessa in a future Incident 101 post2. Future you will thank you for reading it.
When the focus of the analysis is on learning, not just fixing, incident analysis makes a company better able to respond to future incidents. Alex Elman, a Site Reliability Engineer at Indeed who leads their Resilience Engineering team puts it this way:
“A ‘prevent and fix’ cycle is a backward facing process that aims to avoid surprise and prepares only for previously encountered failure modes. This method of following up on incidents leads to narrowly targeted improvements and lessons that don’t apply to future encounters with surprise.”3
How We Best Benefit
Of course, you do want your organization to learn about the nature of unexpected events and to take actions that can help minimize that failure mode from being a problem in the future. However, the real benefit of incident analysis is to increase understanding about how the system works under different kinds of operating conditions. This understanding can better equip engineers in handling future surprises—some of which may look and feel like past incidents!
It’s important that your incident analysis does both: prevents similar recurrences from happening in the future and teaches engineers a wider range of skills that help them handle ongoing challenges to reliability. As noted in the IBM Garage methodology for incident analysis, “Repetitive problems frustrate users, burn out engineers, and can lead to a loss of faith in the reliability of your application. More broadly, repeated issues harm the reputation of the team or organization, resulting in business consequences such as lost customers. Incident analysis is a critical skill for any site reliability engineer, and indeed, all technical roles to develop.”4
What We’re Learning Next
We believe incident analysis is a crucial method for improving not just the resilience of engineering teams, but the business as a whole. So get excited to dive in to our Incident Analysis 101 series! We’ll cover topics like:
which incidents to investigate;
who should lead the investigations;
what kinds of data should you use in your incident reviews;
how to write compelling reports that help people learn;
how to share the findings with others and drive meaningful learning;
how to decide on what improvements to make after the analysis is finished.
For more detailed information on these and other topics, you can always check out Jeli’s Howie: The Post Incident Guide for more information around Incident Analysis. If you enjoy this content or want to suggest a future topic tweet us @jeli_io.