PagerDuty Blog

How to effectively analyze an incident

Learning from an incident, like any complex subject, is going to require some effort. How does one even get started in analyzing it all in an efficient, effective manner?

Permission to Iterate

The first thing to know is that incident analysis is iterative; we can’t just make sense of all the data at once. We must acknowledge that investigators will often start with no prior knowledge—or perhaps a cursory knowledge—of the event, which deepens as more reading, interviewing, and question-asking occurs. It is, therefore, important to give yourself (and others!) permission to get things wrong, change your mind, or learn entirely new things. This is part of the process of reconstructing an incident and we’ll touch on this idea a few times as we move through the steps below.

Getting Familiar

You’ve collected the data and likely got a very rough timeline for the incident in your head. It’s worth getting this timeline out of your head, so it can be shared with others to clarify and expand upon. We suggest doing this by building a narrative timeline that’s guided by the understanding that you’ve built up so far. Start to tell the story of the event by identifying high level plot points, like when a problem was identified, how the information was acquired and shared, what fixes were employed, and when resolution was achieved.
Many investigators do this by compiling the data in a document by selecting, highlighting, bolding, cutting/pasting quotes or key pieces of information. If you are working from a transcript you might add codes or tags to flag their meaning. In doing so, you’re doing more than collecting data. You’re building an engaging story to later tell your organization!

Developing the Story

Humans love stories. Not only do we enjoy them, we learn more effectively from them! Our goal is to take all this information we’re collecting and shape it into a narrative. Like any good story, your incident will have a beginning, middle, and an end. How did your incident begin? With a customer support inquiry, or maybe a deploy done before the incident began? The middle will be full of setbacks, mystery, and triumph. And then, the ever-important ending. ending? Was there chatter after the all clear that discussed fallout and repair? Maybe foreshadowing of more to come?
When getting familiar with the incident you’ve likely already started forming a nascent story. As you read, learn, listen, and interview, you’re going to find new twists, turns, and compelling tangents worth checking out.
These tidbits will often manifest as questions: “why did this happen?” Or, “how did they know to do that?” “What in the world is consul?” These are all excellent notes for you to capture and follow up on. As you develop the narrative timeline you’ll start to separate out the important and relevant data collected from all the background data.
The narrative you’re building is going to develop the more you learn. Our questions and notes will accumulate and we’ll begin looking at those notes and deciding how to get answers to those questions. New themes will emerge and some questions will come up again and again.

Interviews

If you’ve chosen to conduct interviews during this analysis, you’ll probably have a list of candidates by now. These are folks for whom you have questions from your tagging or who played key roles. Use these conversations as a way to vet your nascent timeline and to explore the questions you’ve noted from your initial analysis. As you take additional notes, clarify, or add tags, you will also add more questions. Your interviewee may bring up topics you’ve not yet dug into, or you may hear new information that requires a bit more thread pulling.
This fractal question situation may seem intractable as you read this, but each interview also firms up the story. Your judgment will help resolve which ideas are worth pursuing. Again, this is an opportunity for iterating: the information you get from your interviews may cause new topics to appear, old ones to become less relevant, or existing themes to strengthen. You’re using your wonderful brainpower to bring these recurring, novel, or critical topics into a story that is so much richer than just a dry timeline of events. You’ll hone the narrative with each interview, weaving the interviewees additions with the other findings.
As you iterate, some questions may fall to the wayside. Maybe they didn’t end up being as important as you thought, or maybe the information wasn’t available. Unless you’ve been given unlimited time to investigate, the goal isn’t to answer every question. The only goal is to learn about the incident, in order to learn from the incident You may run into questions about your systems that you don’t have time to answer, either because they are too expansive or because you can’t pull together the right information fast enough. And that’s okay! Capturing these novel lines of questions—even without answers—can inform future investigations. You might find later that the question has come up across multiple incidents and the investment may become worthwhile.

Refinement

As you continue to iterate, you’ll notice that your narrative and understanding of the incident will begin to solidify. You’ll begin hearing similar refrains from your interviewees, or seeing patterns in the data you examine. Perhaps multiple interviewees mention that a system doesn’t really have an owner, the original creators have moved on, and nobody understands it well. You may note this and pose it as a question to other interviewees and look for affirmation or dismissal. The data and interviewees, along with your knowledge of the organization and its systems, will guide you in how to choose and refine these developing themes.
Many key moments may fall outside of the traditional “detect, diagnose, prepare” idea. The twists, turns, dead-ends, triumphs, failures, coincidences, and serendipity that make up the complexities of modern systems will pepper your notes. These gems are opportunities to learn. These moments highlight where the idea of how things work differs from how they actually work. Use your judgment as the investigator to decide which items are most relevant to understanding the context of the incident, and which can be must valuable for your audience to discuss. If questions from your notes don’t have clear answers, keep an ear open as you are iterating. Some questions will become more clear. Some things you’ve noted may never find clear answers, and that’s expected. They may point to a gap in the organization, or something that may require further investment. You may decide to pose these questions during the later incident review to create discussion or propose future lines of investigation. I’ve seen new teams formed specifically to answer particularly hairy questions!
Between interviews, go back through your tagging data and notes. Play your narrative through your mind as you look at the data with new ideas and considerations. Do new questions arise? Does certain behavior now make more (or less!) sense? This reflection refines the narrative and defines what next steps (if any) you’ll take.
At this point, you’ve probably got an interesting story developed from your notes, interviews, and findings. You’ve spent some time on stuff that didn’t pan out, but also found a few interesting nuggets you hadn’t known before you began. You might even be a bit excited about the narrative you’ve put together, as you’ve glimpsed some insights or found a situation where how you thought something works doesn’t match up. Maybe you even made some new friends!

Putting It All Together

Telling good stories through your analysis can turn an unfortunate event into an excellent, exciting opportunity for learning. Your initial model of how the incident began and ended might be factually correct, but the narrative you shape as an analyst is enriched and deepened by your collection of stories from other humans. As the creator of the narrative—author of the story—your investment pays off many times over as others hear the story and learn. Timelines and ticket numbers have their place, but stories stick in our brains and capture valuable lessons in ways no incident timeline can.

For more detailed information on these and other topics, you can always check out the Howie: The Post Incident Guide.