Post-incident reviews: a how-to guide

by Jeli, a part of PagerDuty
May 9, 2022

After conducting interviews, putting it all together, and writing up your findings, it’s time to gather everyone together and review the incident. In this post we’ll discuss who’s invited, how to create a learning environment, what the agenda looks like, and what comes next.

Who’s invited?

Ideally, we need the responders involved in the incident to attend so they can share their perspective and recount their experiences. Schedule the meeting to include as many key people as possible, especially anyone you’ve interviewed. Remember that “responders” are not limited to oncall engineers. Make sure to include the other roles involved in the incident such as: customer support, incident management/command, security, and escalation teams/operations centers.

Other participants and stakeholders can also be valuable inclusions depending on the context of the incident. Think: dependent service teams, engineers from other parts of the business, customer success/advocate roles, and impacted users. Diverse perspectives are crucial for making a learning review as comprehensive as possible. Inviting a broader range of roles improves understanding of how other parts of the company work and of larger organizational goals.

In a transparent organization, anyone who’s interested in learning from this incident should be able to attend! However, before opening the invite to everyone, assess the circumstances surrounding a particular review. Is the incident particularly contentious? Could things get potentially divisive or spicy if discussed in front of a broad audience? If the answer to any of these questions is yes, it’s okay to limit the invite list to responders and key stakeholders.

Getting prepped

Writing up your findings

It’s important for the facilitator of a learning review to understand that their role is that of a reporter. You may have gathered information, interviewed participants, analyzed and written up those findings. And there are many different ways to tackle a write up—we suggest starting with a calibration document, to be shared in advance of a review meeting.

A calibration document gives those you’ve interviewed a place to preview your findings, correct any misunderstandings, and comment on the themes you have chosen to highlight. Doing so prevents any surprises during the actual meeting, and helps to build and maintain trust that you’re representing their perspective fairly and accurately. Sharing this document with meeting participants in advance helps to get all attendees on the same page. This makes the meeting more productive by focusing the time on deeper discussions, as opposed to simply recounting a timeline.

A facilitator’s mindset

You might have collected all the information, but this does not make you the sole authority on the topics to be discussed. You are there to facilitate the sharing of knowledge and the expertise of those who participated in the incident. Time to take off your writing cap, put it next to your detective hat, and grab your facilitation visor.

Reach out to those you interviewed, or have identified as subject matter experts, and ask if they’re comfortable being called on to explain pieces of the timeline and describe the event from their perspective. Everyone experienced the event from their specific point of view, and getting it directly from the source helps create a shared understanding of what occurred and gives us an opportunity to learn from each other. Share the calibration document in advance (we recommend at least 24 hours prior to the meeting) so they can get a feeling of how you’ve interpreted the events. Allow them to clarify, elaborate or correct as needed. Giving them advance notice that you’d like them to speak, and helping them understand what to expect, will help to ease any stage fright or defensiveness. These are normal reactions to public speaking but they can limit learning. Working collaboratively on the meeting content and the desired outcomes will reinforce the trust you’ve built together over the investigation.

Timing

When scheduling the meeting, consider how much time it will take to get through the timeline and themes you’ve prepared. If the incident went on for several hours (or days!) a thirty minute review meeting won’t give you enough time. Conversely, the longer a meeting goes, even with important content, it is harder to secure broad attendance and retain attention spans. A good rule of thumb is to keep it between thirty minutes to two hours. You might indicate in the agenda that people can attend for however long they are able.

Agenda

While all incidents are different, we recommend the following structure:

Opening remarks
An overview of the analysis
Interactive narrative summary
Themes discussion
Call for questions
Steps already taken & next steps

Begin the meeting with expectations and acknowledgements. This creates the conditions for an honest and collaborative conversation to thrive.

Opening remarks: set your practical ground rules

As covered in the Howie Guide, if the meeting will be recorded, make that clear, explain why, and get the approval of the attendees. Establish which topics might be diverted into a parking lot of ideas to be addressed later, or in a separate meeting entirely, such as: rabbit holes into technical details, corrective implementation ideas, or action items. You might ask participants to write down questions they have and things they want addressed, and circle back at the end to see if they were answered.

Opening remarks: set your interactional ground rules

Next you’ll want to establish an interpersonal agreement between all attendees. You’re about to start a journey through a potential minefield of events and topics; navigate them successfully by first acknowledging where they are and how to move past them.

When you see data coupled with detailed analysis, it’s easy to assume choices made during an incident were informed by details that weren’t available until afterwards. Point out these counterfactuals if you see folks falling into that trap. And don’t forget to acknowledge the sneaky, innate predisposition present in every learning review: Hindsight Bias. Remind attendees that the incident responders did what they could with the information they had available to them at the time.

This leads to the other elephant in the learning review: Blame. Often in the pursuit of a “blameless” review, we avoid saying individual’s names, or even skip over particularly sensitive parts of the event for fear of it coming across as blameful. But to truly learn from an incident, we need to understand all the circumstances around an action, including the thought process of each individual as they responded in the moment. We can’t afford to be “blameless” if it means avoiding the parts of an incident that are difficult to talk about. Those parts are often where we find the most to learn.

It is not the act of discussing these things that imparts blame, it’s the environment in which the discussion takes place. To paraphrase John Allspaw: “Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account without fear of punishment or retribution.” So the answer is neither to avoid using names, nor to place responsibility for a failure on an individual. It’s to be blame-aware. Acknowledge that having responders discuss their experiences is how we learn from each other. Make it clear to all involved that everyone present is responsible for pointing out blame and moving past it. This enables an open discussion that allows us to share knowledge, improve how we work, and to solve the challenges we face. Together.

To help sum up everything above, here’s a few sayings you can deploy to ease folks into the right headspace:

“Don’t should on yourself, and don’t should on others.”
“Should” is a word that leads into several traps: guilt, blame/judgment, counterfactuals rooted in hindsight. Listen for it, and be prepared to step in and redirect them back to what actually took place.
If you think this might be a stupid question, ask it
This goes beyond “there are no stupid questions.” The questions we feel we should already know the answer to, or existing assumptions, are usually where there is the most to learn from. These seemingly simple, clarifying questions can lead to incredible insights and knowledge sharing.
Be Curious and non-judgmental
This is it—this is how to create a blame-aware environment: learning and sharing without fear nor repercussions. The review should be an agreement between all parties to embark on a discussion, grounded in empathy and understanding, with the intention of learning how things happened, what unfolded, and what may still be unclear.

Once you lay this groundwork, you’re ready to move into the analysis.

Analysis overview

Talk through the data you analyzed (which message channels, various docs, previous incident reports, Jira tickets, Zoom recordings, etc), and through the amount and scope of the interviews you conducted. Keep this short. Summarize where you started, where the analysis took you, and what your overall approach was while investigating this incident. This helps folks understand where you got the information that will be shared.

Interactive narrative summary

Now it’s time for your timeline to shine. Start by providing a short description of the event, cuing the folks you interviewed to describe what unfolded from their perspective. Use the timeline to prompt different responders to share their experiences, and have various subject matter experts provide background knowledge on how the relevant pieces of the system/technology work. Leave space for discussion—the facilitator should not be doing the majority of the talking here.

Keep an eye out for those rabbit holes! Rabbit holes can look like extended time spent discussing specific technical details about how the system worked or should have worked, or implementation details on how to fix or change something. If you’re not sure whether it’s a rabbit hole or an important discussion, ask! Let folks know how much time in the meeting remains, and ask if it’s valuable to continue this conversation now, or if it requires a dedicated time and place to dig into (we’ll talk about this more in our post on action items!).

Themes discussion

Once you’ve gone through the timeline, provide an overview of the themes identified in the calibration document. Take time before the meeting to prioritize the most important themes you want to cover, as you will rarely have all the time you need. Ask for commentary from the responders, subject matter experts, and stakeholders present. It’s okay if all the themes in the calibration doc aren’t covered. Some will likely generate more discussion than others. How people respond and dig into your themes help to indicate which ones warrant a closer look!

Call for questions

Now is the time to refer folks back to any questions they may have held onto during the discussion, any topics not addressed, or unresolved concerns. If the group doesn’t have any of their own, share some that came up for you during your investigation. If there are a lot of unresolved questions, acknowledge they might not all be answered in this learning review, and may warrant further discussion as an action item.

Steps already taken & next steps

Review the remediation and improvement work that has already been done so far and discuss the next steps for action items—learn more about that in this dedicated post.

The next steps also include transitioning the calibration document into a final report, if your organization does one. We suggest the “how we got here report” that tells the story of how the event unfolded from the many different perspectives that you uncovered. “Howie” is built to be different from a traditional postmortem process. Instead of simply reviewing what happened, this report focuses on how the event came to be in the first place, making sure different perspectives are well represented. Here’s an example.

Get people excited to read and discuss your write up, provide an ETA for delivery, and direct folks to make it a living document through commenting and sharing. Make sure to invite attendees who may not have already read the calibration document to provide their input. Don’t forget to solicit feedback! Encourage participants to give their thoughts on the process, meeting, and write up. Collecting feedback helps improve your process, clarify any lingering concerns, and makes the entire process more engaging.

Once the meeting is over, take a deep breath, go for a walk, lay on the floor if you need to. Facilitating can be hard work! You established ground rules, overcame blame, walked through the incident narrative and themes, and established expectations for next steps. Some learning reviews will be harder to navigate than others, and that’s okay. Remember that the more incidents you investigate and meetings you facilitate, the more your muscle memory and skills will grow!

For more detailed information on these and other topics, you can always check out Jeli’s Howie: The Post Incident Guide for more information around Incident Analysis.

Incident Management

AIOps

Process Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

PagerDuty Blog

Post-incident reviews: a how-to guide

Who’s invited?

Getting prepped

Writing up your findings

A facilitator’s mindset

Timing

Agenda

Opening remarks: set your practical ground rules

Opening remarks: set your interactional ground rules

Analysis overview

Interactive narrative summary

Themes discussion

Call for questions

Steps already taken & next steps

PagerDuty Blog

Post-incident reviews: a how-to guide

Who’s invited?

Getting prepped

Writing up your findings

A facilitator’s mindset

Timing

Agenda

Opening remarks: set your practical ground rules

Opening remarks: set your interactional ground rules

Analysis overview

Interactive narrative summary

Themes discussion

Call for questions

Steps already taken & next steps

You may also love these...

The role of psychological safety in incident response

APAC Retrospective: Learnings from a Year of Tech Outages, Restore: Repair vs Root Cause