Spooky season has arrived. With Halloween and the even scarier holidays that follow right around the corner, it’s a great time to talk about the most haunted of all documents: the Post Mortem. *lightning cracks in the distance, a wolf howls, a creepy fog descends as a post mortem doc suddenly erupts through the top soil of a freshly dug grave*
What is a post mortem document?
It’s the report written after an incident that typically contains:
1. What happened during the incident:
Executive Summary of the event
Details of the various features and services affected, including customer impact
Who performed specific roles in during response
2. How we got through it:
A timeline of the incident
Actions taken to solve (and attempt to solve!) the issue(s)
3. Who was involved:
Who was actively involved in response, or simply observing to relay details or learn from the experience
4. How we got here:
Takeaways or Themes
Factors that contributed to this event occurring
Interesting discoveries that may be relevant to other teams, other incidents, as well as future process and product development.
5. What still needs to happen:
Action Items, or the work still left to restore or renew the impacted systems
If you’re looking for more information on how to get the data you need to fill out a post mortem document we have our in-depth Howie Guide you can check out, and an Incident Analysis 101 blog series that breaks it down to bite size pieces. But in this post let’s take a step back and talk about why we spend time on post mortem documents in the first place, and how Jeli can help you cut that time down.
Why create a post mortem document?
Incidents are expensive - to your team’s morale, the company’s bottom line, its reputation, and customer trust. After a costly downtime event, it only makes sense to seek a return on that unplanned investment.
Numerous fields outside of tech, like medical, manufacturing, airline, emergency services like EMT and fire fighting, even extreme sports athletes, take learning from incidents very seriously. Most incidents lead to an investigation that involves a full write up, and sometimes a formally published report. These are deeply detailed documents that compare existing and new understandings of their processes and systems, as well as what they’re doing to adapt those systems in the wake of the incident. In fact, a majority of the processes for response and analysis used in tech are borrowed from the techniques other fields use to learn from incidents.
We can’t know what to fix and change about our systems until we understand what happened within them. The learning that occurs while creating a post mortem doc is the context about the state of your product that will inform future business decisions.
At Jeli our goal is to help people learn from incidents. In tech, the business goals are ever present: to repair customer relationships and hit specific metrics. Shifting the perspective after an incident to learning helps to make that short term goal accessible and crowdsourced. You have already spent a good amount of time and money on this incident, collecting and sharing the knowledge responders earned from that event is a wise business move. We emphasize learning because it directly results in an understanding that is used to keep your customers happy and achieve those business goals.
Without centering learning as the goal, a post mortem document is often filed and forgotten.
Why is it called a post mortem?
As noted earlier, the majority of our incident response and analysis practices in tech come from other fields that have specialized in learning from all kinds of events, including the medical field. Yes, post mortems are quite literally named after the report created during an autopsy to find the cause of death.
Since we tend not to dissect remains searching for the contributing factors of death (we’re talking to our teammates about the decisions we made and how we got through an incident) at Jeli we stay away from the phrase “post mortem.” These documents go by many names: post incident review, event review, opportunity report, to name a few. Go with whatever makes the most sense for your organization and/or gains traction among teammates.
Creating a post mortem doc helps you learn.
The process of compiling all of the data into a post mortem document either to distribute or add to during a review meeting (or both!) is where the majority of the learning from an incident takes place. With Jeli’s Narrative Builder you can pull in all the Slack transcripts relevant to an incident, adding both individual or multiple messages as evidence for events along the timeline.
You can then add context and notes to each event along the timeline, summarize what took place, or add questions that arise. If there are Slack messages you know will be relevant to the timeline during the incident, you can mark it with any reacji in Slack and search by that reaction in Jeli to pull up your marked messages to add to the timeline.
Since Jeli has streamlined the timeline building, it’s spookily easy to create a really robust timeline quickly. This will leave you with time to spare to include the discussions and decisions from response that might not have ultimately led to a solution, but showcase responders’ thought and troubleshooting processes. Having these steps preserved in the timeline can help folks trying to find potential solutions for similar incidents in the future, as timelines are searchable content in Jeli.
The timeline is also the perfect place to highlight where struggles with your incident process got in the way. The post mortem is not only a place to talk about the technical issues that took place during an incident, but also the process and interpersonal issues that surfaced as well. Incidents are where we discover not only where our technical systems experience failures, but also where our coordination and communication might break down as well. Understanding and addressing these issues are just as vital to creating resilience as addressing bugs in your code.
The outline of a post mortem doc earlier in this post can even work as a template you can add in multiple places in Jeli to make the process more accessible to folks new to incident analysis, and speed it up for those more experienced.
A post mortem doc is meant to be shared.
If the process of creating a post mortem document is how we collect the learnings from an incident, how do we make sure that knowledge gets to everyone it applies to? Reviewing a post mortem doc is easier said than done. In fact, we have an entire blog post dedicated to how to share your findings.
Sending out a teaser in your abstract or sharing out your findings in weekly digests are great ways to get eyes on a post mortem document. Targeting specific audiences works well too:
If a discovery during a post mortem redraws the mental model of how parts of your system interact, that’s a great opportunity for an internal tech talk or brown bag lunch discussion of your system’s architecture.
If it was a real struggle to determine customer impact for an incident, partner with support & customer success to gauge what details customers are looking for, and where the disconnect may be between customer experience and existing logs.
If there’s a concern around training or information retention leading to mistakes during deployment, make the learning sticky by sharing it as a badly made meme or rhyming haiku.
(Never underestimate the power of rhyming or silly memes when it comes to making a piece of information stick.)
We’re looking to build safety in our systems. As Woods and Cook say in Nine Steps to Move Forward from Error (2002):
“...organizations that manage potentially hazardous technical operations remarkably successfully create safety by anticipating and planning for unexpected events and future surprises.“
Regardless of the size of your organization, you are likely aware of what matters to folks after an incident is resolved. Do they want to know:
How did the incident unfold?
What was the impact, and how did we figure that out?
How did the contributing factors come together to create the impact we saw?
How did our mental models of the system change based on this event?
What was surprising about this incident?
Did anything make this incident particularly challenging?
If there was only one or two things you’d want others to know about this incident or its response, what would you share? Take that information and share it with your teams. After all, that is why we create a post mortem document: to learn from an incident, document our findings, and use that knowledge to inform future decisions and business goals.
Collecting, compiling, and analyzing incident data into a post mortem document can feel daunting, but with Jeli the amount of time it takes to do this work is slashed more than Jason Voorhees could ever imagine. Our friends at Quizlet used to measure their post incident work in hours, now they measure it in minutes. Incident analysis doesn’t have to be scary. Jeli is here to help, try it out for free and let us know what you think!