Jeli wrote the Howie Post-Incident Guide as an in-depth explanation for how to make the most of your incidents.
We built Howie to be customizable across different sized organizations, skills of investigators, and severity levels of incidents. The guide is broken into 8 steps which can be thought of as a set of tools that you may use depending on a number of factors (time, expertise, and the incident itself).
You can access the full Howie guide for free here but if you’re looking for a lighter, shareable version, read on!
The investigator or analyst is the person in charge of moving the process forward for a specific incident. They should have a base level understanding of the technology and the incident analysis process, and ideally not have been directly involved in the incident. If possible, we recommend finding a partner to help you along and work on it as a pair.
Identify sources of data to include in your work, this is the background needed to make sense of the event and may include written and recorded comms (Slack, Zoom), dev tools, People-specific Data (location, team-structure, tenure).
This is where you familiarize yourself with the incident, figure out who was involved, create a narrative, jot down questions and emerging themes, and finally consolidate this information into one place.
Identify Themes: Look for patterns in the data, things that were difficult or surprising, or descriptions of technology.
Jeli allows you to complete the analysis steps in one place!
1. Import Slack channels data into Jeli
Ingest whole or partial channels, or threads-only using our Jeli message shortcuts in Slack.
2. Look at the People data and create a list of folks to interview or invite to the review meeting.
3. Create Narrative timeline and questions
Look through the transcript and find messages that indicate points in the narrative you want to highlight. It might be helpful to think of them fitting into the markers of: detection, diagnosis, repair, or key moment. Add these messages into a marker as supporting evidence.
For each marker summarize what the messages indicate into the Summary field.
Read through the messages you have added as supporting evidence and think through what questions you have based on these. Remember to not make assumptions. (i.e. “how did Liz know to do what she did here?”)
Create as many markers as you want! Don’t be afraid to go back and forth in incident stages (i.e. go back to detection after some repair work).
4. Identify themes
Look at the questions in the Narrative and add the themes to the “Takeaways” section in Jeli.
Interviewing the participants involved can help elicit more details about what was confusing and why, the event itself, and the response efforts. Interviewing participants also helps to surface important knowledge about your systems, which includes the technical architecture but also insights into your teams, company culture, and communication patterns. This step is time-consuming, so depending on your organization or incident, you may choose to skip this step and ask these questions during the review meeting.
Identify who you’d like to ask questions to. Some examples include key players, systems owners, and teams impacted.
Ask them to tell the story from their point of view and then follow with the questions identified in the earlier step. Be curious, not judgmental.
Share your agenda and Jeli report with folks ahead of the review meeting to make sure you’ve captured things correctly. Doing this gives everyone a chance to clarify their mental models, as well as make additions or corrections to your investigation.
The incident review meeting is a facilitated opportunity to discuss, for the first time and as a group, how things happened, what unfolded, what was surprising, themes identified, and what is unclear.
Facilitation Tips: Send out your material in advance. At the beginning of the session reiterate the purpose of the meeting; the full length Howie has a great script for this to help you get started. If the meeting becomes too negative, feel free to end it early and reschedule it for a different time.
Sample agenda (it helps to explain what to expect and time box items) : 1. Intro + Quick summary of what happened + context on the technology (6-7 mins) 2. Begin going through the Jeli timeline, call on folks involved to share their POV, and ask the questions you’ve noted (20 mins) 3. Go through 2-3 themes as a group (15 mins) 4. Action items, if needed (10 mins)
One-dimensional reports can be hard to learn from. We take a narrative approach to presenting the contributing factors to the event, background on the technology involved—including important historical context—and details about how the incident coordination was handled. It describes both the technical failures and the social and organizational processes involved.
We recommend tracking analytics for the document. Seeing who is reading it and how often or how long after the event can provide lots of valuable insight into what is useful for the organization. This report should be written to be read, not filed!
These should be discussed after the learning has taken place, as action items are not the purpose of learning from incidents. The key to quality action items is collaboration, ownership, and reflection. Remember: action items are not limited to ticketed work, they can be further discussions, or even sharing the things learned across your organization.
At Jeli we believe in the power of taking a proactive approach to incident analysis to help you truly understand how you got here.
We believe that incident analysis can be your organization’s secret weapon that will allow you to gain value from your incidents. We understand that good incident analysis is an investment, so we hope that this guide will help you push toward that change!