Incident Analysis 101: Techniques for Sharing Incident Findings
You’re basking in the glow of new knowledge after your learning review meeting. Everyone is feeling smarter, more informed, and so optimistic for the future. That’s wonderful and worth celebrating—but what about the folks that couldn’t attend the learning review? How can we convey our findings to them and make sure everyone is reaping the wonderful benefits of learning from incidents?
At Jeli we help lots of organizations with this type of work, and many of us have led similar efforts at companies pre-Jeli. For our final article in the Incident Analysis 101 series, we have a number of recommended techniques to share. Mix and match these techniques depending on your organization’s style, and they will help keep everyone informed and engaged with the important output from your incident review sessions!
First off, why are findings worth sharing?
The principle reason for sharing your insights is simply transparency. People like transparency. Your teams will appreciate the culture of keeping everyone in the loop.
But the upsides don’t end there. Consider these other reasons to share the goods:
- Not everyone can make the meeting. Maybe they are out that day, in need of some heads down time, or — eek — dealing with an incident!
- Some outcomes or insights may impact a person, team, or large swath of the organization. Let ‘em know!
- Sharing cogent, timely information is a form of marketing for continued investment in the learning work you’ve done, and can help focus future effort.
- Buy-in for next steps may be needed from leadership or other stakeholders.
Through these efforts, you can establish a practice where everyone in your organization reads and learns from the outcomes of your investigation, instead of just filing them away to check that procedural box.
6 Different Ways to Share
How you choose to share—whether it’s written, visual, presented, or broadcasted from satellites in space—will depend entirely on the norms of your organization. This may mean experimenting with different options until you find what gains the most traction. What matters is that you adapt your output to something that is legible and natural for your colleagues, so that they get the most out of it.
Here is a format that we often recommend to the companies we work with. We’ll discuss each of them in turn, but feel free to use one or more of them in your own work!
- Weekly Updates
1. The Abstract
An Abstract is the #1 thing we recommend folks do (hence its place on the list). Every incident report should have an abstract. It should include: 1 to 2 paragraphs on what happened, why we should care about this event, contributing factors, learnings, and themes.
While folks who didn’t attend the review have the best intentions to read all the written materials, they often won’t have the time. The abstract needs to be your incident’s elevator pitch: if people are going to learn from this incident in 1 minute, what information do they need to know? This way, you can be sure your audience is getting the gist, even if they don’t have a ton of time.
An abstract can also help them decide if they want to commit to reading the report or watching the recording. So, when writing an abstract, make it as easily digestible as possible.
However, that being said, the abstract is brief, so it may lose nuance or depth. Readers will benefit from being linked to another more comprehensive artifact if they have more questions.
How to Share Your Abstract
You choose! The abstract can be shared informally over Slack, email and (if you use it) in Jeli! It can also be combined with other sharing formats in the beginning, as a sort of preface.
Who to Share It With
An abstract can be shared with everyone. It’s particularly great for sharing with leadership and executives, but can also be an excellent entry point for people not directly involved in the incident (such as different engineering teams, non-tech folks) to learn more about exactly what happened.
An Example Abstract:
During the Thanksgiving holiday we received multiple complaints of customers unable to access the “Add to Cart” button. This incident was escalated through Customer Support, and eventually, the engineer on call realized a change made the previous day to a non-critical service (Consul) was causing this problem. It was fixed by reverting the change and re-releasing. The outage lasted 5 hours, with 12 responders from 3 different teams.
This incident highlighted a number of themes:
- Consul is perceived as a non-critical service but it actually impacts a large number of critical needs.
- Consul is supported by a handful of engineers who inherited it.
- Only one person at the company (not in the Consul team) knows how Consul works and what it touches.
- Escalation policies during the holiday made it tough for the on-call engineers to quickly resolve the issue.
The code freeze period can lead to changes being rushed out the door.
2. The Summary
The summary is a slightly more comprehensive version of the incident. It builds on the information included in the abstract, and adds a more detailed overview of the theme’s discussion. If you are including possible action items, include who suggested them.
The summary gives folks more context on what was discussed in the review meeting. The reader should be able to get an understanding of the key themes discussed, as well as any next steps suggested. This is a fairly standard artifact in many organizations, but yours will be a bit different as it orients the reader to what was learned, rather than just relaying what happened!
This form of sharing the incident findings, while more verbose than the Abstract, still carries the downside of brevity. Readers may benefit from being linked to another artifact if they have more questions.
How to Share the Summary
This can be shared in the form of an email, or shared in a larger document. Include links to any other artifacts.
Who to Share With
People who can be impacted by the learnings and anyone whose buy-in you need. When sharing make sure to tag these people and specify why you’re sharing this with them.
Note in the example below how the summary is slightly more elaborate than the Abstract.
An Example Summary:
During the Thanksgiving holiday we received multiple complaints of customers unable to access the “Add to Cart” button. This incident was escalated through Customer Support and eventually the engineer on call realized a change made the previous day to a non-critical service (Consul) was causing this problem. It was fixed by reverting the change and re-releasing. The outage lasted 5 hours, with 12 responders from 3 different teams.
This incident highlighted a number of themes:
- Consul is perceived as a non-critical service but it actually impacts a large number of critical needs. SMEs suggest looking into the readme’s to understand dependencies when troubleshooting. SMEs suggest looking into the readme’s to understand dependencies when troubleshooting.
- Consul is supported by a handful of engineers who inherited it. There needs to be a larger conversation around how we can transition products from one team to another and what is considered sufficient documentation.
- Only one person at the company (not in the Consul team) knows how Consul works and what it touches. We believe this is the case for many orphaned systems.
- Escalation policies during the holiday made it tough for the on-call engineers to quickly resolve the issue. New on-call engineers can best leverage DataDog links to find services impacted which should help them prepare.
- The code freeze period can lead to changes being rushed out the door.
Potential Action items:
- Review data around changes being released around the code freeze window – Vanessa
- Review service handoff material – Laura
- Review training materials for new engineers on call – Will
This is exactly what it sounds like: a recording of the actual review call. When sharing it is helpful to include a message with timestamps of when key moments were being discussed to orient your reader.
This will, of course, be the medium that most resembles the experience of attending the actual review. However, it takes your audience more time to get through than the other formats, which could result in lower engagement. And while they get to listen to what happened, they can’t participate.
How to Share the Recording
Sharing is easy when the review meeting happens over Zoom or Google Meet. Simply share the screen recording file over Slack, and make sure to link to it in the incident report for those who may want to know more after reading the report.
Who to Share It With
This is a great format to share with those team members who were involved with the incident, yet were unable to attend the review meeting. Leadership and other colleagues not involved in the incident are unlikely to watch or listen to this type of review.
From time to time, you will want to provide a synthesis of an incident(s) in a presentation format. For example, this can be done in an informal Community of Practice Meeting or a quarterly presentation to leadership.
The presentation can take multiple shapes. Like a storytime, where you pick one incident and talk through what happened, the themes, and learnings. Or something more along the lines of a prepared talk, where you share multiple incidents and your learnings with a specific goal. Either way, definitely allow time for Q&A!
This is your chance to drive the narrative to a captive audience. A good presentation can rally individual contributors as well as to get support from leadership. Well-crafted slides can do wonders to drive your themes and to drive buy-in from stakeholders.
In some organizations this may take more effort—not everyone is a big fan of writing slides and presenting them in front of others. As mentioned above, leveraging these for a small number of interesting incidents, or to a like-minded group of folks, may ease the burden.
How to Share the Presentation
Depending on the audience you can take advantage of any time that is already dedicated for engineers to present and chat about technology. If you have a specific goal you want to accomplish—perhaps one that needs leadership buy-in—make sure to schedule a specific time where all relevant stakeholders can attend.
Who to Share It With
Anyone! Just make sure you’re adjusting your presentation to your audience. Leadership will often want a more focused presentation that touches on the themes and actionable advice for next steps. Practitioners, while interested in those themes too, will also have more room for discussing context and nuance.
The report is different from a standard post-mortem in that it is primarily focused on the bigger story of what happened, and the context around how the events came to be. The goal, always, is to learn from the incident.
Reports are the most complete written artifact coming out of your incident. A report will give folks an in-depth understanding of the themes around the incidents and how we found them.
This format is much more verbose than Abstracts and Summary, and therefore requires more time invested by the reader. It is more focused than the Recording and might suit organizations or individuals where synchronous presentations are difficult.
How to Share the Report
Share it in the meeting agenda for your Learning Review and post it on Slack following the review. Encourage folks to comment on it—this report is meant to be read, NOT filed!
Who to Share It With
Ideally, everyone! This will mostly be read by folks involved in the incident, teams using similar technologies, folks whose buy-in you need for possible action items, and (our favorite) new members of the team/organization trying to get a better understanding of how the socio-technical system works.
6. Weekly Update
The weekly update consists of a quick review of all the incidents that were analyzed that week. It can be a list of the incidents with their abstracts and a link to the full report. You can also include additional data points like “teams impacted” for quick access to more thorough learnings.
This is a great option for larger organizations with lots of incidents or with silos. Everyone can take a quick glance at this list, find incidents they are interested in based on keywords (e.g. services impacted, technologies involved), and read further into what they find interesting.
Be careful to not allow this format to develop into a shallow, tabular form. You can combat this shallowness by hoisting up themes or other colorful findings from the underlying artifacts and linking to them so the reader can satisfy any curiosity.
How to Share the Weekly Update
This form of update can live in a Notion or wiki page and may also be shared through periodic email blasts or Slack announcements. It gives you a bit of editorial space to spice things up and entertain the audience if that’s your thing.
Who to Share It With
Similar to the Abstract, but should not be used when you need someone to take immediate action.
In conclusion, share it your way. Just make sure you share it.
At the end of the day, any effort you invest to promote what you’ve learned from your incident investigations is deeply important. Drawing in those that couldn’t attend the meetings gives them an opportunity to deepen their understanding of the situation and the organization itself. Plus, your sharing effort acts as a form of marketing that helps to promote the themes and next steps from your incidents, while emphasizing your proactive investment in learning and growing. An organization that shares more, learns more. And more learning means more open doors.
For more detailed information on these and other topics, you can always check out Jeli’s Howie: The Post Incident Guide for more information around Incident Analysis. If you enjoy this content or want to suggest a future topic tweet us @jeli_io.