If you’ve read some of the content we have been putting out there, you know that we believe post-incident analysis should go beyond root cause analysis. Here at Jeli, we promote looking into incident themes in order to get a richer understanding of your work—otherwise known as thematic analysis.
But what do we mean by themes? Themes are the takeaways from an incident and the analysis that followed it. By focusing on takeaways, instead of solely on the “one, true root cause” or slipping right to action items, investigators can extract knowledge from the investment that was the incident. Focusing on those other items may lead to a false sense of security or accomplishment, leaving much more to be learned on the table.
For example, without thematic analysis we may think that we learned all there is to learn from an incident because we decided to fix a bug and tell people to not bring down prod again. But with thematic analysis, we can look into the conditions that made it possible for individuals to take down prod in the first place, or how engineers understand the impact of the changes being made. This type of knowledge and understanding “can better equip engineers in handling future surprises” in addition to minimizing the specific failure mode seen in the incident.
How to spot them
In the Howie Guide we encourage folks to “think of your themes as the topics of interest that surfaced throughout the investigation: what surprised you, what do you think others should know more about, what is shared among other incidents.”
It’s hard to give an exact definition of what a theme looks like. Frankly, finding them often requires a “smell” or “spidey-sense”. This is why I believe that storytelling is so important in incident analysis. I like to hear people involved in the incident tell the story from their point of view, let those involved ask each other questions, and then ask my own. I take note of things that folks maybe do not understand or did not know prior to this narrative. The things that folks found fascinating or worth recapping can be “themes”. The things that folks will talk about at the holiday party or months later when discussing the work with some new team-mate. These are all important takeaways and potential themes in the investigation.
Jeli is built for this storytelling type of incident analysis. Our timeline feature allows you to see exactly what happened and when during the incident, which provides a great starting point for people to share what they were experiencing in the moment. When preparing for an incident review, I will usually go through the Slack transcript of what happened, start surfacing a narrative and jot down questions. I will then take a look at the questions I had and will start to organize them by what they had in common:
Was there a piece of technology we didn’t understand during the incident (or I as an investigator am unsure about)?
Were people unsure of the impact? Did folks in different parts of the org understand impact differently?
Were responders confused about the signals they were being presented with?
Was there inconsistency in how the incident was being communicated?
Did somebody do something really cool when troubleshooting?
In the example above we don’t even mention the technologies impacted; this doesn’t mean we aren’t learning from them. We can discuss the contributing factors and mitigators during the review meeting and in the incident report; folks will learn more about their systems and apply to their work. But we should also discuss the particular themes that made the incident happen the way it did. This leads to a richer understanding of the incident which in turn will enhance future resilience.
What isn’t a theme?
While there’s no specific definition of incident themes, there are a number of pitfalls you should avoid as you uncover themes in your incident.
Action items. If it can be solved with a pull request it’s not a takeaway. This does not mean we don’t address bugs during an incident review—we just don’t stop there.
Blaming or calling out. Maybe during your review meeting you learned that one individual or team “triggered” the incident. While it’s easy to say “so and so doesn’t know how to do their job,” it does not benefit anyone to stop the investigation there and can lead to folks losing trust in the post-incident process. Instead, we want to understand how events led to that person or team triggering the incident. Did they inherit some new technology they don’t completely understand? Are there gaps in the onboarding process? Is the on-call rotation not taking into account knowledge silos?
Anything too vague. While an incident’s theme can be applicable to other incidents, technologies, and even other organizations, they should always make us learn something! There’s a difference between “incidents happen during code freeze” and “engineering’s urgency to release quickly before the code freeze period may lead to rushing things without the usual/proper checks and balances, leading to an increase in incidents”.
What to do with these themes?
Now that you have a better idea of what we mean by themes and how to spot them, you should spread the wealth.
Just like with any other incident learnings, it’s not enough to have identified them. Your work will pay dividends when you share it with others. I like to include thematic takeaways in all of my incident outputs. When writing an executive summary I will usually include it alongside incident impact and action items. If you produce an incident report, make sure to have a section for takeaways. Talk about them in weekly updates, sprint retrospectives, and onboarding!
Spot them in other incidents
It seems that cross-incident analysis is on everyone’s minds nowadays. Instead of discussing the mean time to resolution of last quarter’s incidents, start looking at commonalities across incidents. How many incidents are related to a poor understanding of dependencies? Which incidents’ impact did we not understand due to insufficient data? We will explain cross-incident analysis further in a later blog post, but quality individual incident analysis is a requirement for cross-incident learnings.
Thematic analysis is a new skill, different from traditional root cause analysis or running action item focused reviews. The best way to get good at doing this work is by doing it; it may not be easy or perfect (perfect is certainly not required to provide value) at first but incident analysis is a muscle—the more you use it, the stronger it will be! So for your next review, spend some time (maybe 30 minutes) reviewing the narrative, jot down questions, and try to come up with two or three themes to discuss as a group.