One of the first things we often get asked when working with companies who want to improve their learning from incidents is “who should investigate our outages?”
Good incident analysis- like good code- is a skill set to be learned and refined over time. Many companies take the “template approach.” They do this thinking that an easy-to-use template removes the need for any specialized skills or knowledge. While it is true that simply filling out a template doesn’t require much skill, it also doesn’t produce much in the way of useful insights. To get real value from your incidents, it helps to recognize and cultivate the kinds of skills, knowledge and experience that is needed to generate the meaningful findings to support post-incident activities, like holding post mortems or writing reports.
At Jeli, we believe that incident investigation and analysis should be a core skill for all software engineers working in continuous deployment environments. Why?
Because participating in incident investigation and analysis offers an opportunity to:
- deepen our understanding of the different software involved in delivering your service;
- expand our knowledge of the various kinds of skills and expertise that exists across an organization; and
- identify where process or organizational structure may be helping (or harming) reliability and resilience.
Investigating incidents exposes an engineer to people and parts of the business they may not typically interact with, which can improve their ability to think strategically and proactively about their own day-to-day work. In other words, it’s targeted and applied professional development for engineers, plus the chance to develop their network and learn about business goals and priorities! Score!
In this post we define an ‘investigation’ as exploratory work following an incident that is designed to answer questions about what happened, why it happened and how an organization should learn from it. An ‘investigator’ is anyone who carries out that work. Investigations and investigators mean different things to different companies and some organizations call these roles incident analysts or facilitators.
In this post we will elaborate on some baseline skills for those who tackle investigations. Then explore the pros/cons of solo vs team investigations, and when to bring in outside help.
Defining Ideal Investigator Traits
The ideal investigator is one who has expressed an interest in post-incident activities, have typically attended a number of post mortems and have an aptitude for empathy, keeping an open mind, asking curious questions, and taking a systems view in trying to understand events. These are the people who, when they hear about an action taken by a fellow engineer that on the surface seems to be an egregious error, don’t pile on and assume incompetency. Instead, they might say “Oof, I’ve done something similar. It sounds like it would have been one of those realizations that makes your heart stop. Can you recall what you saw that led you to believe it was the right thing to do?”
Their curiosity is evident and they understand the complexity involved in making a decision under conditions of uncertainty, time pressure, and stress. They empathize. They are comfortable with the messiness of real world operations so they make sure to include details relevant to painting a realistic picture of the conditions faced by the people involved. They are confident and non-partisan when those details may be an organizational “hot potato” that will surface difficult conflicts or highlight unresolved deficiencies (more on this later).
Developing Investigator Skills
Even with an aptitude for troubleshooting, problem solving and empathizing, most software engineers do not possess an innate ability to investigate incidents. Therefore, to set up new investigators for success it’s important to provide support in the form of training, mentorship or shadowing opportunities.
There exists a paradox when it comes to building investigation capacity within engineering teams: many companies want to see value from their incident reviews before allocating time or budget for investigation training. We’ve noticed a pattern where companies who traditionally simply fill out templates for their post-incident process begin to realize more benefit when an engineer has been trained in investigation techniques from another company, and then applies these skills to their new team. This engineer then begins training more engineers in analysis techniques, broadening the impact.
This apprenticeship/mentorship model works well in the context of grassroots initiatives. It does allow you to continuously develop new investigators, but it’s a lot of effort for the mentors. It’s worth using as one approach to investigator training, but perhaps not the only approach. Implementing other ways to train and support investigators reduces the burden on investigators and ensures that any mentors/coaches also have time to support the investigation.
In the absence of more formal training or mentorship, we’ve also seen companies support investigator development by increasing attendance at other teams’ learning reviews and/or shadowing more experienced investigators to be able to ask questions and learn from others’ practices.
Identifying Potential Investigators
In the Howie guide we briefly discussed some characteristics that investigators “should” or “shouldn’t” have.1 This is in part to set that person up for success, and in part to set the investigation up to provide the highest return on the invested time spent looking into it. Incident analysis is a skill that will develop over time but at the beginning, simply managing the logistics involved in conducting an investigation (let alone handling a potentially sensitive interview or meeting facilitation!) can be stressful and, without the right skills, can leave an engineer feeling lost in how to make it useful to others. Let’s break these down further.
When an incident occurs, the following guidelines will help your team decide who can investigate the incident with the most success and least stress!
An investigator should have:
- foundational knowledge
While it is useful for the investigator to take a novice’s view of the system – asking the obvious questions that can surface hidden assumptions – it takes a base level of knowledge to recognize when different engineers may have differing perspectives on how it all works. Investigators should have a sense of what level of depth to explain the involvement components to make the best use of time in a review or report. It’s very likely that an investigator will encounter software they are unfamiliar with if it’s a new component or a team they don’t work with frequently. This offers an opportunity to establish for themselves (and others who will review the findings)
- some training in investigation techniques
Investigations, like incidents, involve sifting through a lot of potentially useful but ‘noisy’ data to discover what is meaningful. If you don’t collect enough data or perspectives on the event an investigation is unlikely to produce any meaningful insights. Conversely, the more data you collect, the more techniques are needed to help manage and extract meaning from it. The more complexity, sensitivity and visibility an incident has, the more important it is that there is a high degree of trust that data will be handled and stored appropriately and that the integrity and intentions of the investigators can also be trusted.
- the ability to create safe spaces for learning
As an investigator, your role is to represent the challenges inherent in the incident without hindsight bias or blame and create conditions for respectful, transparent dialogue to aid learning. Even those who have been doing this work for years continue to practice and learn from each investigation, continually refining approaches and techniques.
- the time and attention to dedicate to the investigation
Few companies have dedicated incident investigation teams so much of this work is conducted alongside an engineer’s day to day work. While tradeoffs are an inherent part of working in fast moving production environments, it is very cognitively demanding to be continually context switching away from an investigation. Ensuring an investigator can dedicate the time and attention to complete the investigation (regardless of how in depth that may be – 1 hour or 100 hours) is important to getting the most out of this work for your teams.
It’s also worth discussing who shouldn’t investigate.
An investigator should not be:
- directly involved in the incident.
A key part of the investigator role is to represent multiple, diverse perspectives fairly which can be challenging when the investigator was directly involved. As we mentioned at the beginning of this post, investigation should be a skill for every engineer on your team, building a team of investigators means a company can have a flexible rotation schedule to avoid any conflict-of-interests and prevent burning out investigators.
On smaller teams where it is not possible to have someone uninvolved in the incident, there should be an explicit expectation that incident findings are considered ‘tentative’ until other participants have an opportunity to comment and clarify. This could be as simple as reminding attendees at the start of a learning review that it is an expected part of the meeting to revise and clarify anything that is not accurate. Similarly, reports should be titled ‘draft’ and given a timeframe for participants to provide comments or clarifications before finalizing content.
- untrained, inexperienced or completely new to the company
As mentioned previously, a base level of knowledge, skills and experience is needed to drive impactful results from your post-incident activities. Set the engineers and the initiative up for success by choosing investigators who can generate the kind of insights that make good use of the organizational attention given.
- in a position of authority
Even in very egalitarian workplaces, power imbalances can exist between investigators and people in positions of authority that can influence the outcomes of the investigation. Someone in a position of authority are those who direct day-to-day activities, control compensation or promotion opportunities, or have significant influence over workplace satisfaction. Investigations run by those in a position of authority are less likely to be questioned, or any inaccuracies challenged by responders due to this imbalance of power.
Determining solo vs team investigators
Whether an investigation is completed by an individual or a team depends a lot on the type of incident that has occurred, the internal investigatory resources available, and the culture of the organization. As post-incident activities drive more meaningful insights, they drive more value for the organization and tend to attract more support and interest. Continuous investments in building up a team of investigators can help amplify the value. However, you have to start somewhere and, to paraphrase a statement from the previous blog post2 in this series, the right investigator for the job is the one that is available!
Extra support for “hot potato” incidents
Despite efforts to build up internal investigation capacity, there are circumstances that better lend themselves to bringing in experts. A pattern we’ve seen emerge in this work is the presence of the “Hot Potato” incident. Merriam Webster defines a ‘Hot Potato’ as “a controversial question or issue that involves unpleasant or dangerous consequences for anyone dealing with it.”
These events can be politically tenuous for individual engineers to investigate without explicit support from senior leadership. Even then, it may be more appropriate to hire third party investigators3 to help your organization work through a hot potato.
Establishing ongoing learning
To practice what we preach it’s important to learn from the learning process itself. Many companies develop a community of practice for their investigators/analysts to give them a forum to share their discoveries, compile lessons learned and provide feedback for one another. The more sophisticated versions of these CoPs include formal training as well as their informal discussions – drawing from newly published research or drawing in experts from the field to introduce new ideas and continually introduce new ideas. For more detailed information on these and other topics, you can always check out Jeli’s Howie: The Post Incident Guide for more information around Incident Analysis. If you enjoy this content or want to suggest a future topic tweet us @jeli_io.
- Assign, Howie Guide
- Which incidents should you investigate? Vanessa Huerta Granda & Laura Maguire
- Aftermath Projects, Adaptive Capacity Labs