In a 2018 talk on blameless post mortems, John Allspaw said, “The people who were there in the middle of an incident are experts in what went wrong and all the messy details that can prevent it from happening again.”1 Taking the time to interview participants after an incident allows us to unlock each individual expertise and share it across the organization. Interviewing generates first person accounts that highlight the knowledge inherent in your teams (as well as the gaps), the strategies that engineers use to manage unexpected systems behavior, and any difficulties that impeded efforts to restore the service. This is crucial information to gather. It helps everyone more readily learn from the experience, even those who weren’t there, and gives the organization an opportunity to look at how existing practices and processes help or hinder during incidents.
All this to say, interviewing is a worthwhile use of the time you’ll spend investigating. It gives you high value insights to supplement your transcript reviews and can be the most efficient way to get at the heart of what matters about this particular incident.2 In this post we’ll assume you recognize this and are able to make the time to interview. If you simply don’t have the time to interview, check out the article “What kind of data can you use, and should you use, for an investigation” to make the best use of the time you do have by using other sources of incident data.
Who should you interview?
So, you’re sold on conducting interviews. You might be thinking “There were over 200 people in that incident, and over 20 were super active! How will I ever get all those critical interviews done before my incident retrospective in 5 days?!” The good news is, when you started your initial analysis and began to understand the event, you already did some of the legwork required to identify key players! The even better news is, experts have been doing this work for years so you don’t have to start from scratch.
As highlighted in our HOWIE guide, here are some things you might consider when deciding who to interview3:
- Key players from key moments.
Detection, diagnosis, or resolution of the incident—if someone was heavily involved in one of these things or led the charge, schedule away.
The key players involved in the incident are usually (but not always!) people who can give you both broad and specific perspectives about the event. For example, a broad perspective may be from the incident commander who managed the incident activities. Their overview of the event will help you quickly get oriented to the incident. Another key person could be the responder who identified the issue or came up with the mitigation/repair. They can help you understand the more specific, technical details of the failure. Key players also come from the customer support, field engineering, customer success management and sales side of the business. Often these key players in incidents were the people who brought timely, specific information from the customers perspective. Depending on the rules of your organization, even the customer(s) themselves can be interviewed as they are deep reservoirs of context and information! The most important goal of interviewing key people is to maximize the information you receive from the time you spend talking to those involved.
- People that you didn’t expect to get involved in the incident.
This might include folx that weren’t on call or weren’t on the team of the impacted system.
Regardless of whether they were pulled into the incident or jumped in on their own accord, taking the time to interview these people can give you insight into expertise (perceived, or otherwise!). Typically when someone is intentionally pulled into an incident, it’s because they have knowledge about a system that renders them incredibly useful in the thick of an incident. By interviewing them and sharing those findings in your facilitation meeting, you can start to spread out that expertise so that, for example, ‘Suzie’ is no longer the only person who knows how consul works.
- People that may feel blamed.
It’s important that people who feel in some way “responsible” for the incident get to share their perspective in an interview so that their experience can be accurately represented in the investigation.
These interviews are crucial in creating a psychologically safe environment. It’s important people feel as though their perspective is not only being heard but fairly represented, and they won’t be blamed for incidents. These interviews can also surface additional contributing factors around organizational processes that may have been missed from other data sources.
- Owners, perceived owners, or disputed owners of impacted systems.
Former owners can also be interesting interviewees.
These people tend to have a high level of understanding of the system, which can make for a more productive retrospective and richer post incident write up. Seemingly basic knowledge about how the full system works puts the incident in valuable context which, when included in your incident write-up, makes it more likely to be read and shared down the line.
- Subject-matter experts (SME) and people that exist on “islands of knowledge”.
Anyone with expertise in the system(s) or components involved in the incident are good to interview. And those who are “the only person who knows how to resolve an outage” represent subject matter expertise that should be more widely distributed.
Similarly to owners, interviewing subject matter experts can help better identify the actual level of expertise required to operate the system. Oftentimes, experts have a hard time documenting and explaining their expertise so interviewing them (even with very basic questions) can draw out this knowledge. We often ask SMEs “how well known is this across your team” and find a surprising number of times the answer is “not at all well known”. These are good signals that sharing this knowledge in a report or learning review is a way to break down knowledge silos and a training opportunity for others to help develop that expertise more broadly.
- “Lone wolves”
People that take action independent of group effort can indicate they have a different perspective of how the system works.
Lone wolves are often seen as counterproductive to incident response, but it is worth probing into “the why” they did not coordinate their activity with others. These interviews could surface potential issues with team dynamics, processes, or communication channels that may have contributed to this independent action.
- The ticket-writer and/or the person who wrote the change.
Person who wrote the ticket that prompted the person to write changes related to the incident and/or the person who wrote said change.
Interviewing these people can give us important context into the steps leading up to the incident. Asking questions such as “Did we have the right context about the system and the organization when we wrote that ticket? What led to it getting pushed out?”
- People that are new to the organization.
Pretty self-explanatory. Ask the newbies.
Interviewing these folx can help us surface hidden assumptions about our systems.
Working with time constraints
This article focused on who you should interview in your quest to facilitate a learning-focused incident retrospective. However, we all run into time constraints that force us to make tradeoffs in interviewing.
If your target interviewees are unavailable, or you run out of time before you need to deliver a report or facilitate a meeting, here’s a few suggestions for how to handle:
- sequence your “high-value” interviews first so that you can get the greatest amount of context in a short amount of time.
- cut down the length of your interviews instead of eliminating them entirely.
- ask the questions in a direct message or in a google doc to allow for the interviewee to respond asynchronously.
- prepare a summary based on other information you’ve gathered and ask the interviewee if it is accurate and to provide comments. This helps target the time they do have available on verifying and clarifying instead of constructing understandings.
- make an effort to end your interviews as the amount of relevant new information drops off noticeably.
- use a calibration document to allow ongoing asynchronous discussions.
Taking the time to talk to responders through interviews can help us surface knowledge about our systems—both technical and socio-technical—that will allow us to be more resilient as we deal with incidents in the future. For more detailed information on these and other topics, you can always check out Jeli’s Howie: The Post Incident Guide, our step-by-step, plug-and-play guide to Incident Analysis.
Thanks for attending another session of Incident Analysis 101! If you enjoy this content or want to suggest a future topic, tweet us @jeli_io.
- Getting the Messy Details is Critical, John Allspaw
- Have you seen this before?, Lorin Hochstein
- Howie: The Post-Incident Guide