10 Things I Learned From My First Incident Review

Published on
August 16, 2022
Author
Fischer Jemison
Fischer Jemison
Software Engineer
Share This

I started at Jeli as a software engineer on the backend team about a year ago. Coming from a team that didn’t place a lot of emphasis on learning from incidents, I was really excited to do a “real” incident review at Jeli, and earlier this year I finally got my chance. I learned a lot from the experience. I’d like to share some of the things that I found particularly interesting or surprising.

1. Trust Matters

The first person I interviewed was one of the primary responders during the incident. He was also an engineer on the backend team, and we’d also worked together before coming to Jeli. Having that existing relationship made it easier to talk about the events of the incident and be candid about actions that might have been perceived as mistakes. Interviewing someone you’ve worked closely with also makes it easier to trust that they will represent the events of the incident fairly to the rest of your org. Building that kind of trust with people you don’t work closely with seems challenging, and I can imagine that being a major hurdle for people starting incident analysis programs at larger companies.

2. Incident Interviews Are A Lot Like Technical Interviews

I was expecting my first interviews to be awkward and stressful because I was totally new to the process of incident interviews. But, even though I was definitely a little awkward and a little stressed, it didn’t really  feel like a totally new experience. I found myself, pretty consistently, using skills I’d started building in the last year during  a series of technical interviews for some of our recent hires on the backend team:

  • Actively listening to people talk about subjects I may not be an expert in and taking detailed notes
  • Understanding how much someone did or didn’t understand about a particular subject
  • Getting people who aren’t talkative to talk more, and keeping interviews with talkative people on track
  • Building rapport with interviewees during the interview

I never had to start from zero with someone I’d never met (like you do in a job interview) and the power dynamic with the people I interviewed was pretty different, but I can imagine with bigger organizations and higher severity incidents these two types of interviews may actually start to be quite similar. This makes for a kind of funny pitch for doing more incident reviews: you’re not hiring all the time, but you can use incident interviews to practice your interview skills all year.

3. If It’s Your First Incident Review, Bring a Copilot

One thing we’ve started doing informally at Jeli is pairing engineers who are new to incident analysis with people from our solutions engineering or research teams who have a lot of incident expertise and use the product frequently. My copilot was particularly helpful during interviews; she asked questions during the interview that hadn't occurred to me to ask, and gave me a lot of useful feedback in debriefs after individual interviews. Having a copilot also made for good psychological safety; it gave me someone to ask the “dumb questions”, and knowing that if I missed an important question in an interview, she would ask. It made the entire process way less stressful. 

4. Incident Reviews Take Time

I assumed that doing this review would be a little side project, without much impact on my regular work. However, between researching the incident, interviewing and writing up notes, and creating the actual documentation for the review, to do a good job it took time and focus, and understanding from my manager that this was a priority.

A piece of advice from my incident review copilot at the start of the process was that a major skill in interviewing was knowing how to get value out of a short interview. After this experience I think that’s true for the other parts of the review as well. If you’re in an org with less buy-in for reviewing incidents, or more incidents to review, can you deliver value from the process without interviewing every single participant? Maybe without spending an afternoon hand-drawing diagrams about the control flow of the buggy code that caused the incident? I have a new appreciation for how difficult it is to cut down on time. If nothing else, I know for future incident reviews that it’s important to block out focus time for studying the incident, and that starting your calibration document the morning of the review is probably not a good idea.

5. Timing Matters

I did my first round of interviews the week of March 14th, then took a week of vacation, then did a second round of interviews and facilitated the learning review the week that I got back. There was a very noticeable difference in the second round of interviews, where participants needed longer to “warm up” and refresh their memory of the incident, whereas, in the first round, the incident was fresh enough that interviewees were generally able to dive right in with only a short refresher. 

My takeaway from this is that interviewing participants as soon as possible after the incident makes it very straightforward to get value from the conversation. Conversely, I learned it’s entirely possible to have valuable incident interviews even when it’s been a while since the incident; you just need to accommodate the fact that people will need a little more help to refresh their memories of the incident, and might want to allocate more time for interviews.

6. Consider Talking to Customers

The incident I reviewed involved responders working closely with a customer to troubleshoot the incident, and from the start of the learning review process I thought it might make sense to involve that customer directly. We considered interviewing the customer as well as inviting them to the learning review, and ultimately decided to just interview them. When talking about this idea with the rest of the team, here’s some things that came up: 

  • How would this affect our reputation? 
    For Jeli, we tend to think customer involvement in the incident review process is generally positive, both to show off our incident review skills and to foster a culture of openness around incidents. 
  • Is this something I was comfortable with? 
    I had done a couple of interviews already at this point, so I felt comfortable interviewing the customer who’d participated, but I’d still never facilitated a learning review before and I was undecided about my level of comfort there.
  • How does this affect our relationship with the specific customer involved? 
    Based on our experience working with the specific customer involved we felt comfortable that any level of involvement would be a positive for the relationship, whether they were just interviewed or also invited to the retro.
  • How would the customer’s presence affect the learning review? 
    We were concerned that participants might not be comfortable speaking candidly about their actions in the incident with a customer present. Even with our strong, blame-aware culture at Jeli, it’s still difficult to get rid of the pressure to put on a positive face for customers. We wanted to make sure we were able to discuss the challenging parts of the incident and not just the parts that went well.

Ultimately, we put a lot of weight on the last point, and decided to only interview the customer involved without inviting them to the review; this was a complex incident that had enough technical and coordination issues that we wanted to be sure we’d do them justice in a learning review. And this approach worked really well! The customer felt included in the process, and we had a very productive learning review that included the insights we learned from  them.

7. Everyone is Wrong About Everything

One thing that became pretty obvious when I was reviewing what people were saying contemporaneously in the incident was that absolutely no one has a complete and correct understanding of the state of the system at any given moment throughout an incident. Many of the mid-incident hypotheses I saw were wrong, even if they were directionally right, which is something that only becomes obvious when you have time outside the constraints of the incident to examine in detail precisely what caused the symptoms observed during the incident. And after seeing it in one incident, it’s something I’ve noticed in all of the incidents I’ve been involved in since.

That made it a lot harder than expected to come up with an incident narrative that was both accurate and cohesive. As the narrator of the incident, you need to coalesce everyone’s view into a single storyline and figure out how to represent multiple conflicting views —without  spending thousands of words simply recounting exactly what people were doing and saying at any given moment. 

Another way to think about this is using the above the line, below the line framework, something I learned about in an internal talk last year. As the incident reviewer, you get to see people forming their mental models about the system in real time, and then you can synthesize those models with other data about the system in order to build a more accurate model. Once you have a more accurate model, you can share back this information to the rest of your team as part of the incident review.

8. Event Data is Essential

People being wrong about stuff isn’t just a problem during the incident. One of the first things to go in people’s memories after an incident is the specific timeline of events in the incident. Interviewees pretty consistently didn’t remember when key events occurred, or even mis-remembered the order of mitigation steps or customer interactions – through no fault of their own! That’s just how the human brain works. 

Fortunately, we recorded most of the key events of the incident in Slack via our incident response bot and tooling that reports deploys to a #deploys Slack channel. We also had an incident commander who was actively recording hypotheses and decisions that were being discussed in the incident call back into Slack. I ingested that data into Jeli and was then able to use it to fill in the details of when things actually happened, while relying on interviews for things they’re actually useful for—like providing context qualitative information. 

Image 1: Jeli timeline view showing investigators forming, disproving and confirming hypotheses throughout the entire duration of the incident

9. Howie Really is That Good

Before I started my incident review, I asked another Jeli backend engineer who’d recently done his first investigation for advice and he told me that if I had any questions I should just look at the HOWIE guide, because they were probably answered there. Not only was that true, I pretty consistently found that tips from other Jeli incident experts were already in Howie (since they were the ones who wrote the guide in the first place). It really is a great doc, and, as I’ve learned, it’s incredibly useful if you’re new to incident analysis.

10. Incident Investigation Is Fun

As a software engineer, you don’t get to put on your metaphorical detective hat very often, and when you do it’s often for reproducing a live bug where the pressure is on to find a resolution. Incident investigations are different; you get to work at your own pace and go deeper to try to understand the failure that happened, examining a body of evidence that can span logs, traces, slack messages, build errors, and people’s memories. It’s a totally different style of working than I was used to and made for a really enjoyable break from my normal responsibilities.

In Conclusion: Be Excited About Your First Incident Review

A pretty consistent theme about the stuff I learned in this incident review is that I either learned new things about our organization or I was challenged in a really productive way that I didn’t expect. It helped a lot that I work with a bunch of incident analysis pros, and that I had the Jeli app available, but some of the other tools I used are stuff anyone can get access to right now: one browser tab for HOWIE, one for Incident Analysis 101, and a pencil and paper (by my count, I took 8 pages of handwritten notes for the entire incident). It’s fun, and you’ll probably learn more than you think!

Promote positive change in your organization. And the industry.