PagerDuty Blog

Near-miss incidents: how to review and learn from them

This post was originally published on the Jeli blog. Jeli was acquired by PagerDuty in 2023 and we’re reposting it here to bring their thought leadership to our community.

We’ve discussed which incidents may be suitable for a more in-depth review than others. Among these are also the incidents that could have been: the near-misses. These incidents are particularly helpful because they provide an easy on-ramp to learning. After all, they’re free from the dark cloud that often looms over the incidents that did, in fact, miss. Yet, not enough orgs take advantage of learning from their near-misses.

What is a near-miss incident?

Near-misses have similar characteristics to what we usually consider an “incident”: something happens, we need multiple folks to collaborate on addressing it, folks need to drop whatever it was they were doing to work on it immediately, and we need someone to coordinate and communicate what is happening. But unlike a traditional incident, at the end of the near-miss, our end user is not impacted! Because of the hard work of those involved during the incident process, we are able to stop the wave of impact from reaching that far.

Near-misses can vary in form. Some examples include:

  • An error in an accounting system that was caught early enough to be fixed before invoices were sent out.
  • A call center’s phone system was down during off-hours. The in-house team was able to work around this outage in time for the call center to open.
  • In either case, due to the quick response, the system was able to complete what it needed to do. Ultimately, we don’t need to fully define near-misses since their whole point is to provide us with an opportunity to expand the universe of events we can learn from.

What can we learn from near-misses?

A near-miss can tell us as much about our systems, organizations, and work as an “original recipe” incident.

Near-misses:
  • Help us understand what is important to us as an organization: who is our end-user? What do they need from us? How do we know if we are fulfilling this need?
  • In the call center example, our user may be the call center employees or those trying to reach them. Both groups need to be able to use the phone systems during operating hours, otherwise it’s a full-blown incident.
  • Tell us who the key players are for a specific system: who do we need when the system stops working? How do we work with them?
  • Maybe we think an incident only requires the engineers in the accounting team but we actually also need folks who can bypass controls around releasing.
  • Show us how we find out about incidents: what do we look at? What indicators do we pay attention to?
  • While the accounting system may be up, how do we check that we are getting accurate results? In the call center example, how do we differentiate between “hard down” or “degradation of call quality”
  • Explain how the system works: what did we expect to happen? What happened that we didn’t expect?
  • By reviewing the near-miss, folks can better understand the architecture behind the telephone systems and the history surrounding it.
  • Highlight everything that had to happen for the incident to be a near-miss, and provide examples that can be used in other parts of the org.
  • Perhaps the developers in charge of the accounting system have a uniquely close relationship with the folks doing the reconciliation, and that should be more encouraged across other teams. Or maybe we had some quick workarounds ready, to make sure we can make quick changes in case of incidents. These quick workarounds can also be implemented into other processes.

How to review near-misses

We can review near-misses the same way we review any incident. You may follow the Howie process with a disclaimer for participants that, while you understand this incident did not have user-impact, there is a lot to learn from them! For your first iterations of near-miss reviews you may have to do some convincing. If so, we recommend following a more lightweight process and perhaps bypassing interviews. Once folks in your organization see the benefit of reviewing these near-misses, they’ll be more likely to agree to invest the time into these investigations.

Near-misses are some of our favorite learning opportunities. We have found that participants are more willing to share their stories from their own point of view when they are in a celebratory mood and know that they cannot get in trouble. Reviewing near-misses is a great way to get started learning from incidents as it provides the psychological safety necessary for a learning culture to take hold at an organization.