Incidents inherently create stress. Systems going down and user complaints pouring in are plenty to get the adrenaline pumping, but eyes and questions from internal stakeholders wondering when a resolution is coming are enough to overwhelm anyone.
Incidents can also generate a fair amount of accidental stress. When we don’t take necessary steps to improve our systems and instead leave ourselves open to repeat failures, teams may start to feel hopeless. When the manager accountable for the system of interest follows up on an outage to find out who’s responsible, people may retreat to avoid associating themselves with anything that’s less than perfect.
Blameless culture and related practices like postmortems made significant strides in counteracting the worst of this accidental stress, imploring organizations that are serious about their incident management process to explore failure from a systems thinking perspective (1). Although many organizations begin their incident management journey at this new baseline, the ideal outcome of learning deeply from incidents requires pushing the envelope even further.
The Learning Zone
Tom Senninger’s Learning Zone model lays out a framework for learning that includes the well-known “comfort zone” and two additional zones: the learning zone and the panic zone.
The comfort zone is too safe and stagnant to promote learning, while the panic zone is filled with too many other concerns and emotions for people to learn effectively.
The inherent and accidental stresses of incidents have constant potential to push us into the panic zone. Sometimes this is okay; we rely on our internal alarms to tell us that something is an emergency. If we spend too much time in the panic zone during an incident, though, we might miss nuanced signals or details important for mitigation and resolution. And if we stay in a state of panic after an incident, we’re losing out on the opportunity to learn from it.
On the other hand, when we become tediously careful in our incident management or place more process than necessary around incident review, we risk keeping ourselves in the comfort zone where we may feel that there isn’t any learning to be had. We must, then, find ways to ensure and incentivize a level of risk taking (3) and generative culture (4) that promotes engagement, creative thinking, and innovation of process toward better systems.
Hallmarks of an organization that pushes people too far and too frequently into the panic zone are:
Lack of ownership. No one puts their neck out for fear of being on the chopping block.
Lack of communication. People don’t speak for fear of being misinterpreted, corrected, or ignored.
Long-lasting anxiety. Everything starts to look like an incident, and even small decisions are steeped in paranoia.
On the other end, hallmarks of an organization that stays within the confines of the comfort zone are:
Bystander effect. Because the stakes are low, few are motivated to engage.
Action item factories. The focus is on creating and completing tasks, regardless of how those tasks impact organizational outcomes.
Rubric following. Each review is in an identical format, agnostic of the shape of the incident or system or people in the review.
Incident treadmills. Actionable improvements aren’t identified or aren’t implemented, and the same incidents repeat regularly.
To maximize learning and avoid these pitfalls, we must enact humanist practices. This goes beyond awareness of blame and systems thinking. We must meet people where they are and create space for them to get curious and find their agency. We can interview responders individually before a review to reduce the bias and power dynamics common to group settings, and we can interview teams to better understand their culture and norms as they pertain to systems of interest. This ethnographic (5) approach fills out our social understanding of what is a deeply sociotechnical landscape.
ITHAKA’s Learning Path
As an organization, we’ve had incident responses at ITHAKA from the beginning working ad hoc through incident resolutions and their aftermath. In the last decade, drawing on ideas and practices from our peers, we evolved from the ad hoc to a more principled and predictable model. Whereas we responded uniquely to each incident early on with hallmarks of comfort or panic, we consciously took steps to push us regularly into the learning zone. Following is a sampling of practices we’ve introduced at ITHAKA to continue building room for learning:
Keep a very light review backbone. We avoid ticking a litany of boxes in each review. Instead we ensure that we hear directly from those involved, record things people learn during the discussion, and identify any highly actionable steps we can take to improve.
Focus on the learning. We have a deep history of blameless culture and incident review, but we’ve started using even more positive language and focusing on the unique or patterned aspects of incidents. This has an added effect of improving attendance because it highlights that anyone can stand to learn and improve even if they weren’t involved.
Stratify communications. We proactively communicate any impacts and changes of broad interest to the organization so that stakeholders don’t need to interrupt the incident management flow. This keeps them informed and keeps mitigation and resolution on task while creating less stress.
At the beginning of 2023, we adopted Jeli.io for our incident coordination and learning reviews. A major shift Jeli helped us make was in lightening the review load on both the involved responders and the investigators. Where we previously put the burden on responders to fill out the incident timeline, Jeli has made it easy to construct a timeline from the actual activity already available in places like Slack and PagerDuty, which has added only a comparatively small burden on investigators. This has reduced the amount of time people are pulled out of their usual focus for a given incident, and in our experience is more thorough than our prior practice. The investigator can also augment timelines as needed with additional interviews and research by the investigator, allowing human judgment to guide the appropriate depth.
At ITHAKA, Jeli helps us put the unique value of an opportunity on display and avoid falling into familiar routines with incident review. When we start a review, the Jeli opportunity page puts the shape of the incident up front with the opportunity’s start and end, executive summary, and relevant opportunity tags. In particular, the opportunity tags get responders and other review attendees thinking about their past related experiences, lending additional context to the review.
The landing page for an ITHAKA opportunity
By starting with this high-level fingerprint of the incident, review attendees are already opening themselves up to the idea that this is a unique opportunity to learn something new. In the review, we support and solidify this learning using Jeli’s narrative view. We build up the story of the incident from the information Jeli collects from Slack and other sources, refine that story through discussion with responders, and use that story to guide the learning review.
The high-level narrative view for an ITHAKA opportunity
Finally, Jeli has made communications management easier. We can readily call in additional responders with the PagerDuty integration, and we can easily communicate impacts and statuses with the StatusPage integration. We’re also interested in exploring the new webhook integration to send salient information exactly where it needs to go.
I’ve personally been quite proud of our practice around incidents over my years with ITHAKA. Seeing how Jeli helps us continue to evolve and refine this practice with a focus on the learning has reminded me that the work is never done and that focusing on the people is usually the secret to success.
Dane Hillard is a technical architect at ITHAKA, where he works on various aspects of the JSTOR platform. Dane has been part of ITHAKA’s incident management team as both an incident manager and investigator for the last 5 years, during which he has managed over 30 incidents, investigated and reviewed over 15 incidents, and caused enough incidents to have lost count.
1. Meadows, Donella. Thinking in Systems: A Primer. Chelsea Green Publishing, 2008.
2. Senninger, Tom. Abenteuer leiten - in Abenteuern lernen: Methodenset zur Planung und Leitung kooperativer Lerngemeinschaften für Training und Teamentwicklung in Schule, Jugendarbeit und Betrieb. Ökotopia Verlag, 2000.
3. DeVaro, Jed, and Fidan Ana Kurtulus. “An Empirical Analysis of Risk, Incentives and the Delegation of Worker Authority.” Industrial and Labor Relations Review, vol. 63, no. 4, 2010, pp. 641–61. JSTOR, http://www.jstor.org/stable/20789040.