Incidents happen, despite the best plans and practices, and rarely go as imagined. Often a hapless engineer gets assigned to run an after-the-fact investigation. They have trouble getting the data and stories they need from others. They hold a meeting hardly anyone attends. They write up a report. It’s kind of long, which looks good sitting on file where no one bothers to read it.
People, not tools
Response teams are made of people, not machines and apps. We know, and cross-disciplinary research backs us up, that people on distributed teams need an environment that supports them in ways that lead to better performance, more productivity, and much higher job satisfaction. These three go together. We’ve baked that into our company culture and our product, and hope you do, too.
1. Make everyone feel safe sharing truths Team members need psychological safety — the certainty that they can do their jobs, speak necessary truths, and have potentially challenging conversations without personal repercussions. Amy Edmonson at Harvard Business School has spent years researching what creates what she calls a “learning zone.” It’s the combination of high performance expectations and psychological safety. Her book The Fearless Organization explains what psychological safety is and isn’t. For incident analysis, we can boil that down to these points:
Don’t let the boss lead the investigation. Can they stay away entirely? People are less likely to question those in authority, and may be afraid to deliver facts that might make themselves look bad. Management’s job is to insist and ensure that the incident is analyzed thoroughly and reported on properly, but without being there to steer the process. Give them a great opportunity to show they can delegate!
Separate facts and ideas from personalities. Individual participants — including you — need to let go of their egos and focus on facts. Make it clear this isn’t a hunt for blame. Don’t focus on who’s right or wrong, or who said what. What matters is that it gets said. It should be OK to admit, “I made a mistake with updating configurations.” The point of the analysis is to find ways to make it less likely that anyone will be able to make that mistake in the future.
Psychological safety isn’t about being nice or lowering standards. It’s something people experience at a group level — the certainty that everyone is in an environment where they can speak their minds about important work issues and events without fear of repercussions. Sometimes the truth hurts. Sometimes the facts are embarrassing. People can’t always help being emotional, or even rude. Make it clear that what matters is they need to spell out the truth, for which no one will be punished.
2. Run an efficient, effective meeting
Burying a long retrospective report doesn’t lead to improved performance, increased productivity, or a healthier work environment. Here are some tips to consider for running your meeting:
Think like a facilitator, not an expert. You may have gathered information, interviewed participants, analyzed and written up findings, but this does not make you the expert—you are there to facilitate the sharing of knowledge and expertise of those who participated in the incident. Everyone experienced the event from their specific point of view, and getting it directly from the source helps create a shared understanding of what occurred and gives us an opportunity to learn from each other.
Prepare people ahead of time. Create an agenda with set time limits.We recommend 30 minutes to two hours max. Make clear that you’ll stick to the schedule on your agenda, and people strapped for time can attend only those parts where they are able — but they really should attend! Also, share the findings you have ahead of time, at least 24 hours in advance of the meeting. Check in with individuals whom you really want to contribute, to let them know what you expect and alleviate any stage fright or defensiveness.
Craft a short abstract of your findings as your incident's elevator pitch. Assume this abstract is all that most people will read. Use this short, pointed abstract to get those who didn't attend the meeting or participate in the analysis to decide to learn more about the incident, and/or to get behind your proposed solutions.
3. Learn across incidents over time
We call it cross-incident analysis, or XIA, in our work. Incidents don’t live independently of one another. They’re best viewed as a connected series of events to be analyzed as a whole. We give away our IR Bot for Slack because a Slack bot is a must-have to handle any incident. But to understand the cross-incident patterns in your organization, at some point you’ll need to leave Slack. This is where Jeli really wins.
Look for people patterns. Jeli’s People View shows a global map of who was on call during each incident, who participated in mitigation, and who else was present as an observer. Studying patterns across incidents can reveal where there are holes in coverage, or conversely which people get drawn into incident after incident — a sign of impending burnout or just plain quitting. You’ll see the islands of knowledge in your organization to be both cultivated and rewarded.
Look for “things you can’t control-F”. Often a series of incidents is caused by something that doesn’t manifest itself as messages in Slack. Do incidents increase around code freeze dates? Is mitigation too dependent on one person who knows a legacy app? Did management decline to pay for a pricey feature-flagging app, but is paying more in lost business or on engineer time to repeatedly mitigate problems? Cross-incident analysis can give you the factual evidence you need to propose changes that will reduce the recurrence or cost of incidents.
We want to learn with you
No one person or group has more than a fraction of the knowledge, experience and insights we share as a group, often without knowing it. We’re always eager to connect and learn with others. Don’t keep us waiting — get in touch to stay connected!