In this post we’ll take a look at a few tactics and tools to help streamline incident response when all the responders are working remotely.
Responding to incidents and running analysis in various forms over the last decade, I’ve logged countless hours in incidents with globally distributed teams as both a responder and full time incident commander. It used to be that “fully remote incident response” was a fun friend I typically only hung out with on nights, weekends, or special occasions. Until March of 2020, when that fun friend unexpectedly moved in and never left. I learned a lot about fully remote incident response over the last few years as we became virtually inseparable.
While there is definitely a difference in how I approach remote first incident response, the reasoning behind that approach remains the same.
The most frequently discussed “area of improvement” during incident response is speed, the Mean Time To… whatever. Most organizations focus on metrics that look at getting faster at recognizing incidents, faster at identifying impact, faster at assembling responders, and faster at mitigating the issue. I understand the motivation here: the less time spent in incidents equals less impact to customers, and more engineering time spent on feature development.
Several organizations I’ve worked with thought “automating as much of response as possible” was the answer to reducing their “Mean Time To Whatever.”
There are great areas in response where automating things can reduce the steps required of responders. And there are great areas in response where automating things can become excruciatingly annoying in the heat of an incident, demand context switching in the midst of troubleshooting, or even get lost in a flood of alerts and notifications in an incident channel.
Much like the technical breakdowns that lead to incidents, the speed at which an incident is worked on depends on multiple contributing factors unique to every organization, and often even unique to each incident, but they frequently come down to communication, collaboration, and coordination: Our ability to understand each other and the problem at hand leads to solving that problem together, while also making sure our stakeholders, outside of immediate response, have the information they need to do their jobs.
Let’s take a look at tactics we can use to make communication, collaboration, and coordination easier, and maybe even a little faster, when our response teams aren’t all in the same room.
Communicating Remotely: Get on a call
Suddenly throwing a group of people together who may or may not know each other well, and telling them to solve an urgent problem that’s not yet fully understood, is already a difficult thing to navigate. Removing their ability to pick up on body language, facial expressions, the nuance of intonation, and other non-verbal queues exponentially increases that difficulty.
It’s common to see advice to “keep incident conversations in messaging apps,” maybe even in a specific channel or thread, because it creates an artifact of response and makes analysis easier. But during the chaotic haze of trying to figure out what the hell is going on, incident analysis is a concern for a future version of you. We’ll talk about a few things to make analysis easier later, but the highest priority during that hazy early phase of response is getting on the same page about what is happening.
I have joined dozens of incidents in progress where troubleshooting is getting contentious, or stagnating. And getting folks into a call to talk out: what they know, what they need, and what they are trying to do, has been the fastest way to get everyone on the same page.
The act of gathering together takes incidents from feeling like an isolating predicament you have to figure out on your own- to a troubleshooting team effort. Calls, whether they’re on Zoom, Google Meet, Slack Huddle, or even an old fashioned conference call, are our best approximation of that “war room” experience. Incidents are stressful environments, but they are also a team sport. Being able to talk things out with your teammates as you work arm in arm on the same problem, eases some of that isolation and anxiety that can creep in during an incident.
This is why one of the first integrations we built in the Jeli incident response bot was to spin up and link a Zoom meeting. You can set it up to have a Zoom link created when kicking off an incident, or use “/jeli zoom” to add a call during response. We’ll link it in the channel description and in the overview we send to designated channels, so that it’s easy for everyone to find. (We’ve also got a Google Meet integration coming soon to Jeli’s incident response bot!)
Getting on a call does not mean you have to stay in that call for the entirety of the incident, maybe a smaller group of folks want to talk through a schema change. Take care to be explicit about when it’s okay for other folks to drop off and continue to follow along in the response channel. This is where expectation setting comes in.
Collaborating Remotely: Manage Expectations
If you’ve piled your responders into a call to get on the same page, your Slack channel might fall silent. This makes it harder for folks looking for answers to catch up on what’s happened so far.
When you hop into a call, set some expectations with the group about how, who, where, and when, you want to communicate outside the call. Jot down highlights that occur in the call as they happen, in the Slack channel. You can designate someone to do this, agree that you’ll each be responsible for updating about your own actions, or crowdsource so that whoever is not actively working on something updates the channel.
This is where “making analysis easier later” and “keeping communication flowing” come together. These don’t have to be formal updates. They can be as simple as:
As you send messages to the response channel, these updates build context as they string together, making it easier for folks who are unable to hop into a call know what’s going on.
If one message is particularly helpful, with Jeli’s incident response bot you can react to a message with the 📣 :mega:, 📌 :pushpin:, or 📫 :mailbox: emoji to turn any message in the incident channel into a status update that’s sent to the broadcast channels.
Updating folks with technical progress is important, but a large, often overlooked, part of incident response is the side conversations that take place in the background while technical troubleshooting takes center stage:
Where are we on solving the problem?
Do we fully understand the scope of impact yet?
What should/can we tell customers?
When can we expect more information?
Do these types of failures in this system have downstream impacts to billing, legal, or security?
Are we violating any contractual SLAs?
Is this at the level that someone should email the executive team?
Just because these questions aren’t technical and won’t “solve” the incident, it doesn’t make them any less an important part of incident response. It’s vital to set expectations with stakeholders (customer facing teams, legal, security, other teams with technical dependencies, and potentially executives) about when those questions will get answers. Both so that they can continue to do their jobs, and so that the responders can focus on the problem at hand.
Jeli’s incident response bot gives you the ability to communicate the stage of the incident. Currently we have 3 stages that broadly cover all of response: Investigating, Identified, and Mitigated (you don’t have to use all of them and they don’t have to happen in any designated order). When you use “/jeli stage” to change stages, you can provide context on how the incident has progressed and any new information learned.
Changing the stage of response also gives you the opportunity to reconfigure which Slack channels the updates are broadcast to. This provides a signal to stakeholders that new answers to their questions will be sent to them with an update on how the journey to resolution is progressing. Plus, if you react with an emoji to an update from Jeli, since the subsequent updates will be threaded under it, you’ll get a threads notification in Slack to help folks who aren’t actively following along with response know when there is new information!
Coordinating Remotely: Maximizing Value After The Response
Incidents have a habit of starting with a bang and ending with a fizzle and a dubious amount of smoke, where you’re not really sure if something is still burning. It can be hard to fully understand when you’re once again free to move about the cabin. Be clear and deliberate about when active incident response has ended, so that responders and stakeholders understand when they’re cleared to disengage. Just like “declaring” an incident sets things in motion, “resolving” an incident is an important boundary.
You may want to enter into a final “hot debrief” call, where you gather responders and stakeholders to let them know: what’s understood about the incident, what was done, what still needs to be understood, and what still needs to be done. It’s also a great opportunity to discuss how response went while it’s still fresh in everyone’s memory. It doesn’t need to be a full review, but if there were things that were hard, unexpected, or awkward, jot them in a thread to be revisited during analysis.
You can also use the opportunity to establish any immediate follow ups. With Jeli’s incident response bot you can use “/jeli jira” to create tickets right from the channel. Or if your brain is too used up to think, use “/jeli remind” to create follow ups that Jeli can DM you a reminder for at a later time of your choosing.
Finally, the actual way that we find opportunities to make remote response easier is by making response a main character in our incident analysis afterwards. Investigating an incident to understand what happened should not be limited to the technical components involved. The circumstances around response are important contributing factors in how the incident unfolded.
Talking about your team's experiences with your response processes are how you find where you have gaps, too many steps, or unclear expectations. And make sure to talk to your stakeholders too!
Get started with our IR Bot for Slack today- it’s completely free! Already using our Bot for IR? Head over to Jeli for a 2-week free trial of our analysis tool that will help you pull multiple sources of data together so you can find common themes, create timelines, and take advantage of every incident’s opportunity for your organization to learn!