Responding to incidents can be very stressful, especially if you ever find yourself in a seemingly never-ending cycle of outages. At Jeli, we have decades of combined experience creating and managing incident programs across large and small organizations and have built our product based on this expertise.
If you’ve ever seen anyone from our team speak at a conference or have chatted with the Jeli team before, you’ve probably heard us talk about Learning From Incidents. In fact, our Founder and CEO, Nora Jones, also founded the Learning From Incidents community.
So, we’re really excited about the modern approach to incident management of learning from incidents, but why? What exactly does that mean anyway? Learning from Incidents in the world of software is a path to building more resilient infrastructure and teams. It’s a path of continuous improvement to make your product, teams, and technology all stronger for your business.
Incidents are often far from over after the impact has been mitigated. Incidents will always occur and are only increasing in our modern world of complex software systems. What matters now, is how an organization handles incidents across their lifecycle. At Jeli, we have defined the incident lifecycle as: response, analysis, and cross-incident analysis. The analysis and cross-incident analysis stages offer a massive opportunity for learning and continuous improvement of an organization’s sociotechnical systems.
We partner with our customers to understand where they are in their Learning from Incidents journey so that we can help them impact change in their organizations and help their teams evolve towards modern incident management methods. The goal is to allow organizations to support the needs of their businesses as efficiently as sensible within the rapidly changing landscape of technology. The process can feel daunting to start, but there are many wins along the way that helps build momentum. Improving how an organization handles incidents enables teams to instead focus their efforts on forward progress in building resiliency in their systems rather than getting caught in a cycle of surface-level repairs.
So, where do you start?
Make response less painful
When you’re drowning in incidents, you’re probably feeling like you don’t have the bandwidth to learn from them, this is normal. Focusing on your incident response process first makes sense. Incidents are all about communication, collaboration, and coordination*.
- Write up your current response workflow, how folks come together to solve the problem, and look for the gaps, where it breaks down. What can be done to make those gaps even just a little bit smaller? If it’s your first time doing this, keep it simple. Start with requiring folks to spin up a channel, and establish who is in charge when an incident channel is spun up.
- How do you get people in the room (physical or chat based)? How do you know who to call and how do you call them? Is it the same people every time? Examine your on-call rotations and expectations; what little things can make their lives on-call just a little bit easier? Bonus points: when you ask others in your organization these questions, how many different answers do you get?
- How do you communicate to stakeholders what is happening? How do you tell your customers? What is hard about it, what are they asking for? Write up some loose guidelines to help manage expectations for folks both inside and outside the immediate team responsible for responding to incidents.
Once writing down the answers to these questions and establishing common ground across folks in your organization, you can get some quick wins even a week into your new process by restructuring on-call rotations and automating some of the response tasks. This can allow you to spend less time on coordination tasks that demand your attention during incidents and allow you to instead focus on doing what you do best: responding and repairing. Improving little things will help you build momentum, start change management, and get support around making larger changes, from both leadership, the people holding the pagers, and the teams talking to your customers.
Establishing a strong workflow, a sustainable on-call rotation, and communication guidelines that keep folks up to date are the goals here; they’re vital pieces of the response puzzle. The processes will also look different for every organization – that’s expected. Experiment! It’s okay if something doesn’t work, learn from it and try something new.
Gone are the days of needing to build automation around response yourself. Jeli’s free Incident Response bot for Slack makes this easy out of the box by bringing your people and tech together in one place for faster response and coordinated communication to stakeholders. All of this is “learning from incidents”.
Start learning what happened
Once you have implemented consistent workflows and automation for incident response with your team, you now have easily accessible artifacts to begin learning from your incidents. Now you can begin to dig into the underlying factors impacting an incident. This is the beginning of unlocking the full potential of learning from incidents to drive resiliency across your teams.
Begin creating reviews after incidents to bring all the pieces together. Incident retrospectives should be collaborative and a safe space in order to understand what truly happened during an incident and to identify improvements to your systems. Building incident retrospectives doesn’t have to be time consuming. Jeli’s Narrative Builder has helped customers building timelines reduce their investment from an hour to a day to instead just 15 minutes to an hour. Narrative Builder helps teams quickly and efficiently tell the story of an incident and capture learnings and action items.
Cycle of improvement
Cross-incident analysis is the most advanced stage of learning from incidents and it takes teams time to reach that part of the journey. Even setbacks are progress– it means you’re learning what is and isn’t working for your team. Magic will start to happen when you scale the process and are able to look at macro trends and patterns in incidents across your organization.
- The more retrospectives you have, the more data you can then use to make recommendations for larger scale changes, such as rearchitecting your CI/CD pipeline. This is done in collaboration with leadership and multiple engineering and product teams.
- You now have a breadth of data with context to provide to leadership for future-focused decisions.
Remember: Part of continuous improvement also requires regularly reflecting on what is and isn’t working in your current incident management process. Assess your current process against your current business goals and organization and make any needed adjustments.
At Jeli, we want every person, team, and organization to view incidents as opportunities to continuously learn & improve. Our goal is to meet people where they are in their incident response and analysis journey. You can get started with a free trial of Jeli today to respond to, manage, and analyze incidents in order to build more resilient infrastructure and teams.
Have we sparked your interest about Learning from Incidents? Here are some additional resources and talks from friends of Jeli and experts in the field: