Continuous Learning From Incidents Through Insights and Collaboration

Published on
July 5, 2023
Will Gallego
Will Gallego
Software Engineer
Share This

People are learning at your organizations, continuously and repeatedly. No successful complex sociotechnical system can exist without learning, incorporating novel ideas and discovering new ways in which our systems interact and react. We’re often not aware when it’s happening, it’s so thoroughly incorporated within our day to day work.

I’m deeply appreciative that folks come to topics around learning from incidents through different vantage points. Some of us are ops minded folks, having careers noted with periods where it felt like all we did was put out other people’s fires. We have SREs who recognize the gaps in our collective knowledge and want to make sure folks are seeing the inbetweens in our complex systems. Infra engineers look to build stable foundations for the rest of their team. We’re even beginning to see frontend engineers, designers, and project managers who understand their impact in the lifecycle of products and how it affects incidents.

This all begins when we recognize that failures are opportunities to learn. The problem is - it’s hard.

Fixes are fantastic, runbooks are useful, and action items naturally arise from our retrospective reviews. But the impact of learning from incidents over the last few years has highlighted that they’re not enough. We’ve remapped our worldview because we as a community have better understood this fact. It’s effortful to change one’s worldview, doubly so to change someone else’s. But we all recognize it’s critical to the fluid nature of our work.

We emphasize adaptability because the alternative is rigid and brittle. You can spend countless hours crafting the perfect system - and still be surprised when it fails. It’s why it clicked for the aviation industry, for nuclear safety and medical response teams, and in the last decade or so, it’s finally taking shape for us in software engineering: failure will happen and we can better understand it by studying that work.

Learning is the first step towards integrating adaptive capacity to your systems. You have to learn how your teams and your systems respond to stress. It’s why we don’t say “I’ve seen seven years of incidents” but instead “Well, let me tell you about one time when…”.

We go to Conferences to learn

We have 1:1s where we learn

We read up on blog posts and books to learn

All of these are stories. When we improve in our orgs, we don’t say “Remember: This consul config needs to be set to 360”. It’s “Remember the time Consul started to demand more resources during that sales push? We had to bump the config here to 360, because underneath that number we can’t keep a quorum quickly enough”. Our improvements are intrinsically tied to our experiences and our learning.

If all of this is ever present and without conscious thought, why bother placing so much emphasis on learning? We’re going to adapt anyways - why not just fix stuff and move on? It’s a lot faster after all.

Simple answers aren’t enough. Our runbooks are useful - but they’re not enough. The dashboards we put in place and the action items generated help, but they don’t connect all the dots. The metrics we surface to illustrate a point are two dimensional when we need to see the bigger picture. They’re data - we need insights.

We never know what’s next, looming over the horizon. Learning is the only way to be prepared for whatever is to come.