Back when Linux was still a fresh face in the data center, I worked as a system administrator in an organization built on big, purple Sun machines certified to run Oracle. In addition to running those Sun machines, it was my job to replace many of them with vastly cheaper Linux machines.
My boss, Howard, was a Sun veteran with deep experience, an impressive resume, and a cubicle stocked with all manner of reference material. In addition to teaching me Solaris, he elegantly tolerated my enthusiasm as the leading edge of a wave that replaced a lot of his infrastructure with commodity hardware and open source operating systems. His adaptability was rad enough, but what I carry forward about Howard was his optimism in the face of issues.
“Mr Watson… we have an opportunity.”
More times than I can count I was summoned to Howard’s desk with this simple phrase. It was usually the beginning of an incident or problem. Some days it was delivered with a lilt. Other days it came with a touch of concern or even a bit of frustration or sarcasm. Regardless, together Howard and I sorted through the problem and brought service back to the medical system we supported. Sometimes he taught me Solaris and other times I taught him Linux. During every problem, Howard taught me patience and empathy.
The idea that an incident is an opportunity to learn or improve isn’t exactly revolutionary. On our best days as operators we say this, often a bit tongue in cheek. In the moment, with the alerts hammering our phones and everyone demanding updates, we don’t have the time to introspect. We rush to reestablish service and, using our guts and our training.
Years later, as I continue to focus on reliability and resilience, I realize that much of the opportunity in Howard’s call to action comes as the dust settles. What did we learn? What more can we mine from the context of the incident? If we step back after assessing the damage and look at the responders, the conditions, and the way information flows, then an incident can tell us so much about our organization.
The opportunity, however, doesn’t stop with incidents. Modern safety science tells us we also have a lot to learn from the successful things we do! Did you have a smooth launch last week? Let’s learn from it! Ship a migration according to plan? I bet there’s something to learn from that experience, too. The opportunity to learn isn’t limited to when the SLOs are violated or when customers are impacted. If anything, we stand to learn more from what goes well than what goes bad.
This is a long winded way of describing why Jeli uses the word “opportunity” in our product. Chaos experiments, smooth deployments, SLO violations, or even huge incidents all hold deep insights into how we function as organizations and how we can improve! Incidents might feel the most sharp, but as your organization flexes its learning muscles Jeli can be applied to any and all of your operations. We’ll help you figure out what opportunities are worth investing in so that your time and attention yield the best insights. When issues arise, Jeli is your Howard—telling you with tranquility and optimism, “Mr Watson…we have an opportunity.”