This is Vanessa, one of the Solutions Engineers here at Jeli. This past winter, we had the chance to work closely with Unity in a design partnership. We got to collaborate hand-in-hand with Unity’s Site Reliability Engineering (SRE) team as they evolved their post-incident processes into one that takes a Learning From Incidents approach incorporating Jeli. Working with them was a blast! The folx at Unity were open and honest with us about what they didn’t like about their current process, as well as what they were looking for. We also learned a lot from them and were able to take their feedback and put it back into our product!
By the end of the partnership, Unity’s engineers had adopted a more joyous and empathetic process which is what Jeli is all about! The same process also allowed all involved (read: not just site reliability engineers) to understand their incidents from a holistic point of view. As someone who has done countless incident reviews, getting folx across different parts of an org to care about incidents is tough work but it pays off in infinite ways!
Meet our partners, Unity.
Unity is the world’s leading platform for creating and operating interactive, real-time 3D content. Its platform provides a comprehensive set of software solutions to create and operate interactive, real-time 2D and 3D content for mobile phones, tablets, PCs, consoles, and augmented and virtual reality devices to reach, engage, and impact their audiences.
proactive > reactive
resilience = overall effectiveness
Unity has been rapidly expanding over the last few years, even during the pandemic. The Site Reliability Engineering (SRE) team needed to make sure their internal processes were keeping up—with both their growing size and their burgeoning culture.
They knew that proactive is greater than reactive and that incident resilience equals overall effectiveness.
That’s where Jeli fit into the equation.
Unity, by the numbers
- Founded in 2004, Unity has grown significantly. The company now boasts over 5,000 workers worldwide
- In the first quarter of 2021, Unity saw an average of more than 5 billion downloads per month of applications built with Unity
- As of the end of Q1, Unity had approximately 2.7 billion monthly active end-users who consumed content created or operated with its solutions
- 94 of the top 100 game development studios by global revenue are Unity customers
- As of the end of Q1, 10 of the top 10 auto manufacturers by global sales were using Unity
A Taxing Process
At Unity, the post-incident story was one we’ve heard before: an incident began in the “open” state, where it was actively investigated and mitigated. It then moved to “resolve,” and then from “resolve” to “close.” However, before an incident could be moved from “resolve” to “close,” a root cause analysis (RCA) document was required to be filled out. Which means…well, it wasn’t always filled out.
Since post-Incident activities are inherently different from regular software development work, Unity found that they had to continuously level up their post-incident investigation skills. At first, incidents would sit in resolved for weeks—sometimes months—before someone got around to filling out that heavyweight paperwork. The SRE team found themselves inadvertently in the business of badgering incident owners to finish their RCAs.
The process problem didn’t end there. Even after an RCA was finally completed, the team found it was hard to capture the full context of an incident in a written document. As a result, an RCA usually went unread, unnoticed, and undervalued. The whole painful process was rendered somewhat useless by what turned out to be too much ceremony. The pain wasn’t worth the gain. So, Unity set about removing the excess ceremony from its process.
Don’t get us wrong—sometimes, in-depth post-mortem reviews did happen. But then, engineers moved on to their reviews, and weren’t quite sure where to start. They sifted through the fast-paced, hectic incident channel in Slack and attempted to create an accurate and cohesive picture of what happened. Discovery was difficult and collating facts across incidents was also challenging due to it being difficult to capture the full context of an incident or incidents, in a written document.
Unity quickly realized the level of overhead was getting out of hand. Time was spent, effort was given, and unfortunately, they were learning more about how to optimize an incident process than they were learning about the incidents themselves.
As a company dedicated to continuous learning and improvement, Unity’s leadership was positioned to turn around this unfortunate process: 1. They had in place the right tools so they had visibility into the work that was being done as well as what engineers were gaining out of it. 2. They not only were willing to improve things, they strived for continuous improvement.
In doing these 2 things, Unity was living up to its stated values of going bold with solutions and letting the best ideas win. By boldly and rapidly iterating policy and tools, the team was homing in on the best solution for supporting incident investigations across a globally distributed team of more than five thousand workers.
Unity’s incident process simply
wasn’t working for them.
It was heavyweight, unhelpful, bureaucratic, and daunting to everyone involved. Like many engineering teams, they were merely recording incidents instead of doing what everyone really wanted: learning from them. Improving their reliability. And strengthening their resilience.
Unity needed a new framework and process for incident analysis that matched the expertise and forward-thinking nature of their team.
A simpler, more effective process.
A significant weight has been lifted off of the shoulders of Unity’s engineers. Gone are the days where incidents would idle, daunting paperwork put off for weeks or even months, and important details forgotten. Teams now promptly close incidents and move right to post-mortem reviews because Jeli tells them exactly where to start, and where to go from there.
Unity engineers simply use Jeli to ingest the incident (through Slack, PagerDuty, and Workday) and create a detailed, user-friendly timeline that guides their post-mortem meetings. All of the boilerplate and paperwork that used to go into making an incident timeline is now automated. The timeline frames post-incident conversations in context and highlights exactly what happened, driving productive discussion and generating more learning opportunities.
Unity’s incidents now move from “resolved,” to fully analyzed, and finally to “closed” at a much faster pace. This establishes a smoother, more efficient, and more effective process. Incidents are actually being analyzed, lessons are being learned and as a result, the Unity engineering team is on its way to even stronger resilience and a confident future.
As a result, the Unity engineering team is on its way to even stronger resilience and a confident future
With Jeli, Unity delivered on their learning process
Number of incidents resolved for over 10 days without postmortem
Incident analysis that feels right.
(Maybe even fun!)
Jeli implementation inspired another, less technical benefit for the Unity SRE team: a cultural shift. Development teams are not only owning their own post-mortems and taking the time to execute them properly—but they’re getting excited about it. They’re enjoying the act of putting the pieces together and creating a cohesive picture that everyone can learn from. The SRE team no longer has to facilitate every post-mortem or remind people to get them done.
More collaboration uncovers more insights.
When postmortems became more enjoyable and fulfilling with Jeli, more people got involved. This launched a wave of interdepartmental engagement and important discussions across teams that normally would not have collaborated. With the tools and process powered by Jeli, more dots were connected and more insights uncovered.
Unity hopes to use Jeli to continue finding these new patterns and using them to learn even more about their resilience.
With the tools and process powered by Jeli, more dots were connected and more insights uncovered
Backed by Jeli, Unity is making moves.
Upon evaluating Jeli, Unity realized there was a great opportunity to overhaul and evolve their whole incident analysis process. It was time to reframe how they looked at the outcome of an incident. The RCA doc evolved into a post mortem doc. The team began to shift to a culture of curiosity and understanding. The focus shifted away from strict numbers and outcomes and instead to continuous learning and adapting. The desire to change was always there, all Unity needed was the right partner. Which they found in Jeli.
The focus shifted away from strict numbers and outcomes and instead to continuous learning and adapting.
- Dramatically decreased the average mean-time-to-close an incident
- Evolved time-consuming, heavyweight post-incident process into a lightweight, quick process—saving hundreds of hours across engineering
- Guides post-incident discussions with an incident timeline that facilitates productive conversations, uncovers learning opportunities, and creates a culture of constant learning
- Prevents SRE burnout and aids in retention