How We Got Here: Incident Report September 2021—Slow / Failed Loading Investigations

Vanessa Huerta Granda Dec 15th, 2021

Executive Summary

On September 28th, 2021 at 1:10pm PT one of our users at Xero reported trouble loading incident reviews within Jeli. They reported this in a shared channel between our champions at Xero and the user-facing folks at Jeli. This was confirmed by Jeli within minutes and investigated. The triggering event was debugged and isolated to a Frontend code change that was deployed a few hours prior, although there were a number of hypotheses flying around that led us to attempt to confirm other potential triggers.

This code change — shipped about 3-4 hours before — was believed to only be impacting features behind a feature flag. Due to some queries that were understood differently than the original author had intended them to be understood, the change inadvertently impacted our entire production environment. The change was quickly reverted and the loading problems were resolved. The total time elapsed was 38 minutes from Xero’s flagging of the incident to Jeli’s all-clear back to Xero.

View of “narrative” timeline in Jeli showing key moments of the incident.
How We Got Here
InvestigatorCory (Jeli)
Incident RespondersChris (Xero), Danesh (Xero), Vanessa (Jeli, Solution Eng), Fischer (Jeli, Backend Eng), Adam (Jeli, Frontend Eng) et al
Interviews Danesh, Vanessa, Fischer, Abdul, Daniela, Adam, Chris, Nora, Jared
Background

Xero has been a Design Partner for Jeli and enjoys a close relationship with regards to communications on a daily basis. There is a large time difference between our users at Xero (New Zealand Time Zone), the Software Engineers at Jeli (mostly Pacific Time Zone), and Solutions Engineers (the most user-facing employees) (Central Time Zone) at Jeli. This incident started at around 1pm PT, or around 9am NZ, right as these customers were starting their day and ready to investigate incident opportunities. 

Prior to this incident we had recently switched our Frontend to Vercel (1 month before this incident). Previously, the Jeli Frontend was running on AWS EC2 and deployed with our in-house deployment tools. 

When you load an investigation in Jeli, you load up a query to get some preliminary information to see if said investigation has been fully ingested. Some of the failure modes we have seen include: request is slow/timing out, a bug in GraphQL leading to a failed request, a bug in frontend leading to a Javascript crash.

At Jeli, engineering is split up into a Backend and a Frontend team; our on-call schedule only pages Backend engineers since they are in charge of our infrastructure and up until the move to Vercel, they were best positioned to quickly diagnose issues and rule out anything other than Frontend code . 

We also use a number of observability tools, such as Honeycomb for errors and metrics, LaunchDarkly for feature flagging, and are in the beginning stages of handling incidents through our own Incident Bot. 

Narrative

Event & Impact

At the start of his day, one of our champions at Xero was doing some routine preparation for an incident review later that day when he noticed individual investigations were taking a while to load. He immediately posted this issue in our shared channel which was noticed by our Solutions Engineer within minutes. 

There was a slight delay in understanding the impact of this event, and ambiguity amongst responders on the urgency of the event, as well as how to start an incident from the Solutions Engineers. This ambiguity is largely a factor of the fact that, well, we haven’t yet had a lot of customer-facing incidents, and we are still understanding and honing what that means.

Trigger

The (Backend) engineer on-call quickly took the incident and started the troubleshooting process, first looking for any slow requests to individual opportunity pages. A couple of other engineers were brought into the incident because the event began as they were finishing their regularly scheduled co-working hour. One of these engineers, (from the Frontend team) found a javascript error happening in the developer console and believed it to be related to a recent change.

Error found by Jeli Frontend engineer.

Engineers began stepping back through deployments; for Backend, they looked back through the production tags in GitHub and for Frontend, engineers looked at the Vercel UI. Once they reached a place where investigations behaved the desired way, engineers rolled back our production environment to that point. 

Engineering had been working on some early prototypes for a new feature so they were changing a number of queries in the Jeli API behind feature flags. This specific Frontend change from a few hours earlier ended up altering a query outside of the feature flag that gets loaded every time a user attempts to load an investigation. 

The flag was expected to be on for internal Jeli users only but we were accidentally using a query for all users, leading to a Javascript exception when it tried to access a field that didn’t exist in the production GraphQL response object. All of this leading to a failure to load investigations. 

Engineering reset the Backend feature flag changes they had made to ensure the Frontend change was truly the trigger and we confirmed with the user that they were able to load individual opportunities without page errors.

Contributors / Enablers

The following are enablers that were “latent” in the system that led to how this incident was able to manifest:

  • Poorly-named queries – The query that we believed was only behind the feature flag was named “use-investigation-insights-overview” while the query that was actually changed was named “use-get-investigation-base”. Frontend had difficulty seeing the difference while writing the changes in part due to the fast-paced nature of the work. We use a naming convention for our queries but we realize now it may be too simple. The “use” part is generated by our GraphQL code while in our code base we try to use the convention of “verbNoun” (such as “getInvestigation” or “getAllInvestigations”). This can get confusing when engineering tries to reuse already existing query fragments, particularly when experimenting and building new features that need parts of some base data along with extra data.
  • Testing expectation – The new feature the Frontend team was working on did not have smoke tests; due to the intentionally rapid nature of the feature development process so that we can learn about our use cases as we ship features, the team usually puts things behind a feature flag and releases them to the internal Jeli web application even if buggy, so that user-facing Jeli beans have the opportunity to play with, familiarize themselves with the feature, and identify bugs.
  • Identifying trigger – our proxy for ingesting Frontend events into Honeycomb had been broken for over a month, this was due to an expected field not being set by our Frontend code for some requests, resulting in events being dropped rather than forwarded to Honeycomb. Engineering knew of this but fixing it kept getting deprioritized – this was fixed soon after the incident took place. The javascript error in console was unclear in that it could have been related to a Frontend issue (which was true) or may also happen if GraphQL is not responding in the expected manner (which was not true in this case).
  • Red herrings – Earlier that day we had made an update to some people-related data (roles, titles, etc) specifically for Xero, and we had also recently switched to reading data from MySQL rather than ElasticSearch.

In this case, the “Slack data from MySQL” flag has three variations:

  • Off – don’t connect to MySQL at all
  • Write-Only – write data to MySQL during ingestions, but don’t read it
  • Read-Write – write data to MySQL and also read from it in GraphQL queries

During the incident, the flag was initially in Read-Write mode, and we’d switched it to Write-Only while troubleshooting. This we now know, turned out to be a red herring.

View of “red herrings” filter set in Jeli showing the times in the incident transcript where we were working or discussing something that in hindsight was a red herring. We can also see this filter set in a timeline view.
Risks

The following risks were revealed through 1:1 interviews with the involved parties. None of these were particularly severe during the incident, but they reveal some surprises that may cause problems in future incidents:

  • Training – We used a number of various systems to troubleshoot (e.g. LaunchDarkly, Honeycomb, Jeli Incident Bot, internal process, etc); there were varying degrees of expertise within the different teams involved in the incident depending on when they joined the organization. Not all of the initial responders had access or knowledge to use these systems, although in the end we were able to get a hold of people that had the background expertise, who debugged live while sharing the screen in front of their colleagues.
Mitigators

The offending commit was identified when Adam from Frontend and Fischer from Backend did a manual check of code revisions until Frontend isolated a point wherein the problem did not occur. Adam then reverted the production code to this point. A number of mitigators made the quick resolution possible.

  • Ability to quickly revert – We had recently switched our Frontend to Vercel which made rolling back an almost immediate process as it keeps old production builds stale. Prior to this, when Frontend wanted to deploy or roll back, the process would take ~ 5 minutes.
  • Positive relationship with our customer – As one of our earliest partners, Xero knew to contact us as soon as possible and was more than willing to provide us with screenshots of the issue. The timing of the incident also helped in that a large part of Xero’s working hours usually fall outside of Jeli’s usual working hours; luckily, this incident took place during one of our overlapping windows and engineers were all online and able to work together without worry of paging someone off-hours.
Difficulties During Handling

Incident Response process – We do not have a specific process for starting an incident from the user point of view.

  • Knowing if an incident qualifies for engaging our machinery, and then how to invoke it, takes work; in this case, this led to a slight delay. Capturing the user’s perspective, representing their needs, and communicating back to them may require an assigned person, this is currently not part of our incident process.
Key Takeaways/ Themes

Above all, this incident went smoothly; everyone was in the right place to promptly and effectively respond and communicate. The problem was resolved quickly, responders learned more about their systems and how to collaborate with each other, and the impacted users were notified and satisfied with the remediation and communication updates.

  • The fast-paced nature of the work the Frontend team does means they depend on feature flags for a lot of their work and assumptions end up being made around what is behind the feature flag and what is user-impacting.
  • Decisions to prioritize diagnostic/observability data can have high future-cost.
    • The missing Frontend events in Honeycomb were known for a period of time but not prioritized until it led to an incident.

Communication

  • Some details of the failure (Were we looking into slowness? Failing to load? Opportunity listing? Review? Chrome loading?) were lost in communication between users, Solutions Engineers, and Frontend/Backend Engineers.
  • In-”team” communication was solid. Between teams (Solutions to Engineering, Backend Engineering to Frontend Engineering) had difficult or unclear escalation paths.

Hypothesis

  • Hypotheses — based on recent changes — were evaluated and disproven within minutes. This also aided the front-end team in narrowing their focus to their own code.
  • The hypotheses tested (i.e. updated people data load, MySQL read changes) benefited from earlier information sharing, but also required iteration with feature flags, etc.

Many things worked well:

  • Communication about recent changes (MySQL data load) happened through multiple channels including weekly Engineering Review meetings and team Slack channels. This meant responders had familiarity with them and were able to build hypotheses.
  • Incident Bot status updates allowed others to follow along without disrupting the core incident channel.
  • Front-end tooling (Vercel) meant quick validation and resolution once isolated.
  • Timing was convenient: Jeli’s staff was able to focus due to being just back from lunch, between meetings, or otherwise without distraction. Xero benefited from it being early in the work day.
Follow-Up Items

Process

  • Set up a streamlined communication process with our users as we onboard new Solutions Engineers (Solutions team)
  • Make changes to our Incident Response bot so it better aligns with internal and external communication expectations (i.e. timing of updates, roles, publication channels) (Engineering)
  • Reassess our on-call rotation to include Frontend engineers post-Vercel move. (Engineering)

Training

  • Follow up training as well as including training within onboarding for Software Engineers and Solutions Engineers particularly for feature flagging (Engineering and Solutions)

Observability data

  • Engineering has already addressed the failing proxy for ingesting Frontend events into Honeycomb. (Engineering)