S. Alex Smith
September 10th, 2016
Yesterday morning, Asana was down for approximately 83 minutes, from 7:25am to 8:48am PDT. This is the longest outage that we’ve had in the past two years, and in terms of customer impact, the worst outage that we’ve ever had. For that, we’re extremely sorry. We understand that Asana is a cornerstone of our users’ workflows, and that this outage wasted your time. We’d like to tell you what we understand about the outage and the steps we are taking to avoid a similar incident in the future.
The proximate cause of the outage was new code that we deployed Wednesday night. We generally deploy new code twice a day, but this was a much later deployment than usual, due to some reverts that happened earlier in the day. Among other things, the new code included logging that was intended to help improve Asana’s security. However, the code had a bug, which resulted in the logging being triggered dramatically more often than we had anticipated. At the time the code was deployed, this caused an increase in the CPU utilization of our web servers. We had sufficient headroom that the CPU increase didn’t cause any problems, allowing the issue to go unnoticed on Wednesday night.
We first became aware of a problem at 7:15am on Thursday morning. The higher CPU utilization, combined with the morning’s traffic—our peak load—caused an increase in request latency, which resulted in web servers holding on to database connections longer than usual. Some of our safeguards detected the increase in connections and caused our non-critical work queue to back off. This caused our search indexer to fall behind, which paged our on-call engineers. At this point, Asana was still up.
Initially the on-call engineers didn’t understand the severity of the problem, since the app was up and the impact of the incident appeared to be limited to the non-critical job queue. Our investigation was initially centered around the database, as this is a usual cause of issues. After the initial page, we received an additional page about the API being down. And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version. With just Asana employees on this version, those machines were not overloaded. At 7:35am, our customer support informed the on-call engineers that users were unable to use the app.
At 7:40am the on-call engineers determined that the databases were actually underloaded.
At 7:47am one of the engineers noticed the increased logging on the web servers, but discarded this line of investigation, since it didn’t match the hypotheses that were being explored.
At 7:57am the issue was escalated again to a broader set of engineers.
At 8:02am an on-call engineer noticed that web server CPU was maxed out, and quickly correlated it with the previous night’s code release.
Between 08:08am and 08:29am we tried to identify a safe version of the code that we could revert to. We had a recent set of reverts the day before, so had we just picked the previous revision, we would have ended up with bad code.
At 8:29am, an on-call engineer positively identified the issue. At this point, we worked to track the last known revision that did not include the faulty change. The revert was executed at 8:37am. Given the nature of the fault, the web clients would not prompt a reload under these circumstances. It was necessary to blacklist the bad revision at 8:42am.
From this point on the application started to gradually recover. The application was fully back to normal at 8:48am.
After the incident was resolved, we ran a 5 Whys. We believe that our failure during this incident was our response. As such, we are implementing additional tools and safeguards to ensure we are better equipped to deal with this type of issue in the future. We identified systems and procedures that we can improve in three main areas: detecting problems before they become user facing, reducing the amount of time required for on-call engineers to triage incidents, and reducing the amount of time to fix problems once they have been identified.
We realize that you rely on Asana to do your work, and yesterday, Asana was not reliable. For our failure to provide a stable service, we’re really sorry—we hold ourselves to a higher standard than that. We hope we can earn your trust again by working to ensure that something like this doesn’t happen in the future.