Developer environment: Achieving reliability by making it fast to reset
July 6th, 2017
We all know that engineers are most productive when they can fully customize their developer environment, but all too often, customization means multiplying the number of ways things break. To provide the best developer environment for all of our teams—IT, developer experience, and engineering alike—our team balances between the need for a small set of platforms to support and a high level of customization.
Today, that means that every engineer receives a Mac with hardware pre-built to their specification, pre-configured with our default editor (IntelliJ), and a full copy of the codebase on their first day—Which they are free to reconfigure and customize as much as they would like. While several people have talked about wiping the entire OS and switching to Linux, so far everyone has stuck with OS X.
As well as the standard tools like a text editor and web browser, we also each need our own private copy of Asana to develop against. Running a copy of Asana requires lots of services (we use MySql, ElasticSearch, Zookeeper, Redis, and more), as well as the Asana app itself, which is designed to run on Linux. To provide these, each developer has their own “sandbox”—a virtual machine running Linux. We use ansible scripts to install and configure all the required services, and continually copy across code changes from the Mac using rsync.
#1 Engineer complaint: Non-functional sandboxes
Before we did the work described here, our biggest source of lost engineer productivity was issues with the Linux sandbox. It was painful because it broke for people fairly often, and when it broke it could take over a day to get it working again. At peak breakage times, a team was working nearly full-time unblocking people with sandbox issues.
Sandboxes break for a number of reasons. Because they are a stateful system with several databases, even temporary errors persist and manifest in confusing ways. In our case, subtle problems with our build rules could result in sandboxes which had incorrect build outputs, and those stale or incorrect outputs would stick around.
It was also hard for people to validate any changes they made to sandbox configuration. Because people were running sandboxes built from different versions of our code, changes worked for some engineers but broke everything for others. Effectively, our sandboxes contained a tiny copy of our entire production infrastructure. While our production infrastructure has entire teams managing it, our sandboxes are only managed by the developers themselves.
To add to the pains we felt, fixing issues when they did arise was difficult because every sandbox was subtly different. And though most of our team are experts on building web applications, they aren’t experts on maintaining linux VMs—so they wouldn’t know what to do when something failed.
A common option was to destroy the entire sandbox and start again, but this was also painful. There was no way to validate configuration changes, so this would often fail, as our configuration scripts would work incrementally but not from scratch. And since building sandboxes also requires downloading and installing lots of third party software, any internet blips would cause the process to fail, often with confusing errors. Even when they were running successfully, the process took hours to complete.
A new approach
Our broken sandboxes were costing us, both in productivity and happiness (Asana hires engineers who really care about making an impact), so something had to be done. We concluded that since our sandboxes are essentially mini versions of a production environment, we should apply best practices from managing our production systems to them.
- Immutable Infrastructure: Much of our pain was caused by maintaining scripts which could build a sandbox from scratch and apply updates to existing sandboxes. This is a common problem for production servers. A great solution is to switch to immutable infrastructure, which means never making a configuration change to a system once it is built. Instead, if system configuration needs to change, you build a new system from scratch with the new scripts and throw the old one away.
- Machines are cattle, not pets: What makes some animals (like cows) cattle, and others (like dogs) pets? One morbid definition is to ask what happens when it gets sick. If it is a pet, you nurse it back to health. If it is cattle, you euthanize it. Using this metaphor, we realized that we had been treating our sandboxes as pets—nursing them back to health when they had issues. This was time consuming and resulted in everyone’s sandbox being slightly different. Instead, if sandbox recreation was easy, then we could just delete a sandbox, and make a new one.
- Remove slow and risky work from the critical path: As we’ve discussed already, much of the work involved in creating a sandbox is slow and liable to fail. When a human being needs to wait for it to succeed, this can be really painful. Luckily, computers are very good at doing the same thing multiple times. We decided that it’s fine for the process of creating a sandbox to be a little slow and unreliable as long as the work happens as part of a non-interactive process—meaning engineers don’t have to wait for it—and it can be given up and started again whenever the process fails or takes too long.
What we did to improve sandboxes
Before we began improving our sandboxes, we were already well set up to move in the direction of using pre-configured images. All of our sandboxes were already running on EC2 instances on Amazon Web Services, which are brought up based on Amazon Machine Images (AMIs). To switch to using a pre-configured sandbox, we just needed to start building a custom AMI instead of using a base Ubuntu one.
For building the AMIs, we used Packer, which gave us an easy way to build a custom AMI. We set up a job called packer-sandbox on our testing Jenkins server to do the following:
- Check out the most recent code version
- Run a packer template that:
- Brings up a new machine with the base AMI
- Calls into all of our existing configuration scripts
- Creates a new AMI based on the result
- Copy the AMI to every AWS region that we have sandboxes in
We now have AMIs that developers can pull down and use with minimal configuration. When a developer wants a brand new sandbox, they can run a script to:
- Delete their old sandbox
- Provision a new EC2 image from the current AMI build by packer-sandbox
- Personalize their sandbox (set some environment variables to their username)
- Rebuild all of our services using the cache on the image
We think so. This change means that we made creating a sandbox faster and more reliable. Now the slow and risky work of running our configuration scripts happens in Jenkins—where we don’t care if it takes an hour, and, if it fails, a developer who is knowledgeable can look into it when they get a chance instead of a random developer being totally blocked by it.
How we made it fast
Even after we did this, it still wasn’t as fast as we were hoping. There were a couple of reasons for this:
- Configuring a developer’s local machine (Mac) is still slow
- Syncing over files into the EC2 machine can still be slow
- Building our services can be slow
1. Configuring a developer’s local machine (Mac) is still slow
Our sandbox setup scripts have generally taken the attitude of “nuke everything to be sure we fix the problem.” To do this, they would create a new sandbox and do a ton of uninstalling and reinstalling of packages on the developer’s Mac (using pip, brew, etc.). Even though we massively sped up the time the sandbox takes to provision, the Mac configuration was still problematic.
So we decided to stop reconfiguring developer’s Macs. We found that the configuration scripts would often create more problems than they solved and not running them saved a ton of time and increased the reliability of the VM creation process. If a developer still wants to, they can run this local configuration script manually.
2. Syncing over files into the EC2 machine can still be slow
Even though the AMI contains all of our files, we still need to sync over the difference between those files and the ones on each developer’s machine. In addition, we use rsync to do this which primarily uses the file modification time to decide if source and destination files are the same.Because developer Macs check out
Because developer Macs check out source code at a different time to when the sandbox image is created, these do not line up, so rsync has to compare the full hash of each file.
We haven’t fixed the underlying issue of rysnc being slow but, some profiling revealed that printing the synced files to standard output seriously increased the time it took, so we just turned that off for the initial rsync.
3. Building our services can be slow
In addition to syncing over the files that changed, we also have to rebuild all of our services based on these changes. All of our new services use Bazel, which caches build outputs so you only do incremental work. That being said, the incremental work is only as good as the difference between the current code and the code that the AMI was built with.
To ensure that these are as similar as possible, we run the packer-sandbox job very regularly (every 12 hours). That makes sure that this incremental Bazel build is as small as possible.
This also has a nice side effect that when packer-sandbox breaks, we have a very narrow range of commits that could have caused it.
A common question we get when talking about this work is, “Why didn’t you just use Docker?” The answer here is one of pragmatism—our existing scripts got us 80% of the way towards being able to build AMI images. If we wanted to build Docker images instead, we would have had to rewrite it completely.
However, the current solution is still all or nothing. If anything goes wrong, the entire sandbox has to be deleted and recreated. It is likely we will slowly move towards splitting different sandbox services into containers, so only those services need to be rebuilt if they fail, rather than the entire image.
We would also like to have a much faster way to reset the entire sandbox—ideally under 5 minutes. The barriers to this are the time it takes to synchronize files, and the time it takes to build our source code.
As started above, rsync is very slow when the file modification times don’t agree (as they do in git checkouts done at different times). We think we can improve this by using git to work out which files haven’t changed between the creation of the sandbox and the current local development environment and forcing the timestamps of unchanged files to match before synchronizing.
We are actively working on reducing compilation times. We recently added a remote cache to Bazel that is shared by all of our developers and testing environments. As our continuous integration pipeline is always building every branch of our code, sandboxes are able to use the cached build artifacts, meaning only files the developer has locally changed themselves need to be rebuilt. We’ve also been switching our compilers to be Bazel persistent worker processes for TypeScript and Scala which has made big differences in build times. The future is looking fast!