Sharding is bitter medicine
S. Alex Smith
April 4th, 2015
Percona mentioned this years ago, and it’s still true. Sharding is generally a bad idea. At Asana we put this off for years, and reaped a number of benefits from doing so:
- Our application logic was simpler.
- We had more freedom to change our data model.
- We could spend more time on things that weren’t sharding.
It was also easy for us to delay sharding: we made our database schemas more compact, and AWS released bigger RDS instances. If we’d held out long enough, we would probably be on AWS’s new Aurora.
We ultimately were forced to shard because of space considerations. Our database was a 3TB RDS instance, and we were using half of this space and growing strongly. Being on a single instance was also a source of pain. Load was a problem, backups took forever, and the database was a huge single point of failure.
In this post I’ll talk about some of the work we did related to sharding. I’ll cover the work we anticipated and how it turned out, some best practices that paid off for us, and some surprises we hit along the way.
Deciding how to shard
There is no single “right way” to shard data. Rather, the right way to shard data depends on your application. At Asana we decided the right way for us to shard was “by workspace”. Our reads and writes are almost always limited to a single workspace, so sharding by workspace let us continue to use single transactions for most of our writes and keep using mysql queries for most of our reads (instead of relying on app level joins). These features — transactions and rich queries — are why we were on a relational database in the first place, so it was important to us to preserve them. Additionally, sharding by workspace means users typically interact with only a handful of databases, so if a database has trouble, it impacts only a fraction of users.
However, sharding by workspace also has its drawbacks. Some data do not fit into a workspace, so this data needs to be special cased. For this we created a separate singleton “master shard”. Also, sharding by workspace means that our shards are much bigger than they would be if we sharded by object, so if a workspace becomes too large we’ll need to change our sharding behavior to accommodate it. But overall, sharding by workspace seemed significantly better than our other options.
Updating the app
The big pieces to sharding were reading, writing, and the ability to migrate shards between databases. Of these, we were most worried about reading, since we expected to have a lot of cross-shard queries that we would need to rewrite to work client side. In comparison we expected writes to be easy, since we could only come up with a few examples of cross-shard writes. These expectations turned out to be backwards: in actuality, reads were quite straightforward, and writes were much worse than expected.
Writes were quite hard to get into shape. Since cross-shard writes can’t be done transactionally, every cross-shard write required careful examination to make sure that it was idempotent. Frequently, this meant rewriting that part of the application. Many of these rewrites were long and messy enough that they took multiple weeks to get right.
We still aren’t in the state we wish we were. We occasionally use nested transactions, where we perform an operation against shard B while in the middle of a transaction against shard A. This made the code hard to get right, and we’re left with parts of the code base that are unintuitive and will be dangerous to refactor. For example, in a code block like
runInTransactionAgainstShard(user.shard()): user.addNewEmail(email) workspace = getWorkspaceForEmailDomain(email) runInTransactionAgainstShard(workspace.shard()): workspace.addUser(user) assert False
we will have added the user to the workspace, but will not have added her new email address.
For migrating shards off of the primary database, we had a few objectives.
- Don’t lose any data.
- Minimize downtime.
- Migrate data quickly.
These objectives typically don’t align. For example, it’s much easier to copy a shard quickly and safely when it’s offline. Ultimately we settled on copy shards while they were live, with brief unavailability when we transitioned to using the copy. The basic method we use is straightforward: we start one process to copy a shard, and a second process to re-copy any object that changes. Once the copy completes, we update a shard-to-database record to indicate that the shard now lives on the new database. In practice, this migration process allows even our biggest shards to be moved with only seconds of unavailability.
Doing this requires some book-keeping at the app layer. First, we maintain a queue of modified objects so we can re-copy any object that is modified while the migration is underway. We already had such a system in place to allow search indexing, so this took no additional work. Second, we maintain a table of “shard locks” on each database. This table holds rows of
shard_id, shard_is_on_this_database, and whenever we start a transaction against a shard, we take a shared lock on this row and confirm that
shard_is_on_this_database is true. When we switch which database holds the live shard, we update
shard_is_on_this_database to false on the original database. This addresses race conditions where a process could write to the old database after we transition off of it.
Best Practices that Paid Off
Sharding was made substantially less risky (and thus easier) by our large suite of automated tests. Our tests caught lots of bugs and missed only a few; and the bugs that were missed were generally in places where we had poor coverage. Generally the tests that caught the bugs weren’t sharding specific — they were just unit tests exercising specific behaviors of the application.
Asana is a medium-sized Asana customer, and we continuously work off of internal releases before we deploy them to other customers. Similarly, when it was time to migrate shards, the first non-test shard that we migrated was the Asana shard. This worked out well, since even after extensive testing, the first migrations resulted in mildly corrupted data. The problem was that our migration script copied associations, but never deleted them. For example, if a task was removed from a list during the migration, it was possible that the task would reappear in the list once the migration was complete. Since we caught this in dogfooding, we could fix the data by apologizing to coworkers and asking them to redo a small amount of work.
In the process of sharding, we discovered that we had accumulated a fair amount of data-cruft debt over the years and now needed to pay it down. We discovered this because there were multiple paths through which we could determine an object’s shard, and they didn’t all agree. Fixing this kind of data involved a lot of patchwork. Some of the data could be made consistent by deleting offending objects, for example when these objects were temporary in nature or were unreachable by users. Some of the data could be made consistent by changing what shard it was in. Frequently we would fix one kind of object only to realize that this had just moved the inconsistency.
We discovered these data problems through a combination of means. In the app we had assertions around loading objects. For example, if we queried for objects on shard A, these assertions would check that each of the returned objects agreed that it was on shard A. Once the assertions tripped and we knew there were problems, we wrote offline queries that reported every shard-inconsistency in our database. The queries were necessary for completeness, but the assertions were crucial for detecting the problem in the first place and helping to confirm that we’d actually solved it.
Turning sharding on in production was more eventful than beta testing was, and our users found a small number of insufficiently tested code paths that we hadn’t hit ourselves. Typically these bugs emerged when a server process would crash at an inopportune time and leave data in a state we couldn’t automatically recover from. These issues largely centered around users signing up or joining a new workspaces — two things that we don’t personally do very often, and thus remained untested in our beta cluster. They also required that the product crash at a particular time and then retry an action — something we simply didn’t test.
A couple of weeks into migrations, we hit our only known problem with customer data. The scheduling scripts weren’t as robust as the migration scripts themselves, and they managed to schedule two simultaneous migrations for a single domain. Our migrations were scheduled as “move shard X to database Y”, and in particular, they would always be migrating from whatever database the shard happened to be on at the time. This meant that, once the first migration finished, the second migration was left trying to migrate shard X from database Y to database Y. Because of what we’d learned during our dogfooding, our migration script operated as
deleteRowsOnTargetDatabase() readRowsFromSourceDatabase() writeRowsToTargetDatabase()
The silver lining is that the migration deleted data from the new database incredibly quickly, and the shard was promptly unusable. This meant that users on the shard could not enter new data, so when we switched the shard to read off of the original database again, these users lost only their last few minutes of writes. We contacted the users in the workspace to let them know what had happened, fixed our scripts to refuse to ever migrate from a database to itself, and restarted the migrations.
Burn Rate Irony
As we continued migrations, we kept an eye on how quickly disk space was dropping on the primary database. We regularly made estimates of when we would run out of space and marked these on a calendar. If we could get the rate low enough that the run-out-of-space date was a decade in the future, we were done. After a few weeks the rate was down to half of where it started, but then it plateaued.
It seemed like the long tail of data was much fatter than we anticipated, and this meant that our current rate of migrating data wasn’t going to be good enough. We spun up more EC2 instances and started migrating more aggressively. When we checked back on the database a few couple of days later, we were using data even faster than we had been before. After the panic subsided, we realized that the migrations themselves used a fair amount of space for bookkeeping, and all of this space was being used on the primary database. Happily there was no requirement for where this bookkeeping should live. We updated our migrations to keep their metadata on a different database, and were rewarded by our burn rate plummeting.
Sharding is a lot of work, and you should put it off as long as you can. By delaying sharding you get to spend more time working on your app, and you accumulate technical debt that you may never need to pay off. Even if you eventually need to shard, the tradeoff is worthwhile.