Interruption-free service migrations: what you can learn from the cashiers at your local…

Amazon can do it, Google can do it, a lot of others can do it too: interruption-free service migrations.

So why are so many enterprise businesses, banks, and others not able to do it? Why do I still get a notification by my bank that there will be service interruptions between Saturday 4 p.m. and Sunday 10 a.m. Why is there still that senior manager asking the question upon a change request: "will that change be disruptive?" 20 years ago, that might have been a reasonable question, but when was the last time you went to google.com and got a “service down due to maintenance”?

I do not know why we are still talking about this. However, I do know how to do interruption-free service migrations. It’s very simple:

Just watch and learn from the cashiers at your local supermarket

Let’s say they have 4 counters. #1, 2, and 3 are open and have a cashier behind them. Now the cashier behind counter #2 is done for the day and needs a replacement.

Here’s what the cashier won’t do: stop everything, take the cash drawer out, wait for her replacement, who has to put in his cash drawer, all while the whole queue of customers on that counter is waiting.

In fact, this is what will happen: A fourth cashier will open counter #4 and start serving customers. Next, the cashier at counter #2 will put up a closed sign to avoid getting more customers in the queue. She will still serve the remaining customers in her queue. Then she will close it.

Notice how this is a 4-step process:

Open a new service
Route traffic to the new service
Serve all remaining traffic in the old service
Close the old service

Also, notice how this leaves all possibilities open: The cashier at counter #2 could suddenly decide to re-open, or not close at all, while the cashier at counter #4 still serves. Or the cashier at counter #4 could decide that it wasn’t a good idea, put a closed sign and leave after the remaining queue was served. It’s all seamless.

Production Systems

So why do we still have these nail-biting upgrade processes in production systems? These all-or-nothing-high-risk-no-rollback upgrade processes, potentially with service interruption, when those cashiers provide us with the perfect blueprint of how to do it?

Okay — I do realize that production systems are a bit more complex than what I described above. But as always, the way to deal with complexity, is to break it down into small, digestible chunks. This is the hard part. Once that is done, any service upgrade can be done gradually with low-risk and rollback at any point in time, just like the cashiers could do it above.

Let me try to use some examples in the following to make my point clear.

Examples

Changing Certificates

Almost any non-trivial system needs to deal with certificates. The problem is always, how do you make sure your clients and your servers get updated simultaneously?

The answer is, they don’t. Instead, the side verifying the validity of a certificate should allow a new one and an old one for the transition period.

Here are the steps:

add new certificate to the the list of accepted certificates on the client (assuming this is the verifying side)
On server, replace old certificate with new one
Wait for all requests using the old certificate to finish on client
remove old certificate from client

Update DB Schema

Let’s say you have 1000 machines accessing an SQL database to feed a web app. A new version of the web app uses a renamed column in the database. So you need to rename that column in the database and simultaneously upgrade all your 1000 machines to the new web app version. How can you do that?

You can’t. Instead, your new version must be able to understand a schema which contains both the new column name and the old column name. It must always try to use the new one first, and falling back to the old one when necessary. When writing, it should always write using the new and old column name.

Here’s how you deploy:

Add the new column to the database. (Open counter #4)
Deploy your new web app version to your machines in staggered manner, potentially with a canary (Reroute traffic to counter #4, put up a closed sign)
Have a background job that copies all old column values to the new one. (Serve all remaining customers at counter #2)
Remove the fallback mechanism in the web app and deploy again (again staggered). (Close counter #2)
Optionally: Remove the old column. (Tear down counter #2 :-))

A New Storage Service Backend

A very similar one as the previous. Let’s say you used S3 so far, but want to migrate to another S3 compatible storage service from another vendor. Your service may consist of thousands of machines.

Again, to do this, you introduce a code change: writes now go to the new and old storage service. Reads go to the new, but fall back to the old when a resource is not found. Deletes also go to both.

The migration plan goes like this:

Set up the new data store. (Open counter #4)
Deploy your updated system to your machines (Reroute traffic to counter #4, put up a closed sign)
Have a background that copies all your data from the old storage service to the new one (Serve all remaining customers at counter #2)
Remove the code change, configure only the new data store. (Close counter #2)

Note that you might need to increase the number of machines for your service temporarily, because of the more expensive data storage operations.

Implications

Looking at the examples, specifically the last one, one thing has become clear by now: to do migrations like these, you need to own the code. If you’re operating a system that you cannot change, you won’t be able to make changes that help interruption-free migrations.

It’s for that reason probably, that newer forms of organizations, e.g. like Amazon’s two-pizza teams that follow a “You build it, you run it” culture are better suited for such interruption-free service migrations than the more classic ones where there is a strict separation between development and operations.

It’s also obvious that these kinds of migration plans require teams to be able to do frequent deployments with small changes. At any point in time during such migration you must be able to stop or rollback. That’s difficult when your update contains other unrelated changes and when your deployment process is manual and slow.

It also helps to have a mechanism to deploy to so-called canaries: single machines that get the update first. Only if they prove working correctly will the update be rolled out on to more machines.

If you wonder what can help you do this kind of deployments, use BOSH. It probably cannot compete in scale with Amazon’s internal deployment system, Apollo. But it’s one of the most powerful you can currently get, especially being free and open source.

Conclusion

Cashiers at your local supermarket can give you the blueprint for doing interruption-free service migrations. Using this blueprint, you can break down your (potentially huge) migration into small changes that provide seamless, interruption-free migrations. Embracing an organizational structure that merges developers and operators helps to implement such small changes that aid the actual migration. Finally, use a powerful deployment system like BOSH to deploy the changes in an automated way.