3 Risk-Mitigation Lessons That We Learned The Hard Way This Year

What do in-flight refueling maneuvers and cloud-to-cloud migration have in common?

Recently, my team made a gargantuan leap when we completed a big-bang cloud-to-cloud migration for one of India’s largest social networking service providers.

In this article, I'll walk you through our journey and the technical details.

I hope that it might help you learn what it takes to perform such a migration from one cloud provider to another while maintaining all your service promises.

But before we dive in.

Are you wondering why am I alluding to procedures related to the military?

It's because your customers’ business is at stake during a cloud-to-cloud migration. In fact, the measure of success for such a maneuver is how predictable each step really is, which means no surprises or anomalies.

So, in this article, I’ll share with you three lessons that helped us appreciate the value of ordinary and mundane steps in orchestrating such maneuvers.

The challenge of migrating cloud infrastructure and data

The customer is a regional social networking startup that enables 1.17 billion users to share their thoughts and emotions in their vernacular language.

Zero downtime is a non-negotiable service requirement.

In our context, keeping the customer's promise means migrating over 75 services, 200 jobs, and 220 tables (approx. 80 TB of total data) from AWS to GCP with zero data loss, zero downtime, and zero or minimal changes in the application's code base.

It also means ensuring business as usual for a total peak-time read of 2 million queries per second (QPS) and a total peak-time write of 560K QPS.

The opportunity to rise to the challenge is exciting, but doubts and concerns continue to linger.

How do we move data from Amazon DynamoDB to Google Cloud Spanner without an adverse impact?
How do we figure out the extent of coupling between the application services and DynamoDB?
How do we employ process-driven guidelines, industry best practices, and standard architectural patterns in a startup that has just started to scale?
How do we apprehend a moving target that’s changing its form every 24 hours because a group of motivated and committed geeks would rather code than sleep?

We were making no headway and whenever we thought that we did, the game had already moved on. Which brings me to our first lesson.

Lesson 1 - Know your operating environment to reduce turbulence and increase visibility

In a scenario where slowing down is not an option, stopping to refuel is unimaginable. But we had to change our worldview before we could change our game plan.

So, we moved from being many distributed teams to a centralized one.

While that's not easy with over 30 people from three companies working 14x7 for almost two months, we welcomed the initial discomfort and the eventual shift in perspective.

The extent of service-to-database coupling was no longer alien to us, apprehending a moving target was no longer daunting, and decision-making was a breeze.

To solve the NoSQL-to-SQL puzzle, we decided to split the data from a single, NoSQL (DynamoDB) database to a combination of SQL (Cloud Spanner) and NoSQL (Cloud Bigtable) databases.

However, this decision compelled us to rethink how the application would interact with the databases. After much fervor, the game changer emerged to ensure that we can keep our customer's promise of zero downtime.

But the elephant in the room is yet to be addressed. Which brings me to our second lesson.

Lesson 2 - Run away as fast and as far as you can from a distributed monolith

Our customer’s platform is a distributed monolith where individual components are tightly coupled together. So, to gain visibility into their services and to see all the dependencies, our customer suggested that we use Lightstep as our observability tool.

We could now find and fix the root causes within minutes instead of hours and days.

Stuff still fails and breaks, but we know why and where.

Although we got better at addressing the elephant in the room, we lost precious time and energy because the platform's inherent nature precluded a lift-and-shift (rehosting) migration approach.

Let me illustrate the gravity of the situation: It took us over three months to figure out all the dependencies to move just one service across.

And this is us, a team that renowned cloud providers have trusted for five years to swiftly execute migrations. Anyway, moving on.

Migration Options: Phased versus Big Bang

The first option, which is a piecemeal migration, isn't feasible because it increases the overall cost and degrades the users' experience, both of which defeat our primary goal.

Migrating in batches isn't attractive either.

It hogs our customer's time and money with no assurance of success because of the risks associated with intricately interwoven services. For example, consider the risk of data corruption, which defeats our service requirement of zero data loss.

The third and the last option, albeit as life-threatening as our in-flight refueling analogy, seems like the only way out of this conundrum.

Do whatever it takes to prevent that spark from igniting a fire. So, we are choosing the big-bang approach, risking it all on a flip of the switch.

And we draw inspiration from the following extract of the poem, "If__", by Rudyard Kipling, which is a wallpaper in the customer’s office.

"If you can make one heap of all your winnings

And risk it on one turn of pitch-and-toss,

And lose, and start again at your beginnings

And never breathe a word about your loss;

If you can force your heart and nerve and sinew

To serve your turn long after they are gone,

And so hold on when there is nothing in you

Except the Will which says to them: ‘Hold on!’"

We took on this challenge when we were exhausted, and we chose to carry on when everyone was certain that it’s time to give up. Which brings me to our final lesson.

Lesson 3 - Honor mundane routines to orchestrate extraordinary maneuvers

By now, we know the problem space and the inherent risks associated with a distributed monolith.

So, it's time to reduce our error budget and overall risks.

We ran async mock cutovers every single day for over a month, ironing out each possible issue, scenario, and curveball.

By the time we started the actual cutover, our mundane affairs became muscle memory.

The following objectives helped us set some milestones, but all our actions stemmed from one goal - Mitigate Risks.

Looking back, it amazes me how honoring the mundane affairs breeds an uncanny sense of clarity, commitment, and camaraderie.

We celebrated our gains and losses, we fixed everything that needed fixing, and we ensured that finding memes and funny videos are just as important as finding solutions to technical challenges.

Pic taken by a member of the cloud-to-cloud migration team

Key Takeaway

What does it take to make this gargantuan leap? A bunch of crazies from the suicide squad for sure!

I’ll end this article with a quote by my colleague, Roshan, to help you visualize the scale of this migration and the thrill of this ride.

"We are creating a query translation engine that performs string processing logic and complicated database operations. Latency should increase, right? But, it does not. We delivered a solution with no increase in latency!" ~ Roshan Patil

By the way, Roshan benchmarked DynamoDB in its entirety because he feels that he might never get another opportunity to develop a solution for over 500K QPS.