Netflix and Smash

Chaos has become a symptom of the tech world. Every day, thousands of developers are putting out fires at work and getting caught up in one crisis after another.

The better part of those fires have been lit by the rise of microservices and distributed cloud architectures. The popularity of those advancements is at an all-time high, yet failures continue to be prominent and complex.

Downtime Jitters

According to an IHS Markit survey, the cost of downtime for 400 companies hit a collective $700 billion per year. This is a staggering figure.

In March 2015, a 12-hour Apple store outage cost the company $25M.
In May 2017, one outage stranded tens of thousands of British Airways (BA) passengers and resulted in a $102.19M loss.
In December 2020, a large-scale outage took down YouTube, Gmail, and Google Assistant for around an hour. There were a lot of pocket holes that month.

We all need a magic pill to alleviate this headache —waiting for your service to crash is a bleak option.

Let’s do it the Netflix way and chill during deployment.

Play Destroy

Welcome to chaos engineering - a place where mistakes are intentional and failures are embraced.

Its history dates back to 2010 when the Netflix Eng Tools team created Chaos Monkey to test the resilience of its IT infrastructure. Today, chaos engineering is ‘celebrating failure’ to help engineers and systems build muscle memory and maintain more resilient complex systems.

Vaccinate Against Downtime

In layman’s terms, chaos engineering is the process of hacking things on purpose.

Just like a vaccination, you inject latency or CPU failure to trigger an immune response within the system.

In this case, our main goal lies in identifying hidden problems that may wreck production.

As a сhaos engineer, you test the system's ability to handle real-world problems - server errors, traffic jumps, corrupted messages - in a series of controlled experiments.

Break Things Strategically

To stress your system out, you need to follow a four-step process:

Define the steady-state of the system. Develop a profound understanding of a system so that you are aware of what it looks like during normal functioning. This state will serve as a measurable variable.
Build a hypothesis around steady-state. Choose the damaging action you want to enact. Simulate realistic scenarios. Replicate real-life problems that have previously occurred in your system. For example, if traffic spikes caused havoc a few months ago, opt for bugs that mimic those affects.
Measure the impact. Keep tabs on your system while the bug is attacking it. Focus on key metrics, but don’t forget to assess the entire system.
Minimize the blast radius. Safeguard the infrastructure by coordinating developer teams and business units. Furthermore, you should start small and build up as you gain confidence in a system.
Invalidate your hypothesis. Finally, you’ll have one of the two outcomes. You either confirm the resilience of the system, or you find a weak point to eliminate.

Pro tip: Run chaos experiments in production to replicate the real state of things. If you perform chaos testing during staging or integration, you won’t build a real vision of how the system in production reacts.

Embrace the Art of Chaos

Awesome! We’ve successfully shattered your application using controlled chaos and debunked the concept of chaos engineering. Next, you would want to right the wrongs to make your system invincible.

Credit for the above piece goes to Tatsiana Isakova, Hang Ngo, and Ellen Stevens.

Subscribe to HackerNoon’s thematic newsletters via our subscribe form in the footer.