How Shift-Right Testing Can Build Product Resiliency

Introduction

Shift-left testing refers to testing earlier in the feature release cycle to increase the chances of finding issues sooner than later. Development teams invest a lot in shift-left testing, and rightly so. Shift-right testing, unlike shift-left testing, refers to testing later in the feature release cycle, typically in the production or pre-production environment. Shift-right testing is very useful in testing large-scale distributed applications for issues that are hard to simulate in shift-left testing and only surface under sustained user traffic. Some strategies to perform shift-right testing are described below.

Shadow Testing

Shadow testing refers to replicating user traffic to both the current version of the system and a future version of the same system. Both the versions run side by side. The replication of traffic can be done either by a proxy that intercepts user traffic and multicasts requests, or the current version of the system can asynchronously forward requests to the future version. The primary goal of shadow testing strategy is to make sure that the future version can handle all user scenarios at the production scale without any issues before it replaces the current version of the system. While shadow testing is in progress, the current version returns responses to the user. Responses returned from both current and future versions are compared to verify correctness and raise alerts in case of any discrepancy. There are two ways to execute this comparison:

In a separate component: Both current and future versions asynchronously forward their responses to a central service that stores them in a database and validation reports are generated.
Within the future version: In this case, the future version assumes that the current version’s response has already been stored in a database and it can look it up for comparison purposes and save the result for reporting.

A practical strategy followed for shadow testing is to slowly ramp up the amount of shadowed traffic by either percentage or requests per second (RPS). The benefit of this strategy is that it limits the outgoing HTTP calls the current version must make in case it is the one forwarding the traffic to the future version. The other benefit is that future version ramps up slowly making it easier to understand at what point scaling bottlenecks (if any) occur.
Diffy is an example of a tool used by Airbnb, Twitter, ByteDance to perform shadow testing.

Dark Launches

Dark launches approach proposes launching new features into production but keeping them invisible to the users until they are well tested and deemed acceptable. This strategy is very useful for testing updates to any backend functionality by invoking it asynchronously from the user flow but hiding the response from the user. The primary goal here is to be able to monitor and assess whether the new functionality can handle user scenarios under production load before making it public to the users.

Dark launches are implemented using feature toggles. Once developers are satisfied that the feature is ready for broad consumption they can turn ON the feature.

Fault Injection and Chaos Engineering

This approach injects specific faults or failures into the system and validates the behavior. Chaos engineering, generally conducted using fault injection, is an orchestration of dependency failures that any large-scale application can experience in production. Chaos engineering helps uncover and understand resiliency gaps in the system that may affect the end-user experience. Chaos tests are run using experimentation in production by selectively introducing failures using fault injection. The benefit of this approach is that the “Blast Radius” in case of a chaos test failure can be controlled by switching OFF the experiment. Organizations perform chaos engineering in a coordinated manner in production. It is an organization-wide exercise, typically conducted on an agreed-upon day, that tests system resiliency, monitoring coverage, and general preparedness of the on-call engineers in case of an outage.

Best Practices

Shift-right testing can directly impact the end-users therefore there are certain best practices to keep in mind:

Shift-left testing is also important: Shift-left testing is performed in a “Safe” environment without the risk of affecting any real users. It is important to continue investing in shift-left testing to build confidence in the quality of the system before it reaches production.
Observability is key: Shift-right testing is heavily dependent on the quality of observability support in the system. This is important both for measuring the outcomes of the shift-right tests and to raise alarms if any of these tests start causing user pain.
Follow OAA: Shift-right test coverage should be grown incrementally to allow developers to Observe, Analyze and Act on improvements. This approach also helps identify if additional tests can be incrementally added to the automatic regression test suite to improve shift-left testing coverage.
Consider canary testing: Before jumping into shift-right testing in production, teams should consider testing in pre-production, canary, or staging environments by using techniques like blue/green deployments to shadow traffic and reduce risk.
Clear SLOs: Teams must have clearly defined SLOs for user flows and at the microservice level to aid objective analysis of outcomes of shift-right tests.

Conclusion

Both shift-left and shift-right testing play an important role in building resilient products and services. They uncover weaknesses in the system that may one day affect end users. In addition to relying on shift-left and shift-right testing, developers should also consider defensive design patterns in the design phase. Design patterns like retries, circuit breakers, sidecar, bulkhead, and others minimize blast radius in case of an outage. Investing in resiliency across the development and release cycle is critical in building products that are reliable and offer predictable outcomes to the users.