Beyond A/B Testing — Switchbacks and Synthetic Control Group

Written by shauryauppal | Published 2022/08/18
Tech Story Tags: debugging | ab-testing | experimentation | data-science | machine-learning | statistics | artificial-intelligence | perform-ab-testing | web-monetization

TLDRA/B Testing is one of the most important skills of a data professional. All major tech giants use this method for experimentation at scale. But there are some limitations of the method as well. Switchbacking or Counterfactuals: Causal Impact or Synthetic control group techniques will be discussed in this Newsletter in detail. The cost of a “bad” version is very high — a/B testing can cause users exposed to “Bad’s churn. For early startups with insufficient user traffic or getting significance will take a lot more time.via the TL;DR App

Prerequisite: Know All About A/B testing

What to do when you can’t A/B Test

A/B Testing is one of the most important skills of a data professional. All major tech giants use this method for experimentation at scale. A/B testing has proven itself a lot of times (some popular case studies):

However, there are some limitations of A/B testing as well let’s understand cases when A/B testing should not or cannot be used:

  • [Cannot be used] When we can’t establish independence between the two groups involved in the A/B test — i.e., adding someone to the “A” group impacts the “B” group and vice versa.
  • [Should not be used] The cost of a “bad” version is very high — A/B testing can cause users exposed to “bad” version churn.
  • [Should not be used] For early startups with insufficient user traffic, A/B testing sample size collection or getting significance will take a lot more time.

[A/B Testing should not be used] — cases can be solved with MultiArmBandit experimentation technique adoption (won’t discuss in this newsletter)

[A/B Testing cannot be used] — case of spillover effect, where we are not able to establish independence b/w two groups involved in A/B Testing, can be solved with Switchbacking or Counterfactuals: Causal Impact or Synthetic control group techniques which will be discussed in this Newsletter in detail.

Beyond AB Testing

I) Naive Method — Pre and Post Full Release Analysis

The most naive approach one can think of when AB Testing is not possible is to do a full release and then analyze the before and after-release impact on metrics.

No Control Group 🤔 doing this is not science at all, the outside world affects users a lot more than the product changes released. Pre-Post or Before-After Analysis does not consider external factors like weather, holiday, lockdowns, etc.

II) Switchback Experiments or Time Split Experiments

Researchers at MIT and Harvard developed a paper that outlines a theoretical framework for optimal analysis and design of switchback experiments. Switchback experiments, also known as time split experiments, employ sequential reshuffling of control/treatments to remove bias inherent to certain data.

These methods are popular in 2-sided marketplaces, such as Doordash, Uber, Ola, Zomato, Swiggy, and Lyft, because they allow for robust experimentation on data with finite resources (drivers, riders/customers, etc.).

Case Study of Doordash — In marketplace experimentation problems of Network Effect or Spillover Effect often make traditional A/B testing ineffectual.

Use Switchbacking: If the treatment impacts a shared pool of resources, the control group will be affected, thereby invalidating our experiment.

Switchback Method — Splits a fixed group of users into treatment and control versions over time (illustration below).

Every 30 minutes we randomly all users in User Group A to either the control or treatment group. This method can apply to experiments with any number of treatments.

The duration of each time split is fairly arbitrary, however, the guiding principle is that the duration should be small enough to show useful insights into our data, but not unnecessarily small so that computation becomes a problem. Doordash uses 30-minute windows.

Limitation of Switchback

Switchback experimentation can only be used when experimenting with different algorithms which are not user-facing: We cannot show different things on the User Interface as it would be a bad user experience. Switchbacks are perfect for experimenting with algorithms like Driver-Rider Matching or Surge Pricing, etc.

III) Synthetic Control or Causal Impact Inferencing

In 2015, Google released a paper Inferring causal impact using Bayesian structural time-series (595+ citations), and in 2016 The State of Applied Econometrics — Causality and Policy (927+ citations) introduced us to Synthetic Control which has been described as the “most important development in program evaluation in the last decade” (Athey and Imbens 2016).

The synthetic control method is a statistical method used to evaluate the effect of an intervention in comparative case studies. It involves the construction of a weighted combination of groups used as controls, to which the treatment group is compared.

This comparison is used to estimate what would have happened to the treatment group if it had not received the treatment.

In the above image: The dark blue line is the metric we are looking at for concluding the impact of the experiment and the dotted line is our prediction if we hadn’t rolled out the treatment. The difference b/w the dotted line and the dark blue line is the treatment effect.

The dotted black line above is the synthetic control which is predicted using a time series model.

To build a model to predict synthetic control — below are the factors we consider.

  1. Control City Data: The control group is used to train a weight vector that predicts the synthetic control values. Note that the control group cannot be influenced by the treatment in any way. Example: If Delhi is Treatment City then Control City can be Mumbai, Bangalore, Kolkata, Chennai, etc.

  2. Treatment City Pattern in Previous Week/ Month/ Years: Past data points to be taken into consideration to capture seasonal trend patterns.

  3. Treatment City Data in Control Period: Last 30 days of data (evaluation metric — conversion rate) of the treatment city (let’s say Delhi).

  4. Other Factors: Weather, Holiday, etc.

Loss Function is to minimize the difference b/w the synthetic control and treatment group before the start of treatment.

Limitation of Synthetic Control

  • Exogenous shocks like Lockdowns, War, etc. can still invalidate results.

  • Difficult to detect small effects.

  • Can’t dig into user-level heterogeneous effects as we over experimenting on the city level.

Conclusion and Key Takeaways

  • We discussed AB Testing case studies and AB Test’s importance in the experimentation world.

  • We went through cases when AB Testing is not possible, particularly in the marketplace.

  • Discussed Switchbacks or Time Split Experimentation Method and its drawbacks.

  • Finally, we discussed Synthetic Control, its importance, how to create a synthetic control, and its drawbacks.

Connect, Follow or Endorse me on LinkedIn if you found this read useful.

If you liked this blog, don’t forget to hit the ❤️. Stay tuned for the next one!

I am nominated for the HackerNoon 2022 Noonies, Vote for me: https://www.noonies.tech/2022/programming/2022-hackernoon-contributor-of-the-year-data

References

Other Recommended Newsletters:

[2] Mastering A/B Testing by understanding Pitfalls

[3] Data Science in Ride-Hailing at Ola, Uber, Rapido, etc.

[4] No more Cancellations? at Uber


Also Published Here


Written by shauryauppal | Data Scientist | Applied Scientist | Research Consultant | Startup Builder
Published by HackerNoon on 2022/08/18