AB Testing For Digital Products

What is an AB test?

AB testing is a great method to increase the efficiency of a web page, app or any other digital product. AB testing allows businesses to test and validate product change ideas, rather than making changes based on assumptions or intuition alone.

In an A/B test, two variants of a digital product (A and B) are compared against each other by showing each variant to a randomly selected group of users and measuring their response.

AB testing is used in many different industries to improve user experience and engagement. Here are three examples of industries that frequently use AB testing:

Software and mobile development: Developers use AB testing to test different variations of the software or mobile app UI to determine which version performs better.
Digital marketing: By testing different versions of ads, marketers can determine which ads are the most effective in terms of driving user engagement.
User experience (UX) design: UX designers use AB testing to test different versions of website layouts, button designs, and navigation menus to optimize user experience.

What Are the Steps to AB Testing?

Imagine we have a messenger app. It is working well, and our existing clients are quite happy. A manager brings forth the idea that he has a feeling the users would like the app more if the main color was blue instead of the current color of green. Of course, the manager is a smart professional, however we do not want to introduce such a huge change to the app based on his sole “feeling“. To check if the “feeling“ of the manager is correct we should conduct an AB test.

Here are the 6 steps of the AB testing process:

Form Your Hypotheses

First, we need to form our hypotheses that will be checked with the AB test.

In the messenger-case described above we will have the following hypotheses:

H0: Messenger users would not have a better app-experience if the app was blue colored.

H1: Messenger users would have a better app-experience if the app was blue-colored.

The goal of the AB test will be to find evidence for H1. If the evidence is enough we will reject the null hypothesis H0 and conclude that the messenger app color should change to blue.

Choose the Right Success Metric

Second, we need to choose some numerical metric that will represent “success” of the change. It could be the number of active users, conversions, clicks or any other metric reflecting the success of the business.

In the messenger-case described above we can choose the success-metric as any metric that represents how happy the users are with the app.

Some Success-Metric Examples:

Number of active users per day (or any other period).
Number of messages sent per user per day (or any other period).
Number of minutes spent in the app by user per day (or any other period).
And so on.

All these metrics represent user satisfaction with the app and we can choose any of them for the AB test. Let’s choose the second one - the number of messages sent per user per week.

Choose the Significance Level

Statistically speaking, the significance level represents the probability of rejecting the null hypothesis when it is actually true (type I error). In simpler words, the significance level is a measure of how much evidence is required to reject the null hypothesis. The lower the significance level the more evidence is needed to reject the null hypothesis.

Choosing the appropriate significance level depends on the problem we are willing to solve with the AB test. Commonly a 5% significance level is used, meaning we are willing to accept a 5% chance of rejecting the null hypothesis when it is true. However, if the change suggested is of high importance, and we are afraid of a mistake, then in such a case we can use a tighter significance level, for example, 1%.

For the messenger-case let’s take 5% significance level.

Design the Control and Experimental Groups

After the success metric and significance level are chosen we need to create 2 random groups of users. One group will be called “control sample” and the other – “experiment sample”. The control sample will have the old version of the app, website or software, whereas the experiment one will have the new version with the proposed changes. In the messenger-case the control sample will have the old green interface, and the experiment sample - the new blue version.

The sample sizes depend on multiple factors:

Significance level - the lower the significance level, the more evidence we need to reject null hypothesis, therefore a larger sample size is required.
Metric variance - the higher the metric variability the more noise we will have in the results, hence a larger sample size is needed.
AB test duration - the faster we need to get the results, the more evidence we want to collect per unit of time, and therefore a bigger sample is required.

In general, the larger the sample size the better it is, as you collect more evidence and increase the reliability of the AB test results. However in practice we might have some sample size restrictions, as larger sample sizes usually requires more resources and bears more risks. For example, in the messenger-case if we make a sample from 1% of the users it might be fine. However, making a sample from 30% of users can bear high risks. What if the users do not like the blue color of the app and stop using it? In that case we bear high business risks.

To calculate the correct group size we can use statistical power calculators or consult with a statistician.

Also, it is important to note that the number of groups might not necessarily be two. In the case described above only two groups are needed as just one color is being tested. However, if we want to test more colors, additional test groups can be created.

Introduce the Change to the Experiment Group

At this step we need to introduce the change to the experiment group. The difficulty of this step depends on your app or software, if it has any technical and methodological framework to conduct A/B tests. Sometimes from a technical point of view it is hard or even impossible to introduce a different version of an app or software to a random subset of users.

Measure the Results

After some time passes you calculate the metric distribution for the control and experiment group. If the metric distribution in the control sample is significantly different from that in the experiment sample, then the null hypothesis can be rejected.

The significance of the difference can be analyzed by using online calculators or programming languages like Python or R. If we have a good understanding of statistics, we can even do it manually. Some tests that may be used are z-test, t-test, and Mann-Whitney test, depending on the context.

Suppose in the messenger-case we made control and experiment groups and tested the groups for 1 week. After 1 week we calculated the chosen metric, number of messages sent per week, for each user in each sample. As a result we visualize the control group and experiment group distributions similar to the graph shown on the picture below. The grey distribution is for control group and the yellow is for the experiment sample.

The interception between the distributions represents the confidence level. The smaller the interception the more different the distributions are. Earlier we chose the confidence level at 5%. Therefore to reject null hypothesis we need the interception to be equal or less than 5% of the distribution area. The interception on the picture looks more than 5%, so we do not have enough evidence to reject null hypothesis. The color of the app will remain green.

Important Points to Mention About AB Tests

To conduct an effective AB test, there are a few best practices to keep in mind.

First, it's important to have a clear hypothesis or goal for the test - what are you trying to achieve with the changes you're making? This will help you measure the success of the test and make informed decisions based on the data.

Second, it's important to test one variable at a time. If you change too many things at once, it will be difficult to determine which change had the biggest impact on the results. By testing one variable at a time, you can isolate the effects of each change and measure its impact on the user experience.

Third, it's important to collect enough data to make an informed decision. This means running the test for a long enough period of time to ensure that you have a statistically significant sample size. Depending on the traffic to your website or app, this could take anywhere from a few days to a few weeks.

Fourth, it is important to have independent groups. That means that each user can be only in one of the groups. Also, users in one of the groups should not have influence on the users in the other group.

Conclusion

In conclusion, AB testing is a powerful tool for optimizing digital products and improving the user experience. By testing different versions of a product and measuring their impact on user behavior, businesses can make data-driven decisions to improve their conversion rates and achieve their goals.

I consider AB testing to be an artificial analogue of evolution. With AB tests you make a change for a small portion of your users (like mutations occurring in animal species), and if the change turns out to be successful you introduce it to all users.