Step-by-Step Walkthrough to A/B Testing Fundamentals

Interviews and Day-to-Day Tips You Need to Know!

11 min readJun 25, 2024

Two version of same feature for A/B Testing- (Image by Author) — **Two version of same feature for A/B Testing** — (Image by Author)

Hey, aspiring Data Analysts! Want to learn “WHY” and “HOW” of SQL, ML, AI, Data Science & Analytics and much more? It’s all here.

Have you ever wondered, how these E-commerce giants decide to roll-out any new features? Since you have found yourself here, so yes, you probably know the answer.

It’s using A/B Testing.

But, the actual question is, how they perform these Control Online Experiments?

In this article, I’ll help you in understanding the core of the A/B Experiments from the perspective of a data scientist, along with a sample problem statement.

By the end of this, you will have a solid understanding of A/B Testing — helpful for both interview preparation and day-to-day work as a DS.

Here’s a brief of topics we’ll be covering today:

What is A/B Testing?
Why is A/B Test performed?
How to conduct Controlled Online Experiments?

So, without any further delay, let’s get started!

What is A/B Testing?

Aka, Split Testing/ Bucket Testing/ Randomized Controlled Experiment/ Online Controlled Experiment, is a statistically randomized control trials for comparing two variants of a feature (in which all the elements are held constant except for the one variable, or factor that is being test) between the control and experiment aka treatment group.

Control Group is the current version, while the Treatment Group is the new version of feature, typically for a webpage or a mobile app.

Why perform A/B Testing?

Obviously, these experiments allows you to make data-driven decisions, instead of guessing which feature will actually perform better, after implementing on the production.

These control online experiments give an edge to test and see the results. This way, you can confidently implement the changes that can improve user experience, conversion rates and overall revenue.

Does this make sense, I think it does right, let me know your thoughts in the comments.

How to do Controlled Online Experiments?

Now let’s deep dive into the flow of a typical, Controlled Online Experiment, using a real world example.

You can find the codes here. Don’t forget to ⭐ the GitHub repository.

Problem Statement:

To determine if including 360-degree product videos feature on the product page increases the conversion rate and overall revenue as compared to a product page with only images, for an E-Commerce Platform.

I’ll go through the 5 major steps, that are usually used to perform the test:

5 major steps for designing, running A/B testing experiments and interpreting results — Image by Author — **5 major steps for designing, running A/B testing experiments and interpreting results** — Image by Nikita Prasad

1. Prerequisites

Generally there are three things to consider before starting the experiment. These things usually come up after the detailed team meeting with Product Managers, SDEs and Designers, etc.

Define a key metric i.e. an Overall Evaluation Criteria (OEC)
Have enough number of Randomization Units
Do changes that are easy to implement and sensitive enough to measure

But before this let’s try to understand,

— Metric Selection

A metric is a quantifiable measure that help to track and access the performance of different aspects of your business. In A/B testing, metrics are crucially used for measuring the effectiveness of the changes that you’re testing.

What are the different types of Metrics?

Typically there are 3 types of metrics, namely

1. Primary aka Goal Metrics or Success Metrics

Usually the main outcome or Key Performance Indicator, KPI that A/B Test is designed to measure. These are aligned with the company’s vision and mission i.e. the long-term goals.

For example, In my project the conversion rate and revenue is directly tied to the business objective.

To note, is that they are not frequently actionable metrics, as stakeholders agreement is required for this.

2. Driver Metrics, or Secondary Metrics, or Surrogate Metrics

They are indirect or predictive metrics, that are often used to measure short-term objectives. They are align with the goal metrics but are more sensitive and actionable, and more suitable for online experiments.

Talking about the example, if the goal is to acquire new users, then one of the driver metric could be # of new users registered per day. It also explains why the changes are occurring in the success metrics.

3. Guardrail Metric

As the name suggests, these metrics guards the severity of the experiment, i.e., the new feature should not degrade or negatively impact to the other critical aspects of your company.

For instance, the new feature may increase latency, causing an increase in bounce rate or un-subscriptions rate.

Note, the driver metrics of one team can be the guardrail metrics for the other team/s.

How to select the best Metrics?

In the book [1], the author summarizes, 3 typical attributes of metrics that are suitable for an experimentation:

3 Attributes of the Best Metrics — Image by Author — **3 Attributes of the Best Metrics** — Image by Nikita Prasad

Further, in online experiments, we typically selects a few driver metrics as key metrics as well as some guardrail metrics to monitor impacts on other aspects of the business.

Well, if you’re curious enough, you may wonder, since we have multiple metrics for experiment, “How do we make the launch decision when one metric goes up and one metric goes down?” It’s a very reasonable question, and I have also hit by this question in the interviews.
In real-life, this scenario happens often, for instance, trade-offs between user acquisition and revenue. Acquiring new users is mostly done through expensive campaigns, like providing discounts or free-gifts, but these often degrade revenue. This type of trade-off is not something that can be determined by a single data science team, it’s something that has to be discussed and aligned with the stakeholders. In practice, many organizations have a mental model for these trade-offs; they are willing or not to accept, when they see any particular results.

Two practical suggestions from the book, “Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing” on formulating metrics for experiments are:

1. Combine the target metrics into an OEC, that is the weighted combination of the most important driver metrics and use it as an only criterion for the experiment.
2. Further, the authors recommends to choose not more that 5 driver metrics, as the target metrics, if coming up with that OEC metrics is not possible. As, too many metrics can lead to the confusion and may possibly leads to ignoring the key metrics, further leading to increase the chances of false discovery effecting the decision-making process.

⏸️ Quick Pause:
Read about the breakthroughs in Generative AI, Data Science, Data Analytics, and much more directly into your Inbox, Subscribe HERE👈🏻

— Randomization Units (or Unit of Division)

Simply quoting, it is the “who” or “what” we randomly assign to either control or treatment variant of A/B Test.

Why does it matter?
Well, selecting the right randomization unit is critical because it impacts both user experience as well as helps in defining what metric/s we can use in A/B Test.

Does this mean, RU are just users?

Even though it is tempting to think that RU are simply the users because in many experiments, we refer to the randomization process as assigning users to each group but that’s not truly accurate.

Let’s talk about the different options for RU one-by-one:

1. User-Level Randomization

Starting with the most commonly used randomization units, i.e., the ‘User_ID’ as it ensures a consistent user experience and it is more stable across time and performance, as once a person gets registered on the website there will be a unique User_ID attached to their account which is the biggest advantage of using it.

While talking about cons, it is to note that User_ID can be used to review a person’s identity consequently, hence we need to be mindful of confidentiality and secure issues, while using User_ID for identification purposes.

Another issue to note is the identification of registered users generally requires a user to login to their accounts which is the limitation of using User_ID as RU.

2. Cookie-Level Randomization

Unlike User_ID, Cookies are pseudonymous IDs specific to a browser and device, helpful for maintaining user privacy.

But the con is that these are the temporary User_ID, as they have expiration associated with them.

3. Event-Level Randomization

Another kind of RU is an Event (such as a Page View or a Session). This represents a finer level of granularity than a User_ID in other words one user can be connected to many page views or sessions.

Typically, there are more page views and sessions than users using an event as RU provides more units and gives us more power to detect small changes.

However, it may also lead to inconsistent user experience.

4. Device_Id

These immutable IDs are associated with a specific device.

But, are only available for mobile devices, that’s why they are most commonly used as minimization units for AB testing of changes in mobile apps.

💡Considerations when choosing Randomization Units:

There are only three main things to focus upon,

1. Consistent User Experience

If dealing with major UI/UX (like color, button, page layout, etc) changes opt for User-level RU such as User_ID
While for non-user visible changes (like any algorithmic changes to improve latency and all), User-level RU doesn’t require so you can choose any event such as page view or session.

2. Coarseness of Randomization Units and Unit of Analysis

The general recommendation is that RU should be at least as coarse as a unit of analysis, popularly known as a metric.

For instance:
We have chosen the Metric to be a Conversion Rate and the Unit of Analysis is Product Page View, then the best RU could be User_ID.

3. Randomization Units Estimation

Lastly, there should be enough RU for the experiment to measure the changes easily (recommended is to have a large number of RU to detect smaller effects).

In my, project I’m considering “User_ID” of only those users who have intention to make a purchase and initiate a checkout process as the Randomization Units.

2. Designing an Experiment

This is the most important component of a test. Here you need to answer the following question to have a solid design for your experiment:

What population to select?
How to Sample Size Estimation of experiment?
How long to run the experiment?

Let’s understand them deeply.

For answering the first question, we first need to know about,

— Power Analysis

It includes making assumptions about model parameters including:

Power of the test (1-β): It indicates the test’s ability to detect a True Effect or the probability of correctly rejecting the null hypothesis (𝐻0) when it is actually false, to control Type-II error. It is common to pick power of the test as 80% or 0.8.
Significance Level (α): It is also the probability of Type-I error, is the likelihood of rejecting the null hypothesis (𝐻0) when it is actually true, and there is no statistically significant impact. Generally we use the significance level of (5% or 0.05), which indicates that we have 5% risk of concluding that there exists a statistically significant difference between the control and treatment group performances when there is no actual difference.
Minimum Detectable Effect (δ): The last parameter as part of a power analysis, is a measure of change in a metric that you consider significant enough to distinguish from the random noise. This value is often chosen from the business point of view, as the effect size that an impact evaluation is designed to estimate for a given level of significance.

— Calculating the Minimum Sample Size

To calculate the minimum sample size for an A/B test, you need to account for the desired beta (1-power), significance level (α), estimate of variance (σ) that can be estimated from previous A/B test or A/A test, and MDE(δ).

Approximate Minimum Sample Size — Image by Author — **Approximate Minimum Sample Size** — Image by Author

This table will help you understand, how to select the sample size:

Summary for Sample Size Estimation — Image by Author — **Summary for Sample Size Estimation** — Table by Author

— Estimating Test Duration

The test duration depends on the required sample size and the average traffic to your site.

Formula for Test Duration — Image by Author — **Formula for Test Duration** — Image by Author

Points to remember:
- Too small test duration can cause Novelty Effect, meaning users can tend to react quickly and positively to all types of changes, independent of their nature.
- Too long test duration can cause Maturation Effect, this is a preferrable situation, to allow users to get used to a new feature to observe real treatment effect.

Ensure your test runs long enough to capture a complete user cycle, such as weekly, biweekly or monthly behaviors.

Further to reduce biasness in your results make sure to avoid seasonality, holidays, etc.

3. Running the Experiment

After designing the experiment, the next step is to run it and collecting data of experiments.

4. Results and Decision with Stakeholders

Data scientists spend more time on checking and interpreting results to make decisions.

Note, it is recommended to do a sanity check to make sure that the data is reliable and before coming onto any decision.

Further, while making decisions, consider following pointers with the stakeholders team:

— Tradeoff between different metrics as mentioned above, i.e., increase in user engagement vs. decline in revenue.

— Cost of launching a change, example, the engineering cost and maintenance.

1. When costs are high:

➡️We need to outweigh the cost by setting the practical significant boundary (δ, delta)

2. When costs are low:

➡️ It is suggested to launch the positive change as long as the results are statistically significant (p-value < α, alpha).

5. Post-Launch Monitoring

After implementing the winning variation, we need to continuously monitor its performance to ensure sustained effectiveness and identify any long-term impacts and repercussions (if any).

Conclusion

For my problem statement, A/B test demonstrated a significant positive effect on user engagement and revenue due to 360-degree product videos on the product page.

With 95% confidence level, the treatment group (which included the 360-degree product videos), exhibited a substantial increase of 64.77% in user interaction and 2.87% rise in overall revenue compared to the control group.

I believe, this feature not only improve the user interaction, after fully launching but also contributed to a notable increase in overall sales performance.

By following these A/B testing fundamentals, data scientists can make informed, data-driven decisions that drive business success.

And that’s a wrap, if you enjoyed this deep dive, Follow me so you won’t miss out on future updates.

Clap 50 times and Share your thoughts below, if you want to see something specific.👇

👉 SUBSCRIBE to my newsletter, Epochs of Data Insights, and receive bi-weekly in-depth Data, AI & ML tips and insights, right in your inbox!

Epochs of Data Insights | Nikita Prasad | Substack

Simplifying the "WHY" and "HOW" of Data Science and Analytics topics. Join in for in-depth bi-weekly practical and…

analyticalnikita.substack.com

Until next time, happy learning!

— Nikita Prasad

Reference

[1] Ron Kohavi, Diane Tang, Ya Xu, “Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing”, 2020.

The contents of external submissions are not necessarily reflective of the opinions or work of Maven Analytics or any of its team members.

We believe in fostering lifelong learning and our intent is to provide a platform for the data community to share their work and seek feedback from the Maven Analytics data fam.

Submit your own writing here if you’d like to become a contributor.

Happy learning!

-Team Maven

Learning Data