A/B testing: A simple idea that is easy to screw up

Even though the basics of A/B testing are easy to grasp, the test often doesn’t generate business benefits. We listed the most common mistakes you can make.

Lauri Makkonen

Product Designer

21/05/2018

The idea of A/B testing is not complicated. You create variations of a part of the service and compare the success of the variants in real time. In an A/B test, there are normally two variants, the original and a modified version. The variants are usually identical apart from a single change, but the method can be used to test major changes as well. Minor changes are quick to design and implement, while larger tests require significant design and coding resources.

Even though A/B testing tools have improved, the concept remains simple. Many tests still fail, most often for one of these reasons:

1. The test is not conducted in the first place

The most important application of A/B testing is the validation of design choices. If decisions affecting large numbers of users are made blind, the results can be damaging. A correctly interpreted A/B test will tell you which of the variants works better, and the answer will be based on reliable data. The test can also be used to develop new hypotheses or rule out design solutions that do not promote business.

Quick A/B checklist

When making changes to a service, measure their impact against the service’s key indicators, such as purchase volume, newsletter subscriptions or items added to the shopping basket.

Don’t try to fix something that isn’t broken. If the service already works well, make sure that the design update will actually improve the UI.

Analyze the results: why did the test succeed or fail? Think about how the variants could be improved further.

2. Hypotheses is pulled out of thin air

A/B testing is hypothesis-based design, so the hypothesis has to be good. In this context, a hypothesis means an educated guess based on data, user behavior, customer feedback or other data you have found. The hypothesis is then refined into the change that will be validated with the A/B test. In my experience, testing without a hypothesis will not produce good results and will often only raise more questions. A/B testing will not tell you which changes to make next. You have to find the signal through study or interviews. Formulating a hypothesis will also help you crystallize the purpose of the UI.

How do you formulate a good hypothesis?

Find the service’s bottlenecks with an analytics tool. Create a funnel from the service path and observe where users drop out.

Record your users with services like Crazy Egg or Hotjar. Study the recordings and look for sections of the online service that need development.

Make contact with the service users, e.g. by interviewing them or reading user feedback. The feedback can give you a good idea of what is wrong with the service.

3. There’s no statistical certainty

An A/B test seeks to achieve the most reliable result possible. If the number of visitors is too small, random factors will have a big influence on the results of the test. Websites with a low user population or conversion rate will not give reliable results. You should still make sure that the low visitor numbers and conversion rate are not caused by poor service implementation.

Also make sure that the test is visible enough. If it does not get enough views, the result will not be statistically significant.

How do I know that I’m testing the right thing?

You should have more than one hundred conversions during the test cycle.

The required number of users is highly dependent on the service, but a good minimum target is 50,000 views/variation.

Decide the duration of the test cycle in advance.

If the service has few users, you should use your resources on increasing traffic rather than running tests.

4. The test focuses on colors and buttons

The visual layout is an important part of a great service, but testing minor details can be a waste of time. The impact of different color shades, button corners or choices of words can be difficult to see.

Even though testing colors and buttons can sometimes have a significant impact on development, you will be more likely to see results by engaging with users on an emotional level. More information is available in this study.

Test these instead

Product availability: low stock levels can speed up purchase decisions.

Ratings: tell users what others thought of the product. If the product is popular, the user can be encouraged to make a purchase decision.

Recommendations: suggest products that the user could be interested in.

5. A good hypothesis gets rejected too easily

A/B testing takes patience because the variants usually perform equally well. If the first test does not generate a statistically significant difference between the variants, but the hypothesis is good, you should consider running more tests. When the test is over, think about what went wrong and test the hypothesis again with new variations. A successful test can have a major business impact, but testing takes time. If the hypothesis was not good and the test does not generate results, you should know when to quit.

In an online store, for example, pay attention to ongoing campaigns and dates that influence purchase behavior, such as paydays (middle and end of the month) or tax return dates.

Make an effort to generate the best hypothesis you can.

Create new variations, since the first variant rarely achieves results.

Take growth or changes in customer behavior caused by peak seasons or other factors into account.

6. You’re running many tests at once

Having more than one test going on the same website can influence customer behavior. If a user is subjected to more than one test at the same time, it can be hard to tell what the variations’ combined effect on the user will be. This will affect the statistical probabilities and make the test less reliable. From a measurement standpoint, it is more sensible to split the test into several parts to discover how each change will influence the user.

Tests including more than two variations are multivariate tests, which are used in different scenarios than A/B tests and need a large number of users to succeed.

When deciding which features to test, take a moment to consider whether the feature creates added value and really requires testing.

Prioritize tests, for example according to cost efficiency. Tests that can be put to production quickly save the time of developers and designers. When the quick test is being executed, you can use the time saved to design the next one.

You should also remember that using more than two variations will affect the statistical probabilities of the test and requires a sufficient number of users.

7. Conclusions is drawn too early

Analysis is one of the most important phases of A/B testing, but don’t rush it. Wait until the test cycle is finished, because one variant can take an early lead purely by chance. The conversion rates will even out as the variations accumulate views. A glaring difference between the variations can sometimes mean that there is a fault in the technical implementation of the test.

Testing takes time, be patient.

Decide the numbers of conversions and users and the testing period in advance. This will also make the analysis easier when the time comes.

8. The benefits go unnoticed

A well designed and implemented and correctly analyzed A/B test is a fantastic tool for validating design and measuring the added value created by it. However, it is essential to perform all phases carefully, or the results will not be reliable and all the hard work will have been wasted. Don’t run tests for the sake of testing – you have to be able to interpret and use the results.

With an A/B you will succeed or fail fast. When the test is done, you can always fall back on the old or do a complete pivot. Each test tells you something about what’s working in the UI and what isn’t.

Perform the test carefully from start to finish and validate your design.

Celebrate every test and learn from it.

Fail fast.

A/B testing: A simple idea that is easy to screw up

1. The test is not conducted in the first place

2. Hypotheses is pulled out of thin air

3. There’s no statistical certainty

4. The test focuses on colors and buttons

5. A good hypothesis gets rejected too easily

6. You’re running many tests at once

7. Conclusions is drawn too early

8. The benefits go unnoticed

Read next

WWDC 2024 key takeaways: 12 new helpful features for your iOS app this fall

Technology choices and downloads of top Finnish consumer apps in 2024

A guide to payment service provider tendering: Process and criteria

Search