Some comments on AB testing implementation

Many job postings in the field of technology (mainly for Data Scientist jobs, but not only) require knowledge and/or experience in “AB testing”.

What is AB testing? A brief inspection at Wikipedia reveals that this is a method for assessing the impact of a certain change when it is carried out. For example, one may sick to know what will happen following a modification of a web page, e.g. whether adding a picture will increase the number of clicks, etc. A and B are the situations before the change, and the subsequent situation, respectively. According to Wikipedia, Google started to implement AB testing in 2000, and this approach began to spread in the technology world in about 2011. Wikipedia also rightly notes that AB testing is actually an implementation of the experimental design that William Sealy Gosset developed in 1908.

Although this is a significant methodological advancement in the digital technology world, I think that this is a naive approach, especially in light of the many advances that occurred in this field since 1908. (of course in your company you do it better). The main problem of this methodology is that it is usually implemented using one factor at a time, thus ignoring possible interactions between a number of variables. Ronald Fisher had already presented this problem in the 1920s and proposed preliminary solutions (such as two-way ANOVA). Of course, there are more advanced solutions that were proposed by his successors.

Other problems can arise in planning the experiment itself: How is the sample size determined? How to choose an unbiased (i.e. representative) sample? How to analyze the results, if at all? And what is the appropriate statistical analysis method?

More things to consider: is there any awareness of the possible errors and the probabilities in which they occur? And if there is such awareness, what is done in order to control the magnitude of these probabilities?

And finally: is there a distinction between a statistically significant effect and a meaningful effect?

I recently visited at a large and successful technology company, where I was presented with several tables of “data analysis”. I recognized many of the failures I have just mentioned: no sample size rationale, interactions were not considered, faulty analysis was applied, no one cares about the error probabilities, and every effect is considered as meaningful.

In another company, two product managers performed two “independent” tests (that is, AB testing and CD testing) using the same population as a sample. The population consisted of new customers, therefore it was not a representative sample of the company’s customers. Worse, the AD and CD interaction was not considered, as the parallel existence of the two experiments was discovered only after the fact, and their results were already implemented. In order to avoid such a mess in the future, I suggested that someone will coordinate all the experiments. The response was that it is not possible because of the organization’s competitive culture.

You can say, “What do you want, the fact is that they do well?” But the truth is that they have succeeded despite the problems with their methodology, especially when the core of their algorithm is based on probability and statistics.

Oren Tzur put it nicely on Twitter:

“I think the argument is that it’s cheap and immediate and you see results even if there is no “good model”, and that mistakes cannot be fixed or even indicated. The approach is “Why should I invest in it? Sometimes it works.”

Rafael Cohen also wrote to me on Twitter:

“When I come to a certain field, I assume that the expert knows something and that my analysis should help him … I took a designer to the site, I will not do AB test on every pixel … even if I have thousands of users a day, I still want not to waste them on bad configuration … the naïve statistical analysis would require 80,000 observations for each experiment… it is likely that someone who uses less observations because of a gut feeling gets reasonable results and creates enough revenue to his company …

This is mediocrity. Why think and plan, asks Cohen, if you can use a naïve approach and get something that sometimes works? Who cares that you could do better?

A few years ago, I gave a talk on the future of statistics in the industry at the ISA annual meeting. I will repeat the main points here.

Sam Wilks, former president of the American Statistical Association, paraphrased H.G. Wells, a pronounced science fiction writer: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”

As far as the pharmaceutical industry is concerned, the future predicted by Wells and Wilks is already here. Statistics are central to all the research, development, and manufacturing processes of the pharmaceutical industry. No one dares to embark on a clinical trial without close statistical support, and in recent years there has been a growing demand for statistical support in even earlier stages of drug development, as well as in production processes.

I hope that awareness of the added value that statistics bring with it will seep into the technological industry as the use of statistics increases, so is the need for statistical thinking on the part of the participants in the process, and making so with someone who “knows a much more statistics than the average programmer”, as Oren Tzur phrased it.

You are welcome to comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.