The lady tasting iced tea

As part of the parents involvement in my youngest son school, last Friday was the “parents teaching” day, where parents presented various topics that may interest the students. I chose to try the reproduce Fisher’s lady tasting tea experiment, but with a twist.

I started the class with general discussion on designing experiments, and presented the story of the lady and the tea. Then I asked them how they would test if the lady can actually tell whether the tea or the milk was added first to a cup. After a short discussion, the 11 years old students reached the design that Fisher used. Of course, I did not expect them to get into the statistical inference details.

Once we got a design, I pooled two bottles of iced tea out of my bag. In Israel there are two leading brands of iced tea, lets call them A and B. A few more minutes were needed to get to the design of an experiment for testing whether the kids can distinguish between the tastes of the two brands.

We used the following design:

  1. A flip of a coin determined if we will pour the same brand of iced tea into two cups, or pour one brand in one cup and the other brand into the other cup.
  2. In case the same brand should be poured into the two cups, another coin flip determined if it should be brand A or brand B.

Then, one of the students who was, of course, blinded to the process of filling the cups, tasted the tea in both cups and announced if she can distinguish between the tastes of the tea in each cup, and her answer was recorded.

The final results are [*]:

Was the taster right?
Yes No Total
Cups combination AB 5 5 10
AA or BB 4 3 7
Total 9 8 17

I think we can conclude that there is no evidence for rejecting the hypothesis that the students can distinguish between the tastes of the two brands (you are welcome to do your own statistical analysis).

On a personal note: from my point of view it was a great success, since my son, who refused tasting brand B was convinced to taste it, and admitted that he likes its taste.

[*] I know I should have recorded the outcomes of the re-randomization, so that the table will have 3 rows and not only two. You will have to forgive me. My only excuse is that was a fun demonstration for fifth graders.

Some comments on AB testing implementation

Many job postings in the field of technology (mainly for Data Scientist jobs, but not only) require knowledge and/or experience in “AB testing”.

What is AB testing? A brief inspection at Wikipedia reveals that this is a method for assessing the impact of a certain change when it is carried out. For example, one may sick to know what will happen following a modification of a web page, e.g. whether adding a picture will increase the number of clicks, etc. A and B are the situations before the change, and the subsequent situation, respectively. According to Wikipedia, Google started to implement AB testing in 2000, and this approach began to spread in the technology world in about 2011. Wikipedia also rightly notes that AB testing is actually an implementation of the experimental design that William Sealy Gosset developed in 1908.

Although this is a significant methodological advancement in the digital technology world, I think that this is a naive approach, especially in light of the many advances that occurred in this field since 1908. (of course in your company you do it better). The main problem of this methodology is that it is usually implemented using one factor at a time, thus ignoring possible interactions between a number of variables. Ronald Fisher had already presented this problem in the 1920s and proposed preliminary solutions (such as two-way ANOVA). Of course, there are more advanced solutions that were proposed by his successors.

Other problems can arise in planning the experiment itself: How is the sample size determined? How to choose an unbiased (i.e. representative) sample? How to analyze the results, if at all? And what is the appropriate statistical analysis method?

More things to consider: is there any awareness of the possible errors and the probabilities in which they occur? And if there is such awareness, what is done in order to control the magnitude of these probabilities?

And finally: is there a distinction between a statistically significant effect and a meaningful effect?

I recently visited at a large and successful technology company, where I was presented with several tables of “data analysis”. I recognized many of the failures I have just mentioned: no sample size rationale, interactions were not considered, faulty analysis was applied, no one cares about the error probabilities, and every effect is considered as meaningful.

In another company, two product managers performed two “independent” tests (that is, AB testing and CD testing) using the same population as a sample. The population consisted of new customers, therefore it was not a representative sample of the company’s customers. Worse, the AD and CD interaction was not considered, as the parallel existence of the two experiments was discovered only after the fact, and their results were already implemented. In order to avoid such a mess in the future, I suggested that someone will coordinate all the experiments. The response was that it is not possible because of the organization’s competitive culture.

You can say, “What do you want, the fact is that they do well?” But the truth is that they have succeeded despite the problems with their methodology, especially when the core of their algorithm is based on probability and statistics.

Oren Tzur put it nicely on Twitter:

“I think the argument is that it’s cheap and immediate and you see results even if there is no “good model”, and that mistakes cannot be fixed or even indicated. The approach is “Why should I invest in it? Sometimes it works.”

Rafael Cohen also wrote to me on Twitter:

“When I come to a certain field, I assume that the expert knows something and that my analysis should help him … I took a designer to the site, I will not do AB test on every pixel … even if I have thousands of users a day, I still want not to waste them on bad configuration … the naïve statistical analysis would require 80,000 observations for each experiment… it is likely that someone who uses less observations because of a gut feeling gets reasonable results and creates enough revenue to his company …

This is mediocrity. Why think and plan, asks Cohen, if you can use a naïve approach and get something that sometimes works? Who cares that you could do better?

A few years ago, I gave a talk on the future of statistics in the industry at the ISA annual meeting. I will repeat the main points here.

Sam Wilks, former president of the American Statistical Association, paraphrased H.G. Wells, a pronounced science fiction writer: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”

As far as the pharmaceutical industry is concerned, the future predicted by Wells and Wilks is already here. Statistics are central to all the research, development, and manufacturing processes of the pharmaceutical industry. No one dares to embark on a clinical trial without close statistical support, and in recent years there has been a growing demand for statistical support in even earlier stages of drug development, as well as in production processes.

I hope that awareness of the added value that statistics bring with it will seep into the technological industry as the use of statistics increases, so is the need for statistical thinking on the part of the participants in the process, and making so with someone who “knows a much more statistics than the average programmer”, as Oren Tzur phrased it.

How to make children eat more vegetables

Let’s start from the end: I do not know how to make children eat more vegetables or even eat some vegetables. At least with my children, success is minimal. But two researchers from the University of Colorado had an idea: we would serve them the vegetables on plates with pictures of vegetables. To test whether the idea works, they conducted an experiment whose results were published in the prestigious journal JAMA Pediatrics . Because the results have been published you can guess that the result of the experiment was positive. But, did they really prove that their idea works?

Design of the experiment and its results

18 kindergarten and school classes (children aged 3–8) were selected in one of the suburbs of Denver. At first the children were offered fruits and vegetables when they were given white plates. In each class a bowl of fruits and a bowl of vegetables were placed, and each child took fruit and vegetables for himself or herself and ate them as he pleased. The weights of the vegetables and fruits were recorded before they were served to the children, and when the children had finished their meal, the researchers weighed the remaining fruits and vegetables. The difference between the weights (before and after the meal) was divided by the number of children, and thus the average amount of fruit and vegetables each child ate was obtained. Fruit and vegetables averages were also calculated separately. The researchers repeated these measurements three times per class.

After a while, the measurements were repeated the same way, but this time the children were given plates with pictures of vegetables and fruits. The result was an average increase of 13.82 grams in vegetables consumption, between 3 and 5 years of age. This result is statistically significant. In percentages it sounds much better: this is an increase of almost 47%.

So, what’s the problem? There are several problems.

First problem — extra precision

I will start with what is seemingly not a problem, but a warning: over-precision. When super precise results are published, you have to start worrying. I would like to emphasize: I mean precision, not accuracy. Accuracy refers to the distance between the measured value and the real, unobserved value, and is usually measured by standard deviation or confidence interval. The issue here is about precision: the results are reported at the level of two decimal places; they are very precise. I’m not saying it’s not important, but from my experience, when someone exaggerates, you have to look more thoroughly at what’s going on. Precision of two digits after decimal when it comes to grams seems excessive to me. You can of course think differently, but that’s the warning signal that made me read the article to the end and think about what was described in it.

Second problem — on whom was the experiment conducted?

The second problem is much more fundamental: the choice of the experimental unit, or unit of observation . The experimental units here are the classrooms. The observations were made at the class level. The researchers measured how many vegetables and fruits were eaten by all the children in the class. They did not measure how many vegetables and fruits each child ate. Although they calculated an average for a child, I suppose everyone knows that the average alone is a problematic measure: it ignores the variation between the children. Before experimental intervention, Each child ate an average of about 30 grams of vegetables at a meal, but I do not think there will be anyone who disagrees with the statement that each child ate a different amount of vegetables. What is the standard deviation? We do not know, and the researchers do not know, but this is essential, because the difference between the children affects the final conclusion. Because the researchers ignored (regardless of the reason) the variation between the children, they practically assumed that the variance was very low, in fact zero. Had the researchers consider this variation, the conclusions of the experiment would be different: the confidence intervals would be different, and wider than the confidence intervals calculated by the researchers.

Another type of variance that was not considered is the variation within children. Let me explain: Even if we watched one child and saw that on average he ate 30 grams of vegetables at every meal, at different meals he eats a different amount of vegetables. The same the question arises again: What is the standard deviation? This standard deviation also has an impact on the final conclusion of the experiment. Of course, each child has a different standard deviation, and this variability should also be taken into consideration.

A third type of variation that was not considered is the variation between children of different ages: it is reasonable to assume that an 8-year-old will react differently to a painted plate than a 3-year-old. An 8-year-old will probably eat more vegetables than a 3-year-old.

I think that the researchers did not pay attention to all these issues. The words variation, adjust or covariate do not appear in the article. Because the researchers ignored these sources of variation, the confidence intervals they calculate are too narrow to reflect the real differences between the children and the types of successes.

Finally, although the experimental unit was the class, the results were reported as measurements were made at the child’s level. In my opinion, this also shows that the researchers were not aware of the variation between and within the children. For them, class and child are one and the same.

Third problem — what about the control?

There is no control group in this experiment. At a first sight, there is no problem: according to the design of the experiment, each class constitutes its own control group. After all, the children received the vegetables in white plates as well as plates with paintings of vegetables and fruits. But I think that’s not enough.

There are lots of types of plates for children, with drawings by Bob the Builder, Disney characters, Adventure Bay, Thomas the engine, and the list goes on. Could it be that the change was due to the very fact of the paintings themselves, and not because they are paintings of vegetables and fruits? Maybe a child whose meal is served on a plate with pictures of his favorite superhero will eat even more vegetables? The experimental design does not answer this question. A control group is needed. In my opinion, two control groups are needed in this experiment. In one of them the children initially get white plates, and then plates of Thomas the engine, Disney or superheroes, depending on their age and preferences. In the second control group there will be children who will initially receive “ordinary” plates (i.e. Thomas, Disney, etc.) and then plates with paintings of vegetables and fruits.

Fourth problem — subgroup analysis

Although the age group of the children in the study was 3–8, the researchers discuss the results for children in ages 3–5. What happened to children at age 6–8? Was the analysis for the two (or more) age groups pre-defined? The researchers do not provide this information.

Fifth Problem — What does all this mean?

First, it was found that there was a statistically significant change in the consumption of vegetables, but no significant change was observed in the fruit. The researchers referred to this in a short sentence: a possible explanation, they said, is the ceiling effect . Formally they are right. ceiling effect is a statistical phenomenon, and that is what happened here. The really important question they did not answer: Why did this effect occur?

And the most important question: Is the significant change also meaningful? What does the difference of 14 grams (sorry, 13.82 grams) mean? The researchers did not address this question. I’ll give you some food for thought. I went to my local supermarket and weighted one cucumber and one tomato (yes, it’s a small sample, I know). The weight of the cucumber was 126 grams, and the weight of the tomato was 124 grams. In other words, each child ate on average an extra half a bite of a tomato or a cucumber. Is this amount of vegetables meaningful in terms of health and / or nutrition? The researchers did not address this question, nor did the editors of the journal.


It is possible that plates with vegetables and fruit paintings cause children to eat more vegetables and fruits. This is indeed an interesting hypothesis. The study that was described here does not answer this question. The manner in which it was planned and implemented does not allow even a partial answer to this question, apparently due to the lack of basic statistical thinking.