I found this picture on twitter, thanks to Kareem Carr who took a screenshot before the person who uploaded it deleted his tweet. I googled and found it in a website named Data Science Central (No link. Sorry). Can you tell what is wrong in this picture?
Well, almost everything is wrong in this picture.
Let’s start with the flowchart on the top.
The null hypothesis is that the average number of pizza slices is 4 and the alternative hypothesis is much higher than 4. What does “much higher” mean? Normally the hypotheses would be Ho:µ<=4 and H1:µ>4, or maybe Ho:µ=4, H1:µ>4, if you are looking for one sided hypothesis. I understand “much higher” as µ>>4, or µ>4+Δ. Now the hypotheses are not exhaustive: there is a gap of Δ. If you studied a course on statistical inference and are familiar with the Neyman-Pearson Lemma, it should not be a problem. You can construct a likelihood ratio test. But then you do not need this picture. The way the hypotheses are stated, standard tests as z-test or t-test are not valid.
Collect your data: there is not much information on how to collect the data except for mentioning to make sure that the observations are random. I suppose that the author assumes that the readers are familiar with standard sampling methods. Or maybe there is a “Data Collection Explained in One Picture” somewhere. I am too afraid to google it.
The author states that in this example the average is 5.6. What about the standard deviation and the sample size?
Next in the flowchart: test the result. The author requires setting the significance level, which is good. However, a good statistician will set the significance level before collecting the data. He will also set the desired power and collect the sample size.
Then you should run the test, e.g. chi-square, t-test or z-test. Again, this should be decided before collecting the data.
Next you need to decide if the result is significant. The author suggests four levels of significance according the value of the p-value. This is totally wrong. The result is either significant or not significant. Statistical significance is not a quantitative property. The author suggests using the p-value not just for determining significance but also as a measure for the strength of evidence.
Oh, and one more thing: if p<=.01 it is also <=.05 and <=0.10.
Now let’s look at the nice graph with the bell shape curve, which is almost certainly the Normal Distribution density curve. But look at the Y-axis label: “probability of X number of slices eaten”.
No, it is not probability, it is density, under the assumption that the null hypothesis is true (and this assumption is not stated anywhere).
And it is not the density of the number of slices eaten per visit as the X-axis label says, but the density of the sample mean. Again, one should state that this distribution is derived assuming the null hypothesis is true.
And even if it was a probability, the maximal value of the curve would be much lower than 100%. In fact, the probability of a customer eating less than one slice is zero (otherwise, why would he pay for the pizza in the first place?), and as far as the restaurant concerns, if the customer ate 1.5 slices, it counts as two slices – there is no recycling. In any case, the sample mean cannot be lower than 1. This brings me back to the statistical test. The variable under consideration is a count variable, therefore there are better alternatives to test the hypotheses, even though a normal approximation should work if the sample size is sufficiently large.
Next is the p-value calculation. The calculated p-value is 0.49. This means that if the mean under the null is 4, and the sample mean is 5.6, then we can back calculate and find that the standard error is 0.967 and the 95% significance threshold is 5.591. That’s fine.
Finally, the explanation of what is actually the p-value is clumsy. The term “Probability value (Observed)” may lead someone to think that this is the probability of the observed sample mean. However, the author states correctly that the p-value is the area (under the normal curve) right to the observed result.
So what we got here? Poor example, negligent descriptions of the statistical process and concepts (to say the least), somewhat reasonable explanation of the p-value calculation, no explanation of what the p-value actually is and what it means, and finally abusing the p-value and its interpretation.
And personal note: why would I care? Someone is wrong on the internet. So what?
I care because I care about statistics. There are similar pictures, blog posts and YouTube videos all over the internet. And they cause bad statistics and bad science. P-values and other statistical concepts are abused on a daily basis, even in leading journals such as JAMA. And bad statistics lead to loss of money if we are lucky and to loss of lives if we are not so lucky. So I care. And you should care too.
Credit: some of the issues discussed in the post were brought up by Kareem and other Twiterers.