Just a couple of weeks ago I wrote here that we should not abandon statistical significance. No, I did not change my mind. We should not abandon statistical significance but we must abandon the p-values.
Executive summary: p-values make a false impression that statistical significance is a continuous concept, but it is not. It is dichotomous.
You are welcome to read on.
I will not get into a lengthy discussion on the difference between Fisher‘s significance testing and Neyman and Pearson‘s hypothesis testing. In short: Fisher had a null hypothesis, a test statistic and a p-value. The decision whether or not to reject the null hypothesis based on the p-value was flexible. Indeed, in one place Fisher suggested to use the 5% threshold for deciding significance, but in other places he used different thresholds.
Neyman and Pearson’s approach is totally different. In their framework there are two hypotheses: the null hypothesis and the alternative hypothesis. The Neyman-Pearson lemma provides as a tool to construct decision rule. Given data, you evaluate the likelihood of these data in two cases: once assuming that the null hypothesis is true, and then assuming that the alternative hypothesis is true. You reject the null hypothesis in favor of the alternative if the ratio of the two likelihoods exceeds a pre-defined threshold. This threshold is determined by the significance level: an acceptable probability of falsely rejecting the null hypothesis (in a frequentist manner), that meets your scientific standards.
When I was a young under-graduate student, we did not have easy access to computers in my university (yes, I am that old). So we calculated the test statistic (Z, t, F, chi-square) and looked in a table similar to this one to see if our test statistic exceeds the threshold.
Fisher was a genius, and so was Karl Pearson. They developed statistical tests such as F and Chi-square based on geometrical considerations. But when you try to construct a test for equality of means as is in ANOVA or a test for independence of two categorical variables by using the NP lemma, you get the same tests. This opened a computational shortcut: The NP decision rule is equivalent to a p-value decision rule. You can decide to reject the null by comparing your test statistic to the value you look up in a table, or by comparing the p-value to your pre-determined significance level. You will get the same result either way. But still, you either reject the null hypothesis or you don’t. The size of the test statistic or the p-value does not matter. There is are such things as “nearly significant”, “almost significant”, “borderline significant” or “very significant”. IF your pre-determined significance level is 0.05, and your p-value is 0.051, that’s really depressing. But if you declare that your result is “almost significant”, it means that you changed your decision rule after the fact.
Let me illustrate it with a story my teacher and mentor, Professor Zvi Gilula, told me once.
Someone robbed a convenience store in the middle of nowhere. The police caught him and brought him to court. There were three eye witnesses, and a video from the security camera. The jury found him guilty and he was sentenced to serve 6 months in jail. In statistical jargon, the null hypothesis states that the defendant is not guilty, and the jury rejected the null hypothesis.
Let’s look at another robber. This man robbed a convenience store in New York City, next to Yankee Stadium. The robbery occurred when the game just ended, the Yankees won, 47000 happy fans witnessed the robbery, and they all came to testify in court, because the poor robber was a Red Sox fan. The jury found him guilty.
Is the second robber guiltier than the first robber? Is the first robber “almost not guilty” or “borderline guilty”? Is the second robber “very guilty”? What would be the right punishment to the second robber? 10 years? Life imprisonment? Death sentence? Do you think now that if you were in the jury of the first robber you would acquit him?
Here is what we need to do if we really want to improve science: when publishing results, one should provide all the necessary information, such as data, summary statistics, test statistics and more. And then state: “we reject the null hypothesis at the 5% significance level”, or maybe “we do not reject the null hypothesis at the 5% significance level” (or any other reasonable and justified significance level. No p-values. And if someone wants to know if the null hypothesis will be rejected at the 5.01% significance level, they can go and calculate their own p value. P values should be abandoned.