The call to ban statistical significance is not new. Several researchers, the most notable of them is Andrew Gellman from Columbia University, are promoting this idea in the last few years. What is new is the recent article in the prestigious journal Nature, that brought up the issue again. Actually, it is a petition signed by about 800 researchers, some of them are statisticians (and yes, Gellman signed this petition too). Gellman and Blake McShane from Northwestern University, who is one of the lead authors of the current petition, also expressed this idea in another provocative article at Nature titled “Five ways to fix statistics” published in November 2017.
I think that it is not a good idea. It is not throwing the baby out with the bathwater. It is attempting to take the baby and throw it out of the sixth-floor window.
Let me make it clear: I do not oppose taking into account prior knowledge, plausibility of mechanism, study design and data quality, real-world costs and benefits, and other factors (I actually quoted here what Gellman and McShane wrote in the 2017 article). I think that when editors are considering a paper that was submitted to publication, they must take all of this into account, provided that the results are statistically significant.
“Abandoning statistical significance” are just nice words for abandoning the Neyman-Pearson Lemma. The lemma guarantees that likelihood ratio tests control the rate of the false positive results at the acceptable significance level, which is currently equal to 5%. The rate is guaranteed in a frequentist manner. I do not know how many studies are published every year. Maybe tens of thousands, maybe even more. If all the researchers are applying the Neyman-Pearson theory with a significance level of 5%, then the proportion of the false positive results approaches 0.05. No wonder that Gellman, who is a Bayesian statistician, opposes this.
Once statistical significance is abandoned, bad things will happen. First, the rate of false positive published results will rise. On the other hand, there is no guarantee that the rate of false negative results (that are not published anyway) will fall. The Neyman-Pearson Lemma provides protection against the false negative rates as well. Granted, you need to have an appropriate sample size to control it. But if you don’t care for statistical significance anymore, you will not care for controlling the false negative rate. Researchers will get an incentive for performing many small sample studies. Such studies are less expensive, and have higher variation. The chance of getting a nice and publishable effect is larger. This phenomenon is known as “The law of small numbers“. What about external validity and reproducibility? In a world of “publish or perish”, nobody seems to care.
And this brings us to the question how one would determine if an effect is publishable. I can think of two possible mechanisms. One possibility is to calculate a p-value and apply a “soft” threshold. P is 0.06? that’s fine. What about 0.1? it depends. 0.7? no way (I hope). A second possibility is the creation of arbitrary rules of thumbs that have no theoretical basis. In this paper, for example, the researchers expressed the effects in terms of Hedges’ g, and considered an effect as meaningful if g>1.3. Why 1.3? They did not provide any justification.
I worked many years in the pharmaceutical industry, and now I work in a research institute affiliated with a large healthcare organization. False positive results, as well as false negative results, can lead to serious consequences. Forget about the money involved. It can be a matter of life and death. Therefore, I care for the error rates, and strongly support hypothesis testing and insisting on statistical significance. And you should care too.