In the first article, I outlined the origin and the theoretical foundation of the p-value in scientific analysis. In this second and last part, we take a closer look at the consequences of the use and misuse of p-values in modern science.
It is a well-known fact that insignificant test results are under-reported – either because scientists keep them quiet, or because the scientific journals prefer significant results and are hesitant to publish negative outcomes – so-called publication bias.
Using the p-value for significance testing also causes scientists to overestimate the size of the observed effects, because only the effects that are particularly powerful (due to random fluctuations) pass the significance limit.
As an analogy, picture a sailor who defines rough seas as when the waves come crashing over the deck of ship – the nautical equivalent to passing the statistical significance limit. Generally speaking, the sailor will overestimate the bad conditions by omitting the times when the sea was choppier than usual, but not rough enough to crash over the deck of the ship.
This inflation effect is called the winner’s curse – a term from the auction house, where the highest bidder risks paying too much for an item compared to its actual market value.
The problem becomes worse when the power of the experiment is low, and when other scientists repeat the experiments they often find a disappointingly small effect or even none at all.
This inability to replicate previously published test results has been called a reproducibility crisis, and has attracted much attention particularly in behavioural psychology and the medical sciences.
Significance test showdown
There are many reasons for this crisis, but a central problem is the way we use the p-value – with the five per cent probability (or any other percentage) as a be-all and end-all in terms of establishing whether there is a significant effect or not.
Several solutions have been presented to combat the problem. One of the more extreme is to ban researchers from reporting any p-values, as done at the journal Basic and Applied Social Psychology.
Another approach is to lower the conventional significance limit from five per cent to, for example, 0.005 per cent. Such low boundaries are already the norm in some sciences, such as astrophysics.
This solution makes p-hacking harder, but it is not without problems. Both in bioscience and drug trials, such a low p-value limit would require the use of more test subjects, which is an ethical and a financial challenge.
A fundamental objection to significance tests based on p-values is that they do not tell the scientists what they really want to know. Ronald Fisher asked: ‘If a certain hypothesis is true, what is the probability that the observation is real?’
But scientifically speaking, the opposite is much more interesting: what is the probability that a given hypothesis is true, based on a given observation? Estimating this so-called ‘a posteriori’ probability is the goal of what is known as Bayesian statistics.
The problem with the Bayesian approach is that in order to calculate the a posteriori probability of the hypothesis, according to the laws of probability theory, you also have to know the probability of the hypothesis before you make your observation.
This ‘a priori’ probability is only precisely known in certain cases, for example screenings for an illness whose frequency in the population is already known.
There is still hope for the p-value
By now, you may have the impression that quantitative science is fumbling around in the dark. But it is not that bad.
After all, we are learning about the world and developing new and improved medicine every day. But in recent years, scientists have been paying more and more attention to the problems outlined here – in part because p-value significance tests are used more frequently than ever before.
Besides focusing attention on the importance of the size of observed effects, statisticians and scientific journals have also highlighted the need to increase the power of experiments.
Bayesian methods will probably play a bigger role in time. They can be computationally heavy, but that problem will be reduced as computing power continues to increase.
It has also been suggested that original experiments should be replicated more systematically, and that ‘boring’ repetition tests are given a higher scientific status than they have currently.
Finally, reasonable arguments have been made that the p-value should be exempt from significance tests and instead assessed independently for what it is: a graduated measurement of evidence against the null hypothesis.
As long as the p-value is not abused, we do not need to abandon it just yet.