In Discover Magazine this month is a really frightening article about the failure of psychologists to reproduce the results of 100 high profile psychology experiments published in 2008. 350 scientists attempted to reproduce the results of these experiments with these results:
- Researchers not involved in the initial studies contacted the original authors to get feedback on their protocols; in most cases, the original researchers helped with study designs and strategies. Despite this thoroughness, while 97 of the original studies reported significant results, only 35 of the replications reported the same. And even then, the effect size (a measurement of how strong a finding is) was smaller — on average, less than half the original size.
With only 1 in 3 of the reproduced experiments showing significant results and even then with much smaller effect size, it’s fair to say that psychology has, as the article states, a ‘reproducibility crisis’.
Without making fun of our social science brethren, we’ll start with the assumption that they were well intentioned and knowledgeable regarding establishing statistical significance. So what could have gone wrong?
Although it’s now about two years old, there’s an interesting article by Hilda Bastian (November 2013) that shares the title of this blog that’s worth reading if you missed it the first time around. She reminds us:
- Testing for statistical significance estimates the probability of getting roughly that result if the study hypothesis is assumed to be true. It can’t on its own tell you whether this assumption was right, or whether the results would hold true in different circumstances. It provides a limited picture of probability, taking limited information about the data into account and giving only “yes” or “no” as options.
What’s more, the finding of statistical significance itself can be a “fluke,” and that becomes more likely in bigger data and when you run the test on multiple comparisons in the same data.
It would be wonderful if there were a simple single statistical measure everybody could use with any set of data and it would reliably separate true from false.
Yet, statistical significance is commonly treated as though it is that magic wand. Take a null hypothesis or look for any association between factors in a data set and abracadabra! Get a “p value” over or under 0.05 and you can be 95% certain it’s either a fluke or it isn’t. You can eliminate the play of chance! You can separate the signal from the noise!
Except that you can’t. That’s not really what testing for statistical significance does. And therein lies the rub. So this is a bit of a cautionary tale that we should remember not to be seduced by our own statistics and consider the possibility that statistical significance is not a synonym for ‘true’. Take a look at the rest of Hilda’s article for some additional nuance on this issue.