sienceEtDefiance_replicationCrisis
π Society π Science and technology
What does it mean to “trust science”?

The curious case of the reproducibility crisis

par Valentin Weber, PhD student in cognitive science at ENS-PSL
On June 23rd, 2021 |
4min reading time
Valentin Weber
Valentin Weber
PhD student in cognitive science at ENS-PSL
Key takeaways
  • Social sciences, but also biomedical research and other scientific disciplines are currently experiencing a “reproducibility crisis”.
  • One third of the results of social science studies cannot be replicated and are therefore potentially erroneous – reproducibility being an essential determinant of the scientific nature of the work.
  • This crisis is due, in particular, to the need to provide innovative and significant results in order to be published in prestigious scientific journals.
  • One solution could thus be “registered reports”, which guarantee the publication of the study solely on the basis of its initial hypotheses, even before its final results are known.

Repro­du­ci­bi­li­ty, mea­ning other scien­tists can obtain a com­pa­rable res­ponse by recrea­ting the same condi­tions as an ori­gi­nal stu­dy, is impe­ra­tive to the scien­ti­fic method. In short, the results of an expe­riment should be the same whoe­ver car­ries out the pro­ce­dure. And for most cases, this is true. Howe­ver, across various dis­ci­plines in the social sciences but also in bio­me­di­cal research, cer­tain scien­ti­fic stu­dies have fai­led to repli­cate when car­ried out later by other scien­tists – ques­tio­ning the results of the ori­gi­nal stu­dies123

Known as the “repli­ca­tion cri­sis”, the pro­blem concerns more than a few single stu­dies publi­shed in low-tier jour­nals. Rather, the cri­sis is sys­te­ma­tic, affec­ting as many a third of social science stu­dies – invol­ving even the most pres­ti­gious jour­nals such as Science or Nature4. Affec­ted stu­dies include various well-known phe­no­me­na, many of which have found their way into the public. Among­st them are wide­ly spread concepts such as ste­reo­type threat5, impli­cit bias6, or social pri­ming7. Those are just three of the most famous fin­dings facing serious cri­ti­cisms, so much so that they might not sur­vive fur­ther metho­do­lo­gi­cal scru­ti­ny. But how did we get into this cri­sis, and what can we do about it ?

P‑hacking, HARKing and publication bias

Spe­ci­fi­cal­ly, two of the most com­mon “bad research” prac­tices res­pon­sible for non-repli­cable results are due to sta­tis­ti­cal mani­pu­la­tion : p‑hacking and HAR­King. In the for­mer, resear­chers tweak their research desi­gn slight­ly until a non-signi­fi­cant result turns signi­fi­cant – essen­tial­ly tur­ning some­thing from a nega­tive result to a posi­tive one. For example, after fai­ling to find any effect in their expe­riment, the resear­chers might change the way variables are mea­su­red, exclude a few out­liers that were not exclu­ded before, or step­wise col­lect a few more par­ti­ci­pants, che­cking each time whe­ther the results have become signi­fi­cant. All of these prac­tices increase the chance that the resear­chers will find an effect even if the effect actual­ly does not exist. 

Simi­lar­ly, in HAR­King (hypo­the­si­sing after results are known), resear­chers ran­dom­ly find an effect bet­ween two variables and then hypo­the­sise that this is what they expec­ted all along. FYI : a hypo­the­sis is made before an expe­riment is car­ried out – not retroac­ti­ve­ly. In today’s age of big data, it’s not hard to see why this is a bad idea. In a large data­set contai­ning hun­dreds of variables, some of these variables will be cor­re­la­ted with each other just by chance. Clai­ming that you only expec­ted an effect for these cor­re­la­ted variables even though you ran cor­re­la­tions for all variables gives a dis­tor­ted view of the actual data.

In the aca­de­mic world, publi­ca­tions are the gold stan­dard of suc­cess, but scien­ti­fic research is much har­der to publish if no signi­fi­cant results are found. As such there is a publi­ca­tion bias. Hence, if you want to have a thri­ving career in science, you had bet­ter find an effect ! Of course, this doesn’t ful­ly explain why signi­fi­cant results are so impor­tant. After all, the results of an expe­riment do not tell us any­thing about the qua­li­ty of the methods used. If a stu­dy does not find the effect, it might sim­ply be that that the effect does not exist. Yet, scien­ti­fic jour­nals never­the­less often refuse to accept non-signi­fi­cant results for publi­ca­tion, because non-signi­fi­cant results do not prove the absence of an effect to the same extent as signi­fi­cant results proof the exis­tence of an effect. 

In stan­dard social science research, the highest accep­table false posi­tive rate is 5%, while the highest accep­table false nega­tive rate is 20%. In other words, many scien­ti­fic stu­dies are not ade­qua­te­ly powe­red – mea­ning they do not have enough par­ti­ci­pants to decrease the false nega­tive rate to an ade­quate level. As a conse­quence, jour­nals may reject stu­dies with non-signi­fi­cant results on the grounds that the stu­dy could have found the effect if the sample size had been larger.

Pressure for a “scoop”

All of the afo­re­men­tio­ned ques­tio­nable research prac­tices – p‑hacking, HAR­King (which argua­bly is a sub­type of p‑hacking), publi­ca­tion bias, and under­po­we­red stu­dies – are well-known issues by now, but the pro­blems of the repli­ca­tion cri­sis run dee­per. One of the rea­sons why many clas­sic stu­dies were found to be non-repli­cable only decades after the ori­gi­nal stu­dies were conduc­ted is that there is lit­tle incen­tive to do repli­ca­tion stu­dies. Aca­de­mic careers thrive on pur­suing novel ideas, as jour­nals are like­ly to dis­miss repli­ca­tion research due to its lack of ori­gi­na­li­ty. Hence, there is not suf­fi­cient repli­ca­tion research that would either red-flag ori­gi­nal stu­dies if the results are not repli­ca­ted or pro­vide more cer­tain­ty for those results that are suc­cess­ful­ly replicated. 

A rela­ted conse­quence of the lack of repli­ca­tion of older stu­dies is that it is dif­fi­cult to esti­mate the extent of the repro­du­ci­bi­li­ty crisis.

A rela­ted conse­quence of the lack of repli­ca­tion research is that it’s hard to esti­mate the magni­tude of the repli­ca­tion cri­sis. Aside from social science and bio­me­di­cal research : which other dis­ci­plines are affec­ted ? And to what extent ? Until repli­ca­tion research becomes com­mon prac­tice, we can only spe­cu­late about the ans­wers to these questions.

While it’s hard to think of a sui­table way to inte­grate regu­lar repli­ca­tion stu­dies into the cur­rent research sys­tem, regis­te­red reports could pro­vide a solu­tion to all four of the bad research prac­tices men­tio­ned here. Unlike nor­mal jour­nal articles, regis­te­red reports are accep­ted for publi­ca­tion before data is col­lec­ted. Hence, the pro­blem of publi­ca­tion bias is sol­ved as the results can­not influence the journal’s deci­sion whe­ther or not the stu­dy will be publi­shed. P‑hacking and HAR­King are also unli­ke­ly to occur since the resear­chers have to spe­ci­fy in advance which and how hypo­theses will be tes­ted, and any devia­tion from the research plan needs extra­or­di­na­ry jus­ti­fi­ca­tion. Final­ly, regis­te­red reports are gene­ral­ly more ade­qua­te­ly powe­red than nor­mal jour­nal articles, as the methods (inclu­ding the inten­ded sample size) are revie­wed before the stu­dy is conducted.

Would a more repli­cable science lead to more public trust in scien­ti­fic fin­dings ? We don’t know, but it’s like­ly. If the scien­ti­fic com­mu­ni­ty accepts that cer­tain research fin­dings are indeed dubious and attempts to improve on these short­co­mings, maybe science scep­tics will be less reluc­tant to accept research results that are actual­ly robust. We cer­tain­ly still have a long way to go until the cri­sis fades, but fos­te­ring metho­do­lo­gi­cal skills, adop­ting regis­te­red reports as a publi­ca­tion model, and incen­ti­vi­sing repli­ca­tion research are pro­mi­sing first steps in the right direction.

1Open Science Col­la­bo­ra­tion (2015). Esti­ma­ting the repro­du­ci­bi­li­ty of psy­cho­lo­gi­cal science. Science, 349(6251), aac4716–aac4716.
2Freed­man, L. P., Cock­burn, I. M., & Sim­coe, T. S. (2015). The eco­no­mics of repro­du­ci­bi­li­ty in pre­cli­ni­cal research. PLoS Bio­lo­gy, 13(6), e1002165. doi:10.1371/journal.pbio.1002165.
3Ioan­ni­dis, J. P. (2005). Why most publi­shed research fin­dings are false. PLoS Med. 2, e124.
4Came­rer, C. F., Dre­ber, A., Holz­meis­ter, F., Ho, T.-H., Huber, J., Jahan­nes­son, M., … Wu, H. (2018). Eva­lua­ting the repli­ca­bi­li­ty of social science expe­ri­ments in Nature and Science bet­ween 2010 and 2015. Nature Human Beha­vior, 2, 637–644.
5Flore, P. C., Mul­der, J., & Wicherts, J. M. (2019). The influence of gen­der ste­reo­type threat on mathe­ma­tics test scores of Dutch high school stu­dents : A regis­te­red report. Com­pre­hen­sive Results in Social Psy­cho­lo­gy, 3, 140–174. https://​doi​.org/ 10.1080/23743603.2018.1559647
6Schim­mack, U. (2020, Decem­ber 13). Defund Impli­cit Bias Research. Repli­ca­bi­li­ty Index. https://​repli​ca​tio​nin​dex​.com/​c​a​t​e​g​o​r​y​/​i​m​p​l​i​c​i​t​-​bias/.
7Chi­vers, T. (2019). What’s next for psychology’s embat­tled field of social pri­ming. Nature, 576(7786), 200–202. doi:10.1038/d41586-019- 03755–2

Contributors

Valentin Weber

Valentin Weber

PhD student in cognitive science at ENS-PSL

Valentin Weber holds a degree in psychology and is currently preparing his PhD in cognitive sciences at ENS-PSL. His research interests lie at the intersection of philosophy, neuroscience, and psychology and his current work focuses on iconic memory and other issues in the philosophy of cognitive science. Previously, he has studied psychological methods and has worked on psychometric models.

Support accurate information rooted in the scientific method.

Donate