sienceEtDefiance_replicationCrisis
π Society π Science and technology
What does it mean to “trust science”?

The curious case of the reproducibility crisis

par Valentin Weber, PhD student in cognitive science at ENS-PSL
On June 23rd, 2021 |
4min reading time
Valentin Weber
Valentin Weber
PhD student in cognitive science at ENS-PSL
Key takeaways
  • Social sciences, but also biomedical research and other scientific disciplines are currently experiencing a “reproducibility crisis”.
  • One third of the results of social science studies cannot be replicated and are therefore potentially erroneous – reproducibility being an essential determinant of the scientific nature of the work.
  • This crisis is due, in particular, to the need to provide innovative and significant results in order to be published in prestigious scientific journals.
  • One solution could thus be “registered reports”, which guarantee the publication of the study solely on the basis of its initial hypotheses, even before its final results are known.

Repro­du­cib­il­ity, mean­ing oth­er sci­ent­ists can obtain a com­par­able response by recre­at­ing the same con­di­tions as an ori­gin­al study, is imper­at­ive to the sci­entif­ic meth­od. In short, the res­ults of an exper­i­ment should be the same who­ever car­ries out the pro­ced­ure. And for most cases, this is true. How­ever, across vari­ous dis­cip­lines in the social sci­ences but also in bio­med­ic­al research, cer­tain sci­entif­ic stud­ies have failed to rep­lic­ate when car­ried out later by oth­er sci­ent­ists – ques­tion­ing the res­ults of the ori­gin­al stud­ies123

Known as the “rep­lic­a­tion crisis”, the prob­lem con­cerns more than a few single stud­ies pub­lished in low-tier journ­als. Rather, the crisis is sys­tem­at­ic, affect­ing as many a third of social sci­ence stud­ies – involving even the most pres­ti­gi­ous journ­als such as Sci­ence or Nature4. Affected stud­ies include vari­ous well-known phe­nom­ena, many of which have found their way into the pub­lic. Amongst them are widely spread con­cepts such as ste­reo­type threat5, impli­cit bias6, or social prim­ing7. Those are just three of the most fam­ous find­ings facing ser­i­ous cri­ti­cisms, so much so that they might not sur­vive fur­ther meth­od­o­lo­gic­al scru­tiny. But how did we get into this crisis, and what can we do about it?

P‑hacking, HARKing and publication bias

Spe­cific­ally, two of the most com­mon “bad research” prac­tices respons­ible for non-rep­lic­able res­ults are due to stat­ist­ic­al manip­u­la­tion: p‑hacking and HARK­ing. In the former, research­ers tweak their research design slightly until a non-sig­ni­fic­ant res­ult turns sig­ni­fic­ant – essen­tially turn­ing some­thing from a neg­at­ive res­ult to a pos­it­ive one. For example, after fail­ing to find any effect in their exper­i­ment, the research­ers might change the way vari­ables are meas­ured, exclude a few out­liers that were not excluded before, or step­wise col­lect a few more par­ti­cipants, check­ing each time wheth­er the res­ults have become sig­ni­fic­ant. All of these prac­tices increase the chance that the research­ers will find an effect even if the effect actu­ally does not exist. 

Sim­il­arly, in HARK­ing (hypo­thes­ising after res­ults are known), research­ers ran­domly find an effect between two vari­ables and then hypo­thes­ise that this is what they expec­ted all along. FYI: a hypo­thes­is is made before an exper­i­ment is car­ried out – not ret­ro­act­ively. In today’s age of big data, it’s not hard to see why this is a bad idea. In a large data­set con­tain­ing hun­dreds of vari­ables, some of these vari­ables will be cor­rel­ated with each oth­er just by chance. Claim­ing that you only expec­ted an effect for these cor­rel­ated vari­ables even though you ran cor­rel­a­tions for all vari­ables gives a dis­tor­ted view of the actu­al data.

In the aca­dem­ic world, pub­lic­a­tions are the gold stand­ard of suc­cess, but sci­entif­ic research is much harder to pub­lish if no sig­ni­fic­ant res­ults are found. As such there is a pub­lic­a­tion bias. Hence, if you want to have a thriv­ing career in sci­ence, you had bet­ter find an effect! Of course, this doesn’t fully explain why sig­ni­fic­ant res­ults are so import­ant. After all, the res­ults of an exper­i­ment do not tell us any­thing about the qual­ity of the meth­ods used. If a study does not find the effect, it might simply be that that the effect does not exist. Yet, sci­entif­ic journ­als nev­er­the­less often refuse to accept non-sig­ni­fic­ant res­ults for pub­lic­a­tion, because non-sig­ni­fic­ant res­ults do not prove the absence of an effect to the same extent as sig­ni­fic­ant res­ults proof the exist­ence of an effect. 

In stand­ard social sci­ence research, the highest accept­able false pos­it­ive rate is 5%, while the highest accept­able false neg­at­ive rate is 20%. In oth­er words, many sci­entif­ic stud­ies are not adequately powered – mean­ing they do not have enough par­ti­cipants to decrease the false neg­at­ive rate to an adequate level. As a con­sequence, journ­als may reject stud­ies with non-sig­ni­fic­ant res­ults on the grounds that the study could have found the effect if the sample size had been larger.

Pressure for a “scoop”

All of the afore­men­tioned ques­tion­able research prac­tices – p‑hacking, HARK­ing (which argu­ably is a sub­type of p‑hacking), pub­lic­a­tion bias, and under­powered stud­ies – are well-known issues by now, but the prob­lems of the rep­lic­a­tion crisis run deep­er. One of the reas­ons why many clas­sic stud­ies were found to be non-rep­lic­able only dec­ades after the ori­gin­al stud­ies were con­duc­ted is that there is little incent­ive to do rep­lic­a­tion stud­ies. Aca­dem­ic careers thrive on pur­su­ing nov­el ideas, as journ­als are likely to dis­miss rep­lic­a­tion research due to its lack of ori­gin­al­ity. Hence, there is not suf­fi­cient rep­lic­a­tion research that would either red-flag ori­gin­al stud­ies if the res­ults are not rep­lic­ated or provide more cer­tainty for those res­ults that are suc­cess­fully replicated. 

A related con­sequence of the lack of rep­lic­a­tion of older stud­ies is that it is dif­fi­cult to estim­ate the extent of the repro­du­cib­il­ity crisis.

A related con­sequence of the lack of rep­lic­a­tion research is that it’s hard to estim­ate the mag­nitude of the rep­lic­a­tion crisis. Aside from social sci­ence and bio­med­ic­al research: which oth­er dis­cip­lines are affected? And to what extent? Until rep­lic­a­tion research becomes com­mon prac­tice, we can only spec­u­late about the answers to these questions.

While it’s hard to think of a suit­able way to integ­rate reg­u­lar rep­lic­a­tion stud­ies into the cur­rent research sys­tem, registered reports could provide a solu­tion to all four of the bad research prac­tices men­tioned here. Unlike nor­mal journ­al art­icles, registered reports are accep­ted for pub­lic­a­tion before data is col­lec­ted. Hence, the prob­lem of pub­lic­a­tion bias is solved as the res­ults can­not influ­ence the journal’s decision wheth­er or not the study will be pub­lished. P‑hacking and HARK­ing are also unlikely to occur since the research­ers have to spe­cify in advance which and how hypo­theses will be tested, and any devi­ation from the research plan needs extraordin­ary jus­ti­fic­a­tion. Finally, registered reports are gen­er­ally more adequately powered than nor­mal journ­al art­icles, as the meth­ods (includ­ing the inten­ded sample size) are reviewed before the study is conducted.

Would a more rep­lic­able sci­ence lead to more pub­lic trust in sci­entif­ic find­ings? We don’t know, but it’s likely. If the sci­entif­ic com­munity accepts that cer­tain research find­ings are indeed dubi­ous and attempts to improve on these short­com­ings, maybe sci­ence scep­tics will be less reluct­ant to accept research res­ults that are actu­ally robust. We cer­tainly still have a long way to go until the crisis fades, but fos­ter­ing meth­od­o­lo­gic­al skills, adopt­ing registered reports as a pub­lic­a­tion mod­el, and incentiv­ising rep­lic­a­tion research are prom­ising first steps in the right direction.

1Open Sci­ence Col­lab­or­a­tion (2015). Estim­at­ing the repro­du­cib­il­ity of psy­cho­lo­gic­al sci­ence. Sci­ence, 349(6251), aac4716–aac4716.
2Freed­man, L. P., Cock­burn, I. M., & Sim­coe, T. S. (2015). The eco­nom­ics of repro­du­cib­il­ity in pre­clin­ic­al research. PLoS Bio­logy, 13(6), e1002165. doi:10.1371/journal.pbio.1002165.
3Ioan­nid­is, J. P. (2005). Why most pub­lished research find­ings are false. PLoS Med. 2, e124.
4Camer­er, C. F., Dre­ber, A., Holzmeister, F., Ho, T.-H., Huber, J., Jahan­nesson, M., … Wu, H. (2018). Eval­u­at­ing the rep­lic­ab­il­ity of social sci­ence exper­i­ments in Nature and Sci­ence between 2010 and 2015. Nature Human Beha­vi­or, 2, 637–644.
5Flore, P. C., Mulder, J., & Wich­erts, J. M. (2019). The influ­ence of gender ste­reo­type threat on math­em­at­ics test scores of Dutch high school stu­dents: A registered report. Com­pre­hens­ive Res­ults in Social Psy­cho­logy, 3, 140–174. https://​doi​.org/ 10.1080/23743603.2018.1559647
6Schim­mack, U. (2020, Decem­ber 13). Defund Impli­cit Bias Research. Rep­lic­ab­il­ity Index. https://​rep​lic​a​tionin​dex​.com/​c​a​t​e​g​o​r​y​/​i​m​p​l​i​c​i​t​-​bias/.
7Chivers, T. (2019). What’s next for psychology’s embattled field of social prim­ing. Nature, 576(7786), 200–202. doi:10.1038/d41586-019- 03755–2

Contributors

Valentin Weber

Valentin Weber

PhD student in cognitive science at ENS-PSL

Valentin Weber holds a degree in psychology and is currently preparing his PhD in cognitive sciences at ENS-PSL. His research interests lie at the intersection of philosophy, neuroscience, and psychology and his current work focuses on iconic memory and other issues in the philosophy of cognitive science. Previously, he has studied psychological methods and has worked on psychometric models.

Support accurate information rooted in the scientific method.

Donate