sienceEtDefiance_replicationCrisis
π Society π Science and technology
What does it mean to “trust science”?

The curious case of the reproducibility crisis

Valentin Weber, PhD student in cognitive science at ENS-PSL
On June 23rd, 2021 |
4 mins reading time
4
The curious case of the reproducibility crisis
Valentin Weber
Valentin Weber
PhD student in cognitive science at ENS-PSL
Key takeaways
  • Social sciences, but also biomedical research and other scientific disciplines are currently experiencing a “reproducibility crisis”.
  • One third of the results of social science studies cannot be replicated and are therefore potentially erroneous – reproducibility being an essential determinant of the scientific nature of the work.
  • This crisis is due, in particular, to the need to provide innovative and significant results in order to be published in prestigious scientific journals.
  • One solution could thus be “registered reports”, which guarantee the publication of the study solely on the basis of its initial hypotheses, even before its final results are known.

Repro­ducibil­i­ty, mean­ing oth­er sci­en­tists can obtain a com­pa­ra­ble response by recre­at­ing the same con­di­tions as an orig­i­nal study, is imper­a­tive to the sci­en­tif­ic method. In short, the results of an exper­i­ment should be the same who­ev­er car­ries out the pro­ce­dure. And for most cas­es, this is true. How­ev­er, across var­i­ous dis­ci­plines in the social sci­ences but also in bio­med­ical research, cer­tain sci­en­tif­ic stud­ies have failed to repli­cate when car­ried out lat­er by oth­er sci­en­tists – ques­tion­ing the results of the orig­i­nal stud­ies123

 Known as the “repli­ca­tion cri­sis”, the prob­lem con­cerns more than a few sin­gle stud­ies pub­lished in low-tier jour­nals. Rather, the cri­sis is sys­tem­at­ic, affect­ing as many a third of social sci­ence stud­ies – involv­ing even the most pres­ti­gious jour­nals such as Sci­ence or Nature4. Affect­ed stud­ies include var­i­ous well-known phe­nom­e­na, many of which have found their way into the pub­lic. Amongst them are wide­ly spread con­cepts such as stereo­type threat5, implic­it bias6, or social prim­ing7. Those are just three of the most famous find­ings fac­ing seri­ous crit­i­cisms, so much so that they might not sur­vive fur­ther method­olog­i­cal scruti­ny. But how did we get into this cri­sis, and what can we do about it?

P‑hacking, HARK­ing and pub­li­ca­tion bias

 Specif­i­cal­ly, two of the most com­mon “bad research” prac­tices respon­si­ble for non-replic­a­ble results are due to sta­tis­ti­cal manip­u­la­tion: p‑hacking and HARK­ing. In the for­mer, researchers tweak their research design slight­ly until a non-sig­nif­i­cant result turns sig­nif­i­cant – essen­tial­ly turn­ing some­thing from a neg­a­tive result to a pos­i­tive one. For exam­ple, after fail­ing to find any effect in their exper­i­ment, the researchers might change the way vari­ables are mea­sured, exclude a few out­liers that were not exclud­ed before, or step­wise col­lect a few more par­tic­i­pants, check­ing each time whether the results have become sig­nif­i­cant. All of these prac­tices increase the chance that the researchers will find an effect even if the effect actu­al­ly does not exist. 

 Sim­i­lar­ly, in HARK­ing (hypoth­e­sis­ing after results are known), researchers ran­dom­ly find an effect between two vari­ables and then hypoth­e­sise that this is what they expect­ed all along. FYI: a hypoth­e­sis is made before an exper­i­ment is car­ried out – not retroac­tive­ly. In today’s age of big data, it’s not hard to see why this is a bad idea. In a large dataset con­tain­ing hun­dreds of vari­ables, some of these vari­ables will be cor­re­lat­ed with each oth­er just by chance. Claim­ing that you only expect­ed an effect for these cor­re­lat­ed vari­ables even though you ran cor­re­la­tions for all vari­ables gives a dis­tort­ed view of the actu­al data.

 In the aca­d­e­m­ic world, pub­li­ca­tions are the gold stan­dard of suc­cess, but sci­en­tif­ic research is much hard­er to pub­lish if no sig­nif­i­cant results are found. As such there is a pub­li­ca­tion bias. Hence, if you want to have a thriv­ing career in sci­ence, you had bet­ter find an effect! Of course, this doesn’t ful­ly explain why sig­nif­i­cant results are so impor­tant. After all, the results of an exper­i­ment do not tell us any­thing about the qual­i­ty of the meth­ods used. If a study does not find the effect, it might sim­ply be that that the effect does not exist. Yet, sci­en­tif­ic jour­nals nev­er­the­less often refuse to accept non-sig­nif­i­cant results for pub­li­ca­tion, because non-sig­nif­i­cant results do not prove the absence of an effect to the same extent as sig­nif­i­cant results proof the exis­tence of an effect. 

 In stan­dard social sci­ence research, the high­est accept­able false pos­i­tive rate is 5%, while the high­est accept­able false neg­a­tive rate is 20%. In oth­er words, many sci­en­tif­ic stud­ies are not ade­quate­ly pow­ered – mean­ing they do not have enough par­tic­i­pants to decrease the false neg­a­tive rate to an ade­quate lev­el. As a con­se­quence, jour­nals may reject stud­ies with non-sig­nif­i­cant results on the grounds that the study could have found the effect if the sam­ple size had been larger.

Pres­sure for a “scoop”

  All of the afore­men­tioned ques­tion­able research prac­tices – p‑hacking, HARK­ing (which arguably is a sub­type of p‑hacking), pub­li­ca­tion bias, and under­pow­ered stud­ies – are well-known issues by now, but the prob­lems of the repli­ca­tion cri­sis run deep­er. One of the rea­sons why many clas­sic stud­ies were found to be non-replic­a­ble only decades after the orig­i­nal stud­ies were con­duct­ed is that there is lit­tle incen­tive to do repli­ca­tion stud­ies. Aca­d­e­m­ic careers thrive on pur­su­ing nov­el ideas, as jour­nals are like­ly to dis­miss repli­ca­tion research due to its lack of orig­i­nal­i­ty. Hence, there is not suf­fi­cient repli­ca­tion research that would either red-flag orig­i­nal stud­ies if the results are not repli­cat­ed or pro­vide more cer­tain­ty for those results that are suc­cess­ful­ly replicated. 

 A relat­ed con­se­quence of the lack of repli­ca­tion research is that it’s hard to esti­mate the mag­ni­tude of the repli­ca­tion cri­sis. Aside from social sci­ence and bio­med­ical research: which oth­er dis­ci­plines are affect­ed? And to what extent? Until repli­ca­tion research becomes com­mon prac­tice, we can only spec­u­late about the answers to these questions.

 While it’s hard to think of a suit­able way to inte­grate reg­u­lar repli­ca­tion stud­ies into the cur­rent research sys­tem, reg­is­tered reports could pro­vide a solu­tion to all four of the bad research prac­tices men­tioned here. Unlike nor­mal jour­nal arti­cles, reg­is­tered reports are accept­ed for pub­li­ca­tion before data is col­lect­ed. Hence, the prob­lem of pub­li­ca­tion bias is solved as the results can­not influ­ence the journal’s deci­sion whether or not the study will be pub­lished. P‑hacking and HARK­ing are also unlike­ly to occur since the researchers have to spec­i­fy in advance which and how hypothe­ses will be test­ed, and any devi­a­tion from the research plan needs extra­or­di­nary jus­ti­fi­ca­tion. Final­ly, reg­is­tered reports are gen­er­al­ly more ade­quate­ly pow­ered than nor­mal jour­nal arti­cles, as the meth­ods (includ­ing the intend­ed sam­ple size) are reviewed before the study is conducted.

 Would a more replic­a­ble sci­ence lead to more pub­lic trust in sci­en­tif­ic find­ings? We don’t know, but it’s like­ly. If the sci­en­tif­ic com­mu­ni­ty accepts that cer­tain research find­ings are indeed dubi­ous and attempts to improve on these short­com­ings, maybe sci­ence scep­tics will be less reluc­tant to accept research results that are actu­al­ly robust. We cer­tain­ly still have a long way to go until the cri­sis fades, but fos­ter­ing method­olog­i­cal skills, adopt­ing reg­is­tered reports as a pub­li­ca­tion mod­el, and incen­tivis­ing repli­ca­tion research are promis­ing first steps in the right direction.

1Open Sci­ence Col­lab­o­ra­tion (2015). Esti­mat­ing the repro­ducibil­i­ty of psy­cho­log­i­cal sci­ence. Sci­ence, 349(6251), aac4716–aac4716.
2Freed­man, L. P., Cock­burn, I. M., & Sim­coe, T. S. (2015). The eco­nom­ics of repro­ducibil­i­ty in pre­clin­i­cal research. PLoS Biol­o­gy, 13(6), e1002165. doi:10.1371/journal.pbio.1002165.
3Ioan­ni­dis, J. P. (2005). Why most pub­lished research find­ings are false. PLoS Med. 2, e124.
4Camer­er, C. F., Dreber, A., Holzmeis­ter, F., Ho, T.-H., Huber, J., Jahan­nes­son, M., … Wu, H. (2018). Eval­u­at­ing the replic­a­bil­i­ty of social sci­ence exper­i­ments in Nature and Sci­ence between 2010 and 2015. Nature Human Behav­ior, 2, 637–644.
5Flo­re, P. C., Mul­der, J., & Wicherts, J. M. (2019). The influ­ence of gen­der stereo­type threat on math­e­mat­ics test scores of Dutch high school stu­dents: A reg­is­tered report. Com­pre­hen­sive Results in Social Psy­chol­o­gy, 3, 140–174. https://​doi​.org/ 10.1080/23743603.2018.1559647
6Schim­mack, U. (2020, Decem­ber 13). Defund Implic­it Bias Research. Replic­a­bil­i­ty Index. https://​repli​ca​tionin​dex​.com/​c​a​t​e​g​o​r​y​/​i​m​p​l​i​c​i​t​-​bias/.
7Chivers, T. (2019). What’s next for psychology’s embat­tled field of social prim­ing. Nature, 576(7786), 200–202. doi:10.1038/d41586-019- 03755–2

Contributors

Valentin Weber

Valentin Weber

PhD student in cognitive science at ENS-PSL

Valentin Weber holds a degree in psychology and is currently preparing his PhD in cognitive sciences at ENS-PSL. His research interests lie at the intersection of philosophy, neuroscience, and psychology and his current work focuses on iconic memory and other issues in the philosophy of cognitive science. Previously, he has studied psychological methods and has worked on psychometric models.