Systematic misestimation of machine learning performance in neuroimaging studies of depression

Flint, Claas; Redlich, Ronny; Arolt, Volker; Hahn, Tim; Eickhoff, Simon B.; Opel, Nils; Krug, Axel; Leenings, Ramona; Clark, Scott; Dannlowski, Udo; Kircher, Tilo; Winter, Nils R.; Jiang, Xiaoyi; Baune, Bernhard T.; Mehler, David M. A.; Cearns, Micah; Nenadic, Igor; Emden, Daniel

doi:10.1038/s41386-021-01020-7

Items
Marc 21

001			892632
005			20230515091803.0
024	7	_	\|a 10.1038/s41386-021-01020-7 \|2 doi
024	7	_	\|a 0893-133X \|2 ISSN
024	7	_	\|a 1740-634X \|2 ISSN
024	7	_	\|a 2128/28282 \|2 Handle
024	7	_	\|a altmetric:105599429 \|2 altmetric
024	7	_	\|a 33958703 \|2 pmid
024	7	_	\|a WOS:000647877800001 \|2 WOS
037	_	_	\|a FZJ-2021-02221
082	_	_	\|a 610
100	1	_	\|a Flint, Claas \|0 0000-0001-5164-8227 \|b 0
245	_	_	\|a Systematic misestimation of machine learning performance in neuroimaging studies of depression
260	_	_	\|a Basingstoke \|c 2021 \|b Nature Publishing Group
336	7	_	\|a article \|2 DRIVER
336	7	_	\|a Output Types/Journal article \|2 DataCite
336	7	_	\|a Journal Article \|b journal \|m journal \|0 PUB:(DE-HGF)16 \|s 1626785232_8199 \|2 PUB:(DE-HGF)
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a JOURNAL_ARTICLE \|2 ORCID
336	7	_	\|a Journal Article \|0 0 \|2 EndNote
520	_	_	\|a We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from Major Depressive Disorder (MDD) and healthy controls based on neuroimaging data. Drawing upon structural MRI data from a balanced sample of N = 1868 MDD patients and healthy controls from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of 61%. Next, we mimicked the process by which researchers would draw samples of various sizes (N = 4 to N = 150) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes (N = 20), we observe accuracies of up to 95%. For medium sample sizes (N = 100) accuracies up to 75% were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.
536	_	_	\|a 525 - Decoding Brain Organization and Dysfunction (POF4-525) \|0 G:(DE-HGF)POF4-525 \|c POF4-525 \|f POF IV \|x 0
542	_	_	\|i 2021-05-06 \|2 Crossref \|u https://creativecommons.org/licenses/by/4.0
542	_	_	\|i 2021-05-06 \|2 Crossref \|u https://creativecommons.org/licenses/by/4.0
588	_	_	\|a Dataset connected to CrossRef, Journals: juser.fz-juelich.de
700	1	_	\|a Cearns, Micah \|0 0000-0002-3353-8566 \|b 1
700	1	_	\|a Opel, Nils \|0 P:(DE-HGF)0 \|b 2
700	1	_	\|a Redlich, Ronny \|0 P:(DE-HGF)0 \|b 3
700	1	_	\|a Mehler, David M. A. \|0 P:(DE-HGF)0 \|b 4
700	1	_	\|a Emden, Daniel \|0 P:(DE-HGF)0 \|b 5
700	1	_	\|a Winter, Nils R. \|0 P:(DE-HGF)0 \|b 6
700	1	_	\|a Leenings, Ramona \|0 P:(DE-HGF)0 \|b 7
700	1	_	\|a Eickhoff, Simon B. \|0 P:(DE-Juel1)131678 \|b 8
700	1	_	\|a Kircher, Tilo \|0 P:(DE-HGF)0 \|b 9
700	1	_	\|a Krug, Axel \|0 0000-0002-0564-2497 \|b 10
700	1	_	\|a Nenadic, Igor \|0 P:(DE-HGF)0 \|b 11
700	1	_	\|a Arolt, Volker \|0 P:(DE-HGF)0 \|b 12
700	1	_	\|a Clark, Scott \|0 P:(DE-HGF)0 \|b 13
700	1	_	\|a Baune, Bernhard T. \|0 P:(DE-HGF)0 \|b 14
700	1	_	\|a Jiang, Xiaoyi \|0 P:(DE-HGF)0 \|b 15
700	1	_	\|a Dannlowski, Udo \|0 P:(DE-HGF)0 \|b 16 \|e Corresponding author
700	1	_	\|a Hahn, Tim \|0 P:(DE-HGF)0 \|b 17
773	1	8	\|a 10.1038/s41386-021-01020-7 \|b Springer Science and Business Media LLC \|d 2021-05-06 \|n 8 \|p 1510-1517 \|3 journal-article \|2 Crossref \|t Neuropsychopharmacology \|v 46 \|y 2021 \|x 0893-133X
773	_	_	\|a 10.1038/s41386-021-01020-7 \|0 PERI:(DE-600)2008300-2 \|n 8 \|p 1510-1517 \|t Neuropsychopharmacology \|v 46 \|y 2021 \|x 0893-133X
856	4	_	\|u h
856	4	_	\|u https://juser.fz-juelich.de/record/892632/files/s41386-021-01020-7-1.pdf \|y OpenAccess
909	C	O	\|o oai:juser.fz-juelich.de:892632 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 8 \|6 P:(DE-Juel1)131678
913	1	_	\|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|1 G:(DE-HGF)POF4-520 \|0 G:(DE-HGF)POF4-525 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Decoding Brain Organization and Dysfunction \|x 0
913	0	_	\|a DE-HGF \|b Key Technologies \|l Decoding the Human Brain \|1 G:(DE-HGF)POF3-570 \|0 G:(DE-HGF)POF3-574 \|3 G:(DE-HGF)POF3 \|2 G:(DE-HGF)POF3-500 \|4 G:(DE-HGF)POF \|v Theory, modelling and simulation \|x 0
914	1	_	\|y 2021
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0200 \|2 StatID \|b SCOPUS
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1030 \|2 StatID \|b Current Contents - Life Sciences
915	_	_	\|a Creative Commons Attribution CC BY 4.0 \|0 LIC:(DE-HGF)CCBY4 \|2 HGFVOC
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0600 \|2 StatID \|b Ebsco Academic Search
915	_	_	\|a JCR \|0 StatID:(DE-HGF)0100 \|2 StatID \|b NEUROPSYCHOPHARMACOL : 2015
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0150 \|2 StatID \|b Web of Science Core Collection
915	_	_	\|a WoS \|0 StatID:(DE-HGF)0110 \|2 StatID \|b Science Citation Index
915	_	_	\|a WoS \|0 StatID:(DE-HGF)0111 \|2 StatID \|b Science Citation Index Expanded
915	_	_	\|a OpenAccess \|0 StatID:(DE-HGF)0510 \|2 StatID
915	_	_	\|a Peer Review \|0 StatID:(DE-HGF)0030 \|2 StatID \|b ASC
915	_	_	\|a IF >= 5 \|0 StatID:(DE-HGF)9905 \|2 StatID \|b NEUROPSYCHOPHARMACOL : 2015
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0310 \|2 StatID \|b NCBI Molecular Biology Database
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1050 \|2 StatID \|b BIOSIS Previews
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0300 \|2 StatID \|b Medline
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0199 \|2 StatID \|b Thomson Reuters Master Journal List
920	_	_	\|l yes
920	1	_	\|0 I:(DE-Juel1)INM-7-20090406 \|k INM-7 \|l Gehirn & Verhalten \|x 0
980	_	_	\|a journal
980	_	_	\|a VDB
980	_	_	\|a UNRESTRICTED
980	_	_	\|a I:(DE-Juel1)INM-7-20090406
980	1	_	\|a FullTexts
999	C	5	\|a 10.1001/jama.2015.18421 \|9 -- missing cx lookup -- \|1 AM Darcy \|p 551 - \|2 Crossref \|u Darcy AM, Louie AK, Roberts LW. Machine learning and the profession of medicine. J Am Med Assoc. 2016;315:551–52. \|t J Am Med Assoc \|v 315 \|y 2016
999	C	5	\|a 10.1002/wps.20297 \|9 -- missing cx lookup -- \|1 HA Eyre \|p 21 - \|2 Crossref \|u Eyre HA, Singh AB, Reynolds C. Tech giants enter mental health. World Psychiatry. 2016;15:21–22. \|t World Psychiatry \|v 15 \|y 2016
999	C	5	\|a 10.1016/j.neuron.2014.10.047 \|9 -- missing cx lookup -- \|1 JDE Gabrieli \|p 11 - \|2 Crossref \|u Gabrieli JDE, Ghosh SS, Whitfield-Gabrieli S. Prediction as a humanitarian and pragmatic contribution from human cognitive neuroscience. Neuron. 2015;85:11–26. \|t Neuron. \|v 85 \|y 2015
999	C	5	\|a 10.1126/science.aaa8415 \|9 -- missing cx lookup -- \|1 MI Jordan \|p 255 - \|2 Crossref \|u Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349:255–60. \|t Science. \|v 349 \|y 2015
999	C	5	\|a 10.1038/mp.2016.201 \|9 -- missing cx lookup -- \|1 T Hahn \|p 37 - \|2 Crossref \|u Hahn T, Nierenberg AA, Whitfield-Gabrieli S. Predictive analytics in mental health: applications, guidelines, challenges and perspectives. Mol Psychiatry. 2017;22:37–43. \|t Mol Psychiatry. \|v 22 \|y 2017
999	C	5	\|1 BA Johnston \|y 2015 \|2 Crossref \|u Johnston BA, Steele JD, Tolomeo S, Christmas D, Matthews K. Structural MRI-based predictions in patients with treatment-refractory depression (TRD). PLoS One. 2015;10:1–16.
999	C	5	\|a 10.1093/brain/aws084 \|9 -- missing cx lookup -- \|1 B Mwangi \|p 1508 - \|2 Crossref \|u Mwangi B, Ebmeier KP, Matthews K, Douglas Steele J. Multi-centre diagnostic classification of individual structural neuroimaging scans from patients with major depressive disorder. Brain. 2012;135:1508–21. \|t Brain. \|v 135 \|y 2012
999	C	5	\|a 10.1002/gps.4262 \|9 -- missing cx lookup -- \|1 MJ Patel \|p 1056 - \|2 Crossref \|u Patel MJ, Andreescu C, Price JC, Edelman KL, Reynolds CF, Aizenstein HJ. Machine learning approaches for integrating clinical and imaging features in late-life depression classification and response prediction. Int J Geriatr Psychiatry. 2015;30:1056–67. \|t Int J Geriatr Psychiatry. \|v 30 \|y 2015
999	C	5	\|a 10.1016/j.biopsych.2017.09.032 \|9 -- missing cx lookup -- \|1 AH Neuhaus \|p e81 - \|2 Crossref \|u Neuhaus AH, Popescu FC. Sample Size, Model Robustness, and Classification Accuracy in Diagnostic Multivariate Neuroimaging Analyses. Biol Psychiatry. 2018;84:e81–e82. \|t Biol Psychiatry. \|v 84 \|y 2018
999	C	5	\|a 10.1016/j.neuroimage.2016.02.079 \|9 -- missing cx lookup -- \|1 MR Arbabshirani \|p 137 - \|2 Crossref \|u Arbabshirani MR, Plis S, Sui J, Calhoun VD. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. Neuroimage. 2017;145:137–65. \|t Neuroimage. \|v 145 \|y 2017
999	C	5	\|a 10.1109/34.75512 \|9 -- missing cx lookup -- \|1 S Raudys \|p 252 - \|2 Crossref \|u Raudys S, Jain A. Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Trans Pattern Anal Mach Intell. 1991;13:252–64. \|t IEEE Trans Pattern Anal Mach Intell \|v 13 \|y 1991
999	C	5	\|a 10.1186/1471-2288-14-137 \|1 T van der Ploeg \|9 -- missing cx lookup -- \|2 Crossref \|u van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137. \|t BMC Med Res Methodol \|v 14 \|y 2014
999	C	5	\|a 10.1016/j.biopsych.2016.10.028 \|9 -- missing cx lookup -- \|1 J Kambeitz \|p 330 - \|2 Crossref \|u Kambeitz J, Cabral C, Sacchet MD, Gotlib IH, Zahn R, Serpa MH, et al. Detecting Neuroimaging Biomarkers for Depression: A Meta-analysis of Multivariate Pattern Recognition Studies. Biol Psychiatry. 2017;82:330–38. \|t Biol Psychiatry. \|v 82 \|y 2017
999	C	5	\|a 10.1016/j.neuroimage.2016.10.038 \|9 -- missing cx lookup -- \|1 G Varoquaux \|p 166 - \|2 Crossref \|u Varoquaux G, Raamana PR, Engemann DA, Hoyos-Idrobo A, Schwartz Y, Thirion B. Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. Neuroimage. 2017;145:166–79. \|t Neuroimage. \|v 145 \|y 2017
999	C	5	\|a 10.31219/OSF.IO/UZEHJ \|9 -- missing cx lookup -- \|2 Crossref \|u Hahn T, Ebner-Priemer U, Meyer-Lindenberg A Transparent Artificial Intelligence – A Conceptual Framework for Evaluating AI-based Clinical Decision Support Systems. OSF Prepr. 2019. 2019. https://doi.org/10.31219/OSF.IO/UZEHJ.
999	C	5	\|a 10.1016/j.neuroimage.2017.06.061 \|9 -- missing cx lookup -- \|1 G Varoquaux \|p 68 - \|2 Crossref \|u Varoquaux G. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage. 2018;180:68–77. \|t Neuroimage. \|v 180 \|y 2018
999	C	5	\|a 10.1038/npp.2015.86 \|9 -- missing cx lookup -- \|1 U Dannlowski \|p 2510 - \|2 Crossref \|u Dannlowski U, Kugel H, Grotegerd D, Redlich R, Suchy J, Opel N, et al. NCAN cross-disorder risk variant is associated with limbic gray matter deficits in healthy subjects and major depression. Neuropsychopharmacology. 2015;40:2510–16. \|t Neuropsychopharmacology. \|v 40 \|y 2015
999	C	5	\|a 10.1038/mp.2014.39 \|9 -- missing cx lookup -- \|1 U Dannlowski \|p 398 - \|2 Crossref \|u Dannlowski U, Grabe HJ, Wittfeld K, Klaus J, Konrad C, Grotegerd D, et al. Multimodal imaging of a tescalcin (TESC)-regulating polymorphism (rs7294919)-specific effects on hippocampal gray matter structure. Mol Psychiatry. 2015;20:398–404. \|t Mol Psychiatry. \|v 20 \|y 2015
999	C	5	\|a 10.1007/s00406-018-0943-x \|9 -- missing cx lookup -- \|2 Crossref \|u Kircher T, Wöhr M, Nenadic I, Schwarting R, Schratt G, Alferink J, et al. Neurobiology of the major psychoses: a translational perspective on brain structure and function—the FOR2107 consortium. Eur Arch Psychiatry Clin Neurosci. 2018:1–14.
999	C	5	\|2 Crossref \|u Wittchen H-U, Wunderlich U, Gruschwitz S, Zaudig M SKID I. Strukturiertes Klinisches Interview für DSM-IV. Achse I: Psychische Störungen. Interviewheft und Beurteilungsheft. Eine deutschsprachige, erweiterte Bearb. d. amerikanischen Originalversion des SKID I. Göttingen: Hogrefe; 1997.
999	C	5	\|a 10.1016/j.neuroimage.2018.01.079 \|9 -- missing cx lookup -- \|1 C Vogelbacher \|p 450 - \|2 Crossref \|u Vogelbacher C, Möbius TWD, Sommer J, Schuster V, Dannlowski U, Kircher T, et al. The Marburg-Münster Affective Disorders Cohort Study (MACS): A quality assurance protocol for MR neuroimaging data. Neuroimage. 2018;172:450–460. \|t Neuroimage. \|v 172 \|y 2018
999	C	5	\|1 F Pedregosa \|y 2012 \|2 Crossref \|u Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2012;12:2825–30.
999	C	5	\|a 10.1016/j.biopsych.2015.12.023 \|9 -- missing cx lookup -- \|1 AF Marquand \|p 552 - \|2 Crossref \|u Marquand AF, Rezek I, Buitelaar J, Beckmann CF. Understanding heterogeneity in clinical cohorts using normative models: beyond case-control studies. Biol Psychiatry. 2016;80:552–61. \|t Biol Psychiatry. \|v 80 \|y 2016
999	C	5	\|a 10.3389/fpsyt.2016.00050 \|9 -- missing cx lookup -- \|1 HG Schnack \|p 1 - \|2 Crossref \|u Schnack HG, Kahn RS. Detecting neuroimaging biomarkers for psychiatric disorders: sample size matters. Front Psychiatry. 2016;7:1–12. \|t Front Psychiatry \|v 7 \|y 2016
999	C	5	\|a 10.1016/j.jneumeth.2015.01.010 \|9 -- missing cx lookup -- \|1 E Combrisson \|p 126 - \|2 Crossref \|u Combrisson E, Jerbi K. Exceeding chance level by chance: the caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. J Neurosci Methods. 2015;250:126–36. \|t J Neurosci Methods. \|v 250 \|y 2015

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe