Systematic misestimation of machine learning performance in neuroimaging studies of depression

Flint, Claas; Redlich, Ronny; Arolt, Volker; Hahn, Tim; Eickhoff, Simon B.; Opel, Nils; Krug, Axel; Leenings, Ramona; Clark, Scott; Dannlowski, Udo; Kircher, Tilo; Winter, Nils R.; Jiang, Xiaoyi; Baune, Bernhard T.; Mehler, David M. A.; Cearns, Micah; Nenadic, Igor; Emden, Daniel

doi:10.1038/s41386-021-01020-7

Journal Article

FZJ-2021-02221

Systematic misestimation of machine learning performance in neuroimaging studies of depression

Flint, C. ; Cearns, M. ; Opel, N. ; Redlich, R. ; Mehler, D. M. A. ; Emden, D. ; Winter, N. R. ; Leenings, R. ; Eickhoff, S. B.FZJ* ; Kircher, T. ; Krug, A. ; Nenadic, I. ; Arolt, V. ; Clark, S. ; Baune, B. T. ; Jiang, X. ; Dannlowski, U. (Corresponding author) ; Hahn, T.

2021
Nature Publishing Group Basingstoke

Neuropsychopharmacology 46(8), 1510-1517 (2021) [10.1038/s41386-021-01020-7]

This record in other databases:

Please use a persistent id in citations: http://hdl.handle.net/2128/28282 doi:10.1038/s41386-021-01020-7

Abstract: We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from Major Depressive Disorder (MDD) and healthy controls based on neuroimaging data. Drawing upon structural MRI data from a balanced sample of N = 1868 MDD patients and healthy controls from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of 61%. Next, we mimicked the process by which researchers would draw samples of various sizes (N = 4 to N = 150) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes (N = 20), we observe accuracies of up to 95%. For medium sample sizes (N = 100) accuracies up to 75% were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.

Classification:

ddc:610

Contributing Institute(s):

Gehirn & Verhalten (INM-7)

Research Program(s):

525 - Decoding Brain Organization and Dysfunction (POF4-525) (POF4-525)

Appears in the scientific report 2021

Database coverage:
Medline

;

;

; BIOSIS Previews ; Current Contents - Life Sciences ; Ebsco Academic Search ; IF >= 5 ; JCR ; NCBI Molecular Biology Database ; SCOPUS ; Science Citation Index ; Science Citation Index Expanded ; Thomson Reuters Master Journal List ; Web of Science Core Collection

Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Institute Collections > INM > INM-7
Workflow collections > Public records
Publications database
Open Access

Record created 2021-05-19, last modified 2023-05-15

Similar records

OpenAccess:

PDF
External links:

Fulltext

Fulltext by OpenAccess repository

Rate this document:

(Not yet reviewed)

Add to personal basket
Export as Author List with IDs BibTeX (UTF-8), EndNote XML, EndNote Text, RIS, MARC, Print MARC, MARCXML, DC,
Request correction
Submit fulltext

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help