Towards increasing the clinical applicability of machine learning biomarkers in psychiatry

Dukart, Juergen; Weis, Susanne; Genon, Sarah; Eickhoff, Simon B.
doi:10.1038/s41562-021-01085-w
000891733 001__ 891733
000891733 005__ 20210810182031.0
000891733 0247_ $$2ISSN$$a0022-7722
000891733 0247_ $$2ISSN$$a1447-073X
000891733 0247_ $$2ISSN$$a1447-6959
000891733 0247_ $$2doi$$a10.1038/s41562-021-01085-w
000891733 0247_ $$2Handle$$a2128/28323
000891733 0247_ $$2altmetric$$aaltmetric:103336036
000891733 0247_ $$2pmid$$a33820977
000891733 0247_ $$2WOS$$aWOS:000636920500001
000891733 037__ $$aFZJ-2021-01703
000891733 082__ $$a150
000891733 1001_ $$0P:(DE-Juel1)177727$$aDukart, Juergen$$b0
000891733 245__ $$aTowards increasing the clinical applicability of machine learning biomarkers in psychiatry
000891733 260__ $$aLondon$$bNature Research$$c2021
000891733 3367_ $$2DRIVER$$aarticle
000891733 3367_ $$2DataCite$$aOutput Types/Journal article
000891733 3367_ $$0PUB:(DE-HGF)16$$2PUB:(DE-HGF)$$aJournal Article$$bjournal$$mjournal$$s1627024865_28799
000891733 3367_ $$2BibTeX$$aARTICLE
000891733 3367_ $$2ORCID$$aJOURNAL_ARTICLE
000891733 3367_ $$00$$2EndNote$$aJournal Article
000891733 520__ $$aDue to a lack of objective biomarkers, psychiatric diagnoses still rely strongly on patient reporting and clinician judgement. The ensuing subjectivity negatively affects the definition and reliability of psychiatric diagnoses1,2. Recent research has suggested that a combination of advanced neuroimaging and machine learning may provide a solution to this predicament by establishing such objective biomarkers for psychiatric conditions, improving the diagnostic accuracy, prognosis and development of novel treatments3.These promises led to widespread interest in machine learning applications for mental health4, including a recent paper that reports a biological marker for one of the most difficult yet momentous questions in psychiatry—the assessment of suicidal behaviour5. Just et al. compared a group of 17 participants with suicidal ideation with 17 healthy controls, reporting high discrimination accuracy using task-based functional magnetic resonance imaging signatures of life- and death-related concepts3. The authors further reported high discrimination between nine ideators who had attempted suicide versus eight ideators who had not. While being a laudable effort into a difficult topic, this study unfortunately illustrates some common conceptual and technical issues in the field that limit translation into clinical practice and raise unrealistic hopes when the results are communicated to the general public.From a conceptual point of view, machine learning studies aimed at clinical applications need to carefully consider any decisions that might hamper the interpretation or generalizability of their results. Restrictiveness to an arbitrary setting may become detrimental for machine learning applications by providing overly optimistic results that are unlikely to generalize. As an example, Just et al. excluded more than half of the patients and healthy controls initially enrolled in the study from the main analysis due to missing desired functional magnetic resonance imaging effects (a rank accuracy of at least 0.6 based on all 30 concepts). This exclusion introduces a non-assessable bias to the interpretation of the results, in particular when considering that only six of the 30 concepts were selected for the final classification procedure. While Just et al. attempt to address this question by applying the trained classifier to the initially excluded 21 suicidal ideators, they explicitly omit the excluded 24 controls from this analysis, preventing any interpretation of the extent to which the classifier decision is dependent on this initial choice.From a technical point of view, machine learning-based predictions based on neuroimaging data in small samples are intrinsically highly variable, as stable accuracy estimates and high generalizability are only achieved with several hundreds of participants6,7. The study by Just et al. falls into this category of studies with a small sample size. To estimate the impact of uncertainty on the results by Just et al., we adapted a simulation approach with the code and data kindly provided by the authors, randomly permuting (800 times) the labels across the groups using their default settings and computing the accuracies. These results showed that the 95% confidence interval for classification accuracy obtained using this dataset is about 20%, leaving large uncertainty with respect to any potential findings.Special care is also required with respect to any subjective choices in feature and classifier settings or group selection. While ad-hoc selection of a specific setting is subjective, testing of different ones and outcome-based post-hoc justification of such leads to overfitting, thus limiting the generalizability of any classification. Such overfitting may occur when multiple models or parameter choices are tested with respect to their ability to predict the testing data and only those that perform best are reported. To illustrate this issue, we performed an additional analysis with the code and data kindly provided by Just et al. More specifically, in the code and the manuscript, we identified the following non-exhaustive number of prespecified settings: (1) removal of occipital cortex data; (2) subdivision of clusters larger than 11 mm; (3) selection of voxels with at least four contributing participants in each group; (4) selection of stable clusters containing at least five voxels; (5) selection of the 1,200 most stable features; and (6) manual copying and replacing of a cluster for one control participant. Importantly, according to the publication or code documentation, all of these parameters were chosen ad hoc and for none of these settings was a parameter search performed. We systematically evaluated the effect of each of these choices on the accuracy for differentiation between suicide ideators and controls in the original dataset provided by Just et al. As shown in Fig. 1, each of the six parameters represents an optimum choice for differentiation accuracy in this dataset, with any (even minor) change often resulting in substantially lower accuracy estimates. Similarly, data leakage may also contribute to optimistic results when information outside the training set is used to build a prediction model. More generally, whenever human interventions guide the development of machine learning models for the prediction of clinical conditions, a careful evaluation and reporting of any researcher’s degrees of freedom is essential to avoid data leakage and overfitting. Subsequent sharing of data processing and analysis pipelines, as well as collected data, is a further key step to increase reproducibility and facilitate replication of potential findings.
000891733 536__ $$0G:(DE-HGF)POF4-5254$$a5254 - Neuroscientific Data Analytics and AI (POF4-525)$$cPOF4-525$$fPOF IV$$x0
000891733 588__ $$aDataset connected to DataCite
000891733 7001_ $$0P:(DE-Juel1)172811$$aWeis, Susanne$$b1
000891733 7001_ $$0P:(DE-Juel1)161225$$aGenon, Sarah$$b2
000891733 7001_ $$0P:(DE-Juel1)131678$$aEickhoff, Simon B.$$b3$$eCorresponding author
000891733 773__ $$0PERI:(DE-600)2885046-4$$a10.1038/s41562-021-01085-w$$p431–432$$tNature human behaviour$$v5$$x2397-3374$$y2021
000891733 8564_ $$uhttps://juser.fz-juelich.de/record/891733/files/ML_in_psychiatry_comment_on_Just_etal_rev2_final.pdf$$yPublished on 2021-04-05. Available in OpenAccess from 2021-10-05.$$zStatID:(DE-HGF)0510
000891733 8564_ $$uhttps://juser.fz-juelich.de/record/891733/files/s41562-021-01085-w.pdf$$yRestricted
000891733 909CO $$ooai:juser.fz-juelich.de:891733$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
000891733 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)177727$$aForschungszentrum Jülich$$b0$$kFZJ
000891733 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)172811$$aForschungszentrum Jülich$$b1$$kFZJ
000891733 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)161225$$aForschungszentrum Jülich$$b2$$kFZJ
000891733 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)131678$$aForschungszentrum Jülich$$b3$$kFZJ
000891733 9131_ $$0G:(DE-HGF)POF4-525$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5254$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vDecoding Brain Organization and Dysfunction$$x0
000891733 9141_ $$y2021
000891733 915__ $$0StatID:(DE-HGF)0200$$2StatID$$aDBCoverage$$bSCOPUS$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0300$$2StatID$$aDBCoverage$$bMedline$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0130$$2StatID$$aDBCoverage$$bSocial Sciences Citation Index$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0530$$2StatID$$aEmbargoed OpenAccess
000891733 915__ $$0StatID:(DE-HGF)0100$$2StatID$$aJCR$$bNAT HUM BEHAV : 2019$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)1180$$2StatID$$aDBCoverage$$bCurrent Contents - Social and Behavioral Sciences$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)9910$$2StatID$$aIF >= 10$$bNAT HUM BEHAV : 2019$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)1030$$2StatID$$aDBCoverage$$bCurrent Contents - Life Sciences$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0113$$2StatID$$aWoS$$bScience Citation Index Expanded$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0150$$2StatID$$aDBCoverage$$bWeb of Science Core Collection$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0160$$2StatID$$aDBCoverage$$bEssential Science Indicators$$d2021-01-30
000891733 915__ $$0StatID:(DE-HGF)0199$$2StatID$$aDBCoverage$$bClarivate Analytics Master Journal List$$d2021-01-30
000891733 920__ $$lyes
000891733 9201_ $$0I:(DE-Juel1)INM-7-20090406$$kINM-7$$lGehirn & Verhalten$$x0
000891733 980__ $$ajournal
000891733 980__ $$aVDB
000891733 980__ $$aUNRESTRICTED
000891733 980__ $$aI:(DE-Juel1)INM-7-20090406
000891733 9801_ $$aFullTexts
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe