Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

Hamdan, Sami; Schwender, Holger; Eickhoff, Simon; Polier, Georg von; Patil, Kaustubh; Weis, Susanne; Love, Bradley C
doi:10.1093/gigascience/giad071
001010545 001__ 1010545
001010545 005__ 20240429104900.0
001010545 0247_ $$2doi$$a10.1093/gigascience/giad071
001010545 0247_ $$2datacite_doi$$a10.34734/FZJ-2023-03119
001010545 0247_ $$2pmid$$a37776368
001010545 0247_ $$2WOS$$aWOS:001189196000001
001010545 037__ $$aFZJ-2023-03119
001010545 082__ $$a610
001010545 1001_ $$0P:(DE-Juel1)184874$$aHamdan, Sami$$b0$$ufzj
001010545 245__ $$aConfound-leakage: Confound Removal in Machine Learning Leads to Leakage
001010545 260__ $$aOxford$$bOxford University Press$$c2023
001010545 3367_ $$2DRIVER$$aarticle
001010545 3367_ $$2DataCite$$aOutput Types/Journal article
001010545 3367_ $$0PUB:(DE-HGF)16$$2PUB:(DE-HGF)$$aJournal Article$$bjournal$$mjournal$$s1697798276_11492
001010545 3367_ $$2BibTeX$$aARTICLE
001010545 3367_ $$2ORCID$$aJOURNAL_ARTICLE
001010545 3367_ $$00$$2EndNote$$aJournal Article
001010545 500__ $$aThis work was partly supported by the Helmholtz-AI project DeGen (ZT-I-PF-5-078), the Helmholtz Portfolio Theme “Supercomputing and Modeling for the Human Brain,” and Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project-ID 431549029–SFB 1451 project B05.
001010545 520__ $$aBackgroundMachine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.ResultsWe provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.ConclusionsMishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.
001010545 536__ $$0G:(DE-HGF)POF4-5252$$a5252 - Brain Dysfunction and Plasticity (POF4-525)$$cPOF4-525$$fPOF IV$$x0
001010545 536__ $$0G:(GEPRIS)458640473$$aSFB 1451 B05 - Einzelfallvorhersagen der motorischen Fähigkeiten bei Gesunden und Patienten mit motorischen Störungen (B05) (458640473)$$c458640473$$x1
001010545 588__ $$aDataset connected to DataCite
001010545 7001_ $$0P:(DE-HGF)0$$aLove, Bradley C$$b1
001010545 7001_ $$0P:(DE-HGF)0$$aPolier, Georg von$$b2
001010545 7001_ $$0P:(DE-Juel1)172811$$aWeis, Susanne$$b3$$ufzj
001010545 7001_ $$0P:(DE-HGF)0$$aSchwender, Holger$$b4
001010545 7001_ $$0P:(DE-Juel1)131678$$aEickhoff, Simon$$b5$$ufzj
001010545 7001_ $$0P:(DE-Juel1)172843$$aPatil, Kaustubh$$b6$$eCorresponding author$$ufzj
001010545 773__ $$0PERI:(DE-600)2708999-X$$a10.1093/gigascience/giad071$$gVol. 12, p. giad071$$pgiad071$$tGigaScience$$v12$$x2047-217X$$y20323
001010545 8564_ $$uhttps://juser.fz-juelich.de/record/1010545/files/Invoice_SOA23LT000561.pdf
001010545 8564_ $$uhttps://juser.fz-juelich.de/record/1010545/files/giad071.pdf$$yOpenAccess
001010545 8767_ $$8SOA23LT000561$$92023-08-21$$a1200196293$$d2023-09-07$$eAPC$$jZahlung erfolgt
001010545 909CO $$ooai:juser.fz-juelich.de:1010545$$pdnbdelivery$$popenCost$$pVDB$$pdriver$$pOpenAPC$$popen_access$$popenaire
001010545 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)184874$$aForschungszentrum Jülich$$b0$$kFZJ
001010545 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)172811$$aForschungszentrum Jülich$$b3$$kFZJ
001010545 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)131678$$aForschungszentrum Jülich$$b5$$kFZJ
001010545 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)172843$$aForschungszentrum Jülich$$b6$$kFZJ
001010545 9131_ $$0G:(DE-HGF)POF4-525$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5252$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vDecoding Brain Organization and Dysfunction$$x0
001010545 9141_ $$y2023
001010545 915pc $$0PC:(DE-HGF)0000$$2APC$$aAPC keys set
001010545 915pc $$0PC:(DE-HGF)0003$$2APC$$aDOAJ Journal
001010545 915__ $$0StatID:(DE-HGF)0501$$2StatID$$aDBCoverage$$bDOAJ Seal$$d2022-07-04T06:14:53Z
001010545 915__ $$0StatID:(DE-HGF)0561$$2StatID$$aArticle Processing Charges$$d2023-08-24
001010545 915__ $$0StatID:(DE-HGF)0113$$2StatID$$aWoS$$bScience Citation Index Expanded$$d2023-08-24
001010545 915__ $$0StatID:(DE-HGF)0700$$2StatID$$aFees$$d2023-08-24
001010545 915__ $$0StatID:(DE-HGF)0500$$2StatID$$aDBCoverage$$bDOAJ$$d2022-07-04T06:14:53Z
001010545 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001010545 915__ $$0StatID:(DE-HGF)1190$$2StatID$$aDBCoverage$$bBiological Abstracts$$d2023-08-24
001010545 915__ $$0LIC:(DE-HGF)CCBY4$$2HGFVOC$$aCreative Commons Attribution CC BY 4.0
001010545 915__ $$0StatID:(DE-HGF)0160$$2StatID$$aDBCoverage$$bEssential Science Indicators$$d2023-08-24
001010545 915__ $$0StatID:(DE-HGF)0100$$2StatID$$aJCR$$bGIGASCIENCE : 2022$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0200$$2StatID$$aDBCoverage$$bSCOPUS$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0300$$2StatID$$aDBCoverage$$bMedline$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0320$$2StatID$$aDBCoverage$$bPubMed Central$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0030$$2StatID$$aPeer Review$$bDOAJ : Open peer review$$d2022-07-04T06:14:53Z
001010545 915__ $$0StatID:(DE-HGF)0600$$2StatID$$aDBCoverage$$bEbsco Academic Search$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0030$$2StatID$$aPeer Review$$bASC$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0199$$2StatID$$aDBCoverage$$bClarivate Analytics Master Journal List$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)1050$$2StatID$$aDBCoverage$$bBIOSIS Previews$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)0150$$2StatID$$aDBCoverage$$bWeb of Science Core Collection$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)1030$$2StatID$$aDBCoverage$$bCurrent Contents - Life Sciences$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)1060$$2StatID$$aDBCoverage$$bCurrent Contents - Agriculture, Biology and Environmental Sciences$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)1040$$2StatID$$aDBCoverage$$bZoological Record$$d2023-10-26
001010545 915__ $$0StatID:(DE-HGF)9905$$2StatID$$aIF >= 5$$bGIGASCIENCE : 2022$$d2023-10-26
001010545 920__ $$lyes
001010545 9201_ $$0I:(DE-Juel1)INM-7-20090406$$kINM-7$$lGehirn & Verhalten$$x0
001010545 980__ $$ajournal
001010545 980__ $$aVDB
001010545 980__ $$aUNRESTRICTED
001010545 980__ $$aI:(DE-Juel1)INM-7-20090406
001010545 980__ $$aAPC
001010545 9801_ $$aAPC
001010545 9801_ $$aFullTexts
guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help