Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

Hamdan, Sami; Schwender, Holger; Eickhoff, Simon; Polier, Georg von; Patil, Kaustubh; Weis, Susanne; Love, Bradley C

doi:10.1093/gigascience/giad071

Items
Marc 21

001			1010545
005			20240429104900.0
024	7	_	\|a 10.1093/gigascience/giad071 \|2 doi
024	7	_	\|a 10.34734/FZJ-2023-03119 \|2 datacite_doi
024	7	_	\|a 37776368 \|2 pmid
024	7	_	\|a WOS:001189196000001 \|2 WOS
037	_	_	\|a FZJ-2023-03119
082	_	_	\|a 610
100	1	_	\|a Hamdan, Sami \|0 P:(DE-Juel1)184874 \|b 0 \|u fzj
245	_	_	\|a Confound-leakage: Confound Removal in Machine Learning Leads to Leakage
260	_	_	\|a Oxford \|c 2023 \|b Oxford University Press
336	7	_	\|a article \|2 DRIVER
336	7	_	\|a Output Types/Journal article \|2 DataCite
336	7	_	\|a Journal Article \|b journal \|m journal \|0 PUB:(DE-HGF)16 \|s 1697798276_11492 \|2 PUB:(DE-HGF)
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a JOURNAL_ARTICLE \|2 ORCID
336	7	_	\|a Journal Article \|0 0 \|2 EndNote
500	_	_	\|a This work was partly supported by the Helmholtz-AI project DeGen (ZT-I-PF-5-078), the Helmholtz Portfolio Theme “Supercomputing and Modeling for the Human Brain,” and Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project-ID 431549029–SFB 1451 project B05.
520	_	_	\|a BackgroundMachine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.ResultsWe provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound.ConclusionsMishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.
536	_	_	\|a 5252 - Brain Dysfunction and Plasticity (POF4-525) \|0 G:(DE-HGF)POF4-5252 \|c POF4-525 \|f POF IV \|x 0
536	_	_	\|a SFB 1451 B05 - Einzelfallvorhersagen der motorischen Fähigkeiten bei Gesunden und Patienten mit motorischen Störungen (B05) (458640473) \|0 G:(GEPRIS)458640473 \|c 458640473 \|x 1
588	_	_	\|a Dataset connected to DataCite
700	1	_	\|a Love, Bradley C \|0 P:(DE-HGF)0 \|b 1
700	1	_	\|a Polier, Georg von \|0 P:(DE-HGF)0 \|b 2
700	1	_	\|a Weis, Susanne \|0 P:(DE-Juel1)172811 \|b 3 \|u fzj
700	1	_	\|a Schwender, Holger \|0 P:(DE-HGF)0 \|b 4
700	1	_	\|a Eickhoff, Simon \|0 P:(DE-Juel1)131678 \|b 5 \|u fzj
700	1	_	\|a Patil, Kaustubh \|0 P:(DE-Juel1)172843 \|b 6 \|e Corresponding author \|u fzj
773	_	_	\|a 10.1093/gigascience/giad071 \|g Vol. 12, p. giad071 \|0 PERI:(DE-600)2708999-X \|p giad071 \|t GigaScience \|v 12 \|y 20323 \|x 2047-217X
856	4	_	\|u https://juser.fz-juelich.de/record/1010545/files/Invoice_SOA23LT000561.pdf
856	4	_	\|y OpenAccess \|u https://juser.fz-juelich.de/record/1010545/files/giad071.pdf
909	C	O	\|o oai:juser.fz-juelich.de:1010545 \|p openaire \|p open_access \|p OpenAPC \|p driver \|p VDB \|p openCost \|p dnbdelivery
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 0 \|6 P:(DE-Juel1)184874
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 3 \|6 P:(DE-Juel1)172811
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 5 \|6 P:(DE-Juel1)131678
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 6 \|6 P:(DE-Juel1)172843
913	1	_	\|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|1 G:(DE-HGF)POF4-520 \|0 G:(DE-HGF)POF4-525 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Decoding Brain Organization and Dysfunction \|9 G:(DE-HGF)POF4-5252 \|x 0
914	1	_	\|y 2023
915	p	c	\|a APC keys set \|0 PC:(DE-HGF)0000 \|2 APC
915	p	c	\|a DOAJ Journal \|0 PC:(DE-HGF)0003 \|2 APC
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0501 \|2 StatID \|b DOAJ Seal \|d 2022-07-04T06:14:53Z
915	_	_	\|a Article Processing Charges \|0 StatID:(DE-HGF)0561 \|2 StatID \|d 2023-08-24
915	_	_	\|a WoS \|0 StatID:(DE-HGF)0113 \|2 StatID \|b Science Citation Index Expanded \|d 2023-08-24
915	_	_	\|a Fees \|0 StatID:(DE-HGF)0700 \|2 StatID \|d 2023-08-24
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0500 \|2 StatID \|b DOAJ \|d 2022-07-04T06:14:53Z
915	_	_	\|a OpenAccess \|0 StatID:(DE-HGF)0510 \|2 StatID
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1190 \|2 StatID \|b Biological Abstracts \|d 2023-08-24
915	_	_	\|a Creative Commons Attribution CC BY 4.0 \|0 LIC:(DE-HGF)CCBY4 \|2 HGFVOC
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0160 \|2 StatID \|b Essential Science Indicators \|d 2023-08-24
915	_	_	\|a JCR \|0 StatID:(DE-HGF)0100 \|2 StatID \|b GIGASCIENCE : 2022 \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0200 \|2 StatID \|b SCOPUS \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0300 \|2 StatID \|b Medline \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0320 \|2 StatID \|b PubMed Central \|d 2023-10-26
915	_	_	\|a Peer Review \|0 StatID:(DE-HGF)0030 \|2 StatID \|b DOAJ : Open peer review \|d 2022-07-04T06:14:53Z
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0600 \|2 StatID \|b Ebsco Academic Search \|d 2023-10-26
915	_	_	\|a Peer Review \|0 StatID:(DE-HGF)0030 \|2 StatID \|b ASC \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0199 \|2 StatID \|b Clarivate Analytics Master Journal List \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1050 \|2 StatID \|b BIOSIS Previews \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0150 \|2 StatID \|b Web of Science Core Collection \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1030 \|2 StatID \|b Current Contents - Life Sciences \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1060 \|2 StatID \|b Current Contents - Agriculture, Biology and Environmental Sciences \|d 2023-10-26
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)1040 \|2 StatID \|b Zoological Record \|d 2023-10-26
915	_	_	\|a IF >= 5 \|0 StatID:(DE-HGF)9905 \|2 StatID \|b GIGASCIENCE : 2022 \|d 2023-10-26
920	_	_	\|l yes
920	1	_	\|0 I:(DE-Juel1)INM-7-20090406 \|k INM-7 \|l Gehirn & Verhalten \|x 0
980	_	_	\|a journal
980	_	_	\|a VDB
980	_	_	\|a UNRESTRICTED
980	_	_	\|a I:(DE-Juel1)INM-7-20090406
980	_	_	\|a APC
980	1	_	\|a APC
980	1	_	\|a FullTexts

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe