Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

Nieto, Nicolas; Raimondo, Federico; Lichtenberg, Artur; Reuter, Martin; Patil, Kaustubh; Jung, Christian; Diers, Kersten; Kelm, Malte; Eickhoff, Simon

Items
Marc 21

001			1038496
005			20250224080206.0
024	7	_	\|2 datacite_doi \|a 10.34734/FZJ-2025-01491
037	_	_	\|a FZJ-2025-01491
041	_	_	\|a English
100	1	_	\|0 P:(DE-Juel1)194707 \|a Nieto, Nicolas \|b 0 \|e Corresponding author
245	_	_	\|a Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites
260	_	_	\|c 2024
336	7	_	\|0 PUB:(DE-HGF)25 \|2 PUB:(DE-HGF) \|a Preprint \|b preprint \|m preprint \|s 1739259028_2007
336	7	_	\|2 ORCID \|a WORKING_PAPER
336	7	_	\|0 28 \|2 EndNote \|a Electronic Article
336	7	_	\|2 DRIVER \|a preprint
336	7	_	\|2 BibTeX \|a ARTICLE
336	7	_	\|2 DataCite \|a Output Types/Working Paper
520	_	_	\|a Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.
536	_	_	\|0 G:(DE-HGF)POF4-5254 \|a 5254 - Neuroscientific Data Analytics and AI (POF4-525) \|c POF4-525 \|f POF IV \|x 0
588	_	_	\|a Dataset connected to DataCite
700	1	_	\|0 P:(DE-Juel1)131678 \|a Eickhoff, Simon \|b 1
700	1	_	\|0 P:(DE-HGF)0 \|a Jung, Christian \|b 2
700	1	_	\|0 P:(DE-HGF)0 \|a Reuter, Martin \|b 3
700	1	_	\|0 P:(DE-HGF)0 \|a Diers, Kersten \|b 4
700	1	_	\|0 P:(DE-HGF)0 \|a Kelm, Malte \|b 5
700	1	_	\|0 P:(DE-HGF)0 \|a Lichtenberg, Artur \|b 6
700	1	_	\|0 P:(DE-Juel1)185083 \|a Raimondo, Federico \|b 7
700	1	_	\|0 P:(DE-Juel1)172843 \|a Patil, Kaustubh \|b 8 \|u fzj
773	_	_	\|t Arxiv \|y 2024
856	4	_	\|u https://juser.fz-juelich.de/record/1038496/files/2410.19643v3.pdf \|y OpenAccess
909	C	O	\|o oai:juser.fz-juelich.de:1038496 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|0 I:(DE-588b)5008462-8 \|6 P:(DE-Juel1)194707 \|a Forschungszentrum Jülich \|b 0 \|k FZJ
910	1	_	\|0 I:(DE-HGF)0 \|6 P:(DE-Juel1)194707 \|a HHU Düsseldorf \|b 0
910	1	_	\|0 I:(DE-588b)5008462-8 \|6 P:(DE-Juel1)131678 \|a Forschungszentrum Jülich \|b 1 \|k FZJ
910	1	_	\|0 I:(DE-HGF)0 \|6 P:(DE-Juel1)131678 \|a HHU Düsseldorf \|b 1
910	1	_	\|0 I:(DE-588b)5008462-8 \|6 P:(DE-Juel1)185083 \|a Forschungszentrum Jülich \|b 7 \|k FZJ
910	1	_	\|0 I:(DE-588b)5008462-8 \|6 P:(DE-Juel1)172843 \|a Forschungszentrum Jülich \|b 8 \|k FZJ
913	1	_	\|0 G:(DE-HGF)POF4-525 \|1 G:(DE-HGF)POF4-520 \|2 G:(DE-HGF)POF4-500 \|3 G:(DE-HGF)POF4 \|4 G:(DE-HGF)POF \|9 G:(DE-HGF)POF4-5254 \|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|v Decoding Brain Organization and Dysfunction \|x 0
914	1	_	\|y 2024
915	_	_	\|0 StatID:(DE-HGF)0510 \|2 StatID \|a OpenAccess
920	_	_	\|l yes
920	1	_	\|0 I:(DE-Juel1)INM-7-20090406 \|k INM-7 \|l Gehirn & Verhalten \|x 0
980	_	_	\|a preprint
980	_	_	\|a VDB
980	_	_	\|a UNRESTRICTED
980	_	_	\|a I:(DE-Juel1)INM-7-20090406
980	1	_	\|a FullTexts

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe