001038496 001__ 1038496
001038496 005__ 20250224080206.0
001038496 0247_ $$2datacite_doi$$a10.34734/FZJ-2025-01491
001038496 037__ $$aFZJ-2025-01491
001038496 041__ $$aEnglish
001038496 1001_ $$0P:(DE-Juel1)194707$$aNieto, Nicolas$$b0$$eCorresponding author
001038496 245__ $$aImpact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites
001038496 260__ $$c2024
001038496 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1739259028_2007
001038496 3367_ $$2ORCID$$aWORKING_PAPER
001038496 3367_ $$028$$2EndNote$$aElectronic Article
001038496 3367_ $$2DRIVER$$apreprint
001038496 3367_ $$2BibTeX$$aARTICLE
001038496 3367_ $$2DataCite$$aOutput Types/Working Paper
001038496 520__ $$aMachine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.
001038496 536__ $$0G:(DE-HGF)POF4-5254$$a5254 - Neuroscientific Data Analytics and AI (POF4-525)$$cPOF4-525$$fPOF IV$$x0
001038496 588__ $$aDataset connected to DataCite
001038496 7001_ $$0P:(DE-Juel1)131678$$aEickhoff, Simon$$b1
001038496 7001_ $$0P:(DE-HGF)0$$aJung, Christian$$b2
001038496 7001_ $$0P:(DE-HGF)0$$aReuter, Martin$$b3
001038496 7001_ $$0P:(DE-HGF)0$$aDiers, Kersten$$b4
001038496 7001_ $$0P:(DE-HGF)0$$aKelm, Malte$$b5
001038496 7001_ $$0P:(DE-HGF)0$$aLichtenberg, Artur$$b6
001038496 7001_ $$0P:(DE-Juel1)185083$$aRaimondo, Federico$$b7
001038496 7001_ $$0P:(DE-Juel1)172843$$aPatil, Kaustubh$$b8$$ufzj
001038496 773__ $$tArxiv$$y2024
001038496 8564_ $$uhttps://juser.fz-juelich.de/record/1038496/files/2410.19643v3.pdf$$yOpenAccess
001038496 909CO $$ooai:juser.fz-juelich.de:1038496$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
001038496 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001038496 9141_ $$y2024
001038496 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)194707$$aForschungszentrum Jülich$$b0$$kFZJ
001038496 9101_ $$0I:(DE-HGF)0$$6P:(DE-Juel1)194707$$aHHU Düsseldorf$$b0
001038496 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)131678$$aForschungszentrum Jülich$$b1$$kFZJ
001038496 9101_ $$0I:(DE-HGF)0$$6P:(DE-Juel1)131678$$aHHU Düsseldorf$$b1
001038496 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)185083$$aForschungszentrum Jülich$$b7$$kFZJ
001038496 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)172843$$aForschungszentrum Jülich$$b8$$kFZJ
001038496 9131_ $$0G:(DE-HGF)POF4-525$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5254$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vDecoding Brain Organization and Dysfunction$$x0
001038496 920__ $$lyes
001038496 9201_ $$0I:(DE-Juel1)INM-7-20090406$$kINM-7$$lGehirn & Verhalten$$x0
001038496 980__ $$apreprint
001038496 980__ $$aVDB
001038496 980__ $$aUNRESTRICTED
001038496 980__ $$aI:(DE-Juel1)INM-7-20090406
001038496 9801_ $$aFullTexts