Preprint FZJ-2025-01491

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

 ;  ;  ;  ;  ;  ;  ;  ;

2024

Arxiv () [10.34734/FZJ-2025-01491]

This record in other databases:

Please use a persistent id in citations: doi:

Abstract: Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.


Contributing Institute(s):
  1. Gehirn & Verhalten (INM-7)
Research Program(s):
  1. 5254 - Neuroscientific Data Analytics and AI (POF4-525) (POF4-525)

Appears in the scientific report 2024
Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Institute Collections > INM > INM-7
Document types > Reports > Preprints
Workflow collections > Public records
Publications database
Open Access

 Record created 2025-01-30, last modified 2025-02-24


OpenAccess:
Download fulltext PDF
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)