Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

Sharafutdinov, Konstantin; Bickenbach, Johannes; Schuppert, Andreas; Nikulina, Kateryna; Fritsch, Sebastian Johannes; Polzin, Richard; Bhat, Jayesh S.; Marx, Gernot; E. Samadi, Moein; Mayer, Hannah

doi:10.3389/fdata.2022.603429

Items
Marc 21

001			910683
005			20230123110716.0
024	7	_	\|a 10.3389/fdata.2022.603429 \|2 doi
024	7	_	\|a 2128/32270 \|2 Handle
024	7	_	\|a 36387013 \|2 pmid
024	7	_	\|a WOS:000885597500001 \|2 WOS
037	_	_	\|a FZJ-2022-04055
041	_	_	\|a English
082	_	_	\|a 004
100	1	_	\|a Sharafutdinov, Konstantin \|0 P:(DE-HGF)0 \|b 0 \|e Corresponding author
245	_	_	\|a Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
260	_	_	\|a Lausanne \|c 2022 \|b Frontiers Media
336	7	_	\|a article \|2 DRIVER
336	7	_	\|a Output Types/Journal article \|2 DataCite
336	7	_	\|a Journal Article \|b journal \|m journal \|0 PUB:(DE-HGF)16 \|s 1672836012_10869 \|2 PUB:(DE-HGF)
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a JOURNAL_ARTICLE \|2 ORCID
336	7	_	\|a Journal Article \|0 0 \|2 EndNote
520	_	_	\|a Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.
536	_	_	\|a 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) \|0 G:(DE-HGF)POF4-5112 \|c POF4-511 \|f POF IV \|x 0
536	_	_	\|a SMITH - Medizininformatik-Konsortium - Beitrag Forschungszentrum Jülich (01ZZ1803M) \|0 G:(BMBF)01ZZ1803M \|c 01ZZ1803M \|x 1
588	_	_	\|a Dataset connected to CrossRef, Journals: juser.fz-juelich.de
700	1	_	\|a Bhat, Jayesh S. \|0 P:(DE-HGF)0 \|b 1
700	1	_	\|a Fritsch, Sebastian Johannes \|0 P:(DE-Juel1)185651 \|b 2
700	1	_	\|a Nikulina, Kateryna \|0 P:(DE-HGF)0 \|b 3
700	1	_	\|a E. Samadi, Moein \|0 P:(DE-HGF)0 \|b 4
700	1	_	\|a Polzin, Richard \|0 P:(DE-HGF)0 \|b 5
700	1	_	\|a Mayer, Hannah \|0 P:(DE-HGF)0 \|b 6
700	1	_	\|a Marx, Gernot \|0 P:(DE-HGF)0 \|b 7
700	1	_	\|a Bickenbach, Johannes \|0 P:(DE-HGF)0 \|b 8
700	1	_	\|a Schuppert, Andreas \|0 P:(DE-HGF)0 \|b 9
773	_	_	\|a 10.3389/fdata.2022.603429 \|g Vol. 5, p. 603429 \|0 PERI:(DE-600)2957497-3 \|p 603429 \|t Frontiers in Big Data \|v 5 \|y 2022 \|x 2624-909X
856	4	_	\|u https://juser.fz-juelich.de/record/910683/files/Sharafutdinov_et_al%20Convex%20hull%20analysis.pdf \|y OpenAccess
909	C	O	\|o oai:juser.fz-juelich.de:910683 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 2 \|6 P:(DE-Juel1)185651
913	1	_	\|a DE-HGF \|b Key Technologies \|l Engineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action \|1 G:(DE-HGF)POF4-510 \|0 G:(DE-HGF)POF4-511 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Enabling Computational- & Data-Intensive Science and Engineering \|9 G:(DE-HGF)POF4-5112 \|x 0
914	1	_	\|y 2022
915	_	_	\|a Creative Commons Attribution CC BY 4.0 \|0 LIC:(DE-HGF)CCBY4 \|2 HGFVOC
915	_	_	\|a OpenAccess \|0 StatID:(DE-HGF)0510 \|2 StatID
915	_	_	\|a Article Processing Charges \|0 StatID:(DE-HGF)0561 \|2 StatID \|d 2020-09-05
915	_	_	\|a Fees \|0 StatID:(DE-HGF)0700 \|2 StatID \|d 2020-09-05
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0200 \|2 StatID \|b SCOPUS \|d 2022-11-05
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0300 \|2 StatID \|b Medline \|d 2022-11-05
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0501 \|2 StatID \|b DOAJ Seal \|d 2021-05-13T09:31:41Z
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0500 \|2 StatID \|b DOAJ \|d 2021-05-13T09:31:41Z
915	_	_	\|a Peer Review \|0 StatID:(DE-HGF)0030 \|2 StatID \|b DOAJ : Blind peer review \|d 2021-05-13T09:31:41Z
915	_	_	\|a Creative Commons Attribution CC BY (No Version) \|0 LIC:(DE-HGF)CCBYNV \|2 V:(DE-HGF) \|b DOAJ \|d 2021-05-13T09:31:41Z
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0199 \|2 StatID \|b Clarivate Analytics Master Journal List \|d 2022-11-05
915	_	_	\|a WoS \|0 StatID:(DE-HGF)0112 \|2 StatID \|b Emerging Sources Citation Index \|d 2022-11-05
915	_	_	\|a DBCoverage \|0 StatID:(DE-HGF)0150 \|2 StatID \|b Web of Science Core Collection \|d 2022-11-05
915	_	_	\|a Article Processing Charges \|0 StatID:(DE-HGF)0561 \|2 StatID \|d 2022-11-05
915	_	_	\|a Fees \|0 StatID:(DE-HGF)0700 \|2 StatID \|d 2022-11-05
920	_	_	\|l no
920	1	_	\|0 I:(DE-Juel1)JSC-20090406 \|k JSC \|l Jülich Supercomputing Center \|x 0
980	_	_	\|a journal
980	_	_	\|a VDB
980	_	_	\|a I:(DE-Juel1)JSC-20090406
980	_	_	\|a UNRESTRICTED
980	1	_	\|a FullTexts

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe