Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

Sharafutdinov, Konstantin; Bickenbach, Johannes; Schuppert, Andreas; Nikulina, Kateryna; Fritsch, Sebastian Johannes; Polzin, Richard; Bhat, Jayesh S.; Marx, Gernot; E. Samadi, Moein; Mayer, Hannah
doi:10.3389/fdata.2022.603429
000910683 001__ 910683
000910683 005__ 20230123110716.0
000910683 0247_ $$2doi$$a10.3389/fdata.2022.603429
000910683 0247_ $$2Handle$$a2128/32270
000910683 0247_ $$2pmid$$a36387013
000910683 0247_ $$2WOS$$aWOS:000885597500001
000910683 037__ $$aFZJ-2022-04055
000910683 041__ $$aEnglish
000910683 082__ $$a004
000910683 1001_ $$0P:(DE-HGF)0$$aSharafutdinov, Konstantin$$b0$$eCorresponding author
000910683 245__ $$aApplication of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
000910683 260__ $$aLausanne$$bFrontiers Media$$c2022
000910683 3367_ $$2DRIVER$$aarticle
000910683 3367_ $$2DataCite$$aOutput Types/Journal article
000910683 3367_ $$0PUB:(DE-HGF)16$$2PUB:(DE-HGF)$$aJournal Article$$bjournal$$mjournal$$s1672836012_10869
000910683 3367_ $$2BibTeX$$aARTICLE
000910683 3367_ $$2ORCID$$aJOURNAL_ARTICLE
000910683 3367_ $$00$$2EndNote$$aJournal Article
000910683 520__ $$aMachine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.
000910683 536__ $$0G:(DE-HGF)POF4-5112$$a5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)$$cPOF4-511$$fPOF IV$$x0
000910683 536__ $$0G:(BMBF)01ZZ1803M$$aSMITH - Medizininformatik-Konsortium - Beitrag Forschungszentrum Jülich (01ZZ1803M)$$c01ZZ1803M$$x1
000910683 588__ $$aDataset connected to CrossRef, Journals: juser.fz-juelich.de
000910683 7001_ $$0P:(DE-HGF)0$$aBhat, Jayesh S.$$b1
000910683 7001_ $$0P:(DE-Juel1)185651$$aFritsch, Sebastian Johannes$$b2
000910683 7001_ $$0P:(DE-HGF)0$$aNikulina, Kateryna$$b3
000910683 7001_ $$0P:(DE-HGF)0$$aE. Samadi, Moein$$b4
000910683 7001_ $$0P:(DE-HGF)0$$aPolzin, Richard$$b5
000910683 7001_ $$0P:(DE-HGF)0$$aMayer, Hannah$$b6
000910683 7001_ $$0P:(DE-HGF)0$$aMarx, Gernot$$b7
000910683 7001_ $$0P:(DE-HGF)0$$aBickenbach, Johannes$$b8
000910683 7001_ $$0P:(DE-HGF)0$$aSchuppert, Andreas$$b9
000910683 773__ $$0PERI:(DE-600)2957497-3$$a10.3389/fdata.2022.603429$$gVol. 5, p. 603429$$p603429$$tFrontiers in Big Data$$v5$$x2624-909X$$y2022
000910683 8564_ $$uhttps://juser.fz-juelich.de/record/910683/files/Sharafutdinov_et_al%20Convex%20hull%20analysis.pdf$$yOpenAccess
000910683 909CO $$ooai:juser.fz-juelich.de:910683$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
000910683 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)185651$$aForschungszentrum Jülich$$b2$$kFZJ
000910683 9131_ $$0G:(DE-HGF)POF4-511$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5112$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vEnabling Computational- & Data-Intensive Science and Engineering$$x0
000910683 9141_ $$y2022
000910683 915__ $$0LIC:(DE-HGF)CCBY4$$2HGFVOC$$aCreative Commons Attribution CC BY 4.0
000910683 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
000910683 915__ $$0StatID:(DE-HGF)0561$$2StatID$$aArticle Processing Charges$$d2020-09-05
000910683 915__ $$0StatID:(DE-HGF)0700$$2StatID$$aFees$$d2020-09-05
000910683 915__ $$0StatID:(DE-HGF)0200$$2StatID$$aDBCoverage$$bSCOPUS$$d2022-11-05
000910683 915__ $$0StatID:(DE-HGF)0300$$2StatID$$aDBCoverage$$bMedline$$d2022-11-05
000910683 915__ $$0StatID:(DE-HGF)0501$$2StatID$$aDBCoverage$$bDOAJ Seal$$d2021-05-13T09:31:41Z
000910683 915__ $$0StatID:(DE-HGF)0500$$2StatID$$aDBCoverage$$bDOAJ$$d2021-05-13T09:31:41Z
000910683 915__ $$0StatID:(DE-HGF)0030$$2StatID$$aPeer Review$$bDOAJ : Blind peer review$$d2021-05-13T09:31:41Z
000910683 915__ $$0LIC:(DE-HGF)CCBYNV$$2V:(DE-HGF)$$aCreative Commons Attribution CC BY (No Version)$$bDOAJ$$d2021-05-13T09:31:41Z
000910683 915__ $$0StatID:(DE-HGF)0199$$2StatID$$aDBCoverage$$bClarivate Analytics Master Journal List$$d2022-11-05
000910683 915__ $$0StatID:(DE-HGF)0112$$2StatID$$aWoS$$bEmerging Sources Citation Index$$d2022-11-05
000910683 915__ $$0StatID:(DE-HGF)0150$$2StatID$$aDBCoverage$$bWeb of Science Core Collection$$d2022-11-05
000910683 915__ $$0StatID:(DE-HGF)0561$$2StatID$$aArticle Processing Charges$$d2022-11-05
000910683 915__ $$0StatID:(DE-HGF)0700$$2StatID$$aFees$$d2022-11-05
000910683 920__ $$lno
000910683 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
000910683 980__ $$ajournal
000910683 980__ $$aVDB
000910683 980__ $$aI:(DE-Juel1)JSC-20090406
000910683 980__ $$aUNRESTRICTED
000910683 9801_ $$aFullTexts
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe