000841390 001__ 841390 000841390 005__ 20210129232018.0 000841390 020__ $$a978-9935-9383-2-9 000841390 037__ $$aFZJ-2017-08465 000841390 041__ $$aEnglish 000841390 1001_ $$0P:(DE-Juel1)162390$$aGötz, Markus$$b0$$eCorresponding author$$ufzj 000841390 245__ $$aScalable Data Analysis in High Performance Computing$$f2014-04-01 - 2017-12-05 000841390 260__ $$aReykjavik$$bHáskólaprent, Universität Island$$c2017 000841390 300__ $$a156 p. 000841390 3367_ $$2DataCite$$aOutput Types/Dissertation 000841390 3367_ $$0PUB:(DE-HGF)3$$2PUB:(DE-HGF)$$aBook$$mbook 000841390 3367_ $$2ORCID$$aDISSERTATION 000841390 3367_ $$2BibTeX$$aPHDTHESIS 000841390 3367_ $$02$$2EndNote$$aThesis 000841390 3367_ $$0PUB:(DE-HGF)11$$2PUB:(DE-HGF)$$aDissertation / PhD Thesis$$bphd$$mphd$$s1513673730_27837 000841390 3367_ $$2DRIVER$$adoctoralThesis 000841390 502__ $$aDissertation, Universität Island, 2017$$bDissertation$$cUniversität Island$$d2017$$o2017-12-05 000841390 520__ $$aOver the last decades one could observe a drastic increase in the generation and storage of data in both, industry and science. While the field of data analysis is not new, it is now facing the challenge of coping with an increasing size, bandwidth and complexity of data. This renders traditional analysis methods and algorithms ineffective. This problem has been coined as the Big Data challenge. Concretely in science the major data producers are large-scale monolithic experiments and the outputs of domain simulations. Up until now, most of this data has not yet been completely analyzed, but rather stored in data repositories for later consideration due to the lack of efficient means of processing. As a consequence, there is a need for large-scale data analysis frameworks and algorithm libraries allowing to study these datasets. In context of scientific applications, potentially coupled with legacy simulations, the designated target platform are heterogeneous high-performance computing systems.This thesis proposes a design and prototypical realization of such a framework based on the experience collected from empirical applications. For this, selected scientific use cases, with an emphasis on earth sciences, were studied. In particular, these are object segmentation in point cloud data and biological imagery, outlier detection in oceanographic time-series data as well as land cover type classification in remote sensing images. In order to deal with the data amounts, two analysis algorithms have been parallelized for shared- and distributed-memory systems. Concretely, these are HPDBSCAN, a density-based clustering algorithm, as well as Distributed Max-Trees, a filtering step for images. The presented parallelization strategies have been abstracted into a generalized paradigm, enabling the formulation of scalable algorithms for other similar analysis methods. Moreover, it permits the definition of requirements for the design of a large-scale data analysis framework and algorithm library for heterogeneous, distributed high-performance computing systems. In line with that, the thesis presents a prototypical realization called Juelich Machine Learning Library (JuML), providing essential low-level components and readily usable analysis algorithm implementations. 000841390 536__ $$0G:(DE-HGF)POF3-512$$a512 - Data-Intensive Science and Federated Computing (POF3-512)$$cPOF3-512$$fPOF III$$x0 000841390 536__ $$0G:(DE-Juel1)PHD-NO-GRANT-20170405$$aPhD no Grant - Doktorand ohne besondere Förderung (PHD-NO-GRANT-20170405)$$cPHD-NO-GRANT-20170405$$x1 000841390 8564_ $$uhttps://hdl.handle.net/20.500.11815/472 000841390 909CO $$ooai:juser.fz-juelich.de:841390$$pVDB 000841390 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)162390$$aForschungszentrum Jülich$$b0$$kFZJ 000841390 9131_ $$0G:(DE-HGF)POF3-512$$1G:(DE-HGF)POF3-510$$2G:(DE-HGF)POF3-500$$3G:(DE-HGF)POF3$$4G:(DE-HGF)POF$$aDE-HGF$$bKey Technologies$$lSupercomputing & Big Data$$vData-Intensive Science and Federated Computing$$x0 000841390 9141_ $$y2017 000841390 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0 000841390 980__ $$aphd 000841390 980__ $$aVDB 000841390 980__ $$abook 000841390 980__ $$aI:(DE-Juel1)JSC-20090406 000841390 980__ $$aUNRESTRICTED