A scalable tool architecture for diagnosing wait states in massively parallel applications

Geimer, M.; Wolf, F.; Wylie, B.; Mohr, B.

doi:10.1016/j.parco.2009.02.003

Items
Marc 21

001			6609
005			20250314084056.0
024	7	_	\|2 DOI \|a 10.1016/j.parco.2009.02.003
024	7	_	\|2 WOS \|a WOS:000268438000001
037	_	_	\|a PreJuSER-6609
041	_	_	\|a eng
082	_	_	\|a 004
084	_	_	\|2 WoS \|a Computer Science, Theory & Methods
100	1	_	\|0 P:(DE-Juel1)132112 \|a Geimer, M. \|b 0 \|u FZJ
245	_	_	\|a A scalable tool architecture for diagnosing wait states in massively parallel applications
260	_	_	\|a Amsterdam [u.a.] \|b North-Holland, Elsevier Science \|c 2009
300	_	_	\|a 375 - 388
336	7	_	\|a Journal Article \|0 PUB:(DE-HGF)16 \|2 PUB:(DE-HGF)
336	7	_	\|a Output Types/Journal article \|2 DataCite
336	7	_	\|a Journal Article \|0 0 \|2 EndNote
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a JOURNAL_ARTICLE \|2 ORCID
336	7	_	\|a article \|2 DRIVER
440	_	0	\|0 12681 \|a Parallel Computing \|v 35 \|x 0167-8191 \|y 7
500	_	_	\|a This work was supported by the Helmholtz Association under Grants No. VH-NG-118 and No. VH-VI-228. Also, we would like to thank Marek Behr, Mike Nicolai, and Markus Probst from the Chair for Computational Analysis of Technical Systems at RWTH Aachen University for giving us access to their code.
520	_	_	\|a When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target application's communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales. (C) 2009 Elsevier B.V. All rights reserved.
536	_	_	\|0 G:(DE-Juel1)FUEK411 \|2 G:(DE-HGF) \|a Scientific Computing \|c P41 \|x 0
536	_	_	\|0 G:(DE-Juel-1)ATMLPP \|a ATMLPP - ATML Parallel Performance (ATMLPP) \|c ATMLPP \|x 1
588	_	_	\|a Dataset connected to Web of Science
650	_	7	\|2 WoSType \|a J
653	2	0	\|2 Author \|a Performance analysis
653	2	0	\|2 Author \|a Scalability
653	2	0	\|2 Author \|a Event tracing
653	2	0	\|2 Author \|a Pattern search
700	1	_	\|0 P:(DE-Juel1)VDB1927 \|a Wolf, F. \|b 1 \|u FZJ
700	1	_	\|0 P:(DE-Juel1)132302 \|a Wylie, B. \|b 2 \|u FZJ
700	1	_	\|0 P:(DE-Juel1)132199 \|a Mohr, B. \|b 3 \|u FZJ
773	_	_	\|0 PERI:(DE-600)1466340-5 \|a 10.1016/j.parco.2009.02.003 \|g Vol. 35, p. 375 - 388 \|p 375 - 388 \|q 35<375 - 388 \|t Parallel computing \|v 35 \|x 0167-8191 \|y 2009
856	7	_	\|u http://dx.doi.org/10.1016/j.parco.2009.02.003
909	C	O	\|o oai:juser.fz-juelich.de:6609 \|p VDB
913	1	_	\|0 G:(DE-Juel1)FUEK411 \|a DE-HGF \|b Schlüsseltechnologien \|k P41 \|l Supercomputing \|v Scientific Computing \|x 0
914	1	_	\|y 2009
915	_	_	\|0 StatID:(DE-HGF)0010 \|a JCR/ISI refereed
920	1	_	\|0 I:(DE-Juel1)JSC-20090406 \|k JSC \|l Jülich Supercomputing Centre \|g JSC \|x 0
920	1	_	\|0 I:(DE-82)080012_20140620 \|k JARA-HPC \|l Jülich Aachen Research Alliance - High-Performance Computing \|g JARA \|x 1
970	_	_	\|a VDB:(DE-Juel1)114966
980	_	_	\|a VDB
980	_	_	\|a ConvertedRecord
980	_	_	\|a journal
980	_	_	\|a I:(DE-Juel1)JSC-20090406
980	_	_	\|a I:(DE-82)080012_20140620
980	_	_	\|a UNRESTRICTED
981	_	_	\|a I:(DE-Juel1)VDB1346

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe