000006609 001__ 6609
000006609 005__ 20250314084056.0
000006609 0247_ $$2DOI$$a10.1016/j.parco.2009.02.003
000006609 0247_ $$2WOS$$aWOS:000268438000001
000006609 037__ $$aPreJuSER-6609
000006609 041__ $$aeng
000006609 082__ $$a004
000006609 084__ $$2WoS$$aComputer Science, Theory & Methods
000006609 1001_ $$0P:(DE-Juel1)132112$$aGeimer, M.$$b0$$uFZJ
000006609 245__ $$aA scalable tool architecture for diagnosing wait states in massively parallel applications
000006609 260__ $$aAmsterdam [u.a.]$$bNorth-Holland, Elsevier Science$$c2009
000006609 300__ $$a375 - 388
000006609 3367_ $$0PUB:(DE-HGF)16$$2PUB:(DE-HGF)$$aJournal Article
000006609 3367_ $$2DataCite$$aOutput Types/Journal article
000006609 3367_ $$00$$2EndNote$$aJournal Article
000006609 3367_ $$2BibTeX$$aARTICLE
000006609 3367_ $$2ORCID$$aJOURNAL_ARTICLE
000006609 3367_ $$2DRIVER$$aarticle
000006609 440_0 $$012681$$aParallel Computing$$v35$$x0167-8191$$y7
000006609 500__ $$aThis work was supported by the Helmholtz Association under Grants No. VH-NG-118 and No. VH-VI-228. Also, we would like to thank Marek Behr, Mike Nicolai, and Markus Probst from the Chair for Computational Analysis of Technical Systems at RWTH Aachen University for giving us access to their code.
000006609 520__ $$aWhen scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target application's communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales. (C) 2009 Elsevier B.V. All rights reserved.
000006609 536__ $$0G:(DE-Juel1)FUEK411$$2G:(DE-HGF)$$aScientific Computing$$cP41$$x0
000006609 536__ $$0G:(DE-Juel-1)ATMLPP$$aATMLPP - ATML Parallel Performance (ATMLPP)$$cATMLPP$$x1
000006609 588__ $$aDataset connected to Web of Science
000006609 65320 $$2Author$$aPerformance analysis
000006609 65320 $$2Author$$aScalability
000006609 65320 $$2Author$$aEvent tracing
000006609 65320 $$2Author$$aPattern search
000006609 650_7 $$2WoSType$$aJ
000006609 7001_ $$0P:(DE-Juel1)VDB1927$$aWolf, F.$$b1$$uFZJ
000006609 7001_ $$0P:(DE-Juel1)132302$$aWylie, B.$$b2$$uFZJ
000006609 7001_ $$0P:(DE-Juel1)132199$$aMohr, B.$$b3$$uFZJ
000006609 773__ $$0PERI:(DE-600)1466340-5$$a10.1016/j.parco.2009.02.003$$gVol. 35, p. 375 - 388$$p375 - 388$$q35<375 - 388$$tParallel computing$$v35$$x0167-8191$$y2009
000006609 8567_ $$uhttp://dx.doi.org/10.1016/j.parco.2009.02.003
000006609 909CO $$ooai:juser.fz-juelich.de:6609$$pVDB
000006609 915__ $$0StatID:(DE-HGF)0010$$aJCR/ISI refereed
000006609 9141_ $$y2009
000006609 9131_ $$0G:(DE-Juel1)FUEK411$$aDE-HGF$$bSchlüsseltechnologien$$kP41$$lSupercomputing$$vScientific Computing$$x0
000006609 9201_ $$0I:(DE-Juel1)JSC-20090406$$gJSC$$kJSC$$lJülich Supercomputing Centre$$x0
000006609 9201_ $$0I:(DE-82)080012_20140620$$gJARA$$kJARA-HPC$$lJülich Aachen Research Alliance - High-Performance Computing$$x1
000006609 970__ $$aVDB:(DE-Juel1)114966
000006609 980__ $$aVDB
000006609 980__ $$aConvertedRecord
000006609 980__ $$ajournal
000006609 980__ $$aI:(DE-Juel1)JSC-20090406
000006609 980__ $$aI:(DE-82)080012_20140620
000006609 980__ $$aUNRESTRICTED
000006609 981__ $$aI:(DE-Juel1)VDB1346