Journal Article PreJuSER-6609

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
A scalable tool architecture for diagnosing wait states in massively parallel applications

 ;  ;  ;

2009
North-Holland, Elsevier Science Amsterdam [u.a.]

Parallel computing 35, 375 - 388 () [10.1016/j.parco.2009.02.003]

This record in other databases:  

Please use a persistent id in citations: doi:

Abstract: When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target application's communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales. (C) 2009 Elsevier B.V. All rights reserved.

Keyword(s): J ; Performance analysis (auto) ; Scalability (auto) ; Event tracing (auto) ; Pattern search (auto)


Note: This work was supported by the Helmholtz Association under Grants No. VH-NG-118 and No. VH-VI-228. Also, we would like to thank Marek Behr, Mike Nicolai, and Markus Probst from the Chair for Computational Analysis of Technical Systems at RWTH Aachen University for giving us access to their code.

Contributing Institute(s):
  1. Jülich Supercomputing Centre (JSC)
  2. Jülich Aachen Research Alliance - High-Performance Computing (JARA-HPC)
Research Program(s):
  1. Scientific Computing (P41)
  2. ATMLPP - ATML Parallel Performance (ATMLPP) (ATMLPP)

Appears in the scientific report 2009
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
JARA > JARA > JARA-JARA\-HPC
Workflow collections > Public records
Institute Collections > JSC
Publications database

 Record created 2012-11-13, last modified 2025-03-14



Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)