Journal Article FZJ-2016-04097

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

 ;  ;  ;  ;

2016
acm Association for Computing Machinery New York, NY

ACM Transactions on Parallel Computing 3(2), 11 () [10.1145/2934661]

This record in other databases:

Please use a persistent id in citations: doi:

Abstract: Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances even for runs with hundreds of thousands of processes.


Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 511 - Computational Science and Mathematical Methods (POF3-511) (POF3-511)
  2. ATMLPP - ATML Parallel Performance (ATMLPP) (ATMLPP)

Appears in the scientific report 2016
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Workflow collections > Public records
Institute Collections > JSC
Publications database

 Record created 2016-08-01, last modified 2025-03-14


Restricted:
Download fulltext PDF
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)