Contribution to a conference proceedings FZJ-2021-04559

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Exploring the impact of node failures on the resource allocation for parallel jobs

 ;  ;

2021

14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, LisbonLisbon, Portugal, 30 Aug 2021 - 30 Aug 20212021-08-302021-08-30 12 p. ()

Please use a persistent id in citations:

Abstract: Increasing the size and complexity of modern HPC systemsalso increases the probability of various types of failures. Failures maydisrupt application execution and waste valuable system resources dueto failed executions. In this work, we explore the eect of node failureson the completion times of MPI parallel jobs. We introduce a simulationenvironment that generates synthetic traces of node failures, assumingthat the times between failures for each node are independently dis-tributed, following the same distribution but with dierent parameters.To highlight the importance of failure-awareness for resource allocation,we compare two failure-oblivious resource allocation approaches withone that considers node failure probabilities before assigning a partitionto a job: a heuristic that randomly selects the partition for a job, andSlurm's linear resource allocation policy. We present results for a casestudy that assumes a 4D-torus topology and a Weibull distribution foreach node's time between failures, and considers several dierent tracesof node failures, capturing dierent failure patterns. For the synthetictraces explored, the benet is more prominent for longer jobs, up to82% depending on the trace, when compared with Slurm and a failure-oblivious heuristic. For shorter jobs, benets are noticeable for systemswith more frequent failures.


Note: Koordiniertes Projekt, daher keine internen Autoren

Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 899 - ohne Topic (POF4-899) (POF4-899)
  2. DEEP-SEA - DEEP – SOFTWARE FOR EXASCALE ARCHITECTURES (955606) (955606)
  3. EuroEXA - Co-designed Innovation and System for Resilient Exascale Computing in Europe: From Applications to Silicon (754337) (754337)

Appears in the scientific report 2021
Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Document types > Events > Contributions to a conference proceedings
Workflow collections > Public records
Institute Collections > JSC
Publications database
Open Access

 Record created 2021-11-25, last modified 2022-01-31


OpenAccess:
Download fulltext PDF
External link:
Download fulltextFulltext by OpenAccess repository
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)