Exploring the impact of node failures on the resource allocation for parallel jobs

Vardas, Ioannis; Marazakis, Manolis; Ploumidis, Manolis

Contribution to a conference proceedings

FZJ-2021-04559

Exploring the impact of node failures on the resource allocation for parallel jobs

Vardas, I. ; Ploumidis, M. ; Marazakis, M.

2021

14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Lisbon, Portugal, 30 Aug 2021 - 30 Aug 2021 12 p. (2021)

Please use a persistent id in citations: http://hdl.handle.net/2128/29162

Abstract: Increasing the size and complexity of modern HPC systemsalso increases the probability of various types of failures. Failures maydisrupt application execution and waste valuable system resources dueto failed executions. In this work, we explore the eect of node failureson the completion times of MPI parallel jobs. We introduce a simulationenvironment that generates synthetic traces of node failures, assumingthat the times between failures for each node are independently dis-tributed, following the same distribution but with dierent parameters.To highlight the importance of failure-awareness for resource allocation,we compare two failure-oblivious resource allocation approaches withone that considers node failure probabilities before assigning a partitionto a job: a heuristic that randomly selects the partition for a job, andSlurm's linear resource allocation policy. We present results for a casestudy that assumes a 4D-torus topology and a Weibull distribution foreach node's time between failures, and considers several dierent tracesof node failures, capturing dierent failure patterns. For the synthetictraces explored, the benet is more prominent for longer jobs, up to82% depending on the trace, when compared with Slurm and a failure-oblivious heuristic. For shorter jobs, benets are noticeable for systemswith more frequent failures.

Note: Koordiniertes Projekt, daher keine internen Autoren

Contributing Institute(s):