Home > Publications database > Exploring the impact of node failures on the resource allocation for parallel jobs |
Contribution to a conference proceedings | FZJ-2021-04559 |
; ;
2021
Please use a persistent id in citations: http://hdl.handle.net/2128/29162
Abstract: Increasing the size and complexity of modern HPC systemsalso increases the probability of various types of failures. Failures maydisrupt application execution and waste valuable system resources dueto failed executions. In this work, we explore the eect of node failureson the completion times of MPI parallel jobs. We introduce a simulationenvironment that generates synthetic traces of node failures, assumingthat the times between failures for each node are independently dis-tributed, following the same distribution but with dierent parameters.To highlight the importance of failure-awareness for resource allocation,we compare two failure-oblivious resource allocation approaches withone that considers node failure probabilities before assigning a partitionto a job: a heuristic that randomly selects the partition for a job, andSlurm's linear resource allocation policy. We present results for a casestudy that assumes a 4D-torus topology and a Weibull distribution foreach node's time between failures, and considers several dierent tracesof node failures, capturing dierent failure patterns. For the synthetictraces explored, the benet is more prominent for longer jobs, up to82% depending on the trace, when compared with Slurm and a failure-oblivious heuristic. For shorter jobs, benets are noticeable for systemswith more frequent failures.
![]() |
The record appears in these collections: |