TY  - CONF
AU  - Baumann, Thomas
AU  - Götschel, Sebastian
AU  - Lunet, Thibaut
AU  - Ruprecht, Daniel
AU  - Schöbel, Ruth
AU  - Speck, Robert
TI  - Resilience in Spectral Deferred Corrections
M1  - FZJ-2022-05208
PY  - 2022
AB  - Advancement in computational speed is nowadays gained by using more processing units rather than faster ones. Faults in the processing units caused by numerous sources including radiation and aging have been neglected in the past. However, the increasing size of HPC machines makes them more susceptible and it is important to develop a resilience strategy to avoid losing millions of CPU hours. Parallel-in-time methods target the very largest of computers and are hence required to come with algorithm-based fault tolerance. We look here at spectral deferred corrections (SDC), which is a time marching scheme that is at the heart of parallel-in-time methods such as PFASST. Due to its iterative nature, there is ample opportunity to plug in computationally inexpensive fault tolerance schemes, many of which are also easy to implement. We experimentally examine the capability of various strategies to recover from single bit flips both for serial SDC as well as a small-scale parallel-in-time version with diagonal preconditioners.
T2  - 11th Parallel-in-Time Workshop
CY  - 11 Jul 2022 - 15 Jul 2022, Marseilles (France)
Y2  - 11 Jul 2022 - 15 Jul 2022
M2  - Marseilles, France
LB  - PUB:(DE-HGF)24
UR  - https://juser.fz-juelich.de/record/911978
ER  -