Resiliency in numerical algorithm design for extreme scale simulations

Agullo, Emmanuel; Rüde, Ulrich; Obersteiner, Michael; Eibl, Sebastian; Bautista-Gomez, Leonardo; Fung, Fred; Lion, Romain; Bungartz, Hans-Joachim; Schulz, Martin; Heisig, Marco; Wagner, Andreas; Mycek, Paul; Rizzi, Francesco; Ciorba, Florina M; Bonaventura, Luca; Engelmann, Christian; Speck, Robert; Drzisga, Daniel; Göddeke, Dominik; Chatterjee, Sanjay; Stals, Linda; Anzt, Hartwig; Teranishi, Keita; Quintana-Ortí, Enrique S; Thibault, Samuel; Kohl, Nils; Gansterer, Wilfried N; Thönnes, Dominik; Li, Xiaoye Sherry; Jézéquel, Fabienne; Benacchio, Tommaso; Mehl, Miriam; Giraud, Luc; Altenbernd, Mirco; Wohlmuth, Barbara; DeBardeleben, Nathan

doi:10.1177/10943420211055188

% IMPORTANT: The following is UTF-8 encoded.  This means that in the presence
% of non-ASCII characters, it will not work with BibTeX 0.99 or older.
% Instead, you should use an up-to-date BibTeX implementation like “bibtex8” or
% “biber”.

@ARTICLE{Agullo:910504,
      author       = {Agullo, Emmanuel and Altenbernd, Mirco and Anzt, Hartwig
                      and Bautista-Gomez, Leonardo and Benacchio, Tommaso and
                      Bonaventura, Luca and Bungartz, Hans-Joachim and Chatterjee,
                      Sanjay and Ciorba, Florina M and DeBardeleben, Nathan and
                      Drzisga, Daniel and Eibl, Sebastian and Engelmann, Christian
                      and Gansterer, Wilfried N and Giraud, Luc and Göddeke,
                      Dominik and Heisig, Marco and Jézéquel, Fabienne and Kohl,
                      Nils and Li, Xiaoye Sherry and Lion, Romain and Mehl, Miriam
                      and Mycek, Paul and Obersteiner, Michael and Quintana-Ortí,
                      Enrique S and Rizzi, Francesco and Rüde, Ulrich and Schulz,
                      Martin and Fung, Fred and Speck, Robert and Stals, Linda and
                      Teranishi, Keita and Thibault, Samuel and Thönnes, Dominik
                      and Wagner, Andreas and Wohlmuth, Barbara},
      title        = {{R}esiliency in numerical algorithm design for extreme
                      scale simulations},
      journal      = {The international journal of high performance computing
                      applications},
      volume       = {36},
      number       = {2},
      issn         = {1078-3482},
      address      = {Thousand Oaks, Calif.},
      publisher    = {Sage Science Press},
      reportid     = {FZJ-2022-03887},
      pages        = {251 - 285},
      year         = {2022},
      abstract     = {This work is based on the seminar titled ‘Resiliency in
                      Numerical Algorithm Design for Extreme Scale Simulations’
                      held March 1–6, 2020, at Schloss Dagstuhl, that was
                      attended by all the authors. Advanced supercomputing is
                      characterized by very high computation speeds at the cost of
                      involving an enormous amount of resources and costs. A
                      typical large-scale computation running for 48 h on a system
                      consuming 20 MW, as predicted for exascale systems, would
                      consume a million kWh, corresponding to about 100k Euro in
                      energy cost for executing 1023 floating-point operations. It
                      is clearly unacceptable to lose the whole computation if any
                      of the several million parallel processes fails during the
                      execution. Moreover, if a single operation suffers from a
                      bit-flip error, should the whole computation be declared
                      invalid? What about the notion of reproducibility itself:
                      should this core paradigm of science be revised and refined
                      for results that are obtained by large-scale simulation?
                      Naive versions of conventional resilience techniques will
                      not scale to the exascale regime: with a main memory
                      footprint of tens of Petabytes, synchronously writing
                      checkpoint data all the way to background storage at
                      frequent intervals will create intolerable overheads in
                      runtime and energy consumption. Forecasts show that the mean
                      time between failures could be lower than the time to
                      recover from such a checkpoint, so that large calculations
                      at scale might not make any progress if robust alternatives
                      are not investigated. More advanced resilience techniques
                      must be devised. The key may lie in exploiting both advanced
                      system features as well as specific application knowledge.
                      Research will face two essential questions: (1) what are the
                      reliability requirements for a particular computation and
                      (2) how do we best design the algorithms and software to
                      meet these requirements? While the analysis of use cases can
                      help understand the particular reliability requirements, the
                      construction of remedies is currently wide open. One avenue
                      would be to refine and improve on system- or
                      application-level checkpointing and rollback strategies in
                      the case an error is detected. Developers might use fault
                      notification interfaces and flexible runtime systems to
                      respond to node failures in an application-dependent
                      fashion. Novel numerical algorithms or more stochastic
                      computational approaches may be required to meet accuracy
                      requirements in the face of undetectable soft errors. These
                      ideas constituted an essential topic of the seminar. The
                      goal of this Dagstuhl Seminar was to bring together a
                      diverse group of scientists with expertise in exascale
                      computing to discuss novel ways to make applications
                      resilient against detected and undetected faults. In
                      particular, participants explored the role that algorithms
                      and applications play in the holistic approach needed to
                      tackle this challenge. This article gathers a broad range of
                      perspectives on the role of algorithms, applications and
                      systems in achieving resilience for extreme scale
                      simulations. The ultimate goal is to spark novel ideas and
                      encourage the development of concrete solutions for
                      achieving such resilience holistically.},
      cin          = {JSC},
      ddc          = {004},
      cid          = {I:(DE-Juel1)JSC-20090406},
      pnm          = {5111 - Domain-Specific Simulation Data Life Cycle Labs
                      (SDLs) and Research Groups (POF4-511) / DFG project
                      450829162 - Raum-Zeit-parallele Simulation multimodale
                      Energiesystemen (450829162)},
      pid          = {G:(DE-HGF)POF4-5111 / G:(GEPRIS)450829162},
      typ          = {PUB:(DE-HGF)16},
      UT           = {WOS:000730172300001},
      doi          = {10.1177/10943420211055188},
      url          = {https://juser.fz-juelich.de/record/910504},
}

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help