Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

Agullo, Emmanuel; Obersteiner, Michael; Eibl, Sebastian; Bautista-Gomez, Leonardo; Fung, Fred; Lion, Romain; Bungartz, Hans-Joachim; Schulz, Martin; Heisig, Marco; Gansterer, Wilfried N.; Wagner, Andreas; Jezequel, Fabienne; Thoennes, Dominik; Mycek, Paul; Rizzi, Francesco; Goeddeke, Dominik; Bonaventura, Luca; Engelmann, Christian; Speck, Robert; Drzisga, Daniel; Chatterjee, Sanjay; Stals, Linda; Anzt, Hartwig; Teranishi, Keita; Ruede, Ulrich; Thibault, Samuel; Kohl, Nils; Ciorba, Florina M.; Li, Xiaoye Sherry; Quintana-Orti, Enrique S.; Benacchio, Tommaso; Mehl, Miriam; Giraud, Luc; Altenbernd, Mirco; Wohlmuth, Barbara; DeBardeleben, Nathan

% IMPORTANT: The following is UTF-8 encoded.  This means that in the presence
% of non-ASCII characters, it will not work with BibTeX 0.99 or older.
% Instead, you should use an up-to-date BibTeX implementation like “bibtex8” or
% “biber”.

@ARTICLE{Agullo:888522,
      author       = {Agullo, Emmanuel and Altenbernd, Mirco and Anzt, Hartwig
                      and Bautista-Gomez, Leonardo and Benacchio, Tommaso and
                      Bonaventura, Luca and Bungartz, Hans-Joachim and Chatterjee,
                      Sanjay and Ciorba, Florina M. and DeBardeleben, Nathan and
                      Drzisga, Daniel and Eibl, Sebastian and Engelmann, Christian
                      and Gansterer, Wilfried N. and Giraud, Luc and Goeddeke,
                      Dominik and Heisig, Marco and Jezequel, Fabienne and Kohl,
                      Nils and Li, Xiaoye Sherry and Lion, Romain and Mehl, Miriam
                      and Mycek, Paul and Obersteiner, Michael and Quintana-Orti,
                      Enrique S. and Rizzi, Francesco and Ruede, Ulrich and
                      Schulz, Martin and Fung, Fred and Speck, Robert and Stals,
                      Linda and Teranishi, Keita and Thibault, Samuel and
                      Thoennes, Dominik and Wagner, Andreas and Wohlmuth, Barbara},
      title        = {{R}esiliency in {N}umerical {A}lgorithm {D}esign for
                      {E}xtreme {S}cale {S}imulations},
      reportid     = {FZJ-2020-04986},
      year         = {2020},
      note         = {45 pages, 3 figures, submitted to The International Journal
                      of High Performance Computing Applications},
      abstract     = {This work is based on the seminar titled ``Resiliency in
                      Numerical Algorithm Design for Extreme Scale Simulations''
                      held March 1-6, 2020 at Schloss Dagstuhl, that was attended
                      by all the authors. Naive versions of conventional
                      resilience techniques will not scale to the exascale regime:
                      with a main memory footprint of tens of Petabytes,
                      synchronously writing checkpoint data all the way to
                      background storage at frequent intervals will create
                      intolerable overheads in runtime and energy consumption.
                      Forecasts show that the mean time between failures could be
                      lower than the time to recover from such a checkpoint, so
                      that large calculations at scale might not make any progress
                      if robust alternatives are not investigated. More advanced
                      resilience techniques must be devised. The key may lie in
                      exploiting both advanced system features as well as specific
                      application knowledge. Research will face two essential
                      questions: (1) what are the reliability requirements for a
                      particular computation and (2) how do we best design the
                      algorithms and software to meet these requirements? One
                      avenue would be to refine and improve on system- or
                      application-level checkpointing and rollback strategies in
                      the case an error is detected. Developers might use fault
                      notification interfaces and flexible runtime systems to
                      respond to node failures in an application-dependent
                      fashion. Novel numerical algorithms or more stochastic
                      computational approaches may be required to meet accuracy
                      requirements in the face of undetectable soft errors. The
                      goal of this Dagstuhl Seminar was to bring together a
                      diverse group of scientists with expertise in exascale
                      computing to discuss novel ways to make applications
                      resilient against detected and undetected faults. In
                      particular, participants explored the role that algorithms
                      and applications play in the holistic approach needed to
                      tackle this challenge.},
      cin          = {JSC},
      cid          = {I:(DE-Juel1)JSC-20090406},
      pnm          = {511 - Computational Science and Mathematical Methods
                      (POF3-511) / DFG project 450829162 - Raum-Zeit-parallele
                      Simulation multimodale Energiesystemen (450829162)},
      pid          = {G:(DE-HGF)POF3-511 / G:(GEPRIS)450829162},
      typ          = {PUB:(DE-HGF)25},
      eprint       = {2010.13342},
      howpublished = {arXiv:2010.13342},
      archivePrefix = {arXiv},
      SLACcitation = {$\%\%CITATION$ = $arXiv:2010.13342;\%\%$},
      url          = {https://juser.fz-juelich.de/record/888522},
}

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help