000888522 001__ 888522
000888522 005__ 20230310131358.0
000888522 0247_ $$2arXiv$$aarXiv:2010.13342
000888522 0247_ $$2altmetric$$aaltmetric:93201252
000888522 037__ $$aFZJ-2020-04986
000888522 1001_ $$0P:(DE-HGF)0$$aAgullo, Emmanuel$$b0
000888522 245__ $$aResiliency in Numerical Algorithm Design for Extreme Scale Simulations
000888522 260__ $$c2020
000888522 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1607523859_13191
000888522 3367_ $$2ORCID$$aWORKING_PAPER
000888522 3367_ $$028$$2EndNote$$aElectronic Article
000888522 3367_ $$2DRIVER$$apreprint
000888522 3367_ $$2BibTeX$$aARTICLE
000888522 3367_ $$2DataCite$$aOutput Types/Working Paper
000888522 500__ $$a45 pages, 3 figures, submitted to The International Journal of High Performance Computing Applications
000888522 520__ $$aThis work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.
000888522 536__ $$0G:(DE-HGF)POF3-511$$a511 - Computational Science and Mathematical Methods (POF3-511)$$cPOF3-511$$fPOF III$$x0
000888522 536__ $$0G:(GEPRIS)450829162$$aDFG project 450829162 - Raum-Zeit-parallele Simulation multimodale Energiesystemen (450829162)$$c450829162$$x1
000888522 588__ $$aDataset connected to arXivarXiv
000888522 7001_ $$0P:(DE-HGF)0$$aAltenbernd, Mirco$$b1
000888522 7001_ $$0P:(DE-HGF)0$$aAnzt, Hartwig$$b2
000888522 7001_ $$0P:(DE-HGF)0$$aBautista-Gomez, Leonardo$$b3
000888522 7001_ $$0P:(DE-HGF)0$$aBenacchio, Tommaso$$b4
000888522 7001_ $$0P:(DE-HGF)0$$aBonaventura, Luca$$b5
000888522 7001_ $$0P:(DE-HGF)0$$aBungartz, Hans-Joachim$$b6
000888522 7001_ $$0P:(DE-HGF)0$$aChatterjee, Sanjay$$b7
000888522 7001_ $$0P:(DE-HGF)0$$aCiorba, Florina M.$$b8
000888522 7001_ $$0P:(DE-HGF)0$$aDeBardeleben, Nathan$$b9
000888522 7001_ $$0P:(DE-HGF)0$$aDrzisga, Daniel$$b10
000888522 7001_ $$0P:(DE-HGF)0$$aEibl, Sebastian$$b11
000888522 7001_ $$0P:(DE-HGF)0$$aEngelmann, Christian$$b12
000888522 7001_ $$0P:(DE-HGF)0$$aGansterer, Wilfried N.$$b13
000888522 7001_ $$0P:(DE-HGF)0$$aGiraud, Luc$$b14
000888522 7001_ $$0P:(DE-HGF)0$$aGoeddeke, Dominik$$b15
000888522 7001_ $$0P:(DE-HGF)0$$aHeisig, Marco$$b16
000888522 7001_ $$0P:(DE-HGF)0$$aJezequel, Fabienne$$b17
000888522 7001_ $$0P:(DE-HGF)0$$aKohl, Nils$$b18
000888522 7001_ $$0P:(DE-HGF)0$$aLi, Xiaoye Sherry$$b19
000888522 7001_ $$0P:(DE-HGF)0$$aLion, Romain$$b20
000888522 7001_ $$0P:(DE-HGF)0$$aMehl, Miriam$$b21
000888522 7001_ $$0P:(DE-HGF)0$$aMycek, Paul$$b22
000888522 7001_ $$0P:(DE-HGF)0$$aObersteiner, Michael$$b23
000888522 7001_ $$0P:(DE-HGF)0$$aQuintana-Orti, Enrique S.$$b24
000888522 7001_ $$0P:(DE-HGF)0$$aRizzi, Francesco$$b25
000888522 7001_ $$0P:(DE-HGF)0$$aRuede, Ulrich$$b26
000888522 7001_ $$0P:(DE-HGF)0$$aSchulz, Martin$$b27
000888522 7001_ $$0P:(DE-HGF)0$$aFung, Fred$$b28
000888522 7001_ $$0P:(DE-Juel1)132268$$aSpeck, Robert$$b29$$ufzj
000888522 7001_ $$0P:(DE-HGF)0$$aStals, Linda$$b30$$eCorresponding author
000888522 7001_ $$0P:(DE-HGF)0$$aTeranishi, Keita$$b31
000888522 7001_ $$0P:(DE-HGF)0$$aThibault, Samuel$$b32
000888522 7001_ $$0P:(DE-HGF)0$$aThoennes, Dominik$$b33
000888522 7001_ $$0P:(DE-HGF)0$$aWagner, Andreas$$b34
000888522 7001_ $$0P:(DE-HGF)0$$aWohlmuth, Barbara$$b35
000888522 8564_ $$uhttps://arxiv.org/abs/2010.13342
000888522 909CO $$ooai:juser.fz-juelich.de:888522$$pVDB
000888522 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)132268$$aForschungszentrum Jülich$$b29$$kFZJ
000888522 9131_ $$0G:(DE-HGF)POF3-511$$1G:(DE-HGF)POF3-510$$2G:(DE-HGF)POF3-500$$3G:(DE-HGF)POF3$$4G:(DE-HGF)POF$$aDE-HGF$$bKey Technologies$$lSupercomputing & Big Data$$vComputational Science and Mathematical Methods$$x0
000888522 9141_ $$y2020
000888522 920__ $$lyes
000888522 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
000888522 980__ $$apreprint
000888522 980__ $$aVDB
000888522 980__ $$aI:(DE-Juel1)JSC-20090406
000888522 980__ $$aUNRESTRICTED