000910504 001__ 910504
000910504 005__ 20230310131347.0
000910504 0247_ $$2doi$$a10.1177/10943420211055188
000910504 0247_ $$2ISSN$$a1078-3482
000910504 0247_ $$2ISSN$$a1094-3420
000910504 0247_ $$2ISSN$$a1741-2846
000910504 0247_ $$2Handle$$a2128/32166
000910504 0247_ $$2WOS$$aWOS:000730172300001
000910504 037__ $$aFZJ-2022-03887
000910504 082__ $$a004
000910504 1001_ $$0P:(DE-HGF)0$$aAgullo, Emmanuel$$b0
000910504 245__ $$aResiliency in numerical algorithm design for extreme scale simulations
000910504 260__ $$aThousand Oaks, Calif.$$bSage Science Press$$c2022
000910504 3367_ $$2DRIVER$$aarticle
000910504 3367_ $$2DataCite$$aOutput Types/Journal article
000910504 3367_ $$0PUB:(DE-HGF)16$$2PUB:(DE-HGF)$$aJournal Article$$bjournal$$mjournal$$s1666879654_31184
000910504 3367_ $$2BibTeX$$aARTICLE
000910504 3367_ $$2ORCID$$aJOURNAL_ARTICLE
000910504 3367_ $$00$$2EndNote$$aJournal Article
000910504 520__ $$aThis work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.
000910504 536__ $$0G:(DE-HGF)POF4-5111$$a5111 - Domain-Specific Simulation Data Life Cycle Labs (SDLs) and Research Groups (POF4-511)$$cPOF4-511$$fPOF IV$$x0
000910504 536__ $$0G:(GEPRIS)450829162$$aDFG project 450829162 - Raum-Zeit-parallele Simulation multimodale Energiesystemen (450829162)$$c450829162$$x1
000910504 588__ $$aDataset connected to CrossRef, Journals: juser.fz-juelich.de
000910504 7001_ $$0P:(DE-HGF)0$$aAltenbernd, Mirco$$b1
000910504 7001_ $$0P:(DE-HGF)0$$aAnzt, Hartwig$$b2
000910504 7001_ $$0P:(DE-HGF)0$$aBautista-Gomez, Leonardo$$b3
000910504 7001_ $$0P:(DE-HGF)0$$aBenacchio, Tommaso$$b4
000910504 7001_ $$0P:(DE-HGF)0$$aBonaventura, Luca$$b5
000910504 7001_ $$0P:(DE-HGF)0$$aBungartz, Hans-Joachim$$b6
000910504 7001_ $$0P:(DE-HGF)0$$aChatterjee, Sanjay$$b7
000910504 7001_ $$0P:(DE-HGF)0$$aCiorba, Florina M$$b8
000910504 7001_ $$0P:(DE-HGF)0$$aDeBardeleben, Nathan$$b9
000910504 7001_ $$0P:(DE-HGF)0$$aDrzisga, Daniel$$b10
000910504 7001_ $$0P:(DE-HGF)0$$aEibl, Sebastian$$b11
000910504 7001_ $$0P:(DE-HGF)0$$aEngelmann, Christian$$b12
000910504 7001_ $$0P:(DE-HGF)0$$aGansterer, Wilfried N$$b13
000910504 7001_ $$0P:(DE-HGF)0$$aGiraud, Luc$$b14
000910504 7001_ $$0P:(DE-HGF)0$$aGöddeke, Dominik$$b15
000910504 7001_ $$0P:(DE-HGF)0$$aHeisig, Marco$$b16
000910504 7001_ $$0P:(DE-HGF)0$$aJézéquel, Fabienne$$b17
000910504 7001_ $$0P:(DE-HGF)0$$aKohl, Nils$$b18
000910504 7001_ $$0P:(DE-HGF)0$$aLi, Xiaoye Sherry$$b19
000910504 7001_ $$0P:(DE-HGF)0$$aLion, Romain$$b20
000910504 7001_ $$0P:(DE-HGF)0$$aMehl, Miriam$$b21
000910504 7001_ $$0P:(DE-HGF)0$$aMycek, Paul$$b22
000910504 7001_ $$0P:(DE-HGF)0$$aObersteiner, Michael$$b23
000910504 7001_ $$0P:(DE-HGF)0$$aQuintana-Ortí, Enrique S$$b24
000910504 7001_ $$0P:(DE-HGF)0$$aRizzi, Francesco$$b25
000910504 7001_ $$0P:(DE-HGF)0$$aRüde, Ulrich$$b26
000910504 7001_ $$0P:(DE-HGF)0$$aSchulz, Martin$$b27
000910504 7001_ $$0P:(DE-HGF)0$$aFung, Fred$$b28
000910504 7001_ $$0P:(DE-Juel1)132268$$aSpeck, Robert$$b29$$ufzj
000910504 7001_ $$00000-0003-1557-8393$$aStals, Linda$$b30$$eCorresponding author
000910504 7001_ $$0P:(DE-HGF)0$$aTeranishi, Keita$$b31
000910504 7001_ $$0P:(DE-HGF)0$$aThibault, Samuel$$b32
000910504 7001_ $$0P:(DE-HGF)0$$aThönnes, Dominik$$b33
000910504 7001_ $$0P:(DE-HGF)0$$aWagner, Andreas$$b34
000910504 7001_ $$0P:(DE-HGF)0$$aWohlmuth, Barbara$$b35
000910504 773__ $$0PERI:(DE-600)2017480-9$$a10.1177/10943420211055188$$gVol. 36, no. 2, p. 251 - 285$$n2$$p251 - 285$$tThe international journal of high performance computing applications$$v36$$x1078-3482$$y2022
000910504 8564_ $$uhttps://juser.fz-juelich.de/record/910504/files/2010.13342.pdf$$yOpenAccess
000910504 909CO $$ooai:juser.fz-juelich.de:910504$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
000910504 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)132268$$aForschungszentrum Jülich$$b29$$kFZJ
000910504 9131_ $$0G:(DE-HGF)POF4-511$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5111$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vEnabling Computational- & Data-Intensive Science and Engineering$$x0
000910504 9141_ $$y2022
000910504 915__ $$0StatID:(DE-HGF)0160$$2StatID$$aDBCoverage$$bEssential Science Indicators$$d2021-01-26
000910504 915__ $$0StatID:(DE-HGF)0113$$2StatID$$aWoS$$bScience Citation Index Expanded$$d2021-01-26
000910504 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
000910504 915__ $$0StatID:(DE-HGF)0430$$2StatID$$aNational-Konsortium$$d2022-11-17$$wger
000910504 915__ $$0StatID:(DE-HGF)0200$$2StatID$$aDBCoverage$$bSCOPUS$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)0300$$2StatID$$aDBCoverage$$bMedline$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)0199$$2StatID$$aDBCoverage$$bClarivate Analytics Master Journal List$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)1160$$2StatID$$aDBCoverage$$bCurrent Contents - Engineering, Computing and Technology$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)0150$$2StatID$$aDBCoverage$$bWeb of Science Core Collection$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)0100$$2StatID$$aJCR$$bINT J HIGH PERFORM C : 2021$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)0600$$2StatID$$aDBCoverage$$bEbsco Academic Search$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)0030$$2StatID$$aPeer Review$$bASC$$d2022-11-17
000910504 915__ $$0StatID:(DE-HGF)9900$$2StatID$$aIF < 5$$d2022-11-17
000910504 920__ $$lyes
000910504 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
000910504 980__ $$ajournal
000910504 980__ $$aVDB
000910504 980__ $$aUNRESTRICTED
000910504 980__ $$aI:(DE-Juel1)JSC-20090406
000910504 9801_ $$aFullTexts