Journal Article FZJ-2024-05715

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Parallel and scalable AI in HPC systems for CFD applications and beyond

 ;  ;  ;

2024

2, 1444337 pp. () [10.3389/fhpcp.2024.1444337]

This record in other databases:  

Please use a persistent id in citations: doi:  doi:

Abstract: This manuscript presents the library AI4HPC with its architecture and components. The library enables large-scale trainings of AI models on High-Performance Computing systems. It addresses challenges in handling non-uniform datasets through data manipulation routines, model complexity through specialized ML architectures, scalability through extensive code optimizations that augment performance, HyperParameter Optimization (HPO), and performance monitoring. The scalability of the library is demonstrated by strong scaling experiments on up to 3,664 Graphical Processing Units (GPUs) resulting in a scaling efficiency of 96%, using the performance on 1 node as baseline. Furthermore, code optimizations and communication/computation bottlenecks are discussed for training a neural network on an actuated Turbulent Boundary Layer (TBL) simulation dataset (8.3 TB) on the HPC system JURECA at the Jülich Supercomputing Centre. The distributed training approach significantly influences the accuracy, which can be drastically compromised by varying mini-batch sizes. Therefore, AI4HPC implements learning rate scaling and adaptive summation algorithms, which are tested and evaluated in this work. For the TBL use case, results scaled up to 64 workers are shown. A further increase in the number of workers causes an additional overhead due to too small dataset samples per worker. Finally, the library is applied for the reconstruction of TBL flows with a convolutional autoencoder-based architecture and a diffusion model. In case of the autoencoder, a modal decomposition shows that the network provides accurate reconstructions of the underlying field and achieves a mean drag prediction error of ≈5%. With the diffusion model, a reconstruction error of ≈4% is achieved when super-resolution is applied to 5-fold coarsened velocity fields. The AI4HPC library is agnostic to the underlying network and can be adapted across various scientific and technical disciplines.


Note: Missing Journal: Frontiers in High Performance Computing (Front. High Perform. Comput.) = 2813-7337 (import from CrossRef, Journals: juser.fz-juelich.de); Please add the journal to the list of journals

Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 5111 - Domain-Specific Simulation & Data Life Cycle Labs (SDLs) and Research Groups (POF4-511) (POF4-511)
  2. RAISE - Research on AI- and Simulation-Based Engineering at Exascale (951733) (951733)

Appears in the scientific report 2024
Database coverage:
Creative Commons Attribution CC BY 4.0 ; OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Dokumenttypen > Aufsätze > Zeitschriftenaufsätze
Workflowsammlungen > Öffentliche Einträge
Workflowsammlungen > Publikationsgebühren
Institutssammlungen > JSC
Publikationsdatenbank
Open Access

 Datensatz erzeugt am 2024-10-01, letzte Änderung am 2024-11-05


OpenAccess:
Volltext herunterladen PDF
Externer link:
Volltext herunterladenVolltext
Dieses Dokument bewerten:

Rate this document:
1
2
3
 
(Bisher nicht rezensiert)