Journal Article FZJ-2023-02265

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

 ;  ;  ;  ;

2023
SpringerOpen Heidelberg [u.a.]

Journal of Big Data 10(1), 96 () [10.1186/s40537-023-00765-w]

This record in other databases:  

Please use a persistent id in citations:   doi:

Abstract: Continuously increasing data volumes from multiple sources, such as simulation and experimental measurements, demand efficient algorithms for an analysis within a realistic timeframe. Deep learning models have proven to be capable of understanding and analyzing large quantities of data with high accuracy. However, training them on massive datasets remains a challenge and requires distributed learning exploiting High-Performance Computing systems. This study presents a comprehensive analysis and comparison of three well-established distributed deep learning frameworks - Horovod, DeepSpeed, and Distributed Data Parallel by PyTorch - with a focus on their runtime performance and scalability. Additionally, the performance of two data loaders, the native PyTorch data loader and the DALI data loader by NVIDIA, is investigated. To evaluate these frameworks and data loaders, three standard ResNet architectures with 50, 101, and 152 layers are tested using the ImageNet dataset. The impact of different learning rate schedulers on validation accuracy is also assessed. The novel contribution lies in the detailed analysis and comparison of these frameworks and data loaders on the state-of-the-art Jülich Wizard for European Leadership Science (JUWELS) Booster system at the Jülich Supercomputing Centre, using up to 1024 A100 NVIDIA GPUs in parallel. Findings show that the DALI data loader significantly reduces the overall runtime of ResNet50 from more than 12 h on 4 GPUs to less than 200 s on 1024 GPUs. The outcomes of this work highlight the potential impact of distributed deep learning using efficient tools on accelerating scientific discoveries and data-driven applications.

Classification:

Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 5111 - Domain-Specific Simulation & Data Life Cycle Labs (SDLs) and Research Groups (POF4-511) (POF4-511)
  2. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)
  3. RAISE - Research on AI- and Simulation-Based Engineering at Exascale (951733) (951733)

Appears in the scientific report 2023
Database coverage:
Medline ; Creative Commons Attribution CC BY 4.0 ; DOAJ ; OpenAccess ; Article Processing Charges ; Clarivate Analytics Master Journal List ; Current Contents - Engineering, Computing and Technology ; DOAJ Seal ; Essential Science Indicators ; Fees ; IF >= 5 ; JCR ; SCOPUS ; Science Citation Index Expanded ; Web of Science Core Collection
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Workflow collections > Public records
Workflow collections > Publication Charges
Institute Collections > JSC
Publications database
Open Access

 Record created 2023-06-09, last modified 2023-10-27


OpenAccess:
Download fulltext PDF
External link:
Download fulltextFulltext by OpenAccess repository
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)