Journal Article FZJ-2023-03545

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Enhancing Distributed Neural Network Training Through Node-Based Communications

 ;  ;  ;

2023
IEEE [New York, NY]

IEEE transactions on neural networks and learning systems 35, 1 - 15 () [10.1109/TNNLS.2023.3309735]

This record in other databases:    

Please use a persistent id in citations: doi:  doi:

Abstract: The amount of data needed to effectively train modern deep neural architectures has grown significantly, leading to increased computational requirements. These intensive computations are tackled by the combination of last generation computing resources, such as accelerators, or classic processing units. Nevertheless, gradient communication remains as the major bottleneck, hindering the efficiency notwithstanding the improvements in runtimes obtained through data parallelism strategies. Data parallelism involves all processes in a global exchange of potentially high amount of data, which may impede the achievement of the desired speedup and the elimination of noticeable delays or bottlenecks. As a result, communication latency issues pose a significant challenge that profoundly impacts the performance on distributed platforms. This research presents node-based optimization steps to significantly reduce the gradient exchange between model replicas whilst ensuring model convergence. The proposal serves as a versatile communication scheme, suitable for integration into a wide range of general-purpose deep neural network (DNN) algorithms. The optimization takes into consideration the specific location of each replica within the platform. To demonstrate the effectiveness, different neural network approaches and datasets with disjoint properties are used. In addition, multiple types of applications are considered to demonstrate the robustness and versatility of our proposal. The experimental results show a global training time reduction whilst slightly improving accuracy.

Classification:

Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 5111 - Domain-Specific Simulation & Data Life Cycle Labs (SDLs) and Research Groups (POF4-511) (POF4-511)

Appears in the scientific report 2023
Database coverage:
Medline ; OpenAccess ; Clarivate Analytics Master Journal List ; Current Contents - Engineering, Computing and Technology ; Ebsco Academic Search ; Essential Science Indicators ; IF >= 10 ; JCR ; SCOPUS ; Science Citation Index Expanded ; Web of Science Core Collection
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Workflow collections > Public records
Institute Collections > JSC
Publications database
Open Access

 Record created 2023-09-20, last modified 2024-01-16


OpenAccess:
Download fulltext PDF
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)