Contribution to a conference proceedings FZJ-2022-03912

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework

 ;  ;  ;  ;

2022
IEEE

2022 IEEE International Conference on Cluster Computing, CLUSTER, HeidelbergHeidelberg, Germany, 6 Sep 2022 - 9 Sep 20222022-09-062022-09-09 IEEE 516-522 () [10.1109/CLUSTER51413.2022.00066]

This record in other databases:  

Please use a persistent id in citations:   doi:

Abstract: Deep Learning (DL) applications are used to solve complex problems efficiently. These applications require complex neural network models composed of millions of parameters and huge amounts of data for proper training. This is only possible by parallelizing the necessary computations by so-called distributed deep learning (DDL) frameworks over many GPUs distributed over multiple nodes of a HPC cluster. These frameworks mostly utilize the compute power of the GPUs and use only a small portion of the available compute power of the CPUs in the nodes for I/O and inter-process communication, leaving many CPU cores idle and unused. The more powerful the base CPU in the cluster nodes, the more compute resources are wasted. In this paper, we investigate how much of this unutilized compute resources could be used for executing other applications without lowering the performance of the DDL frameworks. In our experiments, we executed a noise-generation application, which generates a very-high memory, network or I/O load, in parallel with DDL frameworks, and use HPC profiling and tracing techniques to determine whether and how the generated noise is affecting the performance of the DDL frameworks. Early results indicate that it might be possible to utilize the idle cores for jobs of other users without affecting the performance of the DDL applications in a negative way.


Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)
  2. ExtraNoise – Leistungsanalyse von HPC-Anwendungen in verrauschten Umgebungen (449683531) (449683531)
  3. ATMLPP - ATML Parallel Performance (ATMLPP) (ATMLPP)

Appears in the scientific report 2022
Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Dokumenttypen > Ereignisse > Beiträge zu Proceedings
Workflowsammlungen > Öffentliche Einträge
Institutssammlungen > JSC
Publikationsdatenbank
Open Access

 Datensatz erzeugt am 2022-10-27, letzte Änderung am 2025-03-14


OpenAccess:
Volltext herunterladen PDF
Externer link:
Volltext herunterladenFulltext by OpenAccess repository
Dieses Dokument bewerten:

Rate this document:
1
2
3
 
(Bisher nicht rezensiert)