Analyzing HPC Monitoring Data With a View Towards Efficient Resource Utilization

Maloney, Samuel; Eicker, Norbert; Guimaraes, Filipe; Frings, Wolfgang; Suarez, Estela

doi:10.1109/SBAC-PAD63648.2024.00023

Journal Article/Contribution to a conference proceedings/Contribution to a book

FZJ-2024-05813

Analyzing HPC Monitoring Data With a View Towards Efficient Resource Utilization

Maloney, S. (Corresponding author)FZJ* ; Suarez, E.FZJ* ; Eicker, N.FZJ* ; Guimaraes, F.FZJ* ; Frings, W.FZJ*

2024
IEEE

2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD, Hilo, HI, USA, 13 Nov 2024 - 15 Nov 2024 2643-3001 170-181 (2024) [10.1109/SBAC-PAD63648.2024.00023]

This record in other databases:

Please use a persistent id in citations: doi:10.1109/SBAC-PAD63648.2024.00023 doi:10.34734/FZJ-2024-05813

Abstract: Compute nodes in modern HPC systems are growing in size and their hardware has become ever more diverse. Still, many HPC centers allocate the resources of full nodes exclusively to avoid contention, despite the associated risk of underutilization. This paper describes a thorough resource utilization study of CPU and GPU compute and memory capacity, and interconnect bandwidth on JUWELS, a mature leadership-class modular supercomputer, with the aim of identifying opportunities for improving utilization through advanced scheduling and node sharing. Separate analysis of CPU-only and GPU-accelerated nodes finds that CPU compute usage is already close to optimal for the CPU-only nodes, whereas there is plenty of scope for co-scheduling CPU-based jobs on GPU-accelerated nodes. Memory capacity and node-level interconnect bandwidth are sufficient to provision co-scheduled jobs. We analyze multiple one-month datasets to validate robustness of conclusions over time and compare with previous studies on other systems to establish generalizability of results.

Note: The data used for this study are available at: https://doi.org/10.26165/JUELICH-DATA/BDFBPQ 979-8-3503-5616-8/24/$31.00 © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Contributing Institute(s):

Jülich Supercomputing Center (JSC)

Research Program(s):

Appears in the scientific report 2024

Database coverage:
OpenAccess

Click to display QR Code for this record

The record appears in these collections:
Document types > Events > Contributions to a conference proceedings
Document types > Books > Contribution to a book
Document types > Articles > Journal Article
Workflow collections > Public records
Institute Collections > JSC
Publications database
Open Access

Record created 2024-10-11, last modified 2025-03-17

Similar records

OpenAccess:

PDF
(additional files)

Rate this document:

(Not yet reviewed)

Add to personal basket
Export as Author List with IDs BibTeX (UTF-8), EndNote XML, EndNote Text, RIS, MARC, Print MARC, MARCXML, DC,
Request correction
Submit fulltext

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help