TY  - EJOUR
AU  - Zaourar, Lilia
AU  - Benazouz, Mohamed
AU  - Mouhagir, Ayoub
AU  - Falquez, Carlos
AU  - Portero, Antoni
AU  - Ho, Nam
AU  - Suarez, Estela
AU  - Petrakis, Polydoros
AU  - Marazakis, Manolis
AU  - Sgherzi, Francesco
AU  - Fernandez, Ivan
AU  - Dolbeau, Romain
AU  - Pleiter, Dirk
TI  - Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC
M1  - FZJ-2024-03363
PY  - 2024
AB  - The memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.
LB  - PUB:(DE-HGF)25
DO  - DOI:10.34734/FZJ-2024-03363
UR  - https://juser.fz-juelich.de/record/1026292
ER  -