TY  - CONF
AU  - Schieffer, Gabin
AU  - Shi, Ruimin
AU  - Markidis, Stefano
AU  - Herten, Andreas
AU  - Faj, Jennifer
AU  - Peng, Ivy
TI  - Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric
PB  - IEEE
M1  - FZJ-2025-00766
SP  - 567-576
PY  - 2024
AB  - Modern GPU systems are constantly evolving tomeet the needs of computing-intensive applications in scientificand machine learning domains. However, there is typically a gapbetween the hardware capacity and the achievable applicationperformance. This work aims to provide a better understandingof the Infinity Fabric interconnects on AMD GPUs and CPUs. Wepropose a test and evaluation methodology for characterizing theperformance of data movements on multi-GPU systems, stressingdifferent communication options on AMD MI250X GPUs, includ-ing point-to-point and collective communication, and memoryallocation strategies between GPUs, as well as the host CPU.In a single-node setup with four GPUs, we show that directpeer-to-peer memory accesses between GPUs and utilization ofthe RCCL library outperform MPI-based solutions in terms ofmemory/communication latency and bandwidth. Our test andevaluation method serves as a base for validating memory andcommunication strategies on a system and improving applicationson AMD multi-GPU computing systems.
T2  - SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
CY  - 17 Nov 2024 - 22 Nov 2024, Atlanta, GA (USA)
Y2  - 17 Nov 2024 - 22 Nov 2024
M2  - Atlanta, GA, USA
LB  - PUB:(DE-HGF)8
UR  - <Go to ISI:>//WOS:001451792300060
DO  - DOI:10.1109/SCW63240.2024.00079
UR  - https://juser.fz-juelich.de/record/1037595
ER  -