%0 Conference Paper
%A Schieffer, Gabin
%A Shi, Ruimin
%A Markidis, Stefano
%A Herten, Andreas
%A Faj, Jennifer
%A Peng, Ivy
%T Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric
%I IEEE
%M FZJ-2025-00766
%P 567-576
%D 2024
%X Modern GPU systems are constantly evolving tomeet the needs of computing-intensive applications in scientificand machine learning domains. However, there is typically a gapbetween the hardware capacity and the achievable applicationperformance. This work aims to provide a better understandingof the Infinity Fabric interconnects on AMD GPUs and CPUs. Wepropose a test and evaluation methodology for characterizing theperformance of data movements on multi-GPU systems, stressingdifferent communication options on AMD MI250X GPUs, includ-ing point-to-point and collective communication, and memoryallocation strategies between GPUs, as well as the host CPU.In a single-node setup with four GPUs, we show that directpeer-to-peer memory accesses between GPUs and utilization ofthe RCCL library outperform MPI-based solutions in terms ofmemory/communication latency and bandwidth. Our test andevaluation method serves as a base for validating memory andcommunication strategies on a system and improving applicationson AMD multi-GPU computing systems.
%B SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
%C 17 Nov 2024 - 22 Nov 2024, Atlanta, GA (USA)
Y2 17 Nov 2024 - 22 Nov 2024
M2 Atlanta, GA, USA
%F PUB:(DE-HGF)8
%9 Contribution to a conference proceedings
%U <Go to ISI:>//WOS:001451792300060
%R 10.1109/SCW63240.2024.00079
%U https://juser.fz-juelich.de/record/1037595