# GPU Programming HPC: Modern Architectures & Trends MA-INF 1106, Bonn 20 January 2023 | Dr. Andreas Herten | Accelerating Devices Lab, Forschungszentrum Jülich # **Outline** Introduction **GPU History** JUPITER **JUWFLS** JUWELS Cluster JUWFLS Booster Platform Comparisons **GPU Architecture** Summary ``` Programming GPUs Libraries Directives CUDA C/C++ Kernels Grid, Blocks Memory Management Unified Memory Performance Analysis Beyond CUDA Cooperative Groups MPI Thrust Standard Parallelism HIP SYCL MORE MODELS!!1 ``` A short but unparalleled story 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 2001 NVIDIA GeForce 3 with *programmable* shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI - 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 2001 NVIDIA GeForce 3 with *programmable* shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI - 2007 CUDA - 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 2001 NVIDIA GeForce 3 with *programmable* shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI - 2007 CUDA - 2009 OpenCL - 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 2001 NVIDIA GeForce 3 with *programmable* shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI - 2007 CUDA - 2009 OpenCL - 2022 Top 500: 32 % with GPUs (#1, #2; 7 of top 10) [4], Green 500: 9 of top 10 with GPUs [5] - 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 2001 NVIDIA GeForce 3 with *programmable* shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI - 2007 CUDA - 2009 OpenCL - 2022 Top 500: 32 % with GPUs (#1, #2; 7 of top 10) [4], Green 500: 9 of top 10 with GPUs [5] - 2022 : Leonardo (250 PFLOP/s\*, Italy), NVIDIA GPUs; LUMI (552 PFLOP/s, Finland), AMD GPUs - $\blacksquare$ : Frontier ( $R_{\text{max}} = 1.102 \, \text{EFLOP/s}, \, \text{ORNL}$ ), AMD GPUs <sup>\*:</sup> Effective FLOP/s, not theoretical peak - 1999 Graphics computation pipeline implemented in dedicated *graphics hardware*Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] - 2001 NVIDIA GeForce 3 with *programmable* shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI - 2007 CUDA - 2009 OpenCL - 2022 Top 500: 32 % with GPUs (#1, #2; 7 of top 10) [4], Green 500: 9 of top 10 with GPUs [5] - 2022 Leonardo (250 PFLOP/s\*, Italy), NVIDIA GPUs; LUMI (552 PFLOP/s, Finland), AMD GPUs Frontier (R<sub>max</sub> = 1.102 EFLOP/s, ORNL), AMD GPUs - Soon : JUPITER ( $\approx 1 \, \text{EFLOP/s}, \, \text{JSC}$ ) - $\blacksquare$ : Aurora ( $\approx$ 2 EFLOP/s, Argonne), Intel GPUs; El Capitan ( $\approx$ 2 EFLOP/s, LLNL), AMD GPUs <sup>\*:</sup> Effective FLOP/s, not theoretical peak # **JUPITER** ### **Recent Developments** - Jülich (JSC) selected as Hosting Entity for first European Exascale supercomputer: JUPITER - System procured together with EuroHPC JU - After long preparation, RFI finally published on Monday! - Now we start the exciting negotiations... # **JUPITER** **Recent Developments** - Jülich (JSC) selected as Ho JUPITER - System procured together - After long preparation, RFI - Now we start the exciting r European High Performance Computing Joint Undertaking # GENERAL INVITATION TO TENDER EUROHPC/2023/CD/0001 ### Descriptive Document Acquisition, delivery, installation and hardware and software maintenance of JUPITER Exascale Supercomputer for the European High Performance Computing Joint Undertaking (EuroHPC) xascale supercomputer Descriptive document EUROHPC/2023/CD/0001 1/116 J The JUPITER SOLWARE components must enable a dynamic MSA architecture without soltware restrictions. Scheduler, resource manager and MPI runtime, among others, must enable advanced dynamic MSA features and deliver the features and benefits documented by the DEEP research projects. For some control plane components (see above), JSC wishes to take advantage of its own xOPS software stack complemented with components that are a result of the collaborative work developed in the ParaStation consortium. Proposals for further (possibly module-specific) solutions that are optimized for the hardware offered will also be accepted if the above requirements can be realized in a superior or equal manner. Considering the core capabilities of JUPITER's hardware and software, the centre stage is taken by accelerated devices: At the heart of JUPITER is the highly scalable **Booster** module, with the option of having two accelerated, tightly connected modules, built on the same accelerator technologies and networks, if that provides additional value. Given the technology developments in recent years, graphical processing unit (GPU) based accelerators are expected to provide at least one exaflop of double precision floating point performance as to the sustained HPL's R<sub>max</sub> within the aforementioned power footprint. Lower floating-point precision will reach even higher performance numbers. Those accelerators are able to handle both classical HPC and novel AI workloads and are well suited for HPDA. A high-bandwidth and low-latency interconnect for the Booster will be required to provide network capabilities for highly parallel workloads and, at the same time, fast access to the storage backends. Based on positive experience with the JUWELS Booster Dragonfly topology that provides 200 Gbit per GPU (or 800 Gbit per node) this is considered also an appropriate # **Status Quo Across Architectures** ### Performance # **Status Quo Across Architectures** ### **Memory Bandwidth** ### JUWELS Cluster – Jülich's Scalable System - 2500 nodes with Intel Xeon CPUs (2 × 24 cores) - 46 + 10 nodes with 4 NVIDIA Tesla V100 cards (16 GB memory) - 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #86) # **JUWELS** Booster – Scaling Higher! - lacksquare 936 nodes with AMD EPYC Rome CPUs (2 imes 24 cores) - Each with 4 NVIDIA A100 Ampere GPUs (each: FP64TC: 19.5 TFLOP/s, 40 GB memory) - ullet InfiniBand DragonFly+ HDR-200 network; 4 imes 200 Gbit/s per node # **Top500 List Nov 2022:** - #1 Europe - #8 World - #4\* Top/Green500 ### **JUWELS** Booster – Scaling Higher! - 936 nodes with AMD EPYC Rome CPUs (2 × 24 cores) - Each with 4 NVIDIA A100 Ampere GPUs (each: FP64TC: 19.5 TFLOP/s, 40 GB memory) - ullet InfiniBand DragonFly+ HDR-200 network; 4 imes 200 Gbit/s per node # Platform # CPU vs. GPU ### A matter of specialties aphics: lee [8] and Shearings Holidays # CPU vs. GPU ### A matter of specialties Transporting one **Transporting many** aphics: Lee [8] and Shearings Holiday # CPU vs. GPU Chip ### GPU optimized to hide latency - Memory - GPU has small (40 GB), but high-speed memory 1555 GB/s - Stage data to GPU memory: via PCIe 4 bus (32 GB/s) Device ### GPU optimized to hide latency - Memory - GPU has small (40 GB), but high-speed memory 1555 GB/s - Stage data to GPU memory: via PCIe 4 bus (32 GB/s) Device ### GPU optimized to hide latency - Memory - GPU has small (40 GB), but high-speed memory 1555 GB/s - Stage data to GPU memory: via PCIe 4 bus (32 GB/s) - Stage automatically (*Unified Memory*), or manually Device ### GPU optimized to hide latency - Memory - GPU has small (40 GB), but high-speed memory 1555 GB/s - Stage data to GPU memory: via PCle 4 bus (32 GB/s) - Stage automatically (*Unified Memory*), or manually - Two engines: Overlap compute and copy Device ### GPU optimized to hide latency - Memory - GPU has small (40 GB), but high-speed memory 1555 GB/s - Stage data to GPU memory: via PCIe 4 bus (32 GB/s) - Stage automatically (*Unified Memory*), or manually - Two engines: Overlap compute and copy V100 32 GB RAM, 900 GB/s A100 40 GB RAM, 1555 GB/s Host Device ### GPU optimized to hide latency - Memory - GPU has small (40 GB), but high-speed memory 1555 GB/s - Stage data to GPU memory: via PCIe 4 bus (32 GB/s) - Stage automatically (*Unified Memory*), or manually - Two engines: Overlap compute and copy SIMT V100 32 GB RAM, 900 GB/s A100 40 GB RAM, 1555 GB/s Slide 10174 Host Device $\mathsf{SIMT} = \mathsf{SIMD} \oplus \mathsf{SMT}$ - CPU: - Single Instruction, Multiple Data (SIMD) ### Scalar $\mathsf{SIMT} = \mathsf{SIMD} \oplus \mathsf{SMT}$ - CPU: - Single Instruction, Multiple Data (SIMD) ### Vector $SIMT = SIMD \oplus SMT$ - CPU: - Single Instruction, Multiple Data (SIMD) - Simultaneous Multithreading (SMT) ### Vector $SIMT = SIMD \oplus SMT$ - CPU: - Single Instruction, Multiple Data (SIMD) - Simultaneous Multithreading (SMT) ### Vector ### SMT $SIMT = SIMD \oplus SMT$ - CPU: - Single Instruction, Multiple Data (SIMD) - Simultaneous Multithreading (SMT) - GPU: Single Instruction, Multiple Threads (SIMT) ### Vector ### SMT $SIMT = SIMD \oplus SMT$ - CPU: - Single Instruction, Multiple Data (SIMD) - Simultaneous Multithreading (SMT) - GPU: Single Instruction, Multiple Threads (SIMT) ### Vector ### SMT $SIMT = SIMD \oplus SMT$ - CPU: - Single Instruction, Multiple Data (SIMD) - Simultaneous Multithreading (SMT) - GPU: Single Instruction, Multiple Threads (SIMT) - CPU core ≈ GPU multiprocessor (SM) - Working unit: set of threads (32, a warp) - Fast switching of threads (large register file) - Branching if — ### Vector ### SMT ### $\mathsf{SIMT} = \mathsf{SIMD} \oplus \mathsf{SMT}$ ### Vector ### SMT ### $\mathsf{SIMT} = \mathsf{SIMD} \oplus \mathsf{SMT}$ ### Vector ### SMT #### Multiprocessor # **SIMT** #### Vector #### **SMT** #### SIMT ## A100 vs H100 Comparison of current vs. next generation ## A100 vs H100 Comparison of current vs. next generation ## A100 vs H100 #### Comparison of current vs. next generation ## CPU vs. GPU #### Let's summarize this! ## Optimized for low latency - + Large main memory - + Fast clock rate - + Large caches - + Branch prediction - + Powerful ALU - Relatively low memory bandwidth - Cache misses costly - Low performance per watt #### Optimized for high throughput - + High bandwidth main memory - + Latency tolerant (parallelism) - + More compute resources - + High performance per watt - Limited memory capacity - Low per-thread performance - Extension card # Programming GPUs ## **Preface: CPU** #### A simple CPU program! ``` SAXPY: \vec{y} = a\vec{x} + \vec{y}, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + v[i]; int a = 42: int n = 10: float x[n], y[n]; // fill x, v saxpy(n, a, x, y); ``` # **Summary of Acceleration Possibilities** # **Summary of Acceleration Possibilities** # **Libraries** Programming GPUs is easy: Just don't! ## Libraries Programming GPUs is easy: Just don't! Use applications & libraries Slide 17174 Use applications & libraries Wizard: Breazell [13] # Use applications & libraries Numba # Use applications & libraries Numba Wizard: Breazell [13] ### **cuBLAS** #### Parallel algebra - GPU-parallel BLAS (all 152 routines) - Single, double, complex data types - Constant competition with Intel's MKL - Multi-GPU support - → https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas ## **CUBLAS** #### Code example ``` int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublasHandle t handle: cublasCreate(Shandle): float * d x, * d y; cudaMallocManaged(\delta d_x, n * sizeof(x[0])); cudaMallocManaged(\delta d v. n * sizeof(v[0])): cublasSaxpv(n. a. d x. 1. d v. 1): cublasGetVector(n, sizeof(v[0]), d v, 1, v, 1); cudaFree(d x); cudaFree(d y); cublasDestroy(handle); ``` ## cuBLAS #### Code example ``` int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublasHandle t handle: cublasCreate(&handle); float * d x, * d y; Allocate GPU memory cudaMallocManaged(\delta d_x, n * sizeof(x[0])); cudaMallocManaged(\delta d_y, n * sizeof(y[0])); Call BLAS routine cublasSaxpy(n, a, d_x, 1, d_y, 1); Copy result to host cublasGetVector(n, sizeof(y[0]), d y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); ``` # **Programming GPUs** **Directives** # **GPU Programming with Directives** #### Keepin' you portable Annotate serial source code by directives ``` #pragma acc loop for (int i = 0; i < 1; i++) {};</pre> ``` Slide 22174 # **GPU Programming with Directives** #### Keepin' you portable Annotate serial source code by directives ``` #pragma acc loop for (int i = 0; i < 1; i++) {};</pre> ``` - OpenACC: Especially for GPUs; OpenMP: Has GPU support - Compiler interprets directives, creates according instructions # **GPU Programming with Directives** #### Keepin' you portable Annotate serial source code by directives ``` #pragma acc loop for (int i = 0; i < 1; i++) {};</pre> ``` - OpenACC: Especially for GPUs; OpenMP: Has GPU support - Compiler interprets directives, creates according instructions #### Pro - Portability - Other compiler? No problem! To it, it's a serial program - Different target architectures from same code - Easy to program #### Con - Only few compilers - Not all the raw power available - A little harder to debug # OpenACC / OpenMP Code example ``` void saxpv acc(int n, float a, float * x, float * v) { #pragma acc kernels for (int i = 0; i < n; i++) v[i] = a * x[i] + v[i]: float a = 42; int n = 10; float x[n], y[n]; // fill x. v saxpy_acc(n, a, x, y); ``` # OpenACC / OpenMP **Code example** ``` void saxpy_acc(int n, float a, float * x, float * y) { #pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) loop for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy acc(n, a, x, y);</pre> ``` Finally... Finally... OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009 - Platform: Programming language (OpenCL C/C++), API, and compiler - Targets CPUs, GPUs, FPGAs, and other many-core machines - Fully open source #### Finally... OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009 - Platform: Programming language (OpenCL C/C++), API, and compiler - Targets CPUs, GPUs, FPGAs, and other many-core machines - Fully open source CUDA NVIDIA's GPU platform 2007 - Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ... - Only NVIDIA GPUs - Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step - Also: CUDA Fortran; and more in NVIDIA HPC SDK #### Finally... OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009 - Platform: Programming language (OpenCL C/C++), API, and compiler - Targets CPUs, GPUs, FPGAs, and other many-core machines - Fully open source CUDA NVIDIA's GPU platform 2007 - Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ... - Only NVIDIA GPUs - Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step - Also: CUDA Fortran; and more in NVIDIA HPC SDK HIP AMD's unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+ SYCL Intel's unified programming model for CPUs and GPUs (also: DPC++) #### Finally... OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009 - Platform: Programming language (OpenCL C/C++), API, and compiler - Targets CPUs, GPUs, FPGAs, and other many-core machines - Fully open source CUDA NVIDIA's GPU platform 2007 - Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ... - Only NVIDIA GPUs - Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step - Also: CUDA Fortran; and more in NVIDIA HPC SDK HIP AMD's unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+ SYCL Intel's unified programming model for CPUs and GPUs (also: DPC++) - Choose what flavor you like, what colleagues/collaboration is using - Hardest: Come up with parallelized algorithm #### Finally... OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009 - Platform: Programming language (OpenCL C/C++), API, and compiler - Targets CPUs, GPUs, FPGAs, and other many-core machines - Fully open source #### CUDA NVIDIA's GPU platform 2007 - Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ... - Only NVIDIA GPUs - Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step - Also: CUDA Fortran; and more in NVIDIA HPC SDK HIP AMD's unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+ SYCL Intel's unified programming model for CPUs and GPUs (also: DPC++) - Choose what flavor you like, what colleagues/collaboration is using - Hardest: Come up with parallelized algorithm # CUDA C/C++ **Programming GPUs** ## **CUDA SAXPY** #### With runtime-managed data transfers ``` global void saxpy cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) v[i] = a * x[i] + v[i]: int a = 42; int n = 10; float x[n], y[n]; // fill x, y cudaMallocManaged(&x. n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy cuda<<<2, 5>>>(n, a, x, y); ``` cudaDeviceSvnchronize(): In software: Threads, Blocks • Methods to exploit parallelism: In software: Threads, Blocks - Methods to exploit parallelism: - Thread 3 In software: Threads, Blocks - Methods to exploit parallelism: - Threads In software: Threads, Blocks Methods to exploit parallelism: $\blacksquare \quad \underline{\mathsf{Threads}} \to \underline{\mathsf{Block}}$ In software: Threads, Blocks Methods to exploit parallelism: - $\bullet \quad \underbrace{\mathsf{Threads}}_{} \to \underbrace{\mathsf{Block}}_{}$ - Block In software: Threads, Blocks • Methods to exploit parallelism: - $\blacksquare \quad \text{Threads} \rightarrow \quad \text{Block}$ - Blocks In software: Threads, Blocks - Methods to exploit parallelism: - Threads → Block - lacks ightarrow Grid In software: Threads, Blocks Methods to exploit parallelism: - $\blacksquare \quad \text{Threads} \rightarrow \quad \text{Block}$ - lacks ightarrow Grid - Threads & blocks in 3D In software: Threads, Blocks - Methods to exploit parallelism: - $\blacksquare \quad \text{Threads} \rightarrow \quad \text{Block}$ - lacks ightarrow Grid - Threads & blocks in 3D - Parallel function: kernel - \_\_global\_\_ kernel(int a, float \* b) { } - Access own ID by global variables threadIdx.x, blockIdx.y,... - Execution entity: threads - Lightweight → fast switchting! - $lue{}$ 1000s threads execute simultaneously ightarrow order non-deterministic! ### **Kernel Functions** - Kernel: Parallel GPU function - Executed by each thread - In parallel - Called from host or device ### **Kernel Functions** - Kernel: Parallel GPU function - Executed by each thread - In parallel - Called from host or device - All threads execute same code; but can take different paths in program flow (some penalty) ### **Kernel Functions** - Kernel: Parallel GPU function - Executed by each thread - In parallel - Called from host or device - All threads execute same code; but can take different paths in program flow (some penalty) - Info about thread: local, global IDs ``` int currentThreadId = threadIdx.x; float x = input[currentThreadId]; output[currentThreadId] = x*x; ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops ``` void scale(float scale, float * in, float * out, int N) { for (int i = 0; i < N; i++) out[i] = scale * in[i]; }</pre> ``` Recipe for C Function $\rightarrow$ CUDA Kernel **Identify Loops** ``` void scale(float scale, float * in, float * out, int N) { for ( int i = 0; i < N; i++ ) out[i] = scale * in[i]; }</pre> ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops Extract Index ``` void scale(float scale, float * in, float * out, int N) { int i = 0 for (; i < N; i++ ) out[i] = scale * in[i]; }</pre> ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops Extract Index Extract Termination Condition ``` void scale(float scale, float * in, float * out, int N) { int i = 0 for (: i++ if (i < N) out[i] = scale * in[i]: ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops | Extract Index | Extract Termination Condition | Remove for ``` void scale(float scale, float * in, float * out, int N) { int i = 0 ``` ``` if (i < N) out[i] = scale * in[i]: ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops | Extract Index | Extract Termination Condition | Remove for | Add global ``` global void scale(float scale, float * in, float * out, int N) { int i = 0 ``` ``` if (i < N) out[i] = scale * in[i]: ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops Extract Index Extract Termination Condition Remove for Add global ``` __global__ void scale(float scale, float * in, float * out, int N) { int i = threadIdx.x: ``` ``` if (i < N) out[i] = scale * in[i]: ``` Recipe for C Function $\rightarrow$ CUDA Kernel Identify Loops Extract Index Extract Termination Condition Remove for Add global Replace i by threadIdx.x ... including block configuration ``` global void scale(float scale, float * in, float * out, int N) { int i = threadIdx.x + blockIdx.x * blockDim.x; ``` ``` if (i < N) out[i] = scale * in[i]: ``` ### **Summary** C function with explicit loop ``` void scale(float scale, float * in, float * out, int N) { for (int i = 0; i < N; i++) out[i] = scale * in[i]; } CUDA kernel with implicit loop __global__ void scale(float scale, float * in, float * out, int N) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < N)</pre> ``` out[i] = scale \* in[i]: ``` kernel<<<iint gridDim, int blockDim>>>(...) ``` Parallel threads of kernel launched with triple-chevron syntax Slide 31174 - Total number of threads, divided into - Number of blocks on the grid (gridDim) - Number of threads per block (blockDim) ``` kernel<<<iint gridDim, int blockDim>>>(...) ``` - Parallel threads of kernel launched with triple-chevron syntax - Total number of threads, divided into - Number of blocks on the grid (gridDim)Number of threads per block (blockDim) ``` kernel<<<iint gridDim, int blockDim>>>(...) ``` - Parallel threads of kernel launched with triple-chevron syntax - Total number of threads, divided into - Number of blocks on the grid (gridDim)Number of threads per block (blockDim) - Call returns immediately; kernel launch is asynchronous! ``` kernel<<<iint gridDim, int blockDim>>>(...) ``` - Parallel threads of kernel launched with triple-chevron syntax - Total number of threads, divided into - Number of blocks on the grid (gridDim)Number of threads per block (blockDim) - Call returns immediately; kernel launch is asynchronous! - Example: ``` int nThreads = 32; scale<<<N/nThreads, nThreads>>>(23, in, out, N) ``` ``` kernel<<<iint gridDim, int blockDim>>>(...) ``` - Parallel threads of kernel launched with triple-chevron syntax - Total number of threads, divided into - Number of blocks on the grid (gridDim)Number of threads per block (blockDim) - Call returns immediately; kernel launch is asynchronous! - Example: ``` int nThreads = 32; scale<<<N/nThreads, nThreads>>>(23, in, out, N) ``` Possibility for too many threads; include termination condition into kernel! ### **Full Kernel Launch** For Reference ``` kernel<<<dim3 gD, dim3 bD, size_t shared, cudaStream_t stream>>>(...) ``` Slide 32174 • 2 additional, optional parameters ### **Full Kernel Launch** #### For Reference ``` kernel<<<dim3 gD, dim3 bD, size_t shared, cudaStream_t stream>>>(...) ``` 2 additional, optional parameters ### shared Dynamic shared memory - Small GPU memory space; share data in block (high bandwidth) - Shared memory: allocate statically (compile time) or dynamically (run time) - size\_t shared: bytes of shared memory allocated per block (in addition to static shared memory) ### **Full Kernel Launch** #### For Reference ``` kernel<<<dim3 gD, dim3 bD, size_t shared, cudaStream_t stream>>>(...) ``` • 2 additional, optional parameters ### shared Dynamic shared memory - Small GPU memory space; share data in block (high bandwidth) - Shared memory: allocate statically (compile time) or dynamically (run time) - size\_t shared: bytes of shared memory allocated per block (in addition to static shared memory) ### stream Associated CUDA stream - CUDA streams enable different channels of communication with GPU - Can overlap in some cases (communication, computation) - cudaStream\_t stream: ID of stream to use for this kernel launch ■ Threads & blocks in 3D - Threads & blocks in 3D - Create 3D configurations with struct dim3 dim3 blockOrGridDim(size\_t dimX, size\_t dimY, size\_t dimZ) - Threads & blocks in 3D - Create 3D configurations with struct dim3 ``` dim3 blockOrGridDim(size_t dimX, size_t dimY, size_t dimZ) ``` Example: ``` dim3 blockDim(32, 32); dim3 gridDim = {1000, 100}; ``` - Threads & blocks in 3D - Create 3D configurations with struct dim3 ``` dim3 blockOrGridDim(size_t dimX, size_t dimY, size_t dimZ) ``` Example: ``` dim3 blockDim(32, 32); dim3 gridDim = {1000, 100}; ``` Kernel call with dim3 ``` kernel<<<dim3 gridDim, dim3 blockDim>>>(...) ``` Block and grid sizes are hardware-dependent - Block and grid sizes are hardware-dependent - For JSC GPUs: Tesla V100, A100 Block - $\vec{N}_{\mathsf{Thread}} \leq (1024_{\mathsf{x}}, 1024_{\mathsf{y}}, 64_{\mathsf{z}})$ - $|\vec{N}_{\mathsf{Thread}}| = N_{\mathsf{Thread}} \leq 1024$ - Block and grid sizes are hardware-dependent - For JSC GPUs: Tesla V100, A100 Block $$\vec{N}_{Thread} \leq (1024_x, 1024_y, 64_z)$$ • $$|\vec{N}_{\mathsf{Thread}}| = N_{\mathsf{Thread}} \leq 1024$$ Grid • $$\vec{N}_{Blocks} \le (2147483647_x, 65535_y, 65535_z) = (2^{31}, 2^{16}, 2^{16}) - \vec{1}$$ - Block and grid sizes are hardware-dependent - For JSC GPUs: Tesla V100, A100 Block $$\vec{N}_{Thread} \leq (1024_x, 1024_y, 64_z)$$ • $$|\vec{N}_{\mathsf{Thread}}| = N_{\mathsf{Thread}} \leq 1024$$ Grid • $$\vec{N}_{Blocks} \le (2147483647_x, 65535_y, 65535_z) = (2^{31}, 2^{16}, 2^{16}) - \vec{1}$$ Find out yourself: deviceQuery example from CUDA Samples - Block and grid sizes are hardware-dependent - For JSC GPUs: Tesla V100, A100 ``` Block \vec{N}_{Thread} \le (1024_x, 1024_y, 64_z) • |\vec{N}_{Thread}| = N_{Thread} \le 1024 Grid • \vec{N}_{Blocks} \le (2147483647_x, 65535_y, 65535_z) = (2^{31}, 2^{16}, 2^{16}) - \vec{1} ``` - Find out yourself: deviceQuery example from CUDA Samples - Workflow: Chose 128 or 256 as block dim; calculate grid dim from problem size ``` int Nx = 1000, Ny = 1000; dim3 blockDim(16, 16); int gx = (Nx % blockDim.x == 0) Nx / blockDim.x : Nx / blockDim.x + 1; int gy = (Ny % blockDim.y == 0) Ny / blockDim.y : Ny / blockDim.y + 1; dim3 gridDim(gx, gy); kernel<<<gridDim, blockDim>>>(); ``` ### **Hardware Threads** **Mapping Software Threads to Hardware** Slide 35174 # **Memory Management** #### With Automated Transfers Allocate memory to be used on GPU or CPU ``` cudaMallocManaged(T** ptr, size_t nBytes) ``` Data is copied to GPU or to CPU automatically (managed) # **Memory Management** #### **With Automated Transfers** Allocate memory to be used on GPU or CPU ``` cudaMallocManaged(T** ptr, size_t nBytes) ``` Slide 36174 - Data is copied to GPU or to CPU automatically (managed) - Example: ``` float * a; int N = 2048; cudaMallocManaged(&a, N * sizeof(float)); ``` #### With Automated Transfers Allocate memory to be used on GPU or CPU ``` cudaMallocManaged(T** ptr, size_t nBytes) ``` - Data is copied to GPU or to CPU automatically (managed) - Example: ``` float * a; int N = 2048; cudaMallocManaged(&a, N * sizeof(float)); ``` Free device memory ``` cudaFree(void* ptr) ``` #### With Manual Transfers Allocate memory to be used on GPU ``` cudaMalloc(T** ptr, size_t nBytes) ``` #### With Manual Transfers Allocate memory to be used on GPU ``` cudaMalloc(T** ptr, size_t nBytes) ``` ■ Copy data between host ↔ device ``` cudaMemcpy(void* dst, void* src, size_t nByte, enum cudaMemcpyKind dir) ``` #### With Manual Transfers Allocate memory to be used on GPU ``` cudaMalloc(T** ptr, size_t nBytes) ``` ■ Copy data between host ↔ device ``` cudaMemcpy(void* dst, void* src, size_t nByte, enum cudaMemcpyKind dir) ``` Example: ``` float * a, * a_d; int N = 2048; // fill a cudaMalloc(&a_d, N * sizeof(float)); cudaMemcpy(a_d, a, N * sizeof(float), cudaMemcpyHostToDevice); kernel<<<1,1>>>(a_d, N); cudaMemcpy(a , a_d, N * sizeof(float), cudaMemcpyDeviceToHost); ``` Member of the Helmholtz Association 20 January 2023 Slide 37174 # **Unified Memory** Overview - Everything started with manual data management - First Unified Memory since CUDA 6.0 - Better Unified Memory better since CUDA 8.0 - Now: Unified Memory great default, explicit memory only a possible optimization # Manual Memory vs. Unified Memory ``` void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data: char *data: char *data d: data = (char *)malloc(N); cudaMallocManaged(&data, N): cudaMalloc(&data d, N); fread(data, 1, N, fp): fread(data, 1, N, fp): cudaMemcpy(data_d, data, N, cudaMemcpyHostToDevice); kernel<<<....>>>(data. N): kernel<<<....>>>(data. N): cudaDeviceSynchronize(): cudaMemcpv(data. data d. N. cudaMemcpvDeviceToHost): host func(data) host func(data): cudaFree(data d); free(data); cudaFree(data); ``` ``` cudaMallocManaged(&ptr, ...); *ptr = 1; kernel<<<...>>(ptr); ``` ``` cudaMallocManaged(∂ptr, ...); ← Empty! No pages anywhere yet (like malloc()) ``` ``` *ptr = 1; CPU page fault: data allocates on CPU ``` ``` kernel<<<...>>(ptr); ``` #### Under the hood cudaMallocManaged(&ptr, ...); ← Empty! No pages anywhere yet (like malloc()) ``` cudaMallocManaged(&ptr, ...); Empty! No pages anywhere yet (like malloc()) *ptr = 1; CPU page fault: data allocates on CPU kernel<<<...>>(ptr); GPU page fault: data migrates to GPU ``` - Pages populate on first touch - Pages migrate on-demand - GPU memory over-subscription possible - Concurrent access from CPU and GPU to memory (page-level) # **Performance Analysis** Comparing scale\_vector\_um (Unified Memory) and scale\_vector (manual copy) for 20 480 float elements. | Time(%) | Total Time (ns) | Name | |---------|-----------------|----------------------------------------------| | | | | | 100.0 | 463,286 | <pre>scale(float, float*, float*, int)</pre> | | Time(%) | Total Time (ns) | Name | |---------|-----------------|----------------------------------------------| | | | | | 100.0 | 4,792 | <pre>scale(float, float*, float*, int)</pre> | # **Performance Analysis** Comparing scale\_vector\_um (Unified Memory) and scale\_vector (manual copy) for 20 480 float elements. | lime(%) | lotal lime (ns) | Name | |---------|-----------------|----------------------------------------------| | | | | | 100.0 | 463,286 | <pre>scale(float, float*, float*, int)</pre> | 100× *slower?!* What's going wrong here? | - | <br>- | - | - | |---|-------|---|---| | | 1 | | 0 | | | | | | . 0 | - | _ | - | - | - | - | - | - | _ | - | _ | _ | _ | - | _ | - | _ | - | - | _ | _ | - | - | - | _ | _ | - | _ | _ | _ | - | _ | _ | _ | - | - | _ | - | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | | 4 | , | 7 | 9 | 2 | S | C | a | ι | e | ( | f | ι | 0 | a | t | , | | f | ι | 0 | a | t | * | , | | f | ι | 0 | a | t | * | , | | i | n | t | Comparing scale\_vector\_um (Unified Memory) and scale\_vector (manual copy) for 20.480 float elements. | Time(%) | Total Time (ns) | Name | |---------|-----------------|----------------------------------------------| | | | | | 100.0 | 4,792 | <pre>scale(float, float*, float*, int)</pre> | Comparing scale\_vector\_um (Unified Memory) and scale\_vector (manual copy) for 20.480 float elements. | - | CUDA HW (Tesla V100-PCIE-160 | | | | |---|------------------------------|------------------|---|-------------| | | ▼ 9.6% Kernels | | S | | | | ▶ 100.0% scale | | s | | | | ▼ 90.4% Memory | | | | | | 56.5% HtoD memcpy | Memcpy HtoD (Pag | | | | | 43.5% DtoH memcpy | | | Memcpy DtoH | Σ # **Comparing UM and Explicit Transfers** UM Kernel is launched, data is needed by kernel, data migrates host→device ⇒ Run time of kernel incorporates time for data transfers Explicit Data will be needed by kernel – data migrates host—device before kernel launch ⇒ Run time of **kernel** without any transfers # **Comparing UM and Explicit Transfers** UM Kernel is launched, data is needed by kernel, data migrates host→device ⇒ Run time of kernel incorporates time for data transfers Explicit Data will be needed by kernel – data migrates host→device **before** kernel launch ⇒ Run time of **kernel** without any transfers - UM more convenient - Total run time of whole program does not principally change Except: Fault handling costs $\mathcal{O}$ (10 $\mu$ s), stalls execution - But data transfers sometimes sorted to kernel launch # **Comparing UM and Explicit Transfers** UM Kernel is launched, data is needed by kernel, data migrates host→device ⇒ Run time of kernel incorporates time for data transfers Explicit Data will be needed by kernel – data migrates host→device **before** kernel launch ⇒ Run time of **kernel** without any transfers - UM more convenient - Total run time of whole program does not principally change Except: Fault handling costs $O(10 \, \mu s)$ , stalls execution - But data transfers sometimes sorted to kernel launch - ⇒ Improve UM behavior with performance hints! **New API routines** API calls to augment data location knowledge of runtime cudaMemPrefetchAsync(data, length, device, stream) Prefetches data to device (on stream) asynchronously **New API routines** API calls to augment data location knowledge of runtime - cudaMemPrefetchAsync(data, length, device, stream) Prefetches data to device (on stream) asynchronously - cudaMemAdvise(data, length, advice, device) Advise about usage of given data, advice: **New API routines** API calls to augment data location knowledge of runtime - cudaMemPrefetchAsync(data, length, device, stream) Prefetches data to device (on stream) asynchronously - cudaMemAdvise(data, length, advice, device) Advise about usage of given data, advice: - cudaMemAdviseSetReadMostly: Read-only copy is kept #### **New API routines** API calls to augment data location knowledge of runtime - cudaMemPrefetchAsync(data, length, device, stream) Prefetches data to device (on stream) asynchronously - cudaMemAdvise(data, length, advice, device) Advise about usage of given data, advice: - cudaMemAdviseSetReadMostly: Read-only copy is kept - cudaMemAdviseSetPreferredLocation: Set preferred location to avoid migrations; first access will establish mapping Member of the Helmholtz Association 20 January 2023 Slide 43174 #### **New API routines** API calls to augment data location knowledge of runtime - cudaMemPrefetchAsync(data, length, device, stream) Prefetches data to device (on stream) asynchronously - cudaMemAdvise(data, length, advice, device) Advise about usage of given data, advice: - cudaMemAdviseSetReadMostly: Read-only copy is kept - cudaMemAdviseSetPreferredLocation: Set preferred location to avoid migrations; first access will establish mapping - cudaMemAdviseSetAccessedBy: Data is accessed by this device; will pre-map data to avoid page fault #### **New API routines** API calls to augment data location knowledge of runtime - cudaMemPrefetchAsync(data, length, device, stream) Prefetches data to device (on stream) asynchronously - cudaMemAdvise(data, length, advice, device) Advise about usage of given data, advice: - cudaMemAdviseSetReadMostly: Read-only copy is kept - cudaMemAdviseSetPreferredLocation: Set preferred location to avoid migrations; first access will establish mapping - cudaMemAdviseSetAccessedBy: Data is accessed by this device; will pre-map data to avoid page fault - Use cudaCpuDeviceId for device CPU, or use cudaGetDevice() as usual to retrieve current GPU device id (default: 0) Member of the Helmholtz Association 20 January 2023 Slide 43174 ## **Hints in Code** ``` void sortfile(FILE *fp, int N) { char *data; // ... cudaMallocManaged(&data, N); fread(data. 1. N. fp): cudaMemPrefetchAsync(data, N, device); kernel<<<....>>>(data. N): cudaDeviceSynchronize(): host func(data); cudaFree(data); } ``` ## **Hints in Code** ``` void sortfile(FILE *fp, int N) { char *data; // ... cudaMallocManaged(&data, N); fread(data, 1, N, fp); cudaMemPrefetchAsync(data, N, device); kernel<<<....>>>(data. N): cudaDeviceSynchronize(): host func(data); cudaFree(data); } ``` Prefetch data to avoid expensive GPU page faults ## **Hints in Code** ``` void sortfile(FILE *fp, int N) { char *data: // ... cudaMallocManaged(&data, N); fread(data. 1. N. fp): cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, device); cudaMemPrefetchAsync(data, N, device); kernel<<<....>>>(data. N): cudaDeviceSynchronize(): host func(data); cudaFree(data); } ``` Read-only copy of data is created on GPU during prefetch → CPU and GPU reads will not fault Prefetch data to avoid expensive GPU page faults # **Performance Analysis** **Programming GPUs** ## **GPU Tools** The helpful helpers helping helpless (and others) #### NVIDIA AMD rocProf Profiler for AMD's ROCm stack uProf Analyzer for AMD's CPUs and GPUs # **Nsight Systems** CLI ``` $ nsvs profile --stats=true ./poisson2d 10 # (shortened) CUDA APT Statistics: Total Time (ns) Num Calls Average Minimum Maximum Name 90.9 160,407,572 5.346.919.1 1.780 25.648.117 cuStreamSynchronize CUDA Kernel Statistics: Time(%) Total Time (ns) Instances Average Minimum Maximum Name 10 15,868,661.7 14,525,819 25,652,783 main_106_gpu 100.0 158.686.617 0.0 2.512.0 3,680 main 106 gpu_red 25.120 10 2.304 ``` Member of the Helmholtz Association 20 January 2023 Slide 47174 # **Nsight Systems** GUI # **Nsight Compute** GUI ### , 511.6. 55 Programming GPUs Beyond CUDA # Beyond CUDA: Cooperative Groups **Programming GPUs** # **New Model: Cooperative Groups** Motivation to extend classical model Algorithmic Not all algorithms map easily to available synchronization methods; synchronization should be more flexible Design Make groups of threads explicit entities Hardware Access new hardware features (Independent Thread Scheduling, Thread Block Clusters) → Cooperative Groups (CG) A flexible model for synchronization and communication within groups of threads. # **New Model: Cooperative Groups** - Motivation to extend classical model - Algorithmic Not all algorithms map easily to available synchronization methods; synchronization should be more flexible - Design Make groups of threads explicit entities - Hardware Access new hardware features (Independent Thread Scheduling, Thread Block Clusters) - → Cooperative Groups (CG) A flexible model for synchronization and communication within groups of threads. - All in namespace cooperative\_groups (cooperative\_groups.h header) - Following in text: cooperative\_groups::func() → cg::func() namespace cg = cooperative\_groups; Thread Group # **Common Methods of Cooperative Groups** - Fundamental type: thread\_group - Every CG has following member functions ``` sync() Synchronize the threads of this group (alternative cg::sync(g)) ``` Before: \_\_syncthreads() for whole block thread\_rank() Get unique ID of current thread in this group (local index) Before: threadIdx.x for index in block size() Number of threads in this group Before: blockDim.x for number of threads in block is\_valid() Group is technically ok # Simple Example: Print Rank ``` __device__ void printRank(cg::thread_group g) { printf("Rank %d\n", g.thread_rank()); } __global__ void allPrint() { cg::thread_block b = cg::this_thread_block(); printRank(b); } int main() { allPrint<<<1, 23>>(); } ``` # **Advanced Example: Cooperative Reduce Collective** ``` __shared__ int reduction_s[BLOCKSIZE]; cg::thread_block cta = cg::this_thread_block(); cg::thread_block_tile<32> tile = cg::tiled_partition<32>(cta); const int tid = cta.thread_rank(); int value = A[tid]; reduction_s[tid] = cg::reduce(tile, value, cg::plus<int>()); // reduction_s contains tile-sum at all positions associated to tile cg::sync(cta); // Still to do: sum partial tile sums ``` **Programming GPUs** Beyond CUDA: MPI • Modern compute nodes: multiple GPUs per node Modern compute nodes: multiple GPUs per node - Modern compute nodes: multiple GPUs per node - HPC: multiple nodes - Modern compute nodes: multiple GPUs per node - HPC: multiple nodes - Modern compute nodes: multiple GPUs per node - HPC: multiple nodes - Technology for distribution: MPI - MPI also for multi-GPU computing! - Modern compute nodes: multiple GPUs per node - HPC: multiple nodes - Technology for distribution: MPI - MPI also for multi-GPU computing! - Important: Direct GPU-to-GPU memory transfers, no intermediate transfer to CPU - Modern compute nodes: multiple GPUs per node - HPC: multiple nodes - Technology for distribution: MPI - MPI also for multi-GPU computing! - Important: Direct GPU-to-GPU memory transfers, no intermediate transfer to CPU - Modern MPIs can be GPU-aware and do the right thing # MPI Sketch (Pseudo-C) ``` #include <mpi.h> int main(int argc, char *argv[]) { int rank.size: // Tnit MPT MPI Init(&argc. &argv): // Get current rank ID and total number of ranks */ MPI Comm rank(MPI COMM WORLD, &rank); MPI Comm size(MPI COMM WORLD. &size): // Call routines cudaMalloc(&buffer. n*sizeof(double)); computeKernel<<<dim grid,dim block>>>(buffer); MPI Sendrecv(buffer, n, MPI REAL TYPE, top, 0, buffer+n, n, MPI REAL TYPE, bottom, 0, → MPI COMM WORLD. MPI STATUS IGNORE): // Shutdown MPI Finalize(): return 0: ``` # **Beyond CUDA: Thrust** **Programming GPUs** ### **Thrust** #### Iterators! Iterators everywhere! 🚀 - Template library - A precursor to a GPU-accelerated pSTL? - Based on iterators - Data-parallel primitives (scan(), sort(), reduce(),...) - Fully compatible with plain CUDA C (comes with CUDA Toolkit) - Great with [](){} lambdas! - → http://thrust.github.io/ http://docs.nvidia.com/cuda/thrust/ ### **Thrust** #### Code example ``` int a = 42: int n = 10: thrust::host vector<float> x(n), v(n); // fill x, y thrust::device vector d x = x. d v = v: thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), [=] device (auto x. auto v) {return a*x+v:}); // or: using namespace thrust::placeholders; thrust::transform(d x.begin(), d x.end(), d y.begin(), d y.begin(), a * 1 + \rightarrow 2); x = d x: ``` ### **Standard Parallelism** - By now, GPUs (and other accelerators) ubiquitous; around for long time - Dedicated, custom parallelism concepts move into standards of languages C++ Parallel STL since C++17 (2017) Fortran do concurrent - Both allow for execution on GPU - Programmer identifies, exposes parallel code; compiler generates GPU-capable binary - Compiler: NVHPC best, but also Intel oneDPL and others Member of the Helmholtz Association 20 January 2023 Slide 63174 SUPERCOMPUT # pSTL Standard Parallelism Example # Beyond CUDA: HIP **Programming GPUs** Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs – 1.102 EFLOP/s; also most energy-efficient! - Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs – 1.102 EFLOP/s; also most energy-efficient! - 2023: Aurora at Argonne with > 60 000 Intel Ponte Vecchio GPUs > 2 EFLOP/s - 2023: El Capitan at Lawrence Livermore with AMD MI300 GPUs -> 2 EFLOP/s - Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs – 1.102 EFLOP/s; also most energy-efficient! - 2023: Aurora at Argonne with > 60 000 Intel Ponte Vecchio GPUs - > 2 EFLOP/s - 2023: El Capitan at Lawrence Livermore with AMD MI300 GPUs -> 2 EFLOP/s - 2024: JUPITER at JSC -> 1 EFLOP/s! GPUs, details TBD - Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs – 1.102 EFLOP/s; also most energy-efficient! - 2023: Aurora at Argonne with > 60 000 Intel Ponte Vecchio GPUs - > 2 EFLOP/s - 2023: El Capitan at Lawrence Livermore with AMD MI300 GPUs -> 2 EFLOP/s - 2024: JUPITER at JSC -> 1 EFLOP/s! GPUs, details TBD #### **AMD GPUs: HIP** - HIP: AMD's framework to utilize HPC GPUs - Heterogeneous Interface for Portability - Similar to CUDA, very similar sed -i 's/cuda/hip/' - Can be compiled to run on NVIDIA GPUs (with CUDA) or AMD GPUs (ROCm) - Includes C++ runtime API, kernel language; CUDA conversion tools - Open Source - Very similar performance on NVIDIA GPUs like CUDA ``` HIP_PLATFORM=amd hipcc --offload-arch=gfx90a -std=c++14 -o daxpy daxpy.cpp ``` #### **HIP SAXPY** ``` #include <cuda.h> __global__ void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x: if (i < n) v[i] = a * x[i] + v[i]: int a = 42: int n = 10: float x[n], y[n]; // fill x, v cudaMallocManaged(&x. n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); saxpy cuda<<<2, 5>>>(n, a, x, y); ``` cudaDeviceSvnchronize(): #### HIP SAXPY ``` #include "hip/hip runtime.h" __global__ void saxpy_hip (int n, float a, float * x, float * y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) v[i] = a * x[i] + v[i]: int a = 42: int n = 10: float x[n], y[n]; // fill x, v hipMallocManaged(&x. n * sizeof(float)); hipMallocManaged(&v, n * sizeof(float)); saxpy hip <<<2, 5>>>(n, a, x, y); ``` Slide 69174 hipDeviceSvnchronize(): **Programming GPUs** Beyond CUDA: SYCL # SYCL / DPC++ - oneAPI: Intel's framework to utilize HPC GPUs and other parallel processors - Large, open-source-ish ecosystem - oneAPI: Umbrella name for programming models and libraries, open; also a "specification" - DPC++: Data-Parallel C++; language built on C++ to target parallel devices, implements SYCL and prototypes extensions - SYCL: C++17-based model to target parallel devices, by Khronos group, open - Intel oneAPI DPC++/C++ Compiler: New LLVM-based Intel compiler to compile DPC++ - oneMKL, oneDNN, ...: Specific libraries for domains, some open - oneAPI DPC++ Library (oneDPL): DPC++-accompanying library with algorithms etc. - Programming with iterators, lambdas, queues, views - Since OSS: Not only for Intel GPUs but also AMD, NVIDIA backends - Higher Level: Might even give better performance then legacy CUDA - $\rightarrow$ github.com/oneapi-src # **DPC++ Example** ``` int a = 42: int n = 10: std::vector x(N), v(N); // fill x, y sycl::queue q(sycl::gpu_selector{}); sycl::buffer<float, 1> d_x { x.data(), sycl::range<1>(x.size())}, d_y...; g.submit([8] (svcl::handler8 h) { auto x access = d x.get access<svcl::access::mode::read> (h); auto y access = d y.get access<sycl::access::mode::read write> (h); h.parallel_for<class axpy>( sycl::range<1>{length}, [=] (sycl::id<1> it) { auto i = it.get id(0): v access[i] += a * x access[i] + y access[i]; }): ``` **Programming GPUs** **Beyond CUDA: MORE MODELS!!1** # State-of-the-Art GPU Programming Models - GPU programming not only programming with CUDA anymore - Much more, and CUDA only one solution Full vendor support - New GPU vendors in the game now hungry for a piece of the cake - Many models, most offer translation from CUDA Comprehensive support, but not by vendor See appendix for details or doi:10.34732/xdvblg-r1bvif bind your way through it or directly link the libraries Member of the Helmholtz Association 20 January 2023 Slide 73174 ### **Conclusions** - GPUs achieve performance by specialized hardware - Acceleration can be done by different means - Libraries are the easiest - OpenACC can give first entry point - Full power with CUDA - Threads, Blocks to expose parallelism for a kernel - Several API routines exist - Cooperative Groups: new entry point - Beyond CUDA: Thrust, pSTL, HIP, SYCL, Kokkos, ... ### **Conclusions** - GPUs achieve performance by specialized hardware - Acceleration can be done by different means - Libraries are the easiest - OpenACC can give first entry point - Full power with CUDA - Threads, Blocks to expose parallelism for a kernel - Several API routines exist - Cooperative Groups: new entry point - Beyond CUDA: Thrust, pSTL, HIP, SYCL, Kokkos, ... # Appendix Appendix GPU Model/Vendor Compatibility Table References Glossary # Appendix **GPU Model/Vendor Compatibility Table** ### **GPU Programming Models: Table** Full vendor support Intel - Indirect, but comprehensive support, by vendor - Vendor support, but not (yet) entirely comprehensive - Comprehensive support, but not by vendor - Limited, probably indirect support - but at least some - No direct support available, but of course one could ISO-C-bind your way through it or directly link the libraries C++ C++ (sometimes also C) Fortran Fortran Fortran 24 **CUDA** HIP SYCL OpenACC OpenMP Fortran Fortran Fortran C++ C++ C++C++ Fortran C++ NVIDIA 12 \_\_\_11 \_\_\_14 /6 16 AMD -18 /19 20 /21 /6 Intel Standard Kokkos ALPAKA C++Fortran Fortran C++Fortran Python C++**A** 27 **29** /30 26 **3**1 NVIDIA 32 /33 **34 35** /30 AMD 41 38 39 40 /30 # **GPU Programming Models: Footnotes I** - 1: CUDA C/C++ is supported on NVIDIA GPUs through the CUDA Toolkit - 2: CUDA Fortran, a proprietary Fortran extension, is supported on NVIDIA GPUs via the NVIDIA HPC SDK - 3: HIP programs can directly use NVIDIA GPUs via a CUDA backend; HIP is maintained by AMD - 4: No such thing like HIP for Fortran, but AMD offers Fortran interfaces to HIP and ROCm libraries in hipfort - 5: SYCL can be used on NVIDIA GPUs with experimental support either in SYCL directly or in DPC++, or via hipSYCL - 6: No such thing like SYCL for Fortran - 7: OpenACC C/C++ supported on NVIDIA GPUs directly (and best) through NVIDIA HPC SDK; additional, somewhat limited support by GCC C compiler and in LLVM through Clacc - 8: OpenACC Fortran supported on NVIDIA GPUs directly (and best) through NVIDIA HPC SDK; additional, somewhat limited support by GCC Fortran compiler and Flacc - 9: OpenMP in C++ supported on NVIDIA GPUs through NVIDIA HPC SDK (albeit with a few limits), by GCC, and Clang; see OpenMP ECP BoF on status in 2022. - 10: OpenMP in Fortran supported on NVIDIA GPUs through NVIDIA HPC SDK (but not full OpenMP feature set available), by GCC, and Flang - 25: pSTL features supported on NVIDIA GPUs through NVIDIA HPC SDK - 26: Standard Language parallel features supported on NVIDIA GPUs through NVIDIA HPC SDK - 27: Kokkos supports NVIDIA GPUs by calling CUDA as part of the compilation process - 28: Kokkos is a C++ model, but an official compatibility layer (Fortran Language Compatibility Layer, FLCL) is available. Member of the Helmholtz Association 20 January 2023 Slide 5120 # **GPU Programming Models: Footnotes II** - 29: Alpaka supports NVIDIA GPUs by calling CUDA as part of the compilation process; also, an OpenMP backend can be used - 30: Alpaka is a C++ model - 31: There is a vast community of offloading Python code to NVIDIA GPUs, like CuPy, Numba, cuNumeric, and many others; NVIDIA actively supports a lot of them, but has no direct product like CUDA for Python; so, the status is somewhere in between - 11: hipify by AMD can translate CUDA calls to HIP calls which runs natively on AMD GPUs - 12: AMD offers a Source-to-Source translator to convert some CUDA Fortran functionality to OpenMP for AMD GPUs (gpufort); in addition, there are ROCm library bindings for Fortran in hipfort OpenACC/CUDA Fortran Source-to-Source translator - 13: HIP is the preferred native programming model for AMD GPUs - 14: SYCL can use AMD GPUs, for example with hipSYCL or DPC++ for HIP AMD - 15: OpenACC C/C++ can be used on AMD GPUs via GCC or Clacc; also, Intel's OpenACC to OpenMP Source-to-Source translator can be used to generate OpenMP directives from OpenACC directives - 16: OpenACC Fortran can be used on AMD GPUs via GCC; also, AMD's gpufort Source-to-Source translator can move OpenACC Fortran code to OpenMP Fortran code, and also Intel's translator can work - 17: AMD offers a dedicated, Clang-based compiler for using OpenMP on AMD GPUs: AOMP; it supports both C/C++ (Clang) and Fortran (Flang, example) Member of the Helmholtz Association 20 January 2023 Slide 6120 # **GPU Programming Models: Footnotes III** - 32: Intel's DPC++ (oneAPI) can be compiled with an experimental HIP AMD backend, allowing to launch STL algorithms to AMD GPUs; caveats from Intel's STL support apply - 33: Currently, no (known) way to launch Standard-based parallel algorithms on AMD GPUs - 34: Kokkos supports AMD GPUs through HIP - 35: Alpaka supports AMD GPUs through HIP or through an OpenMP backend - 36: AMD does not officially support GPU programming with Python (also not semi-officially like NVIDIA), but third-party support is available, for example through Numba (currently inactive) or a HIP version of CuPy - 18: SYCLomatic translates CUDA code to SYCL code, allowing it to run on Intel GPUs; also, Intel's DPC++ Compatibility Tool can transform CUDA to SYCL - 19: No direct support, only via ISO C bindings, but at least an example can be found on GitHub; it's pretty scarce and not by Intel itself, though - 20: CHIP-SPV supports mapping CUDA and HIP to OpenCL and Intel's Level Zero, making it run on Intel GPUs - 21: No such thing like HIP for Fortran - 22: SYCL is the prime programming model for Intel GPUs; actually, SYCL is only a standard, while Intel's implementation of it is called DPC++ (Data Parallel C++), which extends the SYCL standard in various places; actually actually, Intel namespaces everything oneAPI these days, so the full proper name is Intel oneAPI DPC++ (which incorporates a C++ compiler and also a library) - 23: OpenACC can be used on Intel GPUs by translating the code to OpenMP with Intel's Source-to-Source translator Member of the Helmholtz Association 20 January 2023 Slide 7120 # **GPU Programming Models: Footnotes IV** - 24: Intel has extensive support for OpenMP through their latest compilers - 37: Intel supports pSTL algorithms through their DPC++ Library (oneDPL; GitHub). It's heavily namespaced and not yet on the same level as NVIDIA - 38: With Intel oneAPI 2022.3, Intel supports DO CONCURRENT with GPU offloading - 39: Kokkos supports Intel GPUs through SYCL - 40: Alpaka v0.9.0 introduces experimental SYCL support; also, Alpaka can use OpenMP backends - 41: Not a lot of support available at the moment, but notably DPNP, a SYCL-based drop-in replacement for Numpy, and numba-dpex, an extension of Numba for DPC++. **Appendix** References #### References I - [2] Kenneth E. Hoff III et al. "Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware." In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH '99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp. 277–286. ISBN: 0-201-48560-5. DOI: 10.1145/311535.311567. URL: http://dx.doi.org/10.1145/311535.311567 (pages 3–9). - [3] Chris McClanahan. "History and Evolution of GPU Architecture." In: A Survey Paper (2010). URL: http://mcclanahoochie.com/blog/wp-content/uploads/2011/03/gpu-hist-paper.pdf (pages 3-9). - [4] Jack Dongarra et al. *TOP500*. Nov. 2016. URL: https://www.top500.org/lists/2016/11/(pages 3-9). #### References II - [5] Jack Dongarra et al. Green500. Nov. 2016. URL: https://www.top500.org/green500/lists/2016/11/(pages 3-9). - [6] Karl Rupp. Pictures: CPU/GPU Performance Comparison, URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/(pages 13, 14). - Wes Breazell, Picture: Wizard, URL: [13] https://thenounproject.com/wes13/collection/its-a-wizards-world/ (pages 46-50). # References: Images, Graphics I - [1] Héctor J. Rivas. *Color Reels*. Freely available at Unsplash. URL: https://unsplash.com/photos/87hFrPk3V-s. - [7] Forschungszentrum Jülich GmbH (Ralf-Uwe Limbach). JUWELS Booster. - [8] Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/ (pages 19, 20). - [9] Shearings Holidays. *Picture: Shearings coach 636*. URL: https://www.flickr.com/photos/shearings/13583388025/(pages 19, 20). - [10] Nvidia Corporation. *Pictures: Volta GPU*. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf. # References: Images, Graphics II - [11] Nvidia Corporation. *Pictures: Ampere GPU*. Ampere Architecture Whitepaper. URL: http://www.nvidia.com/nvidia-ampere-architecture-whitepaper (pages 35–37). - [12] Nvidia Corporation. Pictures: Hopper GPU. Nvidia Developer Technical Blog: NVIDIA Hopper Architecture In-Depth. URL: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/. - [14] OLCF at ORNL. *Picture: Frontier*. Flickr. URL: https://www.flickr.com/photos/olcf/52117623843/. # Appendix Glossary # Glossary I - AMD Manufacturer of CPUs and GPUs. 3, 4, 5, 6, 7, 8, 9, 60, 61, 62, 63, 64, 65, 198, 199 - Ampere GPU architecture from NVIDIA (announced 2019). 16, 17 - API A programmatic interface to software by well-defined functions. Short for application programming interface. 60, 61, 62, 63, 64, 65, 199 - ATI Canada-based GPUs manufacturing company; bought by AMD in 2006. 3, 4, 5, 6, 7, 8, 9 - CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 3, 4, 5, 6, 7, 8, 9, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 94, 95, 96, 101, 102, 103, 104, 105, 164, 181, 182, 199 # **Glossary II** HIP GPU programming model by AMD to target their own and NVIDIA GPUs with one combined language. Short for Heterogeneous-compute Interface for Portability. 60, 61, 62, 63, 64, 65 JUWELS Jülich's new supercomputer, the successor of JUQUEEN. 15, 16, 17 NVIDIA US technology company creating GPUs. 3, 4, 5, 6, 7, 8, 9, 15, 16, 17, 35, 36, 37, 60, 61, 62, 63, 64, 65, 137, 197, 198, 200 NVLink NVIDIA's communication protocol connecting CPU $\leftrightarrow$ GPU and GPU $\leftrightarrow$ GPU with high bandwidth. 200 OpenACC Directive-based programming, primarily for many-core machines. 55, 56, 57, 58, 59, 181, 182 # **Glossary III** - OpenCL The *Open Computing Language*. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 3, 4, 5, 6, 7, 8, 9, 60, 61, 62, 63, 64, 65 - OpenGL The *Open Graphics Library*, an API for rendering graphics across different hardware architectures. 3, 4, 5, 6, 7, 8, 9 - OpenMP Directive-based programming, primarily for multi-threaded machines. 55, 56, 57, 58, 59 - ROCm AMD software stack and platform to program AMD GPUs. Short for Radeon Open Compute (*Radeon* is the GPU product line of AMD). 60, 61, 62, 63, 64, 65 - SAXPY Single-precision $A \times X + Y$ . A simple code example of scaling a vector and adding an offset. 43, 67, 174, 175 # Glossary IV - Tesla The GPU product line for general purpose computing computing of NVIDIA. 15, 101, 102, 103, 104, 105 - Thrust A parallel algorithms library for (among others) GPUs. See https://thrust.github.io/.164 - V100 A large GPU with the Volta architecture from NVIDIA. It employs NVLink 2 as its interconnect and has fast HBM2 memory. Additionally, it features Tensorcores for Deep Learning and Independent Thread Scheduling, 101, 102, 103, 104, 105 - Volta GPU architecture from NVIDIA (announced 2017), 200 - CG Cooperative Groups, 143, 144, 150 # **Glossary V** - CPU Central Processing Unit. 15, 19, 20, 21, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 43, 60, 61, 62, 63, 64, 65, 107, 108, 109, 115, 116, 117, 118, 119, 127, 128, 129, 130, 131, 132, 135, 197, 198, 199 - GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 7, 8, 9, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 42, 46, 47, 48, 49, 50, 51, 54, 55, 56, 57, 60, 61, 62, 63, 64, 65, 66, 77, 78, 79, 94, 95, 96, 101, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112, 115, 116, 117, 118, 119, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 141, 142, 153, 163, 168, 176, 179, 181, 182, 197, 198, 199, 200 - SIMD Single Instruction, Multiple Data. 28, 29, 30, 31, 32, 33, 34, 35, 36, 37 - SIMT Single Instruction, Multiple Threads. 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37 # **Glossary VI** SM Streaming Multiprocessor. 28, 29, 30, 31, 32, 33, 34, 35, 36, 37 SMT Simultaneous Multithreading. 28, 29, 30, 31, 32, 33, 34, 35, 36, 37