

# GPU ACCELERATORS AT JSC JÜLICH CHALLENGES HACKATHON

18 March 2021 | Andreas Herten | Forschungszentrum Jülich



# Outline

GPUs at JSC
JUWELS
JURECA DC
GPU Architecture
Empirical Motivation
Comparisons
GPU Architecture

Programming GPUs
Libraries
Directives
CUDA C/C++
Performance Analysis
Advanced Topics
Using GPUs on JSC Systems
Compiling
Resource Allocation

Slide 1137





## JUWELS Cluster - Jülich's Scalable System

- 2500 nodes with Intel Xeon CPUs (2 × 24 cores)
- 46 + 10 nodes with 4 NVIDIA Tesla V100 cards (32 GB memory)
- 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #44)





# **JUWELS** Booster – Scaling Higher!

- $\blacksquare$  936 nodes with AMD EPYC Rome CPUs (2  $\times$  24 cores)
- Each with 4 NVIDIA A100 Ampere GPUs (each: FP64TC: 19.5 TFLOP/s)
- ullet InfiniBand DragonFly+ HDR-200 network; 4 imes 200 Gbit/s per node



Member of the Helmholtz Association 18 March 2021 Slide 3137





# **Top500 List Nov 2020:**

- #1 Europe
- #7 World
- #3\* Green500

## **JUWELS** Booster – Scaling Higher!

- 936 nodes with AMD EPYC Rome CPUs (2 × 24 cores)
- Each with 4 NVIDIA A100 Ampere GPUs (each: FP64TC: 19.5 TFLOP/s)
- ullet InfiniBand DragonFly+ HDR-200 network; 4 imes 200 Gbit/s per node





# JURECA DC – Multi-Purpose

- 768 nodes with AMD EPYC Rome CPUs (2  $\times$  64 cores)
- 192 nodes with 4 NVIDIA A100 Ampere GPUs
- InfiniBand DragonFly+ HDR-100 network
- Also: JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing



# **GPU** Architecture

# **Status Quo Across Architectures**

#### Performance



# **Status Quo Across Architectures**

## **Memory Bandwidth**



## A matter of specialties





#### A matter of specialties



Transporting one



**Transporting many** 

raphics: lee [3] and Shearings Holida

Chip







# GPU optimized to hide latency

- Memory
  - GPU has small (40 GB), but high-speed memory 1555 GB/s
  - Stage data to GPU memory: via PCIe 4 bus (32 GB/s)



Device



## GPU optimized to hide latency

- Memory
  - GPU has small (40 GB), but high-speed memory 1555 GB/s
  - Stage data to GPU memory: via PCIe 4 bus (32 GB/s)



Device



# GPU optimized to hide latency

- Memory
  - GPU has small (40 GB), but high-speed memory 1555 GB/s
  - Stage data to GPU memory: via PCIe 4 bus (32 GB/s)
  - Stage automatically (*Unified Memory*), or manually



Device



# GPU optimized to hide latency

- Memory
  - GPU has small (40 GB), but high-speed memory 1555 GB/s
  - Stage data to GPU memory: via PCIe 4 bus (32 GB/s)
  - Stage automatically (*Unified Memory*), or manually
- Two engines: Overlap compute and copy





Device



## GPU optimized to hide latency

- Memory
  - GPU has small (40 GB), but high-speed memory 1555 GB/s
  - Stage data to GPU memory: via PCle 4 bus (32 GB/s)
  - Stage automatically (*Unified Memory*), or manually
- Two engines: Overlap compute and copy



#### V100

32 GB RAM, 900 GB/s



## A100

Slide 0137

40 GB RAM, 1555 GB/s





Device



## GPU optimized to hide latency

- Memory
  - GPU has small (40 GB), but high-speed memory 1555 GB/s
  - Stage data to GPU memory: via PCle 4 bus (32 GB/s)
  - Stage automatically (*Unified Memory*), or manually
- Two engines: Overlap compute and copy



SIMT

V100

32 GB RAM, 900 GB/s



A100

40 GB RAM, 1555 GB/s





Device



## $\mathbf{SIMT} = \mathbf{SIMD} \oplus \mathbf{SMT}$

- CPU:
  - Single Instruction, Multiple Data (SIMD)

18 March 2021

Slide 10137

#### Scalar



 $\mathbf{SIMT} = \mathbf{SIMD} \oplus \mathbf{SMT}$ 

- CPU:
  - Single Instruction, Multiple Data (SIMD)

#### Vector



#### $\mathbf{SIMT} = \mathbf{SIMD} \oplus \mathbf{SMT}$

- CPU:
  - Single Instruction, Multiple Data (SIMD)
  - Simultaneous Multithreading (SMT)

#### Vector







#### $\mathbf{SIMT} = \mathbf{SIMD} \oplus \mathbf{SMT}$

- CPU:
  - Single Instruction, Multiple Data (SIMD)
  - Simultaneous Multithreading (SMT)

#### Vector



#### SMT



#### $SIMT = SIMD \oplus SMT$

- CPU:
  - Single Instruction, Multiple Data (SIMD)
  - Simultaneous Multithreading (SMT)
- GPU: Single Instruction, Multiple Threads (SIMT)

#### Vector



#### SMT



Slide 10137

#### $SIMT = SIMD \oplus SMT$

- CPU:
  - Single Instruction, Multiple Data (SIMD)
  - Simultaneous Multithreading (SMT)
- GPU: Single Instruction, Multiple Threads (SIMT)

#### Vector



#### **SMT**





#### $SIMT = SIMD \oplus SMT$

- CPU:
  - Single Instruction, Multiple Data (SIMD)
  - Simultaneous Multithreading (SMT)
- GPU: Single Instruction, Multiple Threads (SIMT)
  - CPU core ≈ GPU multiprocessor (SM)
  - Working unit: set of threads (32, a warp)
  - Fast switching of threads (large register file)
  - Branching if —

#### Vector



**SMT** 



SIMT



## $SIMT = SIMD \oplus SMT$



#### Vector



#### SMT







#### $SIMT = SIMD \oplus SMT$



#### Vector



#### SMT







## Multiprocessor

# **SIMT**

 $SIMT = SIMD \oplus SMT$ 





#### Vector



#### **SMT**







#### Let's summarize this!



# Optimized for low latency

- + Large main memory
- + Fast clock rate
- + Large caches
- + Branch prediction
- + Powerful ALU
- Relatively low memory bandwidth
- Cache misses costly
- Low performance per watt



# Optimized for high throughput

- + High bandwidth main memory
- + Latency tolerant (parallelism)
- + More compute resources
- + High performance per watt
- Limited memory capacity
- Low per-thread performance
- Extension card



# Programming GPUs

# **Preface: CPU**

## A simple CPU program!

```
SAXPY: \vec{y} = a\vec{x} + \vec{y}, with single precision
Part of LAPACK BLAS Level 1
void saxpy(int n, float a, float * x, float * y) {
  for (int i = 0; i < n; i++)
    y[i] = a * x[i] + v[i];
float a = 42;
int n = 10;
float x[n], y[n];
// fill x, v
saxpy(n, a, x, y);
```

# **Summary of Acceleration Possibilities**





# **Summary of Acceleration Possibilities**





# **Libraries**

Programming GPUs is easy: Just don't!



# Libraries

Programming GPUs is easy: Just don't!

Use applications & libraries



# **Libraries**

## Programming GPUs is easy: Just don't!

Use applications & libraries



Wizard: Breazell [5]

## Use applications & libraries























Numba

th⊝ano

Wizard: Breazell [5]

## Use applications & libraries

















Thrus



Numba







#### **cuBLAS**

#### Parallel algebra



- GPU-parallel BLAS (all 152 routines)
- Single, double, complex data types
- Constant competition with Intel's MKL
- Multi-GPU support
- → https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas



## **cuBLAS**

#### Code example

```
int a = 42: int n = 10:
float x[n]. v[n]:
// fill x, v
cublasHandle t handle:
cublasCreate(&handle):
float * d x. * d v:
cudaMallocManaged(\delta d x. n * sizeof(x[0])):
cudaMallocManaged(&d y, n * sizeof(y[0]));
cublasSetVector(n, sizeof(x[0]), x, 1, d x, 1):
cublasSetVector(n, sizeof(y[0]), y, 1, d y, 1);
cublasSaxpy(n, a, d x, 1, d y, 1);
cublasGetVector(n. sizeof(v[0]), d v. 1. v. 1):
cudaFree(d x); cudaFree(d y);
cublasDestrov(handle):
```

### **cuBLAS**

#### Code example

```
int a = 42: int n = 10:
float x[n]. v[n]:
// fill x, v
cublasHandle t handle:
cublasCreate(&handle):
float * d x. * d v:
                                                                               Allocate GPU memory
cudaMallocManaged(&d x. n * sizeof(x[0])):●
cudaMallocManaged(&d y, n * sizeof(y[0]));
                                                                                   Copy data to GPU
cublasSetVector(n. sizeof(x[0]), x, 1, d x, 1):
cublasSetVector(n, sizeof(y[0]), y, 1, d y, 1);
                                                                                    Call BLAS routine
cublasSaxpy(n, a, d x, 1, d y, 1); \bullet
                                                                                  Copy result to host
cublasGetVector(n. sizeof(v[0]). d v. 1. v. 1):
                                                                                            Finalize
cudaFree(d x); cudaFree(d y);
```



cublasDestrov(handle):

**Directives** 

**Programming GPUs** 

## **GPU** Programming with Directives

#### Keepin' you portable

Annotate serial source code by directives

```
#pragma acc loop
for (int i = 0; i < 1; i++) {};</pre>
```



## **GPU** Programming with Directives

#### Keepin' you portable

Annotate serial source code by directives

```
#pragma acc loop
for (int i = 0; i < 1; i++) {};</pre>
```

- OpenACC: Especially for GPUs; OpenMP: Has GPU support
- Compiler interprets directives, creates according instructions



## **GPU** Programming with Directives

#### Keepin' you portable

Annotate serial source code by directives

```
#pragma acc loop
for (int i = 0; i < 1; i++) {};</pre>
```

- OpenACC: Especially for GPUs; OpenMP: Has GPU support
- Compiler interprets directives, creates according instructions

#### Pro

- Portability
  - Other compiler? No problem! To it, it's a serial program
  - Different target architectures from same code
- Easy to program

#### Con

- Compiler support only raising
- Not all the raw power available
- Harder to debug
- Easy to program wrong



## **OpenACC**

#### Code example

```
void saxpy_acc(int n, float a, float * x, float * y) {
    #pragma acc kernels
    for (int i = 0; i < n; i++)
        y[i] = a * x[i] + y[i];
}

float a = 42;
int n = 10;
float x[n], y[n];
// fill x, y

saxpy_acc(n, a, x, y);</pre>
```

## **OpenACC**

#### Code example

```
void saxpv acc(int n, float a, float * x, float * v) {
  #pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) loop
  for (int i = 0; i < n; i++)
   v[i] = a * x[i] + v[i]:
float a = 42;
int n = 10:
float x[n], y[n];
// fill x. v
saxpv acc(n, a, x, v);
```

## CUDA C/C++

**Programming GPUs** 

Finally...



Slide 22137

Finally...

OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009

- Platform: Programming language (OpenCL C/C++), API, and compiler
- Targets CPUs, GPUs, FPGAs, and other many-core machines
- Fully open source



#### Finally...

OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009

- Platform: Programming language (OpenCL C/C++), API, and compiler
- Targets CPUs, GPUs, FPGAs, and other many-core machines
- Fully open source

CUDA NVIDIA's GPU platform 2007

- Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ...
- Only NVIDIA GPUs
- Compilation with nvcc (free, but not open)
   clang has CUDA support, but CUDA needed for last step
- Also: CUDA Fortran; and more in NVIDIA HPC SDK



#### Finally...

OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009

- Platform: Programming language (OpenCL C/C++), API, and compiler
- Targets CPUs, GPUs, FPGAs, and other many-core machines
- Fully open source

CUDA NVIDIA's GPU platform 2007

- Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ...
- Only NVIDIA GPUs
- Compilation with nvcc (free, but not open)
   clang has CUDA support, but CUDA needed for last step
- Also: CUDA Fortran; and more in NVIDIA HPC SDK
- Choose what flavor you like, what colleagues/collaboration is using
- Hardest: Come up with parallelized algorithm



#### Finally...

OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009

- Platform: Programming language (OpenCL C/C++), API, and compiler
- Targets CPUs, GPUs, FPGAs, and other many-core machines
- Fully open source

CUDA NVIDIA's GPU platform 2007

- Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ...
- Only NVIDIA GPUs
- Compilation with nvcc (free, but not open)
   clang has CUDA support, but CUDA needed for last step
- Also: CUDA Fortran; and more in NVIDIA HPC SDK

HIP AMD's new unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+

- Choose what flavor you like, what colleagues/collaboration is using
- Hardest: Come up with parallelized algorithm



#### Finally...

OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009

- Platform: Programming language (OpenCL C/C++), API, and compiler
- Targets CPUs, GPUs, FPGAs, and other many-core machines
- Fully open source

CUDA NVIDIA's GPU platform 2007

- Platform: Drivers, programming language (CUDA C/C++), API, compiler, tools, ...
- Only NVIDIA GPUs
- Compilation with nvcc (free, but not open)
   clang has CUDA support, but CUDA needed for last step
- Also: CUDA Fortran; and more in NVIDIA HPC SDK

HIP AMD's new unified programming model for AMD (via ROCm) and NVIDIA GPUs 2016+

- Choose what flavor you like, what colleagues/collaboration is using
- Hardest: Come up with parallelized algorithm



In software: Threads, Blocks



In software: Threads, Blocks

- Methods to exploit parallelism:
  - Thread

3

In software: Threads, Blocks

- Methods to exploit parallelism:
  - Threads





In software: Threads, Blocks

Methods to exploit parallelism:

 $\blacksquare \quad \underline{\mathsf{Threads}} \to \underline{\mathsf{Block}}$ 





In software: Threads, Blocks

- $\bullet \quad \underbrace{\mathsf{Threads}}_{} \to \underbrace{\mathsf{Block}}_{}$
- Block





In software: Threads, Blocks

- $\blacksquare \quad \text{Threads} \rightarrow \quad \text{Block}$
- Blocks



In software: Threads, Blocks

- Threads → Block
- lacks ightarrow Grid



In software: Threads, Blocks

- Methods to exploit parallelism:
  - $\blacksquare \quad \text{Threads} \rightarrow \quad \text{Block}$
  - lacks ightarrow Grid
  - Threads & blocks in 3D



In software: Threads, Blocks

- Methods to exploit parallelism:
  - $\blacksquare \quad \text{Threads} \rightarrow \quad \text{Block}$
  - lacks ightarrow Grid
  - Threads & blocks in 3D



- Parallel function: kernel
  - \_\_global\_\_ kernel(int a, float \* b) { }
  - Access own ID by global variables threadIdx.x, blockIdx.y,...
- Execution entity: threads
  - Lightweight → fast switchting!
  - $lue{}$  1000s threads execute simultaneously ightarrow order non-deterministic!

#### **CUDA SAXPY**

#### With runtime-managed data transfers

```
global void saxpy cuda(int n, float a, float * x, float * y) {
 int i = blockIdx.x * blockDim.x + threadIdx.x;
 if (i < n)
   v[i] = a * x[i] + v[i]:
int a = 42;
int n = 10;
float x[n], y[n];
// fill x, y
cudaMallocManaged(&x. n * sizeof(float));
cudaMallocManaged(&y, n * sizeof(float));
saxpy cuda<<<2, 5>>>(n, a, x, y);
```

#### **CUDA SAXPY**

```
With runtime-managed data transfers
```

```
Specify kernel
global ← void saxpy cuda(int n, float a, float * x, float * y) {
  int i = blockIdx.x * blockDim.x + threadIdx.x:
                                                                                  ID variables
  if (i < n)•
    v[i] = a * x[i] + v[i]:
                                                                               Guard against
                                                                              too many threads
int a = 42;
int n = 10;
float x[n], y[n];
                                                                          Allocate GPU-capable
// fill x, y
cudaMallocManaged(&x. n * sizeof(float)):
                                                                              Call kernel
cudaMallocManaged(&y, n * sizeof(float));
                                                                        2 blocks, each 5 threads
saxpv cuda<<<2, 5>>>(n, a, x, v);
                                                                                   Wait for
```

kernel to finish

cudaDeviceSvnchronize():

**Programming GPUs** 

**Performance Analysis** 

#### **GPU Tools**

The helpful helpers helping helpless (and others)

#### NVIDIA

Nsight Systems GPU program profiler with timeline Nsight Compute GPU kernel profiler

OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.



## **Nsight Systems**

CLI

```
$ nsvs profile --stats=true ./poisson2d 10 # (shortened)
CUDA APT Statistics:
 Time(%) Total Time (ns) Num Calls
                                      Average
                                                  Minimum
                                                              Maximum
                                                                                Name
    90.9
              160,407,572
                                   5.346.919.1
                                                 1.780 25.648.117 cuStreamSynchronize
CUDA Kernel Statistics:
 Time(%)
          Total Time (ns) Instances
                                      Average
                                                  Minimum
                                                              Maximum
                                                                              Name
                                10 15,868,661.7 14,525,819 25,652,783 main_106_gpu
   100.0
              158.686.617
     0.0
                                         2.512.0
                                                                  3,680 main 106 gpu_red
                  25.120
                                10
                                                      2.304
```



## **Nsight Systems**

GUI



## **Nsight Compute**

GUI



## **Advanced Topics**

So much more interesting things to show!

- Optimize memory transfers to reduce overhead
- Optimize applications for GPU architecture
- Drop-in BLAS acceleration with NVBLAS (\$LD\_PRELOAD)
- Tensor Cores for Deep Learning
- Libraries, Abstractions: Kokkos, Alpaka, Futhark, HIP, SYCL, C++AMP, C++ pSTL, ...
- Use multiple GPUs
  - On one node
  - Across many nodes o MPI



- ..
- Some of that: Addressed at dedicated training courses

# Using GPUs on JSC Systems

#### Compiling

#### CUDA

- Module: module load CUDA/11.0
- Compile: nvcc file.cu Default host compiler: g++; use nvcc\_pgc++ for PGI compiler
- Example cuBLAS: g++ file.cpp -I\$CUDA\_HOME/include -L\$CUDA\_HOME/lib64 -lcublas -lcudart

#### OpenACC

- Module: module load NVHPC/20.9-GCC-9.3.0
- Compile: nvc++ -acc=gpu file.cpp

MPI CUDA-aware MPIs (with direct Device-Device transfers)

ParaStationMPI module load ParaStationMPI/5.4.7-1 mpi-settings/CUDA

OpenMPI module load OpenMPI/4.1.0rc1 mpi-settings/CUDA

# Running

Dedicated GPU partitions

```
JUWELS
```

```
--partition=gpus 46 nodes (Job limits: \leq 1 d) 
--partition=develgpus 10 nodes (Job limits: \leq 2 h, \leq 2 nodes)
```

#### **JUWELS Booster**

```
--partition=booster 926 nodes 
--partition=develbooster 10 nodes (Job limits: \leq 1 d, \leq 2 nodes)
```

#### JURECA DC

```
--partition=dc-gpu 192 nodes --partition=dc-gpu-devel ?? nodes
```

- Needed: Resource configuration with --gres=gpu:4
- → See online documentation



# Running

#### **JUWELS Booster Topology**

- JUWELS Booster: NPS-4 (in total: 8 NUMA Domains)
- Not all have GPU or HCA affinity!
- Network is structured into two levels: In-Cell and Inter-Cell (DragonFly+ network)



→ Documentation: apps.fz-juelich.de/jsc/hps/juwels/



### **Example**

- 16 tasks in total, running on 4 nodes
- Per node: 4 GPUs

```
#!/bin/bash -x
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4
#SBATCH --output=gpu-out.%j
#SBATCH --error=gpu-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=gpus
#SBATCH --gres=gpu:4
```

srun ./gpu-prog



# Conclusion

- GPUs provide highly-parallel computing power
- We have many devices installed at JSC, ready to be used!



- GPUs provide highly-parallel computing power
- We have many devices installed at JSC, ready to be used!
- Training courses by JSC
   CUDA Course April 2021
   OpenACC Course October 2021
- Generally: see online documentation and sc@fz-juelich.de



- GPUs provide highly-parallel computing power
- We have many devices installed at JSC, ready to be used!
- Training courses by JSC CUDA Course April 2021 OpenACC Course October 2021
- Generally: see online documentation and sc@fz-juelich.de
- Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me!



- GPUs provide highly-parallel computing power
- We have many devices installed at JSC, ready to be used!
- Training courses by JSC
   CUDA Course April 2021
   OpenACC Course October 2021
- Generally: see online documentation and sc@fz-juelich.de
- Further consultation via our lab: NVIDIA Application Lab





# Appendix

Appendix Glossary References



#### Glossary I

- AMD Manufacturer of CPUs and GPUs. 49, 50, 51, 52, 53, 54, 84, 86
- Ampere GPU architecture from NVIDIA (announced 2019). 4, 5, 6
  - API A programmatic interface to software by well-defined functions. Short for application programming interface. 49, 50, 51, 52, 53, 54
  - CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 73, 78, 79, 80, 81, 86
    - HIP GPU programming model by AMD to target their own and NVIDIA GPUs with one combined language. Short for Heterogeneous-compute Interface for Portability. 49, 50, 51, 52, 53, 54



#### Glossary II

JSC Jülich Supercomputing Centre, the supercomputing institute of Forschungszentrum Jülich, Germany. 2, 78, 79, 80, 81, 85

JURECA A multi-purpose supercomputer at JSC. 6

JUWELS Jülich's new supercomputer, the successor of JUQUEEN. 3, 4, 5, 74

MPI The Message Passing Interface, a API definition for multi-node computing, 71, 73

NVIDIA US technology company creating GPUs. 3, 4, 5, 6, 26, 27, 28, 49, 50, 51, 52, 53, 54, 67, 78, 79, 80, 81, 84, 86

OpenACC Directive-based programming, primarily for many-core machines, 43, 44, 45, 46, 47, 73, 78, 79, 80, 81



# **Glossary III**

- OpenCL The *Open Computing Language*. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 49, 50, 51, 52, 53, 54, 67
- OpenMP Directive-based programming, primarily for multi-threaded machines. 43, 44, 45
  - ROCm AMD software stack and platform to program AMD GPUs. Short for Radeon Open Compute (*Radeon* is the GPU product line of AMD). 49, 50, 51, 52, 53, 54
  - SAXPY Single-precision  $A \times X + Y$ . A simple code example of scaling a vector and adding an offset. 31, 64, 65
  - Tesla The GPU product line for general purpose computing computing of NVIDIA. 3



### **Glossary IV**

- CPU Central Processing Unit. 3, 6, 10, 11, 12, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 49, 50, 51, 52, 53, 54, 84, 86
- GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 34, 35, 36, 37, 38, 39, 42, 43, 44, 45, 48, 49, 50, 51, 52, 53, 54, 65, 66, 67, 71, 72, 74, 75, 76, 78, 79, 80, 81, 84, 85, 86
- SIMD Single Instruction, Multiple Data. 19, 20, 21, 22, 23, 24, 25, 26, 27, 28
- SIMT Single Instruction, Multiple Threads. 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28
  - SM Streaming Multiprocessor. 19, 20, 21, 22, 23, 24, 25, 26, 27, 28
- SMT Simultaneous Multithreading. 19, 20, 21, 22, 23, 24, 25, 26, 27, 28



#### References I

- [2] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/(pages 8, 9).
- [5] Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/ (pages 34-38).

# References: Images, Graphics I

- [1] Forschungszentrum Jülich GmbH (Ralf-Uwe Limbach). JUWELS Booster.
- [3] Mark Lee. *Picture: kawasaki ninja*. URL: https://www.flickr.com/photos/pochacco20/39030210/ (pages 10, 11).
- [4] Shearings Holidays. *Picture: Shearings coach 636*. URL: https://www.flickr.com/photos/shearings/13583388025/(pages 10, 11).

