

## CMSC5743 Lab 03 CUDA Tutorial Materials

Yang BAI
Department of Computer Science & Engineering
Chinese University of Hong Kong
ybai@cse.cuhk.edu.hk

October 29, 2021

#### Outline



1 Vector Addition

2 Tensor Core WMMA

**3** General Matrix Multiplication

### CUDA Programming Language



- Heterogeneous Computing
- Host and Device
- CUDA C/C++

**Vector Addition** 

#### Coding Style and Organization



- Host Code Initialization
- Device Code Initialization
- Kernel Code
- Check Your Results
- Free Source on Host
- Free Source on Device



```
int* a;
int* b;
int* c;
int* dev a;
int* dev_b;
int* dev_c;
a = (int* )malloc(sizeof(int) * N);
b = (int* )malloc(sizeof(int) * N);
c = (int* )malloc(sizeof(int) * N);
for ( int i = 0; i < N; i++ ) {
    a[i] = i;
    b[i] = N - i - 1;
```

#### Device Code Initialization



```
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));

cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

add<<<numBlocks, numThreads>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
```



```
#include <stdio.h>
#include <iostream>
#define N 128
#define numThreads 128
#define numBlocks 1
__global__ void add(int* a, int* b, int* c) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid < N) {</pre>
        c[tid] = a[tid] + b[tid];
```



```
bool flag = true;
int tot = 0;
printf("Let us check our results...\n");
for( int i = 0; i < N; i++ ) {
    if (a[i] + b[i] != c[i]) {
        flag = false;
        printf("%d + %d != %d\n", a[i], b[i], c[i]);
    tot += 1;
if (flag) {
    printf("success!");
```



```
free(a):
free(b);
free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
```

**Tensor Core WMMA** 

#### Tensor Core Overview



- Access to Tensor Core by cuBLAS/cuDNN
- Access to Tensor Core by CUTLASS
- Access to Tensor Core by TVM

#### Tensor Core History



- Volta Tensor Core (1st)
  - FP16 supported
  - 8 x 8 x 4 (M x N x K)
- Turing Tensor Core (2nd)
  - FP16 supported
  - 8 x 8 x 4, 16 x 8 x 8 (recommended)
  - INT8, INT4, INT1 supported
  - 8 x 8 x 16, 8 x 8 x 32, 8 x 8 x 128
- Ampere Tensor Core (3rd)
  - new bfloat16 supported, FP16
  - 16 x 8 x 8, 16 x 8 x 16
  - new TF32 supported, 16 x 8 x 4, 16 x 8 x 8
  - new Double supported, 8 x 8 x 4



## Format of Floating points IEEE754



#### Data types





#### Tesla T4 GPU Architecture





#### Turing 2080Ti Information



- Architecture: Turing
- SMs: 68
- CUDA Cores/SM: 64
- CUDA Cores/GPU: 4352
- Tensor Cores/SM: 8
- Tensor Cores/GPU: 544

#### Hierarchical Structure



```
for (int cta n = 0: cta n < GemmN: cta n += CtaTileN) {</pre>
 for (int cta m = 0: cta m < GemmM: cta m += CtaTileM) {</pre>
    for (int cta k = 0; cta k < GemmK; cta k += CtaTileK) {</pre>
      for (int warp_n = 0; warp_n < CtaTileN; warp_n += WarpTileN) {</pre>
        for (int warp_m = 0; warp_m < CtaTileM; warp_m += WarpTileM) {</pre>
          for (int warp k = 0; warp k < CtaTileK; warp k += WarpTileK)
            for (int mma k = 0; mma k < WarpTileK; mma k += MmaK) {</pre>
              for (int mma n = 0; mma n < WarpTileN; mma n += MmaN) {</pre>
                 for (int mma_m = 0; mma_m < WarpTileM; mma_m += MmaM) {</pre>
                   mma_instruction(d, a, b, c);
```

#### Hierarchical Structure

- The basic triple loop nest computing matrix multiply may be blocked and tiled to match concurrency in hardware, memory locality, and parallel programming models.
- In CUTLASS, GEMM is mapped to NVIDIA GPUs with the structure illustrated by the following loop nest.

#### This tiled loop nest targets concurrency among

- threadblocks-level
- warps
- CUDA and Tensor Cores

#### and takes advantage of memory locality within

- shared memory
- registers

#### The flow of data in CUTLASS





# General Matrix Multiplication

- Q1 Learn the code in ./Lab03-CUDA/code and it contains three folders (vector\_add, gemm, wmma)
  - Learn the code style and components of vector\_add.cu file
  - Complete all of the code in gemm folder
  - Try to make your gemm kernel more efficient
    - shared memory
    - tiling size
    - block and thread size
- Q2 Learn the wmma.cu from the /Lab03-CUDA/code/wmma to run it successfully by compile.sh script
  - Learn the different data type in CUDA programming language such as Float16, Int8
  - Learn the basic knowledge of Tensor Core and WMMA in CUDA programming language
  - Learn the difference between FLOPs and FLOPS
  - Change the tiling size in wmma.cu to get the different TFLOPS

