

## **CENG 5030 Energy Efficient Computing**

### Implementation 06: TVM

Bei Yu CSE Department, CUHK byu@cse.cuhk.edu.hk

(Latest update: September 2, 2023)

2023 Fall



### 1 SOTA Solutions

### 2 MNN





### These slides contain/adapt materials developed by

Chen, Tianqi, et al. "TVM: An automated end-to-end optimizing compiler for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018.























eudo-code for convolution program for the VIA accelerate



| 0x00: LOAD(PARAM(0-71))<br>0x00: LOAD(CATTV(0-74))<br>0x00: LOAD(LOBUET(0-31))<br>0x00: LOAD(LOBUET(0-31))<br>0x000: POS(10)-CEX)<br>0x000: PUSH(LS>-ST)<br>0x06: PUSH(EX>-ST)<br>0x00: PUSH(EX>-ST)<br>0x00: PUSH(EX>-ST)<br>0x00: PUSH(EX>-ST)<br>0x00: PUSH(EX>-ST)<br>0x00: PUSH(EX>-ST) | 0-71],LOBUF[ 0-31],STBUF[ 0- 7]) | // LD@TID0<br>// LD@TID0<br>// LD@TID0<br>// EX0TID0<br>// EX0TID0<br>// EX0TID0<br>// EX0TID0<br>// ST0TID0<br>// ST0TID0<br>// ST0TID0<br>// ST0TID0 |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 0+08: LOAD(ACTIV[25-50])<br>0+08: LOAD(LOBUF[31-53])<br>0+08: PUSH(LD>-EX)<br>0+08: PUSH(LD>-EX)<br>0+08: PUSH(LD>-EX)<br>0+08: EXE (ACTIV[25-50],PARAM[<br>0+18: PUSH(EX>-5T)<br>0+11: PUSH(EX>-5T)<br>0+11: PUSH(EX>-5T)<br>0+11: STOR(CT=>E2,2-39])<br>0+11: STOR(CT=>E2,2-39])           | 0-71],LDBUF[32-63],STBUF[32-39]) | // LD@TID1<br>// LD@TID1<br>// LD@TID1<br>// EXQTID1<br>// EXQTID1<br>// EXQTID1<br>// ST@TID1<br>// ST@TID1<br>// ST@TID1                             |  |
| 0x15: POP (EX->1D)<br>0x15: LOAD(PARAM[0-71])<br>0x17: LOAD(ACTIV[0-74])<br>0x18: LOAD(LOBUF[0-71])<br>0x18: LOAD(LOBUF[0-71])<br>0x18: POISH(LD->EX)<br>0x18: POISH(LD->EX)<br>0x18: POISH(EX->ST)<br>0x18: POP (EX->ST)<br>0x18: POP (EX->ST)<br>0x18: POP (EX->ST)                        | 0-71],LOBUF[ 0-31],STBUF[ 0- 7]) | // LD@TID2<br>// LD@TID2<br>// LD@TID2<br>// LD@TID2<br>// LD@TID2<br>// EX@TID2<br>// EX@TID2<br>// EX@TID2<br>// ST@TID2                             |  |
| // Vertial inteld 3<br>0x20: PDP (Ex >>LD)<br>0x21: LDAD(ACTIV[25:50])<br>0x21: LDAD(ACTIV[25:50])<br>0x22: LDAD(LDBDE[12:53])<br>0x24: PDP (LD=>EX)<br>0x25: PDP (LD=>EX)<br>0x25: PDP (Ex=>ST)<br>0x26: PDP (EX=>ST)<br>0x27: PUSH(Ex=>ST)<br>0x28: PDP (EX=>ST)<br>0x28: PDP (EX=>ST)     | 0-71],LDBUF[32-63],STBUF[32-39]) | // LD@TID3<br>// LD@TID3<br>// LD@TID3<br>// LD@TID3<br>// EX@TID3<br>// EX@TID3<br>// EX@TID3<br>// ST@TID3<br>// ST@TID3                             |  |



micro op 0 fields.

micro-coded program

// Convolution access pattern dictated by micro-coded program. // Each register index is derived as a 2-D affine function. // e.g. idx,= a.gvb.x\*c.g\*, where c.g\* is specified by

(a) Blocked convolution program with multiple thread contexts







Explosion of models and frameworks



Explosion of models and frameworks



Explosion of models and frameworks





Explosion of models and frameworks





Explosion of models and frameworks





Explosion of models and frameworks







Explosion of models and frameworks





Explosion of models and frameworks





Explosion of models and frameworks





Explosion of models and frameworks















































### Vanilla Code

```
for y in range(1024):
    for x in range(1024):
        C[y][x] = 0
        for k in range(1024):
            C[y][x] += A[k][y] * B[k][x]
```







#### Loop Tiling for Locality







#### Map to Accelerators

```
inp_buffer AL[8][8], BL[8][8]
acc_buffer CL[8][8]
for yo in range(128):
    vdla.fill_zero(CL)
    for ko in range(128):
        vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8])
        vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8])
        vdla.fused_gemm8x8_add(CL, AL, BL)
        vdla.dma_copy2d(C[yo*8:yo*8+k,xo*8:xo*8+8], CL)
```

Human exploration of optimized code









































New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

cuDNN









9


# Limitations of Existing Approach



New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

cuDNN





# Limitations of Existing Approach



New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

cuDNN













#### Limitations of Existing Approach Frameworks m

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

#### **Engineering intensive**

**cuDNN** 























































### Computational Graph as IR

#### Represent High level Effective Equivalent Transformations **Deep Learning Computations** to Optimize the Graph data attributes w1 conv2d channels=32, kernel size=(3.3). padding=(1,1), relu conv2d use bias=0 fused-conv2dw2 conv2d bn operation bn-relu relu inputs flatten relu dataflow dependency dense w3 softmax shape=(1.10)

Approach taken by: TensorFlow XLA, Intel NGraph, Nvidia TensorRT





# need to build and optimize operators for each hardware, variant of layout, precision, threading pattern ...









#### **Tensor Index Expression**

Compute C = dot(A, B.T)

import tvm

m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h')
A = tvm.placeholder((m, h), name='A')
B = tvm.placeholder((n, h), name='B')
k = tvm.reduce\_axis((0, h), name='k')
C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] \* B[j, k], axis=k))
Shape of C
Computation Rule



# Tensor Expressions are Expressive

#### Affine Transformation

```
out = tvm.compute((n, m), lambda i, j: tvm.sum(data[i, k] * w[j, k], k))
out = tvm.compute((n, m), lambda i, j: out[i, j] + bias[i])
```

#### Convolution

```
out = tvm.compute((c, h, w),
    lambda i, x, y: tvm.sum(data[kc,x+kx,y+ky] * w[i,kx,ky], [kx,ky,kc]))
```

#### ReLU

out = tvm.compute(shape, lambda \*i: tvm.max(0, out(\*i))



#### Emerging Tools Using Tensor Expression Language

Halide: Image processing language

Loopy: python based kernel generator

TACO: sparse tensor code generator

Tensor Comprehension



# Schedule: Tensor Expression to Code





- C = tvm.compute((n,), lambda i: A[i] + B[i])
- s = tvm.create\_schedule(C.op)

```
for (int i = 0; i < n; ++i) {
    C[i] = A[i] + B[i];
}</pre>
```



```
C = tvm.compute((n,), lambda i: A[i] + B[i])
s = tvm.create schedule(C.op)
```

xo, xi = s[C].split(s[C].axis[0], factor=32)

```
for (int xo = 0; xo < ceil(n / 32); ++xo) {
  for (int xi = 0; xi < 32; ++xi) {
    int i = xo * 32 + xi;
    if (i < n) {
        C[i] = A[i] + B[i];
      }
   }
}</pre>
```



```
C = tvm.compute((n,), lambda i: A[i] + B[i])
s = tvm.create_schedule(C.op)
xo, xi = s[C].split(s[C].axis[0], factor=32)
s[C].recorder(xi, xo)
```

```
for (int xi = 0; xi < 32; ++xi) {
  for (int xo = 0; xo < ceil(n / 32); ++xo) {
    int i = xo * 32 + xi;
    if (i < n) {
        C[i] = A[i] + B[i];
      }
    }
}</pre>
```



```
C = tvm.compute((n,), lambda i: A[i] + B[i])
s = tvm.create_schedule(C.op)
xo, xi = s[C].split(s[C].axis[0], factor=32)
s[C].recorder(xi, xo)
s[C].bind(xo, tvm.thread_axis("blockIdx.x")
s[C].bind(xi, tvm.thread_axis("threadIdx.x")
```

```
int i = threadIdx.x * 32 + blockIdx.x;
if (i < n) {
    C[i] = A[i] + B[i];
}</pre>
```



#### Key Challenge: Good Space of Schedule

Should contain any knobs that produces a logically equivalent program that runs well on backend models

Must contain the common manual optimization patterns

Need to actively evolve to incorporate new techniques





Reuse primitives from prior work: Halide, Loopy



#### Challenge to Support Diverse Hardware Backends

CPUs



GPUs





TPU-like specialized Accelerators







| GPUs |
|------|
|------|





scalar



vector

Memory Subsystem





15



| GPUs |  |
|------|--|
|      |  |





vector

| 12    |       |    |  |
|-------|-------|----|--|
| LZ    |       |    |  |
| SM    | SM    |    |  |
| TX/L1 | TX/L1 |    |  |
| RF RF | RF    | RF |  |
| mixed |       |    |  |
|       | icu - |    |  |

Momory Subsystem

Shared memory among compute cores





| GPUs  | Compute Pr              | imitives      | Memory Subsystem                                                                                |
|-------|-------------------------|---------------|-------------------------------------------------------------------------------------------------|
| GF 05 |                         |               |                                                                                                 |
|       | scalar                  | vector<br>Sha | L2<br>SM SM<br>TX/L1 TX/L1<br>RF RF RF RF<br><i>mixed</i><br>ared memory among<br>compute cores |
|       | Use of Shared<br>Memory | Thre<br>Coope | ead<br>ration                                                                                   |



#### TPU-like Specialized Accelerators





#### **Compute Primitives**



tensor



explicitly managed



#### TPU-like Specialized Accelerators











Compute primitives

















#### Hardware designer: declare tensor instruction interface with Tensor Expression

w, x = t.placeholder((8, 8)), t.placeholder((8, 8)) declare behavior k = t, reduce axis((0, 8)) v = t.compute((8, 8), lambda i, i: t.sum(w[i, k] \* x[j, k], axis=k)) lowering rule to generate def gemm intrin lower(inputs. outputs); hardware intrinsics to carry ww ptr = inputs[0].access ptr("r") xx ptr = inputs[1].access ptr("r") out the computation zz ptr = outputs[0].access ptr("w") compute = t.hardware intrin("gemm8x8", ww ptr. xx ptr. zz ptr) reset = t.hardware intrin("fill zero", zz ptr) update = t.hardware intrin("fuse gemm8x8 add", ww ptr. xx ptr. zz ptr) return compute, reset, update

gemm8x8 = t.decl\_tensor\_intrin(y.op, gemm\_intrin\_lower)


### **Tensorization Challenge**



#### Hardware designer: declare tensor instruction interface with Tensor Expression

w, x = t.placeholder((8, 8)), t.placeholder((8, 8)) declare behavior k = t, reduce axis((0, 8)) v = t.compute((8, 8), lambda i, i: t.sum(w[i, k] \* x[j, k], axis=k)) lowering rule to generate def gemm intrin lower(inputs. outputs): hardware intrinsics to carry ww ptr = inputs[0].access ptr("r") xx ptr = inputs[1].access ptr("r") out the computation zz ptr = outputs[0].access ptr("w") compute = t.hardware intrin("gemm8x8", ww ptr, xx ptr, zz ptr) reset = t.hardware intrin("fill zero", zz ptr) update = t.hardware intrin("fuse gemm8x8 add", ww ptr. xx ptr. zz ptr) return compute, reset, update

gemm8x8 = t.decl\_tensor\_intrin(y.op, gemm\_intrin\_lower)

#### Tensorize: transform program to use tensor instructions





#### TPU-like Specialized Accelerators





#### **Compute Primitives**



tensor



explicitly managed



#### **TPU-like Specialized Accelerators**





**Compute Primitives** 







# Software Support for Latency Hiding





# Software Support for Latency Hiding





# Software Support for Latency Hiding







































## Global View of TVM Stack





## High Level Compilation Frontend

module = runtime.create(graph, lib, tvm.gpu(0))
module.set\_input(\*\*params)
module.run(data=data\_array)
output = tvm.nd.empty(out\_shape, ctx=tvm.gpu(0))
module.get\_output(0, output)

import tvm
import nnvm.frontend
import nnvm.compiler

On languages and platforms you choose





#### Program Your Phone with Python from Your Laptop

**RPC** Server on Compiler Stack Embedded Device lib = t.build(s, [A, B], 'llvm -target=armv7l-none-linux-gnueabihf', name='mvfunc') remote = t.rpc.connect(host. port) lib.save('myfunc.o') upload module to remote remote.upload('myfunc.o') get remote function f = remote.load\_module('myfunc.o') ctx = remote.cpu(0)copy data to remote a = t.nd.array(np.random.uniform(size=1024), ctx) get remote array handle b = t.nd.array(np.zeros(1024), ctx)run function on remote remote timer = f.time evaluator('mvfunc', ctx, number=10) time\_cost = remote\_timer(a, b) get profile statistics back np.testing.assert equal(b.asnumpy(), expected) copy data back to host for correctness verification













# Learning-based Program Optimizer Program **Program Optimizer** Code Generator **Runtime Measurements**





High experiment cost, each trial costs ~1second













#### Need reliable cost model per hardware

















Adapt to hardware type by learning Make prediction in 1ms level

























**Relative Speedup** 







#### End to End Inference Performance (Nvidia Titan X)
























# End to End Performance(ARM Cortex-A53)





#### End to End Performance(ARM Cortex-A53)





# End to End Performance(ARM GPU)





# Supporting New Specialized Accelerators



Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer





# Supporting New Specialized Accelerators



Hardware aware Search Space of Optimized Tensor Programs

#### Machine Learning based Program Optimizer















#### VTA MicroArchitecture





webservices







VTA Hardware/Software Interface (ISA)

#### VTA MicroArchitecture





webservices











amazon









- JIT compile accelerator micro code
- Support heterogenous devices, 10x better than CPU on the same board.
- Move hardware complexity to software





- JIT compile accelerator micro code
- Support heterogenous devices, 10x better than CPU on the same board.
- Move hardware complexity to software
   compiler, driver,
   hardware design
   full stack open source



# TVM: Learning-based Learning System



# Check it out!



# **SOTA Solutions**

#### Example: FlexFlow, SysML'2019

1

• "The optimizer uses a MCMC search algorithm to explore the space of possible parallelization strategies and iteratively proposes candidate strategies that are evaluated by a execution simulator."



<sup>1</sup>Zhihao Jia, Matei Zaharia, and Alex Aiken (2019). "Beyond Data and Model Parallelism for Deep Neural Networks". In: *Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019.* Ed. by Ameet Talwalkar, Virginia Smith, and Matei Zaharia. mlsys.org. URL: https://proceedings.mlsys.org/book/265.pdf.

5/30

2



• "We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search using effective model transfer across workloads."



<sup>2</sup>Tianqi Chen et al. (2018). "Learning to Optimize Tensor Programs". In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. Ed. by Samy Bengio et al., pp. 3393–3404. URL: http://papers.nips.cc/paper/7599-learning-to-optimize-tensor-programs.

#### Example: Ansor: AutoTVM v2.0, arXiv



• "We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores much more optimization combinations by sampling programs from a hierarchical representation of the search space."

3



<sup>3</sup>Lianmin Zheng et al. (2020). "Ansor : Generating High-Performance Tensor Programs for Deep Learning". In: *CoRR* abs/2006.06762. arXiv: 2006.06762. URL: https://arxiv.org/abs/2006.06762.

#### Example: Halide, SIGGRAPH '2019



 "We generate schedules for Halide programs using tree search over the space of schedules guided by a learned cost model and optional autotuning. The cost model is trained by benchmarking thousands of randomly-generated Halide programs and schedules. The resulting code significantly outperforms prior work and human experts."



4

<sup>4</sup>Andrew Adams et al. (2019). "Learning to optimize halide with tree search and random programs". In: *ACM Trans. Graph.* 38.4, 121:1–121:12. DOI: 10.1145/3306346.3322967. URL: https://doi.org/10.1145/3306346.3322967.



# MNN





<sup>5</sup>Xiaotang Jiang et al. (2020). "MNN: A Universal and Efficient Inference Engine". In:





- Caffe Deep Learning Framework
- TensorFlow Deep Learning Framework
- Pytorch Deep Learning Framework







- PyTorch is a python package that provides two high-level features:
  - Tensor computation (like numpy) with strong GPU acceleration
  - Deep Neural Networks built on a tape-based autograd system
- Model Deployment:
  - For high-performance inference deployment for trained models, export to ONNX format and optimize and deploy with NVIDIA TensorRT or MNN inference accelerator

#### PyTorch Code Sample



```
torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
    def init (self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
   def forward(self, x):
        x = F.max pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num flat features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
   def num flat features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension
        num features = 1
        for s in size:
            num features *= s
        return num features
```





- TensorFlow is an open source software library for numerical computation using data flow graphs
- Model Deployment
  - For high-performance inference deployment for trained models, using TensorFlow-MNN integration to optimize models within TensorFlow and deploy with MNN inference accelerator



```
import tensorflow as tf
     from tensorflow.keras import Model, layers
     import numpy as np
 6 ▼ class NeuralNet(Model):
 8 🔻
             super(NeuralNet, self).__init__()
             self.fc1 = layers.Dense(n_hidden_1, activation=tf.nn.relu)
             self.fc2 = layers.Dense(n_hidden_2, activation=tf.nn.relu)
             self.out = layers.Dense(num classes)
         def call(self, x, is training=False):
18 🔻
             x = self.fc1(x)
             x = self.fc2(x)
             x = self.out(x)
22 🔻
             if not is_training:
                 x = tf.nn.softmax(x)
             return x
```



# Caffe

- Caffe is a deep learning framework made with expression, speed, and modularity in mind:
  - Expressive architecture encourages application and innovation
  - Extensible code fosters active development.
  - Speed makes Caffe perfect for research experiments and industry deployment
- Model Deployment:
  - For high-performance inference deployment for trained models, using Caffe-MNN integration to optimize models within Caffe and MNN inference accelerator

#### Caffe Code Sample



```
caffe root = '../'
     import svs
     sys.path.insert(0, caffe_root + 'python')
     import os
     os.chdir(caffe root)
     !data/mnist/get mnist.sh
     !examples/mnist/create mnist.sh
     os.chdir('examples')
     from caffe import layers as L, params as P
17 ▼ def lenet(lmdb, batch size):
         n = caffe.NetSpec()
         n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
                                  transform param=dict(scale=1./255), ntop=2)
         n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight filler=dict(type='xavier'))
         n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
         n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
         n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
         n.fc1 = L.InnerProduct(n.pool2, num output=500, weight filler=dict(type='xavier'))
         n.relu1 = L.ReLU(n.fc1, in place=True)
         n.score = L.InnerProduct(n.relu1, num_output=10, weight_filler-dict(type='xavier'))
         n.loss = L.SoftmaxWithLoss(n.score, n.label)
         return n.to_proto()
     with open('mnist/lenet auto train.prototxt', 'w') as f:
         f.write(str(lenet('mnist/mnist train lmdb', 64)))
     with open('mnist/lenet auto test.prototxt', 'w') as f:
         f.write(str(lenet('mnist/mnist test lmdb', 100)))
```

#### Overview of the proposed Mobile Neural Network





#### On-device inference









- Training in fp32 and inference in fp16 is expected to get same accuracy as in fp32 most of the time
- Add batch normalization to activation
- If it is integer RGB input (0 255), normalize it to be float (0 1)



- Advantages of FP16:
  - FP16 improves speed (TFLOPS) and performance
  - FP16 reduces memory usage of a neural network
  - FP16 data transfers are faster than FP32
- Disadvantages of FP16:
  - They must be converted to or from 32-bit floats before they are operated on




• As a programmer, there are several ways you can use Neon technology:

- Neon intrinsics
- Neon-enabled libraries
- Auto-vectorization by your compiler
- Hand-coded Neon assembler





- Support for both integer and floating point operations ensures the adaptability of a broad range of applications, from codecs to High Performance Computing to 3D graphics.
- Tight coupling to the Arm processor provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow









## Automated Search







- Generally, MNN outperforms other inference engines under almost all settings by about 20% 40%, regardless of the smartphones, backends, and networks
- For CPU, on average, 4-thread inference with MNN is about 30% faster than others on iOS platforms, and about 34% faster on Android platforms
- For Metal GPU backend on iPhones, MNN is much faster than TF-Lite, a little slower than CoreML but still comparable

## Performance on different smartphones and networks











## • alibaba2020mnn

• Christian Szegedy et al. (2015). "Going deeper with convolutions". In: CVPR