

# **CENG 5030 Energy Efficient Computing**

## Implementation 04: Sparse Conv

Bei Yu CSE Department, CUHK byu@cse.cuhk.edu.hk

(Latest update: November 22, 2023)

2023 Fall



### (1) Kernel Sparse Convolution

**2** Submanifold Sparse Convolution





# **Kernel Sparse Convolution**



- Our DNN may be redundant, and sometimes the filters may be sparse
- Sparsity can be helpful to overcome over-fitting



### Sparse Convolution: Naive Implementation 1





- 1: for all *w*[i] do
- 2: **if** *w*[i] = 0 **then**
- 3: Continue;
- 4: end if
- 5: output feature map  $Y \leftarrow X \times w[i]$ ;

6: end for



w

0

0

4

8





- 1: for all *w*[i] do
- 2: **if** *w*[i] = 0 **then**
- 3: Continue;
- 4: end if
- 5: output feature map  $Y \leftarrow X \times w[i]$ ;

6: end for

BAD implementation for Pipeline!

| Instr. No.     | Pipeline Stage |    |    |     |     |     |     |
|----------------|----------------|----|----|-----|-----|-----|-----|
| 1              | IF             | ID | EX | мем | WB  |     |     |
| 2              |                | IF | ID | ΕX  | мем | WB  |     |
| 3              |                |    | IF | ID  | ΕX  | мем | WB  |
| 4              |                |    |    | IF  | ID  | ΕX  | мем |
| 5              |                |    |    |     | IF  | ID  | ΕX  |
| Clock<br>Cycle | 1              | 2  | 3  | 4   | 5   | 6   | 7   |





## Sparse Matrix Representation





- CSR: Good for operation on feature maps
- CSC: Good for operation on filters
- We have better control on filters, thus usually CSC.







- BAD implementation for Spatial Locality!
- Poor memory access patterns

## SOTA 2: Sparse Convolution





Figure 1: Conceptual view of the direct sparse convolution algorithm. Computation of output value at (y, x)th position of *n*th output channel is highlighted.

```
for each output channel n {
  for j in [W.rowptr[n], W.rowptr[n+1]) {
    off = W.colidx[j]; coeff = W.value[j]
    for (int y = 0; y < H_OUT; ++y) {
      for (int x = 0; x < W_OUT; ++x) {
        out[n][y][x] += coeff*in[off+f(0,y,x)]
      }
   }
}</pre>
```

Figure 2: Sparse convolution pseudo code. Matrix W has *compressed sparse row* (CSR) format, where *rowptr*[n] points to the first non-zero weight of *n*th output channel. For the *j*th nonzero weight at (n, c, r, s), W.colidx[j] contains the offset to (c, r, s) the element of tensor in, which is pre-computed by layout function as f(c, r, s). If in has CHW format,  $f(c, r, s) = (cH_{in} + r)W_{in} + s$ . The "virtual" dense matrix is formed on-the-fly by shifting in by (0, y, x).

<sup>1</sup>Jongsoo Park et al. (2017). "Faster CNNs with direct sparse convolutions and guided pruning". In: *Proc. ICLR*.



- Sparsity is a desired property for computation acceleration. (cuSPARSE library, direct sparse convolution, etc.)
- Sometimes not only the filters but also the input feature maps are sparse.



## Discussion: Sparse-Sparse Convolution





- Efficient programming implementation required; (Improve pipeline efficiency)
- When sparsity(*input*) = 0.9, sparsity(*weight*) = 0.8, more than  $10 \times$  speedup;
- Some other issues:
  - How to be compatible with pooling layer?
  - Transform between dense & sparse formats



# **Submanifold Sparse Convolution**



In real world, we have to handle voxel data sometimes. For example, in point cloud analysis, 3D voxel data is widely used. A simple example is shown here and it can be viewed as  $V \in (1, R, R, R)$ .



### Voxel Data



Here is a rabbit with shape  $V \in (1, 64, 64, 64)$ . If using traditional convolution to extract its feature, the GPU will out of memory very soon because the input  $V \in (1, 64, 64, 64)$  can be viewed as an image  $I \in (1, 4096, 64)$ .





To overcome this issue, we use 3D sparse convolution for voxel data analysis. Sparse convolution only calculate the data points where voxel data exists.



In this Lab, we are going to build a sparse convolution from scratch. Here we use the example input:



where P1 and P2 has pixel value of 1 in 3 channels.



Firstly, we build a hash table to store the input data. Considering the following case:

conv2D(kernel\_size=3, out\_channels=2, stride=1, padding=0)



#### We can build an input table $H_{in}$ like this:



H\_in



























| P1 | P1 | P1 |
|----|----|----|
| P1 |    |    |
|    |    |    |







| P1 | P1 | P1 |
|----|----|----|
| P1 | P1 |    |
|    |    |    |

| (0,0) |
|-------|
| (1,0) |
| (2,0) |
| (0,1) |
| (1,1) |





| P1 | P1 | P1 |
|----|----|----|
| P1 | P1 | P1 |
|    |    |    |

| (0,0) |
|-------|
| (1,0) |
| (2,0) |
| (0,1) |
| (1,1) |
| (2,1) |





After applying the same process to  $P_{a}$  we get an output hash table  $H_{a}$  via  $P_{out}$ mergi  $P_{a}$ 

P1 P1 P1

P1 P1 P1

P2 P2

P2 P2

P2 P2





## Submanifold Sparse Convolution



- Next we build up a Rulebook to realize *H*<sub>in</sub> to *H*<sub>out</sub>.
- To build the rule book, we have to build an offset map like this:







#### Quick Question:

Please write the offset map of *P*2 by yourself.



After obtaining the offset map, we can finally build up the rule book as follow:

P out (0,0)(1,0)(2,0)(0,1)(1,1)(2,1) P\_out (1,0)(2,0)(1,1)(2,1)(1,2)(2,2)



## Submanifold Sparse Convolution



Recalling the  $H_{in}$  and  $H_{out}$ , the rulebook is generated as follow:





#### If the offset already exists, we simply add 1 in count:





#### After getting rulebook, we can apply sparse convolution:



For  $P_1$ , the reults is shown above, which is the blue points in 5-th row. Please practice  $P_2$  by yourself



# **Sparse Hardware Architecture**



## EIE: Efficient Inference Engine on Compressed Deep Neural Network

Han et al. ISCA 2016





## **Deep Learning Accelerators**

• First Wave: Compute (Neu Flow)

• Second Wave: Memory (Diannao family)

• Third Wave: Algorithm / Hardware Co-Design (EIE)

Google TPU: "This unit is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs"



### EIE: the First DNN Accelerator for Sparse, Compressed Model





## **EIE: Parallelization on Sparsity**

$$\vec{a} \left( \begin{array}{cccc} 0 & a_1 & 0 & a_3 \end{array} \right) \\ \times & & \vec{b} \\ \begin{pmatrix} w_{0,0} | \boldsymbol{w}_{0,1} | & 0 & | \boldsymbol{w}_{0,3} | \\ 0 & \mathbf{0} & | \boldsymbol{w}_{1,2} | & 0 & | \\ 0 & \boldsymbol{w}_{2,1} | & 0 & | \boldsymbol{w}_{2,3} | \\ 0 & \mathbf{0} & | & 0 & | & 0 \\ 0 & \mathbf{0} & | & w_{4,2} | \boldsymbol{w}_{4,3} | \\ w_{5,0} | & \mathbf{0} & | & 0 & | \\ 0 & | & \mathbf{0} & | & w_{6,3} | \\ 0 & | & \mathbf{w}_{7,1} | & 0 & | & 0 \\ \end{array} \right) = \begin{pmatrix} b_0 \\ b_1 \\ -b_2 \\ b_3 \\ -b_4 \\ b_5 \\ b_6 \\ -b_7 \end{pmatrix} \stackrel{ReLU}{\Rightarrow} \begin{pmatrix} b_0 \\ b_1 \\ 0 \\ b_3 \\ 0 \\ b_5 \\ b_6 \\ 0 \\ \end{array}$$



## **EIE: Parallelization on Sparsity**







## Dataflow



rule of thumb: 0 \* A = 0 W \* 0 = 0



# **EIE Architecture**

### Weight decode



### Address Accumulate

rule of thumb: 0 \* A = 0 W \* 0 = 0 2.09, 1.92 => 2 27/43



# Post Layout Result of EIE



| Technology       | 40 nm           |
|------------------|-----------------|
| # PEs            | 64              |
| on-chip SRAM     | 8 MB            |
| Max Model Size   | 84 Million      |
| Static Sparsity  | 10x             |
| Dynamic Sparsity | 3x              |
| Quantization     | 4-bit           |
| ALU Width        | 16-bit          |
| Area             | 40.8 mm^2       |
| MxV Throughput   | 81,967 layers/s |
| Power            | 586 mW          |

- 1. Post layout result
- 2. Throughput measured on AlexNet FC-7



# **Speedup on EIE**





# **Energy Efficiency on EIE**





# **Comparison: Throughput**





# **Comparison: Energy Efficiency**



## Weight Sparsity<sup>2</sup>



### Indexing Module (IM) for sparse data



- IM is used for indexing needed neurons of sparse networks with different levels of sparsities.
- A centralized IM is designed in the buffer controller and only transfer the indexed neurons to processing engines.

<sup>&</sup>lt;sup>2</sup>Shijin Zhang et al. (2016). "Cambricon-x: An accelerator for sparse neural networks". In: *Proc. MICRO*. IEEE, pp. 1–12.



### Direct indexing and hardware implementation



• Neurons are selected from all input neurons directly based on existed connections in the binary string.



### Step indexing and hardware implementation



• Neurons are selected based on the distances between input neurons with existed synapses.



#### Lots of Runtime Zeroes

#### Ineffectual zero computations.



<sup>&</sup>lt;sup>3</sup>Jorge Albericio et al. (2016). "Cnvlutin: Ineffectual-neuron-free deep neural network computing". In: *ACM SIGARCH Computer Architecture News* 44.3, pp. 1–13.



### DaDianNao<sup>4</sup>



<sup>4</sup>Yunji Chen et al. (2014). "Dadiannao: A machine-learning supercomputer". In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, pp. 609–622. 32/43



### Processing in DaDianNao





### Processing in DaDianNao





### Processing in DaDianNao





### Processing in DaDianNao

#### Zero removal.





### Processing in DaDianNao

#### Zero removal.





### Processing in DaDianNao

Lanes can not longer operate in lock-step.





### CNVLUTIN: Decoupling Lanes





### **CNVLUTIN:** Decoupling Lanes

#### Subunit 0



### Subunit 15





### **CNVLUTIN: Decoupling Lanes**

#### Subunit 0



### Subunit 15





### **CNVLUTIN:** Decoupling Lanes



#### **Decoupled Neuron Lanes:**

Neuron + coordinate Proceed independently

### Partitioned SB:

16-wide accesses1 synapse per filter



- Wenlin Chen et al. (2015). "Compressing neural networks with the hashing trick". In: *Proc. ICML*, pp. 2285–2294
- Huizi Mao et al. (2017). "Exploring the granularity of sparsity in convolutional neural networks". In: *CVPR Workshop*, pp. 13–20
- Zhuang Liu et al. (2017). "Learning efficient convolutional networks through network slimming". In: *Proc. ICCV*, pp. 2736–2744
- Chenglong Zhao et al. (June 2019). "Variational convolutional neural network pruning". In: *Proc. CVPR*
- Junru Wu et al. (2018). "Deep *k*-Means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions". In: *Proc. ICML*