当前位置：网站首页>[paper] mec: memory efficient revolution for deep neural network

[paper] mec: memory efficient revolution for deep neural network

2022-07-20 08:47:00 【one or only】

This paper is too simple for direct convolution , Poor performance , be based on im2col,FFT,Winorgrad The convolution calculation method of etc. has great Memory overhead And make the performance decline and in performance and Memory consumption aspect It is difficult to reach a balance , Propose memory efficient convolution MEC.

MEC The memory overhead is reduced and the convolution calculation process is accelerated .MEC Use a simple and novel method to reduce the input matrix in a highly compact way , At the same time, we still use highly optimized libraries, such as BlAS To speed up matrix multiplication , And perform matrix multiplication of many small matrices in parallel .MEC Reduce the memory occupation of input matrix , Improve the efficiency of memory subsystem （ That is to say, promotion cache The locality of ） thus MEC It can accelerate the calculation of convolution without losing accuracy .

By using CPU/GPU stay mobile and server Experiments were carried out on the platform , Proved MEC Is a very efficient algorithm , It can be applied to various systems with limited memory .

2 Preliminaries

Mark

utilize C The linguistic representation of tensors and matrices is row-major Sequential , But if you use some math libraries , such as cuBLAS, It is column-major Sequential . The text still uses row-major Sequential representation , But understand all matrices as transposed .
sh,sw Represents the step size of the convolution kernel .
lowered matrix L ：Inputs Intermediate storage representation converted into a matrix .
For ease of explanation , Assume that the input I Any tape has been applied 0 Fill of . The output matrix O The dimensions are ：

$O_{h, w}=\frac{i_{h, w}-k_{h, w}}{s_{h, w}}+1$

Related work

im2col-based convolution: Convert the input matrix into Toeplitz matrix , Then use the highly optimized linear algebra Library BLAS Perform fast matrix multiplication .
FFT-based convolution： Convolution can be calculated simply in frequency domain . But all convolution kernels must be filled , Make and input The matrix has the same size, This increases memory overhead . The convolution kernel is much smaller than the input matrix , The more memory overhead .
Winograd-based convolution： be based on Coppersmith-Winograd Algorithm , It shows how to reduce the multiplication count at the cost of more addition and a large number of intermediate products . Some papers point out that , be based on winograd Convolution of for GPU The small kernel on is effective .

3 MEC Algorithm

For a given memory size ,MEC Larger models can be used for training or reasoning .
MEC Allow larger... To be used during training mini-batch size To accelerate the delay of each cycle .
MEC By improving the efficiency of the memory subsystem ( For example, increase the number of cache hits ) To speed up the calculation .

3.1 motivation

Direct convolution is simple and direct , No memory overhead , But the calculation is not efficient .
im2col Convolution converts convolution into a convolution multiplication to calculate , You can use highly optimized libraries such as BLAS Library , Computing is more efficient . But it increases the memory consumption （ Because of the conversion input The matrix is more primitive input The matrix is larger ）

im2col Of lowered matrix L Dimension for ： $i_{n} O_{h} O_{w} \times k_{h} k_{w} k_{c}$ （ In this article, ：25 * 9）

The original input matrix is ：7 * 7

3.2 MEC Primary version of algorithm

MEC Can solve im2col The problem of memory occupation .MEC Process diagram 2 Shown ：

chart 1 in im2col After the input matrix is converted, the dimension is 25 × 9, and MEC The dimension after method conversion is 5 × 21, The small 54%, And the final output matrix is the same as the direct convolution .
MEC The converted matrix is divided into 5 Parts of P , Q , R , S , T Multiply with convolution kernel respectively .

This multiplication is in BLAS Of gemm There are three calculation methods in the interface ：

First , $k_{h} \times k_{w}$ Matrix K Explain as a $k_{h} k_{w}$ × 1 Matrix .
secondly , Partition P , Q , R , S , T By providing a pointer to the initial element and ld = $i_{h} k_{w}$ To specify the ,ld = $i_{h} k_{w}$ yes L The length of a whole line of .
Last , Every line O from P , Q , R , S , T and K Between 5 Separate gemm Call composition . Even though gemm The number of calls has increased , however mult/add The total number of operations is still based on im2col The convolution of is the same , Keep the computational complexity unchanged .

MEC The reduction of is efficient , Because we have fewer elements from I Move to smaller L .

and im2col comparison , The total amount of calculation remains unchanged , But it reduces memory overhead .

The whole process is like Algorithm 1 Shown , among $i_{n}=i_{c}=k_{c}=1$ .
The first 4 The loop of rows is copied from the input matrix in parallel $k_{w}$ Continuous elements to intermediate matrix L in , All replications can be performed in parallel .
The first 10 The loop of rows calculates the output matrix in parallel , Each matrix multiplication is calculated through a gemm Calls to perform , These matrix multiplication calculations can be performed in parallel .

3.3 MEC Advanced version of Algorithm

In order to be able to handle channels ( $i_{k}$ and $i_{c}$ ） and mini-batches ( $i_{n}$ ), The algorithm 1 Extend to Algorithm 2.

4 experimental result

Experimental setup

Use 32 Single precision experiment of bit
Use multithreaded OpenBLAS, OpenMP, and cuBLAS stay CPU/GPU Implemented on MEC（C++）
Use the same library in CPU/GPU Based on im2col Convolution of
Use other C++ Open source convolution library for comparison , Compare memory overhead and performance ： Such as open source based FFT Convolution of 、 Open source is based on Winograd Convolution of

Some descriptions of convolution algorithm

Conv.cpu： use openBLAS/openMP Based on im2col Convolution of
Conv.gpu： use openBLAS Based on im2col Convolution of
Wino.cpu： be based on Winograd F ( 2 × 2 , 3 × 3 ) Convolution of （ Only when the $k_{h}=k_{w}=3$ You can use ）
Wino.gpu： Besides using GPU The others are the same as above
FFT.gpu： stay GPU Upper use cuFFT Library convolution
MEC.cpu： use openBLAS/openMP Based on MEC Convolution of
MEC.gpu： use cuBLAS Based on MEC Convolution of

benchmarks contain 12 Convolution layers ：

The experiment platform

Mobile：ARM7 (MSM8960) Android phone of ,,mini-bath size=1
Server：Linux server with Intel CPU (E5-2680) and Nvidia GPU (P100) for inference and training (mini-bath size=32)

experimental result

（a）MEC.cpu Than Conv.cpu Than in Server-CPU The first convolution on v_1 Better performance and memory usage ; It can be clearly observed that , With the ratio k / s An increase in , Memory overhead and runtime have been improved .
（b） It can be seen that MEC.cpu Has the lowest memory utilization .

（c）（d） It can be seen that MEC.cpu Run fastest

（e） It can be seen that MEC.gpu Requires the least memory overhead ,Wino.gpu Due to memory limitations, only cv6 − cv12 Layer test
（f） It can be seen that MEC.gpu The fastest