当前位置:网站首页>[paper] mec: memory efficient revolution for deep neural network
[paper] mec: memory efficient revolution for deep neural network
2022-07-20 08:47:00 【one or only】
1 Related introduction
This paper is too simple for direct convolution , Poor performance , be based on im2col,FFT,Winorgrad The convolution calculation method of etc. has great Memory overhead And make the performance decline and in performance and Memory consumption aspect It is difficult to reach a balance , Propose memory efficient convolution MEC.
MEC The memory overhead is reduced and the convolution calculation process is accelerated .MEC Use a simple and novel method to reduce the input matrix in a highly compact way , At the same time, we still use highly optimized libraries, such as BlAS To speed up matrix multiplication , And perform matrix multiplication of many small matrices in parallel .MEC Reduce the memory occupation of input matrix , Improve the efficiency of memory subsystem ( That is to say, promotion cache The locality of ) thus MEC It can accelerate the calculation of convolution without losing accuracy .
By using CPU/GPU stay mobile and server Experiments were carried out on the platform , Proved MEC Is a very efficient algorithm , It can be applied to various systems with limited memory .
2 Preliminaries
Mark
- utilize C The linguistic representation of tensors and matrices is row-major Sequential , But if you use some math libraries , such as cuBLAS, It is column-major Sequential . The text still uses row-major Sequential representation , But understand all matrices as transposed .
- sh,sw Represents the step size of the convolution kernel .
- lowered matrix L :Inputs Intermediate storage representation converted into a matrix .
- For ease of explanation , Assume that the input I Any tape has been applied 0 Fill of . The output matrix O The dimensions are :
Related work
- im2col-based convolution: Convert the input matrix into Toeplitz matrix , Then use the highly optimized linear algebra Library BLAS Perform fast matrix multiplication .
- FFT-based convolution: Convolution can be calculated simply in frequency domain . But all convolution kernels must be filled , Make and input The matrix has the same size, This increases memory overhead . The convolution kernel is much smaller than the input matrix , The more memory overhead .
- Winograd-based convolution: be based on Coppersmith-Winograd Algorithm , It shows how to reduce the multiplication count at the cost of more addition and a large number of intermediate products . Some papers point out that , be based on winograd Convolution of for GPU The small kernel on is effective .
3 MEC Algorithm
- For a given memory size ,MEC Larger models can be used for training or reasoning .
- MEC Allow larger... To be used during training mini-batch size To accelerate the delay of each cycle .
- MEC By improving the efficiency of the memory subsystem ( For example, increase the number of cache hits ) To speed up the calculation .
3.1 motivation
- Direct convolution is simple and direct , No memory overhead , But the calculation is not efficient .
- im2col Convolution converts convolution into a convolution multiplication to calculate , You can use highly optimized libraries such as BLAS Library , Computing is more efficient . But it increases the memory consumption ( Because of the conversion input The matrix is more primitive input The matrix is larger )
im2col Of lowered matrix L Dimension for : ( In this article, :25 * 9)
The original input matrix is :7 * 7
3.2 MEC Primary version of algorithm
MEC Can solve im2col The problem of memory occupation .MEC Process diagram 2 Shown :
- chart 1 in im2col After the input matrix is converted, the dimension is 25 × 9, and MEC The dimension after method conversion is 5 × 21, The small 54%, And the final output matrix is the same as the direct convolution .
- MEC The converted matrix is divided into 5 Parts of P , Q , R , S , T Multiply with convolution kernel respectively .
This multiplication is in BLAS Of gemm There are three calculation methods in the interface :
- First ,
Matrix K Explain as a
× 1 Matrix .
- secondly , Partition P , Q , R , S , T By providing a pointer to the initial element and ld =
To specify the ,ld =
yes L The length of a whole line of .
- Last , Every line O from P , Q , R , S , T and K Between 5 Separate gemm Call composition . Even though gemm The number of calls has increased , however mult/add The total number of operations is still based on im2col The convolution of is the same , Keep the computational complexity unchanged .
MEC The reduction of is efficient , Because we have fewer elements from I Move to smaller L .
and im2col comparison , The total amount of calculation remains unchanged , But it reduces memory overhead .
- The whole process is like Algorithm 1 Shown , among
.
- The first 4 The loop of rows is copied from the input matrix in parallel
Continuous elements to intermediate matrix L in , All replications can be performed in parallel .
- The first 10 The loop of rows calculates the output matrix in parallel , Each matrix multiplication is calculated through a gemm Calls to perform , These matrix multiplication calculations can be performed in parallel .
3.3 MEC Advanced version of Algorithm
In order to be able to handle channels ( and
) and mini-batches (
), The algorithm 1 Extend to Algorithm 2.
4 experimental result
Experimental setup
- Use 32 Single precision experiment of bit
- Use multithreaded OpenBLAS, OpenMP, and cuBLAS stay CPU/GPU Implemented on MEC(C++)
- Use the same library in CPU/GPU Based on im2col Convolution of
- Use other C++ Open source convolution library for comparison , Compare memory overhead and performance : Such as open source based FFT Convolution of 、 Open source is based on Winograd Convolution of
Some descriptions of convolution algorithm
- Conv.cpu: use openBLAS/openMP Based on im2col Convolution of
- Conv.gpu: use openBLAS Based on im2col Convolution of
- Wino.cpu: be based on Winograd F ( 2 × 2 , 3 × 3 ) Convolution of ( Only when the
You can use )
- Wino.gpu: Besides using GPU The others are the same as above
- FFT.gpu: stay GPU Upper use cuFFT Library convolution
- MEC.cpu: use openBLAS/openMP Based on MEC Convolution of
- MEC.gpu: use cuBLAS Based on MEC Convolution of
benchmarks contain 12 Convolution layers :
The experiment platform
- Mobile:ARM7 (MSM8960) Android phone of ,,mini-bath size=1
- Server:Linux server with Intel CPU (E5-2680) and Nvidia GPU (P100) for inference and training (mini-bath size=32)
experimental result
(a)MEC.cpu Than Conv.cpu Than in Server-CPU The first convolution on Better performance and memory usage ; It can be clearly observed that , With the ratio k / s An increase in , Memory overhead and runtime have been improved .
(b) It can be seen that MEC.cpu Has the lowest memory utilization .
(c)(d) It can be seen that MEC.cpu Run fastest
(e) It can be seen that MEC.gpu Requires the least memory overhead ,Wino.gpu Due to memory limitations, only cv6 − cv12 Layer test
(f) It can be seen that MEC.gpu The fastest
- MEC.cpu Reduce the memory overhead to nearly one third , Improved running time 20%
边栏推荐
- Easier to use C language entry-level cheese (2) select statements + loop statements, functions, arrays, operators (super detailed)
- El transfer left right default data display
- mysql聚簇索引和非聚簇索引的区别
- 91 pop up case - father passes son - son passes father
- 11.监视属性——watch
- TS 类 class
- 6.双向绑定(v-model)的实现
- 【论文】MEC: Memory-efficient Convolution for Deep Neural Network
- Development of dynamic memory in C language
- 【数据库基础】MySql基础总结
猜你喜欢
随机推荐
【数据库基础】MySql基础总结2
函数的定义方式和调用
Development of dynamic memory in C language
【js】var 和 let 在 for 循环中的不同反应
83-局部组件的复用[父传子]
Easily master | struct structure | knowledge points
【数据库基础】MySql基础总结
My first blog
基于C语言实现的学生管理系统
Simple construction of local image server
PHP微信扫码关注公众号并授权登录源码
9.键盘事件
C language --- unary sparse polynomial operation
【js】浅拷贝与深拷贝
74-学生管理系统-添加-删除-展示
Discuz大气游戏风格模板/仿lol英雄联盟游戏DZ游戏模板GBK
Practice improving C Language-1
How is data stored in memory?
60-[重点]Object.defineProperty的set方法
ES6之Object.defineProperty 和 Proxy 区别