当前位置：网站首页>Orepa: Ali proposed a heavy parameter strategy with fast training. The memory is halved and the speed is doubled | CVPR 2022

Orepa: Ali proposed a heavy parameter strategy with fast training. The memory is halved and the speed is doubled | CVPR 2022

2022-07-21 18:43:00 【Xiaofei's algorithm Engineering Notes】

This paper proposes an online heavy parameter method OREPA, In the training stage, complex structural parameters can be converted into single convolution , So as to reduce the time-consuming of a lot of training . In order to achieve this goal , In this paper, the linear scaling layer is used to replace the training BN layer , The diversity of optimization directions and the ability of feature expression are maintained . From the experimental results ,OREPA The accuracy and efficiency of various tasks are very good

source ： Xiaofei's algorithm Engineering Notes official account

The paper : Online Convolutional Re-parameterization

Address of thesis ：https://arxiv.org/abs/2204.00826
Paper code ：https: //github.com/JUGGHM/OREPA_CVPR2022

Introduction

In addition to accuracy , The reasoning speed of the model is also very important . In order to obtain a deployment friendly and high-precision model , Recently, many studies have proposed to improve the model performance based on structural re parameterization . The model used for structural re parameterization has different structures in the training stage and the reasoning stage , Complex structures are used in training to achieve high accuracy , After training, a complex structure is compressed into a linear layer that can be inferred quickly by equivalent transformation . The compressed model usually has a simple architecture , For example, similar VGG Or similar ResNet Structure . From this point of view , The re parameterization strategy can improve the performance of the model without introducing additional reasoning time cost . Official account was sent before RepVGG Interpretation of the thesis 《RepVGG：VGG, The eternal God ！ | 2021 New article 》, If you are interested, you can have a look at .

BN Layer is the key component of heavy parameter model , Add one after each convolution layer BN layer , If the figure 1b Shown , remove BN Layers can cause serious accuracy degradation . In the reasoning stage , Complex structures can be compressed into a single convolution layer . And in the training phase , because BN Layers need to nonlinearly divide the characteristic graph by its standard deviation , Each branch can only be calculated separately . therefore , There are a lot of intermediate computing operations （ Big FLOPS） And buffer characteristic diagram （ High memory usage ）, Bring huge computing overhead . What's worse is , The high training cost hinders the exploration of more complex and possibly more powerful re parametric structures .
Why? BN Layer pair parameterization is so important ？ Based on experiments and analysis , The paper found that BN The scaling factor in the layer can diversify the optimization direction of different branches . Based on this discovery , This paper proposes an online re parameterization method OREPA, Pictured 1c Shown , There are two steps ：

block linearization： Remove all nonlinear generalization layers , Instead, we introduce the linear scaling layer . The linear scaling layer can not only work with BN The optimization directions of different branches are diversified by the same layer , It can also be combined during training .
block squeezing： Simplify the complex linear structure into a single convolution layer .

OREPA Reduce the computing and storage overhead caused by the middle tier , It can significantly reduce training consumption (65%-75% Video memory savings 、 Speed up 1.5-2.3 times ) And has little impact on performance , It makes it possible to explore more complex re parameterized results . To test this , The paper further proposes several re parameterized components to obtain better performance .
The contribution of this paper includes the following three points ：

An online re parameterization method is proposed OREPA, It can greatly improve the training efficiency of the heavily parameterized model , It makes it possible to explore stronger heavy parameter structures .
According to the analysis of the principle of the counterweight parameter model , take BN Replace layer with linear scaling layer , Maintain the diversity of optimization direction and feature expression ability .
Experiments on various visual tasks show that ,OREPA In terms of accuracy and training efficiency, it is better than the previous heavily parameterized model .

Online Re-Parameterization

OREPA It can simplify the complex structure during training into a single convolution , Keep the accuracy unchanged .OREPA The transformation process of is shown in the figure 2 Shown , contain block linearization and block squeezing Two steps .

Preliminaries: Normalization in Re-param

BN Layer is the key structure of multi-layer and multi branch structures in heavy parameters , It is the basis of heavy parameter model performance . With DBB and RepVGG For example , Get rid of BN After the layer （ Change to multiple branches and proceed uniformly BN operation ） There will be a significant decline in performance , As shown in the table 1 Shown .
What's more surprising is that ,BN The use of layer will bring too much training consumption . In the reasoning stage , All intermediate operations in the heavy parameter structure are linear , You can perform consolidation . And in the training phase , because BN The layer is nonlinear （ It needs to be divided by the standard deviation of the characteristic diagram ）, Cannot perform consolidation . Failure to merge will cause intermediate operations to be calculated separately , Generate huge computing consumption and memory cost . and , The high cost also hinders the exploration of more complex structures .

Block Linearization

although BN Layer prevents merging during training , But due to the problem of accuracy , It still cannot be deleted directly . To solve this problem , The paper introduces channel-wise Linear scaling of as BN Linear substitution of layers , Zoom the feature map through the learnable vector . The linear scale layer has BN Similar effect of layer , Guide multiple branches to optimize in different directions , This is the core of heavily parameterized performance .

Based on linear scaling layer , Modify the heavily parameterized structure , Pictured 3 Shown , There are three steps ：

Remove all non-linear layers , That is, the normalization layer in the heavily parameterized structure .
In order to maintain the diversity of optimization , A scaling layer is added at the end of each branch , namely BN Linear substitution of layers .
To stabilize the training process , Add a after all branches BN layer .

after block linearization After the operation , There are only linear layers in the heavy parameter structure , This means that all components in the structure can be merged during the training phase .

Block Squeezing

Block squeezing Transform the operation on the intermediate characteristic graph with too much computation and memory into a faster single convolution kernel operation , This means that the extra training cost of heavy parameters will be reduced from $O(H\times W)$ Reduced to $O(KH\times KW )$ , among $(KH, K W)$ Is the shape of convolution kernel .
Generally speaking , No matter how complex the linear multiparameter structure is , The following two properties are always true ：

All linear layers in the heavy parameter structure （ For example, depth convolution 、 Average pooling and recommended linear scaling ） Can be represented by convolution layer with corresponding parameters , See the appendix of the original text for specific proof .
The multiparameter structure can be expressed as a set of parallel branches , Each branch contains a string of convolutions .

With the above two properties , So we can make multi-layer （ That is, sequential structure ） And multiple branches （ Parallel structure ） Compress to a single convolution , Pictured 4a Sum graph 4b Shown . The original text has a formula proof of partial conversion , If you are interested, you can go to the corresponding chapters of the original , This does not affect Block Squeezing The understanding of our thoughts .

Gradient Analysis on Multi-branch Topology

From the perspective of gradient retransmission, this paper analyzes the multi branch and block linearization The role of , It contains some formula derivation , If you are interested, you can go to the corresponding chapters of the original . Here are two main conclusions ：

If branch shared block linearization, The optimization direction and amplitude of multi branch are the same as that of single branch .
If you use branch independent block linearization, The optimization direction and amplitude of multi branch are different from that of single branch .

The above conclusion shows block linearization The importance of steps . When removing BN After the layer , The zoom layer can maintain the diversification of optimization directions , Avoid multiple branches degenerating into single branches .

Block Design

because OREPA Save a lot of training consumption , It provides the possibility to explore more complex training structures . This paper is based on DBB A new heavy parameter module is designed OREPA-ResNet, The following components are added ：

Frequency prior filter：Fcanet It is pointed out that the pooling layer is a special case of frequency domain filtering , Refer to this work to join 1x1 Convolution + Frequency domain filtering branch .
Linear depthwise separable convolution： Make a few modifications to the depth separable convolution , Remove the intermediate nonlinear activation to merge during training .
Re-parameterization for 1x1 convolution： Previous studies have focused on 3×3 The parameters of convolution layer are ignored 1×1 Convolution , but 1x1 Convolution in bottleneck Structure is very important . secondly , The paper adds an additional 1x1 Convolution +1x1 Convolution branch , Yes 1x1 Convolution is also carried out with multiple parameters .
Linear deep stem： Generally, the network adopts 7x7 Convolution +3x3 Convolution as stem, Some networks replace it with stacked 3 individual 3x3 Convolution achieves good accuracy . However, the paper believes that such a stacking design consumes a lot of computation on the high-resolution feature map at the beginning , For this reason will 3 individual 3x3 Convolution and the linear layer proposed in this paper are compressed into a single 7x7 Convolution layer , It can greatly reduce the calculation consumption and save the accuracy .

OREPA-ResNet Medium block The design is shown in the picture 6 Shown , This should be a down sampling block, Eventually merged into a single 3x3 Convolution for training and reasoning .

Experiment

Comparative experiment of each component .

The influence of scaling layers on the similarity of branches of each layer .

Linear scaling strategy comparison ,channel-wise The best zoom .

Comparison of training time of online and offline heavy parameters .

Compare with other heavy parameter strategies .

Compare the detection and segmentation tasks .

Conclusion

This paper proposes an online heavy parameter method OREPA, In the training stage, complex structural parameters can be converted into single convolution , So as to reduce the time-consuming of a lot of training . In order to achieve this goal , In this paper, the linear scaling layer is used to replace the training BN layer , The diversity of optimization directions and the ability of feature expression are maintained . From the experimental results ,OREPA The accuracy and efficiency of various tasks are very good .