当前位置：网站首页>What does the residual network solve and why is it effective abstract

What does the residual network solve and why is it effective abstract

2022-07-22 04:07:00 【Gulu Gulu day】

Catalog

1. motivation ： Deep neural networks “ Two black clouds ”

2. Formal definition and implementation of residual network

3. Explanation of residual network , Why it works ？

a. From the perspective of information dissemination

b. From the perspective of integrated learning

c. Angle of gradient crushing

4. NLP Residual structure in

1. motivation ： Deep neural networks “ Two black clouds ”

It is generally believed , After training, deep neural network can abstract data features layer by layer , Finally, the features needed to complete the task are extracted / Express , Finally, use a simple classifier ( Or other learning devices ), You can finish the task ; Therefore, deep learning is also called Express / Feature learning ;

Intuitive understanding , With the blessing of nonlinear activation function , The deeper neural network has a larger assumption space , Of course, it is more likely to contain an optimal solution ; But training is a bigger problem ; In addition to the over fitting problem , Deeper neural networks are more likely to appear Gradient dispersion / Explosion problem and Network degradation ;

Gradient dispersion ： Neural network in back propagation , If i The output value of the activation function is 0~1 Between , When there are more layers , Due to the multiplication of chain derivation, the updated gradient value will be close to when approaching the input layer 0;

Gradient explosion ： Neural network in back propagation , When the output value of the activation function >1 when , After multiple levels of tandem , The gradient value will be particularly large ;

Gradient dispersion and gradient explosion will make the model difficult to converge , But now this problem is effectively controlled by standard initialization and middle tier normalization methods to a large extent , It makes it easier for the deep neural network to converge ;

Network degradation ： On the premise that the neural network can converge , As the depth of the network increases , The performance of the model first gradually increases to saturation , And then it goes down rapidly ; But the problem of network degradation is not caused by over fitting , Because even under the same training rounds , The training error rate of degraded networks with more layers is also higher than that of shallow Networks ; Over fitting problem refers to the problem after model training , The effect of training set is much higher than that of test set , The main reason is that the features provided by the training set are not comprehensive , Increasing training data is the most effective method ; Pictured ：

2. Formal definition and implementation of residual network

F() Is the residual function ,H() It is the internal operation of this layer ; The residual is the function to be fitted by the neural network H() Split it into two parts ;

The residual element can be realized in the form of layer hopping connection : Add the output of the unit directly to the input of the unit , And then activate . Pictured ：

Residual network is a good solution to the model degradation problem of deep neural network , stay ImageNet and CIFAR-10 And other image tasks have achieved very good results , Under the premise of the same number of layers, the residual network also converges faster ; And remove individual neural network layers , The performance of the residual network will not be significantly affected

3. Explanation of residual network , Why it works ？

a. From the perspective of information dissemination

The explanation given by the author he Kaiming is

After expansion , In forward propagation , The input signal can be directly transmitted from any bottom layer to the high layer ; Contains a natural identity mapping , To some extent, it solves the problem of network degradation

b. From the perspective of integrated learning

Andreas Veit From the perspective of integrated learning ; Expand the residual network , The following figure is obtained

The residual network can be seen as an integrated model composed of a series of paths , Different paths contain different network layer subsets ; Delete part of the network layer of the residual network or exchange the order of some network modules . Experiments show that the performance of the network is related to the correct number of network paths ; It shows that the expanded path of the residual network has a certain degree of independence and redundancy , Make the residual network behave like an integrated model ; He also proved that the residual network mainly contributed to the gradient in the training of those relatively short paths ;

c. Angle of gradient crushing

In standard feedforward neural networks , As the depth increases , The gradient gradually presents as white noise ; Through visualization, the author finds that as the number of layers increases , In the shallow layer, it looks like Brown noise , In the deep neural network, it behaves like White noise ; As the gradient correlation of neurons decreases exponentially , The spatial structure of gradient is gradually eliminated with the increase of depth ;

Gradient crushing problem ？ Because many optimization methods assume that the gradient is similar at adjacent points , The gradient of fragmentation will greatly reduce the effectiveness of such optimization methods ; If the gradient behaves like white noise , Then the influence of a neuron on the output of the network will be very unstable ; He pointed out that the rate of gradient correlation reduction in residual network decreased from exponential level to sub linear level ;