当前位置：网站首页>Difference between BN and LN

Difference between BN and LN

2022-07-21 02:19:00 【Little aer】

BN And LN The difference between

The main difference is that normalization In different directions ！

Batch As the name suggests, it is for a batch To operate . Suppose we have 10 That's ok 3 Column The data of , That our batchsize = 10, Each row of data has three characteristics , Suppose these three characteristics are 【 height 、 weight 、 Age 】. that BN For each column （ features ） Zoom , For example, work out 【 height 】 Mean and variance of , For the height column 10 Scale data . Weight and age are the same . This is a kind of “ Column scaling ”.

and layer In the opposite direction , It scales each row . That is, just look at one piece of data , Calculate the mean and variance of all the features, and then scale . This is a kind of “ Row scaling ”.

Careful, you've seen ,layer normalization Scale all features , This makes no sense . We worked out a line of this 【 height 、 weight 、 Age 】 The mean variance of the three features and scale it , In fact, it will have a great impact because of the different dimensions of features . however BN There is no such effect , because BN Is to scale a column , The dimensional units of a column are the same .

So why do we use LN Well ？ because NLP In the field ,LN More appropriate .

If we combine a batch of texts into a batch, that BN The direction of operation is , For each position of the word in batch Dimension operation . But the complexity of language text is very high , Any word can be put in the initial position , And the length of each sentence is different , And word order may not affect our understanding of sentences . and BN Is to zoom for each position , This is not true. NLP Laws .

and LN It is scaled for one sentence , And LN Generally used in the third dimension , Such as [batchsize, seq_len, dims] Medium dims, Generally, it is the dimension of word vector , Or is it RNN The output dimension of , The dimensions of each feature of this dimension should be the same . Therefore, we will not encounter the scaling problem caused by the different dimensions of features .

Batch of standardized （Bactch Normalization,BN） Problem solved

It was born to overcome the difficulty of training caused by the deepening of neural network , As the neural network deepens , It's going to be more and more difficult to train , The convergence rate is very slow , It often leads to gradient dispersion problems (Vanishing Gradient Problem).

terms of settlement ： Generally, the training sample is corrected according to the proportion of training sample and target sample . therefore , By introducing Bactch Normalization To standardize the input of some or all layers , So as to fix the mean and variance of input information of each layer .

Method ：Bactch Normalization Generally used in nonlinear mapping ( Activation function ) Before , Yes x=Wu+b Do standardization , It is the result. ( Output signal dimensions ) The mean of 0, The variance of 1. Making the input of each layer have a stable distribution will be conducive to the training of the network .

Generally speaking ,BN, Activation layer ,Dropout The relative order of layers is as follows ：->CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->;BN than Dropout The effect is good ,Dropout Slowly being abandoned .

advantage ：Bactch Normalization The activation function is distributed in a linear interval through standardization , The result is an increased gradient , Let the model more boldly carry out gradient descent , Has the following advantages ：

Increase the search step size , Speed up convergence ;
It's easier to jump out of the local minimum ;
Destroy the original data distribution , To some extent, it alleviates over fitting

therefore , The convergence speed of neural network is very slow (Gradient Explore) Wait until you can't train , You can try to use Bactch Normalization To solve .

BN The defects of

The defects are as follows ：

1、BN Is in batch size All dimensions of the sample are standardized , therefore size The bigger it is, the more reasonable it will be μ and σ To standardize , therefore BN More dependent on size Size .
2、 During training , Fill in the model in batches , But when predicting , If there is only one sample or a small number of samples to do inference, Use this time BN Obviously, there is a big deviation , For example, online learning scenarios .
3、RNN It's a dynamic network , That is to say size It is changing. , It can be big or small. , As a result, multiple sample dimensions cannot be aligned , So it's not suitable for BN.

LN Advantages ：

1、Layer Normalization It is the internal standardization of each sample , Follow size No problem , Not affected by it .
2、RNN in LN It's not affected , Internal standardization , therefore LN It has a wider range of applications .