当前位置:网站首页>Difference between BN and LN
Difference between BN and LN
2022-07-21 02:19:00 【Little aer】
BN And LN The difference between
The main difference is that normalization In different directions !
Batch As the name suggests, it is for a batch To operate . Suppose we have 10 That's ok 3 Column The data of , That our batchsize = 10, Each row of data has three characteristics , Suppose these three characteristics are 【 height 、 weight 、 Age 】. that BN For each column ( features ) Zoom , For example, work out 【 height 】 Mean and variance of , For the height column 10 Scale data . Weight and age are the same . This is a kind of “ Column scaling ”.
and layer In the opposite direction , It scales each row . That is, just look at one piece of data , Calculate the mean and variance of all the features, and then scale . This is a kind of “ Row scaling ”.
Careful, you've seen ,layer normalization Scale all features , This makes no sense . We worked out a line of this 【 height 、 weight 、 Age 】 The mean variance of the three features and scale it , In fact, it will have a great impact because of the different dimensions of features . however BN There is no such effect , because BN Is to scale a column , The dimensional units of a column are the same .
So why do we use LN Well ? because NLP In the field ,LN More appropriate .
If we combine a batch of texts into a batch, that BN The direction of operation is , For each position of the word in batch Dimension operation . But the complexity of language text is very high , Any word can be put in the initial position , And the length of each sentence is different , And word order may not affect our understanding of sentences . and BN Is to zoom for each position , This is not true. NLP Laws .
and LN It is scaled for one sentence , And LN Generally used in the third dimension , Such as [batchsize, seq_len, dims] Medium dims, Generally, it is the dimension of word vector , Or is it RNN The output dimension of , The dimensions of each feature of this dimension should be the same . Therefore, we will not encounter the scaling problem caused by the different dimensions of features .
Batch of standardized (Bactch Normalization,BN) Problem solved
It was born to overcome the difficulty of training caused by the deepening of neural network , As the neural network deepens , It's going to be more and more difficult to train , The convergence rate is very slow , It often leads to gradient dispersion problems (Vanishing Gradient Problem).
terms of settlement : Generally, the training sample is corrected according to the proportion of training sample and target sample . therefore , By introducing Bactch Normalization To standardize the input of some or all layers , So as to fix the mean and variance of input information of each layer .
Method :Bactch Normalization Generally used in nonlinear mapping ( Activation function ) Before , Yes x=Wu+b Do standardization , It is the result. ( Output signal dimensions ) The mean of 0, The variance of 1. Making the input of each layer have a stable distribution will be conducive to the training of the network .
Generally speaking ,BN, Activation layer ,Dropout The relative order of layers is as follows :->CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->;BN than Dropout The effect is good ,Dropout Slowly being abandoned .
advantage :Bactch Normalization The activation function is distributed in a linear interval through standardization , The result is an increased gradient , Let the model more boldly carry out gradient descent , Has the following advantages :
- Increase the search step size , Speed up convergence ;
- It's easier to jump out of the local minimum ;
- Destroy the original data distribution , To some extent, it alleviates over fitting
therefore , The convergence speed of neural network is very slow (Gradient Explore) Wait until you can't train , You can try to use Bactch Normalization To solve .
BN The defects of
The defects are as follows :
1、BN Is in batch size All dimensions of the sample are standardized , therefore size The bigger it is, the more reasonable it will be μ and σ To standardize , therefore BN More dependent on size Size .
2、 During training , Fill in the model in batches , But when predicting , If there is only one sample or a small number of samples to do inference, Use this time BN Obviously, there is a big deviation , For example, online learning scenarios .
3、RNN It's a dynamic network , That is to say size It is changing. , It can be big or small. , As a result, multiple sample dimensions cannot be aligned , So it's not suitable for BN.
LN Advantages :
1、Layer Normalization It is the internal standardization of each sample , Follow size No problem , Not affected by it .
2、RNN in LN It's not affected , Internal standardization , therefore LN It has a wider range of applications .
Inference
[0] cnblogs.com/gczr/p/12597344.html
[1] zhuanlan.zhihu.com/p/74516930
[2] https://blog.csdn.net/m0_37870649/article/details/82025238
边栏推荐
- 【培训课程专用】TEE组件介绍
- 目标检测的进阶-one stage
- Activiti7 workflow and Alibaba components, second office OA, information management, ERP, etc
- Doris Connector 结合 Flink CDC 实现 MySQL 分库分表 Exactly Once精准接入
- Application of "tutorial" torchtext
- 特殊类的设计(单例模式)
- Machine learning notes: Elmo Bert
- Wave field joint reserve issued an open letter, emphasizing that usdd is not subject to any centralized institutions
- ECs and cloud database management
- EXCEL的去重去除某个字段后全部操作
猜你喜欢
解决梯度爆炸和梯度消失
ResNet知识点补充
Multifunctional to meet the needs of tourists, VR panorama promotes the ecological integration of smart scenic spots
Use regular expressions to bypass
Wave field joint reserve issued an open letter, emphasizing that usdd is not subject to any centralized institutions
【培训课程专用】TEE组件介绍
分享一个好玩的JS小游戏
Pyhon crawls Beijing's 10-year weather data
DP knapsack problem
leetcode 剑指 Offer 32 - II. 从上到下打印二叉树 II
随机推荐
C语言力扣第35题之搜索插入位置。经典二分法
Idea多次启动同一个项目
leetcode 剑指 Offer 50. 第一个只出现一次的字符
exness:美元温和回落,澳洲联储呼吁进一步加息
卷积神经网络CNN常用的几个模型
Leetcode sword finger offer 26 Substructure of tree
5G时代到来,VR全景制作是值得创业的行业
1*1卷积核的作用
The difference between layer 4 and layer 7 load balancing (turn)
[special for training courses] Introduction to tee components
DP背包问题
36-【go】Golang的IO流
scala 函数&方法、函数&方法的实现原理
kvm虚拟化
Records from July 18, 2022 to July 25, 2022
SQL Server 2008 R2 uninstall failed
Doris Connector 结合 Flink CDC 实现 MySQL 分库分表 Exactly Once精准接入
线程池代码与测试
AIOps 还是 APM,企业用户应如何作出选择?
进程和线程