当前位置：网站首页>Batch gradient descent, random gradient descent and mini batch gradient descent

Batch gradient descent, random gradient descent and mini batch gradient descent

2022-07-21 02:19:00 【Little aer】

List of articles

‍ gradient descent
‍ Stochastic gradient descent
‍mini-batch gradient descent
‍ summary

‍ gradient descent

Gradient descent is also known as batch gradient descent （Batch Gradient Descent）, The method used is to calculate a epoch（ Every epoch All sample sizes are included ） Of all samples in Loss and , Averaging losses is the current epoch The loss value of , Then calculate the gradient for back propagation 、 Parameters are updated .

‍ Stochastic gradient descent

Stochastic gradient descent (SGD, Now? SGD The meaning has changed , It doesn't mean random gradient descent ), The method used is to a epoch Each sample was tested Loss solve , Then calculate the gradient for back propagation 、 Parameters are updated

‍mini-batch gradient descent

mini-batch gradient descent , also called SGD, Now you see in the code SGD Generally speaking, it means mini-batch gradient descent , It doesn't mean random gradient descent . The method used is to put a epoch For all samples in the, press batch_size division , If batch_size=128, Then the calculation of loss is based on batch_size=128 Calculate the loss and average it , Then calculate the gradient for back propagation 、 Parameters are updated .

‍ summary

Total sample =m,batch_size=128

	Stochastic gradient descent	mini-batch gradient descent	gradient descent
Loss Unit of account	1<=	batch_size<=	m
advantage	The update speed of parameters is greatly accelerated , Because after calculating the of each sample Loss The parameter will be updated once		Every epoch Calculate... From all samples Loss, Calculated in this way Loss It can better represent the performance of the current classifier in the whole training set , The direction of the gradient obtained can also better represent the direction of the global minimum point . If the loss function is convex , Then the global optimal solution can be found in this way .
shortcoming	1. The amount of computation is large and cannot be parallelized . Batch gradient descent can be calculated by matrix operation and parallel calculation Loss, however SGD Gradient calculation and parameter descent are performed every time a sample is traversed , Unable to perform effective parallel computing .2. It is easy to fall into local optimization, resulting in the decline of model accuracy . Because a single sample Loss Cannot replace global Loss, In this way, the calculated gradient direction will also deviate from the globally optimal direction . However, due to the large number of samples , Overall Loss Will remain low , It's just Loss There will be large fluctuations in the change curve of .		All samples need to be used to calculate each time Loss, When the number of samples is very large, even if there is only limited parallel computing , And in each of the epoch Calculate all samples Loss Only update parameters once after , That is, only one gradient descent operation , Very inefficient .
		Combining the advantages of random gradient descent and gradient descent , Avoid shortcomings