当前位置：网站首页>Function principle of pytoch network training process: source code analysis optimizer.zero_ grad()loss.backward()optimizer.step()

Function principle of pytoch network training process: source code analysis optimizer.zero_ grad()loss.backward()optimizer.step()

2022-07-22 20:32:00 【Dull as dull】

Function principle of common parameter training process

1 executive summary

In use pytorch Training model , Usually in the loop epoch In the process of , Constantly cycle through all training data sets .

Used in turn optimizer.zero_grad(),loss.backward() and optimizer.step() Three functions , As shown below ：

model = MyModel()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)  
for epoch in range(1, epochs):
    for i, (inputs, labels) in enumerate(train_loader):
        output= model(inputs)
        loss = criterion(output, labels)
        # compute gradient and do SGD step 
        optimizer.zero_grad()     
        loss.backward()     
    	optimizer.step()

（ Module with updated learning rate lr_scheduler It's unnecessary, so I won't talk about it here for the time being , You can see the following articles if you want to know ：pytorch Adjust the learning rate dynamically , The learning rate drops automatically , according to loss falling Mainly focused on The first 6 section ）

In general , The function of these three functions is to return the gradient to zero first （optimizer.zero_grad()）, Then the gradient of each parameter is calculated by back propagation （loss.backward()）, Finally, a one-step parameter update is performed by gradient descent （optimizer.step()）.

Next, we will understand the specific implementation process of these three functions through the source code . Before that , First, briefly explain the parameter variables commonly used in functions ：

param_groups： Optimizer class When instantiating, a param_groups list , The elements in the list are param_group Dictionaries , here param_group The number of dictionaries is param_groups list The length of , That's the variable num_groups The meaning of .
param_group Dictionaries ： The length of this dictionary is 6, Contains params,lr,momentum,dampening,weight_decay,nesterov this 6 Group key value pairs .
param_group['params'] Key values in the dictionary params： An iterator consisting of model parameters , Model parameters are instantiations Optimizer class Pass in and register the member attribute in the model _parameters in , Each parameter is a torch.nn.parameter.Parameter object .

2 optimizer.zero_grad()

The code is as follows （ Example ）：

def zero_grad(self):
    r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is not None:
                p.grad.detach_()
                p.grad.zero_()

optimizer.zero_grad() The function will traverse all parameters of the model , The parameters mentioned here are all described in the previous general introduction torch.nn.parameter.Parameter Type variable . That is, after that p. adopt p.grad.detach_() Method truncates the gradient flow of back propagation , Re pass p.grad.zero_() The function sets the gradient value of each parameter to 0, That is, the last gradient record is cleared .

Because the training process usually uses mini-batch Method , call backward() The gradient must be cleared before the function , Because if the gradient is not cleared ,pytorch The gradient calculated last time and the gradient calculated this time will be accumulated .

benefits

When our hardware limitations cannot use larger bachsize when , The code is constructed into multiple batchsize Do it once. optimizer.zero_grad() Function call , This allows you to use multiple calculations of smaller bachsize Instead of , More convenient .

Disadvantage

Clear the gradient every time ： Come in one batch The data of , Calculate the primary gradient , Update the network .

summary

Under normal circumstances , Every batch It needs to be called once optimizer.zero_grad() function , Clear the gradient of the parameter ;
You can have more than one batch Call it once optimizer.zero_grad() function , This is equivalent to increasing batch_size.

3 loss.backward()

PyTorch Back propagation of （ namely tensor.backward()） It's through autograd package To achieve ,autograd package Will be based on tensor The corresponding gradient is automatically calculated by mathematical operation .

say concretely ,torch.tensor yes autograd package The base class , If you set tensor Of requires_grads by True, Will start tracking this tensor All the operations above .

If you use tensor.backward(), All gradients will be calculated automatically ,tensor The gradient of will add up to its .grad Go inside . If it doesn't work tensor.backward() Words , The gradient value will be None, therefore loss.backward() To write in optimizer.step() Before .

4 optimizer.step()

With SGD For example ,torch.optim.SGD().step() Source code is as follows ：

def step(self, closure=None):
    """Performs a single optimization step. Arguments: closure (callable, optional): A closure that reevaluates the model and returns the loss. """
    loss = None
    if closure is not None:
        loss = closure()

    for group in self.param_groups:
        weight_decay = group['weight_decay']
        momentum = group['momentum']
        dampening = group['dampening']
        nesterov = group['nesterov']

        for p in group['params']:
            if p.grad is None:
                continue
            d_p = p.grad.data
            if weight_decay != 0:
                d_p.add_(weight_decay, p.data)
            if momentum != 0:
                param_state = self.state[p]
                if 'momentum_buffer' not in param_state:
                    buf = param_state['momentum_buffer'] = torch.clone(d_p).detach()
                else:
                    buf = param_state['momentum_buffer']
                    buf.mul_(momentum).add_(1 - dampening, d_p)
                if nesterov:
                    d_p = d_p.add(momentum, buf)
                else:
                    d_p = buf

            p.data.add_(-group['lr'], d_p)

    return loss

step() function The function of is to perform an optimization step , The value of the parameter is updated by the gradient descent method . Because gradient descent is based on gradient , So in the implementation optimizer.step() function It should be carried out before loss.backward() function To calculate the gradient .

Be careful ：

Different optimizers .step() The specific process of the function is basically similar

The main difference lies in the different optimization methods , Will pass the calculated gradient , Use different formulas to update parameters , That's it

How the specific formula is different or what the difference is can be referred to

Deep learning notes （ Four ）： Analysis and comparison of common optimization algorithms and use strategies （SGD、SGDM、SGD with Nesterov Acceleration、AdaGrad、AdaDelta、Adam、Nadam）_ Dull like dull blog -CSDN Blog

LAST reference

optimizer.zero_grad(),loss.backward(),optimizer.step() How it works | Code farm home

原网站

版权声明
本文为[Dull as dull]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/203/202207221034273603.html