torch.optim optimization algorithm (optim. Adam)

Reprinted from: https://blog.csdn.net/kgzhang/article/details/77479737

torch.optim is a package that implements a variety of optimization algorithms. Most general methods have been supported and provide rich interface calls. More refined optimization algorithms will be integrated in the future.
In order to use torch.optim, we need to construct an Optimizer object Optimizer to save the current state and update the parameters according to the calculated gradient.
To build an optimizer optimizer, you must give it a list of all parameters that can be iteratively optimized (all parameters must be variables s). You can then specify specific options for program optimization, such as learning rate, weight attenuation, and so on.

optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr = 0.0001)
self.optimizer_D_B = torch.optim.Adam(self.netD_B.parameters(), lr=opt.lr, betas=(opt.beta1, 0.999))

Optimizer also supports specifying each parameter option. Simply pass an iteratable dict to replace the previously iteratable Variable. Each item in dict can be defined as a separate parameter group. The parameter group uses a params key to contain the parameter list belonging to it. Other keys should match the keyword parameters accepted by the optimizer to be used as optimization options for this group.

optim.SGD([
                {
     'params': model.base.parameters()},
                {
     'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)

As above, model.base.parameters() will use the learning rate of 1e-2, and model.classifier.parameters() will use the learning rate of 1e-3. 0.9 momentum works on all parameters.
Optimization steps:
All optimizers and optimizers implement the step() method to update all parameters. It has two calling methods:

optimizer.step()

This is a simplified version supported by most optimizers. It will be called when using the following backward() method to calculate the gradient.

for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
optimizer.step(closure)

Some optimization algorithms, such as conjugate gradient and LBFGS, need to re evaluate the objective function many times, so you must pass a closure to recalculate the model. Closure must clear the gradient, calculate and return the loss.

for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    optimizer.step(closure)

Adam algorithm:

adam algorithm source: Adam: A Method for Stochastic Optimization

Adam (adaptive motion estimation) is essentially RMSprop with momentum term. It uses the first-order moment estimation and second-order moment estimation of gradient to dynamically adjust the learning rate of each parameter. Its advantage is that after bias correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable. The formula is as follows:

Among them, the first two formulas are the first-order moment estimation and the second-order moment estimation of the gradient, which can be regarded as the estimation of the expected E|gt|and E|gt^2|;
Formulas 3 and 4 are the correction of the first-order and second-order moment estimation, which can be approximated as the unbiased estimation of the expectation. It can be seen that the direct moment estimation of the gradient has no additional memory requirements, and can be dynamically adjusted according to the gradient. The first part of the last item is a dynamic constraint on the learning rate n, and has a clear range.

class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

Parameters:

params(iterable): Parameters that can be used for iterative optimization or to define parameter groups dicts. 
lr (float, optional) : Learning rate(default: 1e-3)
betas (Tuple[float, float], optional): A coefficient used to calculate the mean sum square of gradients(default: (0.9, 0.999))
eps (float, optional): An item added to the denominator to improve numerical stability(default: 1e-8)
weight_decay (float, optional): Weight attenuation(as L2 punishment)(default: 0)
step(closure=None)Functions: performing a single optimization step
closure (callable, optional): A closure used to reassess the model and return the loss 

torch.optim.adam source code:

import math
from .optimizer import Optimizer

class Adam(Optimizer):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0):
        defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay)
        super(Adam, self).__init__(params, defaults)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    # Exponential moving average of gradient values
                    state['exp_avg'] = grad.new().resize_as_(grad).zero_()
                    # Exponential moving average of squared gradient values
                    state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()

                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                beta1, beta2 = group['betas']

                state['step'] += 1

                if group['weight_decay'] != 0:
                    grad = grad.add(group['weight_decay'], p.data)

                # Decay the first and second moment running average coefficient
                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

                denom = exp_avg_sq.sqrt().add_(group['eps'])

                bias_correction1 = 1 - beta1 ** state['step']
                bias_correction2 = 1 - beta2 ** state['step']
                step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

                p.data.addcdiv_(-step_size, exp_avg, denom)

        return loss

Adam has the following characteristics:
1. Adagrad is good at dealing with sparse gradients and RMSprop is good at dealing with non-stationary targets;
2. Small memory requirements;
3. Calculate different adaptive learning rates for different parameters;
4. It is also suitable for most Nonconvex Optimization - for large data sets and high-dimensional spaces.

Keywords: Algorithm Machine Learning

Added by mrtechguy on Mon, 01 Nov 2021 11:55:51 +0200