Optimizer and training process (can't learn to hit me)

Learning Summary

(1) Each optimizer is a class and must be instantiated before it can be used, for example:

class Net(nn.Moddule):
    ···
net = Net()
optim = torch.optim.SGD(net.parameters(), lr=lr)
optim.step()

(2) optimizer implements the following two steps in epoch of a neural network:
Gradient zero, gradient update.

optimizer = torch.optim.SGD(net.parameters(), lr=1e-5)
for epoch in range(EPOCH):
	...
	optimizer.zero_grad()  #Gradient Zero
	loss = ...             #Calculating loss
	loss.backward()        #BP Reverse Propagation
	optimizer.step()       #Gradient Update

1. Optimizer

The goal of in-depth learning is to change the network parameters continuously so that the parameters can fit the output of various non-linear transformations to the input. Essentially, it is a function to find the optimal solution, but the optimal solution is a matrix. How to find the optimal solution quickly is a key point of in-depth learning research - Taking the classic resnet-50 as an example, it has about 20 million coefficients.There are two ways in which we can calculate so many coefficients if we need to calculate them:

(1) The first is the most direct parameter of violent exhaustion, which has a basic possibility of implementation of 0, which is comparable to the difficulty of moving the mountain plus by fools.
(2) In order to solve the parameters more quickly, a second method is proposed, that is, approximate solution by BP+optimizer.

Therefore, the optimizer updates the parameters of the network based on the gradient information of the network's reverse propagation to reduce the loss function calculation value and make the model output closer to the real label.

2. Optimizer for Pytorch

Pytorch provides a library of optimizers, torch.optim, where ten are provided.

torch.optim.ASGD
torch.optim.Adadelta
torch.optim.Adagrad
torch.optim.Adam
torch.optim.AdamW
torch.optim.Adamax
torch.optim.LBFGS
torch.optim.RMSprop
torch.optim.Rprop
torch.optim.SGD
torch.optim.SparseAdam

These optimization algorithms are all inherited from Optimizer, so let's first look at the base class Optimizer for all of them. Definitions are as follows:

class Optimizer(object):
    def __init__(self, params, defaults):        
        self.defaults = defaults
        self.state = defaultdict(dict)
        self.param_groups = []

Optimizer has three properties:

defaults: Stores the optimizer's superparameters, as shown below:

{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False}

state: Cache of parameters, as shown below

defaultdict(<class 'dict'>, {tensor([[ 0.3864, -0.0131],
        [-0.1911, -0.4511]], requires_grad=True): {'momentum_buffer': tensor([[0.0052, 0.0052],
        [0.0052, 0.0052]])}})

param_groups: The managed parameter group is a list, where each element is a dictionary, in the order params, lr, momentum, dampening, weight_decay, nesterov, for example

[{'params': [tensor([[-0.1022, -1.6890],
					[-1.5116, -1.7846]],
					requires_grad=True)], 
			'lr': 1, 
			'momentum': 0, 
			'dampening': 0, 
			'weight_decay': 0, 
			'nesterov': False}]

Optimizer also has the following methods:

zero_grad(): Clears the gradient of the managed parameter. Pytorch is characterized by that the gradient of the tensor is not automatically zeroed out, so the gradient needs to be cleared after each backward propagation.

def zero_grad(self, set_to_none: bool = False):
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is not None:  #Gradient is not empty
                if set_to_none: 
                    p.grad = None
                else:
                    if p.grad.grad_fn is not None:
                        p.grad.detach_()
                    else:
                        p.grad.requires_grad_(False)
                    p.grad.zero_()# Gradient set to 0

step(): perform a one-step gradient update, parameter update

def step(self, closure): 
    raise NotImplementedError

add_param_group(): Add parameter group

def add_param_group(self, param_group):
    assert isinstance(param_group, dict), "param group must be a dict"
# Check if the type is tensor
    params = param_group['params']
    if isinstance(params, torch.Tensor):
        param_group['params'] = [params]
    elif isinstance(params, set):
        raise TypeError('optimizer parameters need to be organized in ordered collections, but '
                        'the ordering of tensors in sets will change between runs. Please use a list instead.')
    else:
        param_group['params'] = list(params)
    for param in param_group['params']:
        if not isinstance(param, torch.Tensor):
            raise TypeError("optimizer can only optimize Tensors, "
                            "but one of the params is " + torch.typename(param))
        if not param.is_leaf:
            raise ValueError("can't optimize a non-leaf Tensor")

    for name, default in self.defaults.items():
        if default is required and name not in param_group:
            raise ValueError("parameter group didn't specify a value of required optimization parameter " +
                             name)
        else:
            param_group.setdefault(name, default)

    params = param_group['params']
    if len(params) != len(set(params)):
        warnings.warn("optimizer contains a parameter group with duplicate parameters; "
                      "in future, this will cause an error; "
                      "see github.com/pytorch/pytorch/issues/40967 for more information", stacklevel=3)
# It looks like all of these classes are being detected and reported to Warning and Error
    param_set = set()
    for group in self.param_groups:
        param_set.update(set(group['params']))

    if not param_set.isdisjoint(set(param_group['params'])):
        raise ValueError("some parameters appear in more than one parameter group")
# Add parameters
    self.param_groups.append(param_group)

load_state_dict(): Load a dictionary of state parameters, which can be used for intermittent training of models, to continue with the last parameter training

def load_state_dict(self, state_dict):
    r"""Loads the optimizer state.

    Arguments:
        state_dict (dict): optimizer state. Should be an object returned
            from a call to :meth:`state_dict`.
    """
    # deepcopy, to be consistent with module API
    state_dict = deepcopy(state_dict)
    # Validate the state_dict
    groups = self.param_groups
    saved_groups = state_dict['param_groups']

    if len(groups) != len(saved_groups):
        raise ValueError("loaded state dict has a different number of "
                         "parameter groups")
    param_lens = (len(g['params']) for g in groups)
    saved_lens = (len(g['params']) for g in saved_groups)
    if any(p_len != s_len for p_len, s_len in zip(param_lens, saved_lens)):
        raise ValueError("loaded state dict contains a parameter group "
                         "that doesn't match the size of optimizer's group")

    # Update the state
    id_map = {old_id: p for old_id, p in
              zip(chain.from_iterable((g['params'] for g in saved_groups)),
                  chain.from_iterable((g['params'] for g in groups)))}

    def cast(param, value):
        r"""Make a deep copy of value, casting all tensors to device of param."""
   		.....

    # Copy state assigned to params (and cast tensors to appropriate types).
    # State that is not assigned to params is copied as is (needed for
    # backward compatibility).
    state = defaultdict(dict)
    for k, v in state_dict['state'].items():
        if k in id_map:
            param = id_map[k]
            state[param] = cast(param, v)
        else:
            state[k] = v

    # Update parameter groups, setting their 'params' value
    def update_group(group, new_group):
       ...
    param_groups = [
        update_group(g, ng) for g, ng in zip(groups, saved_groups)]
    self.__setstate__({'state': state, 'param_groups': param_groups})

state_dict(): Gets the optimizer's current state information dictionary

def state_dict(self):
    r"""Returns the state of the optimizer as a :class:`dict`.

    It contains two entries:

    * state - a dict holding current optimization state. Its content
        differs between optimizer classes.
    * param_groups - a dict containing all parameter groups
    """
    # Save order indices instead of Tensors
    param_mappings = {}
    start_index = 0

    def pack_group(group):
		......
    param_groups = [pack_group(g) for g in self.param_groups]
    # Remap state to use order indices as keys
    packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
                    for k, v in self.state.items()}
    return {
        'state': packed_state,
        'param_groups': param_groups,
    }

3. Actual Operation

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 16 22:46:46 2021

@author: 86493
"""
import torch 
import os 

# Set weights to follow normal distribution 
weight = torch.randn((2, 2),
                     requires_grad=True)
# Set Gradient to Full 1 Matrix 
weight.grad = torch.ones((2, 2))
# Output existing weight s and data
print("The data of weight before step:\n{}".format(weight.data))
print('-' * 60)
print("The grad of weight before step:\n{}".format(weight.grad))
print('-' * 60)

# Instantiation optimizer
optimizer = torch.optim.SGD([weight], 
                            lr = 0.1,
                            momentum = 0.9)
# Further action
optimizer.step()
# View values after one step, gradient
print("The data of weight after step:\n{}".format(weight.data))
print('-' * 60)
print("The grad of weight after step:\n{}".format(weight.grad))
print('-' * 60)

# Weight Zeroing
optimizer.zero_grad()
print("The grad of weight after optimizer.zero_grad():\n{}".format(weight.grad))
print('-' * 60)

# Output parameters
print("optimizer.parmas_group is \n{}".format(optimizer.param_groups))
print('-' * 60)

# Look at the parameter location, optimizer is the same as weight
# Here you can refer to python as a value-based management
print("weight in optimizer:{}\nweight in weight:{}\n".
      format(id(optimizer.param_groups[0]['params'][0]),
             id(weight)))
print('-' * 60)

# Add parameter: weight2
weight2 = torch.randn((3, 3), requires_grad = True)
optimizer.add_param_group({"params": weight2,
                           'lr': 0.0001, 
                           'nesterov': True})
# View existing parameters
print("optimizer.param_groups is \n{}".format(optimizer.param_groups))
print('-' * 60)

# View current status information
opt_state_dict = optimizer.state_dict()
print("state_dict before step:\n", opt_state_dict)
print('-' * 60)

# Perform 50 step operations
for _ in range(50):
    optimizer.step()
# Output Existing Status Information
print("state_dict after step:\n", optimizer.state_dict())
print('-' * 60)

# Save parameter information
torch.save(optimizer.state_dict(),
           os.path.join(r"D:\Desktop Files\matrix\code\Torch", "optimizer_state_dict.pkl"))
print("--------------------done----------------------")

# Load parameter information
state_dict = torch.load(r"D:\Desktop Files\matrix\code\Torch\optimizer_state_dict.pkl")
optimizer.load_state_dict(state_dict)
print("load state_dict successfully\n{}".format(state_dict))
print('-' * 60)

# Output Last Attribute Information
print("Output final attribute information:\n")
print("Output Properties optimizer.defaults: \n{}".format(optimizer.defaults))
print('-' * 60)
print("Output Properties optimizer.state\n{}".format(optimizer.state))
print('-' * 60)
print("Output Properties optimizer.param_groups\n{}".format(optimizer.param_groups))

The results are:

The data of weight before step:
tensor([[-0.0947,  1.4217],
        [-1.3000, -1.0501]])
------------------------------------------------------------
The grad of weight before step:
tensor([[1., 1.],
        [1., 1.]])
------------------------------------------------------------
The data of weight after step:
tensor([[-0.1947,  1.3217],
        [-1.4000, -1.1501]])
------------------------------------------------------------
The grad of weight after step:
tensor([[1., 1.],
        [1., 1.]])
------------------------------------------------------------
The grad of weight after optimizer.zero_grad():
tensor([[0., 0.],
        [0., 0.]])
------------------------------------------------------------
optimizer.parmas_group is 
[{'params': [tensor([[-0.1947,  1.3217],
        			[-1.4000, -1.1501]], 
        	requires_grad=True)], 
        	'lr': 0.1, 
        	'momentum': 0.9, 
        	'dampening': 0, 
        	'weight_decay': 0, 
        	'nesterov': False}]
------------------------------------------------------------
weight in optimizer:1881798878848
weight in weight:1881798878848

------------------------------------------------------------
optimizer.param_groups is 
[{'params': [tensor([[-0.1947,  1.3217],
        			[-1.4000, -1.1501]], 
        	requires_grad=True)], 
        	'lr': 0.1, 
        	'momentum': 0.9, 
        	'dampening': 0, 
        	'weight_decay': 0, 
        	'nesterov': False}, 
 {'params': [tensor([[-1.7869,  2.1294, -0.1307],
        			[ 0.6809, -0.0193, -0.5704],
        			[-0.5512, -2.5028,  0.2141]], requires_grad=True)], 'lr': 0.0001, 'nesterov': True, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0}]
------------------------------------------------------------
state_dict before step:
 {'state': {0: {'momentum_buffer': tensor([[1., 1.],
        [1., 1.]])}}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}, {'lr': 0.0001, 'nesterov': True, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'params': [1]}]}
------------------------------------------------------------
state_dict after step:
 {'state': {0: {'momentum_buffer': tensor([[0.0052, 0.0052],
        [0.0052, 0.0052]])}}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}, {'lr': 0.0001, 'nesterov': True, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'params': [1]}]}
------------------------------------------------------------
------------done-------------
load state_dict successfully
{'state': {0: {'momentum_buffer': tensor([[0.0052, 0.0052],
        [0.0052, 0.0052]])}}, 'param_groups': [{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0]}, {'lr': 0.0001, 'nesterov': True, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'params': [1]}]}
------------------------------------------------------------
Output final attribute information:

Output Properties optimizer.defaults: 
{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False}
------------------------------------------------------------
Output Properties optimizer.state
defaultdict(<class 'dict'>, {tensor([[-1.0900,  0.4263],
        [-2.2953, -2.0455]], requires_grad=True): {'momentum_buffer': tensor([[0.0052, 0.0052],
        [0.0052, 0.0052]])}})
------------------------------------------------------------
Output Properties optimizer.param_groups
[{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [tensor([[-1.0900,  0.4263],
        [-2.2953, -2.0455]], requires_grad=True)]}, {'lr': 0.0001, 'nesterov': True, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'params': [tensor([[-1.7869,  2.1294, -0.1307],
        [ 0.6809, -0.0193, -0.5704],
        [-0.5512, -2.5028,  0.2141]], requires_grad=True)]}]

IV. EXPERIMENTS

V. Training and Evaluation

Reference

(1) Official pytorch documentation
(2)datawhale notebook
(3)Coding basic concepts:.pkl file what is it?python

Keywords: Pytorch Deep Learning

Added by juhl on Sat, 16 Oct 2021 19:35:25 +0300

Programming VIP