Pytorch: optimizer, loss function and deep neural network framework

Pytorch: optimizer, loss function and deep neural network framework

Copyright: Jingmin Wei, Pattern Recognition and Intelligent System, School of Artificial and Intelligence, Huazhong University of Science and Technology

Common optimizer

Stochastic gradient descent (SGD) is a basic algorithm in machine learning, in which the model can be learned and reasoned. Please refer to the relevant principles and codes by yourself.

The disadvantage of random descent is that it is difficult to determine an appropriate learning rate and easy to converge to the minimum local gradient.

There are two solutions. The idea of momentum is introduced to dynamically update the parameters, and the change of dynamic learning rate is introduced.

The idea of momentum is introduced to dynamically update the parameters, that is, considering the direction of parameter update and the gradient calculated by the current batch, the size and direction of update gradient are calculated comprehensively. Increase the stability of learning parameters and learn convergent parameters faster.
m t = μ × m t − 1 + g t Δ θ t = η × m t m_t=\mu\times m_{t-1}+g_t\\ \Delta{\theta_t}=\eta\times m_t mt​=μ×mt−1​+gt​Δθt​=η×mt​
When the descending direction of the gradient is the same as the last time, the gradient will become larger, otherwise the gradient will become smaller. In the middle and later learning, the gradient will oscillate near the local optimum( g t = 0 g_t=0 gt = 0), so it is possible to jump out of the local optimum.

The commonly used momentum is Nesterov momentum.

Dynamic learning rate change is introduced, that is, at the beginning of training, the distance between the parameters and the optimal value is far, so a larger learning rate should be used. After several rounds of training, the learning rate will be reduced. Therefore, adaptive learning rate algorithms such as Adadelta, RMSProp and Adam are developed.

For example, Adam dynamically estimates the learning rate of each parameter by using the first-order moment and second-order moment of the gradient. The update of parameters is more stable and the convergence speed of the model is faster.

Common optimizer

In the optim module, a variety of optimization algorithms for deep learning are provided.

import torch
from torch import nn
from torch import optim
import matplotlib.pyplot as plt
torch.optim.ASGD()    # Average random gradient descent
torch.optim.Rprop()    # Elastic back propagation  
torch.optim.SGD()     # Random gradient descent
Adam optimizer

with A d a m Adam Taking Adam optimizer as an example, our model uses powerful and general in training A d a m Adam Adam optimization algorithm, A d a m Adam Adam optimizer is a "panacea" optimizer widely used and effective in deep learning. Its main principle is to estimate the first-order moment of the gradient( F i r s t   M o m e n t   E s t i m a t i o n First\ Moment\ Estimation First Moment Estimation and second Moment Estimation( S e c o n d   M o m e n t   E s t i m a t i o n Second\ Moment\ Estimation Second , Moment , Estimation (i.e. the non centralized variance of the gradient) is comprehensively considered to calculate the update step of each iteration.

A d a m Adam Adam algorithm in R M S P r o p RMSProp Based on RMSProp algorithm, exponential weighted moving average is also made for small batch random gradient A d a m Adam Adam algorithm can be regarded as R M S P r o p RMSProp The combination of RMSProp algorithm and momentum method.

A d a m Adam The Adam # algorithm uses momentum variables v t \boldsymbol{v}_t vt and R M S P r o p RMSProp RMSProp # algorithm small batch random gradient exponentially weighted moving average variable according to the square of the element s t \boldsymbol{s}_t st, and in the time step 0 0 0 # initialize each of them to 0 0 0​ .

Given super parameter 0 ≤ β 1 < 1 0 \leq \beta_1<1 0≤ β 1 < 1 (the algorithm author suggests to set it as 0.9 0.9 0.9), time step t t Momentum variable of t + v t \boldsymbol{v}_t vt = random gradient of small batch g t \boldsymbol{g}_t gt's exponentially weighted moving average:
v t ← β 1 v t − 1 + ( 1 − β 1 ) g t . \boldsymbol{v}_t \leftarrow \beta_1 \boldsymbol{v}_{t-1} + (1 - \beta_1) \boldsymbol{g}_t. vt​←β1​vt−1​+(1−β1​)gt​.
and R M S P r o p RMSProp As in RMSProp # algorithm, the super parameter is given 0 ≤ β 2 < 1 0 \leq \beta_2 < 1 0≤ β 2 < 1 (the algorithm author suggests to set it as 0.999 0.999 0.999), and the small batch random gradient is calculated as the item after the square of the element g t ⊙ g t \boldsymbol{g}_t \odot \boldsymbol{g}_t gt ⊙ gt is obtained by exponentially weighted moving average s t \boldsymbol{s}_t st​​ :
s t ← β 2 s t − 1 + ( 1 − β 2 ) g t ⊙ g t . \boldsymbol{s}_t \leftarrow \beta_2 \boldsymbol{s}_{t-1} + (1 - \beta_2) \boldsymbol{g}_t \odot \boldsymbol{g}_t. st​←β2​st−1​+(1−β2​)gt​⊙gt​.
Because we will v 0 \boldsymbol{v}_0 v0} and s 0 \boldsymbol{s}_0 Elements in s0 # are initialized to 0 0 0, in time step t t t we get v t = ( 1 − β 1 ) ∑ i = 1 t β 1 t − i g i \boldsymbol{v}_t = (1-\beta_1) \sum_{i=1}^t \beta_1^{t-i} \boldsymbol{g}_i vt​=(1− β 1​)∑i=1t​ β 1t−i​gi​ . Add the weights of small batch random gradients in the past time steps to obtain ( 1 − β 1 ) ∑ i = 1 t β 1 t − i = 1 − β 1 t (1-\beta_1) \sum_{i=1}^t \beta_1^{t-i} = 1 - \beta_1^t (1− β 1​)∑i=1t​ β 1t−i​=1− β 1t​ . It should be noted that when t t When t is small, the sum of small batch random gradient weights in the past time steps will be small. For example, when β 1 = 0.9 \beta_1 = 0.9 β When 1 = 0.9, v 1 = 0.1 g 1 \boldsymbol{v}_1 = 0.1\boldsymbol{g}_1 v1​=0.1g1​ . In order to eliminate this effect, for any time step t t t. We can v t \boldsymbol{v}_t vt divided by 1 − β 1 t 1 - \beta_1^t 1− β 1t, so that the sum of small batch random gradient weights in the past time steps is 1 1 1 . This is also called deviation correction. stay A d a m Adam In Adam algorithm, we v t \boldsymbol{v}_t vt and s t \boldsymbol{s}_t st , make deviation correction:
v ^ t ← v t 1 − β 1 t \hat{\boldsymbol{v}}_t \leftarrow \frac{\boldsymbol{v}_t}{1 - \beta_1^t} v^t​←1−β1t​vt​​

s ^ t ← s t 1 − β 2 t . \hat{\boldsymbol{s}}_t \leftarrow \frac{\boldsymbol{s}_t}{1 - \beta_2^t}. s^t​←1−β2t​st​​.

next, A d a m Adam Adam algorithm uses the above deviation corrected variables s ^ t ← s t 1 − β 2 t . \hat{\boldsymbol{s}}_t \leftarrow \frac{\boldsymbol{s}_t}{1 - \beta_2^t}. s^t​←1− β 2t​st​​. and s ^ t \hat{\boldsymbol{s}}_t s^t, readjust the learning rate of each element in the model parameters by element operation:
g t ′ ← η v ^ t s ^ t + ϵ , \boldsymbol{g}_t' \leftarrow \frac{\eta \hat{\boldsymbol{v}}_t}{\sqrt{\hat{\boldsymbol{s}}_t} + \epsilon}, gt′​←s^t​ ​+ϵηv^t​​,
among η \eta η Is the learning rate, ϵ \epsilon ϵ Is a constant added to maintain numerical stability, such as 1 0 − 8 10^{-8} 10−8 . and A d a G r a d AdaGrad AdaGrad algorithm R M S P r o p RMSProp RMSProp algorithm and A d a D e l t a AdaDelta Like AdaDelta algorithm, each element in the independent variable of the objective function has its own learning rate. Finally, use g t ′ \boldsymbol{g}_t' gt 'iteration argument:
x t ← x t − 1 − g t ′ . \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}_t'. xt​←xt−1​−gt′​.

torch.optim.Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
                 weight_decay=0, amsgrad=False)
params,    # iterable of parameters to be optimized or dict with parameter group defined, usually model parameters()
lr=1e-3,     # Algorithm learning rate, default to 0.001 
betas=(0.9, 0.999),    # It is used to calculate the gradient and the running average of the square of the gradient
eps=1e-8,    # A term added to the denominator to increase the stability of numerical calculations
weight_decay=0,    # Weight attenuation (L2 penalty)

Establish a test network to demonstrate the use of the optimizer

# Establish a test network
class TestNet(nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()    # Initializes properties inherited from the parent class
        # Define hidden layer
        self.hidden = nn.Sequential(nn.Linear(13, 10),
        # Define predictive regression layer
        self.regression = nn.Linear(10, 1)
    def forward(self, x):
        # Define forward propagation path
        x = self.hidden(x)
        output = self.regression(x)
        return output

testnet = TestNet()    # Build object
# Use iterators (optimizers) to define uniform learning rates for different layers
optimizer = optim.Adam(testnet.parameters(), lr = 0.001)
# Use iterators (optimizers) to define different learning rates for different layers
optimizer = optim.Adam([{'params': testnet.hidden.parameters(), 'lr': 0.0001},
                        {'params': testnet.regression.parameters(), 'lr': 0.01}])
# Objective function optimization framework (non running instance)

# Define the loss function as cross entropy (other loss functions can also be defined)
loss_function = nn.CrossEntropyLoss()  

for input, target in dataset:
    optimizer.zero_grad()    # Gradient clearing
    output = TestNet(input)    # Calculate predicted value
    loss = loss_function(output, target)    # Calculate gradient loss
    loss.backward()    # Loss back propagation
    optimizer.step()    # Update gradient parameters

optim. lr_ The scheduler provides several ways to adjust the learning rate of the optimizer

last_epoch, # Used to set when to start adjusting the learning rate, = - 1 indicates that the learning rate is set to the initial value
step_size, # The learning rate will increase every step_size adjusted to gamma times.
milestones, # Use a list to set the epoch value that needs to adjust the learning rate, and adjust it to the original gamma times
T_max, # T_ Reset the learning rate after Max epoch s
eta_min, # Minimum learning rate per cycle,
# See the textbook for the specific adjustment algorithm

optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)
optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)
optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)
optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)
optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)
# Set learning rate adjustment method
scheduler = optim.lr_scheduler.LambdaLR()

for epoch in range(100):
    scheduler.step()    # Update learning rate

loss function

It is used to indicate the gap between the predicted value and the actual verification set in an iteration.

The goal of optimization problem is to minimize the loss function.

nn module provides a variety of deep learning loss functions

nn.L1Loss(), # Mean absolute error loss for regression problems
nn.MSELoss(), # Loss of mean square error for regression problems
nn.CrossEntropyLoss(), # Cross entropy loss for multi classification
nn.NLLLoss(), # Loss of negative log likelihood function for multi classification
nn.NLLLoss2d(), # Image negative log likelihood function loss, used for image segmentation
nn.KLDivLoss(), # KL divergence loss for regression problems
nn.BCELoss(), # Binary classification cross entropy loss, used for binary classification
nn.MarginRankingLoss(), # Loss of evaluation similarity
nn.MultiLabelMarginLoss(), # Loss of multi label classification
nn.SmoothL1Loss(), # Smooth L1 loss for regression problems
nn.SoftMarginLoss(), # Loss of multi label binary classification problem

Taking cross entropy and mean square error as an example

Mean square error loss

l o s s ( x , y ) = 1 / N ⋅ ( x i − y i ) 2 loss(x, y)=1/N \cdot (x_i - y_i)^2 loss(x,y)=1/N⋅(xi​−yi​)2

nn.MSELoss(size_average=None, reduce=None, reduction='mean')
size_average=None, # The calculated loss is the mean of each batch, otherwise it is the sum of each batch
reduce=None, # The calculated loss will be based on size_ Set average and try to calculate the mean or sum of each batch
reduction='mean', # 'none', 'mean', 'sum' to determine the calculation method of loss. The default is the mean value
Cross entropy loss

It integrates LogSsoftMax and NLLLoss into one class, which is generally used for multi allocation problems

nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100,
                 reduce=None, reduction='mean')
weight=None, # It is a 1-dimensional tensor, containing n elements and representing the weight of N classes. It is very useful when the training samples are unbalanced
ignore_index=-100, # Specifies the target value that is ignored and does not contribute to the input gradient

When weight = None

l o s s ( x , c l a s s ) = − log ⁡ exp ⁡ ( x [ c l a s s ] ) ∑ j x [ j ] = − x [ c l a s s ] + log ⁡ ( ∑ j exp ⁡ ( x [ j ] ) ) loss(x, class)=-\log\frac{\exp(x[class])}{\sum_j x[j]} =-x[class]+\log\bigl(\sum_j \exp(x[j])\bigr) loss(x,class)=−log∑j​x[j]exp(x[class])​=−x[class]+log(j∑​exp(x[j]))

When weight is specified

l o s s ( x , c l a s s ) = w e i g h t [ c l a s s ] × ( − x [ c l a s s ] + l o g ( ∑ j exp ⁡ ( x [ j ] ) ) ) loss(x, class) = weight[class]\times\Bigl(-x[class]+log\bigl(\sum_j \exp(x[j])\bigr)\Bigr) loss(x,class)=weight[class]×(−x[class]+log(j∑​exp(x[j])))

Prevent overfitting

Overfitting refers to the classification or prediction matching on the training set, and the results on the test set are not ideal.

In essence, the deviation of the model is very small, but the variance is very large

Common methods to prevent overfitting include:

  1. Increase the amount of data.
  2. Reasonable data segmentation. Use reasonable proportion to segment training set, verification set and test set.
  3. Regularization method. That is, the penalty norm for the training parameters is added to the loss function to restrict the parameters to be trained. Common parameters are l 1 l_1 l1 # and l 2 l_2 l2 ^ norm.
  4. Dropout. The dropout layer is introduced to randomly lose some neurons, that is, let some neurons stop working with a certain probability p, so as to reduce the over fitting phenomenon of the network.

Network parameter initialization

In order to obtain high-precision training results, some specific parameter initialization methods are used instead of default parameter initialization.

See the textbook for common initialization methods. The following is an example of parameter initialization.

Initialize the weight of a layer

Take a convolution layer as an example

# Define a convolution layer mapping from 3 features to 16 features
conv1 = nn.Conv2d(3, 16, 3)
# Initialize weights using standard normal distribution
torch.manual_seed(12)    # Random number initialization seed
# The generated random number is used to replace the tensor conv1 Raw data of weight
nn.init.normal_(conv1.weight, mean=0, std=1)    
plt.figure(figsize=(8, 6))
plt.hist(, 1)), bins=30)

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-KMYO6y4n-1643884575825)(output_32_0.png)]

# Initializes the offset with the specified value
# Let each element in the offset parameter of conv1 be reinitialized to 0.1
nn.init.constant_(conv1.bias, val=0.1)
Parameter containing:
tensor([0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
        0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000],
Weight initialization method for a network

Initialize the parameters of each layer of multi-layer network

# Establish test network
class TestNet(nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3)
        self.hidden = nn.Sequential(
            nn.Linear(1600, 100),
            nn.Linear(100, 50),
        self.cla = nn.Linear(50, 10)
    def forward(self, x):
        x = self.conv1(x)
        x = x.view(x.shape[0], -1)
        x = self.hidden(x)
        output = self.cla(x)
        return output
# Output network structure
from torchsummary import summary
testnet = TestNet()
summary(testnet, input_size=(3, 12, 12))
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 16, 10, 10]             448
            Linear-2                  [-1, 100]         160,100
              ReLU-3                  [-1, 100]               0
            Linear-4                   [-1, 50]           5,050
              ReLU-5                   [-1, 50]               0
            Linear-6                   [-1, 10]             510
Total params: 166,108
Trainable params: 166,108
Non-trainable params: 0
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.63
Estimated Total Size (MB): 0.65
  (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))
  (hidden): Sequential(
    (0): Linear(in_features=1600, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=50, bias=True)
    (3): ReLU()
  (cla): Linear(in_features=50, out_features=10, bias=True)

Why the above s u m m a r y summary Is the summary written like this? You can review torch appropriately NN chapter.

First, we set c o n v 1 of ginseng number : i n _ c h a n n e l s = 3 , o u t _ c h a n n e l s = 16 , k e r n e l _ s i z e = 3 Parameter of conv1: in\_channels=3, out\_channels=16, kernel\_size=3 Parameter of conv1: in_channels=3,out_channels=16,kernel_size=3

Corresponding to parameter calculation, i.e D 1 = 3 , D 2 = 16 D_1=3, D_2=16 D1 = 3,D2 = 16, filter size F × F = 3 × 3 F\times F=3\times3 F × F=3 × 3. Default construction parameters s t r i d e = 1 , p a d d i n g = 0 , Namely S = 1 , P = 0 Stripe = 1, padding = 0, i.e. S=1, P=0 Stripe = 1, padding = 0, i.e. S=1,P=0

stay s u m m a r y summary In the summary method, i n p u t _ s i z e = D 1 ×   W 1 × H 1 input\_size = D_1\times\ W_1\times H_1 input_size=D1​ ×  W1​ × H1​. That is, the size of the convolution input data is W 1 × H 1 × D 1 = 12 × 12 × 3 W_1\times H_1\times D_1 = 12\times12\times3 W1​×H1​×D1​=12×12×3

These are known quantities.

According to the calculation method of accretion layer:
W 2 = W 1 − F + 2 P S + 1 ,   H 2 = H 1 − F + 2 P S + 1 , D 2 = F W_2=\frac{W_1-F+2P}{S}+1,\ H_2=\frac{H_1-F+2P}{S}+1, D_2=F W2​=SW1​−F+2P​+1, H2​=SH1​−F+2P​+1,D2​=F
So, W 2 = W 1 − 2 ,   H 2 = H 1 − 2 W_2=W_1-2,\ H_2=H_1-2 W2​=W1​−2, H2​=H1​−2

output W 2 × H 2 × D 2 = ( W 1 − 2 ) × ( H 1 − 2 ) × F = 10 × 10 × 16 W_2\times H_2\times D_2=(W_1-2)\times(H_1-2)\times F=10\times10\times16 W2​ × H2​ × D2​=(W1​−2) × (H1​−2) × F=10 × ten × 16. The total parameter quantity is 1600 1600 1600.

Then we set up the full connection layer l i n e a r 2 of ginseng number : i n _ c h a n n e l s = ? , o u t _ c h a n n e l s = ? Parameter of linear2: in\_channels=?, out\_channels=? Parameter of linear2: in_channels=?,out_channels=?

The output of the previous layer is equal to the input of the next layer, i n _ c h a n n e l s = 1600 in\_channels=1600 in_channels=1600, o u t _ c h a n n e l s out\_channels out_channels is determined according to the actual situation of the required output category.

The back full connection layer is similar.

The above are all digressions. Then we will define the weight initialization function for each layer of the above network

def init_weights(m):
    # If it is convolution layer
    if type(m) == nn.Conv2d:
        torch.nn.init.normal_(m.weight, mean=0, std=0.5)    # Normal distribution
    # In case of full connection layer
    if type(m) == nn.Linear:
        torch.nn.init.uniform_(m.weight, a=-0.1, b=0.1)    # -0.1-0.1 uniform distribution    # Set bias bias to 0.01
# Use the apllu method of network to initialize the weight
torch.manual_seed(13)    # Random number initialization seed
  (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))
  (hidden): Sequential(
    (0): Linear(in_features=1600, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=50, bias=True)
    (3): ReLU()
  (cla): Linear(in_features=50, out_features=10, bias=True)

Keywords: AI Pytorch Deep Learning dnn

Added by trawets on Thu, 03 Feb 2022 14:02:26 +0200