Pytorch: optimizer, loss function and deep neural network framework
Copyright: Jingmin Wei, Pattern Recognition and Intelligent System, School of Artificial and Intelligence, Huazhong University of Science and Technology
Common optimizer
Stochastic gradient descent (SGD) is a basic algorithm in machine learning, in which the model can be learned and reasoned. Please refer to the relevant principles and codes by yourself.
The disadvantage of random descent is that it is difficult to determine an appropriate learning rate and easy to converge to the minimum local gradient.
There are two solutions. The idea of momentum is introduced to dynamically update the parameters, and the change of dynamic learning rate is introduced.
The idea of momentum is introduced to dynamically update the parameters, that is, considering the direction of parameter update and the gradient calculated by the current batch, the size and direction of update gradient are calculated comprehensively. Increase the stability of learning parameters and learn convergent parameters faster.
m
t
=
μ
×
m
t
−
1
+
g
t
Δ
θ
t
=
η
×
m
t
m_t=\mu\times m_{t-1}+g_t\\ \Delta{\theta_t}=\eta\times m_t
mt=μ×mt−1+gtΔθt=η×mt
When the descending direction of the gradient is the same as the last time, the gradient will become larger, otherwise the gradient will become smaller. In the middle and later learning, the gradient will oscillate near the local optimum(
g
t
=
0
g_t=0
gt = 0), so it is possible to jump out of the local optimum.
The commonly used momentum is Nesterov momentum.
Dynamic learning rate change is introduced, that is, at the beginning of training, the distance between the parameters and the optimal value is far, so a larger learning rate should be used. After several rounds of training, the learning rate will be reduced. Therefore, adaptive learning rate algorithms such as Adadelta, RMSProp and Adam are developed.
For example, Adam dynamically estimates the learning rate of each parameter by using the first-order moment and second-order moment of the gradient. The update of parameters is more stable and the convergence speed of the model is faster.
Common optimizer
In the optim module, a variety of optimization algorithms for deep learning are provided.
import torch from torch import nn from torch import optim import matplotlib.pyplot as plt
torch.optim.Adadelta() torch.optim.Adagrad() torch.optim.Adam() torch.optim.ASGD() # Average random gradient descent torch.optim.LBFCS() torch.optim.RMSprop() torch.optim.Rprop() # Elastic back propagation torch.optim.SGD() # Random gradient descent
Adam optimizer
with A d a m Adam Taking Adam optimizer as an example, our model uses powerful and general in training A d a m Adam Adam optimization algorithm, A d a m Adam Adam optimizer is a "panacea" optimizer widely used and effective in deep learning. Its main principle is to estimate the first-order moment of the gradient( F i r s t M o m e n t E s t i m a t i o n First\ Moment\ Estimation First Moment Estimation and second Moment Estimation( S e c o n d M o m e n t E s t i m a t i o n Second\ Moment\ Estimation Second , Moment , Estimation (i.e. the non centralized variance of the gradient) is comprehensively considered to calculate the update step of each iteration.
A d a m Adam Adam algorithm in R M S P r o p RMSProp Based on RMSProp algorithm, exponential weighted moving average is also made for small batch random gradient A d a m Adam Adam algorithm can be regarded as R M S P r o p RMSProp The combination of RMSProp algorithm and momentum method.
A d a m Adam The Adam # algorithm uses momentum variables v t \boldsymbol{v}_t vt and R M S P r o p RMSProp RMSProp # algorithm small batch random gradient exponentially weighted moving average variable according to the square of the element s t \boldsymbol{s}_t st, and in the time step 0 0 0 # initialize each of them to 0 0 0 .
Given super parameter
0
≤
β
1
<
1
0 \leq \beta_1<1
0≤ β 1 < 1 (the algorithm author suggests to set it as
0.9
0.9
0.9), time step
t
t
Momentum variable of t +
v
t
\boldsymbol{v}_t
vt = random gradient of small batch
g
t
\boldsymbol{g}_t
gt's exponentially weighted moving average:
v
t
←
β
1
v
t
−
1
+
(
1
−
β
1
)
g
t
.
\boldsymbol{v}_t \leftarrow \beta_1 \boldsymbol{v}_{t-1} + (1 - \beta_1) \boldsymbol{g}_t.
vt←β1vt−1+(1−β1)gt.
and
R
M
S
P
r
o
p
RMSProp
As in RMSProp # algorithm, the super parameter is given
0
≤
β
2
<
1
0 \leq \beta_2 < 1
0≤ β 2 < 1 (the algorithm author suggests to set it as
0.999
0.999
0.999), and the small batch random gradient is calculated as the item after the square of the element
g
t
⊙
g
t
\boldsymbol{g}_t \odot \boldsymbol{g}_t
gt ⊙ gt is obtained by exponentially weighted moving average
s
t
\boldsymbol{s}_t
st :
s
t
←
β
2
s
t
−
1
+
(
1
−
β
2
)
g
t
⊙
g
t
.
\boldsymbol{s}_t \leftarrow \beta_2 \boldsymbol{s}_{t-1} + (1 - \beta_2) \boldsymbol{g}_t \odot \boldsymbol{g}_t.
st←β2st−1+(1−β2)gt⊙gt.
Because we will
v
0
\boldsymbol{v}_0
v0} and
s
0
\boldsymbol{s}_0
Elements in s0 # are initialized to
0
0
0, in time step
t
t
t we get
v
t
=
(
1
−
β
1
)
∑
i
=
1
t
β
1
t
−
i
g
i
\boldsymbol{v}_t = (1-\beta_1) \sum_{i=1}^t \beta_1^{t-i} \boldsymbol{g}_i
vt=(1− β 1)∑i=1t β 1t−igi . Add the weights of small batch random gradients in the past time steps to obtain
(
1
−
β
1
)
∑
i
=
1
t
β
1
t
−
i
=
1
−
β
1
t
(1-\beta_1) \sum_{i=1}^t \beta_1^{t-i} = 1 - \beta_1^t
(1− β 1)∑i=1t β 1t−i=1− β 1t . It should be noted that when
t
t
When t is small, the sum of small batch random gradient weights in the past time steps will be small. For example, when
β
1
=
0.9
\beta_1 = 0.9
β When 1 = 0.9,
v
1
=
0.1
g
1
\boldsymbol{v}_1 = 0.1\boldsymbol{g}_1
v1=0.1g1 . In order to eliminate this effect, for any time step
t
t
t. We can
v
t
\boldsymbol{v}_t
vt divided by
1
−
β
1
t
1 - \beta_1^t
1− β 1t, so that the sum of small batch random gradient weights in the past time steps is
1
1
1 . This is also called deviation correction. stay
A
d
a
m
Adam
In Adam algorithm, we
v
t
\boldsymbol{v}_t
vt and
s
t
\boldsymbol{s}_t
st , make deviation correction:
v
^
t
←
v
t
1
−
β
1
t
\hat{\boldsymbol{v}}_t \leftarrow \frac{\boldsymbol{v}_t}{1 - \beta_1^t}
v^t←1−β1tvt
s ^ t ← s t 1 − β 2 t . \hat{\boldsymbol{s}}_t \leftarrow \frac{\boldsymbol{s}_t}{1 - \beta_2^t}. s^t←1−β2tst.
next,
A
d
a
m
Adam
Adam algorithm uses the above deviation corrected variables
s
^
t
←
s
t
1
−
β
2
t
.
\hat{\boldsymbol{s}}_t \leftarrow \frac{\boldsymbol{s}_t}{1 - \beta_2^t}.
s^t←1− β 2tst. and
s
^
t
\hat{\boldsymbol{s}}_t
s^t, readjust the learning rate of each element in the model parameters by element operation:
g
t
′
←
η
v
^
t
s
^
t
+
ϵ
,
\boldsymbol{g}_t' \leftarrow \frac{\eta \hat{\boldsymbol{v}}_t}{\sqrt{\hat{\boldsymbol{s}}_t} + \epsilon},
gt′←s^t
+ϵηv^t,
among
η
\eta
η Is the learning rate,
ϵ
\epsilon
ϵ Is a constant added to maintain numerical stability, such as
1
0
−
8
10^{-8}
10−8 . and
A
d
a
G
r
a
d
AdaGrad
AdaGrad algorithm
R
M
S
P
r
o
p
RMSProp
RMSProp algorithm and
A
d
a
D
e
l
t
a
AdaDelta
Like AdaDelta algorithm, each element in the independent variable of the objective function has its own learning rate. Finally, use
g
t
′
\boldsymbol{g}_t'
gt 'iteration argument:
x
t
←
x
t
−
1
−
g
t
′
.
\boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}_t'.
xt←xt−1−gt′.
torch.optim.Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False)
params, # iterable of parameters to be optimized or dict with parameter group defined, usually model parameters() lr=1e-3, # Algorithm learning rate, default to 0.001 betas=(0.9, 0.999), # It is used to calculate the gradient and the running average of the square of the gradient eps=1e-8, # A term added to the denominator to increase the stability of numerical calculations weight_decay=0, # Weight attenuation (L2 penalty) amsgrad=False
Establish a test network to demonstrate the use of the optimizer
# Establish a test network class TestNet(nn.Module): def __init__(self): super(TestNet, self).__init__() # Initializes properties inherited from the parent class # Define hidden layer self.hidden = nn.Sequential(nn.Linear(13, 10), nn.ReLU(),) # Define predictive regression layer self.regression = nn.Linear(10, 1) def forward(self, x): # Define forward propagation path x = self.hidden(x) output = self.regression(x) return output
testnet = TestNet() # Build object
# Use iterators (optimizers) to define uniform learning rates for different layers optimizer = optim.Adam(testnet.parameters(), lr = 0.001)
# Use iterators (optimizers) to define different learning rates for different layers optimizer = optim.Adam([{'params': testnet.hidden.parameters(), 'lr': 0.0001}, {'params': testnet.regression.parameters(), 'lr': 0.01}])
# Objective function optimization framework (non running instance) # Define the loss function as cross entropy (other loss functions can also be defined) loss_function = nn.CrossEntropyLoss() for input, target in dataset: optimizer.zero_grad() # Gradient clearing output = TestNet(input) # Calculate predicted value loss = loss_function(output, target) # Calculate gradient loss loss.backward() # Loss back propagation optimizer.step() # Update gradient parameters
optim. lr_ The scheduler provides several ways to adjust the learning rate of the optimizer
last_epoch, # Used to set when to start adjusting the learning rate, = - 1 indicates that the learning rate is set to the initial value step_size, # The learning rate will increase every step_size adjusted to gamma times. milestones, # Use a list to set the epoch value that needs to adjust the learning rate, and adjust it to the original gamma times T_max, # T_ Reset the learning rate after Max epoch s eta_min, # Minimum learning rate per cycle, # See the textbook for the specific adjustment algorithm optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1) optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1) optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1) optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1) optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)
# Set learning rate adjustment method scheduler = optim.lr_scheduler.LambdaLR() for epoch in range(100): train(...) validate(...) scheduler.step() # Update learning rate
loss function
It is used to indicate the gap between the predicted value and the actual verification set in an iteration.
The goal of optimization problem is to minimize the loss function.
nn module provides a variety of deep learning loss functions
nn.L1Loss(), # Mean absolute error loss for regression problems nn.MSELoss(), # Loss of mean square error for regression problems nn.CrossEntropyLoss(), # Cross entropy loss for multi classification nn.NLLLoss(), # Loss of negative log likelihood function for multi classification nn.NLLLoss2d(), # Image negative log likelihood function loss, used for image segmentation nn.KLDivLoss(), # KL divergence loss for regression problems nn.BCELoss(), # Binary classification cross entropy loss, used for binary classification nn.MarginRankingLoss(), # Loss of evaluation similarity nn.MultiLabelMarginLoss(), # Loss of multi label classification nn.SmoothL1Loss(), # Smooth L1 loss for regression problems nn.SoftMarginLoss(), # Loss of multi label binary classification problem
Taking cross entropy and mean square error as an example
Mean square error loss
l o s s ( x , y ) = 1 / N ⋅ ( x i − y i ) 2 loss(x, y)=1/N \cdot (x_i - y_i)^2 loss(x,y)=1/N⋅(xi−yi)2
nn.MSELoss(size_average=None, reduce=None, reduction='mean')
size_average=None, # The calculated loss is the mean of each batch, otherwise it is the sum of each batch reduce=None, # The calculated loss will be based on size_ Set average and try to calculate the mean or sum of each batch reduction='mean', # 'none', 'mean', 'sum' to determine the calculation method of loss. The default is the mean value
Cross entropy loss
It integrates LogSsoftMax and NLLLoss into one class, which is generally used for multi allocation problems
nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
weight=None, # It is a 1-dimensional tensor, containing n elements and representing the weight of N classes. It is very useful when the training samples are unbalanced ignore_index=-100, # Specifies the target value that is ignored and does not contribute to the input gradient
When weight = None
l o s s ( x , c l a s s ) = − log exp ( x [ c l a s s ] ) ∑ j x [ j ] = − x [ c l a s s ] + log ( ∑ j exp ( x [ j ] ) ) loss(x, class)=-\log\frac{\exp(x[class])}{\sum_j x[j]} =-x[class]+\log\bigl(\sum_j \exp(x[j])\bigr) loss(x,class)=−log∑jx[j]exp(x[class])=−x[class]+log(j∑exp(x[j]))
When weight is specified
l o s s ( x , c l a s s ) = w e i g h t [ c l a s s ] × ( − x [ c l a s s ] + l o g ( ∑ j exp ( x [ j ] ) ) ) loss(x, class) = weight[class]\times\Bigl(-x[class]+log\bigl(\sum_j \exp(x[j])\bigr)\Bigr) loss(x,class)=weight[class]×(−x[class]+log(j∑exp(x[j])))
Prevent overfitting
Overfitting refers to the classification or prediction matching on the training set, and the results on the test set are not ideal.
In essence, the deviation of the model is very small, but the variance is very large
Common methods to prevent overfitting include:
- Increase the amount of data.
- Reasonable data segmentation. Use reasonable proportion to segment training set, verification set and test set.
- Regularization method. That is, the penalty norm for the training parameters is added to the loss function to restrict the parameters to be trained. Common parameters are l 1 l_1 l1 # and l 2 l_2 l2 ^ norm.
- Dropout. The dropout layer is introduced to randomly lose some neurons, that is, let some neurons stop working with a certain probability p, so as to reduce the over fitting phenomenon of the network.
Network parameter initialization
In order to obtain high-precision training results, some specific parameter initialization methods are used instead of default parameter initialization.
See the textbook for common initialization methods. The following is an example of parameter initialization.
Initialize the weight of a layer
Take a convolution layer as an example
# Define a convolution layer mapping from 3 features to 16 features conv1 = nn.Conv2d(3, 16, 3) # Initialize weights using standard normal distribution torch.manual_seed(12) # Random number initialization seed # The generated random number is used to replace the tensor conv1 Raw data of weight nn.init.normal_(conv1.weight, mean=0, std=1) plt.figure(figsize=(8, 6)) plt.hist(conv1.weight.data.numpy().reshape((-1, 1)), bins=30) plt.show()
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-KMYO6y4n-1643884575825)(output_32_0.png)]
# Initializes the offset with the specified value # Let each element in the offset parameter of conv1 be reinitialized to 0.1 nn.init.constant_(conv1.bias, val=0.1)
Parameter containing: tensor([0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000], requires_grad=True)
Weight initialization method for a network
Initialize the parameters of each layer of multi-layer network
# Establish test network class TestNet(nn.Module): def __init__(self): super(TestNet, self).__init__() self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3) self.hidden = nn.Sequential( nn.Linear(1600, 100), nn.ReLU(), nn.Linear(100, 50), nn.ReLU() ) self.cla = nn.Linear(50, 10) def forward(self, x): x = self.conv1(x) x = x.view(x.shape[0], -1) x = self.hidden(x) output = self.cla(x) return output
# Output network structure from torchsummary import summary testnet = TestNet() summary(testnet, input_size=(3, 12, 12))
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 16, 10, 10] 448 Linear-2 [-1, 100] 160,100 ReLU-3 [-1, 100] 0 Linear-4 [-1, 50] 5,050 ReLU-5 [-1, 50] 0 Linear-6 [-1, 10] 510 ================================================================ Total params: 166,108 Trainable params: 166,108 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.00 Forward/backward pass size (MB): 0.01 Params size (MB): 0.63 Estimated Total Size (MB): 0.65 ----------------------------------------------------------------
print(testnet)
TestNet( (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1)) (hidden): Sequential( (0): Linear(in_features=1600, out_features=100, bias=True) (1): ReLU() (2): Linear(in_features=100, out_features=50, bias=True) (3): ReLU() ) (cla): Linear(in_features=50, out_features=10, bias=True) )
Why the above s u m m a r y summary Is the summary written like this? You can review torch appropriately NN chapter.
First, we set c o n v 1 of ginseng number : i n _ c h a n n e l s = 3 , o u t _ c h a n n e l s = 16 , k e r n e l _ s i z e = 3 Parameter of conv1: in\_channels=3, out\_channels=16, kernel\_size=3 Parameter of conv1: in_channels=3,out_channels=16,kernel_size=3
Corresponding to parameter calculation, i.e D 1 = 3 , D 2 = 16 D_1=3, D_2=16 D1 = 3,D2 = 16, filter size F × F = 3 × 3 F\times F=3\times3 F × F=3 × 3. Default construction parameters s t r i d e = 1 , p a d d i n g = 0 , Namely S = 1 , P = 0 Stripe = 1, padding = 0, i.e. S=1, P=0 Stripe = 1, padding = 0, i.e. S=1,P=0
stay s u m m a r y summary In the summary method, i n p u t _ s i z e = D 1 × W 1 × H 1 input\_size = D_1\times\ W_1\times H_1 input_size=D1 × W1 × H1. That is, the size of the convolution input data is W 1 × H 1 × D 1 = 12 × 12 × 3 W_1\times H_1\times D_1 = 12\times12\times3 W1×H1×D1=12×12×3
These are known quantities.
According to the calculation method of accretion layer:
W
2
=
W
1
−
F
+
2
P
S
+
1
,
H
2
=
H
1
−
F
+
2
P
S
+
1
,
D
2
=
F
W_2=\frac{W_1-F+2P}{S}+1,\ H_2=\frac{H_1-F+2P}{S}+1, D_2=F
W2=SW1−F+2P+1, H2=SH1−F+2P+1,D2=F
So,
W
2
=
W
1
−
2
,
H
2
=
H
1
−
2
W_2=W_1-2,\ H_2=H_1-2
W2=W1−2, H2=H1−2
output W 2 × H 2 × D 2 = ( W 1 − 2 ) × ( H 1 − 2 ) × F = 10 × 10 × 16 W_2\times H_2\times D_2=(W_1-2)\times(H_1-2)\times F=10\times10\times16 W2 × H2 × D2=(W1−2) × (H1−2) × F=10 × ten × 16. The total parameter quantity is 1600 1600 1600.
Then we set up the full connection layer l i n e a r 2 of ginseng number : i n _ c h a n n e l s = ? , o u t _ c h a n n e l s = ? Parameter of linear2: in\_channels=?, out\_channels=? Parameter of linear2: in_channels=?,out_channels=?
The output of the previous layer is equal to the input of the next layer, i n _ c h a n n e l s = 1600 in\_channels=1600 in_channels=1600, o u t _ c h a n n e l s out\_channels out_channels is determined according to the actual situation of the required output category.
The back full connection layer is similar.
The above are all digressions. Then we will define the weight initialization function for each layer of the above network
def init_weights(m): # If it is convolution layer if type(m) == nn.Conv2d: torch.nn.init.normal_(m.weight, mean=0, std=0.5) # Normal distribution # In case of full connection layer if type(m) == nn.Linear: torch.nn.init.uniform_(m.weight, a=-0.1, b=0.1) # -0.1-0.1 uniform distribution m.bias.data.fill_(0.01) # Set bias bias to 0.01
# Use the apllu method of network to initialize the weight torch.manual_seed(13) # Random number initialization seed testnet.apply(init_weights)
TestNet( (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1)) (hidden): Sequential( (0): Linear(in_features=1600, out_features=100, bias=True) (1): ReLU() (2): Linear(in_features=100, out_features=50, bias=True) (3): ReLU() ) (cla): Linear(in_features=50, out_features=10, bias=True) )