requires_grad
- require_grad means whether the gradient needs to be calculated
- When using the backward() function to calculate the gradient, the gradient of all tensors is not calculated. Only the gradient of tensors that meet the following conditions can be calculated: 1 Current tensor requires_ Grad = true (code example 1); 2. Require of all tensors that depend on this tensor_ Grad = true, that is, the gradient values of all tensors that depend on the tensor can be obtained.
- In all require_grad=True
- By default, the gradient value of * * non leaf node * * will be cleared after being used in the back propagation process and will not be retained.
- By default, only the gradient value of * * leaf node * * can be retained.
- The gradient value of the reserved * * leaf node * * will be stored in the grad attribute of the tensor, in optimizer During step (), the data attribute value of the leaf node will be updated, so as to update the parameters.
Code example
Example 1: require_grad=True will calculate the gradient
import torch import torch.nn as nn import torch.optim as optim import random import os import numpy as np def seed_torch(seed=1029): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True seed_torch() # Define a network class net(nn.Module): def __init__(self, num_class=10): super(net, self).__init__() self.pool1 = nn.AvgPool1d(2) self.bn1 = nn.BatchNorm1d(3) self.fc1 = nn.Linear(12, 4) def forward(self, x): x = self.pool1(x) x = self.bn1(x) x = x.reshape(x.size(0), -1) x = self.fc1(x) return x # Define network model = net() # Define loss loss_fn = nn.CrossEntropyLoss() # Define optimizer optimizer = optim.SGD(model.parameters(), lr=1e-2) # Define training data x = torch.randn((3, 3, 8)) model.fc1.weight.requires_grad = False # fc1.weight does not calculate the gradient print(model.fc1.weight.grad) print(model.fc1.bias.grad) # fc1.bias calculation gradient output = model(x) target = torch.tensor([1, 1, 1]) loss = loss_fn(output, target) loss.backward() print(model.fc1.weight.grad) print(model.fc1.bias.grad)
result
(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py None None None tensor([ 0.1875, -0.8615, 0.3708, 0.3033])
Example 2: use detach() to make non leaf nodes not calculate gradient values (requires_grad=False `)
import torch import torch.nn as nn import torch.optim as optim import random import os import numpy as np def seed_torch(seed=1029): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True seed_torch() # Define a network class net(nn.Module): def __init__(self, num_class=10): super(net, self).__init__() self.pool1 = nn.AvgPool1d(2) self.bn1 = nn.BatchNorm1d(3) self.fc1 = nn.Linear(12, 4) def forward(self, x): x = self.pool1(x) x = self.bn1(x) x = x.reshape(x.size(0), -1) x = x.detach() # Split non leaf node into leaf node x.requires_grad = False x.grad_fn=None y = self.fc1(x) return y # Define network model = net() # Define loss loss_fn = nn.CrossEntropyLoss() # Define optimizer optimizer = optim.SGD(model.parameters(), lr=1e-2) # Define training data x = torch.randn((3, 3, 8)) # Leaf node bn1 before training Parameters of weight print(model.bn1.weight.requires_grad) print(model.bn1.weight.grad) # Leaf node FC1 before training Parameters of weight print(model.fc1.weight.requires_grad) print(model.fc1.weight.grad) output = model(x) target = torch.tensor([1, 1, 1]) loss = loss_fn(output, target) loss.backward() # After training, the leaf node bn1 Parameters of weight print(model.bn1.weight.requires_grad) print(model.bn1.weight.grad) # Leaf node FC1 after training Parameters of weight print(model.fc1.weight.requires_grad) print(model.fc1.weight.grad)
result
(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py True None True None True None True tensor([[ 0.0053, 0.0341, 0.0272, 0.0231, -0.1196, 0.0164, 0.0442, 0.1511, -0.1146, 0.2443, -0.0513, -0.0404], [ 0.1127, 0.0141, 0.1857, -0.3597, 0.5626, 0.1670, -0.0569, -0.6800, 0.5046, -1.0340, 0.2865, 0.2857], [-0.0225, -0.0370, 0.0581, 0.0509, -0.2231, 0.1119, 0.0313, 0.2887, -0.1560, 0.5342, -0.0799, -0.0358], [-0.0955, -0.0112, -0.2710, 0.2857, -0.2199, -0.2953, -0.0185, 0.2402, -0.2340, 0.2555, -0.1553, -0.2095]])
Leaf node and non leaf node
- Tensors can be divided into two categories: leaf nodes and non leaf nodes
- Available via is_leaf to determine whether a tensor is a leaf node
leaf node
-
Leaf nodes can be understood as tensors that do not depend on other tensors
-
In pytorch, the weight and bias tensor in the neural network layer are leaf nodes; Self defined tensor, such as a = torch The node defined by tensor ([1.0]) is a leaf node
import torch a=torch.tensor([1.0]) a.is_leaf True b=a+1 b.is_leaf True
-
It can be seen that b is also a leaf node! It can be understood that b=a+1 simply from the numerical relationship, and b does depend on a. But from pytorch's point of view, everything is for the reverse derivation of A_ If the grad attribute is False, it does not require to obtain the gradient, then the tensor A is actually "meaningless" during back propagation, which can be considered to be free from the calculation diagram, so b is still a leaf node, as shown in the following figure
-
-
For another example, the calculation diagram in the following figure is originally a leaf node, which can normally calculate the gradient of back propagation:
However, after using the detach() function to split a non leaf node into a leaf node
Regardless of requirements_ What is the value of grad attribute? If the derivation path of the original leaf node is interrupted, the gradient value cannot be obtained.
-
Secondly, as shown above, for the tensor that needs derivation, its requires_ The grad attribute must be True. For example, for the top leaf node in the figure below, pytorch will not automatically calculate its derivative.
Non leaf node
By default, another element of the pytorch calculation graph, grad_fn, is saved in the non leaf node. The operation is the derivable operations such as addition, subtraction, multiplication and division, square root, power index pair and trigonometric function. With the operation, the gradient of the leaf node can be calculated.
# Define a network class net(nn.Module): def __init__(self, num_class=10): super(net, self).__init__() self.pool1 = nn.AvgPool1d(2) self.bn1 = nn.BatchNorm1d(3) self.fc1 = nn.Linear(12, 4) def forward(self, x): x = self.pool1(x) x = self.bn1(x) x = x.reshape(x.size(0), -1) print(x) y = self.fc1(x) return y # Define network model = net() # Define loss loss_fn = nn.CrossEntropyLoss() # Define optimizer optimizer = optim.SGD(model.parameters(), lr=1e-2) # Define training data x = torch.randn((3, 3, 8)) output = model(x)
# result (bbn) jyzhang@admin2-X10DAi:~/test$ python net.py tensor([[-0.2112, -1.1580, 0.9010, -0.3500, -0.3878, 1.9242, -0.3629, 0.6713, 0.4996, 2.3366, 0.1928, 0.7291], [-0.7056, -0.4324, -1.8940, 1.7456, -0.7856, -1.8655, -0.4469, 0.7612, -0.8044, 0.4850, -0.7059, -1.0746], [ 0.4769, 1.4226, 0.3125, -0.1074, -0.7744, -0.5955, 0.9378, 0.9242, -1.3836, 0.8161, -0.4706, -0.6202]], grad_fn=<ViewBackward>)
with torch.no_grad()
-
torch.no_grad() is a * * context manager * * used to prohibit gradient calculation. It is usually used in network inference (eval) to reduce the use of computing memory
-
By torch no_ The part wrapped by grad () will not be tracked. Although the output can still be calculated by forward propagation, the calculation process (grad_fn) will not be recorded, so the parameters cannot be updated by back propagation. Specifically, for * * non leaf node * *
- Requirements for non leaf nodes_ The grad attribute changes to False
- Grad of non leaf node_ The FN attribute changes to None
In this way, the gradient of non leaf nodes is not calculated. Therefore, although the leaf nodes (learnable parameters of each layer of the model) are required_ The grad attribute is not changed (still True), and the gradient will not be calculated. The grad attribute is None, and if loss. Is used Backward() will report an error (because the requirements of the first non leaf node (loss))_ Grad attribute is False, grad_fn attribute is None `). Therefore, the learnable parameters of the model are not updated.
-
torch.no_grad() does not affect the behavior of dropout and batchnorm layers in train and eval
Code example
import torch import torch.nn as nn import torch.optim as optim import random import os import numpy as np def seed_torch(seed=1029): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True seed_torch() # Define a network class net(nn.Module): def __init__(self, num_class=10): super(net, self).__init__() self.pool1 = nn.AvgPool1d(2) self.bn1 = nn.BatchNorm1d(3) self.fc1 = nn.Linear(12, 4) def forward(self, x): x = self.pool1(x) x = self.bn1(x) x = x.reshape(x.size(0), -1) print("Non leaf node requires_grad: ", x.requires_grad) # Wrequires for non leaf nodes_ grad print("Non leaf node grad_fn: ", x.grad_fn) # Grad of non leaf node_ fn y = self.fc1(x) return y # Define network model = net() # Define loss loss_fn = nn.CrossEntropyLoss() # Define optimizer optimizer = optim.SGD(model.parameters(), lr=1e-2) # Define training data x = torch.randn((3, 3, 8)) print("Before package fc1.weight of requires_grad: ", model.fc1.weight.requires_grad) # FC1 before package Weight requirements_ grad with torch.no_grad(): print("After package fc1.weight of requires_grad: ", model.fc1.weight.requires_grad) # After wrapping FC1 Requirements of weight_ grad print("Pre training fc1.weight.grad: ", model.fc1.weight.grad) # FC1 before training weight. grad output = model(x) target = torch.tensor([1, 1, 1]) loss = loss_fn(output, target) # This is not true. This is to verify that grad will not be calculated and an error will be reported loss.backward() print("Pre training fc1.weight.grad: ", model.fc1.weight.grad) # FC1 after training weight. grad
result
(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py Before package fc1.weight of requires_grad: True After package fc1.weight of requires_grad: True Pre training fc1.weight.grad: None Non leaf node requires_grad: False Non leaf node grad_fn: None Traceback (most recent call last): File "/home/jyzhang/test/net.py", line 66, in <module> loss.backward() File "/home/jyzhang/anaconda3/envs/bbn/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/jyzhang/anaconda3/envs/bbn/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn # element 0 of tensors refers to the calculated loss
model.eval()
-
Bottom analysis
-
Because the customized network and all layers in the customized network inherit from NN Module is the parent class, NN Module has a training attribute, which is True by default. Therefore, model Eval() makes the training attribute of the customized network and each layer in the customized network False
-
class net(nn.Module): def __init__(self): super(net, self).__init__() self.bn = nn.BatchNorm1d(3, track_running_stats=True) def forward(self, x): return self.bn(x) model = net() model.eval() print(model.training) print(model.bn.training) # output (bbn) jyzhang@admin2-X10DAi:~/test$ python net.py False False
-
-
When performing validation in PyTorch, model. Is used Eval() switches to test mode
-
This mode is used to notify the dropout layer and batchnorm layer to switch to val mode
- In val mode, the dropout layer will let all active units pass through, while the batchnorm layer will stop calculating and updating mean and var, and directly use the mean and var values learned in the training stage (learning here refers to accumulating the updated mean and var values during the forward propagation of data in the training stage)
- For a detailed analysis of the impact on the batchnorm layer, see the following batch_normalize layer part, there are many pits here!!!
-
This mode will not affect the gradient calculation behavior of each layer, that is, the gradient calculation and storage are the same as the training mode (code example 1). Specifically
- Requirements of leaf nodes (learnable parameters of each layer of the model)_ The grad attribute has not changed (still True)
- Requirements for non leaf nodes_ The grad property is True
- Grad of non leaf node_ FN attribute is not None
- Therefore, this mode will not affect the gradient calculation behavior of each layer, even loss Backward() can also run normally to calculate gradients (usually not used)
-
Note that after training the train samples, the generated model will be used to test the samples. Before model(test), you need to add model Eval(), otherwise, if there is input data, the same data output result will change even without training. This is the nature brought about by the BN layer and Dropout in the model (code example 2)
-
If you don't care about the memory size and calculation time, just use model Eval() is enough to get the correct result of validation/test (loss. Backward() is not written during validation/test); And with torch no_ Grad () can further accelerate and save gpu space (because there is no need to calculate and store gradients), so it can calculate faster or run a larger batch to test.
Code example 1
import torch import torch.nn as nn import torch.optim as optim import random import os import numpy as np def seed_torch(seed=1029): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True seed_torch() # Define a network class net(nn.Module): def __init__(self, num_class=10): super(net, self).__init__() self.pool1 = nn.AvgPool1d(2) self.bn1 = nn.BatchNorm1d(3) self.fc1 = nn.Linear(12, 4) def forward(self, x): x = self.pool1(x) x = self.bn1(x) x = x.reshape(x.size(0), -1) print("Non leaf node requires_grad: ", x.requires_grad) # Wrequires for non leaf nodes_ grad print("Non leaf node grad_fn: ", x.grad_fn) # Grad of non leaf node_ fn y = self.fc1(x) return y # Define network model = net() # Define loss loss_fn = nn.CrossEntropyLoss() # Define optimizer optimizer = optim.SGD(model.parameters(), lr=1e-2) # Define training data x = torch.randn((3, 3, 8)) print("Switch to eval front fc1.weight of requires_grad: ", model.fc1.weight.requires_grad) # Switch to FC1 before eval Weight requirements_ grad model.eval() print("Switch to eval front fc1.weight of requires_grad: ", model.fc1.weight.requires_grad) # Switch to FC1 before eval Weight requirements_ grad output = model(x) target = torch.tensor([1, 1, 1]) loss = loss_fn(output, target) # Generally, it is not used in this way, just to verify that eval does not change the gradient calculation wording of each node loss.backward() print("After back propagation model.fc1.weight.grad: ", model.fc1.weight.grad) # Back propagation model fc1. weight. grad
result
(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py Switch to eval front fc1.weight of requires_grad: True Switch to eval front fc1.weight of requires_grad: True Non leaf node requires_grad: True Non leaf node grad_fn: <ViewBackward object at 0x7f656790a040> After back propagation model.fc1.weight.grad: tensor([[-0.0395, -0.0310, -0.0322, -0.0101, -0.1166, -0.0275, -0.0164, 0.0703, -0.0346, 0.1522, -0.0075, -0.0020], [ 0.1980, 0.1494, 0.2325, -0.0340, 0.5011, 0.2446, 0.1040, -0.2942, 0.1612, -0.5969, 0.0538, 0.0527], [-0.0783, -0.0664, -0.0703, -0.0085, -0.2122, -0.0585, -0.0395, 0.1268, -0.0598, 0.2753, -0.0148, -0.0059], [-0.0801, -0.0521, -0.1300, 0.0526, -0.1723, -0.1586, -0.0480, 0.0971, -0.0669, 0.1695, -0.0315, -0.0449]])
Code example 2
import torch import torch.nn as nn import torch.optim as optim import random import os import numpy as np def seed_torch(seed=1029): random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True seed_torch() # Define a network class net(nn.Module): def __init__(self, num_class=10): super(net, self).__init__() self.pool1 = nn.AvgPool1d(2) self.bn1 = nn.BatchNorm1d(3) self.fc1 = nn.Linear(12, 4) def forward(self, x): x = self.pool1(x) x = self.bn1(x) x = x.reshape(x.size(0), -1) y = self.fc1(x) return y # Define network model = net() # Define loss loss_fn = nn.CrossEntropyLoss() # Define optimizer optimizer = optim.SGD(model.parameters(), lr=1e-2) # Define training data x1 = torch.randn((1, 3, 8)) x2 = torch.randn((1, 3, 8)) x3 = torch.randn((1, 3, 8)) x4 = torch.randn((1, 3, 8)) # Before switching to eval mode print(model.bn1.running_mean) model(x1) print(model.bn1.running_mean) model(x2) print(model.bn1.running_mean) # After switching to eval mode model.eval() print(model.bn1.running_mean) model(x3) print(model.bn1.running_mean) model(x4) print(model.bn1.running_mean) # Switch to train mode again model.train() print(model.bn1.running_mean) model(x1) print(model.bn1.running_mean) model(x2) print(model.bn1.running_mean)
result
(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py tensor([0., 0., 0.]) tensor([-0.0287, 0.0524, -0.0517]) tensor([-0.0249, 0.0314, -0.0441]) tensor([-0.0249, 0.0314, -0.0441]) tensor([-0.0249, 0.0314, -0.0441]) tensor([-0.0249, 0.0314, -0.0441]) tensor([-0.0249, 0.0314, -0.0441]) tensor([-0.0511, 0.0806, -0.0914]) tensor([-0.0450, 0.0568, -0.0798])
model.train()
The function is the same as that of model Eval () on the contrary, specifically analyzes the analogy model eval()
batch_normalization layer
BatchNorm in pytoch
API
BatchNorm API s in pytoch mainly include:
torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)
Parameter description
-
num_features
When the input dimension is (N, C, L), num_features should be C; Here N is the batch size, C is the data channel, and L is the data length
When the input dimension is (N, C), num_features should be C; Here, N is the batch size and C is the data channel. Each channel represents a feature, and C is omitted
-
eps
When normalizing the input data, it is added to the denominator to prevent division by zero
-
momentum
Update global mean running_mean and variance running_var is used for smoothing
-
affine
When set to True, the BatchNorm layer will have learning parameters
γ , β \gamma, \beta γ,β
Otherwise, these two variables are not included, and the variable names are weight and bias -
track_running_stats
When set to True, it means that the statistical characteristics of the batch in the whole training process are tracked to obtain the variance and mean value, rather than just relying on the statistical characteristics of the currently entered batch. The BatchNorm layer will count the global mean running_mean and variance running_var
Detailed explanation of parameters
executive summary
Because of the model mentioned above Train () and model Eval () controls the training attribute, track_ running_ The stats parameter also indicates whether to track the statistical characteristics of batch in the whole training process. Therefore, it is worth noting that these two attribute combinations will have different calculation behaviors
Explain in detail
-
First declare the normalization formula
In batch normalization, the normalization formula used is:
y = x − E [ x ] Var [ x ] + ϵ y=\frac{x-E[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} y=Var[x]+ϵ x−E[x]
Where E[x] represents the mean and Var[x] represents the variance -
Explain the parameter affine again. If affine=True, the normalized batch will be affine transformed after normalizing the batch data through the normalization formula, that is, multiplied by the weight inside the module (the initial value is [1,1,1,1.]) Then add the bias inside the module (the initial value is [0,0,0,0.]), These two variables are learnable parameters and will be updated during back propagation.
-
training=True, track_running_stats=True
-
This is our common parameter setting during training, that is, model Common settings when train() works
-
At this time, there is an average running_mean (initial value is [0,0,0,0.]) And variance running_var (initial value is [1,1,1,1.], And mean running_mean and variance running_var will track the mean and variance of different batch data to update. The update formula is
x new = ( 1 − momentum ) × x cur + momentum × x batch x_{\text {new }}=(1-\text { momentum }) \times x_{\text {cur }}+\text { momentum } \times x_{\text {batch }} \text {} xnew =(1− momentum )×xcur + momentum ×xbatch
x cur x_{\text {cur }} xcur , indicates running before update_ Mean and running_var, x batch x_{\text {batch }} xbatch denotes the mean and unbiased sample variance of the current batch (denominator is N-1). -
However, when normalizing the current batch (i.e. using the above normalization formula), the mean used is the mean of the current batch, and the variance is the biased sample variance of the current batch (denominator is N), not running_mean and running_var.
-
-
training=True, track_running_stats=False
- At this point, there is no mean running_mean and variance running_var, mean running_mean and variance running_ The value of VaR is None, so the mean value is running_mean and variance running_var will not track the mean and variance of different batch data for updating
- At this time, when normalizing the current batch (i.e. using the above normalization formula), the mean used is the mean of the current batch, and the variance is the biased sample variance of the current batch (denominator is N), not running_mean and running_var.
-
training=False, track_running_stats=True
- This is our common parameter setting during verification and testing, that is, model Common settings when Eval () works
- At this time, there is an average running_mean (initial value is [0,0,0,0.]) And variance running_var (initial value is [1,1,1,1.], But mean running_mean and variance running_var does not track the mean and variance of different batch data for updating.
- When normalizing the current batch (i.e. using the above normalization formula), the mean and variance used are running respectively_ Mean and running_var
-
training=False, track_running_stats=False
- At this point, there is no mean running_mean and variance running_var, mean running_mean and variance running_ The value of VaR is None, so the mean value is running_mean and variance running_var will not track the mean and variance of different batch data for updating;
- When normalizing the current batch (i.e. using the above normalization formula), the mean value used is the mean value of the current batch, and the variance is the biased sample variance of the current batch (denominator is N)
- With training=True, track_running_stats=False is the same
Reference blog
[1] https://zhuanlan.zhihu.com/p/259160576 [2] https://blog.csdn.net/weixin_39228381/article/details/107896863#BatchNorm1d%E8%AE%AD%E7%BB%83%E6%97%B6%E5%89%8D%E5%90%91%E4%BC%A0%E6%92%AD [3] https://www.zhihu.com/question/282672547