Explain the requirements in pytoch in detail_ Grad, leaf node and non leaf node, with torch no_ grad(),model.eval(),model.train(), BatchNorm layer

requires_grad

  • require_grad means whether the gradient needs to be calculated
  • When using the backward() function to calculate the gradient, the gradient of all tensors is not calculated. Only the gradient of tensors that meet the following conditions can be calculated: 1 Current tensor requires_ Grad = true (code example 1); 2. Require of all tensors that depend on this tensor_ Grad = true, that is, the gradient values of all tensors that depend on the tensor can be obtained.
  • In all require_grad=True
    • By default, the gradient value of * * non leaf node * * will be cleared after being used in the back propagation process and will not be retained.
    • By default, only the gradient value of * * leaf node * * can be retained.
    • The gradient value of the reserved * * leaf node * * will be stored in the grad attribute of the tensor, in optimizer During step (), the data attribute value of the leaf node will be updated, so as to update the parameters.

Code example

Example 1: require_grad=True will calculate the gradient

import torch
import torch.nn as nn
import torch.optim as optim
import random
import os
import numpy as np

def seed_torch(seed=1029):
	random.seed(seed)
	os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated
	np.random.seed(seed)
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True

seed_torch()

# Define a network
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.pool1 = nn.AvgPool1d(2)
        self.bn1 = nn.BatchNorm1d(3)
        self.fc1 = nn.Linear(12, 4)

    
    def forward(self, x):
        x = self.pool1(x)
        x = self.bn1(x)
        x = x.reshape(x.size(0), -1)
        x = self.fc1(x)

        return x
    
# Define network
model = net()

# Define loss
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-2)

# Define training data
x = torch.randn((3, 3, 8))

model.fc1.weight.requires_grad = False  # fc1.weight does not calculate the gradient
print(model.fc1.weight.grad)
print(model.fc1.bias.grad)  # fc1.bias calculation gradient

output = model(x)
target = torch.tensor([1, 1, 1])
loss = loss_fn(output, target)

loss.backward()

print(model.fc1.weight.grad)
print(model.fc1.bias.grad)

result

(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
None
None
None
tensor([ 0.1875, -0.8615,  0.3708,  0.3033])

Example 2: use detach() to make non leaf nodes not calculate gradient values (requires_grad=False `)

import torch
import torch.nn as nn
import torch.optim as optim
import random
import os
import numpy as np

def seed_torch(seed=1029):
	random.seed(seed)
	os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated
	np.random.seed(seed)
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True

seed_torch()

# Define a network
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.pool1 = nn.AvgPool1d(2)
        self.bn1 = nn.BatchNorm1d(3)
        self.fc1 = nn.Linear(12, 4)

    
    def forward(self, x):
        x = self.pool1(x)
        x = self.bn1(x)
        x = x.reshape(x.size(0), -1)
        
        x = x.detach()  # Split non leaf node into leaf node x.requires_grad = False x.grad_fn=None

        y = self.fc1(x)

        return y


# Define network
model = net()

# Define loss
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-2)

# Define training data
x = torch.randn((3, 3, 8))

# Leaf node bn1 before training Parameters of weight
print(model.bn1.weight.requires_grad)
print(model.bn1.weight.grad)

# Leaf node FC1 before training Parameters of weight
print(model.fc1.weight.requires_grad)
print(model.fc1.weight.grad)

output = model(x)
target = torch.tensor([1, 1, 1])
loss = loss_fn(output, target)

loss.backward()

# After training, the leaf node bn1 Parameters of weight
print(model.bn1.weight.requires_grad)
print(model.bn1.weight.grad)

# Leaf node FC1 after training Parameters of weight
print(model.fc1.weight.requires_grad)
print(model.fc1.weight.grad)

result

(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
True
None
True
None
True
None
True
tensor([[ 0.0053,  0.0341,  0.0272,  0.0231, -0.1196,  0.0164,  0.0442,  0.1511,
         -0.1146,  0.2443, -0.0513, -0.0404],
        [ 0.1127,  0.0141,  0.1857, -0.3597,  0.5626,  0.1670, -0.0569, -0.6800,
          0.5046, -1.0340,  0.2865,  0.2857],
        [-0.0225, -0.0370,  0.0581,  0.0509, -0.2231,  0.1119,  0.0313,  0.2887,
         -0.1560,  0.5342, -0.0799, -0.0358],
        [-0.0955, -0.0112, -0.2710,  0.2857, -0.2199, -0.2953, -0.0185,  0.2402,
         -0.2340,  0.2555, -0.1553, -0.2095]])

Leaf node and non leaf node

  • Tensors can be divided into two categories: leaf nodes and non leaf nodes
  • Available via is_leaf to determine whether a tensor is a leaf node

leaf node

  • Leaf nodes can be understood as tensors that do not depend on other tensors

  • In pytorch, the weight and bias tensor in the neural network layer are leaf nodes; Self defined tensor, such as a = torch The node defined by tensor ([1.0]) is a leaf node

    import torch
    a=torch.tensor([1.0])
    
    a.is_leaf
    True
    
    b=a+1
    b.is_leaf
    True
    
    • It can be seen that b is also a leaf node! It can be understood that b=a+1 simply from the numerical relationship, and b does depend on a. But from pytorch's point of view, everything is for the reverse derivation of A_ If the grad attribute is False, it does not require to obtain the gradient, then the tensor A is actually "meaningless" during back propagation, which can be considered to be free from the calculation diagram, so b is still a leaf node, as shown in the following figure

  • For another example, the calculation diagram in the following figure is originally a leaf node, which can normally calculate the gradient of back propagation:

    However, after using the detach() function to split a non leaf node into a leaf node

    Regardless of requirements_ What is the value of grad attribute? If the derivation path of the original leaf node is interrupted, the gradient value cannot be obtained.

  • Secondly, as shown above, for the tensor that needs derivation, its requires_ The grad attribute must be True. For example, for the top leaf node in the figure below, pytorch will not automatically calculate its derivative.

Non leaf node

By default, another element of the pytorch calculation graph, grad_fn, is saved in the non leaf node. The operation is the derivable operations such as addition, subtraction, multiplication and division, square root, power index pair and trigonometric function. With the operation, the gradient of the leaf node can be calculated.

# Define a network
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.pool1 = nn.AvgPool1d(2)
        self.bn1 = nn.BatchNorm1d(3)
        self.fc1 = nn.Linear(12, 4)

    
    def forward(self, x):
        x = self.pool1(x)
        x = self.bn1(x)
        x = x.reshape(x.size(0), -1)
        
        print(x)

        y = self.fc1(x)

        return y


# Define network
model = net()

# Define loss
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-2)

# Define training data
x = torch.randn((3, 3, 8))

output = model(x)
# result
(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
tensor([[-0.2112, -1.1580,  0.9010, -0.3500, -0.3878,  1.9242, -0.3629,  0.6713,
          0.4996,  2.3366,  0.1928,  0.7291],
        [-0.7056, -0.4324, -1.8940,  1.7456, -0.7856, -1.8655, -0.4469,  0.7612,
         -0.8044,  0.4850, -0.7059, -1.0746],
        [ 0.4769,  1.4226,  0.3125, -0.1074, -0.7744, -0.5955,  0.9378,  0.9242,
         -1.3836,  0.8161, -0.4706, -0.6202]], grad_fn=<ViewBackward>)

with torch.no_grad()

  • torch.no_grad() is a * * context manager * * used to prohibit gradient calculation. It is usually used in network inference (eval) to reduce the use of computing memory

  • By torch no_ The part wrapped by grad () will not be tracked. Although the output can still be calculated by forward propagation, the calculation process (grad_fn) will not be recorded, so the parameters cannot be updated by back propagation. Specifically, for * * non leaf node * *

    1. Requirements for non leaf nodes_ The grad attribute changes to False
    2. Grad of non leaf node_ The FN attribute changes to None

    In this way, the gradient of non leaf nodes is not calculated. Therefore, although the leaf nodes (learnable parameters of each layer of the model) are required_ The grad attribute is not changed (still True), and the gradient will not be calculated. The grad attribute is None, and if loss. Is used Backward() will report an error (because the requirements of the first non leaf node (loss))_ Grad attribute is False, grad_fn attribute is None `). Therefore, the learnable parameters of the model are not updated.

  • torch.no_grad() does not affect the behavior of dropout and batchnorm layers in train and eval

Code example

import torch
import torch.nn as nn
import torch.optim as optim
import random
import os
import numpy as np

def seed_torch(seed=1029):
	random.seed(seed)
	os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated
	np.random.seed(seed)
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True

seed_torch()

# Define a network
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.pool1 = nn.AvgPool1d(2)
        self.bn1 = nn.BatchNorm1d(3)
        self.fc1 = nn.Linear(12, 4)

    
    def forward(self, x):
        x = self.pool1(x)
        x = self.bn1(x)
        x = x.reshape(x.size(0), -1)
        
        print("Non leaf node requires_grad: ", x.requires_grad)  # Wrequires for non leaf nodes_ grad
        print("Non leaf node grad_fn: ", x.grad_fn) # Grad of non leaf node_ fn

        y = self.fc1(x)

        return y


# Define network
model = net()

# Define loss
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-2)

# Define training data
x = torch.randn((3, 3, 8))

print("Before package fc1.weight of requires_grad: ", model.fc1.weight.requires_grad)  # FC1 before package Weight requirements_ grad

with torch.no_grad():
    print("After package fc1.weight of requires_grad: ", model.fc1.weight.requires_grad)  # After wrapping FC1 Requirements of weight_ grad
    print("Pre training fc1.weight.grad: ", model.fc1.weight.grad)  # FC1 before training weight. grad

    output = model(x)

    target = torch.tensor([1, 1, 1])
    loss = loss_fn(output, target)
    
    # This is not true. This is to verify that grad will not be calculated and an error will be reported
    loss.backward()
    print("Pre training fc1.weight.grad: ", model.fc1.weight.grad)  # FC1 after training weight. grad

result

(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
 Before package fc1.weight of requires_grad:  True
 After package fc1.weight of requires_grad:  True
 Pre training fc1.weight.grad:  None
 Non leaf node requires_grad:  False
 Non leaf node grad_fn:  None
Traceback (most recent call last):
  File "/home/jyzhang/test/net.py", line 66, in <module>
    loss.backward()
  File "/home/jyzhang/anaconda3/envs/bbn/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/jyzhang/anaconda3/envs/bbn/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
# element 0 of tensors refers to the calculated loss

model.eval()

  • Bottom analysis

    • Because the customized network and all layers in the customized network inherit from NN Module is the parent class, NN Module has a training attribute, which is True by default. Therefore, model Eval() makes the training attribute of the customized network and each layer in the customized network False

    • class net(nn.Module):
          def __init__(self):
              super(net, self).__init__()
              self.bn = nn.BatchNorm1d(3, track_running_stats=True)
          
          
          def forward(self, x):
              return self.bn(x)
      
      model = net()
      model.eval()
      print(model.training)
      print(model.bn.training)
      
      # output
      (bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
      False
      False
      
  • When performing validation in PyTorch, model. Is used Eval() switches to test mode

  • This mode is used to notify the dropout layer and batchnorm layer to switch to val mode

    • In val mode, the dropout layer will let all active units pass through, while the batchnorm layer will stop calculating and updating mean and var, and directly use the mean and var values learned in the training stage (learning here refers to accumulating the updated mean and var values during the forward propagation of data in the training stage)
    • For a detailed analysis of the impact on the batchnorm layer, see the following batch_normalize layer part, there are many pits here!!!
  • This mode will not affect the gradient calculation behavior of each layer, that is, the gradient calculation and storage are the same as the training mode (code example 1). Specifically

    • Requirements of leaf nodes (learnable parameters of each layer of the model)_ The grad attribute has not changed (still True)
    • Requirements for non leaf nodes_ The grad property is True
    • Grad of non leaf node_ FN attribute is not None
    • Therefore, this mode will not affect the gradient calculation behavior of each layer, even loss Backward() can also run normally to calculate gradients (usually not used)
  • Note that after training the train samples, the generated model will be used to test the samples. Before model(test), you need to add model Eval(), otherwise, if there is input data, the same data output result will change even without training. This is the nature brought about by the BN layer and Dropout in the model (code example 2)

  • If you don't care about the memory size and calculation time, just use model Eval() is enough to get the correct result of validation/test (loss. Backward() is not written during validation/test); And with torch no_ Grad () can further accelerate and save gpu space (because there is no need to calculate and store gradients), so it can calculate faster or run a larger batch to test.

Code example 1

import torch
import torch.nn as nn
import torch.optim as optim
import random
import os
import numpy as np

def seed_torch(seed=1029):
	random.seed(seed)
	os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated
	np.random.seed(seed)
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True

seed_torch()

# Define a network
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.pool1 = nn.AvgPool1d(2)
        self.bn1 = nn.BatchNorm1d(3)
        self.fc1 = nn.Linear(12, 4)

    
    def forward(self, x):
        x = self.pool1(x)
        x = self.bn1(x)
        x = x.reshape(x.size(0), -1)
        
        print("Non leaf node requires_grad: ", x.requires_grad)  # Wrequires for non leaf nodes_ grad
        print("Non leaf node grad_fn: ", x.grad_fn) # Grad of non leaf node_ fn

        y = self.fc1(x)

        return y


# Define network
model = net()

# Define loss
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-2)

# Define training data
x = torch.randn((3, 3, 8))

print("Switch to eval front fc1.weight of requires_grad: ", model.fc1.weight.requires_grad)  # Switch to FC1 before eval Weight requirements_ grad

model.eval()

print("Switch to eval front fc1.weight of requires_grad: ", model.fc1.weight.requires_grad)  # Switch to FC1 before eval Weight requirements_ grad

output = model(x)

target = torch.tensor([1, 1, 1])
loss = loss_fn(output, target)

# Generally, it is not used in this way, just to verify that eval does not change the gradient calculation wording of each node
loss.backward()
print("After back propagation model.fc1.weight.grad: ", model.fc1.weight.grad)  # Back propagation model fc1. weight. grad

result

(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
 Switch to eval front fc1.weight of requires_grad:  True
 Switch to eval front fc1.weight of requires_grad:  True
 Non leaf node requires_grad:  True
 Non leaf node grad_fn:  <ViewBackward object at 0x7f656790a040>
After back propagation model.fc1.weight.grad:  tensor([[-0.0395, -0.0310, -0.0322, -0.0101, -0.1166, -0.0275, -0.0164,  0.0703,
         -0.0346,  0.1522, -0.0075, -0.0020],
        [ 0.1980,  0.1494,  0.2325, -0.0340,  0.5011,  0.2446,  0.1040, -0.2942,
          0.1612, -0.5969,  0.0538,  0.0527],
        [-0.0783, -0.0664, -0.0703, -0.0085, -0.2122, -0.0585, -0.0395,  0.1268,
         -0.0598,  0.2753, -0.0148, -0.0059],
        [-0.0801, -0.0521, -0.1300,  0.0526, -0.1723, -0.1586, -0.0480,  0.0971,
         -0.0669,  0.1695, -0.0315, -0.0449]])

Code example 2

import torch
import torch.nn as nn
import torch.optim as optim
import random
import os
import numpy as np

def seed_torch(seed=1029):
	random.seed(seed)
	os.environ['PYTHONHASHSEED'] = str(seed) # In order to prohibit hash randomization, the experiment can be repeated
	np.random.seed(seed)
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True

seed_torch()

# Define a network
class net(nn.Module):
    def __init__(self, num_class=10):
        super(net, self).__init__()
        self.pool1 = nn.AvgPool1d(2)
        self.bn1 = nn.BatchNorm1d(3)
        self.fc1 = nn.Linear(12, 4)

    
    def forward(self, x):
        x = self.pool1(x)
        x = self.bn1(x)
        x = x.reshape(x.size(0), -1)
        
        y = self.fc1(x)

        return y


# Define network
model = net()

# Define loss
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=1e-2)

# Define training data
x1 = torch.randn((1, 3, 8))
x2 = torch.randn((1, 3, 8))
x3 = torch.randn((1, 3, 8))
x4 = torch.randn((1, 3, 8))

# Before switching to eval mode
print(model.bn1.running_mean)
model(x1)
print(model.bn1.running_mean)
model(x2)
print(model.bn1.running_mean)

# After switching to eval mode
model.eval()
print(model.bn1.running_mean)
model(x3)
print(model.bn1.running_mean)
model(x4)
print(model.bn1.running_mean)

# Switch to train mode again
model.train()
print(model.bn1.running_mean)
model(x1)
print(model.bn1.running_mean)
model(x2)
print(model.bn1.running_mean)

result

(bbn) jyzhang@admin2-X10DAi:~/test$ python net.py
tensor([0., 0., 0.])
tensor([-0.0287,  0.0524, -0.0517])
tensor([-0.0249,  0.0314, -0.0441])
tensor([-0.0249,  0.0314, -0.0441])
tensor([-0.0249,  0.0314, -0.0441])
tensor([-0.0249,  0.0314, -0.0441])
tensor([-0.0249,  0.0314, -0.0441])
tensor([-0.0511,  0.0806, -0.0914])
tensor([-0.0450,  0.0568, -0.0798])

model.train()

The function is the same as that of model Eval () on the contrary, specifically analyzes the analogy model eval()

batch_normalization layer

BatchNorm in pytoch

API

BatchNorm API s in pytoch mainly include:

torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)

Parameter description

  • num_features

    When the input dimension is (N, C, L), num_features should be C; Here N is the batch size, C is the data channel, and L is the data length

    When the input dimension is (N, C), num_features should be C; Here, N is the batch size and C is the data channel. Each channel represents a feature, and C is omitted

  • eps

    When normalizing the input data, it is added to the denominator to prevent division by zero

  • momentum

    Update global mean running_mean and variance running_var is used for smoothing

  • affine

    When set to True, the BatchNorm layer will have learning parameters
    γ , β \gamma, \beta γ,β
    Otherwise, these two variables are not included, and the variable names are weight and bias

  • track_running_stats

    When set to True, it means that the statistical characteristics of the batch in the whole training process are tracked to obtain the variance and mean value, rather than just relying on the statistical characteristics of the currently entered batch. The BatchNorm layer will count the global mean running_mean and variance running_var

Detailed explanation of parameters

executive summary

Because of the model mentioned above Train () and model Eval () controls the training attribute, track_ running_ The stats parameter also indicates whether to track the statistical characteristics of batch in the whole training process. Therefore, it is worth noting that these two attribute combinations will have different calculation behaviors

Explain in detail

  • First declare the normalization formula

    In batch normalization, the normalization formula used is:
    y = x − E [ x ] Var ⁡ [ x ] + ϵ y=\frac{x-E[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} y=Var[x]+ϵ ​x−E[x]​
    Where E[x] represents the mean and Var[x] represents the variance

  • Explain the parameter affine again. If affine=True, the normalized batch will be affine transformed after normalizing the batch data through the normalization formula, that is, multiplied by the weight inside the module (the initial value is [1,1,1,1.]) Then add the bias inside the module (the initial value is [0,0,0,0.]), These two variables are learnable parameters and will be updated during back propagation.

  • training=True, track_running_stats=True

    1. This is our common parameter setting during training, that is, model Common settings when train() works

    2. At this time, there is an average running_mean (initial value is [0,0,0,0.]) And variance running_var (initial value is [1,1,1,1.], And mean running_mean and variance running_var will track the mean and variance of different batch data to update. The update formula is
      x new  = ( 1 −  momentum  ) × x cur  +  momentum  × x batch  x_{\text {new }}=(1-\text { momentum }) \times x_{\text {cur }}+\text { momentum } \times x_{\text {batch }} \text {} xnew ​=(1− momentum )×xcur ​+ momentum ×xbatch ​
      x cur  x_{\text {cur }} xcur , indicates running before update_ Mean and running_var, x batch  x_{\text {batch }} xbatch denotes the mean and unbiased sample variance of the current batch (denominator is N-1).

    3. However, when normalizing the current batch (i.e. using the above normalization formula), the mean used is the mean of the current batch, and the variance is the biased sample variance of the current batch (denominator is N), not running_mean and running_var.

  • training=True, track_running_stats=False

    1. At this point, there is no mean running_mean and variance running_var, mean running_mean and variance running_ The value of VaR is None, so the mean value is running_mean and variance running_var will not track the mean and variance of different batch data for updating
    2. At this time, when normalizing the current batch (i.e. using the above normalization formula), the mean used is the mean of the current batch, and the variance is the biased sample variance of the current batch (denominator is N), not running_mean and running_var.
  • training=False, track_running_stats=True

    1. This is our common parameter setting during verification and testing, that is, model Common settings when Eval () works
    2. At this time, there is an average running_mean (initial value is [0,0,0,0.]) And variance running_var (initial value is [1,1,1,1.], But mean running_mean and variance running_var does not track the mean and variance of different batch data for updating.
    3. When normalizing the current batch (i.e. using the above normalization formula), the mean and variance used are running respectively_ Mean and running_var
  • training=False, track_running_stats=False

    1. At this point, there is no mean running_mean and variance running_var, mean running_mean and variance running_ The value of VaR is None, so the mean value is running_mean and variance running_var will not track the mean and variance of different batch data for updating;
    2. When normalizing the current batch (i.e. using the above normalization formula), the mean value used is the mean value of the current batch, and the variance is the biased sample variance of the current batch (denominator is N)
    3. With training=True, track_running_stats=False is the same

Reference blog

[1] https://zhuanlan.zhihu.com/p/259160576
[2] https://blog.csdn.net/weixin_39228381/article/details/107896863#BatchNorm1d%E8%AE%AD%E7%BB%83%E6%97%B6%E5%89%8D%E5%90%91%E4%BC%A0%E6%92%AD
[3] https://www.zhihu.com/question/282672547

Keywords: Pytorch

Added by Sera on Wed, 19 Jan 2022 13:40:30 +0200