# Batch normalization

Address: https://arxiv.org/abs/1502.03167

Batch normalization is basically the standard configuration of the current model

To be honest, I still don't understand the underlying reason why batch normalize can make model training more stable. To fully understand, involving many convex optimization theories, it needs a very solid mathematical foundation

So far, I understand that batch normalization is to transform the input features of each layer to a unified scale, so as to avoid the situation that the units of each feature are not unified. That is to say, the distribution of each feature is transformed into a distribution with the mean value of 0 and the variance of 1

Then a linear transformation is added to the transformed data

For frequently asked questions about batch normalize, refer to: https://zhuanlan.zhihu.com/p/55852062

## Batch normalization of full connection layer

First, we consider how to normalize the full connection layer in batches. In general, we place the batch normalization layer between the affine transformation and the activation function in the fully connected layer. Set the input of the full connection layer as uu, the weight parameter and deviation parameter as WW and bb respectively, and the activation function as ϕ ϕ. Set the operator of batch normalization to BNBN. Then, the output of the full connection layer using batch normalization is

ϕ(BN(x)),ϕ(BN(x)),

The batch normalized input xx is transformed from affine

x=Wu+bx=Wu+b

Get. Considering a small batch consisting of mm samples, the output of affine transformation is a new small batch B = {x (1) ,x(m)}B={x(1),… X (m)}. They are the inputs to the batch normalization layer. For any sample x(i) ∈ Rd,1 ≤ i ≤ mx(i) ∈ Rd,1 ≤ i ≤ m in small batch BB, the output of batch normalization layer is also dd dimension vector

y(i)=BN(x(i)),y(i)=BN(x(i)),

And get it from the following steps. First, calculate the mean and variance of small batch BB:

μB←1m∑i=1mx(i),μB←1m∑i=1mx(i),

σ2B←1m∑i=1m(x(i)−μB)2,σB2←1m∑i=1m(x(i)−μB)2,

The square is calculated by the element. Next, standardize x(i)x(i) using the square by element and the division by element:

x^(i)←x(i)−μBσ2B+ϵ−−−−−−√,x^(i)←x(i)−μBσB2+ϵ,

Here, ϵ > 0 ϵ > 0 is a very small constant, ensuring that the denominator is greater than 0. Based on the above standardization, two model parameters, scale parameter γ γ and shift parameter β β, are introduced into the batch normalization layer. These two parameters have the same shape as x(i)x(i), and they are all dd dimensional vectors. This is what the article said at the beginning, after normalizing the feature, do a linear transformation again

They are calculated by element multiplication (symbol ⊙) and addition respectively with x(i)x(i):

y(i)←γ⊙x^(i)+β.y(i)←γ⊙x^(i)+β.

At this point, we get the normalized output y(i)y(i) of x(i)x(i).

It is worth noting that the learnable stretching and migration parameters reserve the possibility of not normalizing x^(i)x^(i) in batches: at this time, it is only necessary to learn γ = σ 2B + ϵϵϵ̭̭̭̭̭ϵϵϵ and β = μ B β = μ B. We can understand this as follows: if batch normalization is not beneficial, in theory, the learned model can not use batch normalization.

## Batch normalization of convolution layer

For convolution layer, batch normalization occurs after convolution calculation and before activation function is applied. If convolution calculation outputs multiple channels, we need to normalize the output of these channels in batches, and each channel has its own stretch and offset parameters, which are scalar. Set mm samples in small batch. On a single channel, it is assumed that the width and height of convolution output are pp and qq, respectively. We need to normalize the m × p × qm × p × q elements in the channel at the same time. In the standardized calculation of these elements, we use the same mean and variance, that is, the mean and variance of M × p × qm × p × q elements in the channel.

To summarize with a more specific example:

For full connection layer, if the output shape is [batch,256], then normalization is to average each column of 256 columns

For the convolution layer, suppose the output shape is [batch,96,5,5], that is, for each sample, there are 96 feature map s of 5x5, which are normalized on 96 channel s, and the mean value is the mean value of the number of batchx5x5 x5

## Batch normalization in prediction

At this time, there is another problem, that is, the model has been trained, and the input and forward propagation results need to be normalized. Then what should I use for the mean and variance? Obviously, it should not be the mean and variance of a batch sample, but the mean and variance of all samples. Because the update of gamma and beta is the result of continuous accumulation instead of It only refers to the input of a certain batch. (note that the sample here does not refer to the input picture matrix of the model, but refers to the input of the oneness layer, which is changing with the training, and the input of different normalization layers is different). Therefore, when we do the batch normalize, we need to maintain a value to estimate the mean value and variance of all samples One common method is moving average

You can use the following test code to see how moving uan approaches 3

momentum=0.9 moving_mean = 0.0 for epoch in range(10): for mean in [1,2,3,4,5]: moving_mean = momentum * moving_mean + (1.0 - momentum) * mean print(moving_mean)

As for why not average the sum of the mean directly, I torch Forum I've asked questions, but I haven't replied yet

Now let's summarize the calculation process of batch normalize and implement it. It is divided into two parts: training and testing

Training:

- Find the mean value of input x
- Find the variance of input x
- Normalize x
x^(i)←x(i)−μBσ2B+ϵ−−−−−−√,x^(i)←x(i)−μBσB2+ϵ,

- Linear transformation of normalized x
y(i)←γ⊙x^(i)+β.y(i)←γ⊙x^(i)+β.

Test:

- Use the mean and variance of moving average to calculate the normalized value
- Linear transformation of normalized values

Then you can write out the definition of BatchNorm

def batch_norm(is_training,X,eps,gamma,beta,running_mean,running_var,alpha): assert len(X.shape) in (2,4) if is_training: #X [batch,n] if len(X.shape) == 2: mean = X.mean(dim=0) var = ((X-mean) ** 2).mean(dim=0) else: #X [batch,c,h,w] mean = X.mean(dim=0,keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True) var = ((X-mean) ** 2).mean(dim=0,keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True) X_hat = (X - mean) / torch.sqrt(var + eps) running_mean = alpha * mean + (1 - alpha) * running_mean running_var = alpha * var + (1 - alpha) * running_var else: X_hat = (X - running_mean) / torch.sqrt(running_var + eps) #print(gamma.shape,X_hat.shape,beta.shape) Y = gamma * X_hat + beta # return Y,running_mean,running_var class BatchNorm(nn.Module): def __init__(self,is_conv,in_channels): super(BatchNorm,self).__init__() #Linear transformation parameters after convolution layer / full connection layer normalization if not is_conv: # x:[batch,n] shape = (1,in_channels) self.gamma = nn.Parameter(torch.ones(shape)) #It is a learnable parameter. It needs to be updated according to the gradient self.beta = nn.Parameter(torch.zeros(shape)) #It is a learnable parameter. It needs to be updated according to the gradient self.running_mean = torch.zeros(shape) #No gradient is needed. Update at forward self.running_var = torch.zeros(shape) #No gradient is needed. Update at forward else: # x:[btach,c,h,w] shape = (1,in_channels,1,1) self.gamma = nn.Parameter(torch.ones(shape)) self.beta = nn.Parameter(torch.ones(shape)) self.running_mean = torch.zeros(shape) self.running_var = torch.zeros(shape) self.eps = 1e-5 self.momentum=0.9 def forward(self,x): # If X is not in memory, copy moving'mean and moving'var to the memory where X is located if self.running_mean.device != x.device: self.running_mean = self.running_mean.to(x.device) self.running_var = self.running_var.to(x.device) # self.training inherits from nn.Module, the default is true, and the call to. eval() will be set to false if self.training: Y,self.running_mean,self.running_var = batch_norm(True,x,self.eps,self.gamma,self.beta,self.running_mean,self.running_var,self.momentum) else: Y,self.running_mean,self.running_var = batch_norm(False,x,self.eps,self.gamma,self.beta,self.running_mean,self.running_var,self.momentum) return Y

BatchNorm inherits from nn.Module, and contains the learnable parameters gamma,beta. They will be updated during back propagation. The parameters running_mean,running_var are calculated during forward propagation

Batch norm needs to distinguish the normalization after convolution or after full connection. The normalization of convolution is to calculate the mean value of each channel separately

## Data loading

batch_size,num_workers=16,2 train_iter,test_iter = learntorch_utils.load_data(batch_size,num_workers,None)

## Model definition

class LeNet(nn.Module): def __init__(self): super(LeNet, self).__init__() self.conv = nn.Sequential( nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size BatchNorm(is_conv=True,in_channels=6), nn.Sigmoid(), nn.MaxPool2d(2, 2), # kernel_size, stride nn.Conv2d(6, 16, 5), BatchNorm(is_conv=True,in_channels=16), nn.Sigmoid(), nn.MaxPool2d(2, 2) ) self.fc = nn.Sequential( nn.Linear(16*4*4, 120), BatchNorm(is_conv=False,in_channels=120), nn.Sigmoid(), nn.Linear(120, 84), BatchNorm(is_conv=False,in_channels = 84), nn.Sigmoid(), nn.Linear(84, 10) ) def forward(self, img): feature = self.conv(img) output = self.fc(feature.view(img.shape[0], -1)) return output net = LeNet().cuda()

## Loss function definition

l = nn.CrossEntropyLoss()

## Optimizer definition

opt = torch.optim.Adam(net.parameters(),lr=0.01)

## Evaluation function definition

def test(): acc_sum = 0 batch = 0 for X,y in test_iter: X,y = X.cuda(),y.cuda() y_hat = net(X) acc_sum += (y_hat.argmax(dim=1) == y).float().sum().item() batch += 1 print('acc:%f' % (acc_sum/(batch*batch_size)))

## train

num_epochs=5 def train(): for epoch in range(num_epochs): train_l_sum,batch=0,0 start = time.time() for X,y in train_iter: X,y = X.cuda(),y.cuda() #Put the sensor in display memory y_hat = net(X) #Forward propagation loss = l(y_hat,y) #There will be softmax operations in calculating loss,nn.CrossEntropyLoss opt.zero_grad()#Gradient emptying loss.backward()#Back propagation, finding gradient opt.step()#Update parameters based on gradient train_l_sum += loss.item() batch += 1 end = time.time() time_per_epoch = end - start print('epoch %d,train_loss %f,time %f' % (epoch + 1,train_l_sum/(batch*batch_size),time_per_epoch)) test() train()

After adding the BN layer, the display memory is not enough. But there is no problem using the nn.BatchNorm2d and nn.BatchNorm1d of the torch itself. It should be the implementation of BatchNorm that is not good enough

The implementation of Batch n orm defined by torch is as follows:

class LeNet(nn.Module): def __init__(self): super(LeNet, self).__init__() self.conv = nn.Sequential( nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size nn.BatchNorm2d(6), nn.Sigmoid(), nn.MaxPool2d(2, 2), # kernel_size, stride nn.Conv2d(6, 16, 5), nn.BatchNorm2d(16), nn.Sigmoid(), nn.MaxPool2d(2, 2) ) self.fc = nn.Sequential( nn.Linear(16*4*4, 120), nn.BatchNorm1d(120), nn.Sigmoid(), nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(), nn.Linear(84, 10) ) def forward(self, img): feature = self.conv(img) output = self.fc(feature.view(img.shape[0], -1)) return output net = LeNet().cuda()

The training output is as follows:

epoch 1,batch_size 4,train_loss 0.194394,time 50.538379 acc:0.789400 epoch 2,batch_size 4,train_loss 0.146268,time 52.352518 acc:0.789500 epoch 3,batch_size 4,train_loss 0.132021,time 52.240710 acc:0.820600 epoch 4,batch_size 4,train_loss 0.126241,time 53.277958 acc:0.824400 epoch 5,batch_size 4,train_loss 0.120607,time 52.067259 acc:0.831800