Transformer hardware implementation part 3: supplement to pytoch basic knowledge

This article is a supplement to the knowledge of pytorch before training Transformer. Thank blogger Mo fan for his video course on Python , whose home page is: Don't bother Python

It is recommended to directly watch the blogger's video tutorial to complete the knowledge supplement of pytorch. You can also browse my text records.


Following the previous article, this is the last article of pytoch knowledge supplement. These parts are required for training Transformer and hardware design later, covering:

  • How to build a network model
  • Method of saving and extracting network
  • Batch data training
  • Batch normalization

1, Method of building network model

import torch
from torch.autograd import Variable
import torch.nn.functional as F
import matplotlib.pyplot as plt

#Building data sets
n_data = torch.ones(100,2)
x0 = torch.normal(2*n_data,0.6)
y0 = torch.zeros(100)
x1 = torch.normal(-2*n_data,0.6)
y1 = torch.ones(100)
x =,x1),0).type(torch.FloatTensor)#32bit float
y =,y1),).type(torch.LongTensor)#64bit int
x, y = Variable(x), Variable(y)

#The first method is to build a network
class Net(torch.nn.Module):
    def __init__(self,n_feature, n_hidden, n_output):
        self.hidden= torch.nn.Linear(n_feature, n_hidden)
        self.predict = torch.nn.Linear(n_hidden,n_output)
    def forward(self,x):
        return x
net2 = Net(2,10,2)

#The second method is to build the network
net1 = torch.nn.Sequential(

The above code introduces two methods of building network model. The first is introduced in the previous article Transformer hardware implementation part 2: supplement to pytoch basic knowledge (1)_ Hammer and people blog - CSDN blog

The second method uses torch.nn.Sequential() to build.

2, Save and extract network methods

loss_func = torch.nn.CrossEntropyLoss()
for i in range(40):
    out = net1(x)
    loss = loss_func(out,y)
#preservation,'net.pkl')#entire net,'net_params.pkl') #parameters

#The first extraction scheme
def restore_net():
    net3 = torch.load('net.pkl')
    out = net3(x)

#The second extraction scheme
def restore_params():
    net4 = torch.nn.Sequential(
    out = net4(x)

The first save function saves the entire model and parameters. The second method only saves parameters. Similarly, there are two extraction methods, as above. Draw a picture and compare the results during training with those after saving and extraction.


3, Batch data training

import torch
import as Data
x = torch.linspace(1, 10, 10)
y = torch.linspace(10, 1, 10)
torch_dataset = Data.TensorDataset(x,y)
loader = Data.DataLoader(
for epoch in range(3):
    for step, (batch_x, batch_y) in enumerate(loader):
        print('Epoch ', epoch, '| Step: ', step, '| batch x: ', batch_x.numpy(), '|batch y: ', batch_y.numpy())

Because the amount of data is too large to train all the data at one time, batch data is selected for training. As above, the function constructs a data set, and the torch. Utils. Data. Dataloader() function constructs an iterator, batch_size is how many data are fetched each time.

4, Optimizer

opt_SGD         = torch.optim.SGD(net_SGD.parameters(), lr=LR)
opt_Momentum    = torch.optim.SGD(net_Momentum.parameters(), lr=LR, momentum=0.8)
opt_RMSprop     = torch.optim.RMSprop(net_RMSprop.parameters(), lr=LR, alpha=0.9)
opt_Adam        = torch.optim.Adam(net_Adam.parameters(), lr=LR, betas=(0.9, 0.99))

The above is the use method of the four optimizers. SGD is a random gradient descent, and the effect is the worst. The torch.optim.SGD(momentum =) parameter is the momentum optimizer, and the training speed of Adam optimizer is better than SGD. RMSprop is a combination of the two, as shown in the figure below, and the effect is better.

  5, Batch normalization

BN aims to standardize the scattered data. The data distribution will affect the training. If the input gap is too large, the output gap is too large, and the output may enter the insensitive part of the activation function after activation. Batch refers to batch data. BN is placed before the activation function to standardize all data to the sensitive area of the activation function. The process is to first calculate the average value of the data, then calculate the data variance, and then subtract the mean value from the data and divide the root sign of the variance to obtain the processed data. The data passes through a scaling coefficient to be trained to offset part of the negative optimization of the previous calculation.

Keywords: Pytorch Deep Learning Transformer

Added by unidox on Tue, 12 Oct 2021 21:42:29 +0300