Automatic derivation with numpy and PyTorch, torch Implementation of two-layer neural network based on NN Library

Realize the step-by-step deepening from manual derivation to automatic derivation and then to the model.

Implementation of two-layer neural network with numpy

A fully connected ReLU neural network, a hidden layer, no bias, L2 Loss (h is hidden layer, ReLU activation function):

h = W 1 X + b 1 h = W_1 X + b_1 h=W1X+b1
h r e l u = m a x ( 0 , h ) h_relu = max(0, h) hrelu=max(0,h)
y h a t = W 2 a + b 2 y_{hat} = W_2 a + b_2 yhat=W2a+b2

When implemented below b 1 b_1 b1, b 2 b_2 b2 ＾ are all 0, and there is no bias bias bias.

This implementation completely uses numpy to calculate forward neural network, loss, and back propagation.

numpy ndarray is an ordinary n-dimensional array. It does not know any knowledge about deep learning or gradient, nor does it know the calculation graph. It is just a data structure used to calculate mathematical operations.

import numpy as np

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

W1 = np.random.randn(D_in, H)  #1000 to 100 dimensions
W2 = np.random.randn(H, D_out)  #100 to 10 dimensions

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    h = X.dot(W1)  # N * H
    h_relu = np.maximum(h, 0)  #Activation function, N * H
    y_hat = h_relu.dot(W2)  # N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = np.square(y_hat - y).sum()  #Mean square error, ignoring ÷ N
    print(t, loss)  #Print loss per iteration
    
    '''Backward propagation( backward pass)'''
    #Calculate the gradient (torch is not used here. The most common chain derivative is used to get d{loss}/dX)
    grad_y_hat = 2.0 * (y_hat - y)  # d{loss}/d{y_hat}，N * D_out
    grad_W2 = h_relu.T.dot(grad_y_hat)  #Look at the third formula in forward propagation, d{loss}/d{W2}, H * D_out
    grad_h_relu = grad_y_hat.dot(W2.T)  #Look at the third formula in forward propagation, d{loss}/d{h_relu}, N * H
    grad_h = grad_h_relu.copy()  #This is the case when H > 0, d{h_relu}/d{h}=1
    grad_h[h<0] = 0  # d{loss}/d{h}
    grad_W1 = X.T.dot(grad_h)  #Look at the first expression in forward propagation, d{loss}/d{W1}
    
    '''Parameter update( update weights of W1 and W2)'''
    W1 -= learning_rate * grad_W1
    W2 -= learning_rate * grad_W2

0 36455139.29176882
1 35607818.495988876
2 36510242.60519045
3 32972837.109358862
4 23623067.52618093
5 13537226.736260608
6 6806959.784455631
7 3501526.30896816
8 2054356.1020693523
9 1400230.6793163505
...
490 3.4950278045838633e-06
491 3.3498523609301454e-06
492 3.210762995939165e-06
493 3.0774805749939447e-06
494 2.9500114328045522e-06
495 2.827652258736098e-06
496 2.710379907890261e-06
497 2.5980242077038875e-06
498 2.490359305069476e-06
499 2.387185101594446e-06

We can see that the final loss is getting smaller and smaller. Now let's take a look at the proximity between the predicted value and the real value

y_hat - y

array([[ 9.16825615e-06, -1.53964987e-05,  6.58365129e-06,
        -3.08909604e-05,  1.05735798e-05,  1.73376919e-05,
         2.63084233e-06, -1.11662576e-05,  1.06904464e-05,
        -1.71528894e-05],
       ...
       [-5.79062537e-06, -1.74789200e-05,  5.27619647e-06,
        -7.82154474e-06,  3.39896752e-06,  1.08366770e-05,
         8.28712496e-06, -8.88009103e-06,  5.78585909e-06,
        -1.14913078e-05]])

It can be seen that their difference is very small, that is, we are successful.

2 using PyTorch automatic derivation to realize two-layer neural network

2.1 manual derivation

At this time, the code in 1 has the following changes:

import numpy as np to import torch
np. random. Change randn to torch randn
Change dot to mm
h_relu is NP in numpy Maximum (h, 0), h.clamp(min=0) in torch, and the lower limit of input is 0
loss is NP in numpy Square is in torch pow(2), and use item() converts the tensor to a numeric value
Transpose is in numpy T. In torch, yes t()
Change copy() to clone()

import torch

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

W1 = torch.randn(D_in, H)  #1000 to 100 dimensions
W2 = torch.randn(H, D_out)  #100 to 10 dimensions

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    h = X.mm(W1)  # N * H
    h_relu = h.clamp(min=0)  #Activation function, N * H
    y_hat = h_relu.mm(W2)  # N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = (y_hat - y).pow(2).sum().item()  #Mean square error, ignoring ÷ N
    print(t, loss)  #Print loss per iteration
    
    '''Backward propagation( backward pass)'''
    #Calculate the gradient (torch is not used here. The most common chain derivative is used to get d{loss}/dX)
    grad_y_hat = 2.0 * (y_hat - y)  # d{loss}/d{y_hat}，N * D_out
    grad_W2 = h_relu.t().mm(grad_y_hat)  #Look at the third formula in forward propagation, d{loss}/d{W2}, H * D_out
    grad_h_relu = grad_y_hat.mm(W2.t())  #Look at the third formula in forward propagation, d{loss}/d{h_relu}, N * H
    grad_h = grad_h_relu.clone()  #This is the case when H > 0, d{h_relu}/d{h}=1
    grad_h[h<0] = 0  # d{loss}/d{h}
    grad_W1 = X.t().mm(grad_h)  #Look at the first expression in forward propagation, d{loss}/d{W1}
    
    '''Parameter update( update weights of W1 and W2)'''
    W1 -= learning_rate * grad_W1
    W2 -= learning_rate * grad_W2

0 28398944.0
1 27809498.0
2 32215128.0
3 37019776.0
4 36226528.0
5 27777396.0
6 16156263.0
7 7798599.0
8 3615862.0
9 1881907.25
...
490 5.404536932474002e-05
491 5.3628453315468505e-05
492 5.282810889184475e-05
493 5.204257831792347e-05
494 5.149881326360628e-05
495 5.084666554466821e-05
496 4.9979411414824426e-05
497 4.938142956234515e-05
498 4.8661189794074744e-05
499 4.8014146159403026e-05

2.2 gradient automatic derivation

At this time, the code in 2.1 has the following changes:

Use requires_grad=True declare that W1 and W2 can be derived. If this figure is not passed, it defaults to False (X, y) to save memory
In forward propagation, in order to facilitate calculation, a line of code is directly used to calculate y_hat
Loss should be tensor at this time. Remove the item() added in front, but remove it item() is placed below where the loss function is printed
Remove all the steps of calculating the gradient and use loss Backward() instead
Put the previously calculated grad manually_ Change W1 to W1 Grad (W2)
After parameter update, use W1 grad. zero_ () clear the gradient of W1 (W2 is the same)
With torch no_ Grad (): prevent the computer from remembering the calculation diagrams of W1 and W2 and occupying memory

import torch

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

W1 = torch.randn(D_in, H, requires_grad=True)  #1000 to 100 dimensions
W2 = torch.randn(H, D_out, requires_grad=True)  #100 to 10 dimensions

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    y_hat = X.mm(W1).clamp(min=0).mm(W2)  # N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = (y_hat - y).pow(2).sum()  #Mean square error, ignoring ÷ N, loss is a calculation graph
    print(t, loss.item())  #Print loss per iteration
    
    '''Backward propagation( backward pass)'''
    loss.backward()
    
    '''Parameter update( update weights of W1 and W2)'''
    with torch.no_grad():
        W1 -= learning_rate * W1.grad
        W2 -= learning_rate * W2.grad
        W1.grad.zero_()
        W2.grad.zero_()

0 28114322.0
1 22391836.0
2 19137772.0
3 16153970.0
4 12953562.0
5 9725695.0
6 6933768.5
7 4784875.0
8 3286503.0
9 2288213.25
...
490 3.3917171094799414e-05
491 3.35296499542892e-05
492 3.318845119792968e-05
493 3.276047937106341e-05
494 3.244510298827663e-05
495 3.209296482964419e-05
496 3.168126931996085e-05
497 3.1402159947901964e-05
498 3.097686203545891e-05
499 3.074205596931279e-05

3. Use torch Implementation of two-layer neural network based on NN Library

3.1 parameters are not standardized

At this time, the code in 2.2 has the following changes:

Change import torch to import torch nn as nn (nn is the neural network)
There is no need to define W1 and W2. Directly define model = torch nn. Sequential (a series of models put together)
Subsequent y_ The calculation of hat only needs model(x)
Loss doesn't have to be so complex. Use the function loss_ fn = nn. Mselos (reduction = 'sum') is enough. Loss refers to this function below
The parameters in parameter update can also be obtained from the model directly with the for loop
Gradient clearing is the same: model zero_ grad()

import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# model = model.cuda()  #This is the case with GPU

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    y_hat = model(X)  # model(X) = model.forward(X), N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
    print(t, loss.item())  #Print loss per iteration
        
    '''Backward propagation( backward pass)'''
    loss.backward()
    
    '''Parameter update( update weights of W1 and W2)'''
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad  #All parameters in the model are updated
            
    model.zero_grad()

0 686.78662109375
1 686.2665405273438
2 685.7469482421875
3 685.2279663085938
4 684.7101440429688
5 684.1931762695312
6 683.6768188476562
7 683.1609497070312
8 682.6456909179688
9 682.130859375
...
490 496.4220275878906
491 496.12548828125
492 495.82940673828125
493 495.533203125
494 495.2373046875
495 494.94171142578125
496 494.6462707519531
497 494.35101318359375
498 494.0567321777344
499 493.7628479003906

model

Sequential(
  (0): Linear(in_features=1000, out_features=100, bias=True)
  (1): ReLU()
  (2): Linear(in_features=100, out_features=10, bias=True)
)

model[0]

Linear(in_features=1000, out_features=100, bias=True)

model[0].weight

Parameter containing:
tensor([[-0.0147, -0.0315,  0.0085,  ...,  0.0039,  0.0254, -0.0308],
        [ 0.0046,  0.0125,  0.0128,  ..., -0.0241, -0.0206, -0.0127],
        [-0.0162,  0.0051,  0.0152,  ..., -0.0280, -0.0133,  0.0079],
        ...,
        [ 0.0239,  0.0237, -0.0025,  ...,  0.0290, -0.0192,  0.0187],
        [-0.0249,  0.0287,  0.0060,  ..., -0.0198,  0.0007,  0.0209],
        [ 0.0238, -0.0157, -0.0156,  ...,  0.0105,  0.0057, -0.0189]],
       requires_grad=True)

3.2 parameter standardization

The code in 3.1 is updated slowly, which may be due to poor parameter initialization, so we use torch nn. init. normal_ weight standardization of layers 0 and 2 (i.e. Linear layer):

import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()  #This is the case with GPU

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    y_hat = model(X)  # model.forward(), N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
    print(t, loss.item())  #Print loss per iteration
        
    '''Backward propagation( backward pass)'''
    loss.backward()
    
    '''Parameter update( update weights of W1 and W2)'''
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad  #All parameters in the model are updated
            
    model.zero_grad()

0 34311500.0
1 32730668.0
2 33845940.0
3 31335464.0
4 23584192.0
5 14068799.0
6 7252735.5
7 3674312.0
8 2069563.0
9 1346445.75
...
490 7.143352559069172e-05
491 7.078371709212661e-05
492 7.009323599049821e-05
493 6.912354729138315e-05
494 6.783746357541531e-05
495 6.718340591760352e-05
496 6.611335265915841e-05
497 6.529116944875568e-05
498 6.444999598897994e-05
499 6.381605635397136e-05

model[0].weight

Parameter containing:
tensor([[ 0.1849, -0.2587,  1.6247,  ..., -0.8608, -2.2139, -1.3076],
        [-0.5197,  0.0600,  0.2141,  ...,  0.0561, -0.1613, -0.3905],
        [-0.5303, -0.1129, -0.2974,  ..., -0.6166, -3.4082,  0.0969],
        ...,
        [-0.4742,  0.2449, -1.5979,  ..., -0.6195, -0.2970, -1.3764],
        [-0.1131,  0.4973,  0.7679,  ...,  0.1231,  0.6992,  0.4403],
        [-0.1557,  0.8185,  0.7784,  ..., -0.9993,  0.3424, -1.1116]],
       requires_grad=True)

3.3 optim method

The previous gradient descent method is still stupid. It is to manually update the weights of the model. In 3.3, we use optim package to help us update parameters. Optim package provides various model optimization methods, including SGD+momentum, RMSProp, Adam and so on.

Continue to improve on the basis of 3.2:

General learning with optim package_ Rate is 1e-4
Define optimizer and optimize with Adam
The whole part of parameter update can be optimized with one sentence Instead of step (), it means that all parameters are updated in one step
With optimizer zero_ Grad() clears the gradient

import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# torch.nn.init.normal_(model[0].weight)
# torch.nn.init.normal_(model[2].weight)

# model = model.cuda()  #This is the case with GPU

loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    y_hat = model(X)  # model.forward(), N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
    print(t, loss.item())  #Print loss per iteration
        
    optimizer.zero_grad()  #Clear the gradient before deriving
    '''Backward propagation( backward pass)'''
    loss.backward()
    
    '''Parameter update( update weights of W1 and W2)'''
    optimizer.step()  #Update all parameters in one step

0 677.295166015625
1 660.0888061523438
2 643.3673095703125
3 627.08642578125
4 611.1599731445312
5 595.6091918945312
6 580.5427856445312
7 565.9138793945312
8 551.620849609375
9 537.651123046875
...
490 9.944045586962602e-09
491 9.147494317574001e-09
492 8.492017755656889e-09
493 7.793811818146423e-09
494 7.225093412444039e-09
495 6.644597760896431e-09
496 6.126881668677697e-09
497 5.687876836191208e-09
498 5.240272660245182e-09
499 4.8260742069317075e-09

Here you need to comment out the standardized part, otherwise the update will be very bad. (this is very strange. The learning_rate and optimization method will change again, and may need to be standardized again)

3.4 self defined model

Usually from torch nn. Module inherits the new model.

import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

'''Define a two-tier network'''
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        #Define model structure
        self.linear1 = torch.nn.Linear(D_in, H, bias=False)
        self.linear2 = torch.nn.Linear(H, D_out, bias=False)
        
    def forward(self, x):
        y_hat = self.linear2(self.linear1(X).clamp(min=0))
        return y_hat

    
model = TwoLayerNet(D_in, H, D_out)

loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):  #Do 500 iterations
    '''Forward propagation( forward pass)'''
    y_hat = model(X)  # model.forward(), N * D_out
    
    '''Calculation loss function( compute loss)'''
    loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
    print(t, loss.item())  #Print loss per iteration
        
    optimizer.zero_grad()  #Clear the gradient before deriving
    '''Backward propagation( backward pass)'''
    loss.backward()
    
    '''Parameter update( update weights of W1 and W2)'''
    optimizer.step()  #Update all parameters in one step

0 713.7529296875
1 695.759033203125
2 678.2886352539062
3 661.2178344726562
4 644.5472412109375
5 628.3016357421875
6 612.5072021484375
7 597.1802978515625
8 582.385009765625
9 568.1029663085938
...
490 3.386985554243438e-07
491 3.155915919705876e-07
492 2.9405845225483063e-07
493 2.7391826051825774e-07
494 2.553651086145692e-07
495 2.379783694550497e-07
496 2.2159480295158573e-07
497 2.0649896725899453e-07
498 1.9220941283037973e-07
499 1.790194232853537e-07

Adam works well at this time.

4 Summary

In fact, the steps are the same: defining parameters, defining models, defining loss functions, giving them to optimizer for optimization and training.

Keywords: Machine Learning AI neural networks Pytorch Deep Learning

Added by karldenton on Fri, 31 Dec 2021 17:56:23 +0200

Programming VIP