Automatic derivation with numpy and PyTorch, torch Implementation of two-layer neural network based on NN Library

Realize the step-by-step deepening from manual derivation to automatic derivation and then to the model.

Implementation of two-layer neural network with numpy

A fully connected ReLU neural network, a hidden layer, no bias, L2 Loss (h is hidden layer, ReLU activation function):

• h = W 1 X + b 1 h = W_1 X + b_1 h=W1​X+b1​
• h r e l u = m a x ( 0 , h ) h_relu = max(0, h) hr​elu=max(0,h)
• y h a t = W 2 a + b 2 y_{hat} = W_2 a + b_2 yhat​=W2​a+b2​

When implemented below b 1 b_1 b1​, b 2 b_2 b2 ＾ are all 0, and there is no bias bias bias.

This implementation completely uses numpy to calculate forward neural network, loss, and back propagation.

numpy ndarray is an ordinary n-dimensional array. It does not know any knowledge about deep learning or gradient, nor does it know the calculation graph. It is just a data structure used to calculate mathematical operations.

import numpy as np

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

W1 = np.random.randn(D_in, H)  #1000 to 100 dimensions
W2 = np.random.randn(H, D_out)  #100 to 10 dimensions

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
h = X.dot(W1)  # N * H
h_relu = np.maximum(h, 0)  #Activation function, N * H
y_hat = h_relu.dot(W2)  # N * D_out

'''Calculation loss function( compute loss)'''
loss = np.square(y_hat - y).sum()  #Mean square error, ignoring ÷ N
print(t, loss)  #Print loss per iteration

'''Backward propagation( backward pass)'''
#Calculate the gradient (torch is not used here. The most common chain derivative is used to get d{loss}/dX)
grad_y_hat = 2.0 * (y_hat - y)  # d{loss}/d{y_hat}，N * D_out
grad_W2 = h_relu.T.dot(grad_y_hat)  #Look at the third formula in forward propagation, d{loss}/d{W2}, H * D_out
grad_h_relu = grad_y_hat.dot(W2.T)  #Look at the third formula in forward propagation, d{loss}/d{h_relu}, N * H
grad_h = grad_h_relu.copy()  #This is the case when H > 0, d{h_relu}/d{h}=1
grad_W1 = X.T.dot(grad_h)  #Look at the first expression in forward propagation, d{loss}/d{W1}

'''Parameter update( update weights of W1 and W2)'''
0 36455139.29176882
1 35607818.495988876
2 36510242.60519045
3 32972837.109358862
4 23623067.52618093
5 13537226.736260608
6 6806959.784455631
7 3501526.30896816
8 2054356.1020693523
9 1400230.6793163505
...
490 3.4950278045838633e-06
491 3.3498523609301454e-06
492 3.210762995939165e-06
493 3.0774805749939447e-06
494 2.9500114328045522e-06
495 2.827652258736098e-06
496 2.710379907890261e-06
497 2.5980242077038875e-06
498 2.490359305069476e-06
499 2.387185101594446e-06

We can see that the final loss is getting smaller and smaller. Now let's take a look at the proximity between the predicted value and the real value

y_hat - y
array([[ 9.16825615e-06, -1.53964987e-05,  6.58365129e-06,
-3.08909604e-05,  1.05735798e-05,  1.73376919e-05,
2.63084233e-06, -1.11662576e-05,  1.06904464e-05,
-1.71528894e-05],
...
[-5.79062537e-06, -1.74789200e-05,  5.27619647e-06,
-7.82154474e-06,  3.39896752e-06,  1.08366770e-05,
8.28712496e-06, -8.88009103e-06,  5.78585909e-06,
-1.14913078e-05]])

It can be seen that their difference is very small, that is, we are successful.

2.1 manual derivation

At this time, the code in 1 has the following changes:

• import numpy as np to import torch
• np. random. Change randn to torch randn
• Change dot to mm
• h_relu is NP in numpy Maximum (h, 0), h.clamp(min=0) in torch, and the lower limit of input is 0
• loss is NP in numpy Square is in torch pow(2), and use item() converts the tensor to a numeric value
• Transpose is in numpy T. In torch, yes t()
• Change copy() to clone()
import torch

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

W1 = torch.randn(D_in, H)  #1000 to 100 dimensions
W2 = torch.randn(H, D_out)  #100 to 10 dimensions

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
h = X.mm(W1)  # N * H
h_relu = h.clamp(min=0)  #Activation function, N * H
y_hat = h_relu.mm(W2)  # N * D_out

'''Calculation loss function( compute loss)'''
loss = (y_hat - y).pow(2).sum().item()  #Mean square error, ignoring ÷ N
print(t, loss)  #Print loss per iteration

'''Backward propagation( backward pass)'''
#Calculate the gradient (torch is not used here. The most common chain derivative is used to get d{loss}/dX)
grad_y_hat = 2.0 * (y_hat - y)  # d{loss}/d{y_hat}，N * D_out
grad_W2 = h_relu.t().mm(grad_y_hat)  #Look at the third formula in forward propagation, d{loss}/d{W2}, H * D_out
grad_h_relu = grad_y_hat.mm(W2.t())  #Look at the third formula in forward propagation, d{loss}/d{h_relu}, N * H
grad_h = grad_h_relu.clone()  #This is the case when H > 0, d{h_relu}/d{h}=1
grad_W1 = X.t().mm(grad_h)  #Look at the first expression in forward propagation, d{loss}/d{W1}

'''Parameter update( update weights of W1 and W2)'''
0 28398944.0
1 27809498.0
2 32215128.0
3 37019776.0
4 36226528.0
5 27777396.0
6 16156263.0
7 7798599.0
8 3615862.0
9 1881907.25
...
490 5.404536932474002e-05
491 5.3628453315468505e-05
492 5.282810889184475e-05
493 5.204257831792347e-05
494 5.149881326360628e-05
495 5.084666554466821e-05
496 4.9979411414824426e-05
497 4.938142956234515e-05
498 4.8661189794074744e-05
499 4.8014146159403026e-05

At this time, the code in 2.1 has the following changes:

• Use requires_grad=True declare that W1 and W2 can be derived. If this figure is not passed, it defaults to False (X, y) to save memory
• In forward propagation, in order to facilitate calculation, a line of code is directly used to calculate y_hat
• Loss should be tensor at this time. Remove the item() added in front, but remove it item() is placed below where the loss function is printed
• Remove all the steps of calculating the gradient and use loss Backward() instead
• Put the previously calculated grad manually_ Change W1 to W1 Grad (W2)
• After parameter update, use W1 grad. zero_ () clear the gradient of W1 (W2 is the same)
• With torch no_ Grad (): prevent the computer from remembering the calculation diagrams of W1 and W2 and occupying memory
import torch

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

W1 = torch.randn(D_in, H, requires_grad=True)  #1000 to 100 dimensions
W2 = torch.randn(H, D_out, requires_grad=True)  #100 to 10 dimensions

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
y_hat = X.mm(W1).clamp(min=0).mm(W2)  # N * D_out

'''Calculation loss function( compute loss)'''
loss = (y_hat - y).pow(2).sum()  #Mean square error, ignoring ÷ N, loss is a calculation graph
print(t, loss.item())  #Print loss per iteration

'''Backward propagation( backward pass)'''
loss.backward()

'''Parameter update( update weights of W1 and W2)'''
0 28114322.0
1 22391836.0
2 19137772.0
3 16153970.0
4 12953562.0
5 9725695.0
6 6933768.5
7 4784875.0
8 3286503.0
9 2288213.25
...
490 3.3917171094799414e-05
491 3.35296499542892e-05
492 3.318845119792968e-05
493 3.276047937106341e-05
494 3.244510298827663e-05
495 3.209296482964419e-05
496 3.168126931996085e-05
497 3.1402159947901964e-05
498 3.097686203545891e-05
499 3.074205596931279e-05

3. Use torch Implementation of two-layer neural network based on NN Library

3.1 parameters are not standardized

At this time, the code in 2.2 has the following changes:

• Change import torch to import torch nn as nn (nn is the neural network)
• There is no need to define W1 and W2. Directly define model = torch nn. Sequential (a series of models put together)
• Subsequent y_ The calculation of hat only needs model(x)
• Loss doesn't have to be so complex. Use the function loss_ fn = nn. Mselos (reduction = 'sum') is enough. Loss refers to this function below
• The parameters in parameter update can also be obtained from the model directly with the for loop
import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True
torch.nn.ReLU(),
torch.nn.Linear(H, D_out)
)

# model = model.cuda()  #This is the case with GPU

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
y_hat = model(X)  # model(X) = model.forward(X), N * D_out

'''Calculation loss function( compute loss)'''
loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
print(t, loss.item())  #Print loss per iteration

'''Backward propagation( backward pass)'''
loss.backward()

'''Parameter update( update weights of W1 and W2)'''
for param in model.parameters():
param -= learning_rate * param.grad  #All parameters in the model are updated

0 686.78662109375
1 686.2665405273438
2 685.7469482421875
3 685.2279663085938
4 684.7101440429688
5 684.1931762695312
6 683.6768188476562
7 683.1609497070312
8 682.6456909179688
9 682.130859375
...
490 496.4220275878906
491 496.12548828125
492 495.82940673828125
493 495.533203125
494 495.2373046875
495 494.94171142578125
496 494.6462707519531
497 494.35101318359375
498 494.0567321777344
499 493.7628479003906
model
Sequential(
(0): Linear(in_features=1000, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=10, bias=True)
)
model
Linear(in_features=1000, out_features=100, bias=True)
model.weight
Parameter containing:
tensor([[-0.0147, -0.0315,  0.0085,  ...,  0.0039,  0.0254, -0.0308],
[ 0.0046,  0.0125,  0.0128,  ..., -0.0241, -0.0206, -0.0127],
[-0.0162,  0.0051,  0.0152,  ..., -0.0280, -0.0133,  0.0079],
...,
[ 0.0239,  0.0237, -0.0025,  ...,  0.0290, -0.0192,  0.0187],
[-0.0249,  0.0287,  0.0060,  ..., -0.0198,  0.0007,  0.0209],
[ 0.0238, -0.0157, -0.0156,  ...,  0.0105,  0.0057, -0.0189]],

3.2 parameter standardization

The code in 3.1 is updated slowly, which may be due to poor parameter initialization, so we use torch nn. init. normal_ weight standardization of layers 0 and 2 (i.e. Linear layer):

import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True
torch.nn.ReLU(),
torch.nn.Linear(H, D_out)
)

torch.nn.init.normal_(model.weight)
torch.nn.init.normal_(model.weight)

# model = model.cuda()  #This is the case with GPU

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
y_hat = model(X)  # model.forward(), N * D_out

'''Calculation loss function( compute loss)'''
loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
print(t, loss.item())  #Print loss per iteration

'''Backward propagation( backward pass)'''
loss.backward()

'''Parameter update( update weights of W1 and W2)'''
for param in model.parameters():
param -= learning_rate * param.grad  #All parameters in the model are updated

0 34311500.0
1 32730668.0
2 33845940.0
3 31335464.0
4 23584192.0
5 14068799.0
6 7252735.5
7 3674312.0
8 2069563.0
9 1346445.75
...
490 7.143352559069172e-05
491 7.078371709212661e-05
492 7.009323599049821e-05
493 6.912354729138315e-05
494 6.783746357541531e-05
495 6.718340591760352e-05
496 6.611335265915841e-05
497 6.529116944875568e-05
498 6.444999598897994e-05
499 6.381605635397136e-05
model.weight
Parameter containing:
tensor([[ 0.1849, -0.2587,  1.6247,  ..., -0.8608, -2.2139, -1.3076],
[-0.5197,  0.0600,  0.2141,  ...,  0.0561, -0.1613, -0.3905],
[-0.5303, -0.1129, -0.2974,  ..., -0.6166, -3.4082,  0.0969],
...,
[-0.4742,  0.2449, -1.5979,  ..., -0.6195, -0.2970, -1.3764],
[-0.1131,  0.4973,  0.7679,  ...,  0.1231,  0.6992,  0.4403],
[-0.1557,  0.8185,  0.7784,  ..., -0.9993,  0.3424, -1.1116]],

3.3 optim method

The previous gradient descent method is still stupid. It is to manually update the weights of the model. In 3.3, we use optim package to help us update parameters. Optim package provides various model optimization methods, including SGD+momentum, RMSProp, Adam and so on.

Continue to improve on the basis of 3.2:

• General learning with optim package_ Rate is 1e-4
• Define optimizer and optimize with Adam
• The whole part of parameter update can be optimized with one sentence Instead of step (), it means that all parameters are updated in one step
import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True
torch.nn.ReLU(),
torch.nn.Linear(H, D_out)
)

# torch.nn.init.normal_(model.weight)
# torch.nn.init.normal_(model.weight)

# model = model.cuda()  #This is the case with GPU

loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
y_hat = model(X)  # model.forward(), N * D_out

'''Calculation loss function( compute loss)'''
loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
print(t, loss.item())  #Print loss per iteration

'''Backward propagation( backward pass)'''
loss.backward()

'''Parameter update( update weights of W1 and W2)'''
optimizer.step()  #Update all parameters in one step
0 677.295166015625
1 660.0888061523438
2 643.3673095703125
3 627.08642578125
4 611.1599731445312
5 595.6091918945312
6 580.5427856445312
7 565.9138793945312
8 551.620849609375
9 537.651123046875
...
490 9.944045586962602e-09
491 9.147494317574001e-09
492 8.492017755656889e-09
493 7.793811818146423e-09
494 7.225093412444039e-09
495 6.644597760896431e-09
496 6.126881668677697e-09
497 5.687876836191208e-09
498 5.240272660245182e-09
499 4.8260742069317075e-09

Here you need to comment out the standardized part, otherwise the update will be very bad. (this is very strange. The learning_rate and optimization method will change again, and may need to be standardized again)

3.4 self defined model

Usually from torch nn. Module inherits the new model.

import torch.nn as nn  #Various methods of defining neural network

N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions

'''Randomly create some training data'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

'''Define a two-tier network'''
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
super(TwoLayerNet, self).__init__()
#Define model structure
self.linear1 = torch.nn.Linear(D_in, H, bias=False)
self.linear2 = torch.nn.Linear(H, D_out, bias=False)

def forward(self, x):
y_hat = self.linear2(self.linear1(X).clamp(min=0))
return y_hat

model = TwoLayerNet(D_in, H, D_out)

loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4

for t in range(500):  #Do 500 iterations
'''Forward propagation( forward pass)'''
y_hat = model(X)  # model.forward(), N * D_out

'''Calculation loss function( compute loss)'''
loss = loss_fn(y_hat, y)  #Mean square error, ignoring ÷ N, loss is a calculation graph
print(t, loss.item())  #Print loss per iteration

'''Backward propagation( backward pass)'''
loss.backward()

'''Parameter update( update weights of W1 and W2)'''
optimizer.step()  #Update all parameters in one step
0 713.7529296875
1 695.759033203125
2 678.2886352539062
3 661.2178344726562
4 644.5472412109375
5 628.3016357421875
6 612.5072021484375
7 597.1802978515625
8 582.385009765625
9 568.1029663085938
...
490 3.386985554243438e-07
491 3.155915919705876e-07
492 2.9405845225483063e-07
493 2.7391826051825774e-07
494 2.553651086145692e-07
495 2.379783694550497e-07
496 2.2159480295158573e-07
497 2.0649896725899453e-07
498 1.9220941283037973e-07
499 1.790194232853537e-07

Adam works well at this time.

4 Summary

In fact, the steps are the same: defining parameters, defining models, defining loss functions, giving them to optimizer for optimization and training.

Added by karldenton on Fri, 31 Dec 2021 17:56:23 +0200