Realize the step-by-step deepening from manual derivation to automatic derivation and then to the model.
Implementation of two-layer neural network with numpy
A fully connected ReLU neural network, a hidden layer, no bias, L2 Loss (h is hidden layer, ReLU activation function):
- h = W 1 X + b 1 h = W_1 X + b_1 h=W1X+b1
- h r e l u = m a x ( 0 , h ) h_relu = max(0, h) hrelu=max(0,h)
- y h a t = W 2 a + b 2 y_{hat} = W_2 a + b_2 yhat=W2a+b2
When implemented below b 1 b_1 b1, b 2 b_2 b2 ^ are all 0, and there is no bias bias bias.
This implementation completely uses numpy to calculate forward neural network, loss, and back propagation.
numpy ndarray is an ordinary n-dimensional array. It does not know any knowledge about deep learning or gradient, nor does it know the calculation graph. It is just a data structure used to calculate mathematical operations.
import numpy as np N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = np.random.randn(N, D_in) y = np.random.randn(N, D_out) W1 = np.random.randn(D_in, H) #1000 to 100 dimensions W2 = np.random.randn(H, D_out) #100 to 10 dimensions learning_rate = 1e-6 for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' h = X.dot(W1) # N * H h_relu = np.maximum(h, 0) #Activation function, N * H y_hat = h_relu.dot(W2) # N * D_out '''Calculation loss function( compute loss)''' loss = np.square(y_hat - y).sum() #Mean square error, ignoring ÷ N print(t, loss) #Print loss per iteration '''Backward propagation( backward pass)''' #Calculate the gradient (torch is not used here. The most common chain derivative is used to get d{loss}/dX) grad_y_hat = 2.0 * (y_hat - y) # d{loss}/d{y_hat},N * D_out grad_W2 = h_relu.T.dot(grad_y_hat) #Look at the third formula in forward propagation, d{loss}/d{W2}, H * D_out grad_h_relu = grad_y_hat.dot(W2.T) #Look at the third formula in forward propagation, d{loss}/d{h_relu}, N * H grad_h = grad_h_relu.copy() #This is the case when H > 0, d{h_relu}/d{h}=1 grad_h[h<0] = 0 # d{loss}/d{h} grad_W1 = X.T.dot(grad_h) #Look at the first expression in forward propagation, d{loss}/d{W1} '''Parameter update( update weights of W1 and W2)''' W1 -= learning_rate * grad_W1 W2 -= learning_rate * grad_W2
0 36455139.29176882 1 35607818.495988876 2 36510242.60519045 3 32972837.109358862 4 23623067.52618093 5 13537226.736260608 6 6806959.784455631 7 3501526.30896816 8 2054356.1020693523 9 1400230.6793163505 ... 490 3.4950278045838633e-06 491 3.3498523609301454e-06 492 3.210762995939165e-06 493 3.0774805749939447e-06 494 2.9500114328045522e-06 495 2.827652258736098e-06 496 2.710379907890261e-06 497 2.5980242077038875e-06 498 2.490359305069476e-06 499 2.387185101594446e-06
We can see that the final loss is getting smaller and smaller. Now let's take a look at the proximity between the predicted value and the real value
y_hat - y
array([[ 9.16825615e-06, -1.53964987e-05, 6.58365129e-06, -3.08909604e-05, 1.05735798e-05, 1.73376919e-05, 2.63084233e-06, -1.11662576e-05, 1.06904464e-05, -1.71528894e-05], ... [-5.79062537e-06, -1.74789200e-05, 5.27619647e-06, -7.82154474e-06, 3.39896752e-06, 1.08366770e-05, 8.28712496e-06, -8.88009103e-06, 5.78585909e-06, -1.14913078e-05]])
It can be seen that their difference is very small, that is, we are successful.
2 using PyTorch automatic derivation to realize two-layer neural network
2.1 manual derivation
At this time, the code in 1 has the following changes:
- import numpy as np to import torch
- np. random. Change randn to torch randn
- Change dot to mm
- h_relu is NP in numpy Maximum (h, 0), h.clamp(min=0) in torch, and the lower limit of input is 0
- loss is NP in numpy Square is in torch pow(2), and use item() converts the tensor to a numeric value
- Transpose is in numpy T. In torch, yes t()
- Change copy() to clone()
import torch N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = torch.randn(N, D_in) y = torch.randn(N, D_out) W1 = torch.randn(D_in, H) #1000 to 100 dimensions W2 = torch.randn(H, D_out) #100 to 10 dimensions learning_rate = 1e-6 for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' h = X.mm(W1) # N * H h_relu = h.clamp(min=0) #Activation function, N * H y_hat = h_relu.mm(W2) # N * D_out '''Calculation loss function( compute loss)''' loss = (y_hat - y).pow(2).sum().item() #Mean square error, ignoring ÷ N print(t, loss) #Print loss per iteration '''Backward propagation( backward pass)''' #Calculate the gradient (torch is not used here. The most common chain derivative is used to get d{loss}/dX) grad_y_hat = 2.0 * (y_hat - y) # d{loss}/d{y_hat},N * D_out grad_W2 = h_relu.t().mm(grad_y_hat) #Look at the third formula in forward propagation, d{loss}/d{W2}, H * D_out grad_h_relu = grad_y_hat.mm(W2.t()) #Look at the third formula in forward propagation, d{loss}/d{h_relu}, N * H grad_h = grad_h_relu.clone() #This is the case when H > 0, d{h_relu}/d{h}=1 grad_h[h<0] = 0 # d{loss}/d{h} grad_W1 = X.t().mm(grad_h) #Look at the first expression in forward propagation, d{loss}/d{W1} '''Parameter update( update weights of W1 and W2)''' W1 -= learning_rate * grad_W1 W2 -= learning_rate * grad_W2
0 28398944.0 1 27809498.0 2 32215128.0 3 37019776.0 4 36226528.0 5 27777396.0 6 16156263.0 7 7798599.0 8 3615862.0 9 1881907.25 ... 490 5.404536932474002e-05 491 5.3628453315468505e-05 492 5.282810889184475e-05 493 5.204257831792347e-05 494 5.149881326360628e-05 495 5.084666554466821e-05 496 4.9979411414824426e-05 497 4.938142956234515e-05 498 4.8661189794074744e-05 499 4.8014146159403026e-05
2.2 gradient automatic derivation
At this time, the code in 2.1 has the following changes:
- Use requires_grad=True declare that W1 and W2 can be derived. If this figure is not passed, it defaults to False (X, y) to save memory
- In forward propagation, in order to facilitate calculation, a line of code is directly used to calculate y_hat
- Loss should be tensor at this time. Remove the item() added in front, but remove it item() is placed below where the loss function is printed
- Remove all the steps of calculating the gradient and use loss Backward() instead
- Put the previously calculated grad manually_ Change W1 to W1 Grad (W2)
- After parameter update, use W1 grad. zero_ () clear the gradient of W1 (W2 is the same)
- With torch no_ Grad (): prevent the computer from remembering the calculation diagrams of W1 and W2 and occupying memory
import torch N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = torch.randn(N, D_in) y = torch.randn(N, D_out) W1 = torch.randn(D_in, H, requires_grad=True) #1000 to 100 dimensions W2 = torch.randn(H, D_out, requires_grad=True) #100 to 10 dimensions learning_rate = 1e-6 for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' y_hat = X.mm(W1).clamp(min=0).mm(W2) # N * D_out '''Calculation loss function( compute loss)''' loss = (y_hat - y).pow(2).sum() #Mean square error, ignoring ÷ N, loss is a calculation graph print(t, loss.item()) #Print loss per iteration '''Backward propagation( backward pass)''' loss.backward() '''Parameter update( update weights of W1 and W2)''' with torch.no_grad(): W1 -= learning_rate * W1.grad W2 -= learning_rate * W2.grad W1.grad.zero_() W2.grad.zero_()
0 28114322.0 1 22391836.0 2 19137772.0 3 16153970.0 4 12953562.0 5 9725695.0 6 6933768.5 7 4784875.0 8 3286503.0 9 2288213.25 ... 490 3.3917171094799414e-05 491 3.35296499542892e-05 492 3.318845119792968e-05 493 3.276047937106341e-05 494 3.244510298827663e-05 495 3.209296482964419e-05 496 3.168126931996085e-05 497 3.1402159947901964e-05 498 3.097686203545891e-05 499 3.074205596931279e-05
3. Use torch Implementation of two-layer neural network based on NN Library
3.1 parameters are not standardized
At this time, the code in 2.2 has the following changes:
- Change import torch to import torch nn as nn (nn is the neural network)
- There is no need to define W1 and W2. Directly define model = torch nn. Sequential (a series of models put together)
- Subsequent y_ The calculation of hat only needs model(x)
- Loss doesn't have to be so complex. Use the function loss_ fn = nn. Mselos (reduction = 'sum') is enough. Loss refers to this function below
- The parameters in parameter update can also be obtained from the model directly with the for loop
- Gradient clearing is the same: model zero_ grad()
import torch.nn as nn #Various methods of defining neural network N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = torch.randn(N, D_in) y = torch.randn(N, D_out) model = torch.nn.Sequential( torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True torch.nn.ReLU(), torch.nn.Linear(H, D_out) ) # model = model.cuda() #This is the case with GPU loss_fn = nn.MSELoss(reduction='sum') learning_rate = 1e-6 for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' y_hat = model(X) # model(X) = model.forward(X), N * D_out '''Calculation loss function( compute loss)''' loss = loss_fn(y_hat, y) #Mean square error, ignoring ÷ N, loss is a calculation graph print(t, loss.item()) #Print loss per iteration '''Backward propagation( backward pass)''' loss.backward() '''Parameter update( update weights of W1 and W2)''' with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad #All parameters in the model are updated model.zero_grad()
0 686.78662109375 1 686.2665405273438 2 685.7469482421875 3 685.2279663085938 4 684.7101440429688 5 684.1931762695312 6 683.6768188476562 7 683.1609497070312 8 682.6456909179688 9 682.130859375 ... 490 496.4220275878906 491 496.12548828125 492 495.82940673828125 493 495.533203125 494 495.2373046875 495 494.94171142578125 496 494.6462707519531 497 494.35101318359375 498 494.0567321777344 499 493.7628479003906
model
Sequential( (0): Linear(in_features=1000, out_features=100, bias=True) (1): ReLU() (2): Linear(in_features=100, out_features=10, bias=True) )
model[0]
Linear(in_features=1000, out_features=100, bias=True)
model[0].weight
Parameter containing: tensor([[-0.0147, -0.0315, 0.0085, ..., 0.0039, 0.0254, -0.0308], [ 0.0046, 0.0125, 0.0128, ..., -0.0241, -0.0206, -0.0127], [-0.0162, 0.0051, 0.0152, ..., -0.0280, -0.0133, 0.0079], ..., [ 0.0239, 0.0237, -0.0025, ..., 0.0290, -0.0192, 0.0187], [-0.0249, 0.0287, 0.0060, ..., -0.0198, 0.0007, 0.0209], [ 0.0238, -0.0157, -0.0156, ..., 0.0105, 0.0057, -0.0189]], requires_grad=True)
3.2 parameter standardization
The code in 3.1 is updated slowly, which may be due to poor parameter initialization, so we use torch nn. init. normal_ weight standardization of layers 0 and 2 (i.e. Linear layer):
import torch.nn as nn #Various methods of defining neural network N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = torch.randn(N, D_in) y = torch.randn(N, D_out) model = torch.nn.Sequential( torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True torch.nn.ReLU(), torch.nn.Linear(H, D_out) ) torch.nn.init.normal_(model[0].weight) torch.nn.init.normal_(model[2].weight) # model = model.cuda() #This is the case with GPU loss_fn = nn.MSELoss(reduction='sum') learning_rate = 1e-6 for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' y_hat = model(X) # model.forward(), N * D_out '''Calculation loss function( compute loss)''' loss = loss_fn(y_hat, y) #Mean square error, ignoring ÷ N, loss is a calculation graph print(t, loss.item()) #Print loss per iteration '''Backward propagation( backward pass)''' loss.backward() '''Parameter update( update weights of W1 and W2)''' with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad #All parameters in the model are updated model.zero_grad()
0 34311500.0 1 32730668.0 2 33845940.0 3 31335464.0 4 23584192.0 5 14068799.0 6 7252735.5 7 3674312.0 8 2069563.0 9 1346445.75 ... 490 7.143352559069172e-05 491 7.078371709212661e-05 492 7.009323599049821e-05 493 6.912354729138315e-05 494 6.783746357541531e-05 495 6.718340591760352e-05 496 6.611335265915841e-05 497 6.529116944875568e-05 498 6.444999598897994e-05 499 6.381605635397136e-05
model[0].weight
Parameter containing: tensor([[ 0.1849, -0.2587, 1.6247, ..., -0.8608, -2.2139, -1.3076], [-0.5197, 0.0600, 0.2141, ..., 0.0561, -0.1613, -0.3905], [-0.5303, -0.1129, -0.2974, ..., -0.6166, -3.4082, 0.0969], ..., [-0.4742, 0.2449, -1.5979, ..., -0.6195, -0.2970, -1.3764], [-0.1131, 0.4973, 0.7679, ..., 0.1231, 0.6992, 0.4403], [-0.1557, 0.8185, 0.7784, ..., -0.9993, 0.3424, -1.1116]], requires_grad=True)
3.3 optim method
The previous gradient descent method is still stupid. It is to manually update the weights of the model. In 3.3, we use optim package to help us update parameters. Optim package provides various model optimization methods, including SGD+momentum, RMSProp, Adam and so on.
Continue to improve on the basis of 3.2:
- General learning with optim package_ Rate is 1e-4
- Define optimizer and optimize with Adam
- The whole part of parameter update can be optimized with one sentence Instead of step (), it means that all parameters are updated in one step
- With optimizer zero_ Grad() clears the gradient
import torch.nn as nn #Various methods of defining neural network N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = torch.randn(N, D_in) y = torch.randn(N, D_out) model = torch.nn.Sequential( torch.nn.Linear(D_in, H, bias=True), # W1 * X + b, default True torch.nn.ReLU(), torch.nn.Linear(H, D_out) ) # torch.nn.init.normal_(model[0].weight) # torch.nn.init.normal_(model[2].weight) # model = model.cuda() #This is the case with GPU loss_fn = nn.MSELoss(reduction='sum') learning_rate = 1e-4 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' y_hat = model(X) # model.forward(), N * D_out '''Calculation loss function( compute loss)''' loss = loss_fn(y_hat, y) #Mean square error, ignoring ÷ N, loss is a calculation graph print(t, loss.item()) #Print loss per iteration optimizer.zero_grad() #Clear the gradient before deriving '''Backward propagation( backward pass)''' loss.backward() '''Parameter update( update weights of W1 and W2)''' optimizer.step() #Update all parameters in one step
0 677.295166015625 1 660.0888061523438 2 643.3673095703125 3 627.08642578125 4 611.1599731445312 5 595.6091918945312 6 580.5427856445312 7 565.9138793945312 8 551.620849609375 9 537.651123046875 ... 490 9.944045586962602e-09 491 9.147494317574001e-09 492 8.492017755656889e-09 493 7.793811818146423e-09 494 7.225093412444039e-09 495 6.644597760896431e-09 496 6.126881668677697e-09 497 5.687876836191208e-09 498 5.240272660245182e-09 499 4.8260742069317075e-09
Here you need to comment out the standardized part, otherwise the update will be very bad. (this is very strange. The learning_rate and optimization method will change again, and may need to be standardized again)
3.4 self defined model
Usually from torch nn. Module inherits the new model.
import torch.nn as nn #Various methods of defining neural network N, D_in, H, D_out = 64, 1000, 100, 10 #64 training data (just a batch). The input is 1000 dimensions, the hidden is 100 dimensions and the output is 10 dimensions '''Randomly create some training data''' X = torch.randn(N, D_in) y = torch.randn(N, D_out) '''Define a two-tier network''' class TwoLayerNet(torch.nn.Module): def __init__(self, D_in, H, D_out): super(TwoLayerNet, self).__init__() #Define model structure self.linear1 = torch.nn.Linear(D_in, H, bias=False) self.linear2 = torch.nn.Linear(H, D_out, bias=False) def forward(self, x): y_hat = self.linear2(self.linear1(X).clamp(min=0)) return y_hat model = TwoLayerNet(D_in, H, D_out) loss_fn = nn.MSELoss(reduction='sum') learning_rate = 1e-4 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) for t in range(500): #Do 500 iterations '''Forward propagation( forward pass)''' y_hat = model(X) # model.forward(), N * D_out '''Calculation loss function( compute loss)''' loss = loss_fn(y_hat, y) #Mean square error, ignoring ÷ N, loss is a calculation graph print(t, loss.item()) #Print loss per iteration optimizer.zero_grad() #Clear the gradient before deriving '''Backward propagation( backward pass)''' loss.backward() '''Parameter update( update weights of W1 and W2)''' optimizer.step() #Update all parameters in one step
0 713.7529296875 1 695.759033203125 2 678.2886352539062 3 661.2178344726562 4 644.5472412109375 5 628.3016357421875 6 612.5072021484375 7 597.1802978515625 8 582.385009765625 9 568.1029663085938 ... 490 3.386985554243438e-07 491 3.155915919705876e-07 492 2.9405845225483063e-07 493 2.7391826051825774e-07 494 2.553651086145692e-07 495 2.379783694550497e-07 496 2.2159480295158573e-07 497 2.0649896725899453e-07 498 1.9220941283037973e-07 499 1.790194232853537e-07
Adam works well at this time.
4 Summary
In fact, the steps are the same: defining parameters, defining models, defining loss functions, giving them to optimizer for optimization and training.