# Training of the first linear regression

you can run the program if you want Reference here . For ndarray and autograd, please refer to the previous blogs.

# preface

now there is a function, y=w*x+b, W, b is known, so give an X, you can find the corresponding y.
but when w and b are unknown, we only give a pair of X and Y. the obtained W and b may only satisfy this pair, but cannot satisfy other X and y. At this time, we need a model to train W, b to meet as many x, y as possible, that is, give a certain number of X, y to deduce the qualified W, b.

# Generate data

now define a function y[i] = 2 * x[i][0] -3.4 * x[i][1] + 4.2 + noise (noise follows the normal distribution with mean value of 0 and variance of 0.1, and noise represents meaningless interference in the data set). Here w is [2, - 3.4], and b is 4.2. Or it can be understood that this function has two W and X, that is, y=w1*x1+w2*x2+b.

```from mxnet import ndarray as nd
num_inputs = 2 #There are two input dimensions, equivalent to x
num_examples = 1000 #Generate 1000 data
true_w = [2,-3.4]
true_b = 4.2

x = nd.random_normal(shape=(num_examples,num_inputs))
#A matrix with 1000 rows and 2 columns. Each element is randomly sampled in a normal distribution with a mean of 0 and a standard deviation of 1

y = true_w[0] * x[:,0] + true_w[1] * x[:,1] + true_b

#x[:,0] represents the array composed of the first column of each row
#You can output x, x[:,0], x[:,0] Shape, y.shape look at what they look like and what type or shape they are. It may be a little abstract at the beginning of contact

y += .01*nd.random_normal(shape=y.shape) #Add a noise value

#x. The Y output looks at the randomly generated x matrix and an array containing 1000 data generated by the elements in X
#When I calculate y here, I wonder whether the calculation of elements of different types and shapes is reasonable. Later, I think of darray's broadcast mechanism, which is also the convenience brought by python
```
```print(x[0:10],y[0:10]) #The first ten rows of matrix x and the 10 y's generated from it can also be seen from the output that the shape s of x and y are different

[[-0.47243103  1.2975348 ]
[ 1.5410181  -2.5207853 ]
[-0.60842186 -1.7573569 ]
[ 0.6143626   0.0028276 ]
[ 0.00257095 -0.5846045 ]
[ 0.64122546  0.0483991 ]
[-0.20711961 -0.34759858]
[ 0.25469646  0.01989137]
[-0.39016405 -2.276683  ]
[-0.5919514  -2.4271743 ]]
<NDArray 10x2 @cpu(0)>
[-1.1587338 15.844224   8.968834   5.409754   6.185884   5.302211
4.9785733  4.640998  11.1503    11.273367 ]
<NDArray 10 @cpu(0)>
```

when training the model, we need to traverse the data set and constantly read a small number of data samples. Here we define a function: it returns batch every time_ Size (batch size) the characteristics and labels of a random sample.

```import random
batch_size = 10 #Number of data read each time
def data_iter():
idx = list(range(num_examples))
#list with idx of 0-999

random.shuffle(idx)
#The shuffle() function sorts all the elements of the sequence at random

for i in range(0,num_examples,batch_size): #Cycle 100 times
j = nd.array(idx[i:min(i+batch_size,num_examples)])
#Take 10 elements from idx. Personally, I think it's OK to write idx[i:i+batch_size] directly

yield nd.take(x,j),nd.take(y,j)
#nd.take(x,j) is equivalent to taking the j-th row in the X matrix, nd Take (y, J) is equivalent to taking the y-th data in y. of course, j is an array
#You can print and see nd take(x,j),nd. What is take (y, J)
#In Python, the function that uses yield is called generator, which returns an iterator and tuple type.

#In this way, there are 100 groups of random data blocks. Each data block includes a matrix with 10 rows and 2 columns (data below) and an array with length of 10 (label below)
#In the first step, x and y are all generated. In my opinion, this step is for the randomness of data and the convenience of data reading
```

let's read the first small batch data sample and print it.

```for data,label in data_iter():
print(data,label) #Print the first random data block
break

[[ 0.7542019  -0.48587778]
[-1.9441624  -0.91037935]
[ 0.13180183  0.88579226]
[-1.4955239   0.737821  ]
[-0.88221204 -0.18438959]
[-0.7792825  -0.53876454]
[-0.8198182   1.4236803 ]
[ 0.02309756 -0.29708868]
[ 0.05650486 -0.6636138 ]
[ 2.4149287   0.48304093]]
<NDArray 10x2 @cpu(0)>
[ 7.357671   3.4120867  1.4567578 -1.3204895  3.057557   4.484297
-2.299931   5.250752   6.5591908  7.3872066]
<NDArray 10 @cpu(0)>
```

# Initialize model function

purpose: if you know y and x and require w and b, take random w and b first.

```w = nd.random_normal(shape=(num_inputs,1)) #w is a matrix with two rows and one column
b = nd.zeros((1,))
params = [w,b] #params is a list
```
```w,b,params #At this time, w and b are much worse than our ideal situation, so we need training to approach them

(
[[1.201833  ]
[0.29849657]]
<NDArray 2x1 @cpu(0)>,

[0.]
<NDArray 1 @cpu(0)>,
[
[[1.201833  ]
[0.29849657]]
<NDArray 2x1 @cpu(0)>,

[0.]
<NDArray 1 @cpu(0)>])
```

after training, we need to derive these parameters to update their values, so we need to create their gradients.

```for param in params:
```

# Define model

```def net(x):  #What is returned is our predicted value yhat (the predicted value y obtained by calculating f (x) according to the current parameters w and b)
return nd.dot(x,w) + b
#Here, x is a matrix with L rows and 2 columns, and w is a matrix with 2 rows and 1 column
#Multiply to obtain the matrix of L rows and 1 column. The element of each row is x[i][0]*w[0]+x[i][1]*w[1]
#Multiply the matrix to get the matrix of L row and 1 column, plus b
#Here comes the broadcast mechanism (when two ndarrays with different shapes are calculated by elements)
```

run and try.

```print(data) #data generated above
print(net(data)) #The predicted value is obtained through the model

[[ 0.7542019  -0.48587778]
[-1.9441624  -0.91037935]
[ 0.13180183  0.88579226]
[-1.4955239   0.737821  ]
[-0.88221204 -0.18438959]
[-0.7792825  -0.53876454]
[-0.8198182   1.4236803 ]
[ 0.02309756 -0.29708868]
[ 0.05650486 -0.6636138 ]
[ 2.4149287   0.48304093]]
<NDArray 10x2 @cpu(0)>

[[ 0.7613919 ]
[-2.6083038 ]
[ 0.42280975]
[-1.577133  ]
[-1.1153111 ]
[-1.0973868 ]
[-0.56032085]
[-0.06092054]
[-0.13017704]
[ 3.046527  ]]
<NDArray 10x1 @cpu(0)>
```

# loss function

```def square_loss(yhat,y):
#Y was originally an array. Here, y is changed into the shape of yhat to avoid automatic broadcasting
return(yhat - y.reshape(yhat.shape))**2
#Square error is used to measure the gap between the predicted target and the real target
```

# Using random gradient descent to solve

we take the parametric model along the opposite direction of the gradient for a specific distance, which is generally called the learning rate.

```def SGD(params,lr):
for param in params:
param[:] = param-lr*param.grad  #Here - change + what will happen?

#It's said that it's an in-situ operation, but it hasn't been understood thoroughly. It's a bit like parameter transfer of a function
#Param [:] cannot be changed to param, which will not have a learning effect, because params has not changed. The test is as follows:
a = nd.array([[1,2],[3,4]])
c = nd.array([1,1])
for b in a:
b = b - c
print(a)

[[1. 2.]
[3. 4.]] #a no change
<NDArray 2x2 @cpu(0)>

a = nd.array([[1,2],[3,4]])
c = nd.array([1,1])
for b in a:
b[:] = b - c
print(a)

[[0. 1.]
[2. 3.]]
<NDArray 2x2 @cpu(0)>
```

# train

iterate the data several times, calculate the gradient and update the model parameters

```epochs = 5 #Number of iterations
learning_rate = 0.001 #Learning rate, you can try to set a high profile. What will happen
for e in range(epochs):
total_loss = 0 #loss
for data,label in data_iter(): #The data will be retrieved 100 times as defined above
output = net(data) #Get the output through the model
loss = square_loss(output,label) #Gap between real data and predicted data
loss.backward() #Let loss take the derivative of w and b to make loss smaller
SGD(params,learning_rate)#modify parameters

total_loss += nd.sum(loss).asscalar()
print("Epoch %d ,average loss: %f"%(e,total_loss/num_examples))

Epoch 0 ,average loss: 7.926646
Epoch 1 ,average loss: 0.156229
Epoch 2 ,average loss: 0.003208
Epoch 3 ,average loss: 0.000159
Epoch 4 ,average loss: 0.000097 #The loss (error) is reduced. If you iterate several times, you will find that it finally converges to a certain number
```

compare the real parameters with the parameters obtained through learning iteration.

```true_w,w

([2, -3.4],

[[ 2.0008717]
[-3.3992171]]
<NDArray 2x1 @cpu(0)>)

true_b,b

(4.2,

[4.199591] #Although there are errors, they are very close, because there is still noise
<NDArray 1 @cpu(0)>)
```

# summary

the first program of in-depth learning took a lot of time to understand. The main syntax of python was not very familiar, and some codes could not be understood at first, such as take(), index, and some knowledge and operations of matrix and gradient were also connected with the high-order line generation learned before. After listening to the class of leader Li Mu twice, he found out the purpose and logic of each step, and finally understood 80%. In general, it is to read the data, define the model and train these three steps. Although the program is not long, he felt some difficulties at the beginning. I hope it will be slow and smooth after passing the first pass.

Keywords: Python Deep Learning

Added by OilSheikh on Tue, 08 Mar 2022 03:01:53 +0200