[fish book notes] Introduction to deep learning: Theory and implementation of Python based personal note sharing

In order to complete the design, I recently started the introductory deep learning

I would like to share with you my notes when reading the fish book. If there is any omission, please correct it!

If reprinted, please indicate the source!

1, Perceptron

The perceptron receives multiple input signals and outputs one signal.

As shown in the figure, the sensor receives two input signals. Where \ (\ theta \) is the threshold, and neurons will be activated if they exceed the threshold.

The limitation of perceptron is that it can only represent the space divided by a straight line, that is, linear space. Multi layer perceptron can realize complex functions.

2, Neural network

Neural network consists of three parts: input layer, hidden layer and output layer

1. Activate function

The activation function converts the sum of input signals into output signals, which is equivalent to simple screening and processing of calculation results.

The activation function shown in the figure is a step function.

1) sigmoid function

sigmoid function is a commonly used neural network activation function.

The formula is:

\[h(x)=\frac{1}{1+e^{-x}} \]

As shown in the figure, its output value is between 0 and 1.

2) ReLU function

The ReLU(Rectified Linear Unit) function is a recently used activation function.

3) tanh function

2. Implementation of three-layer neural network

The neural network includes: input layer, two hidden layers and output layer.

def forward(network, x): # x is the input data
  # For the processing of the first hidden layer, point multiplication plus offset is transmitted to the activation function
  a1 = np.dot(x, W1) + b1
  z1 = sigmoid(a1)
  # Processing of the second hidden layer
  a2 = np.dot(z1, W2) + b2
  z2 = sigmoid(a2)
  #Output layer processing identity_ Function as is output a3
  a3 = np.dot(z2, W3) + b3
  y = identify_function(a3)
  return y # y is the final result

3. Output layer activation function

Generally speaking, the regression problem chooses the identity function, and the classification problem chooses the softmax function.

Formula of softmax function:

\[y_{k}=\frac{e^{a_{k}}}{\sum_{i=1}^{n}e^{a_{i}}} \]

Assuming that there are \ (n \) neurons in the output layer, calculate the output \ (y_{k} \) of the \ (K \) th neuron.

The sum of the output values of the softmax function is 1. Therefore, we can interpret its output as probability.

The number of neurons in the output layer is generally equal to the number of set categories.

4. Handwritten numeral recognition

Use MNIST dataset.

Using pickle package to serialize and deserialize the required data can speed up the reading speed.

Normalization: limit the data to a certain range.

Batch batch

The input data is packaged in batches, and multiple pictures can be processed at one time.

batch_size = 100
for in range(0, len(x), batch_size) # x is the input data
	x_batch = x[i:i+batch_size] # Slice processing, batch at a time_ Size picture
  y_batch = predict(network, x_batch)
  p = np.argmax(y_batch, axis = 1)

3, Learning of neural network

Learning refers to the process of automatically obtaining the optimal weight parameters from the training process.

1. Data driven mode

Extract feature quantities (SIFT, SURF or HOG) from the image, use these feature quantities to convert the image data into vectors, and then use SVM, KNN and other classifiers in machine learning to learn the converted vectors.

2. Loss function

The neural network takes the loss function as an index to find the optimal weight parameters.

The purpose of neural network learning is to reduce the value of loss function as much as possible.

We generally use mean square error and cross entropy error functions.

1) Mean square error

Mean Squared Error.

\[E=\frac{1}{2}\sum_{k}(y_{k}-t_{k})^2 \]

\(Y {k} \) represents the output result of neural network, \ (t {k} \) represents the correct unlabeling, and \ (k \) represents the data dimension.

One hot means: the correct solution label is expressed as 1, and other labels are expressed as 0.

For example:

t = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] # It is assumed that the number "2" is the correct result during number recognition
y = [0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.0, 0.0]

2) Cross entropy error

Cross Entropy Error

\[E=-\sum_{k}t_{k}\log y_{k} \]

\(Y {K} \) represents the output result of the neural network, and \ (t {K} \) represents the correct unlabeling.

3) Mini batch learning

If we require the average loss function of all training data, taking the cross entropy error as an example, it is:

\[E=-\frac{1}{N}\sum_{n}\sum_{k}t_{nk}\log y_{nk} \]

We can choose a part from all the data as the representative of all the data. This part is mini batch.

Like a sample survey.

train_size = x_train.shape[0] # Number of all data in the training set
batch_size = 10 #Size of mini batch
batch_mask = np.random.choice(train_size, batch_size)#The function starts from train_size number randomly selected batch_size number
x_batch = x_train[batch_mask] 
t_batch = t_train[batch_mask]

3. Numerical differentiation

1) Derivative

The central difference is used to approximate the derivative.

def numerical_diff(f, x) #Find the derivative of function f(x) at X
	h = 1e-4 #Tiny value
  return (f(x+h)-f(x-h)) / (2 * h)

2) Gradient

The vector formed by the partial derivatives of all variables is called gradient.

For example, for the function \ (f(x,y)=x^2+y^2 \), the gradient at \ ((x,y) \) is \ (\ frac{\partial f}{\partial x},\frac{\partial f}{\partial y}) \)

Its Python implementation is as follows:

def numerical_gradient(f, x):
  h = 1e-4
  grad = np.zeros_like(x) #Generate an empty array with the same size as the variable group x to store the gradient
 	
  for idx in range(x.size):
    tmp_val = x[idx]
    # f(x+h)
    x[idx] = tmp_val + h
    fxh1 = f(x)
    # f(x-h)
    x[idx] = tmp_val - h
    fxh2 = f(x)
    # Calculate the partial derivative of x[idx]
    grad[idx] = (fxh1 - fxh2) / (2*h)
    x[idx] = tmp_val # Restore value

The gradient points to the direction where the function value at each point decreases the most.

3) Gradient descent method

We usually use the gradient descent method to find the minimum value of the loss function along the gradient direction.

Take the function mentioned above as an example, use the following formula to continuously update the gradient value:

\[x=x-\eta\frac{\partial f}{\partial x}\\ y=y-\eta\frac{\partial f}{\partial y}\\ \]

\(\ eta \) is an update quantity called learning rate. The initial value of learning rate is generally 0.01 or 0.001

The gradient descent method is implemented in Python as follows:

# f is the function, init_x is the initial variable group, the learning rate is 0.01, and the cycle is 100 times
def gradient_descent(f, init_x, learning_rate = 0.01, step_num = 100):
  x = init_x
  for i in range(step_num):
    grad = numerical_gradient(f, x)
    x = x - lr*grad
  return x

4. Gradient of neural network

The learning of neural network requires the gradient of loss function with respect to weight parameters.

For example, for a 2 * 3 weight parameter \ (W \), the loss function is \ (L \), then the gradient \ (\ frac{\partial L}{\partial W} \) is:

\[W=(\begin{matrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{matrix})\\ \frac{\partial L}{\partial W} = (\begin{matrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{13}} \\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{23}} \end{matrix}) \]

5. Implementation of learning algorithm

The process of dynamically adjusting weights and offsets to fit training data is called learning. There are four steps:

  • Mini batch: select Mini batch data with the goal of reducing the value of its loss function. Random gradient descent method SGD.
  • Calculate gradient: calculate the gradient of each weight parameter
  • Update parameter: slightly update the weight along the gradient direction
  • Repeat the above steps

Suppose a neural network has two weight parameters \ (W1 \) and \ (W2 \), two bias parameters \ (b1 \), $b2 $:

class TwoLayerNet:
  # Calculates and returns the network output value
  def predict(self, x):
    a1 = np.dot(x, W1) + b1
    z1 = sigmoid(a1)
    a2 = np.dot(z1, W2) + b2
    y = softmax(a2)
    return y
  # Calculate the loss value t as the correct solution label
  def loss(self, x, t):
    y = self.predict(x)
    return cross_entropy_error(y, t)
  # Calculated gradient
  def count_gradient(self, x, t):
    loss_W = lambda W: self.loss(x, t)
    # Calculate gradient and other parameters ellipsis
    grads['W1'] = numerical_gradient(loss_W, params['W1'])

Implementation of mini batch:

# Super parameter
iters_num = 10000 # Number of drops
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1
network = TwoLayerNetwork(input_size = 784, hidden_size = 50, output_size = 10)
for i in range(iters_num):
  # Get Mini batch
  batch_mask = np.random.choice(train_size, batch_size)
  x_batch = x_train[batch_mask]
  t_batch = t_train[batch_mask]
  # Calculated gradient
  grad = network.count_gradient(x_batch, t_batch)
  # Update parameters
  for key in ('W1','b1','W2','b2'):
    network.params[key] -= leraning_rate * grad[key]

An epoch represents the number of updates when all training data have been used in learning.

4, Error back propagation method BP

The gradient of weight parameters can be calculated quickly by error back propagation method.

It is based on the chain rule.

  • The back propagation of the addition node outputs the upstream value to the downstream intact.
  • The back propagation of the multiplication node is multiplied by the flip value of the input signal.

1. Implementation of activation function layer

1) ReLU

class Relu:
  def __init__(self):
    self.mask = None
  # Forward propagation
  def forward(self, x):
    self.mask = (x <= 0)
    out = x.copy()
    out[self.mask] = 0
    return out
  # Back propagation
  def backward(self,dout):
    dout[self.mask] = 0
    dx = dout
    return dx

2) sigmoid

class Sigmoid:
  def __init__(self):
    self.out = None
  # Forward propagation
  def forward(self, x):
    out = 1 / (1 + np.exp(-x))
    self.out = out
    return out
  # Back propagation
  def backward(self, dout):
    dx = dout * (1.0 - self.out) * self.out
    return dx

2. Implementation of affine / softmax layer

1) Affine

The forward propagation process of neural network is to calculate the weighted sum according to the input data, weight and bias, and output it to the next layer after passing through the activation function.

The matrix product operation is called Affine transformation in the field of geometry, so we implement the processing of Affine transformation as Affine layer.

class Affine:
  def __init__(self, W, b):
    self.W = W
    self.b = b
    self.x = None
    self.dW = None
    self.db = None
  # Forward propagation
  def forward(self, x):
    self.x = x
    out = np.dot(x, self.W) + self.b
    return out
  # Back propagation
  def backward(self, dout):
    dx = np.dot(dout, self.W.T)
    self.dW = np.dot(self.x.T, dout)
    self.db = np.sum(dout, axis = 0)
    return dx

2) Softmax

The softmax function normalizes the input value and outputs it. Considering that the cross entropy error as a loss function is also included here, it is called softmax with loss layer.

class SoftmaxWithLoss:
  def __init__(self):
    self.loss = None
    self.y = None
    self.t = None
  # Forward propagation 
  def forward(self, x, t):
    self.t = t
    self.y = softmax(x)
    self.loss = cross_entropy_error(self.y, self.t)
    return self.loss
  # Back propagation
  def backward(self, dout = 1):
    batch_size = self.t.shape[0]
    dx = (self.y - self.t) / batch_size
    return dx

5, Learning related skills

1. Parameter update

The learning purpose of neural network is to find the parameters that make the value of loss function as small as possible. This process is called Optimization

Common methods include SGD, Momentum, AdaGrad and Adam.

1) SGD

Random gradient descent method.

\[W\gets W-\eta\frac{\partial L}{\partial W} \]

class SGD:
  def __init__(self, lr = 0.01):
    self.lr = lr
  
  def update(self, params, grads):
    for key in params.keys():
      params[key] -= self.lr * grads[key]

2) Momentum

The disadvantage of SGD method is that the direction of gradient does not have the direction of minimum ambition. Momentum means momentum.

The mathematical formula is as follows:

\[v \gets \alpha v-\eta\frac{\partial L}{\partial W}\\ W \gets W+v \]

Indicates that the object is under force in the gradient direction.

class Momentum:
  def __init__(self, lr = 0.01, momentum = 0.9):
    self.lr = lr
    self.momentum = momentum
    self.v = None
  
  def update(self, params, grads):
    if self.v is None:
      self.v = {}
      for key, val in params.items():
        self.v[key] = np.zeros_like(val)
    for key in params.keys():
      self.v[key] = self.momentum * self.v[key] - self.lr * grads[key]
      params[key] += self.v[key]

3) AdaGrad

The AdaGrad method retains the sum of squares of all previous gradient values and will adjust the learning rate appropriately for each element of the parameter.

Ada stands for Adaptive

\[h \gets h+\frac{\partial L}{\partial W}\bigodot \frac{\partial L}{\partial W}\\ W \gets W-\eta\frac{1}{\sqrt{h}}\frac{\partial L}{\partial W} \]

class AdaGrad:
  def __init__(self, lr = 0.01):
    self.lr = lr
    self.h = None
  
  def update(self, params, grads):
    if self.h is None:
      self.h = {}
      for key, val in params.items():
        self.h[key] = np.zeros_like(val)
    for key in params.keys():
      self.h[key] += grads[key] * grads[key]
      params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

4) Adam

Adam is a recent parameter update method, which will set three super three places.

2. Initial value of weight

The distribution of activation values of each layer requires an appropriate breadth, otherwise the gradient may disappear.

1) Xavier initial value

In the general deep learning framework, Xavier initial value has been used as a standard.

In the Xavier initial value, if the number of nodes in the previous layer is \ (n \), the initial value uses the distribution with the standard deviation of \ (\ frac{1}{\sqrt{n} \).

node_num = 100
w = np.random.randn(node_num, node_num) / np.sqrt(node_num)

2) He initial value of relu

When the activation function uses ReLU, the initial value of He is generally used.

If the number of nodes in the previous layer is \ (n \), the initial value uses a Gaussian distribution with a standard deviation of \ (\ sqrt{\frac{2}{n} \).

3. Batch Normalization

In order to make the distribution of activation values of each layer have appropriate breadth, the Batch Normalization method is used for forced adjustment.

Therefore, we need to insert a Batch Norm layer between the affinity layer and the activation function layer. Regularization is carried out in the unit of mini batch during learning.

\[\mu_{B}\gets\frac{1}{m}\sum_{i=1}^m x_{i}\\ \sigma_{B}^2\gets\frac{1}{m}\sum_{i=1}^m (x_{i}-\mu B)^2\\ \hat{x_{i}}\gets\frac{x_{i}-\mu B}{\sqrt{\sigma_{B}^2+\varepsilon}} \]

Calculate the mean value \ (\ mu B \) and variance \ (\ Sigma {B} ^ 2 \) of the set of \ (n \) input data of mini batch \ (b = {x {1}, X {2},..., X {m}} \).

4. Inhibition of over fitting

Over fitting is a very common problem in machine learning. Over fitting refers to the state that only the training data can be fitted, but other data not included in the training data can not be well fitted.

Therefore, we need some methods to suppress overfitting. Weight attenuation is one of the methods.

1) Weight attenuation

For all weights, the weight attenuation method will add \ (\ frac{1}{2}\lambda W^2 \) to the loss function, that is, the \ (L2 \) norm of the weight.

Therefore, in the calculation of the weight gradient, the derivative \ (\ lambda W \) of the regularization term should be added to the result of the previous error back propagation method.

2) Dropout

When the network model is complex, the Dropout method is used to suppress over fitting.

Dropout is a method of deleting neurons during learning. During training, the neurons in the hidden layer are randomly selected and deleted. The deleted neurons no longer transmit signals.

class Dropout:
  def __init__(self, dropout_ratio = 0.5):
    self.dropout_ratio = dropout_ratio
    self.mask = None
  
  def forward(self, x, train_flg = True):
    if train_flg:
      self.mask = np.random.rand(*x.shape) > self.dropout_ratio
      return x * self.mask
    else:
      return x * (1.0 - self.dropout_ratio)
  
  def backward(self, dout):
    return dout * self.mask

5. Verification of super parameters

The super parameters include the number of neurons, batch size, learning rate and so on.

We cannot use test data to evaluate the performance of superparameters, otherwise it will cause over fitting

When adjusting the super parameter, the special confirmation data for the super parameter must be used, which is called validation data

6, Convolutional neural network

The structure of CNN can be assembled like building blocks. Among them, there are Convolution layer and Pooling layer.

In CNN, the connection order of layers is: Revolution - relu - pooling

Pooling is sometimes omitted.

1. Convolution layer

In the full connection layer, the shape of data is ignored. The convolution layer can keep the shape unchanged. When the input data is an image, the convolution layer receives the input data in the form of three-dimensional data and outputs it to the next time in the form of three-dimensional data.

The input and output data of convolution layer is called Feature Map.

1) Convolution operation

Convolution is equivalent to a filter.

The filter is the weight W in the output.

The filter will extract the original information such as edges or patches.

As shown in the figure, the input data size is \ ((5,5) \), the filter size is \ ((3,3) \), and the output size is \ ((3,3) \).

2) Padding

Before convolution layer processing, it is sometimes necessary to fill fixed data around the input data to expand the data.

Filling is mainly used to adjust the size of the output.

3) Stride

The position interval to which the filter is applied is called the stride.

After increasing the stride, the output table is small; After increasing the filling, the stride changes.

Suppose the input size is \ ((H,W) \), the filter size is \ ((FH,FW) \), the output size is \ ((OH,OW) \), the filling is \ (P \), and the step is \ (S \).

Then the output size is:

\[OH=\frac{H+2P-FH}{S}+1\\ OW=\frac{W+2P-FW}{S}+1 \]

4) Convolution operation of three-dimensional data

Taking the 3-channel RGB image as an example, the feature map in the depth direction is increased. When there are multiple characteristic images in the channel direction, the convolution operation of input data and filter will be carried out according to the channel, and the results will be added.

The number of channels of input data and the number of channels of filter should be the same.

When there are multiple filters, the output characteristic diagram also has multiple layers.

2. Pool layer

Pooling is a spatial operation that reduces the height and rectangle upward. In short, pooling is used to streamline data.

The operation that gets the maximum value on the Max pool. Generally speaking, the size of the pooled window will be the same as the stride.

3. Implementation of convolution layer and pooling layer

A key function is im2col. It expands the input three-dimensional data into a two-dimensional matrix to fit the filter.

1) Convolution layer

class Convolution:
  def __init__(self, W, b, stride = 1, pad = 0):
    self.W = W
    self.b = b
    self.stride = stride
    self.pad = pad
  
  def forward(self, x)
  FN, C, FH, FW = self.W.shape # Number of filters, number of channels, height and length
  N, C, H, W = x.shape # Number of input data, number of channels, height and length
  # Calculate the length and height of the output data
  out_h = int(1 + (H + 2 * self.pad - FH) / self.stride) 
  out_w = int(1 + (W + 2 * self.pad - FW) / self.stride)
  # Convert 3D data to matrix using im2col
  col = im2col(x, FH, FW, self.stride, self.pad)
  col_W = self.W.reshape(FN, -1).T
  out = np.dot(col, col_W) + self.b # Multiply by weight and add cheap
  
  out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2) # Change back to 3D data
  return out

2) Pool layer

class Pooling:
  def __init__(self, pool_h, pool_w, stride = 1, pad = 0):
    self.pool_h, self.pool_w, self.stride, self.pad = pool_h, pool_w, stride, pad
  
  def forward(self, x):
    N, C, H, W = x.shape
    # Calculate the length and height of the output data
  	out_h = int(1 + (H - self.pool_h) / self.stride) 
  	out_w = int(1 + (W - self.pool_w) / self.stride)
    # open
    col = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
    col = col.reshape(-1, self.pool_h * self.pool_w)
    # Maximum
    out = np.max(col, axis = 1)
    # transformation
    out = out.reshape(N, out_h, out_w, C).transpose(0, 3, 1, 2)
    return out

4. Implementation of MNIST digital recognition neural network

CN for handwritten numeral recognition

class SimpleConvNet:
    def __init__(self, input_dim = (1, 28, 28), conv_param = {'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1}, hidden_size = 100, output_size = 10, weight_init_std = 0.01):
        """
        :param input_dim: Number and length of input data channels
        :param conv_param: Parameters of convolution layer, number of filters, dimension, filling and stride
        :param hidden_size: Number of hidden layer neurons
        :param output_size: Number of neurons in output layer
        :param weight_init_std: Initialization weight standard deviation
        """
        filter_num = conv_param['filter_num']
        filter_size = conv_param['filter_size']
        filter_pad = conv_param['pad']
        filter_stride = conv_param['stride']
        input_size = input_dim[1]
        conv_output_size = (input_size - filter_size + 2 * filter_pad) / filter_stride + 1
        pool_output_size = int(filter_num * (conv_output_size / 2) * (conv_output_size / 2))

        self.params = {'W1': weight_init_std * np.random.randn(filter_num, input_dim[0], filter_size, filter_size),
                       'b1': np.zeros(filter_num),
                       'W2': weight_init_std * np.random.randn(pool_output_size, hidden_size),
                       'b2': np.zeros(hidden_size),
                       'W3': weight_init_std * np.random.randn(hidden_size, output_size),
                       'b3': np.zeros(output_size)}
        self.layers = OrderedDict()
        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'], self.params['stride'], self.params['pad'])
        self.layers['ReLU1'] = Relu()
        self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
        self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
        self.layers['ReLU2'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])
        self.last_layer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x

    def loss(self, x, t):
        y = self.predict(t)
        return self.last_layer.forward(y, t)

    # Back propagation gradient
    def gradient(self, x, t):
        # forward
        self.loss(x, t)
        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)
        grads = {'W1': self.layers['Conv1'].dW, 'b1': self.layers['Conv1'].db, 'W2': self.layers['Affine1'].dW,
                 'b2': self.layers['Affine1'].db, 'W3': self.layers['Affine2'].dW, 'b3': self.layers['Affine2'].db}
        return grads

Added by NikLP on Thu, 27 Jan 2022 11:38:16 +0200