Self learning neural network series -- improvement of 6 optimization algorithm

6.1 parameter update

SGD
Momentum
AdaGrad
Adam

6.1.1 SGD

Simple but possibly inefficient, e.g. f = 0.05x2+y2
Gradient direction: may not point to the lowest point
Local minimum and global minimum`

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np 

def f(x,y):
    return np.power(x,2)/20 + np.power(y,2)

fig1 = plt.figure()
ax  = Axes3D(fig1)
x = np.arange(-10,10,0.1)
y = np.arange(-10,10,0.1)
x,y = np.meshgrid(x,y)
z = f(x,y)
ax.plot_surface(x,y,z,rstride=1,cstride=1,cmap=plt.cm.coolwarm)
ax.contourf(x,y,z,zdir='z', offset=-2,cmap=plt.cm.coolwarm)
ax.set_xlabel('x',color='r')
ax.set_ylabel('y',color='g')
ax.set_zlabel('z',color='b')
plt.show()
# You can see that the closer the contour line is to 0, the more sparse it is and there is no change
# At the same time, the contour changes only in the y-axis direction

# Gradient diagram
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(-10,10,1)
y= np.arange(-10,10,2)
u,v = np.meshgrid(-x/10,-2*y) #Negative gradient direction
fig,ax = plt.subplots()
q = ax.quiver(x,y,u,v)
ax.quiverkey(q,X=0.3,Y=1.1,U=10,label='Quiver Key,length=10',labelpos='E')
plt.show()

class SGD:

    def __init__(self,lr=0.01):
        self.lr=lr
    
    def update(self,params,grads):
        for key in params.keys():
            params[key] -= grads[key]*self.lr

# test
params = {}
params['x'] = [-7.2]
params['y'] = [2.0]
x = -7.2
y = 2.0
N=40
lr = 0.9
for i in range(N):
    x  -= x/10*lr
    y -= 2*y*lr
    params['x'].append(x)
    params['y'].append(y)

params

# Two dimensional image
x = np.arange(-10,10,0.1)
y = np.arange(-10,10,0.1)
x,y = np.meshgrid(x,y)
plt.contourf(x,y,f(x,y),cmap=plt.cm.coolwarm)
plt.plot(params['x'],params['y'],color='black',marker='o',linestyle='solid')
plt.show()
# Zigzag, repeated horizontal jump, low efficiency

6.1.2 Momentum

Momentum concept: similar to the motion of a small ball on a plane
αv - lr*self.grads[key]
There is a direction of speed
Solve the problem of excessive fluctuation

class Momentum:

    def __init__(self,lr=0.01,momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
    
    def update(self,params,grads):
        if self.v is None:
            self.v = {}
            for key,val in params.items():
                self.v[key] = np.zeros_like(val)
        
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] -self.lr*self.grads[key]
            self.params[key] += self.v[key] # Changed parameter update speed
            # This update speed is affected by the last update speed
            # At the same time, the positive and negative update directions will offset each other v[key] -grads[key]

6.1.3 AdaGrad

Solution: the learning rate is too large to converge
Learning rate attenuation algorithm
h= h + gradient square, that is, the sum of squares of all previous moments
Learning rate η = lr/sqrt(h)
After infinite updates, it tends to 0. The solution is to forget the gradient far in the past, RMSProp method

class AdaGrad:

    def __init__(self,lr=0.01):
        self.lr = lr
        self.h =None
    
    def update(self,params,grads):
        if self.h == None:
            self.h = {}
            for key,val in self.params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] += grads[key]*grads[key]
            self.params[key] -= self.lr*grads[key]/(np.sqrt(self.h[key])+1e-7)

6.1.4 Adam

Loss function optimization
Standard gradient descent: negative gradient direction, full sample calculation, local minimum, saddle point, large amount of calculation and slow speed
Random gradient descent: randomly select one sample each time as the representative update parameter of the gradient, with great fluctuation
Batch gradient descent: select 100 samples each time for gradient descent calculation to alleviate fluctuation
Momentum: to solve zigzag update, the fluctuation is less, the two updates will hedge, and the speed variable is added
Nestero: gradient acceleration method, decelerate in advance, and predict the next parameter Wn=Wn-1- α Vn-1, using the predicted loss function to calculate the gradient for updating
AdaGrad: the learning rate is attenuated, the efficiency is low, and the fluctuation is greater than momentum
AdaDelta: solve the problem that the h in AdaGrad is too large to be updated, and the moving average H= β h(-1)+(1- β) grads^2, the farther away it is, the easier it is to be forgotten
Adam: moving average the first and second moments of velocity variables and gradients to get V1,V2,V3... I hope to update the velocity to the moving average of all gradients in history
Adamax: the first-order moment processing is similar to Adam. Instead of using the second-order moment, the maximum value of the historical gradient is used to determine the learning rate
Nadam: nestero + Adam, the velocity variable is updated with the predicted gradient, and the moving average is not updated

6.2 initial value of weight

Solve invalid learning problems:
Initial value of weight: the variance of the distribution of the generated parameters will affect the distribution of the activation function
Setting error will make the activation function tend to certain values. With the increase of layers, the gradient disappears and cannot be updated
ReLu activation function: the initial value of HE 2/sqrt(n) n is the number of neurons in the previous layer
sigmoid and tanh (hyperbola): Xavier initial value 1/sqrt(n) n is the number of neurons in the previous layer

6.3 standardization of activation values

Batch Norm solves invalid learning problems
Just put a standardized layer before or after the hidden layer
It is stable to the initial value of weight

6.4 regularization

Solve over fitting
L1, L2 and L are subject to weight attenuation to prevent excessive parameters
Dropout: randomly delete neurons and determine a proportion, which is similar to ensemble learning. Different models are randomly selected

6.5 verification of super parameters

Division of training set, verification set and test set
Set the reasonable range of exceeding parameters
Learn from the verification set to verify the evaluation accuracy, and the epoch setting is very small
Repeat the above steps to narrow the parameter range according to the above evaluation results
Select the optimal parameters

Keywords: Python neural networks Deep Learning

Added by somenoise on Mon, 17 Jan 2022 15:49:57 +0200

Programming VIP

Self learning neural network series -- improvement of 6 optimization algorithm

6.1 parameter update

6.1.1 SGD

6.1.2 Momentum

6.1.3 AdaGrad

6.1.4 Adam

6.2 initial value of weight

6.3 standardization of activation values

6.4 regularization

6.5 verification of super parameters

Popular Keywords