Self learning neural network series -- improvement of 6 optimization algorithm

6.1 parameter update

  • SGD
  • Momentum
  • AdaGrad
  • Adam

6.1.1 SGD

  • Simple but possibly inefficient, e.g. f = 0.05x2+y2
  • Gradient direction: may not point to the lowest point
  • Local minimum and global minimum`
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np 

def f(x,y):
    return np.power(x,2)/20 + np.power(y,2)

fig1 = plt.figure()
ax  = Axes3D(fig1)
x = np.arange(-10,10,0.1)
y = np.arange(-10,10,0.1)
x,y = np.meshgrid(x,y)
z = f(x,y)
ax.plot_surface(x,y,z,rstride=1,cstride=1,cmap=plt.cm.coolwarm)
ax.contourf(x,y,z,zdir='z', offset=-2,cmap=plt.cm.coolwarm)
ax.set_xlabel('x',color='r')
ax.set_ylabel('y',color='g')
ax.set_zlabel('z',color='b')
plt.show()
# You can see that the closer the contour line is to 0, the more sparse it is and there is no change
# At the same time, the contour changes only in the y-axis direction
# Gradient diagram
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(-10,10,1)
y= np.arange(-10,10,2)
u,v = np.meshgrid(-x/10,-2*y) #Negative gradient direction
fig,ax = plt.subplots()
q = ax.quiver(x,y,u,v)
ax.quiverkey(q,X=0.3,Y=1.1,U=10,label='Quiver Key,length=10',labelpos='E')
plt.show()
class SGD:

    def __init__(self,lr=0.01):
        self.lr=lr
    
    def update(self,params,grads):
        for key in params.keys():
            params[key] -= grads[key]*self.lr

# test
params = {}
params['x'] = [-7.2]
params['y'] = [2.0]
x = -7.2
y = 2.0
N=40
lr = 0.9
for i in range(N):
    x  -= x/10*lr
    y -= 2*y*lr
    params['x'].append(x)
    params['y'].append(y)

params
# Two dimensional image
x = np.arange(-10,10,0.1)
y = np.arange(-10,10,0.1)
x,y = np.meshgrid(x,y)
plt.contourf(x,y,f(x,y),cmap=plt.cm.coolwarm)
plt.plot(params['x'],params['y'],color='black',marker='o',linestyle='solid')
plt.show()
# Zigzag, repeated horizontal jump, low efficiency

6.1.2 Momentum

  • Momentum concept: similar to the motion of a small ball on a plane
  • αv - lr*self.grads[key]
  • There is a direction of speed
  • Solve the problem of excessive fluctuation
class Momentum:

    def __init__(self,lr=0.01,momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
    
    def update(self,params,grads):
        if self.v is None:
            self.v = {}
            for key,val in params.items():
                self.v[key] = np.zeros_like(val)
        
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] -self.lr*self.grads[key]
            self.params[key] += self.v[key] # Changed parameter update speed
            # This update speed is affected by the last update speed
            # At the same time, the positive and negative update directions will offset each other v[key] -grads[key]

6.1.3 AdaGrad

  • Solution: the learning rate is too large to converge
  • Learning rate attenuation algorithm
  • h= h + gradient square, that is, the sum of squares of all previous moments
  • Learning rate η = lr/sqrt(h)
  • After infinite updates, it tends to 0. The solution is to forget the gradient far in the past, RMSProp method
class AdaGrad:

    def __init__(self,lr=0.01):
        self.lr = lr
        self.h =None
    
    def update(self,params,grads):
        if self.h == None:
            self.h = {}
            for key,val in self.params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] += grads[key]*grads[key]
            self.params[key] -= self.lr*grads[key]/(np.sqrt(self.h[key])+1e-7)
        

6.1.4 Adam

  • Loss function optimization
  • Standard gradient descent: negative gradient direction, full sample calculation, local minimum, saddle point, large amount of calculation and slow speed
  • Random gradient descent: randomly select one sample each time as the representative update parameter of the gradient, with great fluctuation
  • Batch gradient descent: select 100 samples each time for gradient descent calculation to alleviate fluctuation
  • Momentum: to solve zigzag update, the fluctuation is less, the two updates will hedge, and the speed variable is added
  • Nestero: gradient acceleration method, decelerate in advance, and predict the next parameter Wn=Wn-1- α Vn-1, using the predicted loss function to calculate the gradient for updating
  • AdaGrad: the learning rate is attenuated, the efficiency is low, and the fluctuation is greater than momentum
  • AdaDelta: solve the problem that the h in AdaGrad is too large to be updated, and the moving average H= β h(-1)+(1- β) grads^2, the farther away it is, the easier it is to be forgotten
  • Adam: moving average the first and second moments of velocity variables and gradients to get V1,V2,V3... I hope to update the velocity to the moving average of all gradients in history
  • Adamax: the first-order moment processing is similar to Adam. Instead of using the second-order moment, the maximum value of the historical gradient is used to determine the learning rate
  • Nadam: nestero + Adam, the velocity variable is updated with the predicted gradient, and the moving average is not updated

6.2 initial value of weight

  • Solve invalid learning problems:
  • Initial value of weight: the variance of the distribution of the generated parameters will affect the distribution of the activation function
  • Setting error will make the activation function tend to certain values. With the increase of layers, the gradient disappears and cannot be updated
  • ReLu activation function: the initial value of HE 2/sqrt(n) n is the number of neurons in the previous layer
  • sigmoid and tanh (hyperbola): Xavier initial value 1/sqrt(n) n is the number of neurons in the previous layer

6.3 standardization of activation values

  • Batch Norm solves invalid learning problems
  • Just put a standardized layer before or after the hidden layer
  • It is stable to the initial value of weight

6.4 regularization

  • Solve over fitting
  • L1, L2 and L are subject to weight attenuation to prevent excessive parameters
  • Dropout: randomly delete neurons and determine a proportion, which is similar to ensemble learning. Different models are randomly selected

6.5 verification of super parameters

  • Division of training set, verification set and test set
  • Set the reasonable range of exceeding parameters
  • Learn from the verification set to verify the evaluation accuracy, and the epoch setting is very small
  • Repeat the above steps to narrow the parameter range according to the above evaluation results
  • Select the optimal parameters

Keywords: Python neural networks Deep Learning

Added by somenoise on Mon, 17 Jan 2022 15:49:57 +0200