6.1 parameter update
6.1.1 SGD
- Simple but possibly inefficient, e.g. f = 0.05x2+y2
- Gradient direction: may not point to the lowest point
- Local minimum and global minimum`
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
def f(x,y):
return np.power(x,2)/20 + np.power(y,2)
fig1 = plt.figure()
ax = Axes3D(fig1)
x = np.arange(-10,10,0.1)
y = np.arange(-10,10,0.1)
x,y = np.meshgrid(x,y)
z = f(x,y)
ax.plot_surface(x,y,z,rstride=1,cstride=1,cmap=plt.cm.coolwarm)
ax.contourf(x,y,z,zdir='z', offset=-2,cmap=plt.cm.coolwarm)
ax.set_xlabel('x',color='r')
ax.set_ylabel('y',color='g')
ax.set_zlabel('z',color='b')
plt.show()
# You can see that the closer the contour line is to 0, the more sparse it is and there is no change
# At the same time, the contour changes only in the y-axis direction
# Gradient diagram
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(-10,10,1)
y= np.arange(-10,10,2)
u,v = np.meshgrid(-x/10,-2*y) #Negative gradient direction
fig,ax = plt.subplots()
q = ax.quiver(x,y,u,v)
ax.quiverkey(q,X=0.3,Y=1.1,U=10,label='Quiver Key,length=10',labelpos='E')
plt.show()
class SGD:
def __init__(self,lr=0.01):
self.lr=lr
def update(self,params,grads):
for key in params.keys():
params[key] -= grads[key]*self.lr
# test
params = {}
params['x'] = [-7.2]
params['y'] = [2.0]
x = -7.2
y = 2.0
N=40
lr = 0.9
for i in range(N):
x -= x/10*lr
y -= 2*y*lr
params['x'].append(x)
params['y'].append(y)
params
# Two dimensional image
x = np.arange(-10,10,0.1)
y = np.arange(-10,10,0.1)
x,y = np.meshgrid(x,y)
plt.contourf(x,y,f(x,y),cmap=plt.cm.coolwarm)
plt.plot(params['x'],params['y'],color='black',marker='o',linestyle='solid')
plt.show()
# Zigzag, repeated horizontal jump, low efficiency
6.1.2 Momentum
- Momentum concept: similar to the motion of a small ball on a plane
- αv - lr*self.grads[key]
- There is a direction of speed
- Solve the problem of excessive fluctuation
class Momentum:
def __init__(self,lr=0.01,momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self,params,grads):
if self.v is None:
self.v = {}
for key,val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key] -self.lr*self.grads[key]
self.params[key] += self.v[key] # Changed parameter update speed
# This update speed is affected by the last update speed
# At the same time, the positive and negative update directions will offset each other v[key] -grads[key]
6.1.3 AdaGrad
- Solution: the learning rate is too large to converge
- Learning rate attenuation algorithm
- h= h + gradient square, that is, the sum of squares of all previous moments
- Learning rate η = lr/sqrt(h)
- After infinite updates, it tends to 0. The solution is to forget the gradient far in the past, RMSProp method
class AdaGrad:
def __init__(self,lr=0.01):
self.lr = lr
self.h =None
def update(self,params,grads):
if self.h == None:
self.h = {}
for key,val in self.params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] += grads[key]*grads[key]
self.params[key] -= self.lr*grads[key]/(np.sqrt(self.h[key])+1e-7)
6.1.4 Adam
- Loss function optimization
- Standard gradient descent: negative gradient direction, full sample calculation, local minimum, saddle point, large amount of calculation and slow speed
- Random gradient descent: randomly select one sample each time as the representative update parameter of the gradient, with great fluctuation
- Batch gradient descent: select 100 samples each time for gradient descent calculation to alleviate fluctuation
- Momentum: to solve zigzag update, the fluctuation is less, the two updates will hedge, and the speed variable is added
- Nestero: gradient acceleration method, decelerate in advance, and predict the next parameter Wn=Wn-1- α Vn-1, using the predicted loss function to calculate the gradient for updating
- AdaGrad: the learning rate is attenuated, the efficiency is low, and the fluctuation is greater than momentum
- AdaDelta: solve the problem that the h in AdaGrad is too large to be updated, and the moving average H= β h(-1)+(1- β) grads^2, the farther away it is, the easier it is to be forgotten
- Adam: moving average the first and second moments of velocity variables and gradients to get V1,V2,V3... I hope to update the velocity to the moving average of all gradients in history
- Adamax: the first-order moment processing is similar to Adam. Instead of using the second-order moment, the maximum value of the historical gradient is used to determine the learning rate
- Nadam: nestero + Adam, the velocity variable is updated with the predicted gradient, and the moving average is not updated
6.2 initial value of weight
- Solve invalid learning problems:
- Initial value of weight: the variance of the distribution of the generated parameters will affect the distribution of the activation function
- Setting error will make the activation function tend to certain values. With the increase of layers, the gradient disappears and cannot be updated
- ReLu activation function: the initial value of HE 2/sqrt(n) n is the number of neurons in the previous layer
- sigmoid and tanh (hyperbola): Xavier initial value 1/sqrt(n) n is the number of neurons in the previous layer
6.3 standardization of activation values
- Batch Norm solves invalid learning problems
- Just put a standardized layer before or after the hidden layer
- It is stable to the initial value of weight
6.4 regularization
- Solve over fitting
- L1, L2 and L are subject to weight attenuation to prevent excessive parameters
- Dropout: randomly delete neurons and determine a proportion, which is similar to ensemble learning. Different models are randomly selected
6.5 verification of super parameters
- Division of training set, verification set and test set
- Set the reasonable range of exceeding parameters
- Learn from the verification set to verify the evaluation accuracy, and the epoch setting is very small
- Repeat the above steps to narrow the parameter range according to the above evaluation results
- Select the optimal parameters