[supplementary note] the book does not distinguish between "gradient descent method" and "steepest descent method", but in fact, there are differences between the two. The gradient descent method of algorithm A.1 is actually the steepest descent method.
Gradient descent method and steepest descent method
When determining the step size, we have two ideas: one is to calculate the step size according to the instantaneous change rate. When it is far from the extreme point, the step size is larger, which makes the algorithm converge faster; When it is close to the extreme point, the step size is smaller to avoid oscillation at the minimum point; The other is to calculate the step size according to one-dimensional search, that is, after each iteration direction is determined, it will directly move to the extreme point (stagnation point) in the current direction, so as to avoid repeated oscillation in the current direction.
The first method to calculate the step size based on the instantaneous rate of change is the gradient descent method; The second method to calculate the step size based on one-dimensional search is the steepest descent method.
gradient
Binary function
z
=
f
(
x
,
y
)
z = f(x,y)
Take z=f(x,y) as an example. If the function is at point
P
0
(
x
0
,
y
0
)
P_0(x_0,y_0)
P0 (x0, y0) differentiable,
e
l
=
(
cos
α
,
cos
β
)
e_l = (\cos \alpha,\cos \beta)
el=(cos α, cos β) Yes and direction
l
l
l the unit vector in the same direction, then the function is at the point
P
0
P_0
P0 along the direction
l
l
Directional derivative of l
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial...
among
θ
\theta
θ yes
(
f
x
(
x
0
,
y
0
)
,
f
y
(
x
0
,
y
0
)
)
(f_x(x_0,y_0),f_y(x_0,y_0))
(fx (x0, y0), fy (x0, y0)) and
e
l
e_l
Angle of intersection. Equation (22) above shows the directional derivative and vector of the function at this point
(
f
x
(
x
0
,
y
0
)
,
f
y
(
x
0
,
y
0
)
)
(f_x(x_0,y_0),f_y(x_0,y_0))
(fx (x0, y0), fy (x0, y0)). We'll take the vector
(
f
x
(
x
0
,
y
0
)
,
f
y
(
x
0
,
y
0
)
)
(f_x(x_0,y_0),f_y(x_0,y_0))
(fx (x0, y0), fy (x0, y0)) is called a function
f
(
x
,
y
)
f(x,y)
f(x,y) at point
P
0
(
x
0
,
y
0
)
P_0(x_0,y_0)
The gradient of P0 (x0, y0) is recorded as
g
r
a
d
f
(
x
0
,
y
0
)
grad \ f(x_0,y_0)
grad f(x0, y0) or
∇
f
(
x
0
,
y
0
)
\nabla f(x_0,y_0)
∇f(x0,y0). In particular:
- When direction e l e_l el and gradient g r a d f ( x 0 , y 0 ) grad \ f(x_0,y_0) When grad f(x0, y0) has the same direction, the function f ( x , y ) f(x,y) f(x,y) increases the fastest, that is, the directional derivative of the function in this direction obtains the maximum value, which is the gradient g r a d f ( x 0 , y 0 ) grad \ f(x_0,y_0) The module of grad f(x0, y0);
- When direction e l e_l el and gradient g r a d f ( x 0 , y 0 ) grad \ f(x_0,y_0) When grad f(x0, y0) is in the opposite direction, the function f ( x , y ) f(x,y) f(x,y) decreases the fastest, that is, the directional derivative of the function in this direction obtains the minimum value, which is the gradient g r a d f ( x 0 , y 0 ) grad \ f(x_0,y_0) The opposite number of modules of grad f(x0, y0);
- When direction e l e_l el and gradient g r a d f ( x 0 , y 0 ) grad \ f(x_0,y_0) When grad f(x0, y0) direction is orthogonal, the function f ( x , y ) f(x,y) The rate of change of f(x,y) is zero.
In advanced mathematics of Tongji University, the gradient of binary function is defined as follows:
[definition] gradient (from P. 106, Volume 2, higher mathematics, Seventh Edition, Tongji University)
Set function f ( x , y ) f(x,y) f(x,y) in the plane region D D If there is a first-order continuous partial derivative in D, for each point P 0 ( x 0 , y 0 ) ∈ D P_0(x_0,y_0) \in D P0 (x0, y0) ∈ D can determine a vector
f x ( x 0 , y 0 ) i + f y ( x 0 , y 0 ) j f_x(x_0,y_0) i + f_y(x_0,y_0) j fx(x0,y0)i+fy(x0,y0)j
This vector is called a function
f
(
x
,
y
)
f(x,y)
f(x,y) at point
P
0
(
x
0
,
y
0
)
P_0(x_0,y_0)
The gradient of P0 (x0, y0) is recorded as
g
r
a
d
f
(
x
0
,
y
0
)
grad \ f(x_0,y_0)
grad f(x0, y0) or
∇
f
(
x
0
,
y
0
)
\nabla f(x_0,y_0)
∇ f(x0, y0), i.e
g
r
a
d
f
(
x
0
,
y
0
)
=
∇
f
(
x
0
,
y
0
)
=
f
x
(
x
0
,
y
0
)
i
+
f
y
(
x
0
,
y
0
)
j
grad \ f(x_0,y_0) = \nabla f(x_0,y_0) = f_x(x_0,y_0) i + f_y(x_0,y_0) j
grad f(x0,y0)=∇f(x0,y0)=fx(x0,y0)i+fy(x0,y0)j
among ∇ = ∂ ∂ x i + ∂ ∂ y j \nabla = \frac{\partial}{\partial x} i + \frac{\partial}{\partial y} j ∂ x ∂ i + ∂ y ∂ j is called (two-dimensional) vector differential operator or Nabla operator, ∇ f = ∂ f ∂ x i + ∂ f ∂ y j \nabla f = \frac{\partial f}{\partial x} i + \frac{\partial f}{\partial y} j ∇f=∂x∂fi+∂y∂fj.
among i i i, j j j is the unit vector in the same direction as the first coordinate axis and in the same direction as the second coordinate axis.
Gradient vector calculation (Python+scipy calculation)
[Source address]code.gradient_descent.partial_derivative
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/gradient_descent/_partial_derivative.py from scipy.misc import derivative def partial_derivative(func, arr, dx=1e-6): """calculation n Gradient vector of each independent variable of meta function at a certain point (list of partial derivatives) :param func: [function] n Meta function :param arr: [list/tuple] Independent coordinate of the target point :param dx: [int/float] When calculating x Increment of :return: [list] partial derivative """ n_features = len(arr) ans = [] for i in range(n_features): def f(x): arr2 = list(arr) arr2[i] = x return func(arr2) ans.append(derivative(f, arr[i], dx=dx)) return ans
[Source address ]Testing
>>> from code.gradient_descent import partial_derivative >>> partial_derivative(lambda x: x[0] ** 2, [3]) [6.000000000838668] >>> partial_derivative(lambda x: ((x[0] + 3) ** 2 + (x[1] + 4) ** 2) / 2, [0, 0]) [3.000000000419334, 3.9999999996709334]
One dimensional search based on golden section method (implemented in Python)
[Source address]code.gradient_descent.golden_section_for_line_search
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/gradient_descent/_golden_section_for_line_search.py def golden_section_for_line_search(func, a0, b0, epsilon): """One dimensional search minimum point (golden section method) :param func: [function] Univariate function :param a0: [int/float] Left boundary of target area :param b0: [int/float] Right boundary of target area :param epsilon: [int/float] accuracy """ a1, b1 = a0 + 0.382 * (b0 - a0), b0 - 0.382 * (b0 - a0) fa, fb = func(a1), func(b1) while b1 - a1 > epsilon: if fa <= fb: b0, b1, fb = b1, a1, fa a1 = a0 + 0.382 * (b0 - a0) fa = func(a1) else: a0, a1, fa = a1, b1, fb b1 = b0 - 0.382 * (b0 - a0) fb = func(b1) return (a1 + b1) / 2
[Source address ]Testing
>>> from code.gradient_descent import golden_section_for_line_search >>> golden_section_for_line_search(lambda x: x ** 2, -10, 5, epsilon=1e-6) 5.263005013597177e-06
Gradient descent method (Python Implementation)
[Source address]code.gradient_descent.gradient_descent
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/gradient_descent/_gradient_descent.py from ._partial_derivative import partial_derivative # code.gradient_descent.partial_derivative def gradient_descent(func, n_features, eta, epsilon, maximum=1000): """Gradient descent method :param func: [function] n Meta objective function :param n_features: [int] Objective function element :param eta: [int/float] Learning rate :param epsilon: [int/float] Learning accuracy :param maximum: [int] Maximum learning times :return: [list] Result point coordinates """ x0 = [0] * n_features # From variable initial value y0 = func(x0) # Calculate function value for _ in range(maximum): nabla = partial_derivative(func, x0) # Calculated gradient x1 = [x0[i] - eta * nabla[i] for i in range(n_features)] # Iterative argument y1 = func(x1) # Calculate function value if abs(y1 - y0) < epsilon: # If the current variation is less than the learning accuracy, the learning ends return x1 x0, y0 = x1, y1
[Source address ]Testing
>>> from code.gradient_descent import gradient_descent >>> gradient_descent(lambda x: x[0] ** 2, 1, eta=0.1, epsilon=1e-6) [0.0] >>> gradient_descent(lambda x: ((x[0] + 3) ** 2 + (x[1] + 4) ** 2) / 2, 2, eta=0.1, epsilon=1e-6) [-2.9983082373813077, -3.997744316508843]
Steepest descent method (implemented in Python)
[Source address]code.gradient_descent.steepest_descent
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/gradient_descent/_steepest_descent.py from ._golden_section_for_line_search import golden_section_for_line_search # code.gradient_descent.golden_section_for_line_search from ._partial_derivative import partial_derivative # code.gradient_descent.partial_derivative def steepest_descent(func, n_features, epsilon, distance=3, maximum=1000): """Gradient descent method :param func: [function] n Meta objective function :param n_features: [int] Objective function element :param epsilon: [int/float] Learning accuracy :param distance: [int/float] Length range of each one-dimensional search( distance Multiple gradient module) :param maximum: [int] Maximum learning times :return: [list] Result point coordinates """ x0 = [0] * n_features # From variable initial value y0 = func(x0) # Calculate function value for _ in range(maximum): nabla = partial_derivative(func, x0) # Calculated gradient # When the module length of the gradient is less than the accuracy requirement, stop the iteration if pow(sum([nabla[i] ** 2 for i in range(n_features)]), 0.5) < epsilon: return x0 def f(x): """One dimensional function in gradient direction""" x2 = [x0[i] - x * nabla[i] for i in range(n_features)] return func(x2) lk = golden_section_for_line_search(f, 0, distance, epsilon=1e-6) # One dimensional search for stagnation point x1 = [x0[i] - lk * nabla[i] for i in range(n_features)] # Iterative argument y1 = func(x1) # Calculate function value if abs(y1 - y0) < epsilon: # If the current variation is less than the learning accuracy, the learning ends return x1 x0, y0 = x1, y1
[Source address ]Testing
>>> from code.gradient_descent import steepest_descent >>> steepest_descent(lambda x: x[0] ** 2, 1, epsilon=1e-6) [0] >> steepest_descent(lambda x: ((x[0] + 3) ** 2 + (x[1] + 4) ** 2) / 2, 2, epsilon=1e-6) [-2.9999999999635865, -3.999999999951452]