# 1, Introduction

previously, we introduced two support vector machine models in two sections - hard interval support vector machine and soft interval support vector machine. These two models can be collectively referred to as linear support vector machine. Next, we will introduce another support vector machine model—— Nonlinear support vector machine1 (Non-Linear Support Vector Machine).

# Two, model introduction

before introducing nonlinear support vector machines, let's look at the following linearly indivisible data sets:

< center > Figure 2-1 < / center >

in Figure 2-1, circles and crosses are used to represent different label classifications respectively. Obviously, the data set cannot be correctly classified by a straight line, but as shown in Figure 2-2, the data set can be correctly classified by an elliptic curve.

< center > figure 2-2 < / center >

however, it is relatively difficult to solve the problem of nonlinear classification such as elliptic curve in Figure 2-2. Since the linear classification problem is relatively easy to solve, can the data be transformed into a linear classification data through certain nonlinear changes? The answer is yes.

< center > figure 2-3 < / center >

as shown in Fig. 2-3, the data set is subjected to nonlinear transformation (Z = X * X). At this time, it can be seen that the transformed data can be correctly classified through a straight line. The original elliptic curve in Fig. 2-2 is transformed into a straight line in Fig. 2-3, and the original nonlinear classification problem is transformed into a linear classification problem.

### Original model

let's review the original model of soft interval support vector machine:

$$\begin{array}{c} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \geq 1-y_{i}\left(w^{T} x_{i}+b\right) \quad i=1,2, \cdots, N \end{array}$$

< center > formula 2-1 < / center >

suppose there is a φ Function can transform x, for example, mapping x to z in the above example and bringing it into the original model of soft interval support vector machine above:

$$\begin{array}{c} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \geq 1-y_{i}\left(w^{T} \phi(x_{i})+b\right) \quad i=1,2, \cdots, N \end{array}$$

< center > formula 2-2 < / center >

equation 2-2 above is the original model of nonlinear support vector machine.

### Dual model

with the dual model solved in the previous two sections, the dual model of nonlinear support vector machine can be obtained:

$$\begin{array}{c} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} \phi(x_{i})^{T} \phi(x_{j}) \\ \text { s.t. } \quad \sum_{i=1}^{N} \lambda_{i} y_{i}=0 \quad 0 \leq \lambda_{i} \leq C \quad i=1,2, \cdots, N \end{array}$$

< center > formula 2-3 < / center >

### Nuclear technique

observe the change part of the dual model of soft interval support vector machine in equation 2-3, and the feature vector passes through φ After the function transformation, its dimension may be very high, and it is usually difficult to find the inner product. In order to bypass this inner product calculation, we can find a function as shown in equation 2-4, so that the value calculated by the function is equal to the inner product after feature conversion. Such a function is called Kernel Function, and this replacement method is called Nuclear technique2(Kernel Trick)

$$K(x_i, x_j) = \phi(x_i)^T\phi(x_j)$$

< center > formula 2-4 < / center >

use the kernel technique to bring in equation 2-3 to obtain the final dual model of nonlinear support vector machine:

$$\begin{array}{c} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} K(x_i, x_j) \\ \text { s.t. } \quad \sum_{i=1}^{N} \lambda_{i} y_{i}=0 \quad 0 \leq \lambda_{i} \leq C \quad i=1,2, \cdots, N \end{array}$$

< center > formula 2-5 < / center >

decision making under nuclear technology:

(1) Original decision surface function

(2) Weight coefficient w

(3) The final decision surface function is obtained by using kernel technique

\begin{aligned} f(x) &=w^{T} \phi(x)+b & (1)\\ &=\sum_{i=1}^{N} \lambda_{i} y_{i} \phi\left(x_{i}\right) \phi(x)+b & (2)\\ &=\sum_{i=1}^{N} \lambda_{i} y_{i} K\left(x_{i}, x\right)+b & (3) \end{aligned}

< center > formula 2-6 < / center >

kernel technique is not only applicable to support vector machines, but also widely used in other machine learning algorithms.

### kernel function

common kernel functions:

(1) Linear Kernel: the classification result of support vector machine using Linear Kernel is the same as that of soft interval support vector machine

(2) Polynomial Kernel: when γ = 1， ε = When 0, d = 1, the Polynomial Kernel function degenerates into a linear kernel function

(3) Radial basis function/RBF Kernel: the common Gaussian Kernel is a kind of radial basis function

(4) Sigmoid Kernel: tanh is a hyperbolic tangent function

\begin{aligned} K\left(x_{i}, x_{j}\right)&=x_{i}^{T} x_{j} & (1) \\ K\left(x_{i}, x_{j}\right)&=\left(\gamma x_{i}^{T} x_{j}+\varepsilon\right)^{d} & (2)\\ K\left(x_{i}, x_{j}\right)&=e^{-\gamma\left\|x_{i}-x_{j}\right\|^{2}} & (3)\\ K\left(x_{i}, x_{j}\right)&=\tanh \left(\gamma x_{i}^{T} x_{j}+\varepsilon\right) & (4) \end{aligned}

< center > formula 2-7 < / center >

# 3, Algorithm steps

compared with the dual model of soft interval support vector machine, nonlinear support vector machine only uses kernel function to replace the inner product of feature vector, and the others have no change, so the corresponding sequence minimum optimization algorithm (SMO) has little change.

for the algorithm steps, please refer to the contents in the previous two sections and the following code implementation. At the same time, you can refer to Corresponding papers of SMO algorithm 3. Implementation.

# 4, Code implementation

Using Python to implement nonlinear support vector machine (SMO algorithm)

import numpy as np

class SMO:
"""
Support vector machine
Implementation of sequential minimum optimization algorithm( Sequential minimal optimization/SMO)
"""

def __init__(self, X, y, kernel, degree = 3, coef0 = 0.0, gamma = 1.0):
# Training sample characteristic matrix (N * p)
self.X = X
# Training sample label vector (N * 1)
self.y = y
# Lagrange multiplier vector (N * 1)
self.alpha = np.zeros(X.shape[0])
# Error vector, default to negative label vector (N * 1)
self.errors = -y
# Offset
self.b = 0
# Generation value
self.cost = -np.inf
# kernel function
self.kernel = kernel
# Kernel related parameters
self.degree = degree
self.coef0 = coef0
self.gamma = gamma

def fit(self, C = 1, tol = 1e-4):
"""
Algorithm from John C. Platt My thesis
"""
# Update change times
numChanged = 0
# Check all
examineAll = True
while numChanged > 0 or examineAll:
numChanged = 0
if examineAll:
for idx in range(X.shape[0]):
numChanged += self.update(idx, C)
else:
for idx in range(X.shape[0]):
if self.alpha[idx] <= 0:
continue
numChanged += self.update(idx, C)
if examineAll:
examineAll = False
elif numChanged == 0:
examineAll = True
# Calculate generation value
cost = self.calcCost()
if self.cost > 0:
# The algorithm ends when the change of contemporary value is less than the allowable range
rate = (cost - self.cost) / self.cost
if rate <= tol:
break
self.cost = cost

def update(self, idx, C = 1):
"""
Subscript for idx The Lagrange multiplier is updated
"""
X = self.X
y = self.y
alpha = self.alpha
# Check whether the current Lagrange multiplier meets the KKT condition. If it meets the condition, it will directly return 0
if self.checkKKT(idx, C):
return 0
if len(alpha[(alpha != 0)]) > 1:
# Find the subscript of the second Lagrange multiplier to be optimized according to the principle of ｜ E1 - E2 ｜ maximum
jdx = self.selectJdx(idx)
# Update the Lagrange multipliers with subscripts idx and jdx, and return 1 directly when the update is successful
if self.updateAlpha(idx, jdx, C):
return 1
# When the update is not successful, the Lagrange multiplier that is not zero is traversed for update
for jdx in range(X.shape[0]):
if alpha[jdx] == 0:
continue
# Update the Lagrange multipliers with subscripts idx and jdx, and return 1 directly when the update is successful
if self.updateAlpha(idx, jdx, C):
return 1
# When the update is still unsuccessful, the Lagrange multiplier traversing zero is updated
for jdx in range(X.shape[0]):
if alpha[jdx] != 0:
continue
# Update the Lagrange multipliers with subscripts idx and jdx, and return 1 directly when the update is successful
if self.updateAlpha(idx, jdx, C):
return 1
# Return 0 if there is still no update
return 0

def selectJdx(self, idx):
"""
Find the subscript of the second Lagrange multiplier to be optimized
"""
errors = self.errors
if errors[idx] > 0:
# When the error term is greater than zero, select the subscript of the minimum value in the error vector
return np.argmin(errors)
elif errors[idx] < 0:
# When the error term is less than zero, select the subscript of the maximum value in the error vector
return np.argmax(errors)
else:
# When the error term is equal to zero, select the subscript with the largest absolute value of the maximum and minimum value in the error vector
minJdx = np.argmin(errors)
maxJdx = np.argmax(errors)
if max(np.abs(errors[minJdx]), np.abs(errors[maxJdx])) - errors[minJdx]:
return minJdx
else:
return maxJdx

def calcB(self):
"""
Calculate offset
Calculate the offset corresponding to each Lagrange multiplier that is not zero, and then take its average value
"""
X = self.X
y = self.y
alpha = self.alpha
alpha_gt = alpha[alpha > 0]
# Non zero quantity in Lagrange multiplier vector
alpha_gt_len = len(alpha_gt)
# Returns 0 directly when all are zero
if alpha_gt_len == 0:
return 0
# b = y - Wx, please refer to the description in the article for the specific algorithm
X_gt = X[alpha > 0]
y_gt = y[alpha > 0]

s = 0
for idx in range(X_gt.shape[0]):
ss = 0
for jdx in range(X_gt.shape[0]):
ss += alpha_gt[jdx] * y_gt[jdx] * self.kernel(X_gt[jdx], X_gt[idx], self.degree, self.coef0, self.gamma)
s += y_gt[idx] - ss
return s / alpha_gt_len

def calcCost(self):
"""
Calculate generation value
It can be calculated according to the algorithm in the article
"""
X = self.X
y = self.y
alpha = self.alpha
cost = 0
for idx in range(X.shape[0]):
for jdx in range(X.shape[0]):
cost = cost + (y[idx] * y[jdx] * self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma) * alpha[idx] * alpha[jdx])
return np.sum(alpha) - cost / 2

def checkKKT(self, idx, C = 1):
"""
Check subscript is idx Is the Lagrange multiplier satisfied KKT condition
1. alpha >= 0
2. alpha <= C
3. y * f(x) - 1 >= 0
4. alpha * (y * f(x) - 1) = 0
"""
y = self.y
errors = self.errors
alpha = self.alpha
r = errors[idx] * y[idx]
if (alpha[idx] > 0 and alpha[idx] < C and r == 0) or (alpha[idx] == 0 and r > 0) or (alpha[idx] == C and r < 0):
return True
return False

def calcU(self, idx, jdx, C = 1):
"""
Calculate the upper bound of Lagrange multiplier so that the two Lagrange multipliers to be optimized are greater than or equal to 0 at the same time
It can be calculated according to the algorithm in the article
"""
y = self.y
alpha = self.alpha
if y[idx] * y[jdx] == 1:
return max(0, alpha[jdx] + alpha[idx] - C)
else:
return max(0.0, alpha[jdx] - alpha[idx])

def calcV(self, idx, jdx, C = 1):
"""
Calculate the lower bound of Lagrange multiplier so that the two Lagrange multipliers to be optimized are greater than or equal to 0 at the same time
It can be calculated according to the algorithm in the article
"""
y = self.y
alpha = self.alpha
if y[idx] * y[jdx] == 1:
return min(C, alpha[jdx] + alpha[idx])
else:
return min(C, C + alpha[jdx] - alpha[idx])

def updateAlpha(self, idx, jdx, C = 1):
"""
Subscript for idx,jdx The Lagrange multiplier is updated
It can be calculated according to the algorithm in the article
"""
if idx == jdx:
return False
X = self.X
y = self.y
alpha = self.alpha
errors = self.errors
# Error term of idx
Ei = errors[idx]
# Error term of jdx
Ej = errors[jdx]
U = self.calcU(idx, jdx, C)
V = self.calcV(idx, jdx, C)
if U == V:
return False
Kii = self.kernel(X[idx], X[idx], self.degree, self.coef0, self.gamma)
Kjj = self.kernel(X[jdx], X[jdx], self.degree, self.coef0, self.gamma)
Kij = self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma)
# Calculate K
K = Kii + Kjj - 2 * Kij
oldAlphaIdx = alpha[idx]
oldAlphaJdx = alpha[jdx]
oldB = self.b
s = y[idx] * y[jdx]
if K > 0:
# Calculating the new Lagrange multiplier of jdx
newAlphaJdx = oldAlphaJdx + y[jdx] * (Ei - Ej) / K
if newAlphaJdx < U:
# When the new value exceeds the upper bound, change it to the upper bound
newAlphaJdx = U
if newAlphaJdx > V:
# When the new value is lower than the lower bound, change it to the lower bound
newAlphaJdx = V
else:
fi = y[idx] * (Ei + oldB) - oldAlphaIdx * Kii - s * oldAlphaJdx * Kij
fj = y[jdx] * (Ej + oldB) - s * oldAlphaIdx * Kij - oldAlphaJdx * Kjj
Vv = oldAlphaIdx + s * (oldAlphaJdx - V)
Uu = oldAlphaIdx + s * (oldAlphaJdx - U)
Vv = Vv * fi + V * fj + 0.5 * (Vv ** 2) * Kii + 0.5 * (V ** 2) * Kjj + s * V * Vv * Kij
Uu = Uu * fi + U * fj + 0.5 * (Uu ** 2) * Kii + 0.5 * (U ** 2) * Kjj + s * U * Uu * Kij
if Vv < Uu:
newAlphaJdx = V
elif Vv > Uu:
newAlphaJdx = U
else:
newAlphaJdx = oldAlphaJdx
if oldAlphaJdx == newAlphaJdx:
# When the new value is equal to the old value, it is judged as not updated and returned directly
return False
# New Lagrange multipliers for computing idx
newAlphaIdx = oldAlphaIdx + s * (oldAlphaJdx - newAlphaJdx)
# Update Lagrange multiplier vector
alpha[idx] = newAlphaIdx
alpha[jdx] = newAlphaJdx
oldB = self.b
# Recalculate offset
self.b = self.calcB()
# Recalculate error vector
for i in range(X.shape[0]):
newError = errors[i] + y[idx] * (newAlphaIdx - oldAlphaIdx) * self.kernel(X[idx], X[i], self.degree, self.coef0, self.gamma) + y[jdx] * (newAlphaJdx - oldAlphaJdx) * self.kernel(X[jdx], X[i]) - oldB + self.b
return True

def predict(self, X):
fxs = []
for idx in range(len(X)):
fx = 0
for jdx in range(self.X.shape[0]):
fx += self.y[jdx] * self.alpha[jdx] * self.kernel(self.X[jdx], X[idx], self.degree, self.coef0, self.gamma)
fxs.append(fx + self.b)
return np.sign(fxs)

Linear kernel support vector machine

# Linear kernel support vector machine
def linearKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0):
return x.dot(z)

linearSmo = SMO(X, y, kernel = linearKernel)
linearSmo.fit()

Polynomial kernel support vector machine

# Polynomial kernel support vector machine
def polynomialKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0):
return np.power(gamma * x.dot(z) + coef0, degree)

polynomialSmo = SMO(X, y, kernel = polynomialKernel)
polynomialSmo.fit()

Radial basis function kernel support vector machine

# Radial basis function kernel support vector machine
def rbfKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0):
return np.exp(-gamma * np.power(np.linalg.norm(x - z), 2))

rbfSmo = SMO(X, y, kernel = rbfKernel)
rbfSmo.fit()

# 5, Third party library implementation

scikit-learn 4 implementation

from sklearn.svm import SVC

# Linear kernel support vector machine
svc = SVC(kernel = "linear")
# Polynomial kernel support vector machine
svc = SVC(kernel = "poly", gamma = 1)
# Radial basis function kernel support vector machine
svc = SVC(kernel = "rbf", gamma = 1)

# Fitting data
svc.fit(X, y)

# 6, Animation demonstration

the following three figures respectively show the classification results of linearly indivisible data sets for different kernel functions. Red represents the sample points with label value of - 1 and blue represents the sample points with label value of 1. The light red area is the part with the predicted value of - 1, and the light blue area is the part with the predicted value of 1:

< center > figure 6-1 < / center >

< center > figure 6-2 < / center >

< center > figure 6-3 < / center >

it can be seen that different kernel functions have a great impact on the classification results. When the form of feature transformation is not known, it is impossible to select the appropriate kernel function. However, generally, for nonlinear classification problems, radial basis function kernel function can be used to find the appropriate parameters through cross verification.

# 7, Mind map

< center > Figure 7-1 < / center >