Machine learning algorithm series - non linear support vector machine

Background knowledge required for reading this article: linear support vector machine and yidui programming knowledge

1, Introduction

   previously, we introduced two support vector machine models in two sections - hard interval support vector machine and soft interval support vector machine. These two models can be collectively referred to as linear support vector machine. Next, we will introduce another support vector machine model—— Nonlinear support vector machine1 (Non-Linear Support Vector Machine).

Two, model introduction

   before introducing nonlinear support vector machines, let's look at the following linearly indivisible data sets:


< center > Figure 2-1 < / center >

   in Figure 2-1, circles and crosses are used to represent different label classifications respectively. Obviously, the data set cannot be correctly classified by a straight line, but as shown in Figure 2-2, the data set can be correctly classified by an elliptic curve.


< center > figure 2-2 < / center >

   however, it is relatively difficult to solve the problem of nonlinear classification such as elliptic curve in Figure 2-2. Since the linear classification problem is relatively easy to solve, can the data be transformed into a linear classification data through certain nonlinear changes? The answer is yes.


< center > figure 2-3 < / center >

   as shown in Fig. 2-3, the data set is subjected to nonlinear transformation (Z = X * X). At this time, it can be seen that the transformed data can be correctly classified through a straight line. The original elliptic curve in Fig. 2-2 is transformed into a straight line in Fig. 2-3, and the original nonlinear classification problem is transformed into a linear classification problem.

Original model

  let's review the original model of soft interval support vector machine:

$$ \begin{array}{c} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \geq 1-y_{i}\left(w^{T} x_{i}+b\right) \quad i=1,2, \cdots, N \end{array} $$

< center > formula 2-1 < / center >

  suppose there is a φ Function can transform x, for example, mapping x to z in the above example and bringing it into the original model of soft interval support vector machine above:

$$ \begin{array}{c} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \geq 1-y_{i}\left(w^{T} \phi(x_{i})+b\right) \quad i=1,2, \cdots, N \end{array} $$

< center > formula 2-2 < / center >

   equation 2-2 above is the original model of nonlinear support vector machine.

Dual model

   with the dual model solved in the previous two sections, the dual model of nonlinear support vector machine can be obtained:

$$ \begin{array}{c} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} \phi(x_{i})^{T} \phi(x_{j}) \\ \text { s.t. } \quad \sum_{i=1}^{N} \lambda_{i} y_{i}=0 \quad 0 \leq \lambda_{i} \leq C \quad i=1,2, \cdots, N \end{array} $$

< center > formula 2-3 < / center >

Nuclear technique

   observe the change part of the dual model of soft interval support vector machine in equation 2-3, and the feature vector passes through φ After the function transformation, its dimension may be very high, and it is usually difficult to find the inner product. In order to bypass this inner product calculation, we can find a function as shown in equation 2-4, so that the value calculated by the function is equal to the inner product after feature conversion. Such a function is called Kernel Function, and this replacement method is called Nuclear technique2(Kernel Trick)

$$ K(x_i, x_j) = \phi(x_i)^T\phi(x_j) $$

< center > formula 2-4 < / center >

   use the kernel technique to bring in equation 2-3 to obtain the final dual model of nonlinear support vector machine:

$$ \begin{array}{c} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} K(x_i, x_j) \\ \text { s.t. } \quad \sum_{i=1}^{N} \lambda_{i} y_{i}=0 \quad 0 \leq \lambda_{i} \leq C \quad i=1,2, \cdots, N \end{array} $$

< center > formula 2-5 < / center >

  decision making under nuclear technology:

(1) Original decision surface function

(2) Weight coefficient w

(3) The final decision surface function is obtained by using kernel technique

$$ \begin{aligned} f(x) &=w^{T} \phi(x)+b & (1)\\ &=\sum_{i=1}^{N} \lambda_{i} y_{i} \phi\left(x_{i}\right) \phi(x)+b & (2)\\ &=\sum_{i=1}^{N} \lambda_{i} y_{i} K\left(x_{i}, x\right)+b & (3) \end{aligned} $$

< center > formula 2-6 < / center >

  kernel technique is not only applicable to support vector machines, but also widely used in other machine learning algorithms.

kernel function

   common kernel functions:

(1) Linear Kernel: the classification result of support vector machine using Linear Kernel is the same as that of soft interval support vector machine

(2) Polynomial Kernel: when γ = 1, ε = When 0, d = 1, the Polynomial Kernel function degenerates into a linear kernel function

(3) Radial basis function/RBF Kernel: the common Gaussian Kernel is a kind of radial basis function

(4) Sigmoid Kernel: tanh is a hyperbolic tangent function

$$ \begin{aligned} K\left(x_{i}, x_{j}\right)&=x_{i}^{T} x_{j} & (1) \\ K\left(x_{i}, x_{j}\right)&=\left(\gamma x_{i}^{T} x_{j}+\varepsilon\right)^{d} & (2)\\ K\left(x_{i}, x_{j}\right)&=e^{-\gamma\left\|x_{i}-x_{j}\right\|^{2}} & (3)\\ K\left(x_{i}, x_{j}\right)&=\tanh \left(\gamma x_{i}^{T} x_{j}+\varepsilon\right) & (4) \end{aligned} $$

< center > formula 2-7 < / center >

3, Algorithm steps

   compared with the dual model of soft interval support vector machine, nonlinear support vector machine only uses kernel function to replace the inner product of feature vector, and the others have no change, so the corresponding sequence minimum optimization algorithm (SMO) has little change.

   for the algorithm steps, please refer to the contents in the previous two sections and the following code implementation. At the same time, you can refer to Corresponding papers of SMO algorithm 3. Implementation.

4, Code implementation

Using Python to implement nonlinear support vector machine (SMO algorithm)

import numpy as np

class SMO:
    """
    Support vector machine
    Implementation of sequential minimum optimization algorithm( Sequential minimal optimization/SMO)
    """

    def __init__(self, X, y, kernel, degree = 3, coef0 = 0.0, gamma = 1.0):
        # Training sample characteristic matrix (N * p)
        self.X = X
        # Training sample label vector (N * 1)
        self.y = y
        # Lagrange multiplier vector (N * 1)
        self.alpha = np.zeros(X.shape[0])
        # Error vector, default to negative label vector (N * 1)
        self.errors = -y
        # Offset 
        self.b = 0
        # Generation value
        self.cost = -np.inf
        # kernel function
        self.kernel = kernel
        # Kernel related parameters
        self.degree = degree
        self.coef0 = coef0
        self.gamma = gamma

    def fit(self, C = 1, tol = 1e-4):
        """
        Algorithm from John C. Platt My thesis
        https://www.microsoft.com/en-us/research/uploads/prod/1998/04/sequential-minimal-optimization.pdf
        """
        # Update change times
        numChanged = 0
        # Check all
        examineAll = True
        while numChanged > 0 or examineAll:
            numChanged = 0
            if examineAll:
                for idx in range(X.shape[0]):
                    numChanged += self.update(idx, C)
            else:
                for idx in range(X.shape[0]):
                    if self.alpha[idx] <= 0:
                        continue
                    numChanged += self.update(idx, C)
            if examineAll:
                examineAll = False
            elif numChanged == 0:
                examineAll = True
            # Calculate generation value
            cost = self.calcCost()
            if self.cost > 0:
                # The algorithm ends when the change of contemporary value is less than the allowable range
                rate = (cost - self.cost) / self.cost
                if rate <= tol:
                    break
            self.cost = cost

    def update(self, idx, C = 1):
        """
        Subscript for idx The Lagrange multiplier is updated
        """
        X = self.X
        y = self.y
        alpha = self.alpha
        # Check whether the current Lagrange multiplier meets the KKT condition. If it meets the condition, it will directly return 0
        if self.checkKKT(idx, C):
            return 0
        if len(alpha[(alpha != 0)]) > 1:
            # Find the subscript of the second Lagrange multiplier to be optimized according to the principle of | E1 - E2 | maximum
            jdx = self.selectJdx(idx)
            # Update the Lagrange multipliers with subscripts idx and jdx, and return 1 directly when the update is successful
            if self.updateAlpha(idx, jdx, C):
                return 1
        # When the update is not successful, the Lagrange multiplier that is not zero is traversed for update
        for jdx in range(X.shape[0]):
            if alpha[jdx] == 0:
                continue
            # Update the Lagrange multipliers with subscripts idx and jdx, and return 1 directly when the update is successful
            if self.updateAlpha(idx, jdx, C):
                return 1
        # When the update is still unsuccessful, the Lagrange multiplier traversing zero is updated
        for jdx in range(X.shape[0]):
            if alpha[jdx] != 0:
                continue
            # Update the Lagrange multipliers with subscripts idx and jdx, and return 1 directly when the update is successful
            if self.updateAlpha(idx, jdx, C):
                return 1
        # Return 0 if there is still no update
        return 0

    def selectJdx(self, idx):
        """
        Find the subscript of the second Lagrange multiplier to be optimized
        """
        errors = self.errors
        if errors[idx] > 0:
            # When the error term is greater than zero, select the subscript of the minimum value in the error vector
            return np.argmin(errors)
        elif errors[idx] < 0:
            # When the error term is less than zero, select the subscript of the maximum value in the error vector
            return np.argmax(errors)
        else:
            # When the error term is equal to zero, select the subscript with the largest absolute value of the maximum and minimum value in the error vector
            minJdx = np.argmin(errors)
            maxJdx = np.argmax(errors)
            if max(np.abs(errors[minJdx]), np.abs(errors[maxJdx])) - errors[minJdx]:
                return minJdx
            else:
                return maxJdx
    
    def calcB(self):
        """
        Calculate offset
        Calculate the offset corresponding to each Lagrange multiplier that is not zero, and then take its average value
        """
        X = self.X
        y = self.y
        alpha = self.alpha
        alpha_gt = alpha[alpha > 0]
        # Non zero quantity in Lagrange multiplier vector
        alpha_gt_len = len(alpha_gt)
        # Returns 0 directly when all are zero
        if alpha_gt_len == 0:
            return 0
        # b = y - Wx, please refer to the description in the article for the specific algorithm
        X_gt = X[alpha > 0]
        y_gt = y[alpha > 0]
        
        s = 0
        for idx in range(X_gt.shape[0]):
            ss = 0
            for jdx in range(X_gt.shape[0]):
                ss += alpha_gt[jdx] * y_gt[jdx] * self.kernel(X_gt[jdx], X_gt[idx], self.degree, self.coef0, self.gamma)
            s += y_gt[idx] - ss
        return s / alpha_gt_len

    def calcCost(self):
        """
        Calculate generation value
        It can be calculated according to the algorithm in the article
        """
        X = self.X
        y = self.y
        alpha = self.alpha
        cost = 0
        for idx in range(X.shape[0]):
            for jdx in range(X.shape[0]):
                cost = cost + (y[idx] * y[jdx] * self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma) * alpha[idx] * alpha[jdx])
        return np.sum(alpha) - cost / 2

    def checkKKT(self, idx, C = 1):
        """
        Check subscript is idx Is the Lagrange multiplier satisfied KKT condition
        1. alpha >= 0
        2. alpha <= C
        3. y * f(x) - 1 >= 0
        4. alpha * (y * f(x) - 1) = 0
        """
        y = self.y
        errors = self.errors
        alpha = self.alpha
        r = errors[idx] * y[idx]
        if (alpha[idx] > 0 and alpha[idx] < C and r == 0) or (alpha[idx] == 0 and r > 0) or (alpha[idx] == C and r < 0):
            return True
        return False

    def calcU(self, idx, jdx, C = 1):
        """
        Calculate the upper bound of Lagrange multiplier so that the two Lagrange multipliers to be optimized are greater than or equal to 0 at the same time
        It can be calculated according to the algorithm in the article
        """
        y = self.y
        alpha = self.alpha
        if y[idx] * y[jdx] == 1:
            return max(0, alpha[jdx] + alpha[idx] - C)
        else:
            return max(0.0, alpha[jdx] - alpha[idx])

    def calcV(self, idx, jdx, C = 1):
        """
        Calculate the lower bound of Lagrange multiplier so that the two Lagrange multipliers to be optimized are greater than or equal to 0 at the same time
        It can be calculated according to the algorithm in the article
        """
        y = self.y
        alpha = self.alpha
        if y[idx] * y[jdx] == 1:
            return min(C, alpha[jdx] + alpha[idx])
        else:
            return min(C, C + alpha[jdx] - alpha[idx])

    def updateAlpha(self, idx, jdx, C = 1):
        """
        Subscript for idx,jdx The Lagrange multiplier is updated
        It can be calculated according to the algorithm in the article
        """
        if idx == jdx:
            return False
        X = self.X
        y = self.y
        alpha = self.alpha
        errors = self.errors
        # Error term of idx
        Ei = errors[idx]
        # Error term of jdx
        Ej = errors[jdx]
        U = self.calcU(idx, jdx, C)
        V = self.calcV(idx, jdx, C)
        if U == V:
            return False
        Kii = self.kernel(X[idx], X[idx], self.degree, self.coef0, self.gamma)
        Kjj = self.kernel(X[jdx], X[jdx], self.degree, self.coef0, self.gamma)
        Kij = self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma)
        # Calculate K
        K = Kii + Kjj - 2 * Kij
        oldAlphaIdx = alpha[idx]
        oldAlphaJdx = alpha[jdx]
        oldB = self.b
        s = y[idx] * y[jdx]
        if K > 0:
            # Calculating the new Lagrange multiplier of jdx
            newAlphaJdx = oldAlphaJdx + y[jdx] * (Ei - Ej) / K
            if newAlphaJdx < U:
                # When the new value exceeds the upper bound, change it to the upper bound
                newAlphaJdx = U
            if newAlphaJdx > V:
                # When the new value is lower than the lower bound, change it to the lower bound
                newAlphaJdx = V
        else:
            fi = y[idx] * (Ei + oldB) - oldAlphaIdx * Kii - s * oldAlphaJdx * Kij
            fj = y[jdx] * (Ej + oldB) - s * oldAlphaIdx * Kij - oldAlphaJdx * Kjj
            Vv = oldAlphaIdx + s * (oldAlphaJdx - V)
            Uu = oldAlphaIdx + s * (oldAlphaJdx - U)
            Vv = Vv * fi + V * fj + 0.5 * (Vv ** 2) * Kii + 0.5 * (V ** 2) * Kjj + s * V * Vv * Kij
            Uu = Uu * fi + U * fj + 0.5 * (Uu ** 2) * Kii + 0.5 * (U ** 2) * Kjj + s * U * Uu * Kij
            if Vv < Uu:
                newAlphaJdx = V
            elif Vv > Uu:
                newAlphaJdx = U
            else:
                newAlphaJdx = oldAlphaJdx
        if oldAlphaJdx == newAlphaJdx:
            # When the new value is equal to the old value, it is judged as not updated and returned directly
            return False
        # New Lagrange multipliers for computing idx
        newAlphaIdx = oldAlphaIdx + s * (oldAlphaJdx - newAlphaJdx)
        # Update Lagrange multiplier vector
        alpha[idx] = newAlphaIdx
        alpha[jdx] = newAlphaJdx
        oldB = self.b
        # Recalculate offset
        self.b = self.calcB()
        # Recalculate error vector
        newErrors = []
        for i in range(X.shape[0]):
            newError = errors[i] + y[idx] * (newAlphaIdx - oldAlphaIdx) * self.kernel(X[idx], X[i], self.degree, self.coef0, self.gamma) + y[jdx] * (newAlphaJdx - oldAlphaJdx) * self.kernel(X[jdx], X[i]) - oldB + self.b
            newErrors.append(newError)
        self.errors = newErrors
        return True
    
    def predict(self, X):
        fxs = []
        for idx in range(len(X)):
            fx = 0
            for jdx in range(self.X.shape[0]):
                fx += self.y[jdx] * self.alpha[jdx] * self.kernel(self.X[jdx], X[idx], self.degree, self.coef0, self.gamma)
            fxs.append(fx + self.b)
        return np.sign(fxs)

Linear kernel support vector machine

# Linear kernel support vector machine
def linearKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0):
    return x.dot(z)

linearSmo = SMO(X, y, kernel = linearKernel)
linearSmo.fit()

Polynomial kernel support vector machine

# Polynomial kernel support vector machine
def polynomialKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0):
    return np.power(gamma * x.dot(z) + coef0, degree)

polynomialSmo = SMO(X, y, kernel = polynomialKernel)
polynomialSmo.fit()

Radial basis function kernel support vector machine

# Radial basis function kernel support vector machine
def rbfKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0):
    return np.exp(-gamma * np.power(np.linalg.norm(x - z), 2))

rbfSmo = SMO(X, y, kernel = rbfKernel)
rbfSmo.fit()

5, Third party library implementation

scikit-learn 4 implementation

from sklearn.svm import SVC

# Linear kernel support vector machine
svc = SVC(kernel = "linear")
# Polynomial kernel support vector machine
svc = SVC(kernel = "poly", gamma = 1)
# Radial basis function kernel support vector machine
svc = SVC(kernel = "rbf", gamma = 1)

# Fitting data
svc.fit(X, y)

Scikit learn uses LIBSVM 5 library. For details about the implementation of this library, please refer to This article6 .

6, Animation demonstration

   the following three figures respectively show the classification results of linearly indivisible data sets for different kernel functions. Red represents the sample points with label value of - 1 and blue represents the sample points with label value of 1. The light red area is the part with the predicted value of - 1, and the light blue area is the part with the predicted value of 1:


< center > figure 6-1 < / center >


< center > figure 6-2 < / center >


< center > figure 6-3 < / center >

   it can be seen that different kernel functions have a great impact on the classification results. When the form of feature transformation is not known, it is impossible to select the appropriate kernel function. However, generally, for nonlinear classification problems, radial basis function kernel function can be used to find the appropriate parameters through cross verification.

7, Mind map


< center > Figure 7-1 < / center >

8, References

  1. https://en.wikipedia.org/wiki...
  2. https://en.wikipedia.org/wiki...
  3. https://www.microsoft.com/en-...
  4. https://scikit-learn.org/stab...
  5. https://www.csie.ntu.edu.tw/~...
  6. https://github.com/Kaslanaria...

For a complete demonstration, please click here

Note: This article strives to be accurate and easy to understand, but as the author is also a beginner, his level is limited. If there are errors or omissions in the article, readers are urged to criticize and correct it by leaving a message

This article was first published in—— AI map , welcome to pay attention

Keywords: Algorithm Machine Learning AI

Added by mac007 on Mon, 07 Feb 2022 04:10:15 +0200