Machine learning notes - Implementation of linear non separable support vector machine

Implementation of linear non separable support vector machine

   in the article of support vector machine, we have derived the mathematical model of linear non separable support vector machine in detail, and transformed the original problem into a dual problem to be solved. Revisit the dual problem as follows:

Final form of dual problem of linear non separable support vector machine: min ⁡ α       1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i ⋅ x j ) − ∑ i = 1 N α i s . t .       ∑ i = 1 N α i y i = 0 0 ≤ α i ≤ C       i = 1 , 2 , ⋯   , N \min\limits_{\alpha}\,\,\,\,\,\frac{1}{2}\sum\limits_{i=1}^N{\sum\limits_{j=1}^N{\alpha_i\alpha_jy_iy_j(x_i\cdot x_j)}}-\sum\limits_{i=1}^N{\alpha_i}\\ s.t.\,\,\,\,\,\sum\limits_{i=1}^N\alpha_i y_i=0\\ 0\le\alpha_i\le C\,\,\,\,\,i=1,2,\cdots,N αmin​21​i=1∑N​j=1∑N​αi​αj​yi​yj​(xi​⋅xj​)−i=1∑N​αi​s.t.i=1∑N​αi​yi​=00≤αi​≤Ci=1,2,⋯,N

Dual problem of linear separable support vector machine:
min ⁡ α       1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i ⋅ x j ) − ∑ i = 1 N α i s . t .      ∑ i = 1 N α i y i = 0 α i ≥ 0      i = 1 , 2 , ⋯   , N \min\limits_{\alpha}\,\,\,\,\,\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N{\alpha_i\alpha_jy_iy_j(x_i\cdot x_j)}-\sum\limits_{i=1}^N{\alpha_i}\\ s.t.\,\,\,\,\sum\limits_{i=1}^N{\alpha_iy_i=0}\\ \alpha_i\ge0\,\,\,\,i=1,2,\cdots,N αmin​21​i=1∑N​j=1∑N​αi​αj​yi​yj​(xi​⋅xj​)−i=1∑N​αi​s.t.i=1∑N​αi​yi​=0αi​≥0i=1,2,⋯,N

   by comparing the mathematical models of the two support vector machines, it can be found that the objective functions of the two are consistent. The only difference is the constraint condition. In fact, this constraint condition only works with the support vector and has no great impact on other sample points, so we can also think that, The final learning results of linear non separable support vector machine are also only related to some sample points. We have not conducted experiments here, and the results are not known for the time being.
   similarly, we use MATLAB to solve the dual problem. Similar to the linear separable implementation, we only need to change the upper bound of the independent variable, so the code is as follows:

%% The linear non separable data are classified by support vector machine, and the quadratic programming function is used to solve it
% Clear variables
clc;
clear;
% Import dataset
data = [
    27    53    -1
    49    37    -1
    56    39    -1
    28    60    -1
    68    75     1
    57    69     1
    64    62     1
    77    68     1
    70    54     1
    56    63     1
    25    41    -1
    66    34     1
    55    79     1
    77    31    -1
    46    66     1
    30    23    -1
    21    45    -1
    68    42    -1
    43    43    -1
    56    59     1
    79    68     1
    60    34    -1
    49    32    -1
    80    79     1
    77    46     1
    26    66     1
    29    29    -1
    77    34     1
    20    71    -1
    49    25    -1
    58    65     1
    33    57    -1
    31    79     1
    20    78     1
    77    37    -1
    73    34    -1
    60    26    -1
    77    66     1
    71    75     1
    35    36    -1
    49    61     1
    26    37    -1
    42    73     1
    36    50    -1
    66    73     1
    71    43     1
    33    62     1
    43    41    -1
    42    29    -1
    58    20    -1
    ];
X = data(:,1:2);
y = data(:,3);
C = 5;
% Compared with linear separable data, linear non separable data has only one more constraint in the form of quadratic programming C,So it is the same in the process of solving.
% Constructing quadratic coefficient matrix H
H = [];
for i =1:length(X)
    for j = 1:length(X)
        H(i,j) = X(i,:)*(X(j,:))'*y(i)*y(j);
    end
end
% Construct first-order term coefficient f
f = zeros(length(X),1)-1;
A = [];b = []; % Inequality constraint
Aeq = y';beq = 0; % Equality constraint
ub = ones(length(X),1).*C;lb = zeros(length(X),1); % Range of arguments
[x,fval] = quadprog(H,f,A,b,Aeq,beq,lb,ub);
% x Represents the solution of the argument, and x Function value at
% Will be very small x Direct assignment to 0
x(x < 1e-5) = 0;

% Obtained by solving x Solution coefficient w
w = [0,0];
[a,~] = find(x~=0); % Find a solution b Value of
temp = 0;
for i = 1:length(X)
    w = w + x(i)*y(i)*X(i,:);
    temp = temp + x(i)*y(i)*X(i,:)*X(a(1),:)';
end
% Calculate offset factor
b = y(a(1)) - temp;
disp(['The sample points involved are',num2str(length(a))])

% Visualization of data results
k = - w(1)/w(2); % Structural intercept type
b_ = -b/w(2); % Tectonic coefficient b
m = 0:2:90; % Generate some points
n = k*m+b_;
plot(m,n,'--black')
n_2 = k*m+b_+1/w(2);
hold on
plot(m,n_2,'--blue')
n_3 = k*m+b_-1/w(2);
plot(m,n_3,'--red');
% Classify the data into two categories to facilitate drawing
X_1 = [];
X_2 = [];
for i =1:length(y)
    if y(i)==-1
        X_1 = [X_1;X(i,:)];
    else
        X_2 = [X_2;X(i,:)];
    end
end
% HandleVisibility=off The purpose of the parameter is not to display the legend
scatter(X_1(:,1),X_1(:,2),'red','HandleVisibility','off')
hold on
scatter(X_2(:,1),X_2(:,2),'blue','HandleVisibility','off')
xlim([10,90])
grid on
title('Classification results of linear non separable support vector machine')
% Draw the position of the support vector,This is from\alpha To determine which points are support vectors.
scatter(X([14,4],1),X([14,4],2),'r','filled','HandleVisibility','off')
scatter(X(34,1),X(34,2),'b','filled','HandleVisibility','off')
% Sample points 4, 14 and 34 are points falling on the interval boundary
% Calculate relaxation variables
functiona_margin  = zeros(length(X),1);
loos = zeros(length(X),1);
for i=1:length(X)
    functiona_margin(i) = y(i)*(w*X(i,:)'+b);
    if functiona_margin(i)>1 % If the function interval meets the requirements
        loos(i) = 0; % Then this point is not a support vector point
    else
        loos(i) = 1 - functiona_margin(i); % This point is the support vector and calculates the value of the relaxation variable
    end
end
% Draw misclassification points
scatter(X([12,26,28,47],1),X([12,26,28,47],2),'*','black')

% 18 29 35 36 46 Fall between the interval boundary and the separation hyperplane
% 12,26,28,47 It belongs to misclassification point
legend('Dividing line','Category 1','Category 2','Misclassification point')

The results of the operation are as follows:

   as can be seen from the above figure, the definition of support vector for linear non separable problems is much more complex than that for linear separable problems, because there are many cases of sample points at this time, including correct classification of sample points circumscribed at the interval boundary, wrong classification of sample points circumscribed at the interval boundary, wrong classification of sample points between hyperplane and interval boundary The sample points are between the hyperplane and the function boundary and the classification is correct, the sample points are on the function boundary and the classification is correct, the sample points are on the hyperplane, the sample points are on the function boundary and the classification is wrong ⋯ \cdots * there are so many cases, of which only three cases belong to support vectors: on the function boundary, between the function boundary and the hyperplane, and on the other side of the hyperplane (when the classification is wrong). Remember that in the online separable SVM problem, support vector refers to the sample points that fall on the boundary of the function, i.e α i > 0 \alpha_i>0 α i > 0, in the example where the linearity is inseparable, α i > 0 \alpha_i>0 α There are many points where i > 0. We need to discuss these points.
   the following discussion is based on the data in the above code. There are 50 sample points, half of which are positive and half are negative. In addition, we set the punishment intensity of relaxation variables C = 5 C=5 C=5 for all α > 0 \alpha>0 α> The sample points of 0 are discussed, and the relaxation variable size of these points is calculated. We find that:

Support vector α \alpha α ξ \xi ξDistribution of points
44.68120On interval boundary
140.65720On interval boundary
340.33840On interval boundary
1252.5292Misclassification point
2651.2738Misclassification point
2851.5477Misclassification point
4751.2523Misclassification point
1850.8554Between hyperplane and function interval
2950.9446Between hyperplane and function interval
3550.9046Between hyperplane and function interval
3650.0954Between hyperplane and function interval
4650.7262Between hyperplane and function interval

   except for the above 12 support vectors, there are no errors in other point classifications. The above table is the experimental result data. We can find the following rules through the observation results:
  because it is specified in the optimization problem α \alpha α The upper and lower bounds of 0 ≤ α ≤ C 0\le\alpha\le C 0≤ α ≤ C, and here C C The value of C is fixed at the beginning of the experiment. We take it as 5. In addition, we can use the formula for the relaxation variable y i ( w ⋅ x i + b ) ≥ 1 − ξ i y_i(w\cdot x_i+b)\ge1-\xi_i yi​(w⋅xi​+b)≥1− ξ i) calculated. We found that:

  • if α i ∗ < C , ξ i = 0 \alpha^*_i<C,\xi_i=0 α i∗​<C, ξ If I = 0, the support vector x i x_i xi) just fall on the interval boundary;
  • if α i ∗ = C , 0 ≤ ξ i ≤ 1 \alpha^*_i=C,0\le\xi_i\le1 α i∗​=C,0≤ ξ If I ≤ 1, the classification is correct and the support vector x i x_i xi) fall between the interval boundary and the hyperplane;
  • if α i ∗ = C , ξ i = 1 \alpha^*_i=C,\xi_i=1 α i∗​=C, ξ If I = 1, the support vector falls on the hyperplane;
  • if α i ∗ = C , ξ i > 1 \alpha^*_i=C,\xi_i>1 α i∗​=C, ξ If I > 1, the support vector falls on the other side of the classification hyperplane, which belongs to misclassification point;

   in fact, the above laws can be explained by the knowledge of functional interval and geometric interval, because we deduced the formula earlier y i ( w ⋅ x i + b ) ≥ 1 y_i(w\cdot x_i+b)\ge1 yi (w ⋅ xi + b) ≥ 1, where 1 means that the functional distance from the support vector to the hyperplane is 1. When the sample point falls on the hyperplane, the formula ∣ w x + b ∣ = 0 |wx+b|=0 ∣ wx+b ∣ = 0 is constant, that is, if we introduce a relaxation variable ξ \xi ξ Construction formula y i ( w ⋅ x i + b ) ≥ 1 − ξ i y_i(w\cdot x_i+b)\ge1-\xi_i yi​(w⋅xi​+b)≥1− ξ i, when ξ i \xi_i ξ When i equals 1, it means that the sample ( x i , y i ) (x_i,y_i) (xi, yi) falls on a hyperplane. This is to explain the distribution of support vectors from the perspective of function interval. As for why α \alpha α Value of C C C or take < C <C < C, this needs to be understood in the SMO algorithm, but here we only need to know, when α i \alpha_i α When the interval of i is greater than 0, it will have an impact on the solution of the support vector. Therefore, the number of support vectors is not only 2, but also in both soft interval and hard interval. The specific classification needs to be discussed in detail.
  as mentioned earlier, penalty parameters C C The function of C is to adjust the relationship between misclassification and maximum interval, when C C The greater the value of C, the greater the punishment of hyperplane on misclassification points, and the algorithm should consider reducing misclassification points. When C C The smaller the value of C, the lower the importance of misclassification points. Refer to the following figure for details:
When set C C When the value of C is 0.0001, the hyperplane is as follows:

At this time, the hyperplane will cause 5 sample points to be misclassified. Compared with C C When C=5, the number of misclassification will increase by 2.

Keywords: Machine Learning AI

Added by kryptn on Sun, 16 Jan 2022 04:32:23 +0200