[scikit learn] K-nearest neighbor (KNN)

KNN

KNN(K-Nearest Neighbor) is one of the simplest machine learning algorithms, which can be used for classification and regression. It is a supervised learning algorithm. The idea is that if most of the K most similar (i.e. the nearest) samples of a sample in the feature space belong to a certain category, the sample also belongs to this category. In other words, this method only determines the category of the samples to be divided according to the category of the nearest one or several samples.

KNN for classification

Calculate the distance between the points to be classified and the points of known categories
Sort by increasing distance
Select the K points with the smallest distance from the points to be classified
Determine the number of occurrences of the category where the first K points are located
The category with the highest occurrence times of the first K points is returned as the prediction classification of the points to be classified

KNN regression prediction

In regression, the average method or weighted average method is generally used

Average value method: the weight of each adjacent sample is the same, and the final prediction result is the average of the target attribute values of all adjacent samples
Weighted average method: generally, the weight is inversely proportional to the distance

Problems needing attention in KNN algorithm

Whether classification or regression, several key points of KNN algorithm:

Algorithm hyperparameter K
Distance measurement, the distance of sample points in feature space is the reflection of the similarity between sample points
Classification decision rules, the minority obeys the majority.

Algorithm hyperparameter K

If a smaller K value is selected, it is equivalent to using the training instance in a smaller neighborhood for prediction, and the approximation error of "learning" will be reduced. Only the training instance with a closer input instance will play a role in the prediction result. However, the disadvantage is that the estimation error of "learning" will increase, and the prediction results will be very sensitive to the nearest instance points. If the adjacent instance point happens to be noise, the prediction will be wrong. In other words, the decrease of K value means that the overall model is very complex and prone to over fitting.
If a larger K value is selected, it is equivalent to using the training examples in a larger neighborhood for prediction. In fact, the advantage is to reduce the estimation error of learning, but the disadvantage is that the approximation error of learning will increase. At this time, the training instance far from the input instance will also play a role in prediction, making the prediction wrong. The increase of K value means that the overall model becomes simple and prone to under fitting. It can be assumed that the extreme condition K=N, then no matter what the input instance is, it will be simply predicted that it belongs to the most class in the training instance. At this time, the model is too simple to completely ignore a large amount of useful information in training.
In application, the cross validation method is usually used to select the optimal K value. It can also be seen from the above analysis that generally, the K value is relatively small. We will select the K value in a small range and the one with the highest accuracy in the verification set as the final algorithm super parameter k.

Distance measurement

There are three commonly used distances: Manhattan distance, European distance and Minkowski distance

Advantages and disadvantages of KNN

advantage:
1) The algorithm is simple and the theory is mature. It can be used for classification and regression.
2) It can be used for nonlinear classification.
3) There is no obvious training process, but when the program starts running, after the data set is loaded into memory, there is no need to train and predict directly, so the training time complexity is 0.
4) Because KNN method mainly depends on the surrounding limited adjacent samples rather than the method of discriminating class domains to determine the category, KNN method is more suitable than other methods for the sample sets to be classified with more overlapping or overlapping class domains.
5) The algorithm is more suitable for the automatic classification of class domains with large sample size, while those with small sample size are prone to misclassification.

Disadvantages:
1) The distance between each test point and the training set needs to be calculated. When the training set is large, the amount of calculation is quite large and the time complexity is high, especially when the number of features is large.
2) It requires a lot of memory and high space complexity.
3) The sample imbalance problem (that is, the number of samples in some categories is large, while the number of other samples is small), and the prediction accuracy of rare categories is low.
4) It is a lazy learning method, which basically does not learn, resulting in slower prediction speed than algorithms such as logistic regression.

actual combat

'''
Dataset: Mnist
 Number of training sets: 60000
 Number of test sets: 10000 (actual use: 10) changeable
'''

import time
import numpy as np
import sys
import os

# Import MNIST. In different directories load_ data
parent_path=os.path.dirname(os.path.dirname(sys.argv[0])) # Get parent directory
sys.path.append(parent_path) # Modify sys path
from torchvision.datasets.mnist import MNIST


from torch.utils.data import DataLoader
from torchvision import datasets
import torchvision
import torchvision.transforms as transforms




class KNN:
    def __init__(self, x_train, y_train, x_test, y_test, k):
        '''
        Args:
            x_train [Array]: Training set data
            y_train [Array]: Training set label
            x_test [Array]: Test set data
            y_test [Array]: Test set label
            k [int]: k of kNN
        '''
        self.x_train, self.y_train = x_train, y_train
        self.x_test, self.y_test = x_test, y_test
        # Convert the input data into matrix form to facilitate operation
        self.x_train_mat, self.x_test_mat = np.mat(
            self.x_train), np.mat(self.x_test)
        self.y_train_mat, self.y_test_mat = np.mat(
            self.y_test).T, np.mat(self.y_test).T
        self.k = k

    def _calc_dist(self, x1, x2):
        '''Calculate the distance between two sample point vectors,Euclidean distance is used
        :param x1:Vector 1
        :param x2:Vector 2
        :return: Euclidean distance between vectors
        '''
        return np.sqrt(np.sum(np.square(x1 - x2)))

    def _get_k_nearest(self,x):
        '''
        Forecast Sample x Mark of.
        Acquisition method: find and sample x current topK Select points and view their labels.
        Find the type of tag that accounts for the most of a certain type of tag
        :param trainDataMat:Training set data set
        :param trainLabelMat:Training set label set
        :param x:Sample to be predicted x
        :param topK:Select the number of reference nearest samples (the selection of the number of samples is related to the accuracy, see 3 for details).2.3 K Value selection)
        :return:Predicted Tags
        '''
        # Initialize distance list, dist_list[i] indicates the distance between sample x to be predicted and the ith sample in the training set
        dist_list=[0]* len(self.x_train_mat)

        # Traverse all the sample points in the training set and calculate the distance from x
        for i in range( len(self.x_train_mat)):
            # Gets the vector of the current sample in the training set
            x0 = self.x_train_mat[i]
            # Calculate the distance between vector x and training set sample x0
            dist_list[i] = self._calc_dist(x0, x)

        # Sort the distance list and return the subscripts of the nearest k training samples
        # ----------------Optimization point-------------------
        # Because we only take the element index value with small topK, we don't need to sort the whole list, and argsort is to sort the whole list
        # List sorting is a waste of time. Dictionaries have ready-made methods, which can only sort top large or top small, and can be consulted by yourself
        # Just modify the code slightly
        # The main reason why KNN is not optimized here is that the time consumption of KNN is mainly in calculating the distance between vectors, due to the high dimension of vectors
        # Therefore, the calculation time takes a long time, so if you want to improve the time, optimization here is of little significance.
        k_nearest_index = np.argsort(np.array(dist_list))[:self.k]  # Ascending sort
        return k_nearest_index


    def _predict_y(self,k_nearest_index):
        # label_list[1]=3, indicating that there are 3 samples with label 1. Since label here is 0-9, label with length of 10 can be initialized_ list
        label_list=[0] * 10
        for index in k_nearest_index:
            one_hot_label=self.y_train[index]
            number_label=np.argmax(one_hot_label)
            label_list[number_label] += 1
        # The voting method is adopted, that is, the label with the largest number of samples is the predicted label
        y_predict=label_list.index(max(label_list))
        return y_predict

    def test(self,n_test=10):
        '''
        Test accuracy
        :param: n_test: Number of samples to be tested
        :return: Correct rate
        '''
        print('start test')

        # Error value count
        error_count = 0
        # Traverse the test set and test each test set sample
        # Since it takes too much time to calculate vectors, the test set has 6000 samples, so it is artificially changed to
        # Test 200 sample points. To run all, change n_test
        for i in range(n_test):
            # print('test %d:%d'%(i, len(trainDataArr)))
            print('test %d:%d' % (i, n_test))
            # Read the vector of the current test sample of the test set
            x = self.x_test_mat[i]
            # Obtain the nearest training sample serial number
            k_nearest_index=self._get_k_nearest(x)
            # Predicted output y
            y=self._predict_y(k_nearest_index)
            # If the predicted label is inconsistent with the actual label, the error value count is increased by 1
            if y != np.argmax(self.y_test[i]):
                error_count += 1
            print("accuracy=",1 - (error_count /(i+1)))

        # Return accuracy
        return 1 - (error_count / n_test)


if __name__ == "__main__":

    k=1
    start = time.time()

    train_dataset = datasets.MNIST(root = '../datasets', train = True,transform = transforms.ToTensor(), download = True)
    test_dataset = datasets.MNIST(root = '../datasets', train = False, transform = transforms.ToTensor(), download = True)

    # batch_size = len(train_dataset)
    train_loader = DataLoader(dataset=train_dataset, batch_size=60000, shuffle=True)
    test_loader = DataLoader(dataset=test_dataset, batch_size=1000, shuffle=True)
    x_train,y_train = next(iter(train_loader))
    x_test,y_test = next(iter(test_loader))
    x_train = x_train.squeeze(1).view(60000,-1)
    x_test = x_test.squeeze(1).view(1000,-1)


    model=KNN( x_train, y_train, x_test, y_test,k)
    accur=model.test()
    end = time.time()
    print("total acc:",accur)
    print('time span:', end - start)

start test
test 0:10
accuracy= 1.0
test 1:10
accuracy= 1.0
test 2:10
accuracy= 1.0
test 3:10
accuracy= 1.0
test 4:10
accuracy= 1.0
test 5:10
accuracy= 1.0
test 6:10
accuracy= 1.0
test 7:10
accuracy= 1.0
test 8:10
accuracy= 1.0
test 9:10
accuracy= 1.0
total acc: 1.0
time span: 17.0129714012146

Using sklearn

The data set is divided into test data set and training data set

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =train_test_split(X,y)

Create a kneigborsclassifier object

from sklearn.neighbors import KNeighborsClassifier

knn= KNeighborsClassifier()

Use the kneigborsclassifier object to fit to create the model and obtain the classification accuracy

knn.fit(X_train,y_train)
knn.score(X_test,y_test)

Predicting test sets using models

y_predict = knn.predict(X_test)

The parameters and default parameters of KNN's initial function (constructor) are (n_neighbors=5,weights = 'uniform', algorithm = 'auto', leaf_size=30, p=2, metric = 'minkowski', metric_parameters = none, n_jobs = 1, * * kwargs)

[n_neighbors] that is K，The default parameter is 5.

[weights] str Parameter, in order to reduce k Value affects the weight value set, that is, what proportion each sample with voting rights votes according to,'uniform'Means equal weight voting,'distance'Means voting in inverse proportion to distance,[callable]Represents a function defined by itself. This function receives a distance array and returns a weight array. The default parameter is'uniform'. 

[algorithm] str Parameter, that is, what data structure is used internally. There are several options:'ball_tree':Ball tree'kd_tree':kd Trees'brute':Violent search'auto':Automatically select the appropriate algorithm according to the type and structure of data. Yes by default'auto'. It goes without saying that violent search is well known. Which of the first two tree data structures is better depends on the situation. KD Trees are paired in turn K Dimensional coordinate axis, the tree constructed by median segmentation, each node is a super rectangle, and the efficiency is the highest when the dimension is less than 20. ball tree To overcome KD The tree is invented due to high-dimensional failure, and its construction process is based on the centroid C And radius r Divide the sample space, and each node is a hypersphere. For general low dimensional data kd_tree Fast, easy to use ball_tree Relatively slow. For high-dimensional data beyond 20 dimensions kd_tree The effect is not good, and ball_tree The effect is better. You can learn the specific construction process and the theory of advantages and disadvantages if you are interested.

[leaf_size] int Parameter, based on the algorithm introduced above, this parameter gives kd_tree perhaps ball_tree The size of leaf nodes. Different sizes of leaf nodes will affect the construction and search speed of the number, as well as the size of memory used to store the tree. The specific optimal scale depends on the situation.

[matric] str Or distance measurement object, that is, how to measure distance, which is described in detail earlier. default minkowski. 

[p] int Parameter refers to various distance parameters of Min distance above. The default value is 2, i.e. Euclidean distance. p=1 Represents Manhattan distance, etc.

[metric_params] The additional keyword parameters of the distance measurement function are generally ignored. The default is None. 

[n_jobs] int Parameter, which refers to the number of threads for parallel computing. The default value is 1, which means one thread. It is-1 The words are expressed as CPU The number of cores can also be specified as other numbers of threads. If you don't pursue speed here, don't worry. If you need to use it, look at multithreading.

Keywords: Python Machine Learning scikit-learn

Added by CyberShot on Wed, 05 Jan 2022 05:56:09 +0200

Programming VIP