Machine learning practice -- k-nearest neighbor algorithm (KNN)

2.1 overview of k-nearest neighbor algorithm

In short, k-nearest neighbor algorithm uses the method of measuring the distance between different eigenvalues for classification

  • Advantages: high precision, insensitive to outliers and no data input assumption
  • Disadvantages: high computational complexity and space complexity
  • Applicable data range: numerical type and nominal type
  • Numerical type: numerical type target variables can be taken from infinite numerical sets, such as 0.100, 42.001, etc. (numerical type target variables are mainly used for regression analysis)
  • Nominal type: the results of nominal type target variables only take values in limited target sets, such as true and false (nominal type target variables are mainly used for classification)

K-nearest neighbor (k-NN) is a basic classification and regression method proposed by Cover T and Hart P in 1967. Its working principle is: there is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) of the sample. Generally speaking, we only select the first k most similar data in the sample data set, which is the source of K in the k-nearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, the most frequent classification among the k most similar data is selected as the classification of new data.

Example: using k-nearest neighbor algorithm to classify love movies and action movies
Figure 2-1 shows the number of fighting and kissing scenes in five action films. If there is a film that has not been seen, KNN can be used to determine whether it is a love film or an action film.

In the picture? It is a graphical display of the number of shots in the unknown film. See table 2-1 for specific figures

By calculating the distance between the unknown movie and other movies in the sample set, its type can be judged.

According to the order of increasing distance, you can find k nearest movies. Assuming k=3, all the nearest movies are love movies, which can be judged as love movies.

General process of k-nearest neighbor algorithm:

  • 1. Calculate the distance between the point in the known category data set and the current point;
  • 2. Sort according to the order of increasing distance;
  • 3. Select the k points with the smallest distance from the current point;
  • 4. Determine the occurrence frequency of the category where the first k points are located;
  • 5. Return the category with the highest frequency of the first k points as the prediction classification of the current point

Distance measurement:

2.1.1 preparation: import data using python

from numpy import *
import operator  # Import operator module

# Functions that create datasets and labels
def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels


if __name__ == '__main__':
    group, labels = createDataSet()

    print(group)
    print(labels)

result:

[[ 1.   1.1]
 [ 1.   1. ]
 [ 0.   0. ]
 [ 0.   0.1]] 
 ['A', 'A', 'B', 'B']

2.1.2 parsing data from text files
The k-nearest neighbor algorithm is used to divide each group of data into a class, and its pseudo code is as follows:

Perform the following operations for each point in the dataset of unknown category attributes in turn:

Calculates the distance between a point in a known category dataset and the current point
Sort by increasing distance
Select the k points with the smallest distance from the current point
Determine the frequency of occurrence of the category where the first k points are located
Return the category with the highest frequency of the first k points as the prediction classification of the current point

import numpy as np
import operator

"""
Function Description: KNN Algorithm, text classification

Parameters: 
    inX: Data for classification (test set)
    dataSet: Data for training (training set)
    labels: Classification label
    k: kNN Algorithm parameters, select the algorithm with the smallest distance k Point
    
returns: 
    Classification results

modify:
    2018-03-08
    
"""

def classify0(inX,dataSet,labels,k):
    #numpy function shape[0] returns the number of rows of dataSet
    dataSetSize=dataSet.shape[0]
    #Repeat inX once (horizontally) in the direction of column vector and inX times (vertically) in the direction of row vector
    #numpy.tile([0,0],(1,1))#Repeat [0,0] once in the column direction and once in the row > > > array ([[0,0]])
    diffMat=np.tile(inX,(dataSetSize,1))-dataSet
    #Square of two-dimensional feature subtraction
    sqDiffMat=diffMat**2
    #sum() adds all elements, sum(0) columns and sum(1) rows
    sqDistances=sqDiffMat.sum(axis=1)
    #Square off and calculate the distance
    distances=sqDistances**0.5
    
    #Returns the index value of the elements in distances sorted from small to large
    sortedDistIndices=distances.argsort()
    #Define a dictionary that records the number of times of a category
    classCount={}

    for i in range(k):
        #Take out the category of the first k elements
        voteIlabel=labels[sortedDistIndices[i]]
        #Calculate the number of categories
        # dict.get(key,default=None), the get() method of the dictionary, returns the value of the specified key. If the value is not in the dictionary, it returns the default value.
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1

    #key=operator.items(1) sort by dictionary value
    #key=operator.items(0) is sorted according to the key value of the dictionary
    #reverse descending sort dictionary
    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    #The category that returns the most times, that is, the category to be classified
    return sortedClassCount[0][0]


Among them, the number of elements of the label vector is the same as the number of rows of the matrix dataSet. In the above program, the Euclidean distance formula is used to calculate the distance between two vector points.

Example of overall program:

result:
```python
import numpy as np
import operator

"""
Function Description: KNN Algorithm, text classification

Parameters: 
    inX: Data for classification (test set)
    dataSet: Data for training (training set)
    labels: Classification label
    k: kNN Algorithm parameters, select the algorithm with the smallest distance k Point
    
returns: 
    Classification results

modify:
    2018-03-08
    
"""
def createDataSet():
    group = np.array([[1, 101], [5, 89], [108, 0], [108, 5]])
    labels = ['affectional film','affectional film', 'action movie', 'action movie']
    return group, labels

def classify0(inX,dataSet,labels,k):
    #numpy function shape[0] returns the number of rows of dataSet
    dataSetSize=dataSet.shape[0]
    #Repeat inX once (horizontally) in the direction of column vector and inX times (vertically) in the direction of row vector
    #numpy.tile([0,0],(1,1))#Repeat [0,0] once in the column direction and once in the row > > > array ([[0,0]])
    diffMat=np.tile(inX,(dataSetSize,1))-dataSet
    #Square of two-dimensional feature subtraction
    sqDiffMat=diffMat**2
    #sum() adds all elements, sum(0) columns and sum(1) rows
    sqDistances=sqDiffMat.sum(axis=1)
    #Square off and calculate the distance
    distances=sqDistances**0.5
    #Returns the index value of the elements in distances sorted from small to large

    sortedDistIndices=distances.argsort()
    #Define a dictionary that records the number of times of a category
    classCount={}
    for i in range(k):
        #Take out the category of the first k elements
        voteIlabel=labels[sortedDistIndices[i]]
        #Calculate the number of categories
        # dict.get(key,default=None), the get() method of the dictionary, returns the value of the specified key. If the value is not in the dictionary, it returns the default value.
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1

    #key=operator.items(1) sort by dictionary value
    #key=operator.items(0) is sorted according to the key value of the dictionary
    #reverse descending sort dictionary
    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    #The category that returns the most times, that is, the category to be classified
    return sortedClassCount[0][0]

if __name__ == '__main__':
    #Create dataset
    group, labels = createDataSet()
    #Test set
    test = [101,20]
    #kNN classification
    test_class = classify0(test, group, labels, 3)
    #Print classification results
    print(test_class)

Output:

action movie

Seeing this, someone may ask, "under what circumstances can the classifier make mistakes?" Or "is the answer always right?" The answer is No

The classifier will not get 100% correct results. We can use a variety of methods to detect the accuracy of the classifier. In addition, the performance of classifier will also be affected by many factors, such as classifier setting and data set. The performance of different algorithms on different data sets may be completely different.

In order to test the effect of the classifier, we can use the data with known answers. Of course, the answers cannot be told to the classifier to test whether the results given by the classifier meet the expected results. Through a large number of test data, we can get the error rate of the classifier - the number of times the classifier gives wrong results divided by the total number of test executions.

Error rate is a common evaluation method, which is mainly used to evaluate the execution effect of classifier on a data set. The error rate of the perfect classifier is 0, and the error rate of the worst classifier is 1.0. At the same time, it is not difficult for us to find that the k-nearest neighbor algorithm does not train the data, and directly uses the unknown data to compare with the known data to get the results. Therefore, it can be said that the K-neighbor algorithm does not have an explicit learning process.

Keywords: Machine Learning

Added by jd57 on Tue, 22 Feb 2022 14:56:36 +0200