Application of [machine learning practice task 1] (KNN) k nearest neighbor algorithm

1 background

1.1 overview of k-nearest neighbor algorithm

(1) Introduction of k-nearest neighbor algorithm

k-nearest neighbor algorithm is a very effective and easy to master machine learning algorithm. In short, it is an algorithm to classify data by measuring the distance between different eigenvalues.

(2) Working principle of k-nearest neighbor algorithm

Given a set of samples, it is called training set here, and each data in the sample contains labels. For a newly input data without labels, by calculating the distance between the new data and each sample, select the first k, usually K less than 20, and take the label that appears the most frequently among the labels of the latest data in the K series as the newly added data label.

(3) Case of k-nearest neighbor algorithm

At present, the number of kissing and fighting scenes in 6 films is counted. Suppose there is an unseen film, how to determine whether it is a love film or an action film?

Movie titleFighting lensKissing lensFilm type
California Man3104affectional film
He's Not Really into Dudes2100affectional film
Beautiful Woman181affectional film
Kevin Longblade10110action movie
Robo Slayer 3000995action movie
Amped II982action movie
1890unknown

According to the principle of knn algorithm, we can find the distance between the unknown movie and each movie (European distance is used here)

Take California Man as an example

>>>((3-18)**2+(104-90)**2)**(1/2)
20.518284528683193
Movie titleDistance from unknown i movies
California Man20.5
He's Not Really into Dudes18.7
Beautiful Woman19.2
Kevin Longblade115.3
Robo Slayer 3000117.4
Amped II118.9

Therefore, we can find the first k nearest films in the sample. Assuming k=3, the first three films are all romantic films. Therefore, we determine that the unknown film belongs to romantic films.

one point two   Implementation of k-nearest neighbor algorithm with python code

(1) Calculates the distance between each point in a known category dataset and the current point

(2) Sort by increasing distance

(3) Select the k points with the smallest distance from the current point

(4) Determine the frequency of occurrence of the category where the first k points are located

(5) The category with the highest frequency of the first k points is returned as the prediction classification of the current point

import numpy as np
import operator

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

(6) Case

>>>group = np.array([[1, 1.1],
...                 [1, 1],
...                 [0, 0],
...                 [0, 0.1]])
>>>labels = ['A', 'A', 'B', 'B']
>>>classify0([0,0], group, labels, 3)
'B'

one point three   How to test classifiers

Normally, in order to test the classification effect given by the classifier, we usually calculate the error rate of the classifier to judge the effect of the classifier. That is, divide the number of classification errors by the total number of classification. The error rate of the perfect classifier is 0, while the error rate of the worst classifier is 1.

2 use kNN algorithm to improve the matching effect of dating websites

2.1 case introduction

When my friend Helen uses dating software to find a date, although the website will recommend different candidates, she doesn't like everyone. Specifically, she can be divided into the following three categories: people she doesn't like, people with ordinary charm and people with great charm. Despite the above rules, Helen still can't classify the people recommended by the website into the appropriate category, so Helen I hope our classification software can better help her assign the matched objects to the exact classification.

2.2 data preparation

There are two channels for downloading data sets:

Code of python2 version downloaded by machine learning practice official

github download Python 3 version code for 202xxx

The data is stored in datingTestSet2.txt. Each sample occupies one row, a total of 1000 rows of data, mainly including the following three features:

Frequent flyer miles earned per year, percentage of time spent playing video games, and litres of ice cream consumed per week

Before the data is input into the classifier, it is necessary to convert the data into a style that can be recognized by the classifier

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = np.zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

The characteristic data (datingDataMat) read using file2matix is as follows

array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
        [1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
        [2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
        ...,
        [2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
        [4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
        [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]]

The datainglabels are as follows

[3,2,1,1,1,1,3,3,...,3,3,3]

two point three   Data analysis: creating scatter charts using Matplotlib

(1) Correlation between percentage of time spent playing video games and litres of ice cream consumed per week

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*np.array(datingDLabels), 15.0*np.array(datingDLabels))
plt.show()

Where, y-axis is the liters of ice cream consumed per week, and x-axis is the percentage of time spent playing video games

Purple means dislike, green means general charm, and yellow means great charm

(2) Relationship between frequent flyer mileage and percentage of time spent playing video games

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*np.array(datingDLabels), 15.0*np.array(datingDLabels))
plt.show()

Where, y-axis is the percentage of time spent playing video games, and x-axis is the frequent flyer mileage

Purple means dislike, green means general charm, and yellow means great charm

(3) Relationship between frequent flyer mileage and ice cream litres consumed per week

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,0], datingDataMat[:,2], 15.0*np.array(datingDLabels), 15.0*np.array(datingDLabels))
plt.show()

Where, y-axis is the litres of ice cream consumed per week, and x-axis is the frequent flyer mileage

Purple means dislike, green means general charm, and yellow means great charm

  two point four   Data preparation: normalized value

  When calculating the distance between samples through European distance, the number of frequent flyer miles is huge, and the weight that will affect the results will be large, and far greater than the other two features. However, as one of the three equal weights, frequent flyer miles should not affect the results so seriously, as shown in the following examples

((0-67)**2+(20000-32000)**2+(1.1-0.1)**2)**1/2
Percentage of time spent playing video gamesFrequent flyer mileageLitres of ice cream consumed per weekSample classification
10.84000.51
2121340000.93
30200001.12
467320000.12

Usually, when dealing with features with different value ranges, we often use normalization to map the eigenvalues to 0-1 or - 1 to 1, and normalize the features by (all values in the column - minimum value in the column) / (maximum value in the column - minimum value in the column)

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m,1))
    normDataSet = normDataSet/np.tile(ranges, (m,1))   #element wise divide
    return normDataSet, ranges, minVals

  two point five   Test algorithm: verify the classifier as a complete program

Evaluating the accuracy is a very important step in machine learning algorithms. Usually, we only use 90% of the training samples to train the classifier, and the remaining 10% to test the accuracy of the classifier. In order not to affect the randomness of the data, we need to randomly select 10% of the data.

(1) Use the file2matrix function to import data samples

(2) Normalize the data using autoNorm

(3) Use classify0 to train 90% of the data and test 10% of the data

(4) Error rate in output test set

def datingClassTest():
    hoRatio = 0.50      #hold out 10%
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))
    print (errorCount)

Finally, the error rate of the appointment dataset processed by the classifier is 2.4%, which is a very good result. Similarly, we can change the value of hoRatio and k to detect whether the error rate increases with the change of variables

  two point five   Using algorithms: building a complete and usable system

Through the above learning, we try to develop a program for Helen. By finding someone's information on the dating website and inputting it into the program, the program will give Helen's prediction of her liking for each other: dislike, general charm and great charm

import numpy as np
import operator

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = np.zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m,1))
    normDataSet = normDataSet/np.tile(ranges, (m,1))   #element wise divide
    return normDataSet, ranges, minVals

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def classifyPerson():
    resultList = ["not at all", "in small doses", "in large doses"]
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("ferquent fiter miles earned per year?"))
    iceCream = float(input("liters of ice ice crean consumed per year?"))
    datingDataMat,datingLabels = file2matrix('knn/datingTestSet2.txt')       #load data setfrom file
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([percentTats, ffMiles, iceCream])


    classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels,3)
    print ("You will probably like this person:", resultList[classifierResult-1])

if __name__ == "__main__":
    classifyPerson()#10    10000    0.5

Enter test data:

percentage of time spent playing video games?10
ferquent fiter miles earned per year?10000
liters of ice ice crean consumed per year?0.5
You will probably like this person: not at all

3 use kNN algorithm to make handwriting recognition system

3.1 case introduction

Taking the classification of numbers 0-9 as an example, the following case briefly describes how to use k-nearest neighbor algorithm to recognize handwritten numbers.

  Generally, the handwritten input numbers are in picture format. We need to convert the picture into structured data that can be recognized by knn algorithm. In short, we need to read the pixels in the picture. The pixel values are usually between 0-255, 0 is black and 255 is white. Therefore, the pixels with values greater than 250 can be marked as 1 and the rest as 0. The handwritten numeral 1 can be represented by the following data set:

1111111111
1111000111
1111000111
1111001111
1111001111
1111001111
1111001111
1111001111
1110000111
1111111111

three point two   Data preparation: converting images into test vectors

  There are two channels for downloading data sets:

Code of python2 version downloaded by machine learning practice official

github download Python 3 version code for 202xxx

The dataset is stored in digits.zip, where 1 represents the handwritten area and 0 represents the blank area

(big guys, Happy Mid Autumn Festival!!!)  

  The data is read through the img2vector function and an array is returned

def img2vector(filename):
    returnVect = np.zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

3.3 test the algorithm and use kNN to recognize handwritten digits

(1) Use listdir to read all files in the trainingDigits directory as training data

(2) Use listdir to read all files in the testDigits directory as test data

(3) The training data and test data are fed into the knn algorithm

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')           #load the training set
    m = len(trainingFileList)
    trainingMat = np.zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)
    testFileList = listdir('testDigits')        #iterate through the test set
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print ("the classifier came back with: %d, the real answer is: %d"% (classifierResult, classNumStr))
        if (classifierResult != classNumStr): errorCount += 1.0
    print ("\nthe total number of errors is: %d" % errorCount)
    print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))

Output the training results, the error rate is 1.1628%, and the error rate will change by changing the k value and training samples.

the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3

the total number of errors is: 11

the total error rate is: 0.011628

4 Summary

4.1 advantages and disadvantages of k-nearest neighbor algorithm

(1) Advantages: high precision, insensitive to outliers, no data input assumption

(2) Disadvantages: high computational complexity and space complexity

Applicable data range: numerical type and nominal type

four point two   General flow of k-nearest neighbor algorithm

(1) Collect data: any method can be used

(2) Prepare data: the value required for distance calculation, preferably in a structured data format

(3) Analyze data L: any method can be used

(4) Training algorithm: this step is not suitable for k-nearest neighbor algorithm

(5) Test algorithm: calculate error rate

(6) Using the algorithm: first, input the sample data and structured output results, then run the k-nearest neighbor algorithm to determine which classification the input data belongs to respectively, and finally perform subsequent processing on the calculated classification.

four point three   Problems needing attention in the use of k-nearest neighbor algorithm

(1) When the dimensions of data features are not unified, the data needs to be normalized, otherwise there will be a problem that large numbers eat decimals

(2) The distance between data is usually calculated by European distance

(3) The selection of K value in kNN algorithm will have a great impact on the results. Generally, K value is less than the square root of training sample data

(4) The cross validation method is usually used to select the optimal K value

5 Reference

Machine learning practice

Keywords: Python Algorithm Machine Learning

Added by Jaehoon on Tue, 21 Sep 2021 02:58:51 +0300