# K-nearest neighbor algorithm

K-nearest neighbor algorithm (KNN) is a basic classification and regression method proposed in 1967. The working principle is: there is a sample data set, also known as the training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the most similar data in the sample. Generally speaking, we only select the first k most similar data in the sample data set, which is the source of K of k-nearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, the most frequent classification among the k most similar data is selected as the classification of new data.

## 1. Classify whether a movie is an action movie or a love movie

The data set, that is, the training sample set, has two characteristics: the number of fighting shots and the number of kissing shots. In addition, we also know the type of each film, that is, the classification label. Using the distance formula of two points learned in high school The steps of K-nearest neighbor algorithm are as follows:
1. Calculate the distance between the point in the known category dataset and the current point
2. Sort by increasing distance
3. Select the K points with the smallest distance from the current point
4. Determine the occurrence frequency of the category where the first K points are located
5. Return the category with the highest frequency of the first K points to the prediction classification of the current point

### 1.1 data preparation

```import numpy as np
def createDataSet():
group = np.array([[1,101],[5,89],[108,5],[115,8]])
labels = ['affectional film','affectional film','action movie','action movie']
return group, labels
if __name__ == "__main__":
group, labels =createDataSet()
print(group)
print(labels)
```

### 1.2 calculation distance

The distance formula between two points, calculate the distance, select the first K points with the smallest distance, and return the result

```# -*- coding: UTF-8 -*-
import numpy as np
import operator

"""
Function description:kNN algorithm,classifier

Parameters:
inX - Data for classification(Test set)
dataSet - Data for training(Training set)
labes - Classification label
k - kNN Algorithm parameters,Select the one with the smallest distance k Point
Returns:
sortedClassCount - Classification results

Modify:
2017-07-13
"""
def classify0(inX, dataSet, labels, k):
#numpy function shape returns the number of rows of dataSet
dataSetSize = dataSet.shape
#Repeat inX once in the direction of column vector (horizontal), and inX times in the direction of row vector (vertical)
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
#Square of two-dimensional feature subtraction
sqDiffMat = diffMat**2
#sum() adds all elements, sum(0) columns and sum(1) rows
sqDistances = sqDiffMat.sum(axis=1)
#Square off and calculate the distance
distances = sqDistances**0.5
#Returns the index value of the elements in distances sorted from small to large
sortedDistIndices = distances.argsort()
#Define a dictionary that records the number of times of a category
classCount = {}
for i in range(k):
#Take out the category of the first k elements
voteIlabel = labels[sortedDistIndices[i]]
#dict.get(key,default=None), the get() method of the dictionary, returns the value of the specified key. If the value is not in the dictionary, it returns the default value.
#Calculate the number of categories
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
#Replace itemtems() in python2 with items() in python3
#key=operator. Itemsetter (1) sorts by dictionary value
#key=operator. Itemsetter (0) sorts by dictionary key
#reverse descending sort dictionary
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
#The category that returns the most times, that is, the category to be classified
return sortedClassCount

```

### 1.3 overall code

```# -*- coding: UTF-8 -*-
import numpy as np
import operator

"""
Function description:Create dataset

Parameters:
nothing
Returns:
group - data set
labels - Classification label
Modify:
2017-07-13
"""
def createDataSet():
#Four sets of 2D features
group = np.array([[1,101],[5,89],[108,5],[115,8]])
#Labels for four sets of features
labels = ['affectional film','affectional film','action movie','action movie']
return group, labels

"""
Function description:kNN algorithm,classifier

Parameters:
inX - Data for classification(Test set)
dataSet - Data for training(Training set)
labes - Classification label
k - kNN Algorithm parameters,Select the one with the smallest distance k Point
Returns:
sortedClassCount - Classification results

Modify:
2017-07-13
"""
def classify0(inX, dataSet, labels, k):
#numpy function shape returns the number of rows of dataSet
dataSetSize = dataSet.shape
#Repeat inX once in the direction of column vector (horizontal), and inX times in the direction of row vector (vertical)
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
#Square of two-dimensional feature subtraction
sqDiffMat = diffMat**2
#sum() adds all elements, sum(0) columns and sum(1) rows
sqDistances = sqDiffMat.sum(axis=1)
#Square off and calculate the distance
distances = sqDistances**0.5
#Returns the index value of the elements in distances sorted from small to large
sortedDistIndices = distances.argsort()
#Define a dictionary that records the number of times of a category
classCount = {}
for i in range(k):
#Take out the category of the first k elements
voteIlabel = labels[sortedDistIndices[i]]
#dict.get(key,default=None), the get() method of the dictionary, returns the value of the specified key. If the value is not in the dictionary, it returns the default value.
#Calculate the number of categories
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
#Replace itemtems() in python2 with items() in python3
#key=operator. Itemsetter (1) sorts by dictionary value
#key=operator. Itemsetter (0) sorts by dictionary key
#reverse descending sort dictionary
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
#The category that returns the most times, that is, the category to be classified
return sortedClassCount

if __name__ == '__main__':
#Create dataset
group, labels = createDataSet()
#Test set
test = [101,20]
#kNN classification
test_class = classify0(test, group, labels, 3)
#Print classification results
print(test_class)

```

result ## 2.K-nearest neighbor algorithm to determine the pairing effect of dating websites

General process of K-nearest neighbor algorithm:
1. Data collection: crawlers can be used to collect data, and free or charged data provided by a third party can also be used. Generally speaking, the data is placed in txt text and stored in a certain format to facilitate analysis and processing.
2. Prepare data: use python to parse and preprocess data.
3. Analyze data: many methods can be used to analyze data, such as matplotlib for data visualization
4. Test algorithm: calculate the error rate
5. Use algorithm: if the error rate is within an acceptable range, you can run K-nearest neighbor algorithm for classification

### 2.1 actual combat background

Ms. Helen looks for her date on the online website. Although dating websites will recommend different candidates, she doesn't like everyone. She is divided into three categories:
didntLike
smallDises
largeDoses
Put the data in datingtestset Txt file, each sample data occupies one line, with a total of 1000 lines.
The sample data collected by Helen mainly includes three characteristics:
Frequent flyer miles per year
Percentage of time spent playing video games
Litres of ice cream consumed per week ### 2.2 data preparation: data analysis

Before inputting data features into the classifier, the data format to be processed must be changed to the format that the classifier can receive. The data format received by the classifier is: feature matrix and corresponding classification label vector.

```# -*- coding: UTF-8 -*-
import numpy as np
"""
Function description:Open and parse the file and classify the data: 1 means you don't like it,2 Represents general charm,3 The representative is very charming

Parameters:
filename - file name
Returns:
returnMat - Characteristic matrix
classLabelVector - classification Label vector

Modify:
2017-03-24
"""
def file2matrix(filename):
#Open file
fr = open(filename)
#Read all contents of the file
#Get the number of file lines
numberOfLines = len(arrayOLines)
#Returned NumPy matrix, parsed data: numberOfLines row, 3 columns
returnMat = np.zeros((numberOfLines,3))
#Returned category label vector
classLabelVector = []
#Index value of row
index = 0
for line in arrayOLines:
#s.strip(rm). When rm is empty, the blank character (including '\ n','\r','\t', ') is deleted by default
line = line.strip()
#Use s.split(str="",num=string,cout(str)) to slice the string according to the '\ t' separator.
listFromLine = line.split('\t')
#The first three columns of data are extracted and stored in the NumPy matrix of returnMat, that is, the characteristic matrix
returnMat[index,:] = listFromLine[0:3]
#According to the degree of liking marked in the text, 1 represents dislike, 2 represents general charm and 3 represents great charm
if listFromLine[-1] == 'didntLike':
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector

"""
Function description:main function

Parameters:
nothing
Returns:
nothing

Modify:
2017-03-24
"""
if __name__ == '__main__':
#Open file name
filename = "datingTestSet.txt"
#Open and process data
datingDataMat, datingLabels = file2matrix(filename)
print(datingDataMat)
print(datingLabels)

```

The obtained characteristic matrix and label vector can be used for the next operation

### 2.3 analysis data: Data Visualization

Write the function of showdata to visualize the data. The code is as follows:

```# -*- coding: UTF-8 -*-

from matplotlib.font_manager import FontProperties
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import numpy as np

"""
Function description:Open and parse the file and classify the data: 1 means you don't like it,2 Represents general charm,3 The representative is very charming

Parameters:
filename - file name
Returns:
returnMat - Characteristic matrix
classLabelVector - classification Label vector

Modify:
2017-03-24
"""
def file2matrix(filename):
#Open file
fr = open(filename)
#Read all contents of the file
#Get the number of file lines
numberOfLines = len(arrayOLines)
#Returned NumPy matrix, parsed data: numberOfLines row, 3 columns
returnMat = np.zeros((numberOfLines,3))
#Returned category label vector
classLabelVector = []
#Index value of row
index = 0
for line in arrayOLines:
#s.strip(rm). When rm is empty, the blank character (including '\ n','\r','\t', ') is deleted by default
line = line.strip()
#Use s.split(str="",num=string,cout(str)) to slice the string according to the '\ t' separator.
listFromLine = line.split('\t')
#The first three columns of data are extracted and stored in the NumPy matrix of returnMat, that is, the characteristic matrix
returnMat[index,:] = listFromLine[0:3]
#According to the degree of liking marked in the text, 1 represents dislike, 2 represents general charm and 3 represents great charm
if listFromLine[-1] == 'didntLike':
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector

"""
Function description:Visual data

Parameters:
datingDataMat - Characteristic matrix
datingLabels - classification Label
Returns:
nothing
Modify:
2017-03-24
"""
def showdatas(datingDataMat, datingLabels):
#Format Chinese characters
font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
#The fig canvas is divided into one row and one column. The x-axis and y-axis are not shared. The size of the fig canvas is (13,8)
#When nrow = 2 and nclos = 2, the representative fig canvas is divided into four areas, and axs represents the first area in the first line
fig, axs = plt.subplots(nrows=2, ncols=2,sharex=False, sharey=False, figsize=(13,8))

numberOfLabels = len(datingLabels)
LabelsColors = []
for i in datingLabels:
if i == 1:
LabelsColors.append('black')
if i == 2:
LabelsColors.append('orange')
if i == 3:
LabelsColors.append('red')
#Draw the scatter diagram, and draw the scatter data with the first (frequent flyer routine) and second column (game playing) data of the datingDataMat matrix. The scatter size is 15 and the transparency is 0.5
axs.scatter(x=datingDataMat[:,0], y=datingDataMat[:,1], color=LabelsColors,s=15, alpha=.5)
#Set title, x-axis label,y-axis label
axs0_title_text = axs.set_title(u'Proportion of frequent flyer mileage obtained each year and time spent playing video games',FontProperties=font)
axs0_xlabel_text = axs.set_xlabel(u'Frequent flyer miles per year',FontProperties=font)
axs0_ylabel_text = axs.set_ylabel(u'Percentage of time spent playing video games',FontProperties=font)
plt.setp(axs0_title_text, size=9, weight='bold', color='red')
plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black')
plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black')

#Draw the scatter diagram, and draw the scatter data with the first (frequent flyer routine) and third column (ice cream) data of the datingDataMat matrix. The scatter size is 15 and the transparency is 0.5
axs.scatter(x=datingDataMat[:,0], y=datingDataMat[:,2], color=LabelsColors,s=15, alpha=.5)
#Set title, x-axis label,y-axis label
axs1_title_text = axs.set_title(u'Frequent flyer miles earned per year and ice cream litres consumed per week',FontProperties=font)
axs1_xlabel_text = axs.set_xlabel(u'Frequent flyer miles per year',FontProperties=font)
axs1_ylabel_text = axs.set_ylabel(u'Litres of ice cream consumed per week',FontProperties=font)
plt.setp(axs1_title_text, size=9, weight='bold', color='red')
plt.setp(axs1_xlabel_text, size=7, weight='bold', color='black')
plt.setp(axs1_ylabel_text, size=7, weight='bold', color='black')

#Draw the scatter diagram, and draw the scatter data with the second (playing games) and third column (ice cream) data of the datingDataMat matrix. The scatter size is 15 and the transparency is 0.5
axs.scatter(x=datingDataMat[:,1], y=datingDataMat[:,2], color=LabelsColors,s=15, alpha=.5)
#Set title, x-axis label,y-axis label
axs2_title_text = axs.set_title(u'Proportion of time spent playing video games and litres of ice cream consumed per week',FontProperties=font)
axs2_xlabel_text = axs.set_xlabel(u'Percentage of time spent playing video games',FontProperties=font)
axs2_ylabel_text = axs.set_ylabel(u'Litres of ice cream consumed per week',FontProperties=font)
plt.setp(axs2_title_text, size=9, weight='bold', color='red')
plt.setp(axs2_xlabel_text, size=7, weight='bold', color='black')
plt.setp(axs2_ylabel_text, size=7, weight='bold', color='black')
#Set legend
didntLike = mlines.Line2D([], [], color='black', marker='.',
markersize=6, label='didntLike')
smallDoses = mlines.Line2D([], [], color='orange', marker='.',
markersize=6, label='smallDoses')
largeDoses = mlines.Line2D([], [], color='red', marker='.',
markersize=6, label='largeDoses')
axs.legend(handles=[didntLike,smallDoses,largeDoses])
axs.legend(handles=[didntLike,smallDoses,largeDoses])
axs.legend(handles=[didntLike,smallDoses,largeDoses])
#display picture
plt.show()

"""
Function description:main function

Parameters:
nothing
Returns:
nothing

Modify:
2017-03-24
"""
if __name__ == '__main__':
#Open file name
filename = "datingTestSet.txt"
#Open and process data
datingDataMat, datingLabels = file2matrix(filename)
showdatas(datingDataMat, datingLabels)

```

Run the code to see the data visualization results ### 2.4 data preparation: data normalization

Because the numerical attribute of each characteristic data has a great impact on the results, it is necessary to normalize the characteristic values and convert the data to between 0-1 or between - 1 and 1.

```newValue = (oldValue - min) / (max - min)
```

Add the function of autoNorm to the above code, and use this function to automatically normalize the data. The code is as follows

```# -*- coding: UTF-8 -*-
import numpy as np

"""
Function description:Open and parse the file and classify the data: 1 means you don't like it,2 Represents general charm,3 The representative is very charming

Parameters:
filename - file name
Returns:
returnMat - Characteristic matrix
classLabelVector - classification Label vector

Modify:
2017-03-24
"""
def file2matrix(filename):
#Open file
fr = open(filename)
#Read all contents of the file
#Get the number of file lines
numberOfLines = len(arrayOLines)
#Returned NumPy matrix, parsed data: numberOfLines row, 3 columns
returnMat = np.zeros((numberOfLines,3))
#Returned category label vector
classLabelVector = []
#Index value of row
index = 0
for line in arrayOLines:
#s.strip(rm). When rm is empty, the blank character (including '\ n','\r','\t', ') is deleted by default
line = line.strip()
#Use s.split(str="",num=string,cout(str)) to slice the string according to the '\ t' separator.
listFromLine = line.split('\t')
#The first three columns of data are extracted and stored in the NumPy matrix of returnMat, that is, the characteristic matrix
returnMat[index,:] = listFromLine[0:3]
#According to the degree of liking marked in the text, 1 represents dislike, 2 represents general charm and 3 represents great charm
if listFromLine[-1] == 'didntLike':
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector

"""
Function description:Normalize the data

Parameters:
dataSet - Characteristic matrix
Returns:
normDataSet - Normalized characteristic matrix
ranges - Data range
minVals - Data minimum

Modify:
2017-03-24
"""
def autoNorm(dataSet):
#Get the minimum value of data
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
#Range of maximum and minimum values
ranges = maxVals - minVals
#shape(dataSet) returns the number of matrix rows and columns of the dataSet
normDataSet = np.zeros(np.shape(dataSet))
#Returns the number of rows in the dataSet
m = dataSet.shape
#Original value minus minimum value
normDataSet = dataSet - np.tile(minVals, (m, 1))
#The normalized data is obtained by dividing the difference between the maximum and minimum values
normDataSet = normDataSet / np.tile(ranges, (m, 1))
#Return normalized data result, data range and minimum value
return normDataSet, ranges, minVals

"""
Function description:main function

Parameters:
nothing
Returns:
nothing

Modify:
2017-03-24
"""
if __name__ == '__main__':
#Open file name
filename = "datingTestSet.txt"
#Open and process data
datingDataMat, datingLabels = file2matrix(filename)
normDataSet, ranges, minVals = autoNorm(datingDataMat)
print(normDataSet)
print(ranges)
print(minVals)

```

The result of running the code is as follows: ### 2.6 using algorithms: build a complete and usable system

Create classifyPerson in the file with the following code:

```# -*- coding: UTF-8 -*-

import numpy as np
import operator

"""
Function description:kNN algorithm,classifier

Parameters:
inX - Data for classification(Test set)
dataSet - Data for training(Training set)
labes - Classification label
k - kNN Algorithm parameters,Select the one with the smallest distance k Point
Returns:
sortedClassCount - Classification results

Modify:
2017-03-24
"""
def classify0(inX, dataSet, labels, k):
#numpy function shape returns the number of rows of dataSet
dataSetSize = dataSet.shape
#Repeat inX once in the direction of column vector (horizontal), and inX times in the direction of row vector (vertical)
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
#Square of two-dimensional feature subtraction
sqDiffMat = diffMat**2
#sum() adds all elements, sum(0) columns and sum(1) rows
sqDistances = sqDiffMat.sum(axis=1)
#Square off and calculate the distance
distances = sqDistances**0.5
#Returns the index value of the elements in distances sorted from small to large
sortedDistIndices = distances.argsort()
#Define a dictionary that records the number of times of a category
classCount = {}
for i in range(k):
#Take out the category of the first k elements
voteIlabel = labels[sortedDistIndices[i]]
#dict.get(key,default=None), the get() method of the dictionary, returns the value of the specified key. If the value is not in the dictionary, it returns the default value.
#Calculate the number of categories
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
#Replace itemtems() in python2 with items() in python3
#key=operator. Itemsetter (1) sorts by dictionary value
#key=operator. Itemsetter (0) sorts by dictionary key
#reverse descending sort dictionary
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
#The category that returns the most times, that is, the category to be classified
return sortedClassCount

"""
Function description:Open and parse the file and classify the data: 1 means you don't like it,2 Represents general charm,3 The representative is very charming

Parameters:
filename - file name
Returns:
returnMat - Characteristic matrix
classLabelVector - classification Label vector

Modify:
2017-03-24
"""
def file2matrix(filename):
#Open file
fr = open(filename)
#Read all contents of the file
#Get the number of file lines
numberOfLines = len(arrayOLines)
#Returned NumPy matrix, parsed data: numberOfLines row, 3 columns
returnMat = np.zeros((numberOfLines,3))
#Returned category label vector
classLabelVector = []
#Index value of row
index = 0
for line in arrayOLines:
#s.strip(rm). When rm is empty, the blank character (including '\ n','\r','\t', ') is deleted by default
line = line.strip()
#Use s.split(str="",num=string,cout(str)) to slice the string according to the '\ t' separator.
listFromLine = line.split('\t')
#The first three columns of data are extracted and stored in the NumPy matrix of returnMat, that is, the characteristic matrix
returnMat[index,:] = listFromLine[0:3]
#According to the degree of liking marked in the text, 1 represents dislike, 2 represents general charm and 3 represents great charm
if listFromLine[-1] == 'didntLike':
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector

"""
Function description:Normalize the data

Parameters:
dataSet - Characteristic matrix
Returns:
normDataSet - Normalized characteristic matrix
ranges - Data range
minVals - Data minimum

Modify:
2017-03-24
"""
def autoNorm(dataSet):
#Get the minimum value of data
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
#Range of maximum and minimum values
ranges = maxVals - minVals
#shape(dataSet) returns the number of matrix rows and columns of the dataSet
normDataSet = np.zeros(np.shape(dataSet))
#Returns the number of rows in the dataSet
m = dataSet.shape
#Original value minus minimum value
normDataSet = dataSet - np.tile(minVals, (m, 1))
#The normalized data is obtained by dividing the difference between the maximum and minimum values
normDataSet = normDataSet / np.tile(ranges, (m, 1))
#Return normalized data result, data range and minimum value
return normDataSet, ranges, minVals

"""
Function description:By entering a person's 3D features,Perform classified output

Parameters:
nothing
Returns:
nothing

Modify:
2017-03-24
"""
def classifyPerson():
#Output results
resultList = ['Disgusting','Some like','like it very much']
#3D feature user input
precentTats = float(input("Percentage of time spent playing video games:"))
ffMiles = float(input("Frequent flyer miles per year:"))
iceCream = float(input("Litres of ice cream consumed per week:"))
#Open file name
filename = "datingTestSet.txt"
#Open and process data
datingDataMat, datingLabels = file2matrix(filename)
#Training set normalization
normMat, ranges, minVals = autoNorm(datingDataMat)
#Generate NumPy array and test set
inArr = np.array([precentTats, ffMiles, iceCream])
#Test set normalization
norminArr = (inArr - minVals) / ranges
#Return classification results
classifierResult = classify0(norminArr, normMat, datingLabels, 3)
#Print results
print("You may%s this man" % (resultList[classifierResult-1]))

"""
Function description:main function

Parameters:
nothing
Returns:
nothing

Modify:
2017-03-24
"""
if __name__ == '__main__':
classifyPerson()

```

The operation results are as follows:  