# Machine learning experiment I k-nearest neighbor algorithm

1. Introduction to algorithm

The working principle of k-nearest neighbor algorithm is that there is a sample data set, also known as training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label of the data with the most similar feature (nearest neighbor) in the sample set. Generally speaking, we only select the first k most similar data in the sample data set, which is the source of K in the k-nearest neighbor algorithm. Generally, K is an integer not greater than 20. Finally, select the classification that appears most in the k most similar data as the classification of new data.

Advantages: high precision, insensitive to outliers and no data input assumption.                                                                           Disadvantages: high computational complexity and space complexity.                                                                                                 Applicable data range: numerical type and standard type.

The general flow of k-nearest neighbor algorithm is shown in the book: 2. Examples

Taking film classification as an example, kNN can be used to classify whether a film is a love film or an action film. The above figure is a data set, that is, a training sample set. Generally speaking, we divide kissing scenes into romantic films. There are many fighting scenes, which are divided into action films. According to the data of the original six films, even if we don't know what type of unknown films belong to, we can judge by calculating the distance from these films. In the figure above, we get the distance between the film in the sample and the unknown film. Assuming k=3, the three closest to the unknown film are all love films, so we judge the unknown film as a love film.

3. Code implementation

Take improving dating website pairing as an example Preparing data: parsing data from a text file

Resolver for converting text records to NumPy

```def file2matrix(filename):
# Open file
fr = open(filename)
# Read all contents of the file
# Get the number of file lines
numberOfLines = len(arrayOlines)
# Returned NumPy matrix numberOfLines row, 3 columns
returnMat = np.zeros((numberOfLines, 3))
# Create category label vector
classLabelVector = []
# Index value of row
index = 0
for line in arrayOlines:
# Remove the white space at the beginning and end of each line, such as' \ n','\r','\t ','
line = line.strip()
# Slice the contents of each row according to the '\ t' symbol. In this example, there are four columns
listFromLine = line.split('\t')
# The first three columns of data are extracted and stored in the returnMat matrix, that is, the characteristic matrix
returnMat[index, :] = listFromLine[0:3]  # Put in the index line of returnMat
# Classification according to text content 1: dislike; 2: General; 3: Like
if listFromLine[-1] == 'didntLike':
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
# Returns the label column vector and the characteristic matrix
return returnMat, classLabelVector```

Analyzing data: creating scatter charts using Matplotlib

```import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2],
15.0*array(datingLabels), 15.0*array(datingLabels))
plt.show()```

Prepare data: normalized value

Normalized eigenvalue

```def autoNorm(dataSet):
# Gets the minimum value of the data
minVals = dataSet.min(0)
# Gets the maximum value of the data
maxVals = dataSet.max(0)
# Range of maximum and minimum values
ranges = maxVals - minVals
# shape(dataSet) returns the number of matrix rows and columns of the dataSet
normDataSet = np.zeros(np.shape(dataSet))
# numpy function shape returns the number of rows of dataSet
m = dataSet.shape
# Original value minus minimum value (x-xmin)
normDataSet = dataSet - np.tile(minVals, (m, 1))
# Difference divided by the difference between maximum and minimum (x-xmin) / (xmax xmin)
normDataSet = normDataSet / np.tile(ranges, (m, 1))
# Normalized data results, data range, minimum
return normDataSet, ranges, minVals```

Test algorithm: verify the classifier as a complete program

Classifier test code for dating website

```def datingClassTest():
# Open file name
filename = "datingTestSet.txt"
# Store the returned feature matrix and classification vector in datingDataMat and datingLabels respectively
datingDataMat, datingLabels = file2matrix(filename)
# Take 10% of all data. The smaller the Horatio, the lower the error rate
hoRatio = 0.10
# Data normalization, return normalized data result, data range, minimum value
normMat, ranges, minVals = autoNorm(datingDataMat)
# Gets the number of rows of normMat
m = normMat.shape
# 10% number of test data
numTestVecs = int(m * hoRatio)
# Classification error count
errorCount = 0.0
for i in range(numTestVecs):
# The first numTestVecs data is used as the test set, and the last m-numTestVecs data is used as the training set
# k select the number of label s + 1 (the result is better)
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], \
datingLabels[numTestVecs:m], 4)
print("Classification results:%d\t Real category:%d" % (classifierResult, datingLabels[i]))
if classifierResult != datingLabels[i]:
errorCount += 1.0
print("error rate:%f%%" % (errorCount / float(numTestVecs) * 100))```

Using algorithms: building complete and usable systems

Dating site prediction function

```def classifyPerson():
# Output results
resultList = ['Detestable', 'Some like', 'like it very much']
# 3D feature user input
percentTats = float(input("Percentage of time spent playing video games:"))
ffMiles = float(input("Frequent flyer miles obtained each year:"))
iceCream = float(input("Litres of ice cream consumed per week:"))
# Open file name
filename = "datingTestSet.txt"
# Open and process data
datingDataMat, datingLabels = file2matrix(filename)
# Training set normalization
normMat, ranges, minVals = autoNorm(datingDataMat)
# Generate NumPy array, test set
inArr = np.array([percentTats, ffMiles, iceCream])
# Test set normalization
norminArr = (inArr - minVals) / ranges
# Return classification results
classifierResult = classify0(norminArr, normMat, datingLabels, 4)
# Print results
print("You may%s this man" % (resultList[classifierResult - 1]))```

4. Summary

K-NN is a case-based learning, or local approximation and lazy learning that postpones all calculations until after classification. k-nearest neighbor algorithm is one of the simplest machine learning algorithms.

Keywords: Algorithm Machine Learning AI

Added by ts10 on Mon, 27 Sep 2021 20:04:52 +0300