k-nearest neighbor algorithm

KNN overview

k-nearest neighbor (KNN) algorithm is a basic classification and regression method.

As a supervised classification algorithm, it is one of the simplest machine learning algorithms. As the name implies, the main idea of its algorithm is to determine its own category according to the neighbor category with similar distance. The premise of the algorithm is that there needs to be a training data set of marked categories, and the calculation process is generally understood:

Given a training data set, for a new input instance, the k instances closest to the instance are found in the training data set. Most of the k instances belong to a class, so the input instance is divided into this class.

A word summary: near Zhu is red, near Mo is black!

The input of the K nearest neighbor algorithm is the feature vector of the instance, corresponding to the point of the feature space; the output is the category of the instance, which can take multiple categories. The k-nearest-neighbor algorithm assumes that given a training data set, the instance category has been determined. In classification, the new instances are predicted by majority vote according to the categories of k nearest neighbor training instances. Therefore, k-nearest neighbor algorithm does not have an explicit learning process.

In fact, k-nearest neighbor algorithm uses training data set to divide the feature vector space, and as its classification "model". The selection of K value, distance measurement and classification decision rules are the three basic elements of k-nearest neighbor algorithm.

 

Scenario examples

Movies can be classified according to their themes, so how to distinguish action movies from love movies?

  1. Action movies: more fights
  2. Romance: more kisses

Based on the number of kisses and fights in the movie, the theme type of the movie can be automatically divided by using the k-nearest neighbor algorithm to construct the program.

Now, according to the distance between all the films in the sample set we obtained above and the unknown films, K films with the closest distance can be found in ascending order of distance. Assuming k=3, the three closest movies are He's Not Really into Dudes, Beautiful Woman and California Man. knn algorithm determines the type of unknown films according to the type of the three closest films, and all of them are love films, so we determine that the unknown films are love films.

 

Working principle

Calculation steps:

  1. Suppose there is a labeled sample data set (training sample set), which contains the corresponding relationship between each data and its classification.
  2. After entering new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set.
    1. Calculate the distance between the new data and each data in the sample data set.
    2. All the distances are sorted (from small to large, smaller means more similar).
    3. Take the classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.
  3. Find the most frequent classification labels in k data as the classification of new data.

Development process:

  • Data collection: any method
  • Prepare data: the value required for distance calculation, preferably in a structured data format
  • Analysis data: any method
  • Training algorithm: this step is not applicable to k-nearest neighbor algorithm
  • Test algorithm: calculate error rate
  • Using algorithm: input sample data and structured output results, then run k-nearest neighbor algorithm to determine which classification the input data belongs to, and finally perform subsequent processing on the calculated classification

 

Advantages and disadvantages of KNN algorithm

1. advantages

  • It is easy to understand and realize;
  • High precision, insensitive to outliers, no data input assumption, no training required
  • It is especially suitable for multi classification problems.

2. disadvantages

  • Lazy algorithm, large amount of calculation in classification, scanning all training samples to calculate distance, memory overhead, slow scoring;
  • When the samples are unbalanced, for example, if the sample size of one category is large, it may lead to the majority of large-scale samples when calculating the nearest neighbor of new samples, which will affect the classification effect;
  • It has poor interpretability and can't give rules like decision tree.

3. Precautions

Setting of K value

If the K value is set too small, the classification accuracy will be reduced; if the setting is too large, and the test sample belongs to the class with less data in the training set, the noise will be increased and the classification effect will be reduced.

Generally, the setting of K value adopts the method of cross check (based on K=1)

Rule of thumb: K is generally lower than the square root of the number of training samples.

optimization problem

Compress training samples;

When determining the final category, it is not a simple voting method, but a weighted vote, the closer the distance, the higher the weight.

 

KNN project case

Optimize the matching effect of dating websites

Project overview

Helen uses the dating website to find her date. After a while, she found that she had met three types of people:

  • People who don't like it
  • Charming people
  • Attractive people

She hoped:

  1. Date attractive people on weekdays
  2. Date attractive people on weekends
  3. Those who don't like it are excluded directly

Now she has collected some data that has not been recorded by dating websites, which is more helpful to classify matching objects.

Development process

  • Collecting data: providing text files
  • Preparing data: parsing text files using Python
  • Analyzing data: drawing 2D scatter using Matplotlib
  • Training algorithm: this step is not applicable to k-nearest neighbor algorithm
  • Test algorithm: use some data provided by Helen as test samples. The difference between the test sample and the non test sample is that the test sample is the data that has been classified. If the predicted classification is different from the actual classification, it is marked as an error.
  • Use algorithm: generate a simple command line program, and then Helen can input some characteristic data to determine whether the other party is the type she likes.

1. Collect data

Helen extracted the object features of her appointment and classified them in the text, mainly including the following three features:

Frequent flyer miles earned per year, percentage of time spent playing video games, liters of ice cream consumed per week

40920   8.326976    0.953952    3
14488   7.153469    1.673904    2
26052   1.441871    0.805124    1
75136   13.147394   0.428964    1
38344   1.669788    0.134296    1
...

2. Prepare data: use Python to parse text files

Resolver for converting text records to NumPy

def file2matrix(filename):
    """
    Desc:
        //Import training data
    parameters:
        filename: Data file path
    return: 
        //Data matrix returnMat and corresponding category classLabelVector
    """
    with open(filename) as f:
        lines = f.readlines()

    row_num = len(lines)
    # To create a matrix zeros(2,3) is to generate a matrix of 2 * 3, all of which are 0
    matrix = zeros((row_num, 3))
    index = 0
    class_label = []
    for line in lines:
        # Remove empty character cutting eigenvalues at the beginning and end of a string
        features = line.strip().split("/t")
        # Initializing feature matrix and corresponding classification marks
        matrix[index, :] = features[0:3]
        class_label.append(features[-1])
        index += 1

    return matrix, class_label

3. Analysis data

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 1], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.show()

In the following figure, the first and second column attributes of the matrix are used to get a good display effect. Three different sample classification areas are clearly identified, and the category areas of people with different hobbies are also different.

Normalized data

Normalization is a process of unifying weights. For more details, please refer to: https://www.zhihu.com/question/19951858 

Distance between sample 3 and sample 4: $$\ sqrt{(0-67)^2 + (20000-32000)^2 + (1.1-0.1)^2}$$

Normalize the eigenvalues to eliminate the influence caused by the different magnitude of features

Normalization definition: in my opinion, normalization is to limit the data you need to process (through some algorithm) to a certain extent. First of all, normalization is for the convenience of data processing, and second is to speed up the convergence of the program. The methods are as follows:

1) Linear function conversion, the expression is as follows:

y=(x-MinValue)/(MaxValue-MinValue)  

Note: x and y are the values before and after conversion, and MaxValue and MinValue are the maximum and minimum values of samples.   

2) Logarithmic function conversion, the expression is as follows:

y=log10(x)  

Description: base 10 logarithmic function conversion.

As shown in the picture:

! [logarithmic function image] (http://data.apachecn.org/img/AiLearning/ml/2.KNN/knn_1.png)

3) Inverse cotangent function conversion, the expression is as follows:

y=arctan(x)*2/PI 

As shown in the picture:

! [anti cotangent function image] (http://data.apachecn.org/img/AiLearning/ml/2.KNN/arctan_arccot.gif)

In statistics, the specific role of normalization is to summarize the statistical distribution of unified samples. Normalization between 0-1 is the statistical probability distribution, and normalization between - 1 - + 1 is the statistical coordinate distribution.

def normalize(matrix):
    """
    Desc:
        //Normalize the eigenvalues to eliminate the influence caused by the different magnitude of features
    parameter:
        dataSet: data set
    return:
        //The normalized data sets normDataSet. ranges and minVals are the minimum values and ranges, which are not used

    //Normalization formula:
        Y = (X-Xmin)/(Xmax-Xmin)
        //min and max are the minimum eigenvalue and the maximum eigenvalue respectively. This function can automatically convert the digital eigenvalue into the interval from 0 to 1.
    """

    # Calculate the maximum, minimum and range of each attribute
    min_val = matrix.min()
    max_val = matrix.max()
    ranges = max_val - min_val
    m = matrix.shape[0]
    # The matrix composed of the difference between the generated value and the minimum value
    norm_matrix = matrix - tile(min_val, (m, 1))
    # Divide the difference of the minimum value by the range to form the matrix
    norm_matrix = norm_matrix / tile(ranges, (m, 1))
    
    return norm_matrix, ranges, min_val

4. Training algorithm: this step is not applicable to k-nearest neighbor algorithm

Because the test data must be compared with the full amount of training data every time, this process is unnecessary

Pseudo code of kNN algorithm:

For each data point in the dataset:
    Calculate the distance between the target's data point (the data point to be classified) and the data point
    Sort distance: from small to large
    Select the first K shortest distances
    Select the most categories in the K
    Return this category as the prediction value of the target data point
def classify(input_data, dataset, labels, k):
    """
    knn Classification algorithm
    :param input_data: Enter the target data to be classified, that is, someone's eigenvector (one line list)
    :param dataset: Classification data set
    :param labels: dataset Corresponding classification label
    :param k: k Bottom distance
    :return:
    """
    msize = dataset.shape[0]
    # The distance measurement formula is Euclidean distance
    diff_mat = tile(input_data, (msize, 1)) - dataset
    sq_diff_mat = diff_mat ** 2
    sq_distances = sq_diff_mat.sum(axis=1)
    distances = sq_distances ** 0.5

    # Sort distance: from small to large
    sorted_index = distances.argsort()
    # Select the first K shortest distances and the most K categories
    class_count = {}
    for i in range(k):
        label = labels[sorted_index[i]]
        class_count[label] = class_count.get(label, 0) + 1
    sorted_class = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)

    return sorted_class[0][0]

5. Test algorithm

Use some data provided by Helen as test samples. If the forecast classification is different from the actual category, it is marked as an error.

def dating_class_test():
    """
    Desc:
        //Test methods for dating websites
    parameters:
        none
    return:
        //Error number
    """
    # Set a scale of test data (training data set scale = 1-hoRatio)
    ratio = 0.1  # Test scope, part of test as sample
    # Load data from file
    feature_matrix, labels = txt2matrix("data/datingTestSet.txt")
    # Normalized data
    norm_mat, ranges, min_val = normalize(feature_matrix)

    # m represents the number of rows of data, i.e. the first dimension of the matrix
    m = norm_mat.shape[0]
    # Number of samples tested
    test_num = int(m * ratio)
    error_count = 0
    for i in range(test_num):
        classifier_result = classify(norm_mat[i, :], norm_mat[test_num:m, :, labels[test_num:m, :, 3]])
        print("the classifier came back with: %d, the real answer is: %d" % (classifier_result, labels[i]))
        if (classifier_result != labels[i]):
            error_count += 1
    print("the total error rate is: %f" % (error_count / test_num))
    print(error_count)

6. Algorithm

Generate a simple command-line program, and Helen can then enter some characteristic data to determine whether the other party is her favorite type.

Pseudo code of appointment website prediction function

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(raw_input("percentage of time spent playing video games ?"))
    ffMiles = float(raw_input("frequent filer miles earned per year?"))
    iceCream = float(raw_input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels, 3)
    print "You will probably like this person: ", resultList[classifierResult - 1]

 

 

 

Published 33 original articles, won praise 17, visited 10000+
Private letter follow

Keywords: less Python Attribute

Added by Kinneh on Mon, 17 Feb 2020 08:11:49 +0200