k-nearest neighbor method -- python code implementation and kd tree construction search

Algorithm Introduction

k-nearest neighbor method belongs to supervised learning and does not need training model (lazy learning). Algorithm flow: for the test samples, find k training samples from the given training set according to some distance measurement (Minkowski distance, Euclidean distance, Manhattan distance, Chebyshev distance, etc.), and predict according to the k training sample information. For the classification task, according to the voting method, the category that appears most in the k training samples can be used as the prediction result; For the regression task, the average value of the actual output of k training samples can be taken as the prediction result according to the average method, or the weighted average can be used. For the training samples from the test samples, a larger weight can be taken.

python code implementation

The iris data set is classified and divided into training set and test set

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from collections import Counter
iris_data = load_iris()
data = pd.DataFrame(iris_data.data,columns=iris_data.feature_names)
data['target'] = iris_data.target
X,y = data.iloc[:,:-1].values,data.iloc[:,-1].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y)
data.head()
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20

Build model

class knn:
    def __init__(self,X_train,y_train,neighbors,ord):
        self.X = X_train
        self.y = y_train
        self.k = neighbors
        self.ord = ord  #Distance metric type, ord is equal to the parameter p in Minkowski distance
    def predict(self,x):
        knn_list = []  #Store the information of k training set sample points, including the distance and category from the test sample
        for i in range(self.k):
            dist = np.linalg.norm(x-self.X[i],ord=self.ord)
            knn_list.append((dist,self.y[i]))

        for i in range(self.k,len(self.X)):
            dist = np.linalg.norm(x-self.X[i],ord=self.ord)  #Find norm
            max_index = knn_list.index(max(knn_list,key=lambda x:x[0]))
            if dist<knn_list[max_index][0]:
                knn_list[max_index] = (dist,self.y[i])

        knn_y = [k[-1] for k in knn_list] #Category label of the first k training samples adjacent to the test sample
        #Count the categories of k training samples and the number of training samples in each category
        counter_y = Counter(knn_y)  #Return dictionary data type
        max_y = sorted(counter_y.items(),key=lambda x:x[1])[-1][0]
        return max_y

    def score(self,X_test,y_test):
        count = 0
        for x,y in zip(X_test,y_test):
            if y==self.predict(x):
                count += 1
        return count/len(X_test)
knn_model = knn(X_train,y_train,15,2)
test_accuracy = knn_model.score(X_test,y_test)
print('The accuracy of the test set is;',test_accuracy)
The accuracy of the test set is; 0.9777777777777777

sklearn implementation

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(X_train,y_train)
score = knn_model.score(X_test,y_test)
print(score)
0.977777777778

Construction and search of kd tree

kd tree is a binary tree, which represents the division of k-dimensional space. It is used to store and quickly retrieve k-dimensional instance points. Its structure is very suitable for finding nearest neighbors and collision detection.
Build process:
First, build the root node. It is divided according to a dimension of the instance point. Generally, the dimension with the largest variance value is selected for division; Sort the instance points according to the value of the dimension, take the instance point corresponding to the median as the root node, place the instance point with the value less than the median on the left of the tree to generate the left child node, and place the instance point greater than the median on the right of the tree to generate the right child node; Then, the left and right child nodes continue to be divided into dimensions; Repeat the above process until it cannot be divided to generate leaf nodes.
Search process:
First, find the leaf node containing the target point X in the kd tree, start from the root node and recursively access the kd tree. If x is less than the value of the current node's division dimension, move to the left child node, otherwise vice versa; Take the found leaf node as the current closest point, access the parent node of the current closest point upward in turn and update the closest point. If the value of the parent node in the division dimension and the target point x is less than the current closest distance, you also need to access another child node of this parent node; Until the end of access to the root node.
If the training set is [[7,2], [4,7], [5,4], [2,3], [8,1], [9,6], find the nearest training sample of the target point [3,4.5].

from collections import namedtuple
import numpy as np
from time import process_time

#Define kd tree node class
class KdNode(object):
    def __init__(self,dom_elt,split,left,right):   #Including data point, dimension division, left child node and right child node
        self.dom_elt = dom_elt
        self.split = split
        self.left = left
        self.right = right

#Define kd tree
class KdTree(object):
    def __init__(self,data):   #Attributes include: data dimension and root node
        self.k = len(data[0])
        self.root = self.createNode(data,0)

    def createNode(self,data,split):  #The tree is constructed by iterative algorithm, and split is the divided dimension
        if not data:  #No data, return empty node
            return None
        splitPos = len(data)>>2
        dataSort = sorted(data,key=lambda x:x[split])
        medianNode = dataSort[splitPos]
        splitNext = (split+1) % self.k   #The dimension with the largest variance can also be obtained by dividing the dimensions in order
        return KdNode(medianNode,split,self.createNode(dataSort[:splitPos],splitNext),self.createNode(dataSort[splitPos+1:],splitNext))

    def search_nearest(self,point):  #The iterative algorithm is used to traverse the tree and find the nearest neighbor
        self.result = namedtuple('nearestInf','point distance')  #Store nearest neighbor information
        self.nearestPoint = None
        self.nearestDis = 0
        def travel(node,depth):
            if node != None:
                axis = node.split
                if point[axis]<node.dom_elt[axis]:
                    travel(node.left,depth+1)
                else:
                    travel(node.right,depth+1)
                distance = np.sqrt(np.sum((np.array(point) - np.array(node.dom_elt))**2))
                if self.nearestPoint == None:
                    self.nearestPoint = node.dom_elt
                    self.nearestDis = distance
                elif self.nearestDis > distance:
                    self.nearestPoint = node.dom_elt
                    self.nearestDis = distance
                if abs(point[axis]-node.dom_elt[axis]) < self.nearestDis:
                    if point[axis] < node.dom_elt[axis]:
                        travel(node.right, depth + 1)
                    else:
                        travel(node.left, depth + 1)
        travel(self.root,0)
        return self.result(self.nearestPoint,self.nearestDis)

data = [[7,2],[4,7],[5,4],[2,3],[8,1],[9,6]]  #data set
tree1 = KdTree(data)
result = tree1.search_nearest([3,4.5])
print(result)

nearestInf(point=[2, 3], distance=1.8027756377319946)
0.0

The result is points (2, 3) and the distance is 1.803

Keywords: Python Machine Learning

Added by R4000 on Sun, 20 Feb 2022 06:39:08 +0200