Algorithm Introduction
k-nearest neighbor method belongs to supervised learning and does not need training model (lazy learning). Algorithm flow: for the test samples, find k training samples from the given training set according to some distance measurement (Minkowski distance, Euclidean distance, Manhattan distance, Chebyshev distance, etc.), and predict according to the k training sample information. For the classification task, according to the voting method, the category that appears most in the k training samples can be used as the prediction result; For the regression task, the average value of the actual output of k training samples can be taken as the prediction result according to the average method, or the weighted average can be used. For the training samples from the test samples, a larger weight can be taken.
python code implementation
The iris data set is classified and divided into training set and test set
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from collections import Counter iris_data = load_iris() data = pd.DataFrame(iris_data.data,columns=iris_data.feature_names) data['target'] = iris_data.target X,y = data.iloc[:,:-1].values,data.iloc[:,-1].values X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y) data.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Build model
class knn: def __init__(self,X_train,y_train,neighbors,ord): self.X = X_train self.y = y_train self.k = neighbors self.ord = ord #Distance metric type, ord is equal to the parameter p in Minkowski distance def predict(self,x): knn_list = [] #Store the information of k training set sample points, including the distance and category from the test sample for i in range(self.k): dist = np.linalg.norm(x-self.X[i],ord=self.ord) knn_list.append((dist,self.y[i])) for i in range(self.k,len(self.X)): dist = np.linalg.norm(x-self.X[i],ord=self.ord) #Find norm max_index = knn_list.index(max(knn_list,key=lambda x:x[0])) if dist<knn_list[max_index][0]: knn_list[max_index] = (dist,self.y[i]) knn_y = [k[-1] for k in knn_list] #Category label of the first k training samples adjacent to the test sample #Count the categories of k training samples and the number of training samples in each category counter_y = Counter(knn_y) #Return dictionary data type max_y = sorted(counter_y.items(),key=lambda x:x[1])[-1][0] return max_y def score(self,X_test,y_test): count = 0 for x,y in zip(X_test,y_test): if y==self.predict(x): count += 1 return count/len(X_test) knn_model = knn(X_train,y_train,15,2) test_accuracy = knn_model.score(X_test,y_test) print('The accuracy of the test set is;',test_accuracy)
The accuracy of the test set is; 0.9777777777777777
sklearn implementation
from sklearn.neighbors import KNeighborsClassifier knn_model = KNeighborsClassifier(n_neighbors=10) knn_model.fit(X_train,y_train) score = knn_model.score(X_test,y_test) print(score)
0.977777777778
Construction and search of kd tree
kd tree is a binary tree, which represents the division of k-dimensional space. It is used to store and quickly retrieve k-dimensional instance points. Its structure is very suitable for finding nearest neighbors and collision detection.
Build process:
First, build the root node. It is divided according to a dimension of the instance point. Generally, the dimension with the largest variance value is selected for division; Sort the instance points according to the value of the dimension, take the instance point corresponding to the median as the root node, place the instance point with the value less than the median on the left of the tree to generate the left child node, and place the instance point greater than the median on the right of the tree to generate the right child node; Then, the left and right child nodes continue to be divided into dimensions; Repeat the above process until it cannot be divided to generate leaf nodes.
Search process:
First, find the leaf node containing the target point X in the kd tree, start from the root node and recursively access the kd tree. If x is less than the value of the current node's division dimension, move to the left child node, otherwise vice versa; Take the found leaf node as the current closest point, access the parent node of the current closest point upward in turn and update the closest point. If the value of the parent node in the division dimension and the target point x is less than the current closest distance, you also need to access another child node of this parent node; Until the end of access to the root node.
If the training set is [[7,2], [4,7], [5,4], [2,3], [8,1], [9,6], find the nearest training sample of the target point [3,4.5].
from collections import namedtuple import numpy as np from time import process_time #Define kd tree node class class KdNode(object): def __init__(self,dom_elt,split,left,right): #Including data point, dimension division, left child node and right child node self.dom_elt = dom_elt self.split = split self.left = left self.right = right #Define kd tree class KdTree(object): def __init__(self,data): #Attributes include: data dimension and root node self.k = len(data[0]) self.root = self.createNode(data,0) def createNode(self,data,split): #The tree is constructed by iterative algorithm, and split is the divided dimension if not data: #No data, return empty node return None splitPos = len(data)>>2 dataSort = sorted(data,key=lambda x:x[split]) medianNode = dataSort[splitPos] splitNext = (split+1) % self.k #The dimension with the largest variance can also be obtained by dividing the dimensions in order return KdNode(medianNode,split,self.createNode(dataSort[:splitPos],splitNext),self.createNode(dataSort[splitPos+1:],splitNext)) def search_nearest(self,point): #The iterative algorithm is used to traverse the tree and find the nearest neighbor self.result = namedtuple('nearestInf','point distance') #Store nearest neighbor information self.nearestPoint = None self.nearestDis = 0 def travel(node,depth): if node != None: axis = node.split if point[axis]<node.dom_elt[axis]: travel(node.left,depth+1) else: travel(node.right,depth+1) distance = np.sqrt(np.sum((np.array(point) - np.array(node.dom_elt))**2)) if self.nearestPoint == None: self.nearestPoint = node.dom_elt self.nearestDis = distance elif self.nearestDis > distance: self.nearestPoint = node.dom_elt self.nearestDis = distance if abs(point[axis]-node.dom_elt[axis]) < self.nearestDis: if point[axis] < node.dom_elt[axis]: travel(node.right, depth + 1) else: travel(node.left, depth + 1) travel(self.root,0) return self.result(self.nearestPoint,self.nearestDis) data = [[7,2],[4,7],[5,4],[2,3],[8,1],[9,6]] #data set tree1 = KdTree(data) result = tree1.search_nearest([3,4.5]) print(result)
nearestInf(point=[2, 3], distance=1.8027756377319946) 0.0
The result is points (2, 3) and the distance is 1.803