# Algorithm Introduction

k-nearest neighbor method belongs to supervised learning and does not need training model (lazy learning). Algorithm flow: for the test samples, find k training samples from the given training set according to some distance measurement (Minkowski distance, Euclidean distance, Manhattan distance, Chebyshev distance, etc.), and predict according to the k training sample information. For the classification task, according to the voting method, the category that appears most in the k training samples can be used as the prediction result; For the regression task, the average value of the actual output of k training samples can be taken as the prediction result according to the average method, or the weighted average can be used. For the training samples from the test samples, a larger weight can be taken.

# python code implementation

The iris data set is classified and divided into training set and test set

```from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from collections import Counter
data = pd.DataFrame(iris_data.data,columns=iris_data.feature_names)
data['target'] = iris_data.target
X,y = data.iloc[:,:-1].values,data.iloc[:,-1].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,stratify=y)
```
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20

Build model

```class knn:
def __init__(self,X_train,y_train,neighbors,ord):
self.X = X_train
self.y = y_train
self.k = neighbors
self.ord = ord  #Distance metric type, ord is equal to the parameter p in Minkowski distance
def predict(self,x):
knn_list = []  #Store the information of k training set sample points, including the distance and category from the test sample
for i in range(self.k):
dist = np.linalg.norm(x-self.X[i],ord=self.ord)
knn_list.append((dist,self.y[i]))

for i in range(self.k,len(self.X)):
dist = np.linalg.norm(x-self.X[i],ord=self.ord)  #Find norm
max_index = knn_list.index(max(knn_list,key=lambda x:x))
if dist<knn_list[max_index]:
knn_list[max_index] = (dist,self.y[i])

knn_y = [k[-1] for k in knn_list] #Category label of the first k training samples adjacent to the test sample
#Count the categories of k training samples and the number of training samples in each category
counter_y = Counter(knn_y)  #Return dictionary data type
max_y = sorted(counter_y.items(),key=lambda x:x)[-1]
return max_y

def score(self,X_test,y_test):
count = 0
for x,y in zip(X_test,y_test):
if y==self.predict(x):
count += 1
return count/len(X_test)
knn_model = knn(X_train,y_train,15,2)
test_accuracy = knn_model.score(X_test,y_test)
print('The accuracy of the test set is;',test_accuracy)
```
```The accuracy of the test set is; 0.9777777777777777
```

sklearn implementation

```from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(X_train,y_train)
score = knn_model.score(X_test,y_test)
print(score)
```
```0.977777777778
```

# Construction and search of kd tree

kd tree is a binary tree, which represents the division of k-dimensional space. It is used to store and quickly retrieve k-dimensional instance points. Its structure is very suitable for finding nearest neighbors and collision detection.
Build process:
First, build the root node. It is divided according to a dimension of the instance point. Generally, the dimension with the largest variance value is selected for division; Sort the instance points according to the value of the dimension, take the instance point corresponding to the median as the root node, place the instance point with the value less than the median on the left of the tree to generate the left child node, and place the instance point greater than the median on the right of the tree to generate the right child node; Then, the left and right child nodes continue to be divided into dimensions; Repeat the above process until it cannot be divided to generate leaf nodes.
Search process:
First, find the leaf node containing the target point X in the kd tree, start from the root node and recursively access the kd tree. If x is less than the value of the current node's division dimension, move to the left child node, otherwise vice versa; Take the found leaf node as the current closest point, access the parent node of the current closest point upward in turn and update the closest point. If the value of the parent node in the division dimension and the target point x is less than the current closest distance, you also need to access another child node of this parent node; Until the end of access to the root node.
If the training set is [[7,2], [4,7], [5,4], [2,3], [8,1], [9,6], find the nearest training sample of the target point [3,4.5].

```from collections import namedtuple
import numpy as np
from time import process_time

#Define kd tree node class
class KdNode(object):
def __init__(self,dom_elt,split,left,right):   #Including data point, dimension division, left child node and right child node
self.dom_elt = dom_elt
self.split = split
self.left = left
self.right = right

#Define kd tree
class KdTree(object):
def __init__(self,data):   #Attributes include: data dimension and root node
self.k = len(data)
self.root = self.createNode(data,0)

def createNode(self,data,split):  #The tree is constructed by iterative algorithm, and split is the divided dimension
if not data:  #No data, return empty node
return None
splitPos = len(data)>>2
dataSort = sorted(data,key=lambda x:x[split])
medianNode = dataSort[splitPos]
splitNext = (split+1) % self.k   #The dimension with the largest variance can also be obtained by dividing the dimensions in order
return KdNode(medianNode,split,self.createNode(dataSort[:splitPos],splitNext),self.createNode(dataSort[splitPos+1:],splitNext))

def search_nearest(self,point):  #The iterative algorithm is used to traverse the tree and find the nearest neighbor
self.result = namedtuple('nearestInf','point distance')  #Store nearest neighbor information
self.nearestPoint = None
self.nearestDis = 0
def travel(node,depth):
if node != None:
axis = node.split
if point[axis]<node.dom_elt[axis]:
travel(node.left,depth+1)
else:
travel(node.right,depth+1)
distance = np.sqrt(np.sum((np.array(point) - np.array(node.dom_elt))**2))
if self.nearestPoint == None:
self.nearestPoint = node.dom_elt
self.nearestDis = distance
elif self.nearestDis > distance:
self.nearestPoint = node.dom_elt
self.nearestDis = distance
if abs(point[axis]-node.dom_elt[axis]) < self.nearestDis:
if point[axis] < node.dom_elt[axis]:
travel(node.right, depth + 1)
else:
travel(node.left, depth + 1)
travel(self.root,0)
return self.result(self.nearestPoint,self.nearestDis)

data = [[7,2],[4,7],[5,4],[2,3],[8,1],[9,6]]  #data set
tree1 = KdTree(data)
result = tree1.search_nearest([3,4.5])
print(result)

```
```nearestInf(point=[2, 3], distance=1.8027756377319946)
0.0
```

The result is points (2, 3) and the distance is 1.803

Keywords: Python Machine Learning

Added by R4000 on Sun, 20 Feb 2022 06:39:08 +0200