Implementing k-means algorithm from 0

k-means algorithm

Partition based method

Given a data set D with N data objects and the number of clusters to be generated K (k < n), in the partition method D, the objects are allocated to K groups, and each group represents a cluster. Then an objective function is used to evaluate the partition quality, so that the objects in the cluster are similar but different from the objects in other clusters.

characteristic

  • It is very suitable to find spherical mutually exclusive clusters in small and medium-sized data sets.
  • Is based on distance
  • The cluster center can be represented by mean or center point

Introduction to K-means

Because of its excellent speed and good scalability, K-means clustering algorithm is the most famous clustering method. K-means algorithm is a process of repeatedly moving the center point of the class, moving the center point of the class, also known as centroids, to the average position of its containing members, and then re dividing its internal members.

Related concepts of K-means algorithm

Distance measurement

Manhattan distance

Driving from one intersection to another in Manhattan is obviously not a straight distance between two points. This actual driving distance is "Manhattan distance", also known as "city block distance".

Euclidean distance

Euclidean distance is a commonly used definition of distance, which refers to the real distance between two points in m-dimensional space

cosine similarity

Cosine similarity measures the difference between two individuals by using the cosine value of the angle between two vectors in vector space. Compared with distance measurement, cosine similarity pays more attention to the difference of two vectors in direction than distance or length.
The closer the cosine value is to 1, the closer the included angle is to 0 degrees, that is, the more similar the two vectors are, which is called "cosine similarity".

Common distance code implementation

import numpy as np
def distance(x,y,p=2):
    '''
    input:x(ndarray):Coordinates of the first sample
          y(ndarray):Coordinates of the second sample
          p(int):Manhattan distance is equal to 1 and Euclidean distance is equal to 2
    output:distance(float):x reach y Distance      
    ''' 
    if(p==1):
        return  sum(abs(x-y))
    elif(p==2):
        return (sum((x-y)**2))**(1/2)

Calculate the square of Euclidean distance between a sample and all samples in the data set

def euclidean_distance(one_sample, X):
    one_sample = one_sample.reshape(1, -1)
    distances = np.power(np.tile(one_sample, (X.shape[0], 1)) - X, 2).sum(axis=1)
    return distances

Centroid and objective function

centroid

Abbreviation of quality center. The centroid of a particle system is the average position of the mass distribution of the particle system.
The centroid may change with different data proximity measures and clustering objectives.

Proximity measure

Euclidean distance is usually used for points in Euclidean space and cosine similarity is used for documents.
Here, we use European distance

objective function

For the data whose proximity measure is Euclidean distance, we usually use the Sum of the Squared Error (SSE) as the objective function to measure the clustering quality. In other words, calculate the error of each data point, that is, the Euclidean distance from it to the nearest centroid, and then calculate the sum of the squares of the errors.

K-means algorithm flow

python class initialization

class MyKmeans():
    """Kmeans clustering algorithm .
    Parameters:
    -----------
    k: int
        Number of clusters.
    max_iterations: int
        Maximum number of iterations. 
    varepsilon: float
        Judge whether convergence, If all the last time k Cluster centers and all of this k The difference of two cluster centers is less than varepsilon, 
        Then the algorithm has converged
    """
    def __init__(self, k=5, max_iterations=300, varepsilon=0.0001, random_state=2):
        self.k = k
        self.max_iterations = max_iterations
        self.varepsilon = varepsilon
        np.random.seed(random_state)

Select initial centroid

    # Select self at random from all samples K-sample is used as the initial clustering center
    def init_random_centroids(self, X):
        n_samples, n_features = np.shape(X)
        centroids = np.zeros((self.k, n_features))
        for i in range(self.k):
            centroid = X[np.random.choice(range(n_samples))]
            centroids[i] = centroid
        return centroids

Assign each object to the most similar cluster

    # Returns the nearest central index [0, self.k) to the sample
    def _closest_centroid(self, sample, centroids):
        distances = euclidean_distance(sample, centroids)
        closest_i = np.argmin(distances)
        return closest_i

    # Classify all samples. The classification rule is to classify the sample to its nearest center
    def create_clusters(self, centroids, X):
        n_samples = np.shape(X)[0]
        clusters = [[] for _ in range(self.k)]
        for sample_i, sample in enumerate(X):
            centroid_i = self._closest_centroid(sample, centroids)
            clusters[centroid_i].append(sample_i)
        return clusters

Update the centroid of the cluster

    # Update the center
    def update_centroids(self, clusters, X):
        n_features = np.shape(X)[1]
        centroids = np.zeros((self.k, n_features))
        for i, cluster in enumerate(clusters):
            centroid = np.mean(X[cluster], axis=0)
            centroids[i] = centroid
        return centroids

Calculate the label and centroid

    # All samples are classified, and the index of their category is their category label
    def get_cluster_labels(self, clusters, X):
        y_pred = np.zeros(np.shape(X)[0])
        for cluster_i, cluster in enumerate(clusters):
            for sample_i in cluster:
                y_pred[sample_i] = cluster_i
        return y_pred

    # For clusters, return the centroid of its clustering result
    def get_cluster_centroids(self, clusters, X):
        centroids = np.zeros((self.k,np.shape(X)[1]))
        for i, cluster in enumerate(clusters):
            centroid = np.mean(X[cluster], axis=0)
            centroids[i] = centroid
        return centroids

clustering

    # Perform Kmeans clustering on the whole data set X and return the clustered label
    def predict(self, X):
                
        # Select self at random from all samples K-sample is used as the initial clustering center
        centroids = self.init_random_centroids(X)
         
        # Iterate until the algorithm converges (the cluster center of the last time almost coincides with the cluster center of this time) or reaches the maximum number of iterations
        for _ in range(self.max_iterations):
            # Classify all samples. The classification rule is to classify the sample to its nearest center
            clusters = self.create_clusters(centroids, X)
            former_centroids = centroids
             
            # Calculate the new cluster center
            centroids = self.update_centroids(clusters, X)
             
            # If there is almost no change in the cluster center, it indicates that the algorithm has converged and exits the iteration
            diff = centroids - former_centroids
            if diff.any() < self.varepsilon:
                break
         
        return self.get_cluster_labels(clusters, X),self.get_cluster_centroids(clusters, X)

K-means actual combat

requirement

The epidemic situation of each country in the data is different, and some even vary greatly. Therefore, we want to cluster 78 countries first to judge the epidemic prevention effect of each country.

data set

Data source: https://github.com/datasets/covid-19

Through data cleaning and integration, some data are as follows

data reduction

Cumulative confirmed case density = cumulative confirmed cases / total population; Density of existing confirmed cases = number of existing confirmed cases / total population
Therefore, the column of total population belongs to redundant data, so this column of data will be deleted during k-means algorithm.

data conversion

Data normalization is a basic operation of data mining. In reality, the dimensions of different features in the data may be inconsistent, and the differences between values may be very large. Without processing, it may affect the results of data analysis. Therefore, it is necessary to scale the data according to a certain proportion to make it fall in a specific area for comprehensive analysis.

Data normalization can effectively improve the accuracy and effectiveness of algorithms involving distance measurement. For the classification algorithm and clustering algorithm using distance measurement, data normalization is a particularly important step.

method

  • Min max normalization
  • z-score normalization (zero mean normalization) StandardScaler
  • Normalization by decimal scaling

Introduction to sklearn preprocessing in python

There is a StandardScaler function in sklearn preprocessing that can be used for zero mean normalization

scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)

Practical k-means algorithm code

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from My_KMeans import MyKmeans

df = pd.read_csv('Test\\77 Country clustering.csv',header=0,index_col='country')
data = df.iloc[:,:-1]

scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)

mykm = MyKmeans()
label,centroids = mykm.predict(data)
print(label)
result_centroid = scaler.inverse_transform(centroids)
# pd.DataFrame(result_centroid).to_csv('Test\\centroids.csv')
print(result_centroid)
# km = KMeans(5,random_state=2) #The KMeans Library in sklearn is used here (the same effect can be achieved)
# res = km.fit(data)
# print(res.labels_+1)

# result = scaler.inverse_transform(res.cluster_centers_)
# print(result)

Keywords: Python Algorithm Mathematical Modeling kmeans

Added by cjacks on Thu, 16 Dec 2021 10:50:23 +0200