Daily summary of KMeans cluster analysis of "Python data operation"

Content introduction

This paper introduces cluster analysis with a simple example of Python using Keans for cluster analysis.

Cluster analysis or clustering is the task of grouping a group of objects, Make objects in the same group (called clusters) more similar (in a sense) to objects in other groups (clusters) It is the main task of exploratory data mining and the common technology of statistical data analysis. It is used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics.

General application scenarios

  • Group classification of target users: cluster the target groups according to the variables selected for operation or business purposes, divide the target groups into several subdivided groups with obvious characteristics, and adopt refined and personalized operations and services for these subdivided groups in operation activities to improve operation efficiency and business effect.
  • Value combination of different products: cluster many product categories according to specific index variables. Subdivide the product system into multi-dimensional product combinations with different values, different purposes, and formulate corresponding product development plans, operation plans and service plans on this basis.
  • Explore and discover outliers and outliers: mainly risk control applications. There may be a risk component of fraud at the outlier.

Common methods of clustering

It is divided into algorithms based on partition, hierarchy, density, grid, statistics, model, etc. typical algorithms include K-means (classical clustering algorithm), DBSCAN, two-step clustering, BIRCH, spectral clustering, etc.

Implementation of Keans clustering

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics
import random

# Randomly generate 100 groups of data containing 3 groups of characteristics
feature = [[random.random(),random.random(),random.random()] for i in range(100)]
label = [int(random.randint(0,2)) for i in range(100)]

# Convert data format
x_feature = np.array(feature)

# Training clustering model
n_clusters = 3  # Set the number of clusters
model_kmeans = KMeans(n_clusters=n_clusters, random_state=0)  # Establish clustering model object
model_kmeans.fit(x_feature)  # Training clustering model
y_pre = model_kmeans.predict(x_feature)  # Predictive clustering model
y_pre

Evaluation index of clustering

inertias is the attribute of the K-means model object, which represents the sum of the cluster centers closest to the sample. It is used as an unsupervised evaluation index without the label of real classification results. The smaller the value, the better. The smaller the value, the more concentrated the distribution of samples among classes, that is, the smaller the distance within classes.

# The sum of the nearest cluster centers
inertias = model_kmeans.inertia_  

adjusted_rand_s: The Adjusted Rand Index calculates the similarity measure between the two clusters by considering all sample pairs and count pairs allocated in the same or different clusters in the predicted and real clusters. The Adjusted Rand Index obtains a value close to 0 independent of the sample size and category through the adjustment of the RAND index, and its value range is [- 1, 1], negative numbers represent bad results. The closer to 1, the better, which means that the clustering results are more consistent with the real situation.

# Adjusted Rand index
adjusted_rand_s = metrics.adjusted_rand_score(label, y_pre)  

mutual_info_s: Mutual information (MI) is the amount of information about another random variable contained in one random variable. Here, it refers to the measure of similarity between two labels of the same data, and the result is a non negative value.

# Mutual information
mutual_info_s = metrics.mutual_info_score(label, y_pre) 

adjusted_mutual_info_s: Adjusted mutual information (AMI). The adjusted mutual information is the adjustment score of the mutual information score. It takes into account that for clusters with a larger number, the MI is usually higher, regardless of whether there is actually more information sharing. It corrects this effect by adjusting the probability of clustering clusters. When two cluster sets are the same (i.e. exact match), the return value of AMI is 1; the average expected AMI of random partition (independent label) is about 0, or it may be negative.

# Adjusted mutual information
adjusted_mutual_info_s = metrics.adjusted_mutual_info_score(label, y_pre)  

homogeneity_s: Homogeneity: if all clusters only contain data points belonging to members of a single class, the clustering results will meet the homogeneity. The larger the value range [0,1] means that the clustering results are more consistent with the real situation.

# Homogenization score
homogeneity_s = metrics.homogeneity_score(label, y_pre)  

completeness_s: Integrity score: if all data points as members of a given class are elements of the same cluster, the clustering result meets the integrity. Its value range is [0,1]. The larger the value, the more consistent the clustering result is with the real situation.

# Integrity score
completeness_s = metrics.completeness_score(label, y_pre)  

v_measure_s: It is the harmonic average value between homogenization and integrity, v = 2 (uniformity integrity) / (uniformity + integrity). Its value range is [0,1]. The larger the value, the more consistent the clustering results are with the real situation.

v_measure_s = metrics.v_measure_score(label, y_pre)  

silhouette_s: The contour coefficient (Silhouette) is used to calculate the average contour coefficient of all samples. It is calculated by using the average intra cluster distance and the average nearest cluster distance of each sample. It is an unsupervised evaluation index. Its highest value is 1, and the lowest value is - 1,0. Values near 0 represent overlapping clusters, and negative values usually indicate that samples have been assigned to the wrong cluster.

# Average contour coefficient
silhouette_s = metrics.silhouette_score(x_feature, y_pre, metric='euclidean')  

calinski_harabaz_s: The score is defined as the ratio of intra cluster dispersion to inter cluster dispersion, which is an unsupervised evaluation index.

# Calinski and Harabaz score
calinski_harabaz_s = metrics.calinski_harabasz_score(x_feature, y_pre)  

Visualization of clustering effect

# Model effect visualization
centers = model_kmeans.cluster_centers_  # Centers by category
colors = ['#4EACC5', '#FF9C34', '#4E9A06']  # Set colors for different categories
plt.figure()  # Create canvas
for i in range(n_clusters):  # Cyclic read category
    index_sets = np.where(y_pre == i)  # Found an index collection of the same class
    cluster = x_feature[index_sets]  # The data of the same class is divided into a cluster subset
    plt.scatter(cluster[:, 0], cluster[:, 1], c=colors[i], marker='.')  # Show the sample points in the cluster subset
    plt.plot(centers[i][0], centers[i][1], 'o', markerfacecolor=colors[i], markeredgecolor='k',
             markersize=6)  # Show the center of each cluster subset
plt.show()  # Display image

Data prediction

# Model application
new_X = [1, 3.6,9.9]
cluster_label = model_kmeans.predict(np.array(new_X).reshape(1,-1))
print ('The cluster prediction result is: %d' % cluster_label)

Keywords: Python Algorithm AI

Added by DJ Unique on Tue, 21 Dec 2021 04:37:06 +0200