Comparison and implementation of clustering algorithms

1, Foreword

It refers to the division of similar data together. The specific division does not care about this kind of label. The goal is to aggregate similar data together. Clustering is an unsupervised learning method.

2, General process of clustering

Data preparation: feature standardization and dimensionality reduction
Feature selection: select the most effective feature from the original feature and store it in the vector
Feature extraction: transform the selected features to form new prominent features
Clustering: measure the similarity based on a certain distance function to obtain clusters
Evaluation of clustering results: analyze clustering results, such as distance error sum (SSE), etc

3, Data clustering method

Data clustering methods can be divided into partition based methods, density based methods and hierarchical methods.

4, Criteria for measuring clustering algorithms

Different clustering algorithms have different advantages and disadvantages and different application conditions. Generally speaking, it can be seen from the attributes of data (whether sequence input, dimension), the preset of algorithm model and the processing capacity of the model. The details are as follows:
1. Algorithm processing capacity: the ability to process large data sets (i.e. algorithm complexity); Ability to deal with data noise; The ability to handle arbitrary shapes, including nested data with gaps;
2. Whether the algorithm needs preset conditions: whether the number of clusters needs to be known in advance and whether the user needs to give domain knowledge;
3. Data input attribute of algorithm: whether the result of algorithm processing is related to the order of data input, that is, whether the algorithm is independent of the order of data input; The algorithm has the ability to process many attribute data, that is, whether it is sensitive to the dimension of data and whether it has requirements for the type of data.

5, Algorithm implementation

#k-means++
class KMeansClusterAlgorithm(object):
    '''
        this class is k-means cluster algorithm 
        Author:
            xxx
        Date:
            2022-02-10
    '''

    def __init__(self, dataset: list, k: int) -> None:
        '''
            initial Args
            Args:
                dataset:list. like [[x1,y1],[x2,y2)]
                k:int. number of cluster what to get


        '''
        self.dataset = dataset
        self.k = k

    def point_avg(self, points) -> list:
        '''
            Accepts a list of points, each with the same number of dimensions.
            NB. points can have more dimensions than 2
            Returns a new points which is the center of all the points
            Args:
                points:list. a list of points, like [[x,y],[x1,y1],[x2,y2]]
            Return:
                new_center: list
        '''
        dimensions = len(points[0])
        new_center = []
        for dimension in range(dimensions):
            dim_sum = 0
            for p in points:
                dim_sum += p[dimension]

        # average of each dimension
            new_center.append(dim_sum/float(len(points)))
        return new_center

    # def update_centers(self, date_set, assignments):
    def update_centers(self, assignments) -> list:
        '''
            Accepts a dataset and a list of assignments; the indexes of both lists correspond
            to each other.
            compute the center for each of the assigned groups.
            Reture 'k' centers where  is the number of unique assignments.
            Args:
                dataset:
                assignments:
            Return:
                centers:list  ex:[[1,2]]

        '''
        new_means = defaultdict(list)
        centers = []
        for assigment, point in zip(assignments, self.dataset):
            new_means[assigment].append(point)

        for points in new_means.values():
            centers.append(self.point_avg(points))

        return centers

    def distance(self, a: list, b: list) -> int:
        '''
            caculate two points' distance
            Args:
                a:list. point a,ex:[1,3]
                b:list. point b,ex:[1,3]
            Return:
                :int: the distance of two point
        '''
        dimensions = len(a)
        _sum = 0
        for dimension in range(dimensions):
            difference_seq = (a[dimension] - b[dimension]) ** 2
            _sum += difference_seq

        return sqrt(_sum)

    def _assign_points(self, centers) -> list:
        '''
            assign each point to an index that corresponds to the index
            of the center point on its proximity to that point.
            Return a an array of indexes of the centers that correspond to
            an index in the data set; that is, if there are N points in data set
            the list we return will have N elements. Also If there ara Y points in centers
            there will be Y unique possible values within the returned list.
            Args:
                data_points:list  ex:[[1,2],[3,4],[5,6]]
                centers:list   ex:[[3,4]]
            Return:
                assigments:list  

        '''

        assigments = []
        for point in self.dataset:
            shortest = float('Inf')
            shortest_index = 0
            for i in range(len(centers)):
                val = self.distance(point, centers[i])
                if val < shortest:
                    shortest = val
                    shortest_index = i
            assigments.append(shortest_index)

        return assigments

    # def generate_k(self, data_set: list, k: int, centers: list = []) -> list:

    def _generate_k(self, centers: list = []) -> list:
        '''
            Given data set, which is an list of lists,
            find the minimum and maximum for each coordinate,a range.
            Generate k random points between the ranges.
            Return a list of the random points within the ranges
            use self.dataset self.k
            Args:
                data_set:list.  ex:[[1,2],[3,4]]
                k:int. the number of clusters 
            Return:
                list ex:[[1,2]]
        '''
        # centers = []
        dimensions = len(self.dataset[0])
        min_max = defaultdict(int)

        for point in self.dataset:
            for i in range(dimensions):
                val = point[i]
                min_key = f'min_{i}'
                max_key = f'max_{i}'
                if min_key not in min_max or val < min_max[min_key]:
                    min_max[min_key] = val
                if max_key not in min_max or val > min_max[max_key]:
                    min_max[max_key] = val

        for _k in range(self.k):
            rand_point = []
            for i in range(dimensions):
                min_val = min_max[f'min_{i}']
                max_val = min_max[f'max_{i}']

                rand_point.append(uniform(min_val, max_val))

            centers.append(rand_point)
        return centers

    def _euler_distance(self, point1: list, point2: list) -> float:
        '''
            Calculate euler distance between two points, support multidimensional
            Args:
                point1:list
                point2:list
            Return:
                :float
        '''
        distance = 0.0
        for a, b in zip(point1, point2):
            distance += math.pow(a - b, 2)
        return math.sqrt(distance)

    def get_closest_dist(self, point, centroids) -> float:
        '''
            get closest dist between two point
            Args:
                point1:list
                centroids:list. the center of cluster
            Return:
                min_dist:float  
        '''
        min_dist = math.inf  # Initial set to infinity
        for i, centroid in enumerate(centroids):
            dist = self._euler_distance(centroid, point)
            if dist < min_dist:
                min_dist = dist
        return min_dist

    def _kpp_centers(self) -> list:
        '''
            calculate cluster center
            use self.dataset and self.k
            Return:
                cluster_centers:list. self.k(the number of cluster center that user defined) cluster center

        '''
        cluster_centers = []
        cluster_centers.append(random.choice(self.dataset))
        d = [0 for _ in range(len(self.dataset))]
        for _ in range(1, self.k):
            total = 0.0
            for i, point in enumerate(self.dataset):
                # The distance from the nearest cluster center
                d[i] = self.get_closest_dist(point, cluster_centers)
                total += d[i]
            total *= random.random()
            # The next clustering center is selected by wheel method.
            for i, di in enumerate(d):
                total -= di
                if total > 0:
                    continue
                cluster_centers.append(self.dataset[i])
                break
        return cluster_centers

    # def k_means(self, dataset:list, k:int):

    def k_means_plusplus(self) -> tuple:
        '''
            the enter of k-means cluster algorithm 
            Args:
                data_set:list.  ex:[[1,2],[3,4]]
                k:int. the number of clusters 
            Return:
                (assignments, self.dataset):tuple (result list,origin datalist)
        '''

        # k_points = self._generate_k() #[[1,2],[3,4]]
        k_points = self._kpp_centers()
        assignments = self._assign_points(k_points)  # [1,2,1,1,0,0,4]
        old_assignments = None
        times = 0
        while assignments != old_assignments:
            new_centers = self.update_centers(assignments)  # [[11.2],[12.2]]
            old_assignments = assignments
            assignments = self._assign_points(new_centers)

        return (assignments, self.dataset)

reference material
https://zhuanlan.zhihu.com/p/...
https://blog.csdn.net/abc2009...
https://blog.csdn.net/weixin_...
https://www.cnblogs.com/wang2...

Keywords: Algorithm Data Analysis

Added by scnjl on Mon, 14 Feb 2022 13:25:10 +0200

Programming VIP