Proceedings of the
The Nineteenth International Conference on Computational Intelligence and Security (CIS 2023)
December 1 – 4, 2023, Haikou, China

A Two-stage Clustering Algorithm Based on Partition and Merge
Zhihe Wang^a, Xinxin Shi^b, Hui Du^c and Huan Wang^d

School of Computer Science and Engineering, Northwest Normal University, China.

^awangzh@nwnu.edu.cn

^b1759358526@qq.com

^cduhuiywy@nwnu.edu.cn

^d1164136724@qq.com

ABSTRACT

Traditional clustering methods tend to consider only the relationships between data objects and ignore the distribution characteristics of the entire dataset. These result in incomplete data collection, inaccurate clustering results and poor performance on non-convex datasets. To address these problems, a two-stage clustering algorithm based on partitioning and merging is proposed. First, to improve the representation power of the dataset, the algorithm analyzes the dimensions of the dataset using principal dimension analysis and computes the maximum variance for each dimension. Based on the relative position of the dataset and the maximum variance vector, we can judge whether they have the same direction, thus dividing the dataset into multiple subclusters. Second, probability density estimation is used to measure the similarity between the subclusters, and the area size of the coincidence region of the Gaussian mixture distribution is compared to the threshold. If the area is greater than the threshold, the two subclusters are merged, otherwise not. Through experiments on synthetic and real datasets, it is proved that the algorithm can reduce the computational cost of large-scale datasets while maintaining the clustering accuracy. Keywords: Partition, Merge, Principal dimension analysis, KNN, Threshold, Gaussian model.

Download PDF