Improved K-means fast clustering algorithm based on Spark
1. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, Jiangsu 212013, China; 2. Zhenjiang Branch, Jiangsu Union Technical Institute, Zhenjiang, Jiangsu 212016, China
Abstract:To solve the problem that the size of data processed by clustering algorithm became bigger and bigger, and the requirement for the timeliness of algorithm also became higher and higher, a fast K-means clustering algorithm of Spark-KM was proposed based on the distributed computing framework Spark. In K-means algorithm, to solve the problems of local optimum due to the improperly initial clustering point and large-scale data clustering due to increased iterative time, the K-means algorithm was improved by pre-sampling and maximum minimum distance combination. The original data was divided into matrix and stored in the nodes of different Spark computing framework. According to the improved K-means algorithm, the Spark platform was combined with the distributed matrix computing to complete fast clustering of large data. The results show that the algorithm can effectively reduce the number of data moving between nodes with good scalability. The contrast test of the algorithms in stand-alone environment and cluster environment shows that the algorithm is suitable for the large-scale data environment, and the performance of the algorithm is proportional to the data size. The performance of cluster environment is greatly improved than that of stand-alone environment.