Research and Application of Improved K-means Algorithm in Text Clustering

Shen-yi QIAN, Hui-hui LIU, Dai-yi LI

Abstract


K-means is a commonly used text clustering algorithm, the biggest advantage of the proposed algorithm is simple and fast, but due to the random selection of the initial cluster center point, the K-means algorithm is easy to fall into the local optimal algorithm and instability of the clustering results and the number of iterations. To solve this problem, this paper selected the initial cluster centers using hierarchical agglomerative clustering algorithm, to ensure the high quality of the center point; using cosine similarity to measure the distance between the text; reconstructed calculation formula of cluster center and the objective function of clustering quality. The experimental results show that the improved K-means algorithm has a relatively high accuracy and stability with the Sogou Chinese text corpus as the data set. Introduction

Keywords


K-means clustering algorithm, Hierarchical clustering algorithm, Text distance, Objective function, F measure


DOI
10.12783/dtcse/pcmm2018/23653

Refbacks

  • There are currently no refbacks.