Chinese Data Feature Extraction Optimization in Data Detection
Abstract
While microblog is developing rapidly in China, microblog messages are also flooded with a large amount of repetitive information. Simhash algorithm has better precision and efficiency in the existing algorithms of similarity computation. In this paper, according to the actual scene of microblog, the deep optimization of traditional simhash is proposed through the segmentation optimization algorithm (Combined-Analyzer) and weight optimization algorithm (FFBOT-FID). To a certain extent, Combined-Analyzer solved the problem which real scene existed the massive internet words in microblog’s short text and FFBOT-FID helped us solve the problem of calculating weight which was caused by short text and timeliness. The experimental results use in microblog de-duplication and show that the optimization has a higher precision and recall rate than the traditional segmentation algorithm.
Keywords
Data clean, Data detection, Simhash, Short text process, Microblog
DOI
10.12783/dtetr/oect2017/16126
10.12783/dtetr/oect2017/16126
Refbacks
- There are currently no refbacks.