A Comparison of Internal Feature Measures in Statistical-based New Words Extraction
Abstract
New words extraction is an essential prerequisite in Chinese-oriented natural language processing and text mining. The statistical-based method is the most widely used new words extraction methods. There are mainly two kinds of statistical feature for new words extraction: the internal feature and the contextual feature. This paper compares eight internal feature measures for Chinese new words extraction on the individual basis. They are seven widely used internal feature measures and normalized multiword expression distance, which is from the non-compositionality measures of English multiword expressions, introduced for the first time. As a consequence, the experimental results indicate that the performance of normalized multiword expression distance is superior to the others.
Keywords
New words extraction, Text mining, Internal feature measures, NMED
DOI
10.12783/dtcse/smce2017/12444
10.12783/dtcse/smce2017/12444
Refbacks
- There are currently no refbacks.