Collection of Tibetan Network

Chang-zhi WANG, Guixian XU, Hui WANG

Abstract


With the development of Tibetan information technology, technologies about Tibetan web crawlers was extremely important. We elaborate different pages pretreatment rules according to the different sites and make the collected Tibetan Web text dump for Tibetan documents, by constructing a Web crawler to crawl different Tibetan websites, Experiments show that it can quickly and effectively to build large-scale Tibetan corpus, build the foundations for Tibetan information processing technology by self-made software and the module of pretreatment.

Keywords


Web crawler, Pretreatment, Tibetan corpus

Publication Date


2016-11-17 00:00:00


DOI
10.12783/dtcse/cmsam2016/3628

Refbacks

  • There are currently no refbacks.