Collection of Tibetan Network
Abstract
With the development of Tibetan information technology, technologies about Tibetan web crawlers was extremely important. We elaborate different pages pretreatment rules according to the different sites and make the collected Tibetan Web text dump for Tibetan documents, by constructing a Web crawler to crawl different Tibetan websites, Experiments show that it can quickly and effectively to build large-scale Tibetan corpus, build the foundations for Tibetan information processing technology by self-made software and the module of pretreatment.
Keywords
Web crawler, Pretreatment, Tibetan corpus
Publication Date
DOI
10.12783/dtcse/cmsam2016/3628
10.12783/dtcse/cmsam2016/3628
Refbacks
- There are currently no refbacks.