The Research of a Spider Based on Crawling Algorithm
Abstract
This paper conducts a deep research on data mining in three areas including work flow, key technologies and software algorithm of the spider. The paper analyzes the work flow and key technologies of the spider facing URL in details. It also brings forward the mind that adopting several queues to manage the URL list, in order to download HTML, files in high speed we sort the URLs by document correlativity. The aim of this paper is to design a well-adjusted and perfectly functional software model of the spider. Sun JDK+Borland Jbuilder+SQL Server+IIS+Bot package is used as the software development environment support.
Keywords
Spider, URL Seed, Scope First, Document Correlativity, Threshold
DOI
10.12783/dtcse/aice-ncs2016/5717
10.12783/dtcse/aice-ncs2016/5717
Refbacks
- There are currently no refbacks.