文件名称:Nutch-Web
介绍说明--下载内容均来自于网络,请自行研究使用
在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上,提出基于Nutch的Web网站定向采集系统,并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关
键问题进行重点探讨。
-The paperanalyzes typicalopen sourceWeb crawl software, such asNutch, Heritrix, WCT, andWeb-Har-
vest. Following the analyzed result, itputs forward a targetedwebsitesharvestsystem based onNutch. Fourkey issues of
this system are discussed emphatically, which are the initial seedwebsites selection, the harvestprocessmanagement, the
web page contentdenoising, and discovering ofnew seedwebsites.
键问题进行重点探讨。
-The paperanalyzes typicalopen sourceWeb crawl software, such asNutch, Heritrix, WCT, andWeb-Har-
vest. Following the analyzed result, itputs forward a targetedwebsitesharvestsystem based onNutch. Fourkey issues of
this system are discussed emphatically, which are the initial seedwebsites selection, the harvestprocessmanagement, the
web page contentdenoising, and discovering ofnew seedwebsites.
(系统自动生成,下载前可以参看下载内容)
下载文件列表
Nutch-Web.caj