文件名称:InformationExtractionAlgorithms
介绍说明--下载内容均来自于网络,请自行研究使用
关于网页信息抽取的论文:【摘要】提出并实现了一种基于网页文字密度的正文信息提取算法,该算法主要根据中文网页源码每行中的中文字符比例,区别正文行和非正文行,并辅助一些相关的伪源码正文块识别算法,来区别真正的正文信息和噪声信息,从而实现中文网页正文信息的提取。实验结果表明本方法切实可行并且具有较高的准确性和通用性。-About Web information extraction papers: Abstract proposed and implemented a web-based text information extraction text density algorithm mainly based on Chinese Web source of Chinese characters in each line of the proportion of the difference between text lines and non-text lines, and some related pseudo auxiliary source text block identification algorithm to distinguish the true body of information and noise information, enabling Chinese web text information extraction. Experimental results show that this method is feasible and has a high accuracy and versatility.
(系统自动生成,下载前可以参看下载内容)
下载文件列表
Information Extraction Algorithms and Its Application Based on Word Density in a Webpage.pdf