科研成果详情

题名Enhancing duplicate collection detection through replica boundary discovery
作者
发表日期2006
会议名称10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2006
会议录名称Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
会议录编者Ng, WK; Kitsuregawa, M; Li, J; Chang, K
ISBN3540332065; 9783540332060
ISSN0302-9743; 1611-3349
卷号3918
页码361-370
会议日期APR 09-12, 2006
会议地点Singapore
出版者Springer Verlag
摘要

Web documents are widely replicated on the Internet. These replicated documents bring potential problems to Web based information systems. So replica detection on the Web is an indispensable task. The challenge is to find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce the notion of replica boundary to roughly reflect the situation of the replicas; then we propose an effective and efficient approach to discover the boundary of the replicas. The advantages of the proposed approach include: first, it dramatically reduces pair-wise document similarity computation, making it much faster than traditional replicated document detection approaches; second, it can identify the boundary of the replicated collections accurately, demonstrating to what extent two collections are replicated. On two web page sets containing 24 million and 30 million Web pages respectively, we evaluated the accuracy of the approach. © Springer-Verlag Berlin Heidelberg 2006.

DOI10.1007/11731139_42
URL查看来源
收录类别SCIE ; CPCI-S
语种英语English
WOS研究方向Computer Science
WOS类目Computer Science, Artificial Intelligence ; Computer Science, Information Systems
WOS记录号WOS:000237249600042
引用统计
被引频次[WOS]:0   [WOS记录]     [WOS相关记录]
文献类型会议论文
条目标识符https://repository.uic.edu.cn/handle/39GCC9TT/4602
专题个人在本单位外知识产出
作者单位
1.Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
2.Institute of Network Computing and Information Systems, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
推荐引用方式
GB/T 7714
Zhang, Zhigang,Jia, Weijia,Li, Xiaoming. Enhancing duplicate collection detection through replica boundary discovery[C]//Ng, WK; Kitsuregawa, M; Li, J; Chang, K: Springer Verlag, 2006: 361-370.
条目包含的文件
条目无相关文件。
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Zhang, Zhigang]的文章
[Jia, Weijia]的文章
[Li, Xiaoming]的文章
百度学术
百度学术中相似的文章
[Zhang, Zhigang]的文章
[Jia, Weijia]的文章
[Li, Xiaoming]的文章
必应学术
必应学术中相似的文章
[Zhang, Zhigang]的文章
[Jia, Weijia]的文章
[Li, Xiaoming]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。