题名 | Enhancing duplicate collection detection through replica boundary discovery |
作者 | |
发表日期 | 2006 |
会议名称 | 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2006 |
会议录名称 | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
![]() |
会议录编者 | Ng, WK; Kitsuregawa, M; Li, J; Chang, K |
ISBN | 3540332065; 9783540332060 |
ISSN | 0302-9743; 1611-3349 |
卷号 | 3918 |
页码 | 361-370 |
会议日期 | APR 09-12, 2006 |
会议地点 | Singapore |
出版者 | Springer Verlag |
摘要 | Web documents are widely replicated on the Internet. These replicated documents bring potential problems to Web based information systems. So replica detection on the Web is an indispensable task. The challenge is to find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce the notion of replica boundary to roughly reflect the situation of the replicas; then we propose an effective and efficient approach to discover the boundary of the replicas. The advantages of the proposed approach include: first, it dramatically reduces pair-wise document similarity computation, making it much faster than traditional replicated document detection approaches; second, it can identify the boundary of the replicated collections accurately, demonstrating to what extent two collections are replicated. On two web page sets containing 24 million and 30 million Web pages respectively, we evaluated the accuracy of the approach. © Springer-Verlag Berlin Heidelberg 2006. |
DOI | 10.1007/11731139_42 |
URL | 查看来源 |
收录类别 | SCIE ; CPCI-S |
语种 | 英语English |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Artificial Intelligence ; Computer Science, Information Systems |
WOS记录号 | WOS:000237249600042 |
引用统计 | |
文献类型 | 会议论文 |
条目标识符 | https://repository.uic.edu.cn/handle/39GCC9TT/4602 |
专题 | 个人在本单位外知识产出 |
作者单位 | 1.Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China 2.Institute of Network Computing and Information Systems, School of Electronics Engineering and Computer Science, Peking University, Beijing, China |
推荐引用方式 GB/T 7714 | Zhang, Zhigang,Jia, Weijia,Li, Xiaoming. Enhancing duplicate collection detection through replica boundary discovery[C]//Ng, WK; Kitsuregawa, M; Li, J; Chang, K: Springer Verlag, 2006: 361-370. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论