Details of Research Outputs

TitleEnhancing duplicate collection detection through replica boundary discovery
Creator
Date Issued2006
Conference Name10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2006
Source PublicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
EditorNg, WK; Kitsuregawa, M; Li, J; Chang, K
ISBN3540332065; 9783540332060
ISSN0302-9743; 1611-3349
Volume3918
Pages361-370
Conference DateAPR 09-12, 2006
Conference PlaceSingapore
PublisherSpringer Verlag
Abstract

Web documents are widely replicated on the Internet. These replicated documents bring potential problems to Web based information systems. So replica detection on the Web is an indispensable task. The challenge is to find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce the notion of replica boundary to roughly reflect the situation of the replicas; then we propose an effective and efficient approach to discover the boundary of the replicas. The advantages of the proposed approach include: first, it dramatically reduces pair-wise document similarity computation, making it much faster than traditional replicated document detection approaches; second, it can identify the boundary of the replicated collections accurately, demonstrating to what extent two collections are replicated. On two web page sets containing 24 million and 30 million Web pages respectively, we evaluated the accuracy of the approach. © Springer-Verlag Berlin Heidelberg 2006.

DOI10.1007/11731139_42
URLView source
Indexed BySCIE ; CPCI-S
Language英语English
WOS Research AreaComputer Science
WOS SubjectComputer Science, Artificial Intelligence ; Computer Science, Information Systems
WOS IDWOS:000237249600042
Citation statistics
Cited Times [WOS]:0   [WOS Record]     [Related Records in WOS]
Document TypeConference paper
Identifierhttp://repository.uic.edu.cn/handle/39GCC9TT/4602
CollectionResearch outside affiliated institution
Affiliation
1.Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
2.Institute of Network Computing and Information Systems, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Recommended Citation
GB/T 7714
Zhang, Zhigang,Jia, Weijia,Li, Xiaoming. Enhancing duplicate collection detection through replica boundary discovery[C]//Ng, WK; Kitsuregawa, M; Li, J; Chang, K: Springer Verlag, 2006: 361-370.
Files in This Item:
There are no files associated with this item.
Related Services
Usage statistics
Google Scholar
Similar articles in Google Scholar
[Zhang, Zhigang]'s Articles
[Jia, Weijia]'s Articles
[Li, Xiaoming]'s Articles
Baidu academic
Similar articles in Baidu academic
[Zhang, Zhigang]'s Articles
[Jia, Weijia]'s Articles
[Li, Xiaoming]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Zhang, Zhigang]'s Articles
[Jia, Weijia]'s Articles
[Li, Xiaoming]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.