Call for Papers
About the Journal
Editorial Board
Publication Ethics
Instructions for Authors
Announcements
Current Issue
Back Issues
Search for Articles
Categories
Back Issues
 

JCSE, vol. 13, no. 2, pp.78-88, June, 2019

DOI: http://dx.doi.org/10.5626/JCSE.2019.13.2.78

Near-Duplication Document Detection Using Weight One Permutation Hashing

Xinpan Yuan, Songlin Wang, and Xiaojun Deng
School of Computer Science, Hunan University of Technology, Zhuzhou, China

Abstract: As a standard algorithm for efficiently calculating set similarity, Minwise hashing is widely used to detect text similarity. The major drawback associated with Minwise hashing is expensive preprocessing. One permutation hashing (OPH) is proposed in order to reduce the number of random permutations. OPH divides the space 兩?evenly into k bins, and selects the smallest nonzero value in each bin to re-index the selected elements. We propose a weight one permutation hashing (WOPH) by dividing the entire space 兩?into k1 and k2 bins and sampling k1 and k2 in proportion to form a weighted kw. WOPH has a wider range of precision by expanding the proportion of w1 and w2 to different accuracy levels of the user. The variance of WOPH can be rapidly decreased first and then slowly decreased, although the final variance is the same as OPH with the same k. We combined the dynamic double filter with WOPH to reduce the calculation time by eliminating unnecessary comparison in advance. For example, for a large number of real data with low similarity accompanied by high threshold queries, the filter reduces the comparison of WOPH by 85%.

Keyword: One Permutation Hashing; WOPH; Bin; Non-uniform partition; Weighting; Dynamic double filter

Full Paper:   197 Downloads, 1178 View

 
 
ⓒ Copyright 2010 KIISE – All Rights Reserved.    
Korean Institute of Information Scientists and Engineers (KIISE)   #401 Meorijae Bldg., 984-1 Bangbae 3-dong, Seo-cho-gu, Seoul 137-849, Korea
Phone: +82-2-588-9240    Fax: +82-2-521-1352    Homepage: http://jcse.kiise.org    Email: office@kiise.org