JCSE

JCSE, vol. 13, no. 2, pp.78-88, 2019

DOI: http://dx.doi.org/10.5626/JCSE.2019.13.2.78

Near-Duplication Document Detection Using Weight One Permutation Hashing

Xinpan Yuan, Songlin Wang, and Xiaojun Deng
School of Computer Science, Hunan University of Technology, Zhuzhou, China

Abstract: As a standard algorithm for efficiently calculating set similarity, Minwise hashing is widely used to detect text similarity. The major drawback associated with Minwise hashing is expensive preprocessing. One permutation hashing (OPH) is proposed in order to reduce the number of random permutations. OPH divides the space 兩?evenly into k bins, and selects the smallest nonzero value in each bin to re-index the selected elements. We propose a weight one permutation hashing (WOPH) by dividing the entire space 兩?into k1 and k2 bins and sampling k1 and k2 in proportion to form a weighted kw. WOPH has a wider range of precision by expanding the proportion of w1 and w2 to different accuracy levels of the user. The variance of WOPH can be rapidly decreased first and then slowly decreased, although the final variance is the same as OPH with the same k. We combined the dynamic double filter with WOPH to reduce the calculation time by eliminating unnecessary comparison in advance. For example, for a large number of real data with low similarity accompanied by high threshold queries, the filter reduces the comparison of WOPH by 85%.

Keyword: One Permutation Hashing; WOPH; Bin; Non-uniform partition; Weighting; Dynamic double filter

Full Paper: 198 Downloads, 1561 View