JCSE, vol. 13, no. 2, pp.78-88, 2019
DOI: http://dx.doi.org/10.5626/JCSE.2019.13.2.78
Near-Duplication Document Detection Using Weight One Permutation Hashing
Xinpan Yuan, Songlin Wang, and Xiaojun Deng
School of Computer Science, Hunan University of Technology, Zhuzhou, China
Abstract: As a standard algorithm for efficiently calculating set similarity, Minwise hashing is widely used to detect text similarity.
The major drawback associated with Minwise hashing is expensive preprocessing. One permutation hashing (OPH) is
proposed in order to reduce the number of random permutations. OPH divides the space 兩?evenly into k bins, and selects
the smallest nonzero value in each bin to re-index the selected elements. We propose a weight one permutation hashing
(WOPH) by dividing the entire space 兩?into k1 and k2 bins and sampling k1 and k2 in proportion to form a weighted kw.
WOPH has a wider range of precision by expanding the proportion of w1 and w2 to different accuracy levels of the user.
The variance of WOPH can be rapidly decreased first and then slowly decreased, although the final variance is the same
as OPH with the same k. We combined the dynamic double filter with WOPH to reduce the calculation time by eliminating
unnecessary comparison in advance. For example, for a large number of real data with low similarity accompanied
by high threshold queries, the filter reduces the comparison of WOPH by 85%.
Keyword:
One Permutation Hashing; WOPH; Bin; Non-uniform partition; Weighting; Dynamic double filter
Full Paper: 197 Downloads, 1434 View
|