NDD Project presentation

Weighted Simhash: A Random Projection
Approach for detecting near duplicate documents
in large collection
Md Mishfaq Ahmed
Graduate Student
Department of CS
University of Memphis
1

Introduction
• Near duplicate documents (NDD): identical
in terms of core content but differ in small
portions of the document
– Harder to be detected than exact duplicates
– Exact duplicates:
• Standard methods exists
– Near duplicates:
• Several approaches exists but no widely accepted
method to identify
2

Near Duplicate: main sources
• News articles
• Web documents (web pages) differing only in
advertisements and/or timestamps
– As many as 40% of the pages on the web are
duplicates of other pages
3

Near Duplicate: main sources
• NDD techniques useful in sequences that are
not documents (such as DNA sequences)
• Replication for Reliability
– In file systems: main content of an important
document is replicated and stored at different
places
4

Earlier Approaches for NDD
• A naive solution : Compare a document with
all documents in the collection, word by word
– prohibitively expensive on large datasets
• Another approach: Convert documents into
canonical forms, until they are exact
duplicates
• More viable approach: Approximation and
probabilistic methods
– Trade off
• precision and recall ↔ manageable speed
5

Shingling based methods
• A document d = a sequence of
tokens
• Encode d as a set of unique k-grams
– k gram = k contiguous sequence of
tokens
• Measure overlap or similarity
between two k-grams
• Sum of overlaps or similarity across
the entire set gives the similarity
between two docs
7

Projection based methods: SimHash
Example:
– d1: word1+word2+word3
– d2: word1+word4
9

SimHash: Example
10
• Document d1: w1+w2+w3

SimHash: Example
11
– Document d2: word1+word4

Projection based methods:
Probabilistic Simhash
• Key observations:
– Projection is already probabilistic
– Bits in a fingerprint are mutually independent
– Intermediate values are ignored while generating
fingerprints
• Useful to give insight into the volatility of a bit
12

• Key observations:
13
For another document d, which is not a near duplicate
of d1: fingerprint of d is most likely to be different
than that of d1 at bit position with intermediate value
closest to zero

• Implementation
– An unique data structure per document to rank
bits or set of bits according to volatility
• Stores bit positions
– When comparing two fingerprints
• Compare bits with higher volatility first
• Ensures quicker identification of nonduplicates
• Reduce number of bit comparisons for nonduplicates
14

• Drawback:
– Overhead of extra data structure per document
apart from the fingerprint
15

Our Approach: Weighted SimHash
• Main Idea:
– Terms with higher inverse document frequency
(IDF): better in finding NDD
• Consider two documents D1, D2 and two terms: t1:
high IDF , t2: low IDF -
– Case I: both D1 and D2 has t1
– Case II: both D1 and D2 has t2
– Case III: None of them have t1
– Case IV: None of them have t2
• D1,D2 more likely to be NDD in Case I then in Case II
• D1,D2 more likely to be NDD in Case IV then in Case III
16

Weighted SimHash: Key Steps
18

Weighted SimHash
• Generation of fingerprint:
– Terms with higher IDFs contribute more in forming the
sum leading to more significant bits (towards the left
end of fingerprint)
– Terms with lower IDFs contribute more in forming the
sum leading to less significant bits (towards the right
end of fingerprint)
– Leads to increased chance of mismatches in leading
bits for non duplicates.
– How to achieve this?
• Multiplication factor
19

Weighted SimHash
• Multiplication factor (MF) for term t , mft =
f(IDFt , bp)
20
2.5
2
1.5
1
0.5
0
MF for high IDF Term MF for mid IDF Term MF for low IDF Term
Multiplication factor
MSB Bit position LSB

Weighted SimHash
• Example(generation of fingerprint):
– Document D2: word1 + word4
23

Weighted SimHash
• Example(generation of fingerprint):
– Document D2: word1 + word4
24

Weighted SimHash
• Finding near duplicate
– Compare fingerprints of the query document and
each document from collection
• Start scan from most significant bit (or left most bit)
• Count number of mismatch
• If number of mismatch gets past k (allowed hamming
distance threshold): No near duplicate. stop scan. Go
to next document
• If number of mismatch within allowed hamming
distance threshold after scanning entire fingerprint:
near duplicate found
25

Experiment
• Reuters data set: almost 10k documents
– 10 documents randomly selected
– Each of 10 documents has been very slightly
modified at most two words change per doc to
produce 20 documents per selection
– 200 documents which we consider as near
duplicates for respective selection
– 10 docs then used as source query
26

Results
100
90
80
70
60
50
40
30
20
10
0
0 2 4 6 8 10
Recall Percantage
K value (hamming distance threshod)
Recall(SimHash)
Recall(WSH)
Figure : Comparison of percentage recall for all the 20 query
documents for SimHash and Weighted SimHash methods with k
(hamming distance value) shown in the X axis. 28

Results
100
90
80
70
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16
Precision Percantage
K value (hamming distance threshold))
Precision(SimHash)
Precision(WSH)
Figure : Precision comparison between random projection and weighted
simhash. X axis shows different values of k. The figure shows there is no real
difference between two methods in terms of precision.
29

Results
Figure: Average execution time per query for each of the methods. For cosine
similarity the threshold is 0.95.
4656
3820
3560 3490 3587 3664 3680
3440 3560
Cosine Similarity
(0.95)
SimHash WSH with
MF(0.1,1.0)
WSH with
MF(0.3,1.0)
WSH with
MF(0.5,1.0)
WSH with
MF(0.5,1.5)
WSH with
MF(1.0,1.3)
WSH with
MF(1.0,1.6)
WSH with
MF(1.0,2.0)
Comparison of Average Execution time
Average Execution time(milliseconds)
30

Limitations of WSH
• Dependence on IDF
– Web search: IDF unknown
– Heuristics can be used:
• IDF from first 1000 documents
31

Limitations of WSH
• Difficulty Setting the lower and
upper bound on multiplication
factor
– May vary from collection to
collection
32

Conclusion
• Batch processing of Document collection:
– Runtime: WSH better than SimHash
– Precision and Recall: WSH and SimHash are
comparable
33

Conclusion
• Further work on SimHash:
– How much the fingerprint can be allowed to be
altered?
35

NDD Project presentation

More Related Content

Similar to NDD Project presentation

Recently uploaded

NDD Project presentation

Editor's Notes