Weighted Simhash: A Random Projection 
Approach for detecting near duplicate documents 
in large collection 
Md Mishfaq Ahmed 
Graduate Student 
Department of CS 
University of Memphis 
1
Introduction 
• Near duplicate documents (NDD): identical 
in terms of core content but differ in small 
portions of the document 
– Harder to be detected than exact duplicates 
– Exact duplicates: 
• Standard methods exists 
– Near duplicates: 
• Several approaches exists but no widely accepted 
method to identify 
2
Near Duplicate: main sources 
• News articles 
• Web documents (web pages) differing only in 
advertisements and/or timestamps 
– As many as 40% of the pages on the web are 
duplicates of other pages 
3
Near Duplicate: main sources 
• NDD techniques useful in sequences that are 
not documents (such as DNA sequences) 
• Replication for Reliability 
– In file systems: main content of an important 
document is replicated and stored at different 
places 
4
Earlier Approaches for NDD 
• A naive solution : Compare a document with 
all documents in the collection, word by word 
– prohibitively expensive on large datasets 
• Another approach: Convert documents into 
canonical forms, until they are exact 
duplicates 
• More viable approach: Approximation and 
probabilistic methods 
– Trade off 
• precision and recall ↔ manageable speed 
5
Earlier Approaches for NDD 
6
Shingling based methods 
• A document d = a sequence of 
tokens 
• Encode d as a set of unique k-grams 
– k gram = k contiguous sequence of 
tokens 
• Measure overlap or similarity 
between two k-grams 
• Sum of overlaps or similarity across 
the entire set gives the similarity 
between two docs 
7
Projection based methods: SimHash 
Example: 
– d1: word1+word2+word3 
– d2: word1+word4 
9
SimHash: Example 
10 
• Document d1: w1+w2+w3
SimHash: Example 
11 
– Document d2: word1+word4
Projection based methods: 
Probabilistic Simhash 
• Key observations: 
– Projection is already probabilistic 
– Bits in a fingerprint are mutually independent 
– Intermediate values are ignored while generating 
fingerprints 
• Useful to give insight into the volatility of a bit 
12
Projection based methods: 
Probabilistic Simhash 
• Key observations: 
13 
For another document d, which is not a near duplicate 
of d1: fingerprint of d is most likely to be different 
than that of d1 at bit position with intermediate value 
closest to zero
Projection based methods: 
Probabilistic Simhash 
• Implementation 
– An unique data structure per document to rank 
bits or set of bits according to volatility 
• Stores bit positions 
– When comparing two fingerprints 
• Compare bits with higher volatility first 
• Ensures quicker identification of nonduplicates 
• Reduce number of bit comparisons for nonduplicates 
14
Projection based methods: 
Probabilistic Simhash 
• Drawback: 
– Overhead of extra data structure per document 
apart from the fingerprint 
15
Our Approach: Weighted SimHash 
• Main Idea: 
– Terms with higher inverse document frequency 
(IDF): better in finding NDD 
• Consider two documents D1, D2 and two terms: t1: 
high IDF , t2: low IDF - 
– Case I: both D1 and D2 has t1 
– Case II: both D1 and D2 has t2 
– Case III: None of them have t1 
– Case IV: None of them have t2 
• D1,D2 more likely to be NDD in Case I then in Case II 
• D1,D2 more likely to be NDD in Case IV then in Case III 
16
Weighted SimHash: Key Steps 
18
Weighted SimHash 
• Generation of fingerprint: 
– Terms with higher IDFs contribute more in forming the 
sum leading to more significant bits (towards the left 
end of fingerprint) 
– Terms with lower IDFs contribute more in forming the 
sum leading to less significant bits (towards the right 
end of fingerprint) 
– Leads to increased chance of mismatches in leading 
bits for non duplicates. 
– How to achieve this? 
• Multiplication factor 
19
Weighted SimHash 
• Multiplication factor (MF) for term t , mft = 
f(IDFt , bp) 
20 
2.5 
2 
1.5 
1 
0.5 
0 
MF for high IDF Term MF for mid IDF Term MF for low IDF Term 
Multiplication factor 
MSB Bit position LSB
Weighted SimHash 
• Example(generation of fingerprint): 
– Document D2: word1 + word4 
23
Weighted SimHash 
• Example(generation of fingerprint): 
– Document D2: word1 + word4 
24
Weighted SimHash 
• Finding near duplicate 
– Compare fingerprints of the query document and 
each document from collection 
• Start scan from most significant bit (or left most bit) 
• Count number of mismatch 
• If number of mismatch gets past k (allowed hamming 
distance threshold): No near duplicate. stop scan. Go 
to next document 
• If number of mismatch within allowed hamming 
distance threshold after scanning entire fingerprint: 
near duplicate found 
25
Experiment 
• Reuters data set: almost 10k documents 
– 10 documents randomly selected 
– Each of 10 documents has been very slightly 
modified at most two words change per doc to 
produce 20 documents per selection 
– 200 documents which we consider as near 
duplicates for respective selection 
– 10 docs then used as source query 
26
Experiment: Procedure 
27
Results 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
0 2 4 6 8 10 
Recall Percantage 
K value (hamming distance threshod) 
Recall(SimHash) 
Recall(WSH) 
Figure : Comparison of percentage recall for all the 20 query 
documents for SimHash and Weighted SimHash methods with k 
(hamming distance value) shown in the X axis. 28
Results 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
0 2 4 6 8 10 12 14 16 
Precision Percantage 
K value (hamming distance threshold)) 
Precision(SimHash) 
Precision(WSH) 
Figure : Precision comparison between random projection and weighted 
simhash. X axis shows different values of k. The figure shows there is no real 
difference between two methods in terms of precision. 
29
Results 
Figure: Average execution time per query for each of the methods. For cosine 
similarity the threshold is 0.95. 
4656 
3820 
3560 3490 3587 3664 3680 
3440 3560 
Cosine Similarity 
(0.95) 
SimHash WSH with 
MF(0.1,1.0) 
WSH with 
MF(0.3,1.0) 
WSH with 
MF(0.5,1.0) 
WSH with 
MF(0.5,1.5) 
WSH with 
MF(1.0,1.3) 
WSH with 
MF(1.0,1.6) 
WSH with 
MF(1.0,2.0) 
Comparison of Average Execution time 
Average Execution time(milliseconds) 
30
Limitations of WSH 
• Dependence on IDF 
– Web search: IDF unknown 
– Heuristics can be used: 
• IDF from first 1000 documents 
31
Limitations of WSH 
• Difficulty Setting the lower and 
upper bound on multiplication 
factor 
– May vary from collection to 
collection 
32
Conclusion 
• Batch processing of Document collection: 
– Runtime: WSH better than SimHash 
– Precision and Recall: WSH and SimHash are 
comparable 
33
Conclusion 
• Further work on SimHash: 
– How much the fingerprint can be allowed to be 
altered? 
35
Thank you 
36

NDD Project presentation

  • 1.
    Weighted Simhash: ARandom Projection Approach for detecting near duplicate documents in large collection Md Mishfaq Ahmed Graduate Student Department of CS University of Memphis 1
  • 2.
    Introduction • Nearduplicate documents (NDD): identical in terms of core content but differ in small portions of the document – Harder to be detected than exact duplicates – Exact duplicates: • Standard methods exists – Near duplicates: • Several approaches exists but no widely accepted method to identify 2
  • 3.
    Near Duplicate: mainsources • News articles • Web documents (web pages) differing only in advertisements and/or timestamps – As many as 40% of the pages on the web are duplicates of other pages 3
  • 4.
    Near Duplicate: mainsources • NDD techniques useful in sequences that are not documents (such as DNA sequences) • Replication for Reliability – In file systems: main content of an important document is replicated and stored at different places 4
  • 5.
    Earlier Approaches forNDD • A naive solution : Compare a document with all documents in the collection, word by word – prohibitively expensive on large datasets • Another approach: Convert documents into canonical forms, until they are exact duplicates • More viable approach: Approximation and probabilistic methods – Trade off • precision and recall ↔ manageable speed 5
  • 6.
  • 7.
    Shingling based methods • A document d = a sequence of tokens • Encode d as a set of unique k-grams – k gram = k contiguous sequence of tokens • Measure overlap or similarity between two k-grams • Sum of overlaps or similarity across the entire set gives the similarity between two docs 7
  • 8.
    Projection based methods:SimHash Example: – d1: word1+word2+word3 – d2: word1+word4 9
  • 9.
    SimHash: Example 10 • Document d1: w1+w2+w3
  • 10.
    SimHash: Example 11 – Document d2: word1+word4
  • 11.
    Projection based methods: Probabilistic Simhash • Key observations: – Projection is already probabilistic – Bits in a fingerprint are mutually independent – Intermediate values are ignored while generating fingerprints • Useful to give insight into the volatility of a bit 12
  • 12.
    Projection based methods: Probabilistic Simhash • Key observations: 13 For another document d, which is not a near duplicate of d1: fingerprint of d is most likely to be different than that of d1 at bit position with intermediate value closest to zero
  • 13.
    Projection based methods: Probabilistic Simhash • Implementation – An unique data structure per document to rank bits or set of bits according to volatility • Stores bit positions – When comparing two fingerprints • Compare bits with higher volatility first • Ensures quicker identification of nonduplicates • Reduce number of bit comparisons for nonduplicates 14
  • 14.
    Projection based methods: Probabilistic Simhash • Drawback: – Overhead of extra data structure per document apart from the fingerprint 15
  • 15.
    Our Approach: WeightedSimHash • Main Idea: – Terms with higher inverse document frequency (IDF): better in finding NDD • Consider two documents D1, D2 and two terms: t1: high IDF , t2: low IDF - – Case I: both D1 and D2 has t1 – Case II: both D1 and D2 has t2 – Case III: None of them have t1 – Case IV: None of them have t2 • D1,D2 more likely to be NDD in Case I then in Case II • D1,D2 more likely to be NDD in Case IV then in Case III 16
  • 16.
  • 17.
    Weighted SimHash •Generation of fingerprint: – Terms with higher IDFs contribute more in forming the sum leading to more significant bits (towards the left end of fingerprint) – Terms with lower IDFs contribute more in forming the sum leading to less significant bits (towards the right end of fingerprint) – Leads to increased chance of mismatches in leading bits for non duplicates. – How to achieve this? • Multiplication factor 19
  • 18.
    Weighted SimHash •Multiplication factor (MF) for term t , mft = f(IDFt , bp) 20 2.5 2 1.5 1 0.5 0 MF for high IDF Term MF for mid IDF Term MF for low IDF Term Multiplication factor MSB Bit position LSB
  • 19.
    Weighted SimHash •Example(generation of fingerprint): – Document D2: word1 + word4 23
  • 20.
    Weighted SimHash •Example(generation of fingerprint): – Document D2: word1 + word4 24
  • 21.
    Weighted SimHash •Finding near duplicate – Compare fingerprints of the query document and each document from collection • Start scan from most significant bit (or left most bit) • Count number of mismatch • If number of mismatch gets past k (allowed hamming distance threshold): No near duplicate. stop scan. Go to next document • If number of mismatch within allowed hamming distance threshold after scanning entire fingerprint: near duplicate found 25
  • 22.
    Experiment • Reutersdata set: almost 10k documents – 10 documents randomly selected – Each of 10 documents has been very slightly modified at most two words change per doc to produce 20 documents per selection – 200 documents which we consider as near duplicates for respective selection – 10 docs then used as source query 26
  • 23.
  • 24.
    Results 100 90 80 70 60 50 40 30 20 10 0 0 2 4 6 8 10 Recall Percantage K value (hamming distance threshod) Recall(SimHash) Recall(WSH) Figure : Comparison of percentage recall for all the 20 query documents for SimHash and Weighted SimHash methods with k (hamming distance value) shown in the X axis. 28
  • 25.
    Results 100 90 80 70 60 50 40 30 20 10 0 0 2 4 6 8 10 12 14 16 Precision Percantage K value (hamming distance threshold)) Precision(SimHash) Precision(WSH) Figure : Precision comparison between random projection and weighted simhash. X axis shows different values of k. The figure shows there is no real difference between two methods in terms of precision. 29
  • 26.
    Results Figure: Averageexecution time per query for each of the methods. For cosine similarity the threshold is 0.95. 4656 3820 3560 3490 3587 3664 3680 3440 3560 Cosine Similarity (0.95) SimHash WSH with MF(0.1,1.0) WSH with MF(0.3,1.0) WSH with MF(0.5,1.0) WSH with MF(0.5,1.5) WSH with MF(1.0,1.3) WSH with MF(1.0,1.6) WSH with MF(1.0,2.0) Comparison of Average Execution time Average Execution time(milliseconds) 30
  • 27.
    Limitations of WSH • Dependence on IDF – Web search: IDF unknown – Heuristics can be used: • IDF from first 1000 documents 31
  • 28.
    Limitations of WSH • Difficulty Setting the lower and upper bound on multiplication factor – May vary from collection to collection 32
  • 29.
    Conclusion • Batchprocessing of Document collection: – Runtime: WSH better than SimHash – Precision and Recall: WSH and SimHash are comparable 33
  • 30.
    Conclusion • Furtherwork on SimHash: – How much the fingerprint can be allowed to be altered? 35
  • 31.

Editor's Notes

  • #2 Hi All, I am Md Mishfaq Ahmed and I am going to my project on Weighted SimHash: An efficient approach for detecting near duplicate documents in large document collection
  • #3 Documents that are near duplicate of each other are harder to be detected than those who are identical duplicates Identical duplicate documents can be detected by the use of standard checksum methods whereas near duplicate detection does not have any such standard techniques and is still an active area of research. Two near duplicate documents are identical in terms of core content but differ in small portions of the documents.
  • #6 Converting to canonical form: degree of canonization will determine whether two documents are close enough For large collections of web documents Broder’s Shingling Algorithm (based on word sequence) and Charikar’s Random Projection based approach (SimHash) were considered as the state of the art algorithms [Henzinger]. Manku and Anish Das Sharma presented a technique where the problem of near duplicate detection is tackled in the setting of web crawling and works on the Charikar’s technique. Identical duplicate documents can be detected by the use of standard checksum methods whereas near duplicate detection does not have any such standard techniques and is still an active area of research. Two near duplicate documents are identical in terms of core content but differ in small portions of the documents.
  • #7 Two main categories
  • #8 The number of identical entries in the super shingle vectors of two pages is their B-similarity. Two pages are near-duplicates B-similar if their B-similarity is at least 2 [Henzinger]. As can be seen for the standard shingling methods, the document signature vectors are constructed in a way where all the k-grams in the documents are treated equally irrespective of the frequency of occurrence of the terms involved. [Hoad and Zobel] experimented with various strategies for how to select k-grams when constructing shingles, such as based on their TFIDF scores. Mini-wise independent permutation algorithm is another shingling based construct that is based on locality sensitive hashing schema. [Fusion]
  • #9 According to the implementation described by Charikar the algorithm generates a binary vector with m bits to represent documents. At first step, each unique term in the target document is projected into an m-dimensional real-valued random vector, where each element is randomly chosen from [-1, 1]. For the entire collection of document, for each unique term all the random vectors are generated. This assignment is unique across the whole corpus. So, representation of a single term is same all across the corpus. The representations for every term present in a single document are then added together. The final m-dimensional binary vector representing this document is derived by setting each element in the vector to 1 if the corresponding real value is positive and 0 otherwise. The underlying assumption here is that the cosine similarity of two pages is proportional to the number of bits in which the two corresponding projections agree. The C-similarity of two pages is defined as the number of bits their projections agree on. Finally, the two documents are near duplicate if their C-similarity is above a certain fixed threshold.
  • #10 Let us assume we have three documents in the collection Document is converted to a binary vector Each term in the document is mapped to an m dimensional real valued random vector This mapping is same for a term for all doc in the collection The real value is randomly chosen in the range [-1, 1] Vector representation of each term in a doc is added together to get m dimensional real valued sum vector Form the binary vector: For each position: sum>0 : place 1, else place 0
  • #13 In probabilistic Simhash, two major observations are made: Simhash is already probabilistic in nature and bits in a fingerprint are independent of nature. Also, in the process of building the fingerprints of a document, the intermediate values generated from the sums of real values from the representative vectors of each term in a document are discarded. In [9], these intermediate values are exploited to determine the volatility of bit, thus forming volatility ordered set heap for each document so as to dictate the order at which each bit of a fingerprint will be scanned when finding the near duplicates of a document.
  • #14 In probabilistic Simhash, two major observations are made: Simhash is already probabilistic in nature and bits in a fingerprint are independent of nature. Also, in the process of building the fingerprints of a document, the intermediate values generated from the sums of real values from the representative vectors of each term in a document are discarded. In [9], these intermediate values are exploited to determine the volatility of bit, thus forming volatility ordered set heap for each document so as to dictate the order at which each bit of a fingerprint will be scanned when finding the near duplicates of a document.
  • #17 (rarer terms in a collection)
  • #18 (rarer terms in a collection)
  • #22 The Multiplication factor for a term, t is a function of its IDF and the bit position, mft = f(IDFt , bp). In the conventional methods, mft is always 1. In the proposed method mft is defined as follows:   mft = multmin + (multmax - multmin) (IDF modifier – bit position modifier) [1] Where, IDF modifier is:   (IDFt-IDFmin)/( IDFmax- IDFmin)   And, bit position modifier is: bp/(n-1)   Here the constants are: multmin = minimum value of multiplication factor multmax= maximum value of multiplication factor IDFmin= minimum value of the IDF considered among tokens IDFmax= maximum value of the IDF considered among tokens n = number of bits in the fingerprint the variables are: bp = bit position for which the multiplication factor is being calculated IDFt= the IDF of the token t (the token for which multiplication factor is being calculated)
  • #26 Because of generation of fingerprint mismatches between non duplicates are more likely to occur in
  • #27 Data set: No standard dataset To test/compare NDD algorithms: Researchers used different datasets Near duplicates: Synthetic Real
  • #28 Procedure 1. Remove all stop word from all data files using standard list of stop words, stopword.txt. 2. All data files are stemmed using standard stemming process. 3. For each of the algorithm (standard SimHash and weighted SimHash), for each of the 10 text files for which synthetically duplicates are made: a. Calculate the number of documents that are returned as near duplicates. b. Calculate the numbers of documents in step 3 a, that are synthetically created (via insertion, deletion and transposition in the query document), is found. c. from these values various performance measures, precision and recall are found. d. Calculate the time elapsed to find the speed up from the use of weighted Simhash.    
  • #29 We want to see the running time improvement in weighted simhash and whether it affects precision and recall
  • #36 Original fingerprint guaranteed that the similarity between two documents is proportional to the similarity of their fingerprints