Finding Similar Files in Large Document Repositories KDD’05, August 21-24, 2005, Chicago, Illinois, USA. Copyright 2005 AC...
Agenda <ul><li>Introduction </li></ul><ul><li>Method </li></ul><ul><li>Results </li></ul><ul><li>Related work </li></ul><u...
Introduction <ul><li>Millions of technical support documents. </li></ul><ul><ul><li>Covering many different products, solu...
Method <ul><li>Step1:  </li></ul><ul><ul><li>Using a ‘content-based chunking algorithm’ to break up each file into a seque...
Hashing background <ul><li>Use the ‘compare by hash’ method to compare chunks occurring in different files. </li></ul><ul>...
Chunking <ul><li>Breaking a file into a sequence of chunks. </li></ul><ul><li>Chunk boundaries are determined by the local...
Chunking and file similarity <ul><li>The problems in content-based chunking algorithm. </li></ul><ul><ul><li>When two sequ...
 
 
File similarity algorithm <ul><li>Step1 </li></ul><ul><ul><li>Break each file’s content into chunk </li></ul></ul><ul><ul>...
File similarity algorithm (cont.) <ul><li>Step2 </li></ul><ul><ul><li>Optional step for scalability </li></ul></ul><ul><ul...
File similarity algorithm (cont.) <ul><li>Step4 </li></ul><ul><ul><li>Construct a separate file-file similarity graph. </l...
File similarity algorithm (cont.) <ul><li>Step5 </li></ul><ul><ul><li>Output the file-file similarity pairs as desired. </...
Handling identical files <ul><li>Having a multiple files with identical content. </li></ul><ul><li>Using the same metadate...
Handling identical files (cont.)
Complexity analysis <ul><li>The chunking of the files is linear in the total size N of the content. </li></ul><ul><li>O(C ...
Results <ul><li>Implement chunking algorithm in C++ (~1200 lines of code) </li></ul><ul><li>Used Perl to implement  </li><...
Related work <ul><li>Brin et al. “Copy detection machanisims for digital documents” </li></ul><ul><ul><li>Have a large ind...
Conclusions <ul><li>To identify pieces that may have been duplicated. </li></ul><ul><ul><li>Relies on chunking technology ...
Upcoming SlideShare
Loading in …5
×

Finding Similar Files in Large Document Repositories

2,093 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,093
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
Downloads
49
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Finding Similar Files in Large Document Repositories

  1. 1. Finding Similar Files in Large Document Repositories KDD’05, August 21-24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM George Forman HewlettPackard Labs [email_address] Kave Eshghi HewlettPackard Labs [email_address] Stephane Chiocchetti HewlettPackard France [email_address]
  2. 2. Agenda <ul><li>Introduction </li></ul><ul><li>Method </li></ul><ul><li>Results </li></ul><ul><li>Related work </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Introduction <ul><li>Millions of technical support documents. </li></ul><ul><ul><li>Covering many different products, solutions, and phases of support. </li></ul></ul><ul><li>The content in new document may be duplicate. </li></ul><ul><ul><li>Author prefer to copy rather than link to content by reference. </li></ul></ul><ul><ul><li>To avoid the possibility of dead link </li></ul></ul><ul><ul><li>By mistake or limited authorization, the version is not update. </li></ul></ul><ul><li>Solution </li></ul><ul><ul><li>Chunking technology to break up the document into paragraph-like pieces. </li></ul></ul><ul><ul><li>Detecting collisions among the has signatures of these chunks. </li></ul></ul><ul><ul><li>Efficiently determine which files are related in a large repository. </li></ul></ul>
  4. 4. Method <ul><li>Step1: </li></ul><ul><ul><li>Using a ‘content-based chunking algorithm’ to break up each file into a sequence of chunks </li></ul></ul><ul><li>Step2: </li></ul><ul><ul><li>Compute the hash of each chunk. </li></ul></ul><ul><li>Step3: </li></ul><ul><ul><li>To find those files that share chunk hashes </li></ul></ul><ul><ul><li>Reporting only those whose intersection is above some threshold. </li></ul></ul>
  5. 5. Hashing background <ul><li>Use the ‘compare by hash’ method to compare chunks occurring in different files. </li></ul><ul><ul><li>If fixed size sequences, it is almost impossible to find two chunks that have the same hash. </li></ul></ul><ul><ul><li>Use the MD5 algorithm which generates 128-bit hashes. </li></ul></ul><ul><li>Two advantage (Rather than compare the chunk itself) </li></ul><ul><ul><li>Comparison time is shorter. </li></ul></ul><ul><ul><li>Being short an fixed size, lend themselves to efficient data structures for lookup and comparison. </li></ul></ul>
  6. 6. Chunking <ul><li>Breaking a file into a sequence of chunks. </li></ul><ul><li>Chunk boundaries are determined by the local contents of the file. </li></ul><ul><li>Basic Sliding Window Algorithm </li></ul><ul><ul><li>A pair of pre-determined integers D and r, r<D. </li></ul></ul><ul><ul><li>A fixed width sliding window of width W. </li></ul></ul><ul><ul><li>F k the fingerprint on position k. </li></ul></ul><ul><ul><li>k is a chunk boundary if F k mod D = r. </li></ul></ul>
  7. 7. Chunking and file similarity <ul><li>The problems in content-based chunking algorithm. </li></ul><ul><ul><li>When two sequences R and R’ share a contiguous sub-sequence larger than the average chunk size . </li></ul></ul><ul><ul><li>There would be good probability at least one shared chunk falling within the shared sequences. </li></ul></ul><ul><li>Use TTTD to avoid above problem </li></ul><ul><ul><li>Two Thresholds, Two Divisors algorithm. </li></ul></ul><ul><ul><li>Four parameters: </li></ul></ul><ul><ul><ul><li>D the main divisor </li></ul></ul></ul><ul><ul><ul><li>D’ the backup divisor </li></ul></ul></ul><ul><ul><ul><li>T min the minimum chunk size threshold </li></ul></ul></ul><ul><ul><ul><li>T max the maximum chunk size threshold </li></ul></ul></ul>
  8. 10. File similarity algorithm <ul><li>Step1 </li></ul><ul><ul><li>Break each file’s content into chunk </li></ul></ul><ul><ul><li>For each chunk, record its byte length and its hash code. </li></ul></ul><ul><ul><li>The bit-length of the hash code be sufficiently long. </li></ul></ul><ul><ul><li>To avoid having many accidental hash collisions among truly different chunk. </li></ul></ul>
  9. 11. File similarity algorithm (cont.) <ul><li>Step2 </li></ul><ul><ul><li>Optional step for scalability </li></ul></ul><ul><ul><li>Prune and partition the above metadata into independent sub-problems </li></ul></ul><ul><ul><li>Each small enough to fit in memory </li></ul></ul><ul><li>Step3 </li></ul><ul><ul><li>Constructing a bipartite gragh </li></ul></ul><ul><ul><li>With an edge between a file vertex and a chunk vertex </li></ul></ul><ul><ul><li>iff the chunk occurs in the file </li></ul></ul><ul><ul><li>File notes are annotated with their file length </li></ul></ul><ul><ul><li>The chunk nodes are annotated with their chunk length </li></ul></ul>
  10. 12. File similarity algorithm (cont.) <ul><li>Step4 </li></ul><ul><ul><li>Construct a separate file-file similarity graph. </li></ul></ul><ul><ul><li>For each file A: </li></ul></ul><ul><ul><ul><li>(a) Look up the chunks AC that occur in file A. </li></ul></ul></ul><ul><ul><ul><li>(b) For each chunk in AC, look up the files it appears in, accumulating the set of other files BS that share any chunks with file A. (As an optimization due to symmetry, we exclude files that have previously been considered as file A in step 4.) </li></ul></ul></ul><ul><ul><ul><li>(c) For each file B in set BS, determine its chunks in common with file A,2 and add A-B to the file similarity graph if the total chunk bytes in common exceeds some threshold, or percentage of file length. </li></ul></ul></ul>
  11. 13. File similarity algorithm (cont.) <ul><li>Step5 </li></ul><ul><ul><li>Output the file-file similarity pairs as desired. </li></ul></ul><ul><ul><li>Use union-find algorithm to determine clusters of interconnected files. </li></ul></ul>
  12. 14. Handling identical files <ul><li>Having a multiple files with identical content. </li></ul><ul><li>Using the same metadate with a small enhancement. </li></ul><ul><li>While loading the file-chunk data </li></ul><ul><ul><li>Compute a hash over all the chunk hashes </li></ul></ul><ul><ul><li>Maintain a hash table that reference file nodes according to their unique content hashes. </li></ul></ul><ul><ul><li>If a file has already been loaded </li></ul></ul><ul><ul><li>We note the duplicate file name and avoid duplicating the chunk data in memory </li></ul></ul>
  13. 15. Handling identical files (cont.)
  14. 16. Complexity analysis <ul><li>The chunking of the files is linear in the total size N of the content. </li></ul><ul><li>O(C log C) where C is the number of chunks in the repository. (Including duplicates.) </li></ul><ul><li>Since C is linear in N -> O(N log N). </li></ul>
  15. 17. Results <ul><li>Implement chunking algorithm in C++ (~1200 lines of code) </li></ul><ul><li>Used Perl to implement </li></ul><ul><ul><li>the similarity analysis algorithm (~500 LOC) </li></ul></ul><ul><ul><li>The bipartite partitioning algorithm (~250 LOC) </li></ul></ul><ul><ul><li>Shared union-find module (~300 LOC) </li></ul></ul><ul><li>The performance on a given repository ranges widely depending on the average chunk size (a controllable size) </li></ul><ul><ul><li>52, 125 technical support documents in 347 folders </li></ul></ul><ul><ul><li>Comprising 327 MB of HTML content. </li></ul></ul><ul><ul><li>3GHz Intel processor and IGB RAM </li></ul></ul><ul><ul><ul><li>Chunk size set to 5000 bytes -> took 25 minutes and generated 88,510 chunks </li></ul></ul></ul><ul><ul><ul><li>Chunk size set to 100 byte -> took 39 minutes and generate 3.8 million chunks. </li></ul></ul></ul>
  16. 18. Related work <ul><li>Brin et al. “Copy detection machanisims for digital documents” </li></ul><ul><ul><li>Have a large indexed database of existing documents </li></ul></ul><ul><ul><li>Detect new document contains material that already exists in the database. </li></ul></ul><ul><ul><li>it is a 1-vs-N document method. </li></ul></ul><ul><ul><li>This paper is all-to-all. </li></ul></ul><ul><ul><li>Chunk boundaries are based on the hash of ‘text units’ </li></ul></ul><ul><ul><ul><li>Paragraphs </li></ul></ul></ul><ul><ul><ul><li>Sentences </li></ul></ul></ul><ul><ul><ul><li>Can not handle the technical doumentations </li></ul></ul></ul><ul><ul><li>This paper use TTTD chunking algorithm. </li></ul></ul>
  17. 19. Conclusions <ul><li>To identify pieces that may have been duplicated. </li></ul><ul><ul><li>Relies on chunking technology rather than paragraph boundary detection. </li></ul></ul><ul><li>The bottleneck is in human attention to consider in many results. </li></ul><ul><li>Future work </li></ul><ul><ul><li>Reducing false alarms and missed detections </li></ul></ul><ul><ul><li>Making the human review process as productive as possible. </li></ul></ul>

×