Pairwise document similarity in large collections with map reduce

2,417 views
2,220 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,417
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Pairwise document similarity in large collections with map reduce

  1. 1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy
  2. 2.  PairwiseSimilarity MapReduce Framework Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation Results
  3. 3.  PubMed – “More like this” Similar blog posts Google – Similar pages
  4. 4.  Framework that supports distributed computing on clusters of computers Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications
  5. 5.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1
  6. 6. Hello <Hello,1>World <World,1> Map 1Bye <Bye,1>World <World,1>Hello <Hello,1>Hadoop <Hadoop,1> Map 2Goodbye <Goodbye,1>Hadoop <Hadoop,1>
  7. 7. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2<World,1> H U<Bye,1> <World(1,1)> Reduce 2 World ,2 F F<World,1> L <Bye(1)> Reduce 3 Bye,1 E<Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 &<Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1<Goodbye,1> O R<Hadoop,1> T
  8. 8. MAPREDUCE ALGORITHM Scalable•Inverted Index Computation and•Pairwise Similarity Efficient
  9. 9. Document 1A <A,(d1,2)>AB Map 1 <B,(d1,1)>C <C,(d1,1)>Document 2B <B,(d2,1)>DD Map 2 <D,(d2,2)>Document 1 <A,(d3,1)>AB <B,(d3,2)> Map 3BE <E,(d3,1)>
  10. 10. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2),<B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U<C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L<B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]><D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S<A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]><B,(d3,2)> T<E,(d3,1)>
  11. 11.  Group by document ID, not pairs Golomb’s compression for postings Individual Postings List of Postings
  12. 12. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1<B,[(d1,1), Map 2 (d2,d3),2(d2,1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>
  13. 13. S H<(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F<(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]>(d2,d3),2(d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T
  14. 14.  Hadoop 0.16.0 20 machine (4GB memory, 100GB disk) Similarity function - BM25 Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents
  15. 15.  Tokenization Stop word removal Stemming Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs
  16. 16.  Complexity: O(n2) Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail Df-cut can be relaxed to 99.9 percent
  17. 17.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified. Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly? The results have not been evaluated.

×