Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Efficient Parallel Set-Similarity Joins Using Hadoop <ul><li>Chen Li </li></ul>Joint work with  Michael Carey and Rares Ve...
Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation S...
Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Anima...
Similarity Search Find movies with a star  “ similar to ”  Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1...
Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel  L.  Jackson Sch...
Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
<ul><li>Similarity join for large data sets </li></ul><ul><li>Techniques applicable to other domains, e.g.: </li></ul><ul>...
<ul><li>Formulation: set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul>...
Set-Similarity Join Finding pairs of records with a  similarity  on their join attributes > t
Why this formulation? <ul><li>Word tokens: </li></ul><ul><li>Gram tokens: </li></ul>“ Samuel  L.  Jackson”    {Samuel,  L...
Set-similarity functions <ul><li>Jaccard </li></ul><ul><li>Dice </li></ul><ul><li>Cosine </li></ul><ul><li>Hamming </li></...
<ul><li>Formulation of set-similarity join </li></ul><ul><li>   Hadoop-based solutions </li></ul><ul><li>Experiments </li...
<ul><li>Large amounts of data </li></ul><ul><li>Data or processing does not fit in one machine </li></ul><ul><li>Assumptio...
<ul><li>Map:  <23, (a,b,c)>    (a, 23), (b, 23), (c, 23) </li></ul>A naïve solution <ul><li>Too much data to transfer   ...
Solving frequency skew: prefix filtering <ul><li>Prefixes of similar sets should share tokens </li></ul><ul><li>Sort token...
Prefix filtering: example <ul><li>Each set has 5 tokens </li></ul><ul><li>“ Similar”: they share at least 4 tokens </li></...
<ul><li>Stage 1: Order tokens by frequency </li></ul><ul><li>Stage 2: Finding “similar” id pairs </li></ul><ul><li>Stage 3...
Stage 1: Sort tokens by frequency <ul><li>Compute token frequencies </li></ul><ul><li>Sort them </li></ul>MapReduce phase ...
Stage 2: Find “similar” id pairs  <ul><li>Partition using prefixes </li></ul><ul><li>Verify similarity </li></ul>
Stage 3: id pairs    record pairs (phase 1) <ul><li>Bring records for each id in each pair </li></ul>
Stage 3: id pairs    record pairs (phase 2)  <ul><li>Join two half filled records </li></ul>
<ul><li>Formulation of set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>   Experiments </li...
<ul><li>Hardware </li></ul><ul><ul><li>10-node  IBM x3650 cluster </li></ul></ul><ul><ul><li>Intel Xeon processor E5520 2....
Running time <ul><li>Stage 2 </li></ul><ul><li>Stage 1 </li></ul><ul><li>Stage 3 </li></ul>
Speedup
Speedup Breakdown <ul><li>Stage 2 has good speedup </li></ul>
Scaleup <ul><li>Good scaleup </li></ul>
<ul><li>Other methods for the 3 stages </li></ul><ul><li>Case: R <> S </li></ul><ul><li>Dealing with limited memory </li><...
<ul><li>Set-similarity joins in Hadoop:  </li></ul><ul><li>Three-stage approach using Hadoop </li></ul><ul><li>Experimenta...
Thank you <ul><li>Chen Li @ UC Irvine </li></ul><ul><li>Source code available at:  </li></ul><ul><li>http://asterix.ics.uc...
Upcoming SlideShare
Loading in …5
×

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

2,745 views

Published on

Hadoop Summit 2010 - Research Track
Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

  1. 1. Efficient Parallel Set-Similarity Joins Using Hadoop <ul><li>Chen Li </li></ul>Joint work with Michael Carey and Rares Vernica
  2. 2. Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring Tom Hanks
  3. 3. Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  4. 4. Similarity Search Find movies with a star “ similar to ” Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  5. 5. Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel L. Jackson Schwarzenegger …
  6. 6. Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
  7. 7. <ul><li>Similarity join for large data sets </li></ul><ul><li>Techniques applicable to other domains, e.g.: </li></ul><ul><ul><li>Finding similar documents </li></ul></ul><ul><ul><li>Finding customers with similar patterns </li></ul></ul>Focus of this talk
  8. 8. <ul><li>Formulation: set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul><ul><ul><li>More results: see SIGMOD2010 paper </li></ul></ul>Talk Outline
  9. 9. Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t
  10. 10. Why this formulation? <ul><li>Word tokens: </li></ul><ul><li>Gram tokens: </li></ul>“ Samuel L. Jackson”  {Samuel, L., Jackson} “ Samuel Jackson”  {Samuel, Jackson} S c h w a r z e n e g g e r
  11. 11. Set-similarity functions <ul><li>Jaccard </li></ul><ul><li>Dice </li></ul><ul><li>Cosine </li></ul><ul><li>Hamming </li></ul><ul><li>… </li></ul><ul><li>All solvable in this framework </li></ul>
  12. 12. <ul><li>Formulation of set-similarity join </li></ul><ul><li> Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul>Talk Outline
  13. 13. <ul><li>Large amounts of data </li></ul><ul><li>Data or processing does not fit in one machine </li></ul><ul><li>Assumptions: </li></ul><ul><ul><li>Self join: R = S </li></ul></ul><ul><ul><li>Two similar sets share at least 1 token </li></ul></ul>Why Hadoop?
  14. 14. <ul><li>Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23) </li></ul>A naïve solution <ul><li>Too much data to transfer  </li></ul><ul><li>Too many pairs to verify  . </li></ul><ul><li>Reduce:(a,23),(a,29),(a,50), …  Verify each pair </li></ul>
  15. 15. Solving frequency skew: prefix filtering <ul><li>Prefixes of similar sets should share tokens </li></ul><ul><li>Sort tokens by frequency (ascending) </li></ul><ul><li>Prefix of a set: least frequent tokens </li></ul>prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
  16. 16. Prefix filtering: example <ul><li>Each set has 5 tokens </li></ul><ul><li>“ Similar”: they share at least 4 tokens </li></ul><ul><li>Prefix length: 2 </li></ul><ul><li>Record 1 </li></ul><ul><li>Record 2 </li></ul>
  17. 17. <ul><li>Stage 1: Order tokens by frequency </li></ul><ul><li>Stage 2: Finding “similar” id pairs </li></ul><ul><li>Stage 3: id pairs  record paris </li></ul>Hadoop Solution: Overview
  18. 18. Stage 1: Sort tokens by frequency <ul><li>Compute token frequencies </li></ul><ul><li>Sort them </li></ul>MapReduce phase 1 MapReduce phase 2
  19. 19. Stage 2: Find “similar” id pairs <ul><li>Partition using prefixes </li></ul><ul><li>Verify similarity </li></ul>
  20. 20. Stage 3: id pairs  record pairs (phase 1) <ul><li>Bring records for each id in each pair </li></ul>
  21. 21. Stage 3: id pairs  record pairs (phase 2) <ul><li>Join two half filled records </li></ul>
  22. 22. <ul><li>Formulation of set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li> Experiments </li></ul>Talk Outline
  23. 23. <ul><li>Hardware </li></ul><ul><ul><li>10-node IBM x3650 cluster </li></ul></ul><ul><ul><li>Intel Xeon processor E5520 2.26GHz with four cores </li></ul></ul><ul><ul><li>Four 300GB hard disks </li></ul></ul><ul><ul><li>12GB RAM </li></ul></ul><ul><li>Software </li></ul><ul><ul><li>Ubuntu 9.06, 64-bit, server edition OS </li></ul></ul><ul><ul><li>Java 1.6, 64-bit, server </li></ul></ul><ul><ul><li>Hadoop 0.20.1 </li></ul></ul><ul><li>Datasets: publications (DBLP and CITESEERX) </li></ul>Experimental Setting
  24. 24. Running time <ul><li>Stage 2 </li></ul><ul><li>Stage 1 </li></ul><ul><li>Stage 3 </li></ul>
  25. 25. Speedup
  26. 26. Speedup Breakdown <ul><li>Stage 2 has good speedup </li></ul>
  27. 27. Scaleup <ul><li>Good scaleup </li></ul>
  28. 28. <ul><li>Other methods for the 3 stages </li></ul><ul><li>Case: R <> S </li></ul><ul><li>Dealing with limited memory </li></ul>Additional results
  29. 29. <ul><li>Set-similarity joins in Hadoop: </li></ul><ul><li>Three-stage approach using Hadoop </li></ul><ul><li>Experimental study </li></ul>Summary
  30. 30. Thank you <ul><li>Chen Li @ UC Irvine </li></ul><ul><li>Source code available at: </li></ul><ul><li>http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ </li></ul>Acknowledgements: NSF, Google, IBM.

×