Efficient Parallel Set-Similarity Joins Using Hadoop <ul><li>Chen Li </li></ul>Joint work with  Michael Carey and Rares Ve...
Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation S...
Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Anima...
Similarity Search Find movies with a star  “ similar to ”  Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1...
Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel  L.  Jackson Sch...
Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
<ul><li>Similarity join for large data sets </li></ul><ul><li>Techniques applicable to other domains, e.g.: </li></ul><ul>...
<ul><li>Formulation: set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul>...
Set-Similarity Join Finding pairs of records with a  similarity  on their join attributes > t
Why this formulation? <ul><li>Word tokens: </li></ul><ul><li>Gram tokens: </li></ul>“ Samuel  L.  Jackson”    {Samuel,  L...
Set-similarity functions <ul><li>Jaccard </li></ul><ul><li>Dice </li></ul><ul><li>Cosine </li></ul><ul><li>Hamming </li></...
<ul><li>Formulation of set-similarity join </li></ul><ul><li>   Hadoop-based solutions </li></ul><ul><li>Experiments </li...
<ul><li>Large amounts of data </li></ul><ul><li>Data or processing does not fit in one machine </li></ul><ul><li>Assumptio...
<ul><li>Map:  <23, (a,b,c)>    (a, 23), (b, 23), (c, 23) </li></ul>A naïve solution <ul><li>Too much data to transfer   ...
Solving frequency skew: prefix filtering <ul><li>Prefixes of similar sets should share tokens </li></ul><ul><li>Sort token...
Prefix filtering: example <ul><li>Each set has 5 tokens </li></ul><ul><li>“ Similar”: they share at least 4 tokens </li></...
<ul><li>Stage 1: Order tokens by frequency </li></ul><ul><li>Stage 2: Finding “similar” id pairs </li></ul><ul><li>Stage 3...
Stage 1: Sort tokens by frequency <ul><li>Compute token frequencies </li></ul><ul><li>Sort them </li></ul>MapReduce phase ...
Stage 2: Find “similar” id pairs  <ul><li>Partition using prefixes </li></ul><ul><li>Verify similarity </li></ul>
Stage 3: id pairs    record pairs (phase 1) <ul><li>Bring records for each id in each pair </li></ul>
Stage 3: id pairs    record pairs (phase 2)  <ul><li>Join two half filled records </li></ul>
<ul><li>Formulation of set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>   Experiments </li...
<ul><li>Hardware </li></ul><ul><ul><li>10-node  IBM x3650 cluster </li></ul></ul><ul><ul><li>Intel Xeon processor E5520 2....
Running time <ul><li>Stage 2 </li></ul><ul><li>Stage 1 </li></ul><ul><li>Stage 3 </li></ul>
Speedup
Speedup Breakdown <ul><li>Stage 2 has good speedup </li></ul>
Scaleup <ul><li>Good scaleup </li></ul>
<ul><li>Other methods for the 3 stages </li></ul><ul><li>Case: R <> S </li></ul><ul><li>Dealing with limited memory </li><...
<ul><li>Set-similarity joins in Hadoop:  </li></ul><ul><li>Three-stage approach using Hadoop </li></ul><ul><li>Experimenta...
Thank you <ul><li>Chen Li @ UC Irvine </li></ul><ul><li>Source code available at:  </li></ul><ul><li>http://asterix.ics.uc...
Upcoming SlideShare
Loading in...5
×

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

1,895

Published on

Hadoop Summit 2010 - Research Track
Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine

2 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,895
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
2
Likes
7
Embeds 0
No embeds

No notes for slide

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

  1. 1. Efficient Parallel Set-Similarity Joins Using Hadoop <ul><li>Chen Li </li></ul>Joint work with Michael Carey and Rares Vernica
  2. 2. Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring Tom Hanks
  3. 3. Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  4. 4. Similarity Search Find movies with a star “ similar to ” Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  5. 5. Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel L. Jackson Schwarzenegger …
  6. 6. Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
  7. 7. <ul><li>Similarity join for large data sets </li></ul><ul><li>Techniques applicable to other domains, e.g.: </li></ul><ul><ul><li>Finding similar documents </li></ul></ul><ul><ul><li>Finding customers with similar patterns </li></ul></ul>Focus of this talk
  8. 8. <ul><li>Formulation: set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul><ul><ul><li>More results: see SIGMOD2010 paper </li></ul></ul>Talk Outline
  9. 9. Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t
  10. 10. Why this formulation? <ul><li>Word tokens: </li></ul><ul><li>Gram tokens: </li></ul>“ Samuel L. Jackson”  {Samuel, L., Jackson} “ Samuel Jackson”  {Samuel, Jackson} S c h w a r z e n e g g e r
  11. 11. Set-similarity functions <ul><li>Jaccard </li></ul><ul><li>Dice </li></ul><ul><li>Cosine </li></ul><ul><li>Hamming </li></ul><ul><li>… </li></ul><ul><li>All solvable in this framework </li></ul>
  12. 12. <ul><li>Formulation of set-similarity join </li></ul><ul><li> Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul>Talk Outline
  13. 13. <ul><li>Large amounts of data </li></ul><ul><li>Data or processing does not fit in one machine </li></ul><ul><li>Assumptions: </li></ul><ul><ul><li>Self join: R = S </li></ul></ul><ul><ul><li>Two similar sets share at least 1 token </li></ul></ul>Why Hadoop?
  14. 14. <ul><li>Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23) </li></ul>A naïve solution <ul><li>Too much data to transfer  </li></ul><ul><li>Too many pairs to verify  . </li></ul><ul><li>Reduce:(a,23),(a,29),(a,50), …  Verify each pair </li></ul>
  15. 15. Solving frequency skew: prefix filtering <ul><li>Prefixes of similar sets should share tokens </li></ul><ul><li>Sort tokens by frequency (ascending) </li></ul><ul><li>Prefix of a set: least frequent tokens </li></ul>prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
  16. 16. Prefix filtering: example <ul><li>Each set has 5 tokens </li></ul><ul><li>“ Similar”: they share at least 4 tokens </li></ul><ul><li>Prefix length: 2 </li></ul><ul><li>Record 1 </li></ul><ul><li>Record 2 </li></ul>
  17. 17. <ul><li>Stage 1: Order tokens by frequency </li></ul><ul><li>Stage 2: Finding “similar” id pairs </li></ul><ul><li>Stage 3: id pairs  record paris </li></ul>Hadoop Solution: Overview
  18. 18. Stage 1: Sort tokens by frequency <ul><li>Compute token frequencies </li></ul><ul><li>Sort them </li></ul>MapReduce phase 1 MapReduce phase 2
  19. 19. Stage 2: Find “similar” id pairs <ul><li>Partition using prefixes </li></ul><ul><li>Verify similarity </li></ul>
  20. 20. Stage 3: id pairs  record pairs (phase 1) <ul><li>Bring records for each id in each pair </li></ul>
  21. 21. Stage 3: id pairs  record pairs (phase 2) <ul><li>Join two half filled records </li></ul>
  22. 22. <ul><li>Formulation of set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li> Experiments </li></ul>Talk Outline
  23. 23. <ul><li>Hardware </li></ul><ul><ul><li>10-node IBM x3650 cluster </li></ul></ul><ul><ul><li>Intel Xeon processor E5520 2.26GHz with four cores </li></ul></ul><ul><ul><li>Four 300GB hard disks </li></ul></ul><ul><ul><li>12GB RAM </li></ul></ul><ul><li>Software </li></ul><ul><ul><li>Ubuntu 9.06, 64-bit, server edition OS </li></ul></ul><ul><ul><li>Java 1.6, 64-bit, server </li></ul></ul><ul><ul><li>Hadoop 0.20.1 </li></ul></ul><ul><li>Datasets: publications (DBLP and CITESEERX) </li></ul>Experimental Setting
  24. 24. Running time <ul><li>Stage 2 </li></ul><ul><li>Stage 1 </li></ul><ul><li>Stage 3 </li></ul>
  25. 25. Speedup
  26. 26. Speedup Breakdown <ul><li>Stage 2 has good speedup </li></ul>
  27. 27. Scaleup <ul><li>Good scaleup </li></ul>
  28. 28. <ul><li>Other methods for the 3 stages </li></ul><ul><li>Case: R <> S </li></ul><ul><li>Dealing with limited memory </li></ul>Additional results
  29. 29. <ul><li>Set-similarity joins in Hadoop: </li></ul><ul><li>Three-stage approach using Hadoop </li></ul><ul><li>Experimental study </li></ul>Summary
  30. 30. Thank you <ul><li>Chen Li @ UC Irvine </li></ul><ul><li>Source code available at: </li></ul><ul><li>http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ </li></ul>Acknowledgements: NSF, Google, IBM.

×