Your SlideShare is downloading. ×
0
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

1,867

Published on

Hadoop Summit 2010 - Research Track …

Hadoop Summit 2010 - Research Track
Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine

2 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,867
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
2
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Efficient Parallel Set-Similarity Joins Using Hadoop <ul><li>Chen Li </li></ul>Joint work with Michael Carey and Rares Vernica
  • 2. Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring Tom Hanks
  • 3. Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  • 4. Similarity Search Find movies with a star “ similar to ” Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  • 5. Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel L. Jackson Schwarzenegger …
  • 6. Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
  • 7. <ul><li>Similarity join for large data sets </li></ul><ul><li>Techniques applicable to other domains, e.g.: </li></ul><ul><ul><li>Finding similar documents </li></ul></ul><ul><ul><li>Finding customers with similar patterns </li></ul></ul>Focus of this talk
  • 8. <ul><li>Formulation: set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul><ul><ul><li>More results: see SIGMOD2010 paper </li></ul></ul>Talk Outline
  • 9. Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t
  • 10. Why this formulation? <ul><li>Word tokens: </li></ul><ul><li>Gram tokens: </li></ul>“ Samuel L. Jackson”  {Samuel, L., Jackson} “ Samuel Jackson”  {Samuel, Jackson} S c h w a r z e n e g g e r
  • 11. Set-similarity functions <ul><li>Jaccard </li></ul><ul><li>Dice </li></ul><ul><li>Cosine </li></ul><ul><li>Hamming </li></ul><ul><li>… </li></ul><ul><li>All solvable in this framework </li></ul>
  • 12. <ul><li>Formulation of set-similarity join </li></ul><ul><li> Hadoop-based solutions </li></ul><ul><li>Experiments </li></ul>Talk Outline
  • 13. <ul><li>Large amounts of data </li></ul><ul><li>Data or processing does not fit in one machine </li></ul><ul><li>Assumptions: </li></ul><ul><ul><li>Self join: R = S </li></ul></ul><ul><ul><li>Two similar sets share at least 1 token </li></ul></ul>Why Hadoop?
  • 14. <ul><li>Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23) </li></ul>A naïve solution <ul><li>Too much data to transfer  </li></ul><ul><li>Too many pairs to verify  . </li></ul><ul><li>Reduce:(a,23),(a,29),(a,50), …  Verify each pair </li></ul>
  • 15. Solving frequency skew: prefix filtering <ul><li>Prefixes of similar sets should share tokens </li></ul><ul><li>Sort tokens by frequency (ascending) </li></ul><ul><li>Prefix of a set: least frequent tokens </li></ul>prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
  • 16. Prefix filtering: example <ul><li>Each set has 5 tokens </li></ul><ul><li>“ Similar”: they share at least 4 tokens </li></ul><ul><li>Prefix length: 2 </li></ul><ul><li>Record 1 </li></ul><ul><li>Record 2 </li></ul>
  • 17. <ul><li>Stage 1: Order tokens by frequency </li></ul><ul><li>Stage 2: Finding “similar” id pairs </li></ul><ul><li>Stage 3: id pairs  record paris </li></ul>Hadoop Solution: Overview
  • 18. Stage 1: Sort tokens by frequency <ul><li>Compute token frequencies </li></ul><ul><li>Sort them </li></ul>MapReduce phase 1 MapReduce phase 2
  • 19. Stage 2: Find “similar” id pairs <ul><li>Partition using prefixes </li></ul><ul><li>Verify similarity </li></ul>
  • 20. Stage 3: id pairs  record pairs (phase 1) <ul><li>Bring records for each id in each pair </li></ul>
  • 21. Stage 3: id pairs  record pairs (phase 2) <ul><li>Join two half filled records </li></ul>
  • 22. <ul><li>Formulation of set-similarity join </li></ul><ul><li>Hadoop-based solutions </li></ul><ul><li> Experiments </li></ul>Talk Outline
  • 23. <ul><li>Hardware </li></ul><ul><ul><li>10-node IBM x3650 cluster </li></ul></ul><ul><ul><li>Intel Xeon processor E5520 2.26GHz with four cores </li></ul></ul><ul><ul><li>Four 300GB hard disks </li></ul></ul><ul><ul><li>12GB RAM </li></ul></ul><ul><li>Software </li></ul><ul><ul><li>Ubuntu 9.06, 64-bit, server edition OS </li></ul></ul><ul><ul><li>Java 1.6, 64-bit, server </li></ul></ul><ul><ul><li>Hadoop 0.20.1 </li></ul></ul><ul><li>Datasets: publications (DBLP and CITESEERX) </li></ul>Experimental Setting
  • 24. Running time <ul><li>Stage 2 </li></ul><ul><li>Stage 1 </li></ul><ul><li>Stage 3 </li></ul>
  • 25. Speedup
  • 26. Speedup Breakdown <ul><li>Stage 2 has good speedup </li></ul>
  • 27. Scaleup <ul><li>Good scaleup </li></ul>
  • 28. <ul><li>Other methods for the 3 stages </li></ul><ul><li>Case: R <> S </li></ul><ul><li>Dealing with limited memory </li></ul>Additional results
  • 29. <ul><li>Set-similarity joins in Hadoop: </li></ul><ul><li>Three-stage approach using Hadoop </li></ul><ul><li>Experimental study </li></ul>Summary
  • 30. Thank you <ul><li>Chen Li @ UC Irvine </li></ul><ul><li>Source code available at: </li></ul><ul><li>http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ </li></ul>Acknowledgements: NSF, Google, IBM.

×