• Save
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010

on

  • 2,672 views

Hadoop Summit 2010 - Research Track

Hadoop Summit 2010 - Research Track
Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine

Statistics

Views

Total Views
2,672
Views on SlideShare
2,667
Embed Views
5

Actions

Likes
7
Downloads
0
Comments
2

1 Embed 5

http://192.168.6.179 5

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • can i download this presentation please
    Are you sure you want to
    Your message goes here
    Processing…
  • Chen Li's SIGMOD 2010 paper.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010 Presentation Transcript

  • 1. Efficient Parallel Set-Similarity Joins Using Hadoop
    • Chen Li
    Joint work with Michael Carey and Rares Vernica
  • 2. Motivation: Data Cleaning Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime Find movies starring Tom Hanks
  • 3. Movies starring S..warz…ne…ger? Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Tom Hanks Toy Story 3 2010 Animation Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  • 4. Similarity Search Find movies with a star “ similar to ” Schwarrzenger . Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime
  • 5. Record linkage Table R Table S Star Keanu Reeves Samuel Jackson Schwarzenegger … Star Keanu Reeves Samuel L. Jackson Schwarzenegger …
  • 6. Two-step solution Table R Table S Step 2: Verification Star … Star … Step 1: Similarity Join
  • 7.
    • Similarity join for large data sets
    • Techniques applicable to other domains, e.g.:
      • Finding similar documents
      • Finding customers with similar patterns
    Focus of this talk
  • 8.
    • Formulation: set-similarity join
    • Hadoop-based solutions
    • Experiments
      • More results: see SIGMOD2010 paper
    Talk Outline
  • 9. Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t
  • 10. Why this formulation?
    • Word tokens:
    • Gram tokens:
    “ Samuel L. Jackson”  {Samuel, L., Jackson} “ Samuel Jackson”  {Samuel, Jackson} S c h w a r z e n e g g e r
  • 11. Set-similarity functions
    • Jaccard
    • Dice
    • Cosine
    • Hamming
    • All solvable in this framework
  • 12.
    • Formulation of set-similarity join
    •  Hadoop-based solutions
    • Experiments
    Talk Outline
  • 13.
    • Large amounts of data
    • Data or processing does not fit in one machine
    • Assumptions:
      • Self join: R = S
      • Two similar sets share at least 1 token
    Why Hadoop?
  • 14.
    • Map: <23, (a,b,c)>  (a, 23), (b, 23), (c, 23)
    A naïve solution
    • Too much data to transfer 
    • Too many pairs to verify  .
    • Reduce:(a,23),(a,29),(a,50), …  Verify each pair
  • 15. Solving frequency skew: prefix filtering
    • Prefixes of similar sets should share tokens
    • Sort tokens by frequency (ascending)
    • Prefix of a set: least frequent tokens
    prefix r1 r2 Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
  • 16. Prefix filtering: example
    • Each set has 5 tokens
    • “ Similar”: they share at least 4 tokens
    • Prefix length: 2
    • Record 1
    • Record 2
  • 17.
    • Stage 1: Order tokens by frequency
    • Stage 2: Finding “similar” id pairs
    • Stage 3: id pairs  record paris
    Hadoop Solution: Overview
  • 18. Stage 1: Sort tokens by frequency
    • Compute token frequencies
    • Sort them
    MapReduce phase 1 MapReduce phase 2
  • 19. Stage 2: Find “similar” id pairs
    • Partition using prefixes
    • Verify similarity
  • 20. Stage 3: id pairs  record pairs (phase 1)
    • Bring records for each id in each pair
  • 21. Stage 3: id pairs  record pairs (phase 2)
    • Join two half filled records
  • 22.
    • Formulation of set-similarity join
    • Hadoop-based solutions
    •  Experiments
    Talk Outline
  • 23.
    • Hardware
      • 10-node IBM x3650 cluster
      • Intel Xeon processor E5520 2.26GHz with four cores
      • Four 300GB hard disks
      • 12GB RAM
    • Software
      • Ubuntu 9.06, 64-bit, server edition OS
      • Java 1.6, 64-bit, server
      • Hadoop 0.20.1
    • Datasets: publications (DBLP and CITESEERX)
    Experimental Setting
  • 24. Running time
    • Stage 2
    • Stage 1
    • Stage 3
  • 25. Speedup
  • 26. Speedup Breakdown
    • Stage 2 has good speedup
  • 27. Scaleup
    • Good scaleup
  • 28.
    • Other methods for the 3 stages
    • Case: R <> S
    • Dealing with limited memory
    Additional results
  • 29.
    • Set-similarity joins in Hadoop:
    • Three-stage approach using Hadoop
    • Experimental study
    Summary
  • 30. Thank you
    • Chen Li @ UC Irvine
    • Source code available at:
    • http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/
    Acknowledgements: NSF, Google, IBM.