• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides
 

Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

on

  • 976 views

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as ...

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

Statistics

Views

Total Views
976
Views on SlideShare
976
Embed Views
0

Actions

Likes
3
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides Presentation Transcript

    • Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science University of California, Irvine Special Interest Group on Management of Data, 2010Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 1 / 22
    • Outline1 Motivation2 Problem Statement3 Algorithms4 Experimental Evaluation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 2 / 22
    • Example: Data Cleaning/Master-Data-ManagementCustomer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 Smith John ... . . . . . .Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
    • Example: Data Cleaning/Master-Data-ManagementCustomer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 Smith John ... . . . . . .Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
    • Example: Data Cleaning/Master-Data-ManagementCustomer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 Smith John ... . . . . . .Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
    • Example: Data Cleaning/Master-Data-ManagementCustomer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 John W Smith ... . . . . . .Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Example: Data Cleaning/Master-Data-ManagementString → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith}Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
    • Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
    • Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
    • Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
    • Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
    • Problem Statement: Set-Similarity JoinInput Two files of records e.g., R(RID, a, b) and S(RID, c, d) A join column on each file e.g., R.a and S.c A similarity function, sim e.g., Jaccard A similarity threshold, τOutputAll pairs of records from R and S where sim(R.a, S.c) ≥ τ Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 6 / 22
    • Parallelizing Set-Similarity JoinsLarge amounts of data E.g., GeneBank: 100M, Google N-gram: 1T Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm MapReduce: shared-nothing data-processing platformChallenges Partition problem for parallelism Solve using Map, Sort, and Reduce Compute end-to-end set-similarity joins Deal with out-of-memory situations Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22
    • Parallelizing Set-Similarity JoinsLarge amounts of data E.g., GeneBank: 100M, Google N-gram: 1T Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm MapReduce: shared-nothing data-processing platformChallenges Partition problem for parallelism Solve using Map, Sort, and Reduce Compute end-to-end set-similarity joins Deal with out-of-memory situations Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22
    • MapReduce Reviewmap (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).combine (k2,list(v2)) → list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
    • MapReduce Reviewmap (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).combine (k2,list(v2)) → list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
    • MapReduce Reviewmap (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).combine (k2,list(v2)) → list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
    • Parallel Set-Similarity Joins in MapReduceMain idea Hash-partition data across the network based on keys Join values cannot be directly used as keys Generate signatures from join values and use them as keys e.g., “John W Smith” → {John, Smith, W}Three Stages 1 Compute set-element statistics Records → Statistics 2 Generate similar record-ID (“RID”) pairs {Records, Statistics} → RID-pairs 3 Generate pairs of joined records {Records, RID-pairs} → Record-pairs Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22
    • Parallel Set-Similarity Joins in MapReduceMain idea Hash-partition data across the network based on keys Join values cannot be directly used as keys Generate signatures from join values and use them as keys e.g., “John W Smith” → {John, Smith, W}Three Stages 1 Compute set-element statistics Records → Statistics 2 Generate similar record-ID (“RID”) pairs {Records, Statistics} → RID-pairs 3 Generate pairs of joined records {Records, RID-pairs} → Record-pairs Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22
    • Set-Similarity FilteringE.g., Prefix Filtering Pigeonhole principle Global order for set elements: Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
    • Set-Similarity FilteringE.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
    • Set-Similarity FilteringE.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
    • Set-Similarity FilteringE.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
    • Set-Similarity FilteringE.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
    • Processing Stages and AlternativesStage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO)Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK)Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
    • Processing Stages and AlternativesStage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO)Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK)Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
    • Processing Stages and AlternativesStage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO)Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK)Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
    • Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
    • Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
    • Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
    • Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
    • Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
    • Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
    • Additional Contributions Solve R-S-join case Optimize memory requirements for R-S join Handle insufficient memory Introduce additional filters Use external memory Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 13 / 22
    • Experimental SettingHardware 10-node IBM x3650 cluster Intel Xeon processor E5520 2.26GHz with four cores Four 300GB hard disks 12GB RAMSoftware Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 14 / 22
    • Experimental SettingDatasets DBLP Average length: 259 bytes Number of records: 1.2M Total size: 300MB CITESEERX Average length: 1374 bytes Number of records: 1.3M Total size: 1.8GB Increased each up to ×25, preserving join properties DBLP: 31M records, 8.2GB CITESEERX: 32M records, 45GB Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 15 / 22
    • Running Time Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
    • Running Time Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
    • Running Time Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
    • Speedup 5 BTO-BK-BRJ BTO-PK-BRJ Relative running time 4 BTO-PK-OPRJ Ideal Self-join DBLP×10Speedup Different cluster sizes 3 2 1 2 3 4 5 6 7 8 9 10 # Nodes Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 17 / 22
    • Speedup Breakdown Stage 1 Stage 2 Stage 3 5 5 5 BTO BK BRJ 4 OPTO 4 PK 4 OPRJ Ideal Ideal IdealSpeedup Speedup Speedup 3 3 3 2 2 2 1 1 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes # Nodes # Nodes Relative running time Self-join DBLP×10 Different cluster sizes Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 18 / 22
    • Scaleup 1000 Running time 800 Self-joining DBLP×nTime (seconds) n ∈ [5, 25] 600 Proportional cluster 400 BTO-BK-BRJ BTO-PK-BRJ 200 BTO-PK-OPRJ Ideal 0 2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 19 / 22
    • Scaleup Breakdown Stage 1 Stage 2 Stage 3 160 600 200 500Time (seconds) Time (seconds) Time (seconds) 120 150 400 80 300 100 BTO 200 BK BRJ 40 OPTO PK 50 OPRJ Ideal 100 Ideal Ideal 0 0 0 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size # Nodes and Dataset Size # Nodes and Dataset Size Running time Self-joining DBLP×n, n ∈ [5, 25] Proportional cluster Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 20 / 22
    • Summary Set-similarity joins in MapReduce Three-stage approach Balance workload and minimize replication End-to-end algorithms Self-join R-S join Memory issues Optimize memory requirements Handle insufficient memory Experiments 40 cores, 40 disks cluster Speedup and scaleup Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 21 / 22
    • More Info A longer version of the paper, the source-code and the datasets: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ This work is part of the ASTERIX http://asterix.ics.uci.edu/ and Flamingo http://flamingo.ics.uci.edu/ projects at UC Irvine. Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
    • Stage 1: Basic Token Ordering (BTO) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
    • Stage 1: One Phase Token Ordering (OPTO) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
    • Stage 2: Basic Kernel (BK) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
    • Stage 2: Indexed Kernel (PK) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
    • Stage 3: Basic Record Join (BRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
    • Stage 3: One Phase Record Join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22