Efficient Parallel Set-Similarity Joins Using MapReduce                                                                    ...
Upcoming SlideShare
Loading in …5

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster


Published on

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

  1. 1. Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science, University of California, Irvine http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Problem Statement MapReduce Review Stage 2: RID-Pair Generation map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). combine (k2,list(v2)) → list(k2,v2). Experimental Setting Prefix Filtering for Data Partitioning Hardware Example: Data Cleaning/Master-Data-Management 10-node IBM x3650 cluster Pigeonhole principle Intel Xeon processor E5520 2.26GHz with four coresCustomer data from two departments Global order for set elements: Four 300GB hard disks Sales Returns 12GB RAM ID Name ... ID Name ... Datasets S10 John W Smith . . . R20 Smith John . . . DBLP: average length: 259 bytes; 1.2M records; 300MB . . . John W Smith . E.g., sim is overlap size, τ = 4 CITESEERX: average length: 1374 bytes; 1.3M records; 1.8GBMaster customer data across two departments Prefix length is 2 Increased each up to ×25, preserving join properties Customers ID Name ... Experimental Results C30 John W Smith . . . . . Parallelizing Set-Similarity Joins Processing Stages and AlternativesLarge amounts of data Stage 1: Token Ordering E.g., GeneBank: 100M, Google N-gram: 1T Compute the token frequencies and sort Data or processing does not fit in one machine Two MapReduce phases: sort in MapReduce (BTO) Use a cluster of machines and a parallel algorithm One MapReduce phase: sort in memory (OPTO) MapReduce: shared-nothing data-processing platform Stage 2: Kernel (RID-Pair Generation) 5 1000Challenges BTO-BK-BRJ Use prefix-filter to divide, conquer using: BTO-PK-BRJ Partition problem for parallelism BTO-PK-OPRJ 800 Time (seconds) 4 Nested loops (BK) Ideal Solve the problem using Map, Sort, and Reduce Speedup 600 Single-machine set-similarity join algorithm (PK) 3 Compute end-to-end set-similarity joins Stage 3: Record Join 400 BTO-BK-BRJ Deal with out-of-memory situations Generate pairs of similar records 2 200 BTO-PK-BRJ BTO-PK-OPRJ Ideal Two MapReduce phases: reduce-side join (BRJ) 1 0 One MapReduce phase: map-side join (OPRJ) 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes # Nodes and Dataset Size