Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

Efficient Parallel Set-Similarity Joins Using MapReduce
Rares Vernica Michael J. Carey Chen Li
Department of Computer Science, University of California, Irvine
http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

Problem Statement MapReduce Review Stage 2: RID-Pair Generation

map (k1,v1) → list(k2,v2);
reduce (k2,list(v2)) → list(k3,v3).

combine (k2,list(v2)) → list(k2,v2). Experimental Setting

Prefix Filtering for Data Partitioning Hardware
Example: Data Cleaning/Master-Data-Management 10-node IBM x3650 cluster
Pigeonhole principle Intel Xeon processor E5520 2.26GHz with four cores
Customer data from two departments
Global order for set elements: Four 300GB hard disks
Sales Returns 12GB RAM
ID Name ... ID Name ... Datasets
S10 John W Smith . . . R20 Smith John . . . DBLP: average length: 259 bytes; 1.2M records; 300MB
.
. . John W Smith
.
E.g., sim is overlap size, τ = 4 CITESEERX: average length: 1374 bytes; 1.3M records; 1.8GB
Master customer data across two departments Prefix length is 2 Increased each up to ×25, preserving join properties
Customers
ID Name ... Experimental Results
C30 John W Smith . . .
.
.

Parallelizing Set-Similarity Joins
Processing Stages and Alternatives
Large amounts of data
Stage 1: Token Ordering
E.g., GeneBank: 100M, Google N-gram: 1T
Compute the token frequencies and sort
Data or processing does not fit in one machine
Two MapReduce phases: sort in MapReduce (BTO)
Use a cluster of machines and a parallel algorithm
One MapReduce phase: sort in memory (OPTO)
MapReduce: shared-nothing data-processing platform
Stage 2: Kernel (RID-Pair Generation) 5 1000
Challenges BTO-BK-BRJ
Use prefix-filter to divide, conquer using: BTO-PK-BRJ
Partition problem for parallelism BTO-PK-OPRJ 800

Time (seconds)
4
Nested loops (BK) Ideal
Solve the problem using Map, Sort, and Reduce

Speedup
600
Single-machine set-similarity join algorithm (PK) 3
Compute end-to-end set-similarity joins Stage 3: Record Join 400 BTO-BK-BRJ
Deal with out-of-memory situations Generate pairs of similar records 2
200
BTO-PK-BRJ
BTO-PK-OPRJ
Ideal
Two MapReduce phases: reduce-side join (BRJ) 1 0
One MapReduce phase: map-side join (OPRJ) 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
# Nodes # Nodes and Dataset Size

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

More Related Content

What's hot

Similar to Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

Recently uploaded

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster