Efficient Parallel Set-Similarity Joins Using MapReduce
                                                                    Rares Vernica        Michael J. Carey      Chen Li
                                                         Department of Computer Science, University of California, Irvine
                                                         http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

                     Problem Statement                                              MapReduce Review                                                               Stage 2: RID-Pair Generation

                                                           map    (k1,v1)       → list(k2,v2);
                                                           reduce (k2,list(v2)) → list(k3,v3).




                                                           combine (k2,list(v2)) → list(k2,v2).                                                                            Experimental Setting

                                                                           Prefix Filtering for Data Partitioning            Hardware
      Example: Data Cleaning/Master-Data-Management                                                                          10-node IBM x3650 cluster
                                                             Pigeonhole principle                                              Intel Xeon processor E5520 2.26GHz with four cores
Customer data from two departments
                                                             Global order for set elements:                                    Four 300GB hard disks
              Sales                  Returns                                                                                   12GB RAM
      ID Name            ...   ID Name         ...                                                                          Datasets
     S10 John W Smith . . .   R20 Smith John . . .                                                                           DBLP: average length: 259 bytes; 1.2M records; 300MB
       .
       .                        . John W Smith
                                .
                                                             E.g., sim is overlap size, τ = 4                                CITESEERX: average length: 1374 bytes; 1.3M records; 1.8GB
Master customer data across two departments                  Prefix length is 2                                               Increased each up to ×25, preserving join properties
                         Customers
                  ID Name             ...                                                                                                                                  Experimental Results
                 C30 John W Smith . . .
                   .
                   .


               Parallelizing Set-Similarity Joins
                                                                           Processing Stages and Alternatives
Large amounts of data
                                                          Stage 1: Token Ordering
 E.g., GeneBank: 100M, Google N-gram: 1T
                                                            Compute the token frequencies and sort
 Data or processing does not fit in one machine
                                                             Two MapReduce phases: sort in MapReduce (BTO)
 Use a cluster of machines and a parallel algorithm
                                                             One MapReduce phase: sort in memory (OPTO)
 MapReduce: shared-nothing data-processing platform
                                                          Stage 2: Kernel (RID-Pair Generation)                                           5                                                                1000
Challenges                                                                                                                                            BTO-BK-BRJ
                                                            Use prefix-filter to divide, conquer using:                                                 BTO-PK-BRJ
 Partition problem for parallelism                                                                                                                    BTO-PK-OPRJ                                           800




                                                                                                                                                                                          Time (seconds)
                                                                                                                                          4
                                                             Nested loops (BK)                                                                        Ideal
 Solve the problem using Map, Sort, and Reduce




                                                                                                                                Speedup
                                                                                                                                                                                                            600
                                                             Single-machine set-similarity join algorithm (PK)                            3
 Compute end-to-end set-similarity joins                  Stage 3: Record Join                                                                                                                              400                    BTO-BK-BRJ
 Deal with out-of-memory situations                         Generate pairs of similar records                                             2
                                                                                                                                                                                                            200
                                                                                                                                                                                                                                   BTO-PK-BRJ
                                                                                                                                                                                                                                   BTO-PK-OPRJ
                                                                                                                                                                                                                                   Ideal
                                                             Two MapReduce phases: reduce-side join (BRJ)                                 1                                                                  0
                                                             One MapReduce phase: map-side join (OPRJ)                                        2   3    4   5   6   7   8   9 10                                   2    3   4   5    6   7   8   9 10
                                                                                                                                                           # Nodes                                                    # Nodes and Dataset Size

Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

  • 1.
    Efficient Parallel Set-SimilarityJoins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science, University of California, Irvine http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Problem Statement MapReduce Review Stage 2: RID-Pair Generation map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). combine (k2,list(v2)) → list(k2,v2). Experimental Setting Prefix Filtering for Data Partitioning Hardware Example: Data Cleaning/Master-Data-Management 10-node IBM x3650 cluster Pigeonhole principle Intel Xeon processor E5520 2.26GHz with four cores Customer data from two departments Global order for set elements: Four 300GB hard disks Sales Returns 12GB RAM ID Name ... ID Name ... Datasets S10 John W Smith . . . R20 Smith John . . . DBLP: average length: 259 bytes; 1.2M records; 300MB . . . John W Smith . E.g., sim is overlap size, τ = 4 CITESEERX: average length: 1374 bytes; 1.3M records; 1.8GB Master customer data across two departments Prefix length is 2 Increased each up to ×25, preserving join properties Customers ID Name ... Experimental Results C30 John W Smith . . . . . Parallelizing Set-Similarity Joins Processing Stages and Alternatives Large amounts of data Stage 1: Token Ordering E.g., GeneBank: 100M, Google N-gram: 1T Compute the token frequencies and sort Data or processing does not fit in one machine Two MapReduce phases: sort in MapReduce (BTO) Use a cluster of machines and a parallel algorithm One MapReduce phase: sort in memory (OPTO) MapReduce: shared-nothing data-processing platform Stage 2: Kernel (RID-Pair Generation) 5 1000 Challenges BTO-BK-BRJ Use prefix-filter to divide, conquer using: BTO-PK-BRJ Partition problem for parallelism BTO-PK-OPRJ 800 Time (seconds) 4 Nested loops (BK) Ideal Solve the problem using Map, Sort, and Reduce Speedup 600 Single-machine set-similarity join algorithm (PK) 3 Compute end-to-end set-similarity joins Stage 3: Record Join 400 BTO-BK-BRJ Deal with out-of-memory situations Generate pairs of similar records 2 200 BTO-PK-BRJ BTO-PK-OPRJ Ideal Two MapReduce phases: reduce-side join (BRJ) 1 0 One MapReduce phase: map-side join (OPRJ) 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes # Nodes and Dataset Size