SlideShare a Scribd company logo
1 of 47
Efficient Parallel Set-Similarity
    Joins Using MapReduce




                 Tilani Gunawardena
Content
• Introduction
• Preliminaries
•   Self-Join case
•   R-S Join case
•   Handling insufficient memory
•   Experimental evaluation
•   Conclusions
Introduction

• Vast amount of data:
  – Google N-gram database : ~1 trillion records
  – GeneBank : 100 million records, size=416GB
  – Facebook : 400 million active users


• Detecting similar pairs of records becomes a
  challanging proble
Examples
•   Detecting near duplicate web-pages in web crawlin
•   Document clustering
•   Plagiarism detection
•   Master data management
    – “John W. Smith” , “Smith, John” , “John William Smith”
• Making recommendations to users based on
  their similarity to other users in query refinement
• Mining in social networking sites
    – User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
• Identifying coalitions of click fraudsters in online advertising
Preliminaries
• Problem Statement: Given two collections of
  objects/items/records, a similarity metric
  sim(o1,o2) and a threshold λ , find the pairs of
  objects/items/records satisfying sim(o1,o2)≥ λ
Set -similarity functions
• Jaccard or Tanimoto coefficient
   – Jaccard(x, y) =|x ∩y| / |x U y|


• “I will call back” =[I, will, call, back]
• “I will call you soon”=[I, will, call, you, soon]

• Jaccard similarity=3/6=0.5
Set-similarity with MapReduce
• Why Hadoop ?
   – Large amount data,shared nothign architecture




• map (k1,v1) -> list(k2,v2);
• reduce (k2,list(v2)) -> list(k3,v3)
• Problem :
   – Too much data to transfer
   – Too many pairs to verify(Two similar sets share at least
     1 token)
Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on
  effective filters

• string s =“I will call back”
• global token ordering {back,call, will, I}
• prefix of length 2 of s= [back, call]

• prefix filtering principle states that similar strings
  need to share at least one common token in their
  prefixes.
Prefix filtering: example


   Record 1


   Record 2


• Each set has 5 tokens
• “Similar”: they share at least 4 tokens
• Prefix length: 2
                                    9
Parallel Set-Similarity Joins
•   Stage I: Token Ordering
     – Compute data statistics for good signatures
•   Stage II -RID-Pair Generation
•   Stage III: Record Join
     – Generate actual pairs of joined records
Input Data
• RID = Row ID
• a : join column
• “A B C” is a string:
   • Address: “14th Saarbruecker Strasse”
   • Name: “John W. Smith”
Stage I: Token Ordering
• Basic Token Ordering(BTO)
• One Phase Token Ordering (OPTO)
Token Ordering

• Creates a global ordering of the tokens in the
  join column, based on their frequency
        RID                a       b         c

          1           A B D AA     …         …
          2           BBDAE        …         …

   Global Ordering:    E       D   B     A
   (based on
   frequency)          1       2   3     4
Basic Token Ordering(BTO)

• 2 MapReduce cycles:
  – 1st : compute token frequencies
  – 2nd: sort the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
                  , ,




map:                           reduce:
  • tokenize the join             • for each token, compute total
   value of each record            count (frequency)
  • emit each token
   with no. of occurrences 1
Basic Token Ordering – 2nd MapReduce cycle




     map:                  reduce(use only 1 reducer):
       • interchange key      • emits the value
        with value
One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
  – Uses only one MapReduce Cycle (less I/O)
  – In-memory token sorting, instead of using a
    reducer
OPTO – Details
                 , ,
                                             Use tear_down
                                             method to order
                                             the tokens in
                                             memory




map:
                             reduce:
  • tokenize the join
                                • for each token, compute
   value of each record
                                total count (frequency)
  • emit each token
   with no. of occurrences 1
Stage II: RID-Pair Generation

 Basic Kernel(BK)
 Indexed Kernel(PK)
RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
  satisfying the join predicate(sim)
• consists of only one MapReduce cycle

                   Global ordering of tokens obtained in the previous
                   stage
RID-Pair Generation: Map Phase

• scan input records and for each record:
   – project it on RID & join attribute
   – tokenize it
   – extract prefix according to global ordering of tokens obtained in the Token
     Ordering stage
   – route tokens to appropriate reducer
Grouping/Routing Strategies

• Goal: distribute candidates to the right
  reducers to minimize reducers’ workload
• Like hashing (projected)records to the
  corresponding candidate-buckets
• Each reducer handles one/more candidate-
  buckets
• 2 routing strategies:

   Using Individual Tokens          Using Grouped Tokens
Routing: using individual tokens

• Treat each token as a key
• For each record, generates a (key, value) pair for each
      of its prefix tokens:
                         Example:
                         • Given the global ordering:
                            Token       A     B     E    D     G    C    F
                         Frequency      10   10    22    23    23   40   48


                          “A B C”
                           => prefix of length 2: A,B
                           => generate/emit 2 (key,value) pairs:
                                     • (A, (1,A B C))
                                     • (B, (1,A B C))
Grouping/Routing: using individual tokens

• Advantage:
  – high quality of grouping of candidates( pairs of
    records that have no chance of being similar, are
    never routed to the same reducer)
• Disadvantage:
  – high replication of data (same records might be
    checked for similarity in multiple reducers, i.e.
    redundant work)
Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
  (different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
       the groups of the prefix tokens:

                                Example:
                                • Given the global ordering:
                           Token        A     B     E    D     G     C    F
                          Frequency    10    10    22    23    23    40   48

                            “A B C” => prefix of length 2: A,B
                             Suppose A,B belong to group X and
                                       C belongs to group Y
                             => generate/emit 2 (key,value) pairs:
                                     • (X, (1,A B C))
                                     • (Y, (1,A B C))
Grouping/Routing: Using Grouped Tokens

• The groups of tokens (X,Y) are formed assigning
  tokens to groups in a Round-Robin manner
             Token     A    B      E    D    G     C       F
           Frequency   10   10     22   23   23   40       48


           A D F             B G                  E C

           Group1            Group2               Group3
Grouping/Routing: Using Grouped Tokens
• Advantage:
  – fewer replication of record projection

• Disadvantage:
  – Quality of grouping is not so high (records having no
    chance of being similar are sent to the same reducer
    which checks their similarity)

  – “ABCD” (A,B belong to Group X ; C belong to Group Y)
     • o/p –(X,_) & (Y,_)
  – “EFG” (E belong to Group Y )
     • o/p –(Y,_)
RID-Pair Generation: Reduce Phase

  • This is the core of the entire method
  • Each reducer processes one/more buckets
  • In each bucket, the reducer looks for pairs of join attribute values
    satisfying the join predicate
                                 If the similarity of the 2 candidates >= threshold
                                 => output their ids and also their similarity




Bucket of
candidates
RID-Pair Generation: Reduce Phase

• Computing similarity of the candidates in a
  bucket comes in 2 flavors:

     • Basic Kernel : uses 2 nested loops to verify each pair of
       candidates in the bucket



     • Indexed Kernel : uses a PPJoin+ index
RID-Pair Generation: Basic Kernel

• Straightforward method for finding candidates satisfying
  the join predicate
• Quadratic complexity : O(#candidates2)
RID-Pair Generation:PPJoin+Indexed Kernal
•   Uses a special index data structure
•   Not so straightforward to implement
•   map() -same as in BK algorithm
•   Much more efficient
Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual
  records
• Use the RID pairs generated in the previous stage to join
  the actual records
• Main idea:
   – bring in the rest of the each record (everything except the RID
     which we already have)
• 2 approaches:
   – Basic Record Join (BRJ)
   – One-Phase Record Join (OPRJ)
Record Join: Basic Record Join

• Uses 2 MapReduce cycles
   – 1st cycle: fills in the record information for each half of each pair
   – 2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join

• Uses only one MapReduce cycle
R-S Join

• Challenge: We now have 2 different record sources => 2
  different input streams

• Map Reduce can work on only 1 input stream

• 2nd and 3rd stage affected

• Solution: extend (key, value) pairs so that it includes a
  relation tag for each record
Handling Insufficient Memory
• Map-Based Block Processing.
• Reduce-Based Block Processing
Evaluation

• Cluster: 10-node IBM x3650, running Hadoop
• Data sets:
       • DBLP: 1.2M publications
       • CITESEERX: 1.3M publication
       • Consider only the header of each paper(i.e author, title, date of
          publication, etc.)
       • Data size synthetically increased (by various factors)
• Measure:
       • Absolute running time
       • Speedup
       • Scaleup
Self-Join running time

• Best algorithm: BTO-PK-OPRJ
• Most expensive stage: the
  RID-pair generation
Self-Join Speedup

• Fixed data size, vary the
  cluster size
• Best time: BTO-PK-OPRJ
Self-Join Scaleup

• Increase data size and
  cluster size together by the
  same factor
• Best time: BTO-PK-OPRJ
Self-Join Summery
• I stage- BTO was the best choice.
• II stage- PK was the best choice.
• III stage,-the best choice depends on the amount
  of data and the size of the cluster
  – OPRJ was somewhat faster, but the cost of loading the
    similar-RID pairs in memory was constant as the the
    cluster size increased, and the cost increased as the
    data size increased. For these reasons, we recommend
    BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ
R-S Join Performance
Speed Up
• I stage - R-S Join performance was identical to
  the first stage in the self-join case
• II stage -noticed a similar speedup (almost
  perfect) as for the self-join case.
• III stage - OPRJ approach was initially the
  fastest (for the 2 and 4 node case), but it
  eventually became slower than the BRJ
  approach.
Conclusions

• For both self-join and R-S join cases, we recommend BTO-
  PK-BRJ as a robust and scalable method.

• Useful in many data cleaning scenarios

• SSJoin and MapReduce: one solution for huge datasets

• Very efficient when based on prefix-filtering and PPJoin+

• Scales-up up nicely
Thank You!

More Related Content

Similar to Efficient Parallel Set-Similarity Joins Using MapReduce

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate SquaresKyle Teague
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTWsunnygleason
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newchizhangufl
 
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPEPRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPEnikhilcse1
 
456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).ppt456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).pptMohibKhan79
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftAmazon Web Services
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorterManchor Ko
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...Javed Barkatullah
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataAmazon Web Services
 

Similar to Efficient Parallel Set-Similarity Joins Using MapReduce (20)

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate Squares
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
1 DES.pdf
1 DES.pdf1 DES.pdf
1 DES.pdf
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology new
 
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPEPRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
 
456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).ppt456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).ppt
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon Redshift
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big Data
 
Enar short course
Enar short courseEnar short course
Enar short course
 

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 

Recently uploaded (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 

Efficient Parallel Set-Similarity Joins Using MapReduce

  • 1. Efficient Parallel Set-Similarity Joins Using MapReduce Tilani Gunawardena
  • 2. Content • Introduction • Preliminaries • Self-Join case • R-S Join case • Handling insufficient memory • Experimental evaluation • Conclusions
  • 3. Introduction • Vast amount of data: – Google N-gram database : ~1 trillion records – GeneBank : 100 million records, size=416GB – Facebook : 400 million active users • Detecting similar pairs of records becomes a challanging proble
  • 4. Examples • Detecting near duplicate web-pages in web crawlin • Document clustering • Plagiarism detection • Master data management – “John W. Smith” , “Smith, John” , “John William Smith” • Making recommendations to users based on their similarity to other users in query refinement • Mining in social networking sites – User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest • Identifying coalitions of click fraudsters in online advertising
  • 5. Preliminaries • Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)≥ λ
  • 6. Set -similarity functions • Jaccard or Tanimoto coefficient – Jaccard(x, y) =|x ∩y| / |x U y| • “I will call back” =[I, will, call, back] • “I will call you soon”=[I, will, call, you, soon] • Jaccard similarity=3/6=0.5
  • 7. Set-similarity with MapReduce • Why Hadoop ? – Large amount data,shared nothign architecture • map (k1,v1) -> list(k2,v2); • reduce (k2,list(v2)) -> list(k3,v3) • Problem : – Too much data to transfer – Too many pairs to verify(Two similar sets share at least 1 token)
  • 8. Set-Similarity Filtering • Efficient set-similarity join algorithms rely on effective filters • string s =“I will call back” • global token ordering {back,call, will, I} • prefix of length 2 of s= [back, call] • prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
  • 9. Prefix filtering: example Record 1 Record 2 • Each set has 5 tokens • “Similar”: they share at least 4 tokens • Prefix length: 2 9
  • 10. Parallel Set-Similarity Joins • Stage I: Token Ordering – Compute data statistics for good signatures • Stage II -RID-Pair Generation • Stage III: Record Join – Generate actual pairs of joined records
  • 11. Input Data • RID = Row ID • a : join column • “A B C” is a string: • Address: “14th Saarbruecker Strasse” • Name: “John W. Smith”
  • 12. Stage I: Token Ordering • Basic Token Ordering(BTO) • One Phase Token Ordering (OPTO)
  • 13. Token Ordering • Creates a global ordering of the tokens in the join column, based on their frequency RID a b c 1 A B D AA … … 2 BBDAE … … Global Ordering: E D B A (based on frequency) 1 2 3 4
  • 14. Basic Token Ordering(BTO) • 2 MapReduce cycles: – 1st : compute token frequencies – 2nd: sort the tokens by their frequencies
  • 15. Basic Token Ordering – 1st MapReduce cycle , , map: reduce: • tokenize the join • for each token, compute total value of each record count (frequency) • emit each token with no. of occurrences 1
  • 16. Basic Token Ordering – 2nd MapReduce cycle map: reduce(use only 1 reducer): • interchange key • emits the value with value
  • 17. One Phase Tokens Ordering (OPTO) • alternative to Basic Token Ordering (BTO): – Uses only one MapReduce Cycle (less I/O) – In-memory token sorting, instead of using a reducer
  • 18. OPTO – Details , , Use tear_down method to order the tokens in memory map: reduce: • tokenize the join • for each token, compute value of each record total count (frequency) • emit each token with no. of occurrences 1
  • 19. Stage II: RID-Pair Generation  Basic Kernel(BK)  Indexed Kernel(PK)
  • 20. RID-Pair Generation • scans the original input data(records) • outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) • consists of only one MapReduce cycle Global ordering of tokens obtained in the previous stage
  • 21. RID-Pair Generation: Map Phase • scan input records and for each record: – project it on RID & join attribute – tokenize it – extract prefix according to global ordering of tokens obtained in the Token Ordering stage – route tokens to appropriate reducer
  • 22. Grouping/Routing Strategies • Goal: distribute candidates to the right reducers to minimize reducers’ workload • Like hashing (projected)records to the corresponding candidate-buckets • Each reducer handles one/more candidate- buckets • 2 routing strategies: Using Individual Tokens Using Grouped Tokens
  • 23. Routing: using individual tokens • Treat each token as a key • For each record, generates a (key, value) pair for each of its prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C))
  • 24. Grouping/Routing: using individual tokens • Advantage: – high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer) • Disadvantage: – high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
  • 25. Routing: Using Grouped Tokens • Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) • For each record, generates a (key, value) pair for each the groups of the prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C))
  • 26. Grouping/Routing: Using Grouped Tokens • The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner Token A B E D G C F Frequency 10 10 22 23 23 40 48 A D F B G E C Group1 Group2 Group3
  • 27. Grouping/Routing: Using Grouped Tokens • Advantage: – fewer replication of record projection • Disadvantage: – Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) – “ABCD” (A,B belong to Group X ; C belong to Group Y) • o/p –(X,_) & (Y,_) – “EFG” (E belong to Group Y ) • o/p –(Y,_)
  • 28. RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarity Bucket of candidates
  • 29. RID-Pair Generation: Reduce Phase • Computing similarity of the candidates in a bucket comes in 2 flavors: • Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket • Indexed Kernel : uses a PPJoin+ index
  • 30. RID-Pair Generation: Basic Kernel • Straightforward method for finding candidates satisfying the join predicate • Quadratic complexity : O(#candidates2)
  • 31. RID-Pair Generation:PPJoin+Indexed Kernal • Uses a special index data structure • Not so straightforward to implement • map() -same as in BK algorithm • Much more efficient
  • 32. Stage III: Record Join • Until now we have only pairs of RIDs, but we need actual records • Use the RID pairs generated in the previous stage to join the actual records • Main idea: – bring in the rest of the each record (everything except the RID which we already have) • 2 approaches: – Basic Record Join (BRJ) – One-Phase Record Join (OPRJ)
  • 33. Record Join: Basic Record Join • Uses 2 MapReduce cycles – 1st cycle: fills in the record information for each half of each pair – 2nd cycle: brings together the previously filled in records
  • 34. Record Join: One Phase Record Join • Uses only one MapReduce cycle
  • 35. R-S Join • Challenge: We now have 2 different record sources => 2 different input streams • Map Reduce can work on only 1 input stream • 2nd and 3rd stage affected • Solution: extend (key, value) pairs so that it includes a relation tag for each record
  • 36. Handling Insufficient Memory • Map-Based Block Processing. • Reduce-Based Block Processing
  • 37. Evaluation • Cluster: 10-node IBM x3650, running Hadoop • Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Consider only the header of each paper(i.e author, title, date of publication, etc.) • Data size synthetically increased (by various factors) • Measure: • Absolute running time • Speedup • Scaleup
  • 38. Self-Join running time • Best algorithm: BTO-PK-OPRJ • Most expensive stage: the RID-pair generation
  • 39. Self-Join Speedup • Fixed data size, vary the cluster size • Best time: BTO-PK-OPRJ
  • 40.
  • 41. Self-Join Scaleup • Increase data size and cluster size together by the same factor • Best time: BTO-PK-OPRJ
  • 42.
  • 43. Self-Join Summery • I stage- BTO was the best choice. • II stage- PK was the best choice. • III stage,-the best choice depends on the amount of data and the size of the cluster – OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative • Best scaleup was achieved by BTO-PK-BRJ
  • 45. Speed Up • I stage - R-S Join performance was identical to the first stage in the self-join case • II stage -noticed a similar speedup (almost perfect) as for the self-join case. • III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
  • 46. Conclusions • For both self-join and R-S join cases, we recommend BTO- PK-BRJ as a robust and scalable method. • Useful in many data cleaning scenarios • SSJoin and MapReduce: one solution for huge datasets • Very efficient when based on prefix-filtering and PPJoin+ • Scales-up up nicely

Editor's Notes

  1. Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journaldifferent hosts holding the same redundant copies of a pageDetecting such similar pairs is challenging today, as there is an increasing trend of applications being expected to dealwith vast amounts of data that usually do not fit in the main memory of one machine.
  2. 2 maps reduce phases
  3. map: tokenize the join value of each record emit each token with no. of occurrences 1reduce: for each token, compute total count (frequency)
  4. Instead of using MapReduce to sort the tokens, we can explicitly sort the tokens in memory
  5. For each token, the function computes its total count and stores the information locally
  6. i = (i + 1) mod n
  7. Bring records for each id in each pairJoin two half filled records
  8. 3 Stage -most expensive. Reason-this stage had to scan two datasets instead of one