Efficient Parallel Set-Similarity    Joins Using MapReduce                 Tilani Gunawardena
Content• Introduction• Preliminaries•   Self-Join case•   R-S Join case•   Handling insufficient memory•   Experimental ev...
Introduction• Vast amount of data:  – Google N-gram database : ~1 trillion records  – GeneBank : 100 million records, size...
Examples•   Detecting near duplicate web-pages in web crawlin•   Document clustering•   Plagiarism detection•   Master dat...
Preliminaries• Problem Statement: Given two collections of  objects/items/records, a similarity metric  sim(o1,o2) and a t...
Set -similarity functions• Jaccard or Tanimoto coefficient   – Jaccard(x, y) =|x ∩y| / |x U y|• “I will call back” =[I, wi...
Set-similarity with MapReduce• Why Hadoop ?   – Large amount data,shared nothign architecture• map (k1,v1) -> list(k2,v2);...
Set-Similarity Filtering• Efficient set-similarity join algorithms rely on  effective filters• string s =“I will call back...
Prefix filtering: example   Record 1   Record 2• Each set has 5 tokens• “Similar”: they share at least 4 tokens• Prefix le...
Parallel Set-Similarity Joins•   Stage I: Token Ordering     – Compute data statistics for good signatures•   Stage II -RI...
Input Data• RID = Row ID• a : join column• “A B C” is a string:   • Address: “14th Saarbruecker Strasse”   • Name: “John W...
Stage I: Token Ordering• Basic Token Ordering(BTO)• One Phase Token Ordering (OPTO)
Token Ordering• Creates a global ordering of the tokens in the  join column, based on their frequency        RID          ...
Basic Token Ordering(BTO)• 2 MapReduce cycles:  – 1st : compute token frequencies  – 2nd: sort the tokens by their frequen...
Basic Token Ordering – 1st MapReduce cycle                  , ,map:                           reduce:  • tokenize the join...
Basic Token Ordering – 2nd MapReduce cycle     map:                  reduce(use only 1 reducer):       • interchange key  ...
One Phase Tokens Ordering (OPTO)• alternative to Basic Token Ordering (BTO):  – Uses only one MapReduce Cycle (less I/O)  ...
OPTO – Details                 , ,                                             Use tear_down                              ...
Stage II: RID-Pair Generation Basic Kernel(BK) Indexed Kernel(PK)
RID-Pair Generation• scans the original input data(records)• outputs the pairs of RIDs corresponding to records  satisfyin...
RID-Pair Generation: Map Phase• scan input records and for each record:   – project it on RID & join attribute   – tokeniz...
Grouping/Routing Strategies• Goal: distribute candidates to the right  reducers to minimize reducers’ workload• Like hashi...
Routing: using individual tokens• Treat each token as a key• For each record, generates a (key, value) pair for each      ...
Grouping/Routing: using individual tokens• Advantage:  – high quality of grouping of candidates( pairs of    records that ...
Routing: Using Grouped Tokens• Multiple tokens mapped to one synthetic key  (different tokens can be mapped to the same ke...
Grouping/Routing: Using Grouped Tokens• The groups of tokens (X,Y) are formed assigning  tokens to groups in a Round-Robin...
Grouping/Routing: Using Grouped Tokens• Advantage:  – fewer replication of record projection• Disadvantage:  – Quality of ...
RID-Pair Generation: Reduce Phase  • This is the core of the entire method  • Each reducer processes one/more buckets  • I...
RID-Pair Generation: Reduce Phase• Computing similarity of the candidates in a  bucket comes in 2 flavors:     • Basic Ker...
RID-Pair Generation: Basic Kernel• Straightforward method for finding candidates satisfying  the join predicate• Quadratic...
RID-Pair Generation:PPJoin+Indexed Kernal•   Uses a special index data structure•   Not so straightforward to implement•  ...
Stage III: Record Join• Until now we have only pairs of RIDs, but we need actual  records• Use the RID pairs generated in ...
Record Join: Basic Record Join• Uses 2 MapReduce cycles   – 1st cycle: fills in the record information for each half of ea...
Record Join: One Phase Record Join• Uses only one MapReduce cycle
R-S Join• Challenge: We now have 2 different record sources => 2  different input streams• Map Reduce can work on only 1 i...
Handling Insufficient Memory• Map-Based Block Processing.• Reduce-Based Block Processing
Evaluation• Cluster: 10-node IBM x3650, running Hadoop• Data sets:       • DBLP: 1.2M publications       • CITESEERX: 1.3M...
Self-Join running time• Best algorithm: BTO-PK-OPRJ• Most expensive stage: the  RID-pair generation
Self-Join Speedup• Fixed data size, vary the  cluster size• Best time: BTO-PK-OPRJ
Self-Join Scaleup• Increase data size and  cluster size together by the  same factor• Best time: BTO-PK-OPRJ
Self-Join Summery• I stage- BTO was the best choice.• II stage- PK was the best choice.• III stage,-the best choice depend...
R-S Join Performance
Speed Up• I stage - R-S Join performance was identical to  the first stage in the self-join case• II stage -noticed a simi...
Conclusions• For both self-join and R-S join cases, we recommend BTO-  PK-BRJ as a robust and scalable method.• Useful in ...
Thank You!
 Efficient Parallel Set-Similarity Joins Using MapReduce
 Efficient Parallel Set-Similarity Joins Using MapReduce
Upcoming SlideShare
Loading in …5
×

Efficient Parallel Set-Similarity Joins Using MapReduce

2,499 views

Published on

Efficient Parallel Set-Similarity Joins Using MapReduce

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,499
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
47
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journaldifferent hosts holding the same redundant copies of a pageDetecting such similar pairs is challenging today, as there is an increasing trend of applications being expected to dealwith vast amounts of data that usually do not fit in the main memory of one machine.
  • 2 maps reduce phases
  • map: tokenize the join value of each record emit each token with no. of occurrences 1reduce: for each token, compute total count (frequency)
  • Instead of using MapReduce to sort the tokens, we can explicitly sort the tokens in memory
  • For each token, the function computes its total count and stores the information locally
  • i = (i + 1) mod n
  • Bring records for each id in each pairJoin two half filled records
  • 3 Stage -most expensive. Reason-this stage had to scan two datasets instead of one
  • Efficient Parallel Set-Similarity Joins Using MapReduce

    1. 1. Efficient Parallel Set-Similarity Joins Using MapReduce Tilani Gunawardena
    2. 2. Content• Introduction• Preliminaries• Self-Join case• R-S Join case• Handling insufficient memory• Experimental evaluation• Conclusions
    3. 3. Introduction• Vast amount of data: – Google N-gram database : ~1 trillion records – GeneBank : 100 million records, size=416GB – Facebook : 400 million active users• Detecting similar pairs of records becomes a challanging proble
    4. 4. Examples• Detecting near duplicate web-pages in web crawlin• Document clustering• Plagiarism detection• Master data management – “John W. Smith” , “Smith, John” , “John William Smith”• Making recommendations to users based on their similarity to other users in query refinement• Mining in social networking sites – User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest• Identifying coalitions of click fraudsters in online advertising
    5. 5. Preliminaries• Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)≥ λ
    6. 6. Set -similarity functions• Jaccard or Tanimoto coefficient – Jaccard(x, y) =|x ∩y| / |x U y|• “I will call back” =[I, will, call, back]• “I will call you soon”=[I, will, call, you, soon]• Jaccard similarity=3/6=0.5
    7. 7. Set-similarity with MapReduce• Why Hadoop ? – Large amount data,shared nothign architecture• map (k1,v1) -> list(k2,v2);• reduce (k2,list(v2)) -> list(k3,v3)• Problem : – Too much data to transfer – Too many pairs to verify(Two similar sets share at least 1 token)
    8. 8. Set-Similarity Filtering• Efficient set-similarity join algorithms rely on effective filters• string s =“I will call back”• global token ordering {back,call, will, I}• prefix of length 2 of s= [back, call]• prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
    9. 9. Prefix filtering: example Record 1 Record 2• Each set has 5 tokens• “Similar”: they share at least 4 tokens• Prefix length: 2 9
    10. 10. Parallel Set-Similarity Joins• Stage I: Token Ordering – Compute data statistics for good signatures• Stage II -RID-Pair Generation• Stage III: Record Join – Generate actual pairs of joined records
    11. 11. Input Data• RID = Row ID• a : join column• “A B C” is a string: • Address: “14th Saarbruecker Strasse” • Name: “John W. Smith”
    12. 12. Stage I: Token Ordering• Basic Token Ordering(BTO)• One Phase Token Ordering (OPTO)
    13. 13. Token Ordering• Creates a global ordering of the tokens in the join column, based on their frequency RID a b c 1 A B D AA … … 2 BBDAE … … Global Ordering: E D B A (based on frequency) 1 2 3 4
    14. 14. Basic Token Ordering(BTO)• 2 MapReduce cycles: – 1st : compute token frequencies – 2nd: sort the tokens by their frequencies
    15. 15. Basic Token Ordering – 1st MapReduce cycle , ,map: reduce: • tokenize the join • for each token, compute total value of each record count (frequency) • emit each token with no. of occurrences 1
    16. 16. Basic Token Ordering – 2nd MapReduce cycle map: reduce(use only 1 reducer): • interchange key • emits the value with value
    17. 17. One Phase Tokens Ordering (OPTO)• alternative to Basic Token Ordering (BTO): – Uses only one MapReduce Cycle (less I/O) – In-memory token sorting, instead of using a reducer
    18. 18. OPTO – Details , , Use tear_down method to order the tokens in memorymap: reduce: • tokenize the join • for each token, compute value of each record total count (frequency) • emit each token with no. of occurrences 1
    19. 19. Stage II: RID-Pair Generation Basic Kernel(BK) Indexed Kernel(PK)
    20. 20. RID-Pair Generation• scans the original input data(records)• outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim)• consists of only one MapReduce cycle Global ordering of tokens obtained in the previous stage
    21. 21. RID-Pair Generation: Map Phase• scan input records and for each record: – project it on RID & join attribute – tokenize it – extract prefix according to global ordering of tokens obtained in the Token Ordering stage – route tokens to appropriate reducer
    22. 22. Grouping/Routing Strategies• Goal: distribute candidates to the right reducers to minimize reducers’ workload• Like hashing (projected)records to the corresponding candidate-buckets• Each reducer handles one/more candidate- buckets• 2 routing strategies: Using Individual Tokens Using Grouped Tokens
    23. 23. Routing: using individual tokens• Treat each token as a key• For each record, generates a (key, value) pair for each of its prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C))
    24. 24. Grouping/Routing: using individual tokens• Advantage: – high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer)• Disadvantage: – high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
    25. 25. Routing: Using Grouped Tokens• Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key)• For each record, generates a (key, value) pair for each the groups of the prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C))
    26. 26. Grouping/Routing: Using Grouped Tokens• The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner Token A B E D G C F Frequency 10 10 22 23 23 40 48 A D F B G E C Group1 Group2 Group3
    27. 27. Grouping/Routing: Using Grouped Tokens• Advantage: – fewer replication of record projection• Disadvantage: – Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) – “ABCD” (A,B belong to Group X ; C belong to Group Y) • o/p –(X,_) & (Y,_) – “EFG” (E belong to Group Y ) • o/p –(Y,_)
    28. 28. RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarityBucket ofcandidates
    29. 29. RID-Pair Generation: Reduce Phase• Computing similarity of the candidates in a bucket comes in 2 flavors: • Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket • Indexed Kernel : uses a PPJoin+ index
    30. 30. RID-Pair Generation: Basic Kernel• Straightforward method for finding candidates satisfying the join predicate• Quadratic complexity : O(#candidates2)
    31. 31. RID-Pair Generation:PPJoin+Indexed Kernal• Uses a special index data structure• Not so straightforward to implement• map() -same as in BK algorithm• Much more efficient
    32. 32. Stage III: Record Join• Until now we have only pairs of RIDs, but we need actual records• Use the RID pairs generated in the previous stage to join the actual records• Main idea: – bring in the rest of the each record (everything except the RID which we already have)• 2 approaches: – Basic Record Join (BRJ) – One-Phase Record Join (OPRJ)
    33. 33. Record Join: Basic Record Join• Uses 2 MapReduce cycles – 1st cycle: fills in the record information for each half of each pair – 2nd cycle: brings together the previously filled in records
    34. 34. Record Join: One Phase Record Join• Uses only one MapReduce cycle
    35. 35. R-S Join• Challenge: We now have 2 different record sources => 2 different input streams• Map Reduce can work on only 1 input stream• 2nd and 3rd stage affected• Solution: extend (key, value) pairs so that it includes a relation tag for each record
    36. 36. Handling Insufficient Memory• Map-Based Block Processing.• Reduce-Based Block Processing
    37. 37. Evaluation• Cluster: 10-node IBM x3650, running Hadoop• Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Consider only the header of each paper(i.e author, title, date of publication, etc.) • Data size synthetically increased (by various factors)• Measure: • Absolute running time • Speedup • Scaleup
    38. 38. Self-Join running time• Best algorithm: BTO-PK-OPRJ• Most expensive stage: the RID-pair generation
    39. 39. Self-Join Speedup• Fixed data size, vary the cluster size• Best time: BTO-PK-OPRJ
    40. 40. Self-Join Scaleup• Increase data size and cluster size together by the same factor• Best time: BTO-PK-OPRJ
    41. 41. Self-Join Summery• I stage- BTO was the best choice.• II stage- PK was the best choice.• III stage,-the best choice depends on the amount of data and the size of the cluster – OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative• Best scaleup was achieved by BTO-PK-BRJ
    42. 42. R-S Join Performance
    43. 43. Speed Up• I stage - R-S Join performance was identical to the first stage in the self-join case• II stage -noticed a similar speedup (almost perfect) as for the self-join case.• III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
    44. 44. Conclusions• For both self-join and R-S join cases, we recommend BTO- PK-BRJ as a robust and scalable method.• Useful in many data cleaning scenarios• SSJoin and MapReduce: one solution for huge datasets• Very efficient when based on prefix-filtering and PPJoin+• Scales-up up nicely
    45. 45. Thank You!

    ×