Efficient Parallel Set-Similarity Joins Using MapReduce
Upcoming SlideShare
Loading in...5

Efficient Parallel Set-Similarity Joins Using MapReduce



Efficient Parallel Set-Similarity Joins Using MapReduce

Efficient Parallel Set-Similarity Joins Using MapReduce



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journaldifferent hosts holding the same redundant copies of a pageDetecting such similar pairs is challenging today, as there is an increasing trend of applications being expected to dealwith vast amounts of data that usually do not fit in the main memory of one machine.
  • 2 maps reduce phases
  • map: tokenize the join value of each record emit each token with no. of occurrences 1reduce: for each token, compute total count (frequency)
  • Instead of using MapReduce to sort the tokens, we can explicitly sort the tokens in memory
  • For each token, the function computes its total count and stores the information locally
  • i = (i + 1) mod n
  • Bring records for each id in each pairJoin two half filled records
  • 3 Stage -most expensive. Reason-this stage had to scan two datasets instead of one

 Efficient Parallel Set-Similarity Joins Using MapReduce Efficient Parallel Set-Similarity Joins Using MapReduce Presentation Transcript

  • Efficient Parallel Set-Similarity Joins Using MapReduce Tilani Gunawardena
  • Content• Introduction• Preliminaries• Self-Join case• R-S Join case• Handling insufficient memory• Experimental evaluation• Conclusions
  • Introduction• Vast amount of data: – Google N-gram database : ~1 trillion records – GeneBank : 100 million records, size=416GB – Facebook : 400 million active users• Detecting similar pairs of records becomes a challanging proble
  • Examples• Detecting near duplicate web-pages in web crawlin• Document clustering• Plagiarism detection• Master data management – “John W. Smith” , “Smith, John” , “John William Smith”• Making recommendations to users based on their similarity to other users in query refinement• Mining in social networking sites – User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest• Identifying coalitions of click fraudsters in online advertising
  • Preliminaries• Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)≥ λ
  • Set -similarity functions• Jaccard or Tanimoto coefficient – Jaccard(x, y) =|x ∩y| / |x U y|• “I will call back” =[I, will, call, back]• “I will call you soon”=[I, will, call, you, soon]• Jaccard similarity=3/6=0.5
  • Set-similarity with MapReduce• Why Hadoop ? – Large amount data,shared nothign architecture• map (k1,v1) -> list(k2,v2);• reduce (k2,list(v2)) -> list(k3,v3)• Problem : – Too much data to transfer – Too many pairs to verify(Two similar sets share at least 1 token)
  • Set-Similarity Filtering• Efficient set-similarity join algorithms rely on effective filters• string s =“I will call back”• global token ordering {back,call, will, I}• prefix of length 2 of s= [back, call]• prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
  • Prefix filtering: example Record 1 Record 2• Each set has 5 tokens• “Similar”: they share at least 4 tokens• Prefix length: 2 9
  • Parallel Set-Similarity Joins• Stage I: Token Ordering – Compute data statistics for good signatures• Stage II -RID-Pair Generation• Stage III: Record Join – Generate actual pairs of joined records
  • Input Data• RID = Row ID• a : join column• “A B C” is a string: • Address: “14th Saarbruecker Strasse” • Name: “John W. Smith”
  • Stage I: Token Ordering• Basic Token Ordering(BTO)• One Phase Token Ordering (OPTO)
  • Token Ordering• Creates a global ordering of the tokens in the join column, based on their frequency RID a b c 1 A B D AA … … 2 BBDAE … … Global Ordering: E D B A (based on frequency) 1 2 3 4
  • Basic Token Ordering(BTO)• 2 MapReduce cycles: – 1st : compute token frequencies – 2nd: sort the tokens by their frequencies
  • Basic Token Ordering – 1st MapReduce cycle , ,map: reduce: • tokenize the join • for each token, compute total value of each record count (frequency) • emit each token with no. of occurrences 1
  • Basic Token Ordering – 2nd MapReduce cycle map: reduce(use only 1 reducer): • interchange key • emits the value with value
  • One Phase Tokens Ordering (OPTO)• alternative to Basic Token Ordering (BTO): – Uses only one MapReduce Cycle (less I/O) – In-memory token sorting, instead of using a reducer
  • OPTO – Details , , Use tear_down method to order the tokens in memorymap: reduce: • tokenize the join • for each token, compute value of each record total count (frequency) • emit each token with no. of occurrences 1
  • Stage II: RID-Pair Generation Basic Kernel(BK) Indexed Kernel(PK)
  • RID-Pair Generation• scans the original input data(records)• outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim)• consists of only one MapReduce cycle Global ordering of tokens obtained in the previous stage
  • RID-Pair Generation: Map Phase• scan input records and for each record: – project it on RID & join attribute – tokenize it – extract prefix according to global ordering of tokens obtained in the Token Ordering stage – route tokens to appropriate reducer
  • Grouping/Routing Strategies• Goal: distribute candidates to the right reducers to minimize reducers’ workload• Like hashing (projected)records to the corresponding candidate-buckets• Each reducer handles one/more candidate- buckets• 2 routing strategies: Using Individual Tokens Using Grouped Tokens
  • Routing: using individual tokens• Treat each token as a key• For each record, generates a (key, value) pair for each of its prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C))
  • Grouping/Routing: using individual tokens• Advantage: – high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer)• Disadvantage: – high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
  • Routing: Using Grouped Tokens• Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key)• For each record, generates a (key, value) pair for each the groups of the prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C))
  • Grouping/Routing: Using Grouped Tokens• The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner Token A B E D G C F Frequency 10 10 22 23 23 40 48 A D F B G E C Group1 Group2 Group3
  • Grouping/Routing: Using Grouped Tokens• Advantage: – fewer replication of record projection• Disadvantage: – Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) – “ABCD” (A,B belong to Group X ; C belong to Group Y) • o/p –(X,_) & (Y,_) – “EFG” (E belong to Group Y ) • o/p –(Y,_)
  • RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarityBucket ofcandidates
  • RID-Pair Generation: Reduce Phase• Computing similarity of the candidates in a bucket comes in 2 flavors: • Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket • Indexed Kernel : uses a PPJoin+ index
  • RID-Pair Generation: Basic Kernel• Straightforward method for finding candidates satisfying the join predicate• Quadratic complexity : O(#candidates2)
  • RID-Pair Generation:PPJoin+Indexed Kernal• Uses a special index data structure• Not so straightforward to implement• map() -same as in BK algorithm• Much more efficient
  • Stage III: Record Join• Until now we have only pairs of RIDs, but we need actual records• Use the RID pairs generated in the previous stage to join the actual records• Main idea: – bring in the rest of the each record (everything except the RID which we already have)• 2 approaches: – Basic Record Join (BRJ) – One-Phase Record Join (OPRJ)
  • Record Join: Basic Record Join• Uses 2 MapReduce cycles – 1st cycle: fills in the record information for each half of each pair – 2nd cycle: brings together the previously filled in records
  • Record Join: One Phase Record Join• Uses only one MapReduce cycle
  • R-S Join• Challenge: We now have 2 different record sources => 2 different input streams• Map Reduce can work on only 1 input stream• 2nd and 3rd stage affected• Solution: extend (key, value) pairs so that it includes a relation tag for each record
  • Handling Insufficient Memory• Map-Based Block Processing.• Reduce-Based Block Processing
  • Evaluation• Cluster: 10-node IBM x3650, running Hadoop• Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Consider only the header of each paper(i.e author, title, date of publication, etc.) • Data size synthetically increased (by various factors)• Measure: • Absolute running time • Speedup • Scaleup
  • Self-Join running time• Best algorithm: BTO-PK-OPRJ• Most expensive stage: the RID-pair generation
  • Self-Join Speedup• Fixed data size, vary the cluster size• Best time: BTO-PK-OPRJ
  • Self-Join Scaleup• Increase data size and cluster size together by the same factor• Best time: BTO-PK-OPRJ
  • Self-Join Summery• I stage- BTO was the best choice.• II stage- PK was the best choice.• III stage,-the best choice depends on the amount of data and the size of the cluster – OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative• Best scaleup was achieved by BTO-PK-BRJ
  • R-S Join Performance
  • Speed Up• I stage - R-S Join performance was identical to the first stage in the self-join case• II stage -noticed a similar speedup (almost perfect) as for the self-join case.• III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
  • Conclusions• For both self-join and R-S join cases, we recommend BTO- PK-BRJ as a robust and scalable method.• Useful in many data cleaning scenarios• SSJoin and MapReduce: one solution for huge datasets• Very efficient when based on prefix-filtering and PPJoin+• Scales-up up nicely
  • Thank You!