3. Introduction
• Vast amount of data:
– Google N-gram database : ~1 trillion records
– GeneBank : 100 million records, size=416GB
– Facebook : 400 million active users
• Detecting similar pairs of records becomes a
challanging proble
4. Examples
• Detecting near duplicate web-pages in web crawlin
• Document clustering
• Plagiarism detection
• Master data management
– “John W. Smith” , “Smith, John” , “John William Smith”
• Making recommendations to users based on
their similarity to other users in query refinement
• Mining in social networking sites
– User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
• Identifying coalitions of click fraudsters in online advertising
5. Preliminaries
• Problem Statement: Given two collections of
objects/items/records, a similarity metric
sim(o1,o2) and a threshold λ , find the pairs of
objects/items/records satisfying sim(o1,o2)≥ λ
6. Set -similarity functions
• Jaccard or Tanimoto coefficient
– Jaccard(x, y) =|x ∩y| / |x U y|
• “I will call back” =[I, will, call, back]
• “I will call you soon”=[I, will, call, you, soon]
• Jaccard similarity=3/6=0.5
7. Set-similarity with MapReduce
• Why Hadoop ?
– Large amount data,shared nothign architecture
• map (k1,v1) -> list(k2,v2);
• reduce (k2,list(v2)) -> list(k3,v3)
• Problem :
– Too much data to transfer
– Too many pairs to verify(Two similar sets share at least
1 token)
8. Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on
effective filters
• string s =“I will call back”
• global token ordering {back,call, will, I}
• prefix of length 2 of s= [back, call]
• prefix filtering principle states that similar strings
need to share at least one common token in their
prefixes.
9. Prefix filtering: example
Record 1
Record 2
• Each set has 5 tokens
• “Similar”: they share at least 4 tokens
• Prefix length: 2
9
10. Parallel Set-Similarity Joins
• Stage I: Token Ordering
– Compute data statistics for good signatures
• Stage II -RID-Pair Generation
• Stage III: Record Join
– Generate actual pairs of joined records
11. Input Data
• RID = Row ID
• a : join column
• “A B C” is a string:
• Address: “14th Saarbruecker Strasse”
• Name: “John W. Smith”
13. Token Ordering
• Creates a global ordering of the tokens in the
join column, based on their frequency
RID a b c
1 A B D AA … …
2 BBDAE … …
Global Ordering: E D B A
(based on
frequency) 1 2 3 4
14. Basic Token Ordering(BTO)
• 2 MapReduce cycles:
– 1st : compute token frequencies
– 2nd: sort the tokens by their frequencies
15. Basic Token Ordering – 1st MapReduce cycle
, ,
map: reduce:
• tokenize the join • for each token, compute total
value of each record count (frequency)
• emit each token
with no. of occurrences 1
16. Basic Token Ordering – 2nd MapReduce cycle
map: reduce(use only 1 reducer):
• interchange key • emits the value
with value
17. One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
– Uses only one MapReduce Cycle (less I/O)
– In-memory token sorting, instead of using a
reducer
18. OPTO – Details
, ,
Use tear_down
method to order
the tokens in
memory
map:
reduce:
• tokenize the join
• for each token, compute
value of each record
total count (frequency)
• emit each token
with no. of occurrences 1
20. RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)
• consists of only one MapReduce cycle
Global ordering of tokens obtained in the previous
stage
21. RID-Pair Generation: Map Phase
• scan input records and for each record:
– project it on RID & join attribute
– tokenize it
– extract prefix according to global ordering of tokens obtained in the Token
Ordering stage
– route tokens to appropriate reducer
22. Grouping/Routing Strategies
• Goal: distribute candidates to the right
reducers to minimize reducers’ workload
• Like hashing (projected)records to the
corresponding candidate-buckets
• Each reducer handles one/more candidate-
buckets
• 2 routing strategies:
Using Individual Tokens Using Grouped Tokens
23. Routing: using individual tokens
• Treat each token as a key
• For each record, generates a (key, value) pair for each
of its prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))
24. Grouping/Routing: using individual tokens
• Advantage:
– high quality of grouping of candidates( pairs of
records that have no chance of being similar, are
never routed to the same reducer)
• Disadvantage:
– high replication of data (same records might be
checked for similarity in multiple reducers, i.e.
redundant work)
25. Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
(different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
the groups of the prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
“A B C” => prefix of length 2: A,B
Suppose A,B belong to group X and
C belongs to group Y
=> generate/emit 2 (key,value) pairs:
• (X, (1,A B C))
• (Y, (1,A B C))
26. Grouping/Routing: Using Grouped Tokens
• The groups of tokens (X,Y) are formed assigning
tokens to groups in a Round-Robin manner
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
A D F B G E C
Group1 Group2 Group3
27. Grouping/Routing: Using Grouped Tokens
• Advantage:
– fewer replication of record projection
• Disadvantage:
– Quality of grouping is not so high (records having no
chance of being similar are sent to the same reducer
which checks their similarity)
– “ABCD” (A,B belong to Group X ; C belong to Group Y)
• o/p –(X,_) & (Y,_)
– “EFG” (E belong to Group Y )
• o/p –(Y,_)
28. RID-Pair Generation: Reduce Phase
• This is the core of the entire method
• Each reducer processes one/more buckets
• In each bucket, the reducer looks for pairs of join attribute values
satisfying the join predicate
If the similarity of the 2 candidates >= threshold
=> output their ids and also their similarity
Bucket of
candidates
29. RID-Pair Generation: Reduce Phase
• Computing similarity of the candidates in a
bucket comes in 2 flavors:
• Basic Kernel : uses 2 nested loops to verify each pair of
candidates in the bucket
• Indexed Kernel : uses a PPJoin+ index
30. RID-Pair Generation: Basic Kernel
• Straightforward method for finding candidates satisfying
the join predicate
• Quadratic complexity : O(#candidates2)
31. RID-Pair Generation:PPJoin+Indexed Kernal
• Uses a special index data structure
• Not so straightforward to implement
• map() -same as in BK algorithm
• Much more efficient
32. Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual
records
• Use the RID pairs generated in the previous stage to join
the actual records
• Main idea:
– bring in the rest of the each record (everything except the RID
which we already have)
• 2 approaches:
– Basic Record Join (BRJ)
– One-Phase Record Join (OPRJ)
33. Record Join: Basic Record Join
• Uses 2 MapReduce cycles
– 1st cycle: fills in the record information for each half of each pair
– 2nd cycle: brings together the previously filled in records
34. Record Join: One Phase Record Join
• Uses only one MapReduce cycle
35. R-S Join
• Challenge: We now have 2 different record sources => 2
different input streams
• Map Reduce can work on only 1 input stream
• 2nd and 3rd stage affected
• Solution: extend (key, value) pairs so that it includes a
relation tag for each record
37. Evaluation
• Cluster: 10-node IBM x3650, running Hadoop
• Data sets:
• DBLP: 1.2M publications
• CITESEERX: 1.3M publication
• Consider only the header of each paper(i.e author, title, date of
publication, etc.)
• Data size synthetically increased (by various factors)
• Measure:
• Absolute running time
• Speedup
• Scaleup
38. Self-Join running time
• Best algorithm: BTO-PK-OPRJ
• Most expensive stage: the
RID-pair generation
43. Self-Join Summery
• I stage- BTO was the best choice.
• II stage- PK was the best choice.
• III stage,-the best choice depends on the amount
of data and the size of the cluster
– OPRJ was somewhat faster, but the cost of loading the
similar-RID pairs in memory was constant as the the
cluster size increased, and the cost increased as the
data size increased. For these reasons, we recommend
BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ
45. Speed Up
• I stage - R-S Join performance was identical to
the first stage in the self-join case
• II stage -noticed a similar speedup (almost
perfect) as for the self-join case.
• III stage - OPRJ approach was initially the
fastest (for the 2 and 4 node case), but it
eventually became slower than the BRJ
approach.
46. Conclusions
• For both self-join and R-S join cases, we recommend BTO-
PK-BRJ as a robust and scalable method.
• Useful in many data cleaning scenarios
• SSJoin and MapReduce: one solution for huge datasets
• Very efficient when based on prefix-filtering and PPJoin+
• Scales-up up nicely
Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journaldifferent hosts holding the same redundant copies of a pageDetecting such similar pairs is challenging today, as there is an increasing trend of applications being expected to dealwith vast amounts of data that usually do not fit in the main memory of one machine.
2 maps reduce phases
map: tokenize the join value of each record emit each token with no. of occurrences 1reduce: for each token, compute total count (frequency)
Instead of using MapReduce to sort the tokens, we can explicitly sort the tokens in memory
For each token, the function computes its total count and stores the information locally
i = (i + 1) mod n
Bring records for each id in each pairJoin two half filled records
3 Stage -most expensive. Reason-this stage had to scan two datasets instead of one