Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimizing Set-Similarity Join and Search with Different Prefix Schemes

17 views

Published on

As part of the 2018 HPCC Systems Summit Community Day event:

Up first, Zhe Yu, NC State University briefly discusses his poster, How to Be Rich: A Study of Monsters and Mice of American Industry

Following, Fabian Fier, presents his breakout session in the Documentation & Training Track.
Finding duplicate textual content is crucial for many applications, especially plagiarism detection. When dealing with millions of documents finding duplicate content becomes very time-consuming. Thus it needs scalable and efficient data structures and algorithms that solve this task in seconds rather than hours. In my talk, I present an optimization of a common filter-and-verification set-similarity join and search approach. Filter-and-verification means that we only consider such pairs of objects which share a common word or token in a prefix. Such pairs are potentially similar and are verified in a subsequent step. The candidate set is usually orders of magnitudes smaller than the cross product over an input set. We optimizied this approach by regarding overlaps larger than 1, which reduces the candidate set further and makes the verification faster. On the other hand this requires larger prefixes, which use more memory. Our experiments using HPCC Systems show that we can usually optimize the runtime by choosing an overlap different from the standard overlap 1.

Fabian Fier is a PhD student at the database research group of Johann-Christoph Freytag. He holds a diploma in computer science from Humboldt-Universität. His research interest is similarity search on web-scale data. He uses techniques from textual similarity joins on Big Data and adapts them to similiarity search.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Optimizing Set-Similarity Join and Search with Different Prefix Schemes

  1. 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Fabian Fier Set Similarity Join and Search with different Prefix Schemes
  2. 2. Background Motivation & Agenda | 1 | 2 | 3 | 4 | 5 3
  3. 3. • Many applications need to identify similar pairs of documents: • Plagiarism detection • Community mining in social networks • Near-duplicate web page detection • Document clustering • ... • Problem: Large amounts of data • Solution: Parallel set similarity join algorithms (SSJ) Motivation Motivation & Agenda | 1 | 2 | 3 | 4 | 5 4 Agenda 1. Problem Statement 2. Filter-Verification Framework 3. Parallel SSJ Algorithm a. Idea b. Implementation 4. Extensions a. Sliding-Window SSJ b. Set Similarity Search 5. Experiments and Results
  4. 4. • Data representation • Every record (= document) is a set of tokens each representing a word • Input • A set of records R • A similarity function sim • A similarity threshold t • Output • All pairs of records (x, y) where sim(x, y) ≥ t (x ∈ R, y ∈ R) Problem Statement: Set Similarity Join 1. Problem Statement | 2 | 3 | 4 | 5 5
  5. 5. Example: Jaccard Similarity Function • R: • r₁ = a b c d e • r₂ = b c d e f • r3 = b c e f • Jaccard similarity function sim • sim(x, y) = • Similarity threshold t = 0.8 1. Problem Statement | 2 | 3 | 4 | 5 6 || || yx yx   a db e c f r₁ r2 a d b e c f r₁ r3 d b e c f r2 r3 sim(r₁, r2) = 4/6 < 0.8 not similar sim(r₁, r3) = 3/6 < 0.8 not similar sim(r2, r3) = 4/5 = 0.8 similar
  6. 6. • Naive Solution • Calculating similarity value for each pair → Too expensive • Better • Filter-verification framework • Filter step: Use filter to prune all pairs that cannot be similar • Verification step: Compute the similarity value for the remaining pairs How to calculate these pairs efficiently? 1 | 2. Filter-Verification Framework | 3 | 4 | 5 7
  7. 7. • Prefix filtering • (x,y) can only be similar if their prefixes share at least one token • Length filtering • Only documents with a similar length can be similar Common Filter Techniques 1 | 2. Filter-Verification Framework | 3 | 4 | 5 8 e f h a b h k a b c d he f h prefixes
  8. 8. → Our goal: Implement this approach as a parallel algorithm • Experiments with different prefix schemes • Extend it to a set similarity search and a sliding window SSJ algorithm General Approach 1 | 2. Filter-Verification Framework | 3 | 4 | 5 9 r1 b c d e f g h i j k r2 a d f g h i j k r3 c d e g h i j k r4 c e f i j k 1. Build index on prefixes of R records R, t = 0.75 a r2 b r1 c r1, r3, r4 d r2, r3 index 2. Probe (filter) for each record example: r1 list for each prefix token b r1 c r1, r3, r4 d r2, r3 candidates (r1, r2) (r1, r3) (r1, r4) 3. Verification (r1, r3, 0.8) prefix scheme +1 → prefix size +1 → needed overlap +1 a r2 b r1 c r1, r3, r4 d r2, r3, r1 e r3, r4 f r2 b r1 c r1, r3, r4 d r2, r3, r1 e r3, r4 similar pairs
  9. 9. 1. Preprocessing: order token sets by global token frequency 2. Build the inverted index on the prefixes Idea of the Parallel Algorithm (1) 1 | 2 | 3. Parallel SSJ Algorithm – Idea | 4 | 5 10 r1 b c d e f g h i j k r2 a d f g h i j k r3 c d e g h i j k r4 c e f i j k records R a r2 b r1 c r1 r3, r4 d r1, r2, r3 e r3, r4 f r2 t = 0.75, 2 prefix scheme index
  10. 10. Idea of the Parallel Algorithm (2) 3. Calculate the pre-candidates 1 | 2 | 3. Parallel SSJ Algorithm – Idea | 4 | 5 11 t = 0.75, 2 prefix scheme r1 b c d e f g h i j k r2 a d f g h i j k r3 c d e g h i j k r4 c e f i j k records R a r2 b r1 c r1, r3, r4 d r1, r2, r3 e r1, r3, r4 f r2, r4 g r2, r3 prefixes a r2 b r1 c r1 r3, r4 d r1, r2, r3 e r3, r4 f r2 index ⋈ 𝑓𝑖𝑙𝑡𝑒𝑟 𝑟𝑖𝑑 𝑙 < 𝑟𝑖𝑑 𝑟 (r1, r3) (r1, r4) (r3, r4) (r1, r2) (r1, r3) (r2, r3) (r1, r3) (r1, r4) (r3, r4) length length pre-candidates
  11. 11. 4. Filter the pre-candidates by their overlap 5. Calculate the similarities for all remaining pairs Idea of the Parallel Algorithm (3) 1 | 2 | 3. Parallel SSJ Algorithm – Idea | 4 | 5 12 t = 0.75, 2 prefix scheme (r1, r3) (r3, r4) (r1, r2) (r1, r3) (r2, r3) (r1, r3) (r3, r4) pre-candidates (r1, r2) 1 < 2 (r1, r3) 3 (r2, r3) 1 < 2 (r3, r4) 2 candidates (r1, r3, 0.8) similar pairs
  12. 12. 1. Preprocessing • Extract the tokens for each record with NORMALIZE • Count and sort the tokens with TABLE and SORT • PROJECT each token to a new internal token id • Reorder the token set of each record based on the token order with PROJECT Implementation (1) 1 | 2 | 3. Parallel SSJ Algorithm – Implementation | 4 | 5 13 r1 b k c d e f i g h j NORMALIZE(inputDS, LEFT.tokenCnt, extractToken(LEFT, COUNTER)) b k … TABLE b 1 k 4 … SORT a 1 … k 4
  13. 13. 2. Build the inverted index with the index prefixes • Extract the tokens, their position and length for each prefix with NORMALIZE 3. Calculate the pre-candidates • Extract and calculate the tokens, their position, the length lower bound and length for each prefix with NORMALIZE To accelerate the join of these tuples, we do not store the token sets in the tuples Implementation (2) 1 | 2 | 3. Parallel SSJ Algorithm – Implementation | 4 | 5 14 r1 b c d e f g h i j k b 1 10 r1 c 2 10 r1 d 3 10 r1 NORMALIZE(inputDSreordered, indexPrefixLength(LEFT.tokenCnt ), getIndexTupel(LEFT, COUNTER)) r1 b c d e f g h i j k b 1 8 10 r1 c 2 8 10 r1 d 3 8 10 r1 e 4 8 10 r1 NORMALIZE(inputDSreordered, prefixLength(LEFT.tokenCnt), getPrefixTupel(LEFT, COUNTER))
  14. 14. 3. (cont) Calculate the pre-candidates • DISTRIBUTE the index and prefix tuples depending on the tokens • JOIN the index tuples and prefix tuples locally Implementation (3) 1 | 2 | 3. Parallel SSJ Algorithm – Implementation | 4 | 5 15 e 3 8 r3 e 2 6 r4 e 4 8 10 r1 e 3 6 8 r3 e 2 5 6 r4 (distributed) prefixes index r1 … r3 … r3 … r4 …JOIN(prefixes, index, LEFT.token = RIGHT.token AND LEFT.lengthLowerBound <= RIGHT.length AND LEFT.rid < RIGHT.rid, TRANSFORM(…), local);
  15. 15. 4. Filter the pre-candidates • Count the overlap of each pair with DISTRIBUTE, SORT and ROLLUP • SKIP all pairs which do not pass the position filter • Check the overlap of each pair with PROJECT and SKIP if necessary Implementation (4) 1 | 2 | 3. Parallel SSJ Algorithm – Implementation | 4 | 5 16 ROLLUP(tupelDS, LEFT.rid1 = RIGHT.rid1 and LEFT.rid2 = RIGHT.rid2, countOverlap(LEFT,RIGHT), local) r1 r3 3 … (distributed and sorted depending on rids) r1 … r3 … r1 … r3 … r1 … r3 … r1 … r2 … ROLLUP(…) r1 r2 1 … r1 r3 3 … PROJECT PROJECT
  16. 16. Implementation (5) 5. Calculate the similarities for all remaining pairs (x,y) • JOIN the token set for each left record • JOIN the token set for each right record • calculate the similarity in the TRANSFORM function • SKIP all pairs which are not similar 1 | 2 | 3. Parallel SSJ Algorithm – Implementation | 4 | 5 17 (r1, r3, 0.8) JOIN(candidates, records, LEFT.rid1 = RIGHT.rid, TRANSFORM(…), HASH) similar pairs r1 r3 3 … b c d … recordsr1 r3 3 … JOIN(candidates2, records, LEFT.rid2 = RIGHT.rid, verifyPair(LEFT,RIGHT), HASH) records
  17. 17. • Divide each record into windows of size w • Search all window pairs (xi,yj) with sim(xi,yj) ≥ t (x,y in R, |xi|=|yj|=w) Extensions: Sliding Window Set Similarity Join (1) 1 | 2 | 3 | 4. Extensions | 5 18 x x1 x2 x3 x4 x5 w
  18. 18. • Approach: 1. Preprocessing • Calculate the token order as before • Replace each token with its new internal token id in each record with PROJECT • New: Divide each record into windows and sort each window‘s token set with NORMALIZE and SORT 2. Join • Build the inverted index with the index prefixes as before • New: JOIN the index prefix tuples with themselves (without length filter) • Filter the pre-candidates and verify the remaining pairs as before Extensions: Sliding Window Set Similarity Join (2) 1 | 2 | 3 | 4. Extensions | 5 19
  19. 19. Extensions: Set Similarity Search (1) 1 | 2 | 3 | 4. Extensions | 5 20 r1 b c d e f g h i j k r2 a d f g h i j k r3 c d e g h i j k r4 c e f i j k records R query s s a b c d e f g h i j (r1, s, 0.82) similar pairs ?
  20. 20. • Approach: 1. Preparation with THOR • Calculate the token order as before • New: BUILD two payload INDEXes for them 2. New: Searching with ROXIE • Preprocess the query record • Use an INDEX JOIN with the token order and SORT to reorder the tokens • Calculate the pre-candidates and candidates as before and verify them • Use an INDEX JOIN with the index to find the pre-candidates Extensions: Set Similarity Search (2) 1 | 2 | 3 | 4. Extensions | 5 21 token id new token idtoken order key payload token id, length rid, pos, token setindex key payload
  21. 21. • Datasets • flickrlondon, dblp, enron, netflix, csx • patent data 2005 and 2010 • Cluster configuration • 6 Thor nodes • 3 Thor slaves per node Experiments and Results 1 | 2 | 3 | 4 | 5. Experiments and Results 22 dataset size record count average record length flickrlondon 253 MB 1680490 9.8 dblp 685 MB 1268017 36.2 enron 1.0 GB 517431 133.6 netflix 1.1 GB 480189 209.3 csx 3.5 GB 1385532 148.9 2005 9.5 GB 157829 7248.7 (1814.7*) 2010 16.5 GB 244597 8175.3 (1999,9*) * without stopwords
  22. 22. Result 1: The prefix scheme has a big influence on SSJ runtime • Short prefixes and many rare tokens → Small prefix schemes • Long prefixes and few rare tokens → Large prefix schemes 1 | 2 | 3 | 4 | 5. Experiments and Results 23 20 40 60 80 100 120 1 2 3 4 runtime(s) scheme dblp 0 1000 2000 3000 4000 5000 6000 1 2 3 4 runtime(s) scheme netflix t = 0.75 t = 0.8 t = 0.85 t = 0.9 t = 0.95
  23. 23. • Sliding Window SSJ Result 2: The distributed SSJ algorithm serves well for sliding window join 1 | 2 | 3 | 4 | 5. Experiments and Results 24 Disk full 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 3 4 runtime(s) scheme Patent 2005 (without stopwords), w = 100 t = 0.75 t = 0.8 t = 0.85 t = 0.9 t = 0.95 → Same prefix scheme behavior as before 0 500 1000 1500 2000 2500 3000 7580859095 runtime(s) threshold (%) Patent 2005 (without stopwords), 3 prefix scheme w = 50 w = 100 → Smaller window size needs more time
  24. 24. • Set Similarity Search Result 3: The distributed SSJ algorithm serves well for sliding window search (2) 1 | 2 | 3 | 4 | 5. Experiments and Results 25 0 50 100 150 200 250 300 350 400 450 dblp enron csx flicklondon netflix averageruntime(s) 1 prefix scheme 2 prefix scheme 3 prefix scheme → 1 prefix scheme is best suited
  25. 25. • Questions? Thank you! 26
  26. 26. • Prefix filtering • (x,y) can only be similar if their prefixes share at least one token • Length filtering • Only documents with a similar length can be similar • Position filtering • (x,y) can only be similar if the postfix is long enough Common Filter Techniques 1 | 2. Filter-Verification Framework | 3 | 4 | 5 27 w w i j x y |x| - i |y| - j overlapw + 1 + min(|x|-i, |y|-j) ≥ 𝑡 1+𝑡 (|x|+|y|) ? e f h a b h k a b c d he f h prefixes
  27. 27. 1. Preprocessing • Read the input record set R • Count token appearances and calculate the token ordering T • Reorder the token set of each record based on the token ordering T Idea of the Distributed Algorithm (1) 1 | 2 | 3. Distributed SSJ Algorithm – Idea | 4 | 5 28 r1 b k c d e f i g h j r2 a d f k g h i j r3 c k d e j g h i r4 c e j f i k a 1 b 1 c 3 d 3 … k 4records R (unordered) token order T r1 b c d e f g h i j k r2 a d f g h i j k r3 c d e g h i j k r4 c e f i j k records R (ordered)
  28. 28. Result 1: The non parallel SSJ approach transfers well to a distributed algorithm 20 25 30 35 40 45 50 55 60 65 70 7580859095 runtime(s) threshold (%) dblp (2 prefix scheme) 1 | 2 | 3 | 4 | 5. Experiments and Results 29 20 220 420 620 820 1020 1220 1420 1620 1820 2020 7580859095 runtime(s) threshold (%) csx (2 prefix scheme)

×