Efficient Top-k Algorithms for Fuzzy Search in String                     Collections                             Rares Ver...
Outline1   Motivation2   Efficient Top-k Algorithms      Problem Formulation      Algorithms Overview      Top-k Single-Pas...
Need for Approximate String Queries               ID       FirstName        LastName                   # Movies           ...
Need for Approximate String Queries               ID       FirstName        LastName                   # Movies           ...
Need for Approximate String Queries               ID       FirstName        LastName                   # Movies           ...
Need for Ranking         ID       FirstName    LastName                      # Movies     ED         10       Al          ...
Need for Ranking         ID       FirstName    LastName                      # Movies     ED         10       Al          ...
Need for Ranking         ID       FirstName    LastName                      # Movies     ED         10       Al          ...
Need for Ranking         ID       FirstName    LastName                      # Movies     ED         10       Al          ...
Need for Ranking         ID       FirstName    LastName                      # Movies     ED         10       Al          ...
Top-k Similar Strings    Given:            Weighted string collection                    e.g., actors’ LastName and # Movi...
Top-k Similar Strings    Given:            Weighted string collection                    e.g., actors’ LastName and # Movi...
Algorithms Overview                                                             Range                                     ...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-grams: Overlapping substrings of fixed length    Find similar strings: e.g., “Vernica” and “Veronica”    q-gram: substrin...
q-gram Inverted List Index    q=2       ID      String         Weight        1      ab              0.80        2      ccd...
q-gram Inverted List Index    q=2       ID      String         Weight        1      ab              0.80                  ...
q-gram Inverted List Index    q=2       ID      String         Weight        1      ab              0.80                  ...
q-gram Inverted List Index    q=2       ID      String         Weight        1      ab              0.80                  ...
q-gram Inverted List Index    q=2       ID      String         Weight        1      ab              0.80                  ...
Top-k Single-pass Search Algorithm                                                             ID    String    Weight     ...
Top-k Single-pass Search Algorithm                                                             ID    String    Weight     ...
Top-k Single-pass Search Algorithm                                                             ID    String    Weight     ...
Top-k Single-pass Search Algorithm                                                             ID    String    Weight     ...
Top-k Single-pass Search Algorithm                                                              ID    String    Weight    ...
Top-k Single-pass Search Algorithm                                                         ab     bc       cd             ...
Top-k Single-pass Search Algorithm                                                       ab       bc       cd             ...
Top-k Single-pass Search Algorithm                                                       ab      bc        cd             ...
Top-k Single-pass Search Algorithm                                                       ab      bc       cd              ...
Top-k Single-pass Search Algorithm                                                       ab      bc       cd              ...
Top-k Single-pass Search Algorithm                                                                                    1   ...
Top-k Single-pass Search Algorithm                                                                                      1 ...
Top-k Single-pass Search Algorithm                                                                                      1 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Top-k Single-pass Search Algorithm                                                                                      2 ...
Algorithms Overview                                                             Range                                     ...
Experimental Setting       Datasets:            IMDB Actor Names1                    actor names and the number of movies ...
Benefits of Skipping Elements             50                          SPS             40           SPS*                    ...
Potential of the Two-Phase Algorithm             40             30                                                       R...
Optimum Initial Threshold for the Two-Phase Algorithm             100                                                     ...
Summary   Approximate ranking queries in string collections   Useful when mismatch between query and data   Propose three ...
The Flamingo Project                                  This work is part of                           The Flamingo Project ...
A Quick Note on Related WorkFagin et. al [1]                               Our Setting     similarity on multiple         ...
Upcoming SlideShare
Loading in...5
×

Efficient Top-k Algorithms for Fuzzy Search in String Collections

1,748

Published on

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,748
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Efficient Top-k Algorithms for Fuzzy Search in String Collections"

  1. 1. Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17
  2. 2. Outline1 Motivation2 Efficient Top-k Algorithms Problem Formulation Algorithms Overview Top-k Single-Pass Search Algorithm3 Experimental Evaluation Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17
  3. 3. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  4. 4. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  5. 5. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; 0 Results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  6. 6. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  7. 7. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  8. 8. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  9. 9. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Which one result should the system return? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  10. 10. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Which one result should the system return? Which value is more important, # Movies or similarity? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  11. 11. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  12. 12. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Advantages over Range Search: specify k instead of a similarity threshold guaranteed k results; a range search might have 0 results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  13. 13. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17
  14. 14. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  15. 15. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  16. 16. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  17. 17. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  18. 18. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  19. 19. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  20. 20. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  21. 21. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  22. 22. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic,ca}Veronica → {Ve,er,ro,on,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  23. 23. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic,ca}Veronica → {Ve,er,ro,on,ni,ic,ca} Similar strings share a certain number of grams Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  24. 24. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  25. 25. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇒ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  26. 26. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  27. 27. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  28. 28. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇐ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Identified strings are verified by computing the real similarity. Verification is usually an expensive process. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  29. 29. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  30. 30. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  31. 31. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  32. 32. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order Scan the lists corresponding to the ab cc cd bc grams in the query. 1 2 2 4 e.g., “bcd” 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  33. 33. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60Naïve approach: Round-Robin 4 abcd 0.50 Scan all the lists in the same time 5 bcc 0.40 Maintain a list of “open” IDs (might Figure: Dataset still appear on some of the lists) Store the best k “closed” IDs in a top-k buffer ab cc cd bc Stop when the top-k buffer cannot 1 2 2 4 improve 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  34. 34. Top-k Single-pass Search Algorithm ab bc cd 1 2 2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram inverted-lists for query “abcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  35. 35. Top-k Single-pass Search Algorithm ab bc cd →1 2 2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  36. 36. Top-k Single-pass Search Algorithm ab bc cd →1 →2 2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  37. 37. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  38. 38. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  39. 39. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: →1 →2 →2 Top-kHeap-based 20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the . . 2 list of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  40. 40. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  41. 41. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  42. 42. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 →2 →2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  43. 43. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  44. 44. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  45. 45. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  46. 46. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  47. 47. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  48. 48. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 →20 →20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  49. 49. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17
  50. 50. Experimental Setting Datasets: IMDB Actor Names1 actor names and the number of movies they played in 1.2 million actors, average name length 15 weight is the number of movies (log normalized) WEB Corpus Word Grams2 sequences of English words and their frequency on the Web 2.4 million sequences, average sequence length 20 weight is the frequency (log normalized) Jaccard similarity and normalized edit similarity, q = 3 Index and data are stored in main memory at all times Implemented in C++ (GNU compiler) on Ubuntu Linux OS Intel 2.40GHz PC, 2GB RAM 1 http://www.imdb.com/interfaces 2 http://www.ldc.upenn.edu/Catalog LDC2006T13 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17
  51. 51. Benefits of Skipping Elements 50 SPS 40 SPS* Average running time for Time (ms) 30 top-10 queries. IMDB dataset with Jaccard similarity. Single-Pass 20 Search (SPS) algorithm and SPS without 10 skipping (SPS*). 0 0.2 0.4 0.6 0.8 1.0 1.2 Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17
  52. 52. Potential of the Two-Phase Algorithm 40 30 Running time for 3 Time (ms) top-10 queries. WEB Corpus dataset with 20 normalized edit similarity. Two-Phase algorithm with different 10 initial thresholds. 0 Q1 Q2 Q3 Queries Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17
  53. 53. Optimum Initial Threshold for the Two-Phase Algorithm 100 Average running time for top-10 queries. Web 80 Corpus dataset with normalized edit Time (ms) 60 similarity. Single-Pass Search (SPS) algorithm, Two-Phase (2PH) 40 algorithm, 2PH with the SPS optimum initial threshold 20 2PH (2PH Opt). 2PH Opt The Iterative Range Search 0 algorithm was to expensive to be plotted. The average running time 0.4 0.8 1.2 1.6 2.0 2.4 was around 5s. Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17
  54. 54. Summary Approximate ranking queries in string collections Useful when mismatch between query and data Propose three approaches to solve the problem: 1 Use existing approximate range search algorithms as a “black box” Proves to be the most expensive 2 Use particularities of the top-k problem Proves to be very efficient 3 Combine (1) and (2) sequentially Proves to be slightly more efficient Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17
  55. 55. The Flamingo Project This work is part of The Flamingo Project at UC Irvine http://flamingo.ics.uci.edu Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
  56. 56. A Quick Note on Related WorkFagin et. al [1] Our Setting similarity on multiple similarity on one string numerical attributes attribute traverse list of IDs traverse list of IDs one list per attribute one list per q-gram lists sorted on similarity to lists sorted on global weight that attribute lists have different orders of lists have the same order of IDs IDs all IDs appear on all the lists a subset of IDs appear on each list[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×