Your SlideShare is downloading. ×
  • Like
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Efficient Top-k Algorithms for Fuzzy Search in String Collections

  • 1,594 views
Published

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function …

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,594
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
22
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17
  • 2. Outline1 Motivation2 Efficient Top-k Algorithms Problem Formulation Algorithms Overview Top-k Single-Pass Search Algorithm3 Experimental Evaluation Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17
  • 3. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 4. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 5. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; 0 Results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 6. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 7. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 8. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 9. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Which one result should the system return? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 10. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . .Figure: Actor names, popularities, and edit distances to a query string“Shwartzenetrugger”. Which one result should the system return? Which value is more important, # Movies or similarity? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 11. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  • 12. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Advantages over Range Search: specify k instead of a similarity threshold guaranteed k results; a range search might have 0 results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  • 13. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17
  • 14. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 15. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 16. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 17. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 18. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 19. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 20. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 21. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 22. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic,ca}Veronica → {Ve,er,ro,on,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 23. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2Vernica → {Ve,er,rn,ni,ic,ca}Veronica → {Ve,er,ro,on,ni,ic,ca} Similar strings share a certain number of grams Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 24. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 25. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇒ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 26. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 27. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 28. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇐ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Identified strings are verified by computing the real similarity. Verification is usually an expensive process. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 29. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 30. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 31. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 32. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order Scan the lists corresponding to the ab cc cd bc grams in the query. 1 2 2 4 e.g., “bcd” 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 33. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60Naïve approach: Round-Robin 4 abcd 0.50 Scan all the lists in the same time 5 bcc 0.40 Maintain a list of “open” IDs (might Figure: Dataset still appear on some of the lists) Store the best k “closed” IDs in a top-k buffer ab cc cd bc Stop when the top-k buffer cannot 1 2 2 4 improve 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 34. Top-k Single-pass Search Algorithm ab bc cd 1 2 2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram inverted-lists for query “abcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 35. Top-k Single-pass Search Algorithm ab bc cd →1 2 2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 36. Top-k Single-pass Search Algorithm ab bc cd →1 →2 2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 37. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 38. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 39. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: →1 →2 →2 Top-kHeap-based 20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the . . 2 list of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 40. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 41. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 42. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 →2 →2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 43. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 44. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 45. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 46. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 47. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 48. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-kHeap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 →20 →20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 49. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17
  • 50. Experimental Setting Datasets: IMDB Actor Names1 actor names and the number of movies they played in 1.2 million actors, average name length 15 weight is the number of movies (log normalized) WEB Corpus Word Grams2 sequences of English words and their frequency on the Web 2.4 million sequences, average sequence length 20 weight is the frequency (log normalized) Jaccard similarity and normalized edit similarity, q = 3 Index and data are stored in main memory at all times Implemented in C++ (GNU compiler) on Ubuntu Linux OS Intel 2.40GHz PC, 2GB RAM 1 http://www.imdb.com/interfaces 2 http://www.ldc.upenn.edu/Catalog LDC2006T13 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17
  • 51. Benefits of Skipping Elements 50 SPS 40 SPS* Average running time for Time (ms) 30 top-10 queries. IMDB dataset with Jaccard similarity. Single-Pass 20 Search (SPS) algorithm and SPS without 10 skipping (SPS*). 0 0.2 0.4 0.6 0.8 1.0 1.2 Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17
  • 52. Potential of the Two-Phase Algorithm 40 30 Running time for 3 Time (ms) top-10 queries. WEB Corpus dataset with 20 normalized edit similarity. Two-Phase algorithm with different 10 initial thresholds. 0 Q1 Q2 Q3 Queries Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17
  • 53. Optimum Initial Threshold for the Two-Phase Algorithm 100 Average running time for top-10 queries. Web 80 Corpus dataset with normalized edit Time (ms) 60 similarity. Single-Pass Search (SPS) algorithm, Two-Phase (2PH) 40 algorithm, 2PH with the SPS optimum initial threshold 20 2PH (2PH Opt). 2PH Opt The Iterative Range Search 0 algorithm was to expensive to be plotted. The average running time 0.4 0.8 1.2 1.6 2.0 2.4 was around 5s. Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17
  • 54. Summary Approximate ranking queries in string collections Useful when mismatch between query and data Propose three approaches to solve the problem: 1 Use existing approximate range search algorithms as a “black box” Proves to be the most expensive 2 Use particularities of the top-k problem Proves to be very efficient 3 Combine (1) and (2) sequentially Proves to be slightly more efficient Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17
  • 55. The Flamingo Project This work is part of The Flamingo Project at UC Irvine http://flamingo.ics.uci.edu Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
  • 56. A Quick Note on Related WorkFagin et. al [1] Our Setting similarity on multiple similarity on one string numerical attributes attribute traverse list of IDs traverse list of IDs one list per attribute one list per q-gram lists sorted on similarity to lists sorted on global weight that attribute lists have different orders of lists have the same order of IDs IDs all IDs appear on all the lists a subset of IDs appear on each list[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17