Full Text Search on Google App Engine with BigTable Search at Austin PUG February 10, 2010 Percy Wegmann presenting
The Problem You want to be able to do full-text search (you know, like on Google.com) <ul><li>Against data stored in a Pyt...
Without using an external server/service </li></ul>
Full-text Search Basic Features Let's say that you want to search a repository of 2 documents containing the following tex...
“ swans are crowding ducks out of the local lake” </li></ul>A basic search engine should respond to queries as follows: “ ...
More Advanced Features <ul><li>Starts-with matching (for type-ahead completion)
Indexing of non-text fields (numeric, datetime, references, etc.)
Term weighting (e.g. rank matches on title higher than on body)
Faceted Search (like Amazon or Cnet.com)
Background indexing (to speed up inserts)
Thesaurus (“mallard” would match “duck”)
Phrase matching (exact phrases rank higher than disjointed combinations of words) </li></ul>
The Contenders Stemming &  Stopword Removal Boolean OR Ranking Datastore Query SearchableModel stopword removal only Bill ...
Upcoming SlideShare
Loading in …5
×

BigTable Search Presentation to Austin PUG

1,829 views

Published on

Presentation discussing full-text search on Google's App Engine, presented to the Austin Python Users' Group on February 10, 2010.

Published in: Technology, News & Politics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,829
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
39
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

BigTable Search Presentation to Austin PUG

  1. 1. Full Text Search on Google App Engine with BigTable Search at Austin PUG February 10, 2010 Percy Wegmann presenting
  2. 2. The Problem You want to be able to do full-text search (you know, like on Google.com) <ul><li>Against data stored in a Python Google App Engine application
  3. 3. Without using an external server/service </li></ul>
  4. 4. Full-text Search Basic Features Let's say that you want to search a repository of 2 documents containing the following text: <ul><li>“ swan lake performed live at the Met”
  5. 5. “ swans are crowding ducks out of the local lake” </li></ul>A basic search engine should respond to queries as follows: “ swan lake” - returns both documents (inexact matching) “ swan dive” - returns both documents (boolean OR matching) “ swan lake duck” - returns document 2 first (ranking) “ crowds” - returns document 2 (stemming) “ of the” - returns neither (stopword removal) And it should do all of this quickly
  6. 6. More Advanced Features <ul><li>Starts-with matching (for type-ahead completion)
  7. 7. Indexing of non-text fields (numeric, datetime, references, etc.)
  8. 8. Term weighting (e.g. rank matches on title higher than on body)
  9. 9. Faceted Search (like Amazon or Cnet.com)
  10. 10. Background indexing (to speed up inserts)
  11. 11. Thesaurus (“mallard” would match “duck”)
  12. 12. Phrase matching (exact phrases rank higher than disjointed combinations of words) </li></ul>
  13. 13. The Contenders Stemming & Stopword Removal Boolean OR Ranking Datastore Query SearchableModel stopword removal only Bill Katz' Searchable x gae-search x BigTable Search x x x
  14. 14. What The Others Are Missing Boolean OR/Ranking – Makes multi-term queries almost pointless Faceted Search – Users are accustomed to this from sites like Amazon Scalability – No one uses inverted indexes!
  15. 15. Introducing BigTable Search Switch to demo
  16. 16. How it Works – Inverted Index Index is organized by search term. This is how the big boys (Lucene, Sphinx, etc.) do it. Example from Wikipedia Documents <ul><li>“ it is what it is”
  17. 17. “ what is it”
  18. 18. “ it is a banana” </li></ul>Index (stores pointers to documents) “ a”: {3} “ banana”: {3} “ is”: {1, 2, 3} “ it”: {1, 2, 3} “ what”: {1, 2} <ul><li>To search for “it”, we only have to grab a single row from the index yielding {1, 2, 3}
  19. 19. To search for “what or banana” we grab two rows and take the union, yielding {1, 2, 3}
  20. 20. To search for “what and banana” we grab two rows and take intersection, yielding {}
  21. 21. To rank a search “banana or it” we take union and count occurrences, yielding {3, 1, 2} </li></ul>
  22. 22. The Pain of Updating Remember our documents: Documents <ul><li>“ it is what it is”
  23. 23. “ what is it”
  24. 24. “ it is a banana” </li></ul>To add the first document, we have to update 4 index entries. The bigger the documents get, the worse it gets. Worse, multiple documents are represented in a single index entry, so concurrency becomes a problem too – try locking on the index entry for “the”, and your entire system becomes effectively single-threaded!
  25. 25. The Solution to Updating Asynchronous Updates Data Store doc calc Δ queue merge Δ queue merge Δ queue merge Δ queue
  26. 26. Code (at a Glance) Data Model Queues Code
  27. 27. The Better Answer? BigTable Search suffers from some significant limitations: - Fast search engines use custom file storage formats for performance, BigTable Search does not have this option and is consequently not fast - No phrase matching - No synonym or semantic matching Google is working on a full-text search solution (feature 217 on Issues List, In Progress, no ETA, session scheduled for Google I/O in May)
  28. 28. Resources pyporter2 (used by BigTable Search and others for stemming) http://github.com/mdirolf/pyporter2 SearchableModel http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/search/__init__.py Bill Katz' Simple Full-text Search for App Engine http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine gae-search http://gae-full-text-search.appspot.com/ BigTable Search http://code.google.com/p/bigtablesearch/ Google's Upcoming Full-text Search (feature 217) http://code.google.com/p/googleappengine/issues/detail?id=217

×