Your SlideShare is downloading. ×
Designing a generic Python Search Engine API - BarCampLondon 8
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Designing a generic Python Search Engine API - BarCampLondon 8


Published on

Published in: Technology, Design

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Designing a generic python search engine API Richard Boulton @rboulton
  • 2. Lots of search engines ● Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri ● Also MySQL, PostgreSQL Full Text components ● Also client-side engines using Redis, Mongo, etc.
  • 3. Generic API? ● Don't know what search features you need in advance ● So, don't want to be stuck with an early choice. ● Also, don't want to learn new API for trying out new engine.
  • 4. I Need Input
  • 5. Philosophy ● Expose common features in standard way ● Emulate missing features ● Don't get in the way ● Build useful features on top
  • 6. Backend variation ● Fields / Schemas: not in Xapian, Lucene ● Schema modification: Sphinx, Whoosh can't
  • 7. Updates ● Sphinx: no updates (in progress) ● Does update happen synchronously? ● Do updates return docids, or do docids need to be supplied by client? ● Can docids be set by client? ● When do updates become live?
  • 8. Scaling ● Multiple database searches? ● Searching remote databases? ● Replication?
  • 9. Analysers ● Vary wildly between backends ● Stemming, splitting, n-grams ● Language detection ● Fuzzy searches ● Soundex (well, Metaphone)
  • 10. Queries ● Many different features available ● Most engines support arbitrary booleans ● Some have XOR! ● Some only permit sets of filters ● Weighting schemes ● Need to expose native backend query parsers
  • 11. Facets ● Information about result set ● Can be emulated (slow) ● Some backends approximate ● Some backends give stats, histograms
  • 12. Other features ● Spelling correction ● Numeric and Date range searches ● Geospatial searches (box, geohash, distance)
  • 13. Proposed design ● SearchClient class for each backend. ● A definition of standard behaviours that all backends should provide. ● A definition of optional behaviours when more than one backend provides them.
  • 14. Proposed design ● Test suite to ensure that all backends support common features. ● Programmatic way of checking which features a backend supports? (Or just raise exception)
  • 15. Proposed design ● Convenience SearchClient factory function: c = multisearch.SearchClient('xapian', path=dbpath)
  • 16. Documents ● Must support dictionary of fields ● Unicode values ● List(unicode) values ● May support arbitrary other field types, or different data structures, if backend wants to.
  • 17. Schemas ● Fields have types ● Automatic type “guessing” (client or server side) ● Some standard minimal set of analysers ● Text in a language ● Untokenised values ● Don't define exact output; just intent of standard analysers.
  • 18. Search representation ● Abstract query representation ● Tree of python objects. ● Overloaded operators for boolean. ● Chainable methods. ● (have actually written this) ● SearchClient.my_query_type()
  • 19. Code ● Such as it is, on ● Suggestions for a better name appreciated ● Query representation is pretty good, rest is pretty rough.