Your SlideShare is downloading. ×
Designing a generic Python Search Engine API - BarCampLondon 8
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Designing a generic Python Search Engine API - BarCampLondon 8


Published on

Published in: Technology, Design

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Designing a generic python search engine API Richard Boulton @rboulton
  • 2. Lots of search engines ● Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri ● Also MySQL, PostgreSQL Full Text components ● Also client-side engines using Redis, Mongo, etc.
  • 3. Generic API? ● Don't know what search features you need in advance ● So, don't want to be stuck with an early choice. ● Also, don't want to learn new API for trying out new engine.
  • 4. I Need Input
  • 5. Philosophy ● Expose common features in standard way ● Emulate missing features ● Don't get in the way ● Build useful features on top
  • 6. Backend variation ● Fields / Schemas: not in Xapian, Lucene ● Schema modification: Sphinx, Whoosh can't
  • 7. Updates ● Sphinx: no updates (in progress) ● Does update happen synchronously? ● Do updates return docids, or do docids need to be supplied by client? ● Can docids be set by client? ● When do updates become live?
  • 8. Scaling ● Multiple database searches? ● Searching remote databases? ● Replication?
  • 9. Analysers ● Vary wildly between backends ● Stemming, splitting, n-grams ● Language detection ● Fuzzy searches ● Soundex (well, Metaphone)
  • 10. Queries ● Many different features available ● Most engines support arbitrary booleans ● Some have XOR! ● Some only permit sets of filters ● Weighting schemes ● Need to expose native backend query parsers
  • 11. Facets ● Information about result set ● Can be emulated (slow) ● Some backends approximate ● Some backends give stats, histograms
  • 12. Other features ● Spelling correction ● Numeric and Date range searches ● Geospatial searches (box, geohash, distance)
  • 13. Proposed design ● SearchClient class for each backend. ● A definition of standard behaviours that all backends should provide. ● A definition of optional behaviours when more than one backend provides them.
  • 14. Proposed design ● Test suite to ensure that all backends support common features. ● Programmatic way of checking which features a backend supports? (Or just raise exception)
  • 15. Proposed design ● Convenience SearchClient factory function: c = multisearch.SearchClient('xapian', path=dbpath)
  • 16. Documents ● Must support dictionary of fields ● Unicode values ● List(unicode) values ● May support arbitrary other field types, or different data structures, if backend wants to.
  • 17. Schemas ● Fields have types ● Automatic type “guessing” (client or server side) ● Some standard minimal set of analysers ● Text in a language ● Untokenised values ● Don't define exact output; just intent of standard analysers.
  • 18. Search representation ● Abstract query representation ● Tree of python objects. ● Overloaded operators for boolean. ● Chainable methods. ● (have actually written this) ● SearchClient.my_query_type()
  • 19. Code ● Such as it is, on ● Suggestions for a better name appreciated ● Query representation is pretty good, rest is pretty rough.