Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic python search
Lots of search engines
● Lucene, Xapian, Sphinx, Solr, ElasticSearch,
Whoosh, Riak Search, Terrier, Lemur/Indri
● Also MySQL, PostgreSQL Full Text
● Also client-side engines using Redis, Mongo,
● Don't know what search features you need in
● So, don't want to be stuck with an early choice.
● Also, don't want to learn new API for trying out
● Expose common features in standard way
● Emulate missing features
● Don't get in the way
● Build useful features on top
● Fields / Schemas: not in Xapian, Lucene
● Schema modification: Sphinx, Whoosh can't
● Sphinx: no updates (in progress)
● Does update happen synchronously?
● Do updates return docids, or do docids need to
be supplied by client?
● Can docids be set by client?
● When do updates become live?
● Many different features available
● Most engines support arbitrary booleans
● Some have XOR!
● Some only permit sets of filters
● Weighting schemes
● Need to expose native backend query parsers
● Information about result set
● Can be emulated (slow)
● Some backends approximate
● Some backends give stats, histograms
● Spelling correction
● Numeric and Date range searches
● Geospatial searches (box, geohash, distance)
● SearchClient class for each backend.
● A definition of standard behaviours that all
backends should provide.
● A definition of optional behaviours when more
than one backend provides them.
● Test suite to ensure that all backends support
● Programmatic way of checking which features a
backend supports? (Or just raise exception)
● Must support dictionary of fields
● Unicode values
● List(unicode) values
● May support arbitrary other field types, or
different data structures, if backend wants to.
● Fields have types
● Automatic type “guessing” (client or server side)
● Some standard minimal set of analysers
● Text in a language
● Untokenised values
● Don't define exact output; just intent of standard
● Abstract query representation
● Tree of python objects.
● Overloaded operators for boolean.
● Chainable methods.
● (have actually written this)
● Such as it is, on
● Suggestions for a better name appreciated
● Query representation is pretty good, rest is