Designing a generic python search
engine API
Richard Boulton
@rboulton
richard@cnav.co.uk
Lots of search engines
● Lucene, Xapian, Sphinx, Solr, ElasticSearch,
Whoosh, Riak Search, Terrier, Lemur/Indri
● Also MyS...
Generic API?
● Don't know what search features you need in
advance
● So, don't want to be stuck with an early choice.
● Al...
I Need Input
Philosophy
● Expose common features in standard way
● Emulate missing features
● Don't get in the way
● Build useful featu...
Backend variation
● Fields / Schemas: not in Xapian, Lucene
● Schema modification: Sphinx, Whoosh can't
Updates
● Sphinx: no updates (in progress)
● Does update happen synchronously?
● Do updates return docids, or do docids ne...
Scaling
● Multiple database searches?
● Searching remote databases?
● Replication?
Analysers
● Vary wildly between backends
● Stemming, splitting, n-grams
● Language detection
● Fuzzy searches
● Soundex (w...
Queries
● Many different features available
● Most engines support arbitrary booleans
● Some have XOR!
● Some only permit ...
Facets
● Information about result set
● Can be emulated (slow)
● Some backends approximate
● Some backends give stats, his...
Other features
● Spelling correction
● Numeric and Date range searches
● Geospatial searches (box, geohash, distance)
Proposed design
● SearchClient class for each backend.
● A definition of standard behaviours that all
backends should prov...
Proposed design
● Test suite to ensure that all backends support
common features.
● Programmatic way of checking which fea...
Proposed design
● Convenience SearchClient factory function:
c = multisearch.SearchClient('xapian', path=dbpath)
Documents
● Must support dictionary of fields
● Unicode values
● List(unicode) values
● May support arbitrary other field ...
Schemas
● Fields have types
● Automatic type “guessing” (client or server side)
● Some standard minimal set of analysers
●...
Search representation
● Abstract query representation
● Tree of python objects.
● Overloaded operators for boolean.
● Chai...
Code
● Such as it is, on
http://github.com/rboulton/multisearch
● Suggestions for a better name appreciated
● Query repres...
Upcoming SlideShare
Loading in...5
×

Designing a generic Python Search Engine API - BarCampLondon 8

1,780

Published on

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,780
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Designing a generic Python Search Engine API - BarCampLondon 8

  1. 1. Designing a generic python search engine API Richard Boulton @rboulton richard@cnav.co.uk
  2. 2. Lots of search engines ● Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri ● Also MySQL, PostgreSQL Full Text components ● Also client-side engines using Redis, Mongo, etc.
  3. 3. Generic API? ● Don't know what search features you need in advance ● So, don't want to be stuck with an early choice. ● Also, don't want to learn new API for trying out new engine.
  4. 4. I Need Input
  5. 5. Philosophy ● Expose common features in standard way ● Emulate missing features ● Don't get in the way ● Build useful features on top
  6. 6. Backend variation ● Fields / Schemas: not in Xapian, Lucene ● Schema modification: Sphinx, Whoosh can't
  7. 7. Updates ● Sphinx: no updates (in progress) ● Does update happen synchronously? ● Do updates return docids, or do docids need to be supplied by client? ● Can docids be set by client? ● When do updates become live?
  8. 8. Scaling ● Multiple database searches? ● Searching remote databases? ● Replication?
  9. 9. Analysers ● Vary wildly between backends ● Stemming, splitting, n-grams ● Language detection ● Fuzzy searches ● Soundex (well, Metaphone)
  10. 10. Queries ● Many different features available ● Most engines support arbitrary booleans ● Some have XOR! ● Some only permit sets of filters ● Weighting schemes ● Need to expose native backend query parsers
  11. 11. Facets ● Information about result set ● Can be emulated (slow) ● Some backends approximate ● Some backends give stats, histograms
  12. 12. Other features ● Spelling correction ● Numeric and Date range searches ● Geospatial searches (box, geohash, distance)
  13. 13. Proposed design ● SearchClient class for each backend. ● A definition of standard behaviours that all backends should provide. ● A definition of optional behaviours when more than one backend provides them.
  14. 14. Proposed design ● Test suite to ensure that all backends support common features. ● Programmatic way of checking which features a backend supports? (Or just raise exception)
  15. 15. Proposed design ● Convenience SearchClient factory function: c = multisearch.SearchClient('xapian', path=dbpath)
  16. 16. Documents ● Must support dictionary of fields ● Unicode values ● List(unicode) values ● May support arbitrary other field types, or different data structures, if backend wants to.
  17. 17. Schemas ● Fields have types ● Automatic type “guessing” (client or server side) ● Some standard minimal set of analysers ● Text in a language ● Untokenised values ● Don't define exact output; just intent of standard analysers.
  18. 18. Search representation ● Abstract query representation ● Tree of python objects. ● Overloaded operators for boolean. ● Chainable methods. ● (have actually written this) ● SearchClient.my_query_type()
  19. 19. Code ● Such as it is, on http://github.com/rboulton/multisearch ● Suggestions for a better name appreciated ● Query representation is pretty good, rest is pretty rough.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×