Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Designing a generic Python Search Engine API - BarCampLondon 8






Total Views
Views on SlideShare
Embed Views



1 Embed 25

http://lanyrd.com 25



Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Designing a generic Python Search Engine API - BarCampLondon 8 Designing a generic Python Search Engine API - BarCampLondon 8 Presentation Transcript

  • Designing a generic python search engine API Richard Boulton @rboulton [email_address]
  • Lots of search engines
    • Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri
    • Also MySQL, PostgreSQL Full Text components
    • Also client-side engines using Redis, Mongo, etc.
  • Generic API?
    • Don't know what search features you need in advance
    • So, don't want to be stuck with an early choice.
    • Also, don't want to learn new API for trying out new engine.
  • I Need Input
  • Philosophy
    • Expose common features in standard way
    • Emulate missing features
    • Don't get in the way
    • Build useful features on top
  • Backend variation
    • Fields / Schemas: not in Xapian, Lucene
    • Schema modification: Sphinx, Whoosh can't
  • Updates
    • Sphinx: no updates (in progress)
    • Does update happen synchronously?
    • Do updates return docids, or do docids need to be supplied by client?
    • Can docids be set by client?
    • When do updates become live?
  • Scaling
    • Multiple database searches?
    • Searching remote databases?
    • Replication?
  • Analysers
    • Vary wildly between backends
    • Stemming, splitting, n-grams
    • Language detection
    • Fuzzy searches
    • Soundex (well, Metaphone)
  • Queries
    • Many different features available
    • Most engines support arbitrary booleans
    • Some have XOR!
    • Some only permit sets of filters
    • Weighting schemes
    • Need to expose native backend query parsers
  • Facets
    • Information about result set
    • Can be emulated (slow)
    • Some backends approximate
    • Some backends give stats, histograms
  • Other features
    • Spelling correction
    • Numeric and Date range searches
    • Geospatial searches (box, geohash, distance)
  • Proposed design
    • SearchClient class for each backend.
    • A definition of standard behaviours that all backends should provide.
    • A definition of optional behaviours when more than one backend provides them.
  • Proposed design
    • Test suite to ensure that all backends support common features.
    • Programmatic way of checking which features a backend supports? (Or just raise exception)
  • Proposed design
    • Convenience SearchClient factory function:
    • c = multisearch.SearchClient('xapian', path=dbpath)
  • Documents
    • Must support dictionary of fields
      • Unicode values
      • List(unicode) values
    • May support arbitrary other field types, or different data structures, if backend wants to.
  • Schemas
    • Fields have types
    • Automatic type “guessing” (client or server side)
    • Some standard minimal set of analysers
      • Text in a language
      • Untokenised values
    • Don't define exact output; just intent of standard analysers.
  • Search representation
    • Abstract query representation
    • Tree of python objects.
    • Overloaded operators for boolean.
    • Chainable methods.
    • (have actually written this)
    • SearchClient.my_query_type()
  • Code
    • Such as it is, on
    • http://github.com/rboulton/multisearch
    • Suggestions for a better name appreciated
    • Query representation is pretty good, rest is pretty rough.