• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Designing a generic Python Search Engine API - BarCampLondon 8






Total Views
Views on SlideShare
Embed Views



1 Embed 25

http://lanyrd.com 25



Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Designing a generic Python Search Engine API - BarCampLondon 8 Designing a generic Python Search Engine API - BarCampLondon 8 Presentation Transcript

    • Designing a generic python search engine API Richard Boulton @rboulton [email_address]
    • Lots of search engines
      • Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri
      • Also MySQL, PostgreSQL Full Text components
      • Also client-side engines using Redis, Mongo, etc.
    • Generic API?
      • Don't know what search features you need in advance
      • So, don't want to be stuck with an early choice.
      • Also, don't want to learn new API for trying out new engine.
    • I Need Input
    • Philosophy
      • Expose common features in standard way
      • Emulate missing features
      • Don't get in the way
      • Build useful features on top
    • Backend variation
      • Fields / Schemas: not in Xapian, Lucene
      • Schema modification: Sphinx, Whoosh can't
    • Updates
      • Sphinx: no updates (in progress)
      • Does update happen synchronously?
      • Do updates return docids, or do docids need to be supplied by client?
      • Can docids be set by client?
      • When do updates become live?
    • Scaling
      • Multiple database searches?
      • Searching remote databases?
      • Replication?
    • Analysers
      • Vary wildly between backends
      • Stemming, splitting, n-grams
      • Language detection
      • Fuzzy searches
      • Soundex (well, Metaphone)
    • Queries
      • Many different features available
      • Most engines support arbitrary booleans
      • Some have XOR!
      • Some only permit sets of filters
      • Weighting schemes
      • Need to expose native backend query parsers
    • Facets
      • Information about result set
      • Can be emulated (slow)
      • Some backends approximate
      • Some backends give stats, histograms
    • Other features
      • Spelling correction
      • Numeric and Date range searches
      • Geospatial searches (box, geohash, distance)
    • Proposed design
      • SearchClient class for each backend.
      • A definition of standard behaviours that all backends should provide.
      • A definition of optional behaviours when more than one backend provides them.
    • Proposed design
      • Test suite to ensure that all backends support common features.
      • Programmatic way of checking which features a backend supports? (Or just raise exception)
    • Proposed design
      • Convenience SearchClient factory function:
      • c = multisearch.SearchClient('xapian', path=dbpath)
    • Documents
      • Must support dictionary of fields
        • Unicode values
        • List(unicode) values
      • May support arbitrary other field types, or different data structures, if backend wants to.
    • Schemas
      • Fields have types
      • Automatic type “guessing” (client or server side)
      • Some standard minimal set of analysers
        • Text in a language
        • Untokenised values
      • Don't define exact output; just intent of standard analysers.
    • Search representation
      • Abstract query representation
      • Tree of python objects.
      • Overloaded operators for boolean.
      • Chainable methods.
      • (have actually written this)
      • SearchClient.my_query_type()
    • Code
      • Such as it is, on
      • http://github.com/rboulton/multisearch
      • Suggestions for a better name appreciated
      • Query representation is pretty good, rest is pretty rough.