• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Designing a generic Python Search Engine API - BarCampLondon 8

  • 1,680 views
Uploaded on

 

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,680
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Designing a generic python search engine API Richard Boulton @rboulton [email_address]
  • 2. Lots of search engines
    • Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri
    • 3. Also MySQL, PostgreSQL Full Text components
    • 4. Also client-side engines using Redis, Mongo, etc.
  • 5. Generic API?
    • Don't know what search features you need in advance
    • 6. So, don't want to be stuck with an early choice.
    • 7. Also, don't want to learn new API for trying out new engine.
  • 8. I Need Input
  • 9. Philosophy
    • Expose common features in standard way
    • 10. Emulate missing features
    • 11. Don't get in the way
    • 12. Build useful features on top
  • 13. Backend variation
    • Fields / Schemas: not in Xapian, Lucene
    • 14. Schema modification: Sphinx, Whoosh can't
  • 15. Updates
    • Sphinx: no updates (in progress)
    • 16. Does update happen synchronously?
    • 17. Do updates return docids, or do docids need to be supplied by client?
    • 18. Can docids be set by client?
    • 19. When do updates become live?
  • 20. Scaling
    • Multiple database searches?
    • 21. Searching remote databases?
    • 22. Replication?
  • 23. Analysers
    • Vary wildly between backends
    • 24. Stemming, splitting, n-grams
    • 25. Language detection
    • 26. Fuzzy searches
    • 27. Soundex (well, Metaphone)
  • 28. Queries
    • Many different features available
    • 29. Most engines support arbitrary booleans
    • 30. Some have XOR!
    • 31. Some only permit sets of filters
    • 32. Weighting schemes
    • 33. Need to expose native backend query parsers
  • 34. Facets
    • Information about result set
    • 35. Can be emulated (slow)
    • 36. Some backends approximate
    • 37. Some backends give stats, histograms
  • 38. Other features
    • Spelling correction
    • 39. Numeric and Date range searches
    • 40. Geospatial searches (box, geohash, distance)
  • 41. Proposed design
    • SearchClient class for each backend.
    • 42. A definition of standard behaviours that all backends should provide.
    • 43. A definition of optional behaviours when more than one backend provides them.
  • 44. Proposed design
    • Test suite to ensure that all backends support common features.
    • 45. Programmatic way of checking which features a backend supports? (Or just raise exception)
  • 46. Proposed design
    • Convenience SearchClient factory function:
    • 47. c = multisearch.SearchClient('xapian', path=dbpath)
  • 48. Documents
    • Must support dictionary of fields
      • Unicode values
      • 49. List(unicode) values
    • May support arbitrary other field types, or different data structures, if backend wants to.
  • 50. Schemas
    • Fields have types
    • 51. Automatic type “guessing” (client or server side)
    • 52. Some standard minimal set of analysers
      • Text in a language
      • 53. Untokenised values
    • Don't define exact output; just intent of standard analysers.
  • 54. Search representation
    • Abstract query representation
    • 55. Tree of python objects.
    • 56. Overloaded operators for boolean.
    • 57. Chainable methods.
    • 58. (have actually written this)
    • 59. SearchClient.my_query_type()
  • 60. Code
    • Such as it is, on
    • 61. http://github.com/rboulton/multisearch
    • 62. Suggestions for a better name appreciated
    • 63. Query representation is pretty good, rest is pretty rough.