Optimising Xapian
Upcoming SlideShare
Loading in...5
×
 

Optimising Xapian

on

  • 2,071 views

Talk given to UKUUG on 9th August 2009 about the Xapian search engine, and some of the experiences I've had trying to optimise its design and implementation.

Talk given to UKUUG on 9th August 2009 about the Xapian search engine, and some of the experiences I've had trying to optimise its design and implementation.

Statistics

Views

Total Views
2,071
Views on SlideShare
2,061
Embed Views
10

Actions

Likes
0
Downloads
13
Comments
0

3 Embeds 10

http://www.slideshare.net 8
http://lanyrd.com 1
http://static.slidesharecdn.com 1

Accessibility

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Optimising Xapian Optimising Xapian Presentation Transcript

  • Optimising Xapian Richard Boulton UKUUG, Birmingham 9 th August 2009
  • Xapian is a search engine
  • Xapian is a search engine an information retrieval toolkit
  • Index my stuff
  • Find things
  • … quickly … quickly
  • Arbitrary Boolean restrictions Correct spelling mistaks Suggest search completions Find similar things Browse facets of things Find nearby things Get diverse results Find images which look similar Different sort orders Arbitrary weight influences
  • Mature
      10 year old vintage
  • Not going to talk about...
    • Scaling across multiple machines
      • Big topic – go to a cloud talk.
    • Optimising ranking of results
      • Huge topic – go to an IR conference!
    • Optimising specific installations
      • Filesystems, hardware specification, SSDs etc
  • Am going to talk about... Two types of optimisation
      Algorithms Implementation
    10
  • Making the most of hardware
    • Single machine
    • Limited memory
    • Database on a slow disk
  • Requirements
    • Given a set of documents
      • terms and frequencies
    • And a set of queries
      • terms, frequencies and operators
    • Find the best matches
  • Analysing the problem
    • Do as much work at indexing time as possible
    • Precalculated searches?
    • Can't precalculate everything...
    • Calculate all single-term queries
  • Stored data
    • Posting lists:
    Felt 1 6 8 Pens 3 6 7 9
  • Single term search
    • Read a posting list
    • Remember the best
    Felt 1 6 8 Pens 3 6 7 9
  • AND search
    • Naive approach:
      • Read first list.
      • Hold it in memory:
      • Read next list
      • Merge it in:
      • Select the best
    1 6 8 1 3 6 7 8 9
  • AND search
    • Problem – limited by amount of memory
    • Problem – no way to avoid reading all of the list
  • Better AND search
    • Read lists in parallel
    • Start with the shortest
    • Jump forward in second list to keep up with first
    • Keep only the best N items
  • Better AND search
    • Read lists in parallel
    • Start with the shortest
    • Jump forward in second list to keep up with first
    • Keep only the best N items
  • OR search
    • Read lists in parallel
    • But, unlike AND, we can't skip items
    • So … make it into an AND
    • How?
  • OR search
    • ASSUMPTION: we only want the top few results
    • Track only those
    • Keep track of the lowest weight of those
    • Also, calculate upper bound on weight of each term
    • When both upper bounds < lowest weight, we need both, so become an AND
  • Taking it further
    • Can apply this idea across whole query tree
    • Can introduce other operators – AND_MAYBE
    • Phrase queries
      • AND, followed by checking positions
      • Or, store pairs of adjacent terms, and then check positions
      • Or, store certain pairs...
  • Does it work?
  • YES 18
  •  
  •  
  • Implementation
    • … not a small job
    • Datastructures
    • Compression techniques
    • Micro-optimisations
    20
  • Datastructures
    • Assumption – too much data to fit it all in memory
    • Disks are slow
    • But faster when reading in chunks
    • B+-trees – traditional but good
      • Block structured, massively branching tree – very shallow
  • Posting list chunks
    • Store posting lists in chunks
    • Work out what statistics to store, where
    • Get tighter bounds on possible weights, so we can skip better
  • Document length
    • Needed for weight calculation
    • Store it in each posting list – duplicated, but no side lookup
    • Or store it only once?
    • Currently, we store it in all posting lists
    • New backend stores it only once -> 40% smaller!
    • But, currently 10 times slower :(
  • Measurements 25
  •  
  •  
  •  
  • New problems
    • We often have enough memory these days!
    • 500M = A huge collection 10 years ago, now only medium
    • 10M = A large collection 10 years ago, now small – will often fit fully in memory
    • => IO less of a bottleneck – optimise CPU
  • New problems
    • Faceted search
      • Display information about all the items in the result set
      • => Have to calculate all the result set!
      • Or – approximate
      • Or – precalculate the facet values somehow
  • New problems
    • Bias results with external weights
    • Page rank / product rank
    • Fixed weights – so store documents in decreasing weight order – lets us finish early
    • But – harder to update dynamically
  •  
  • Geolocation
    • Bias results by distance from a location
    • Generate hierarchies of terms
      • HTM easiest way to implement
    • Use to restrict candidates
    • Combine candidates with dynamically calculated weight
  • Image similarity
    • Terms representing features
    • Queries with hundreds of terms
    • Current optimisations help
    • … but distribution of frequencies and weights is less amenable to early termination.
  • Variety
    • Strict relevance order leads to duplication
      • Similar items get similar scores
    • Usually want to present a selection of results
    • Order based on combination of novelty and relevance
    • Score depends on earlier documents
    • => our early termination doesn't work
  • http://searchevent.org/ “ A day of informal presentations, open discussion and hacking on open source search technologies.” Tuesday 29 th September 2009 Friends meeting house, Cambridge, UK Learn more Free!
  • Questions Xapian: http://xapian.org/ Me: [email_address] Photo credits: http://www.flickr.com/photos/striatic/729822/ http://www.flickr.com/photos/stephmcg/1592886057/ http://www.flickr.com/photos/dullhunk/3389581452/ http://www.flickr.com/photos/katielips/3367600309/