Optimising Xapian Richard Boulton UKUUG, Birmingham 9 th  August 2009
Xapian is a search engine
Xapian is  a search engine an information retrieval toolkit
Index my stuff
Find things
…  quickly …  quickly
Arbitrary Boolean restrictions Correct spelling mistaks Suggest search completions Find similar things Browse facets of th...
Mature <ul>10 year old vintage </ul>
Not going to talk about... <ul><li>Scaling across multiple machines </li><ul><li>Big topic – go to a cloud talk. </li></ul...
Am going to talk about... Two types of optimisation <ul>Algorithms Implementation </ul>10
Making the most of hardware <ul><li>Single machine
Limited memory
Database on a slow disk </li></ul>
Requirements <ul><li>Given a set of documents </li><ul><li>terms and frequencies </li></ul><li>And a set of queries </li><...
Analysing the problem <ul><li>Do as much work at indexing time as possible
Precalculated searches?
Can't precalculate everything...
Calculate all single-term queries </li></ul>
Stored data <ul><li>Posting lists: </li></ul>Felt 1 6 8 Pens 3 6 7 9
Single term search <ul><li>Read a posting list
Remember the best </li></ul>Felt 1 6 8 Pens 3 6 7 9
AND search <ul><li>Naive approach: </li><ul><li>Read first list.
Hold it in memory:
Read next list
Merge it in:
Select the best </li></ul></ul>1 6 8 1 3 6 7 8 9
AND search <ul><li>Problem – limited by amount of memory
Problem – no way to avoid reading all of the list </li></ul>
Better AND search <ul><li>Read lists in parallel
Start with the shortest
Jump forward in second list to keep up with first
Keep only the best N items </li></ul>
Better AND search <ul><li>Read lists in parallel
Start with the shortest
Jump  forward in second list to keep up with first
Keep only the best N items </li></ul>
OR search <ul><li>Read lists in parallel
But, unlike AND, we can't skip items
So … make it into an AND
How? </li></ul>
OR search <ul><li>ASSUMPTION: we only want the top few results
Track only those
Keep track of the lowest weight of those
Upcoming SlideShare
Loading in...5
×

Optimising Xapian

1,286

Published on

Talk given to UKUUG on 9th August 2009 about the Xapian search engine, and some of the experiences I've had trying to optimise its design and implementation.

Published in: Technology, Business, Spiritual
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,286
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Optimising Xapian

  1. 1. Optimising Xapian Richard Boulton UKUUG, Birmingham 9 th August 2009
  2. 2. Xapian is a search engine
  3. 3. Xapian is a search engine an information retrieval toolkit
  4. 4. Index my stuff
  5. 5. Find things
  6. 6. … quickly … quickly
  7. 7. Arbitrary Boolean restrictions Correct spelling mistaks Suggest search completions Find similar things Browse facets of things Find nearby things Get diverse results Find images which look similar Different sort orders Arbitrary weight influences
  8. 8. Mature <ul>10 year old vintage </ul>
  9. 9. Not going to talk about... <ul><li>Scaling across multiple machines </li><ul><li>Big topic – go to a cloud talk. </li></ul><li>Optimising ranking of results </li><ul><li>Huge topic – go to an IR conference! </li></ul><li>Optimising specific installations </li><ul><li>Filesystems, hardware specification, SSDs etc </li></ul></ul>
  10. 10. Am going to talk about... Two types of optimisation <ul>Algorithms Implementation </ul>10
  11. 11. Making the most of hardware <ul><li>Single machine
  12. 12. Limited memory
  13. 13. Database on a slow disk </li></ul>
  14. 14. Requirements <ul><li>Given a set of documents </li><ul><li>terms and frequencies </li></ul><li>And a set of queries </li><ul><li>terms, frequencies and operators </li></ul><li>Find the best matches </li></ul>
  15. 15. Analysing the problem <ul><li>Do as much work at indexing time as possible
  16. 16. Precalculated searches?
  17. 17. Can't precalculate everything...
  18. 18. Calculate all single-term queries </li></ul>
  19. 19. Stored data <ul><li>Posting lists: </li></ul>Felt 1 6 8 Pens 3 6 7 9
  20. 20. Single term search <ul><li>Read a posting list
  21. 21. Remember the best </li></ul>Felt 1 6 8 Pens 3 6 7 9
  22. 22. AND search <ul><li>Naive approach: </li><ul><li>Read first list.
  23. 23. Hold it in memory:
  24. 24. Read next list
  25. 25. Merge it in:
  26. 26. Select the best </li></ul></ul>1 6 8 1 3 6 7 8 9
  27. 27. AND search <ul><li>Problem – limited by amount of memory
  28. 28. Problem – no way to avoid reading all of the list </li></ul>
  29. 29. Better AND search <ul><li>Read lists in parallel
  30. 30. Start with the shortest
  31. 31. Jump forward in second list to keep up with first
  32. 32. Keep only the best N items </li></ul>
  33. 33. Better AND search <ul><li>Read lists in parallel
  34. 34. Start with the shortest
  35. 35. Jump forward in second list to keep up with first
  36. 36. Keep only the best N items </li></ul>
  37. 37. OR search <ul><li>Read lists in parallel
  38. 38. But, unlike AND, we can't skip items
  39. 39. So … make it into an AND
  40. 40. How? </li></ul>
  41. 41. OR search <ul><li>ASSUMPTION: we only want the top few results
  42. 42. Track only those
  43. 43. Keep track of the lowest weight of those
  44. 44. Also, calculate upper bound on weight of each term
  45. 45. When both upper bounds < lowest weight, we need both, so become an AND </li></ul>
  46. 46. Taking it further <ul><li>Can apply this idea across whole query tree
  47. 47. Can introduce other operators – AND_MAYBE
  48. 48. Phrase queries </li><ul><li>AND, followed by checking positions
  49. 49. Or, store pairs of adjacent terms, and then check positions
  50. 50. Or, store certain pairs... </li></ul></ul>
  51. 51. Does it work?
  52. 52. YES 18
  53. 55. Implementation <ul><li>… not a small job
  54. 56. Datastructures
  55. 57. Compression techniques
  56. 58. Micro-optimisations </li></ul>20
  57. 59. Datastructures <ul><li>Assumption – too much data to fit it all in memory
  58. 60. Disks are slow
  59. 61. But faster when reading in chunks
  60. 62. B+-trees – traditional but good </li><ul><li>Block structured, massively branching tree – very shallow </li></ul></ul>
  61. 63. Posting list chunks <ul><li>Store posting lists in chunks
  62. 64. Work out what statistics to store, where
  63. 65. Get tighter bounds on possible weights, so we can skip better </li></ul>
  64. 66. Document length <ul><li>Needed for weight calculation
  65. 67. Store it in each posting list – duplicated, but no side lookup
  66. 68. Or store it only once?
  67. 69. Currently, we store it in all posting lists
  68. 70. New backend stores it only once -> 40% smaller!
  69. 71. But, currently 10 times slower :( </li></ul>
  70. 72. Measurements 25
  71. 76. New problems <ul><li>We often have enough memory these days!
  72. 77. 500M = A huge collection 10 years ago, now only medium
  73. 78. 10M = A large collection 10 years ago, now small – will often fit fully in memory
  74. 79. => IO less of a bottleneck – optimise CPU </li></ul>
  75. 80. New problems <ul><li>Faceted search </li><ul><li>Display information about all the items in the result set
  76. 81. => Have to calculate all the result set!
  77. 82. Or – approximate
  78. 83. Or – precalculate the facet values somehow </li></ul></ul>
  79. 84. New problems <ul><li>Bias results with external weights
  80. 85. Page rank / product rank
  81. 86. Fixed weights – so store documents in decreasing weight order – lets us finish early
  82. 87. But – harder to update dynamically </li></ul>
  83. 89. Geolocation <ul><li>Bias results by distance from a location
  84. 90. Generate hierarchies of terms </li><ul><li>HTM easiest way to implement </li></ul><li>Use to restrict candidates
  85. 91. Combine candidates with dynamically calculated weight </li></ul>
  86. 92. Image similarity <ul><li>Terms representing features
  87. 93. Queries with hundreds of terms
  88. 94. Current optimisations help
  89. 95. … but distribution of frequencies and weights is less amenable to early termination. </li></ul>
  90. 96. Variety <ul><li>Strict relevance order leads to duplication </li><ul><li>Similar items get similar scores </li></ul><li>Usually want to present a selection of results
  91. 97. Order based on combination of novelty and relevance
  92. 98. Score depends on earlier documents
  93. 99. => our early termination doesn't work </li></ul>
  94. 100. http://searchevent.org/ “ A day of informal presentations, open discussion and hacking on open source search technologies.” Tuesday 29 th September 2009 Friends meeting house, Cambridge, UK Learn more Free!
  95. 101. Questions Xapian: http://xapian.org/ Me: [email_address] Photo credits: http://www.flickr.com/photos/striatic/729822/ http://www.flickr.com/photos/stephmcg/1592886057/ http://www.flickr.com/photos/dullhunk/3389581452/ http://www.flickr.com/photos/katielips/3367600309/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×