Phpconf2008 Sphinx En

1,316 views

Published on

sphinx

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,316
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Phpconf2008 Sphinx En

  1. 1. The riddles of the Sphinx Full-text engine anatomy atlas
  2. 2. Who are you ? <ul><li>Sphinx – FOSS full-text search engine </li></ul>
  3. 3. Who are you ? <ul><li>Sphinx – FOSS full-text search engine </li></ul><ul><li>Good at playing ball </li></ul>
  4. 4. Who are you ? <ul><li>Sphinx – FOSS full-text search engine </li></ul><ul><li>Good at playing ball </li></ul><ul><li>Good at not playing ball </li></ul>
  5. 5. Who are you ? <ul><li>Sphinx – FOSS full-text search engine </li></ul><ul><li>Good at playing ball </li></ul><ul><li>Good at not playing ball </li></ul><ul><li>Good at passing the ball to a team-mate </li></ul>
  6. 6. Who are you ? <ul><li>Sphinx – FOSS full-text search engine </li></ul><ul><li>Good at playing ball </li></ul><ul><li>Good at not playing ball </li></ul><ul><li>Good at passing the ball to a team-mate </li></ul><ul><li>Good at many other “inferior” games </li></ul><ul><ul><li>“ faceted” search , geosearch , snippet extraction, multi-queries, IO throttling, and 1 0-20 other interesting directives </li></ul></ul>
  7. 7. What are you here for? <ul><li>What will not be covered? </li></ul><ul><ul><li>No entry-level “what’s that Sphinx and what’s in it for me” overview </li></ul></ul><ul><ul><li>No long quotes from the documentation </li></ul></ul><ul><ul><li>No C++ architecture details </li></ul></ul>
  8. 8. What are you here for? <ul><li>What will not be covered? </li></ul><ul><ul><li>No entry-level “what’s that Sphinx and what’s in it for me” overview </li></ul></ul><ul><ul><li>No long quotes from the documentation </li></ul></ul><ul><ul><li>No C++ architecture details </li></ul></ul><ul><li>What will be? </li></ul><ul><ul><li>How does it generally work inside </li></ul></ul><ul><ul><li>How things can be optimized </li></ul></ul><ul><ul><li>How things can be parallelized </li></ul></ul>
  9. 9. Chapter 1. Engine insides
  10. 10. Total workflow <ul><li>Indexing first </li></ul><ul><li>Searching second </li></ul>
  11. 11. Total workflow <ul><li>Indexing first </li></ul><ul><li>Searching second </li></ul><ul><li>There are data sources (what to fetch, where from) </li></ul><ul><li>There are indexes </li></ul><ul><ul><li>What data sources to index </li></ul></ul><ul><ul><li>How to process the incoming text </li></ul></ul><ul><ul><li>Where to put the results </li></ul></ul>
  12. 12. How indexing works <ul><li>In two acts , with an intermission </li></ul><ul><li>Phase 1 – collecting documents </li></ul><ul><ul><li>Fetch the documents ( loop over the sources ) </li></ul></ul><ul><ul><li>Split the documents into words </li></ul></ul><ul><ul><li>Process the words ( morphology , * fixes ) </li></ul></ul><ul><ul><li>Replace the words with their wordid’s (CRC32/64) </li></ul></ul><ul><ul><li>Emit a number of temp files </li></ul></ul>
  13. 13. How indexing works <ul><li>Phase 2 – sorting hits </li></ul><ul><ul><li>Hit ( occurrence) is a (docid,wordid,wordpos) record </li></ul></ul><ul><ul><li>Input is a number of partially sorted (by wordid) hit lists </li></ul></ul><ul><ul><li>The incoming lists are merge-sorted </li></ul></ul><ul><ul><li>Output is essentially a single fully sorted hit list </li></ul></ul><ul><li>Intermezzo </li></ul><ul><ul><li>Collect and sort MVA values </li></ul></ul><ul><ul><li>Sort ordinals </li></ul></ul><ul><ul><li>Sort extern attributes </li></ul></ul>
  14. 14. Dumb & dumber <ul><li>The index format is… simple </li></ul><ul><li>Several sorted lists </li></ul><ul><ul><li>Dictionary ( the complete list of wordid’s ) </li></ul></ul><ul><ul><li>Attributes ( only if docinfo=extern) </li></ul></ul><ul><ul><li>Document lists ( for each keyword ) </li></ul></ul><ul><ul><li>Hit lists ( for each keyword ) </li></ul></ul><ul><li>Everything is laid out linearly, good for IO </li></ul>
  15. 15. How searching works <ul><li>For each local index </li></ul><ul><ul><li>Build a list of candidates ( documents that satisfy the full-text query ) </li></ul></ul><ul><ul><li>Filter ( the analogy is WHERE) </li></ul></ul><ul><ul><li>Rank ( compute the documents’ relevance values ) </li></ul></ul><ul><ul><li>Sort ( the analogy is ORDER BY) </li></ul></ul><ul><ul><li>Group ( the analogy is GROUP BY) </li></ul></ul><ul><li>Merge the results from all the local indexes </li></ul>
  16. 16. 1. Searching cost <ul><li>Building the candidates list </li></ul><ul><ul><li>1 keyword = 1 + IO (document list) </li></ul></ul><ul><ul><li>Boolean operations on document lists </li></ul></ul><ul><ul><li>Cost is proportional (~) to the lists lengths </li></ul></ul><ul><ul><li>That is , to the sum of all the keyword frequencies </li></ul></ul><ul><ul><li>In case of phrase/proximity/etc search , there also will be operations on hit lists – approx. 2x IO/CPU </li></ul></ul>
  17. 17. 1. Searching cost <ul><li>Building the candidates list </li></ul><ul><ul><li>1 keyword = 1 + IO (document list) </li></ul></ul><ul><ul><li>Boolean operations on document lists </li></ul></ul><ul><ul><li>Cost is proportional (~) to the lists lengths </li></ul></ul><ul><ul><li>That is , to the sum of all the keyword frequencies </li></ul></ul><ul><ul><li>In case of phrase/proximity/etc search , there also are operations on hit lists – approx. 2x IO/CPU </li></ul></ul><ul><li>Bottom line – “The Who” are really bad </li></ul>
  18. 18. 2. Filtering cost <ul><li>docinfo=inline </li></ul><ul><ul><li>Attributes are inlined in the document lists </li></ul></ul><ul><ul><li>ALL the values are duplicated MANY times! </li></ul></ul><ul><ul><li>Immediately accessible after disk read </li></ul></ul><ul><li>docinfo=extern </li></ul><ul><ul><li>Attributes are stored in a separate list ( file ) </li></ul></ul><ul><ul><li>Fully cached in RAM </li></ul></ul><ul><ul><li>Hashed by docid + binary search </li></ul></ul><ul><li>Simple loop over all filters </li></ul><ul><li>Cost ~ number of candidates and filters </li></ul>
  19. 19. 3. Ranking cost <ul><li>Direct – depends on the ranker </li></ul><ul><ul><li>To account for keyword positions – </li></ul></ul><ul><ul><ul><li>Helps the relevancy </li></ul></ul></ul><ul><ul><ul><li>But costs extra resources – double impact ! </li></ul></ul></ul><ul><li>Cost ~ number of results </li></ul><ul><li>Most expensive – phrase proximity + BM25 </li></ul><ul><li>Most cheap – none (weight=1) </li></ul><ul><li>Indirect – induced in the sorting </li></ul>
  20. 20. 4. Sorting cost <ul><li>Cost ~ number of results </li></ul><ul><li>Also depends on the sorting criteria (documents will be supplied in @id asc order) </li></ul><ul><li>Also depends on max_matches </li></ul><ul><li>The more the max, the worse the server feels </li></ul><ul><li>1-10K is acceptable , 100K is way too much </li></ul><ul><li>10-20 is not enough (makes little sense) </li></ul>
  21. 21. 5. Grouping cost <ul><li>Grouping is internally a kind of sorting </li></ul><ul><li>Cost affected by the number of results, too </li></ul><ul><li>Cost affected by max_matches, too </li></ul><ul><li>Additionally , max_matches setting affects @count and @distinct precision </li></ul>
  22. 22. Chapter 2. Optimizing things
  23. 23. How to optimize queries <ul><li>Partitioning the data </li></ul><ul><li>Choosing ranking vs. sorting mode </li></ul><ul><li>Filters vs. keywords </li></ul><ul><li>Filters vs. manual MTF </li></ul><ul><li>Multi queries </li></ul>
  24. 24. How to optimize queries <ul><li>Partitioning the data </li></ul><ul><li>Choosing ranking vs. sorting mode </li></ul><ul><li>Filters vs. keywords </li></ul><ul><li>Filters vs. manual MTF </li></ul><ul><li>Multi queries </li></ul><ul><li>Last line of defense – Three Big Buttons </li></ul>
  25. 25. 1. Partitioning the data <ul><li>Swiss army knife, for different tasks </li></ul><ul><li>Bound by indexing time? </li></ul><ul><ul><li>Partition , re-index the recent changes only </li></ul></ul><ul><li>Bound by filtering ? </li></ul><ul><ul><li>Partition , search the needed indexes only </li></ul></ul><ul><li>Bound by CPU/HDD? </li></ul><ul><ul><li>Partition, move out to different cores/HDDs/boxes </li></ul></ul>
  26. 26. 1a. Partitioning vs. indexing <ul><li>Vital to keep the balance right </li></ul><ul><li>Under-partition – and indexing will be slow </li></ul><ul><li>Over-partition – and searching will be slow </li></ul><ul><li>1-10 indexes – work reasonably well </li></ul><ul><li>Some users are fine with 50+ (30+24...) </li></ul><ul><li>Some users are fine with 2000+ (!!!) </li></ul>
  27. 27. 1b. Partitioning vs. filtering <ul><li>Totally, 100% dependent on production query statistics </li></ul><ul><ul><li>Analyze your very own production logs </li></ul></ul><ul><ul><li>Add comments if needed (3 rd arg to Query()) </li></ul></ul><ul><li>Justified only if the amount of processed data is going to decrease significantly </li></ul><ul><ul><li>Move out last week’s documents – yes </li></ul></ul><ul><ul><li>Move out English-only documents – no (!) </li></ul></ul>
  28. 28. 1c. Partitioning vs. CPU/HDD <ul><li>Use a distributed index , explicitly map the chunks to physical devices </li></ul><ul><li>Point searchd “at itself” – </li></ul>index dist1 { type = distributed local = chunk01 agent = localhost:3312:chunk02 agent = localhost:3312:chunk03 agent = localhost:3312:chunk04 }
  29. 29. 1 c. How to find CPU/HDD bottlenecks <ul><li>Three standard tools </li></ul><ul><ul><li>vmstat – what’s the CPU busy with? how busy is it? </li></ul></ul><ul><ul><li>oprofile – specifically who eats the CPU ? </li></ul></ul><ul><ul><li>iostat – how busy is the HDD? </li></ul></ul><ul><li>Also use logs , also use searchd --iostats option </li></ul><ul><li>Normally everything is clear (us/sy/bi/bo…), but! </li></ul><ul><li>Caveat – HDD might be iops bound </li></ul><ul><li>Caveat – CPU load from Sphinx might be induced and “hidden” in sy </li></ul>
  30. 30. 2. Ranking <ul><li>Can now be very different ( so called rankers in extended2 mode ) </li></ul><ul><li>Default ranker – phrase+BM25, accounts for keyword positions – not for free </li></ul><ul><li>Sometimes it’s ok to use simpler ranker </li></ul><ul><li>Sometimes @weight is ignored at all совсем ( searching for ipod , sorting by price ) </li></ul><ul><li>Sometimes you can save on ranker </li></ul>
  31. 31. 3. Filters vs. keywords <ul><li>Well-known trick </li></ul><ul><ul><li>When indexing , add a special, fake keyword to the document (_authorid123) </li></ul></ul><ul><ul><li>When searching, add it to the query </li></ul></ul><ul><li>Obvious questions </li></ul><ul><ul><li>What’s faster, what’s better? </li></ul></ul><ul><li>Simple answer </li></ul><ul><ul><li>Count the change before moving away from the cashier </li></ul></ul>
  32. 32. 3. Filters vs. keywords <ul><li>Cost of searching ~ keyword frequencies </li></ul><ul><li>Cost of filtering ~ number of candidates </li></ul><ul><li>Searching – CPU+IO , filtering – CPU only </li></ul><ul><li>Fake keyword frequency = filter value selectivity </li></ul><ul><li>Frequent value + few candidates -> bad! </li></ul><ul><li>Rare value + many candidates -> good! </li></ul>
  33. 33. 4. Filters vs. manual MTF <ul><li>Filters are looped over sequentially </li></ul><ul><li>In the order specified by the app! </li></ul><ul><li>Narrowest filter – better at the start </li></ul><ul><li>Widest filter – better at the end </li></ul><ul><li>Does not matter if you use fake keywords </li></ul><ul><li>Exercise to the reader – why? </li></ul>
  34. 34. 5. Multi-queries <ul><li>Any queries can be sent together in a batch </li></ul><ul><li>Always saves on network roundtrip </li></ul><ul><li>Sometimes allows the optimizer to trigger </li></ul><ul><li>Especially important and frequent case – different sorting/grouping modes </li></ul><ul><li>2x+ optimization for “faceted” searches </li></ul>
  35. 35. 5. Multi-queries $client = new SphinxClient (); $q = “laptop”; // coming from website user $client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”); $client->AddQuery ( $q, “products” ); $client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” ); $client->AddQuery ( $q, “products” ); $client->ResetGroupBy (); $client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” ); $client->SetLimit ( 0, 10 ); $result = $client->RunQueries ();
  36. 36. 6. Three Big Buttons <ul><li>If nothing else helps … </li></ul><ul><li>Cutoff ( см . SetLimits()) </li></ul><ul><ul><li>Forcibly stops searching after first N matches </li></ul></ul><ul><ul><li>Per-index, not overall </li></ul></ul><ul><li>MaxQueryTime ( см. SetMaxQueryTime()) </li></ul><ul><ul><li>Forcibly stops searching after M milli-seconds </li></ul></ul><ul><ul><li>Per-index, not overall </li></ul></ul>
  37. 37. 6. Three Big Buttons <ul><li>If nothing else helps … </li></ul><ul><li>Consulting  </li></ul><ul><ul><li>We can notice the unnoticed </li></ul></ul><ul><ul><li>We can implement the unimplemented </li></ul></ul>
  38. 38. Chapter 3. Parallelization sample
  39. 39. Combat mission <ul><li>Got ~160M cross-links </li></ul><ul><li>Needed misc reports (by domains -> groupby) </li></ul>*************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586 message_published: 2006-09-30 00:00:00 ...
  40. 40. Tackling – one <ul><li>Partitioned the data </li></ul><ul><li>8 boxes , 4x CPU, ~5M links per CPU </li></ul><ul><li>Used Sphinx </li></ul><ul><li>In theory, we could had used MySQL </li></ul><ul><li>It practice, way too complicated </li></ul><ul><ul><li>Would had resulted in 15-20M+ rows/CPU </li></ul></ul><ul><ul><li>Would had resulted in “manual” aggregation code </li></ul></ul>
  41. 41. Tackling – two <ul><li>Extracted “interesting parts” of the URL when indexing, using an UDF </li></ul><ul><li>Replaced the SELECT with full-text query </li></ul>*************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 urlize(url_from,0): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php insidegamer$nl$forum$viewtopic.php$t=40750 urlize(url_from,1): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php
  42. 42. Tackling – three <ul><li>64 indexes </li></ul><ul><ul><li>4 searchd instances per box , by CPU/HDD count </li></ul></ul><ul><ul><li>2 indexes (main+delta) per CPU </li></ul></ul><ul><li>All searched in parallel </li></ul><ul><ul><li>Web box queries the main instance at each box </li></ul></ul><ul><ul><li>Main instance queries itself and other 3 copies </li></ul></ul><ul><ul><li>Using 4 instances, because of startup/update </li></ul></ul><ul><ul><li>Using plain HDDs , because of IO stepping </li></ul></ul>
  43. 43. Results <ul><li>The precision is acceptable </li></ul><ul><li>“ Rare” domains – precise results </li></ul><ul><li>“ Frequent” domains – precision within 0.5% </li></ul><ul><li>Average query time – 0.125 sec </li></ul><ul><li>90% queries – under 0.227 sec </li></ul><ul><li>95% queries – under 0.352 sec </li></ul><ul><li>99% queries – under 2.888 sec </li></ul>
  44. 44. The end

×