0
Sphinx: Leveraging Scalable Search in Drupal



    Mike Cantelon
    http://straight.com
    http://mikecantelon.com




...
This Talk


 Aims to orient you to the Sphinx realm
 Code provided on mikecantelon.com to give
  you basic Drupal Sphinx...
What Is Sphinx?


 Open source search engine
 Developed by Andrew Aksyonoff
 No Java required
    (written in C++)
   ...
Who Uses It?


   NowPublic
   MySQL
   The Pirate Bay
   Joomla
   MiniNova
   Etc...
    http://www.sphinxsearch.c...
How Fast is It?


              http://peter-zaitsev.livejournal.com/12809.html
450


400


350


300                     ...
How Does Sphinx Scale?


 Distributed, parallel searching/indexing across
  multiple agents
 Agents can be machines (“ho...
One of Many Potential Sphinx Architectures



                              Combined
                              Index
S...
Sphinx Components and Installation


   Installation is (usually) straightforward
    (requires make and C++ compiler)
  ...
Sphinx Index Fundamentals


   Can have one or more data source
   Sources include MySQL, MySQL Sphinx
    storage engin...
Sphinx Index Options


   UTF-8 encoding
   Prefix and infix indexing for wildcard searches
   HTML stripping
   N-gra...
Sphinx Stemming


   Good stemming support
   “Stemming” is the breaking down of words
    into their base form
   In a...
Configuration
Sphinx Configuration Overview


 /usr/local/etc/sphinx.conf is where Sphinx
  configuration normally lives
 Defines sour...
Sphinx Source Configuration


   Source specifies:
    • Type
    • Connection info
    • Queries
    • Attributes
   Co...
Source Range Query


 The range query is for pulling batches of rows,
  instead off all rows at once
 MyISAM has table-w...
Sphinx Index And searchd Configuration


 The index will specify a source, path, and
  stemming/indexing options
 Make s...
Starting The Sphinx Search Daemon


 searchd can be started by simply typing “sudo
  searchd” into the command line
 Onc...
Implementation
Example of a Sphinx Implementation


   The “my_search” module, which will be
    available for download on mikecantelon....
Sphinx Implementation Overview


 Sphinx data is accessible through multiple
  APIs: PHP, Python, Perl, Ruby, and Java
 ...
Sphinx Search Modes


   SPH_MATCH_ALL: match all keywords
   SPH_MATCH_ANY: match any keywords
   SPH_MATCH_BOOLEAN: n...
Example of PHP Client Connect


   This example uses SPH_MATCH_EXTENDED:

 // Note: searchd needs to be running
 $cl = ne...
Sphinx Sort Modes


   Sphinx has a number of limited, specific
    search modes :
    • SPH_SORT_RELEVANCE
    • SPH_SOR...
Example of PHP Client Sort and Query


   This example sorts by relevance (most to
    least), then by creation date (rev...
Getting Fancier
Using the Sphinx PHP API Highlighter


 Once you've received search results from
  Sphinx, you can use API's highlighter ...
PHP API Highlighter Example

   In this example $content is an array
    containing document IDs for keys and
    documen...
Humanizing SPH_MATCH_EXTENDED


 SPH_MATCH_EXTENDED can be “humanized”
  by parsing a user's query and breaking it into
 ...
Implementing Humanization in PHP


   One way to humanize via clumsy regular
    expression brutality:
    1)Get keywords...
What About Paging?


 As Sphinx queries don't go through Drupal's
  database abstraction functions, built in paging
  won...
How to Leverage Drupal's Paging


   Manually set globals to leverage paging:
$element = 0;
$GLOBALS['pager_page_array'] ...
Multi-Value Attributes (MVAs)


 Allow to you add Drupal taxonomy data to
  Sphinx index
 Implementation requires two th...
MVA sql_attr_multi Example


sql_attr_multi = uint tag from ranged­query; 
  SELECT n.nid, td.tid FROM term_data td INNER ...
We're done!
Online Resources


 Official Sphinx site:
  http://www.sphinxsearch.com/
 These slides and example module:
  http://mike...
Sphinx: Leveraging Scalable Search in Drupal
Sphinx: Leveraging Scalable Search in Drupal
Upcoming SlideShare
Loading in...5
×

Sphinx: Leveraging Scalable Search in Drupal

5,638

Published on

Published in: Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,638
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Sphinx: Leveraging Scalable Search in Drupal"

  1. 1. Sphinx: Leveraging Scalable Search in Drupal Mike Cantelon http://straight.com http://mikecantelon.com Drupal Camp Vancouver, May 16 ­ 17, 2008
  2. 2. This Talk  Aims to orient you to the Sphinx realm  Code provided on mikecantelon.com to give you basic Drupal Sphinx-driven search
  3. 3. What Is Sphinx?  Open source search engine  Developed by Andrew Aksyonoff  No Java required (written in C++)  Mostly cross-platform (Windows version not yet production ready)  Fast  Scalable  Flexible  Straightforward
  4. 4. Who Uses It?  NowPublic  MySQL  The Pirate Bay  Joomla  MiniNova  Etc... http://www.sphinxsearch.com/powered.html
  5. 5. How Fast is It? http://peter-zaitsev.livejournal.com/12809.html 450 400 350 300 4 3.5 250 3 2.5 Seconds 200 Cached 2 Se c o n d s Ca c h e d 1.5 150 1 0.5 100 0 Mnogosearch Sphinx 50 0 MySQL Boolean Fulltext Mnogosearch MySQL Fulltext MySQL LIKE Sphinx
  6. 6. How Does Sphinx Scale?  Distributed, parallel searching/indexing across multiple agents  Agents can be machines (“horizontal scaling”) or processors/cores (“vertical”)  Searches get first sent to remote agents  Distribution is transparent to the application: any duplicate results are removed
  7. 7. One of Many Potential Sphinx Architectures Combined Index Sphinx Layer Index of Index of 1&2 3&4 MySQL Layer MySQL MySQL MySQL MySQL Shard 1 Shard 2 Shard 3 Shard 4
  8. 8. Sphinx Components and Installation  Installation is (usually) straightforward (requires make and C++ compiler) • ./configure • make • make install  Sphinx consists of three components: • Search daemon (searchd command) • Indexer (indexer command) • Search client (search command)
  9. 9. Sphinx Index Fundamentals  Can have one or more data source  Sources include MySQL, MySQL Sphinx storage engine, Postgres, xmlpipe2  Each source must share schema  Each document must have global ID  Indexes can be rotated seamlessly  Indexes end up taking much more disk space than source text: plan accordingly!
  10. 10. Sphinx Index Options  UTF-8 encoding  Prefix and infix indexing for wildcard searches  HTML stripping  N-gram support for Chinese, Japanese, etc.  Multiple stemming schemes  Stop words  Word forms
  11. 11. Sphinx Stemming  Good stemming support  “Stemming” is the breaking down of words into their base form  In addition to stemming, Sphinx supports indexing via Soundex and Metatone  Multiple stemmers can be applied  Optionally support for Snowball project for international stemming algorithms (http://snowball.tartarus.org)
  12. 12. Configuration
  13. 13. Sphinx Configuration Overview  /usr/local/etc/sphinx.conf is where Sphinx configuration normally lives  Defines sources, indexes, and agents  Configuration file is in custom format
  14. 14. Sphinx Source Configuration  Source specifies: • Type • Connection info • Queries • Attributes  Contains three kinds of queries: • Main query (what you want to index; first column must be document global ID) • Detail query (to fetch document details when using search command) • Optional “range” query
  15. 15. Source Range Query  The range query is for pulling batches of rows, instead off all rows at once  MyISAM has table-wide locking, so big queries can bog things down  An example range query: sql_query_range = SELECT MIN(nid),MAX(nid) FROM node sql_range_step = 512  sql_range_step defines how many rows to fetch on each pass
  16. 16. Sphinx Index And searchd Configuration  The index will specify a source, path, and stemming/indexing options  Make sure the path you specify for your index exists (for example: /var/opt/local/sphinx)  The searchd section will require a directory to exist for logs, process ID file, etc.  This directory will usually be /var/log/searchd
  17. 17. Starting The Sphinx Search Daemon  searchd can be started by simply typing “sudo searchd” into the command line  Once searchd is started, index a source by entering “sudo indexer <name of source>”  Once your index is created, enter “search -p <some search term>” to test  “Seamless” indexing can be done by adding the “--rotate” option when using indexer
  18. 18. Implementation
  19. 19. Example of a Sphinx Implementation  The “my_search” module, which will be available for download on mikecantelon.com, shows a simple use of Sphinx in Drupal
  20. 20. Sphinx Implementation Overview  Sphinx data is accessible through multiple APIs: PHP, Python, Perl, Ruby, and Java  APIs provides optional search snippet extraction, keyword highlighting, and result paging  Results return only document IDs  Hence, after searching returns results, you'll still likely need to fetch details of documents in result page
  21. 21. Sphinx Search Modes  SPH_MATCH_ALL: match all keywords  SPH_MATCH_ANY: match any keywords  SPH_MATCH_BOOLEAN: no relevance, implicit boolean AND between keywords if not otherwise specified  SPH_MATCH_PHRASE: treats query as a phrase and requires a perfect match  SPH_MATCH_EXTENDED: allows for specification of phrases and/or keywords and allows boolean operators
  22. 22. Example of PHP Client Connect  This example uses SPH_MATCH_EXTENDED:  // Note: searchd needs to be running  $cl = new SphinxClient();  $cl­>SetServer('localhost', 3312 );  $cl­>SetLimits($start_item, $items_per_page);  $cl­>SetMatchMode(SPH_MATCH_EXTENDED);  Note the use of SetLimits to allow us to implement paging/limiting
  23. 23. Sphinx Sort Modes  Sphinx has a number of limited, specific search modes : • SPH_SORT_RELEVANCE • SPH_SORT_ATTR_DESC • SPH_SORT_ATTR_ASC • SPH_SORT_TIME_SEGMENTS  SPH_SORT_EXTENDED provides SQL-like sorting, listing “columns” and specifying ascending/descending order  SPH_SORT_EXPR allows sorting using a mathematical equation involving “columns”
  24. 24. Example of PHP Client Sort and Query  This example sorts by relevance (most to least), then by creation date (reverse chronological order):  $cl­>SetSortMode (SPH_SORT_EXTENDED,    "@relevance DESC, created DESC");  // Do Sphinx query  $search_result = $cl­>Query(    $query,    'stories'  );
  25. 25. Getting Fancier
  26. 26. Using the Sphinx PHP API Highlighter  Once you've received search results from Sphinx, you can use API's highlighter to generate excerpts and highlight matches within the excerpts  To do this you must first retrieve the full text of each of the documents shown on the current results page  This will need to be done by database queries or some kind of cache lookup
  27. 27. PHP API Highlighter Example  In this example $content is an array containing document IDs for keys and document text for values: $highlighting_options = array(   'before_match'    =>   "<span style='background­color: yellow'>",   'after_match'     => "</span>",   'chunk_separator' => " ... ",   'limit'           => 256,   'around'          => 3, ); $excerpts = $cl­>BuildExcerpts($content,  'stories', $query, $highlighting_options);
  28. 28. Humanizing SPH_MATCH_EXTENDED  SPH_MATCH_EXTENDED can be “humanized” by parsing a user's query and breaking it into phrases and keywords separated by a chosen operator  For example: given the search query “uwe boll” postal and the chosen search type “match any”, we'd use boolean operator OR between the extracted phrase and keyword  This would end up, internally, as “uwe boll” | postal (note the OR pipe character)
  29. 29. Implementing Humanization in PHP  One way to humanize via clumsy regular expression brutality: 1)Get keywords by using preg_split to split into array using phrases (designated by double- quotes) as splitting points 2)Get phrases by using preg_match_all, matching anything in double-quotes 3)Merge arrays 4)Implode using search operator with spaces on either side (i.e. “ & “ or “ | “)
  30. 30. What About Paging?  As Sphinx queries don't go through Drupal's database abstraction functions, built in paging won't work automagically, as when using Drupal's pager_query function  Writing paging logic that looks/acts identical to Drupal's is tedious  You can, however, leverage Drupal's paging functionality so theme_pager will do your bidding...
  31. 31. How to Leverage Drupal's Paging  Manually set globals to leverage paging: $element = 0; $GLOBALS['pager_page_array'] =   explode(',', $_GET['page']); $GLOBALS['pager_total_items'][$element] =   $found_count; $GLOBALS['pager_total'][$element] =   ceil($found_count / $items_per_page); $GLOBALS['pager_page_array'][$element] = max(0, min((int)$GLOBALS['pager_page_array'] [$element], ((int)$GLOBALS['pager_total'] [$element]) ­ 1));
  32. 32. Multi-Value Attributes (MVAs)  Allow to you add Drupal taxonomy data to Sphinx index  Implementation requires two things: • Addition of sql_attr_multi specification in the source section of your Sphinx configuration file • sql_attr_multi tells how to query for taxonomy • Addition of SetFilter call in PHP API before issuing query  SetFilter example filtering by two taxonomy terms: • $cl->SetFilter('tag', array(114288, 4567));
  33. 33. MVA sql_attr_multi Example sql_attr_multi = uint tag from ranged­query;    SELECT n.nid, td.tid FROM term_data td INNER  JOIN term_node tn      USING(tid) INNER JOIN node n ON  tn.nid=n.nid      WHERE td.vid=9 AND n.nid >= $start AND  n.nid <= $end;    SELECT MIN(n.nid), MAX(n.nid) FROM term_data  td INNER JOIN term_node tn      USING(tid) INNER JOIN node n ON  tn.nid=n.nid      WHERE td.vid=9
  34. 34. We're done!
  35. 35. Online Resources  Official Sphinx site: http://www.sphinxsearch.com/  These slides and example module: http://mikecantelon.com/talks/sphinx
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×