Using Sphinx for Search in PHP


Published on

This is an intro to Sphinx and PHP. It will take you through the very basics of how Sphinx works, how you can set up an index, and using the mysql client to search your index. Then, it culminates in a quick little PHP script that builds a small search interface around your index. I will be posting the example code into my github account soon.

This presentation was given to the LV PHP meetup on August 5th.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Using Sphinx for Search in PHP

  1. 1. Using Sphinx for Search Mike Lively Slickdeals, LLC
  2. 2. What is Sphinx? • A full-text search engine • Quickly get high quality (relevant) results • Designed to integrate well with SQL RDBMS • Can work with any data source • Can be queried using either an API or SQL
  3. 3. How do I know anything about Sphinx? • Manager of Software Architecture for • Alexa top 150 site (in the US) • Have been working at improving our Sphinx search engine for the last 2 months or so. • Over 7 Million searches a month directly through the interface, lots more happen indirectly.
  4. 4. When should I use Sphinx? • Site / Product / Document searches • Auto-suggest / Auto-Correct functionality • Finding relevant and related items
  5. 5. Simple Architecture • Often, search is offloaded straight to the database • Search goes to the backend which performs queries on the database • Obviously very easy to implement
  6. 6. Simple Architecture • Simple “starts with” searches on indexed fields can sometimes work: `city` LIKE ‘Las%’ • Anything else will lock your database for writes with MyISAM. • MySQL is not a great or flexible full text engine • It can sometimes be adequate
  7. 7. Sphinx Architecture • Searchd is responsible for receiving requests from clients and executing the searches against the sphinx index. • Indexer is responsible for getting data into the sphinx index. • This separation allows indexing and searching to be scaled separately.
  8. 8. Sphinx Architecture • Searchd has a binary protocol for which there are several clients available in multiple languages. • Searchd is also binary compatible with MySQL’s protocol since mysql 4.1 • Searchd is a daemon that runs on your search servers
  9. 9. Sphinx Architecture • Indexer is a shell program that you can execute to build any number of indexes. • Can handle index rotation for live indexing
  10. 10. Not So Quick Side Note MySQL IS SLOWWWWWWWWWWWWW (at text matches)
  11. 11. Still Not Quick Side Note Indexes won’t help you…
  12. 12. Quicker Side Note Full Text Search isn’t so bad IF….
  13. 13. Sphinx Concepts • Sphinx Indexes “Documents” • Each document has a unique unsigned, non- zero integer ID (either 32 bit or 64 bit space) • Each document has one or more fields • Each document has zero or more attributes
  14. 14. Indexes / Sources • Sphinx indexes are created from one or more sources. • The source can be a database, xml, or tsv stream. • You can use multiple sources • This is useful for maintaining updated indexes • Also used to implement a sphinx cluster
  15. 15. Sphinx Fields • Fields are what the full text index is comprised of. • When searching you can search against any number of fields. • You can assign different relevancy weights to different fields. • The original value of a field is never stored by Sphinx. • You should always have at least one.
  16. 16. Sphinx Attributes • data that helps further describe the item being indexed • Can be returned as a part of the search • Useful for filtering and sorting results • These are not a part of the full text index.
  17. 17. MySQL Full Text Search • You can get away with MyISAM tables or as of version 5.6 InnoDB. • You don’t care about morphology (think plurals) • You don’t need anything but the most basic of search operators
  18. 18. Creating An Index • We are going to add an index that sources a mysql database. • The data being sourced is a list of the titles of wikipedia posts.
  19. 19. Creating An Index
  20. 20. Indexer Configuration • We are going to be peaking into a sphinx configuration file now. • You can rebuild the config file by concatenating each section into a single file. • On my VM this file is located in /usr/local/etc/ sphinx.conf
  21. 21. Source Definition
  22. 22. Source Definition Defines the connection information
  23. 23. Connection information • Ideally, you should create a separate account for sphinx • You can also connect via unix socket • I didn’t specify it here, but you can also add a port.
  24. 24. Source Definition The query that pulls data to populate the index
  25. 25. Source Index • The index query MUST return the id field as the first column • Remember, the id needs to be a unique, unsigned 64 bit (or less number) • The query must be on a single line. Unless you escape new lines with back slashes. • Notice that we converted the timestamp into a unix timestamp. That is important.
  26. 26. Source Definition How data is stored in the index
  27. 27. Source Fields • The first column in the query is always the ID. • You specify any columns that are attributes. • Remember, attributes are stored in the index as fields that can be used to filter and sort by. • Any field besides the id that is not specified as an attribute, is assumed to be a text field (title)
  28. 28. Index Definition
  29. 29. Index Definition • An Index includes one or more sources. • Each source gets it’s own “source” line • Multiple sources must all define the same fields and attributes. • The ids need to be unique across resources
  30. 30. Index Definition • path is not actually a path, it’s a filename with no extension. • docinfo dictates if attributes are stored in the index or outside of the index. • dict is not really important now. Used to be either crc or keywords. Now crc is deprecated. • min_word_len is the minimum length of words to index
  31. 31. Rest of the Index Configuration
  32. 32. It’s time to build the index indexer <index name>
  33. 33. Searching the Index • searchd is the daemon that searches the index • Binary Protocol
 OR • MySQL Compatible too!
  34. 34. searchd config Included in the same config file as the rest
  35. 35. Spinning up searchd
  36. 36. –Sphinx “I know MySQL”
  37. 37. MySQL Compatible
  38. 38. MySQL Compatible • Tables == Indexes • SHOW TABLES…Shows indexes. • Select * From <index> works too.
  39. 39. Selecting from an index
  40. 40. Querying Indexes • Default limit of 20 rows • Notice the text fields are not returned… • They would be if we made them attributes (sql_field_string)
  41. 41. Querying Indexes • The magic function in SphinxQL is match() • match() performs a full text search against the entire index…usually • The ‘@field’ operator can isolate which field is searched on.
  42. 42. Querying Indexes • You can query against attributes • You can sort results • You can use the weight() function to determine relevancy.
  43. 43. Querying Indexes • The 25387283 title was more relevant because it matched on the term “testing”
  44. 44. Getting PHP into the mix • All we need? PDO. • We will build a basic search page • Accepts a query, displays up to 100 matching results by relevancy with the matching keywords highlighted.
  45. 45. Pulling data from Sphinx
  46. 46. Fetching the data from Mysql
  47. 47. Adding the fancy yellow highlighting
  48. 48. The rest is pretty basic…
  49. 49. Cool things we would talk about if I had like…3 more hours • Auto-suggest, Auto-correct • More on lemmatization and stemming • Distributed Sphinx Clustering • Delta indexes • Real Time Indexes • The plethora of operators you can use • Ranged Queries • ………
  50. 50. Additional Information • The sphinx documentation is actually pretty great • • Slides are already on Slideshare • Will link them to the meet up shortly
  51. 51. Questions?