Successfully reported this slideshow.

Using Thinking Sphinx with rails

11,976 views

Published on

Thinking Sphinx presented at Ruby Fun Day (http://www.rubyonrails.in/events/3)

Published in: Education, Technology

Using Thinking Sphinx with rails

  1. 1. Free open-source SQL full-text search engine An acronym for SQL Phrase Index Developed by Andrew Aksyonoff
  2. 2. <ul><li>database search </li></ul><ul><ul><li>Using SQL directly: like &quot;%text%&quot; </li></ul></ul><ul><ul><li>impractical for large text fields. </li></ul></ul><ul><ul><li>no relevance ranking. </li></ul></ul><ul><li>full text search </li></ul><ul><ul><li>searches all words in every document against query. </li></ul></ul><ul><ul><li>moves processing load out of DB. </li></ul></ul><ul><ul><li>relvance ranking. </li></ul></ul><ul><ul><li>other advanced features. </li></ul></ul>
  3. 3. <ul><li>2 step process </li></ul><ul><ul><li>indexing </li></ul></ul><ul><ul><ul><li>scan text and build a list of search terms. </li></ul></ul></ul><ul><ul><li>searching </li></ul></ul><ul><ul><ul><li>search into index to get refrences to data. </li></ul></ul></ul>
  4. 4. <ul><li>High indexing speed. </li></ul><ul><ul><li>upto 10 MB/sec on modern CPUs. </li></ul></ul><ul><li>High search speed. </li></ul><ul><ul><li>avg query is under 0.1 sec on 2-4 GB text collections. </li></ul></ul><ul><li>High scalability. </li></ul><ul><ul><li>upto 100 GB text, upto 100M documents on a single CPU. </li></ul></ul><ul><li>Supports distributed searching. </li></ul><ul><ul><li>can be extended to multiple servers. </li></ul></ul>
  5. 5. <ul><li>Supports phrase proximity ranking. </li></ul><ul><ul><li>providing good relevance. </li></ul></ul><ul><li>Supports stopwords. </li></ul><ul><ul><li>exclude common words like – a, an, the, with, in </li></ul></ul><ul><li>Supports different search modes </li></ul><ul><ul><li>&quot;match all&quot;, &quot;match phrase&quot; and &quot;match any&quot; </li></ul></ul><ul><li>Supports relevance modification on the fly. </li></ul><ul><li>Key Sphinx features are its speed and phrase proximity ranking. </li></ul>
  6. 6. <ul><li>boardreader.com </li></ul><ul><ul><li>Indexes over 2 billion documents, BoardReader forum search engine is the biggest Sphinx installation at present. </li></ul></ul><ul><li>mininova.org </li></ul><ul><ul><li>Mininova, popular BitTorrent search engine, serves 3-5 million searches daily. </li></ul></ul><ul><li>thepiratebay.org </li></ul><ul><ul><li>The Pirate Bay and (forthcoming) SuprNova moved to Sphinx recently. </li></ul></ul><ul><li>netlog.com </li></ul><ul><ul><li>NetLog, a large social network site with over 35 million registered users, uses Sphinx for pretty every kind of search imaginable - people, photo, blog, event, music, and video searches. 12 million daily queries against 100+ GB indexes are handled by just 2 quad-core search boxes. </li></ul></ul>
  7. 7. <ul><li>Sphinx can be downloaded from http://www.sphinxsearch.com/ </li></ul><ul><li>Its distribution contains the following programs: </li></ul><ul><li>indexer </li></ul><ul><ul><li>utility to create fulltext indices </li></ul></ul><ul><li>searchd </li></ul><ul><ul><li>daemon to search through fulltext indices </li></ul></ul><ul><li>search </li></ul><ul><ul><li>test utility to query fulltext indices from command line </li></ul></ul><ul><li>sphinxapi </li></ul><ul><ul><li>set of API libraries for Ruby, Python, Perl, Java. </li></ul></ul>
  8. 8. <ul><li>Configuration </li></ul><ul><ul><li>settings for indexer and searchd </li></ul></ul><ul><li>Indexes, Fields, Attributes. </li></ul><ul><li>Each index has a document id , some fields , and some attributes . </li></ul><ul><ul><li>The id has to be unique , generally it’s the primary key. </li></ul></ul><ul><ul><li>The fields contain the text that is to be searched . </li></ul></ul><ul><ul><li>The attributes contain the data used for sorting , filtering and grouping . </li></ul></ul>
  9. 9. <ul><li>thinking_sphinx </li></ul><ul><ul><li>Pat Allan </li></ul></ul><ul><ul><li>also developed the underlying API for Sphinx, Riddle. </li></ul></ul><ul><ul><li>git://github.com/freelancing-god/thinking-sphinx.git </li></ul></ul><ul><li>ultrasphinx </li></ul><ul><ul><li>Evan Weaver </li></ul></ul><ul><ul><li>svn://rubyforge.org/var/svn/fauna/ultrasphinx/trunk </li></ul></ul>
  10. 10. <ul><li>Can be installed simply by </li></ul><ul><ul><li>ruby scriptplugin install <path_to_plugin> </li></ul></ul><ul><li>No need to write the sphinx configuration file, plugins take care of this. </li></ul>
  11. 11. <ul><li>field aliasing </li></ul><ul><ul><li>indexes full_name, :as => :name </li></ul></ul><ul><li>field merging </li></ul><ul><ul><li>[first_name, last_name], :as => :name </li></ul></ul><ul><li>field weighting </li></ul><ul><ul><li>set_property :field_weights => {: last_name =>2, :first_name => 1} </li></ul></ul><ul><ul><li>User.search &quot;aaa&quot;, :field_weights => { :first_name => 1, :last_name => 2} </li></ul></ul><ul><li>index computed value </li></ul><ul><ul><li>indexes &quot;age > 15&quot;, :as => :minor </li></ul></ul>
  12. 12. <ul><li>sorting (using attributes and fields) </li></ul><ul><ul><li>:sortable => true </li></ul></ul><ul><ul><li>has created_at </li></ul></ul><ul><ul><li>User.search(&quot;user&quot;, :order => :first_name, :sort_mode => :desc) </li></ul></ul><ul><ul><li>User.search(&quot;user&quot;, :order => &quot;created_at DESC&quot;) </li></ul></ul><ul><li>filtering (using attributes and fields) </li></ul><ul><ul><li>User.search :conditions => {:name => &quot;aaa&quot;} </li></ul></ul><ul><ul><li>User.search :with => {:age => 10} </li></ul></ul><ul><ul><li>User.search :without => {:age => 10} </li></ul></ul><ul><li>add custom SQL conditions to index </li></ul><ul><ul><li>where &quot;first_name = 'aaa'&quot; </li></ul></ul>
  13. 13. <ul><li>drop-in compatibility with will_paginate </li></ul><ul><ul><li>User.search &quot;aaa&quot;, :page => (params[:page] || 1) </li></ul></ul><ul><li>geodistance </li></ul><ul><ul><li>has :latit </li></ul></ul><ul><ul><li>has :longit </li></ul></ul><ul><ul><li>set_property :latitude_attr => :latit, :longitude_attr => :longit </li></ul></ul><ul><ul><li>Address.search &quot;pizza hut&quot;, :geo => [1.234, 4.567], :order => &quot;@geodist asc&quot; </li></ul></ul><ul><li>delta index support </li></ul><ul><ul><li>set_property :delta => true </li></ul></ul>
  14. 14. <ul><li>searching across multiple models </li></ul><ul><ul><li>indexes posts.name </li></ul></ul><ul><ul><li>indexes posts.comments.name </li></ul></ul><ul><li>comprehensive rake tasks </li></ul><ul><ul><li>rake ts:conf </li></ul></ul><ul><ul><li>rake ts:in </li></ul></ul><ul><ul><li>rake ts:start, restart, stop </li></ul></ul><ul><li>multiple deployment environments </li></ul><ul><ul><li>rake ts:config RAILS_ENV=production </li></ul></ul>
  15. 15. <ul><li>one-to-one </li></ul><ul><ul><li>user has_one blog </li></ul></ul><ul><ul><li>indexes blog.name </li></ul></ul><ul><li>one-to-many </li></ul><ul><ul><li>blog has_many posts </li></ul></ul><ul><ul><li>indexes posts.name </li></ul></ul><ul><li>many-to-many (through) </li></ul><ul><ul><li>posts has_many comments through records </li></ul></ul><ul><ul><li>comments has_many posts through records </li></ul></ul><ul><ul><li>indexes comments.name </li></ul></ul>
  16. 16. <ul><ul><li>deeply nested </li></ul></ul><ul><ul><ul><li>blog has_many posts </li></ul></ul></ul><ul><ul><ul><li>posts has_many comments </li></ul></ul></ul><ul><ul><ul><li>indexes posts.comments.name </li></ul></ul></ul><ul><ul><li>STI </li></ul></ul><ul><ul><ul><li>User.search(&quot;user&quot;, :with => {:class_crc => Teacher.to_crc32}) </li></ul></ul></ul><ul><ul><li>polymorphic </li></ul></ul><ul><ul><ul><li>user has_one phone </li></ul></ul></ul><ul><ul><ul><li>company has_one phone </li></ul></ul></ul><ul><ul><ul><li>indexes phone.name </li></ul></ul></ul><ul><ul><ul><li>where &quot;callable_type = 'User'“ </li></ul></ul></ul>
  17. 17. <ul><li>You can run the index task while Sphinx is running, and it’ll reload the indexes automatically. </li></ul><ul><li>As of version 0.9.9, your configuration will automatically be reloaded. </li></ul><ul><li>Keep in mind that if any keywords for Ruby methods - such as id or name - clash with your column names, you need to use the symbol version. </li></ul><ul><li>Sphinx connects to DB directly, so don’t expect that any of the model methods can be indexed. </li></ul>
  18. 18. <ul><li>You can extract commands for indexing and starting search daemon into scripts for fast access. </li></ul><ul><ul><li>indexer --config config/development.sphinx.conf --all </li></ul></ul><ul><ul><li>searchd --config config/development.sphinx.conf </li></ul></ul><ul><li>skip this warning </li></ul><ul><ul><li>distributed index 'model_name' can not be directly indexed; skipping. </li></ul></ul>
  19. 19. <ul><li>Almost has all thinking_sphinx features with some additional features: </li></ul><ul><ul><li>excerpt highlighting </li></ul></ul><ul><ul><li>spellcheck fields* </li></ul></ul><ul><ul><li>faceting on text, date, and numeric fields* </li></ul></ul><ul><ul><li>*will be demonstrated in next presentation </li></ul></ul>
  20. 20. <ul><li>sphinx </li></ul><ul><ul><li>http://www.sphinxsearch.com/ </li></ul></ul><ul><li>ultrasphinx </li></ul><ul><ul><li>http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/files/README.html </li></ul></ul><ul><li>thinking_sphinx </li></ul><ul><ul><li>http://ts.freelancing-gods.com/ </li></ul></ul><ul><ul><li>http://groups.google.com/group/thinking-sphinx/ </li></ul></ul>

×