SPHINX AND THINKING
       SPHINX
      10 Minute Intro
HAYES DAVIS
      Founder, Appozite
cheaptweet.com | @cheaptweet
        @hayesdavis
SPHINX
•Open Source full-text search
 engine
•Designed around SQL
•Standalone daemon
 (searchd)


                                http://guardians.net/hawass/images/sphinx3.jpg
THINKING
     SPHINX
•Rails plugin
•Integrates Active Record
 with Sphinx
•Makes talking to Sphinx
 basically painless
BASIC IDEA


• Configure   your indexes

• Index

• Query

• Repeat
CONFIGURING INDEXES

• Add indexes on your AR            class Article < ActiveRecord::Base

 classes using define_index           define_index do
                                        # fields
• Fields (indexes)   contain text       indexes subject, :sortable => true
                                        indexes content
 you can search                         indexes author.name, :as=> :author,
                                          :sortable => true

• Attributes (has)
                 allow you to           # attributes

 sort and constrain your                has author_id, created_at,
                                            updated_at
 searches                             end

                                    end
• Careful!Column names
 aren’t symbols
Run the indexer
rake thinking_sphinx:index
source twitterer_core_0
{
  type = mysql
  sql_host = 127.0.0.1
  sql_user = cheaptweet
  sql_pass = cheaptweet
  sql_db = cheaptweet_development2
  sql_query_pre = UPDATE `twitterer` SET `delta` = 0
  sql_query_pre = SET NAMES utf8
  sql_query = SELECT `twitterer`.`id` * 1 + 0 AS `id` , CAST(`twitterer`.`screen_name` AS CHAR) AS `screen_name`, CAST(`twitterer`.`name` AS
CHAR) AS `name`, CAST(`twitterer`.`description` AS CHAR) AS `description`, CAST(`twitterer`.`url` AS CHAR) AS `url`,
CAST(`twitterer`.`location` AS CHAR) AS `location`, `twitterer`.`id` AS `sphinx_internal_id`, 283224142 AS `class_crc`, '283224142' AS
`subclass_crcs`, 0 AS `sphinx_deleted` FROM twitterer    WHERE `twitterer`.`id` >= $start   AND `twitterer`.`id` <= $end    AND
`twitterer`.`delta` = 0 GROUP BY `twitterer`.`id` ORDER BY NULL
  sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) FROM `twitterer` WHERE `twitterer`.`delta` = 0
  sql_attr_uint = sphinx_internal_id
  sql_attr_uint = class_crc
  sql_attr_uint = sphinx_deleted
  sql_attr_multi = uint subclass_crcs from field
  sql_query_info = SELECT * FROM `twitterer` WHERE `id` = (($id - 0) / 1)
}

index twitterer_core
{
  source = twitterer_core_0
  path = /Users/hayesdavis/Appozite/workspace/CheapTweet/data/sphinx/development/twitterer_core
  morphology = stem_en
  charset_type = utf-8
}




          MORE ABOUT INDEXING
Thinking Sphinx generates a config file for sphinx, indexes (aka
        “sources”) are defined. It’s a little complicated.
Start Sphinx
rake thinking_sphinx:start
#Searches all fields for “pants”
Article.search “pants”

#Conditions are allowed on fields but must be hash
Article.search “pants”, :conditions=>{
  :subject=>”How To Wear”
}

#Query attributes using :with
Article.search “pants”, :with=>{
  :author_id=>1, :created_at=>1.week.ago..Time.now
}




               SEARCHING
         Use the search method on AR classes
BUT WAIT
     HOW DO I KEEP INDEXES
(ESPECIALLY BIG ONES) UP TO DATE?
DELTA INDEXES TO THE
                 RESCUE
• Mini   index of only rows that have been updated

• Must    merge into “core” index periodically or it’ll get slow

• Simplest   approach: add delta boolean column to model

• Add set_property :delta=>true        to define_index block

• Delta   index is rebuilt on model saves, can cause performance
 hit
DEPLOYMENT &
                 PRODUCTION

• Must   schedule full re-indexing periodically

• Have   god or monit keep an eye on things

• Consider adding some cap tasks to help out with reindexing
 and restarting
TIPS, TRICKS, GOTCHAS

• Simplest   delta indexing can lead to performance issues

• Indexer assumes you have sequential ids on your DB rows and
 iterates through them in chunks - very bad if you have big
 gaps

• Run full indexing as often as you can without hurting
 performance - it’s usually pretty fast

• Youcan hand-edit config files if you need to tune - but be
 careful not to regenerate
RESOURCES


Sphinx http://www.sphinxsearch.com/

Thinking Sphinx http://freelancing-god.github.com/ts/en/

Railscast http://railscasts.com/episodes/120-thinking-sphinx

Quick Introduction to Sphinx and Thinking Sphinx

  • 1.
    SPHINX AND THINKING SPHINX 10 Minute Intro
  • 2.
    HAYES DAVIS Founder, Appozite cheaptweet.com | @cheaptweet @hayesdavis
  • 3.
    SPHINX •Open Source full-textsearch engine •Designed around SQL •Standalone daemon (searchd) http://guardians.net/hawass/images/sphinx3.jpg
  • 4.
    THINKING SPHINX •Rails plugin •Integrates Active Record with Sphinx •Makes talking to Sphinx basically painless
  • 5.
    BASIC IDEA • Configure your indexes • Index • Query • Repeat
  • 6.
    CONFIGURING INDEXES • Addindexes on your AR class Article < ActiveRecord::Base classes using define_index define_index do # fields • Fields (indexes) contain text indexes subject, :sortable => true indexes content you can search indexes author.name, :as=> :author, :sortable => true • Attributes (has) allow you to # attributes sort and constrain your has author_id, created_at, updated_at searches end end • Careful!Column names aren’t symbols
  • 7.
    Run the indexer rakethinking_sphinx:index
  • 8.
    source twitterer_core_0 { type = mysql sql_host = 127.0.0.1 sql_user = cheaptweet sql_pass = cheaptweet sql_db = cheaptweet_development2 sql_query_pre = UPDATE `twitterer` SET `delta` = 0 sql_query_pre = SET NAMES utf8 sql_query = SELECT `twitterer`.`id` * 1 + 0 AS `id` , CAST(`twitterer`.`screen_name` AS CHAR) AS `screen_name`, CAST(`twitterer`.`name` AS CHAR) AS `name`, CAST(`twitterer`.`description` AS CHAR) AS `description`, CAST(`twitterer`.`url` AS CHAR) AS `url`, CAST(`twitterer`.`location` AS CHAR) AS `location`, `twitterer`.`id` AS `sphinx_internal_id`, 283224142 AS `class_crc`, '283224142' AS `subclass_crcs`, 0 AS `sphinx_deleted` FROM twitterer WHERE `twitterer`.`id` >= $start AND `twitterer`.`id` <= $end AND `twitterer`.`delta` = 0 GROUP BY `twitterer`.`id` ORDER BY NULL sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) FROM `twitterer` WHERE `twitterer`.`delta` = 0 sql_attr_uint = sphinx_internal_id sql_attr_uint = class_crc sql_attr_uint = sphinx_deleted sql_attr_multi = uint subclass_crcs from field sql_query_info = SELECT * FROM `twitterer` WHERE `id` = (($id - 0) / 1) } index twitterer_core { source = twitterer_core_0 path = /Users/hayesdavis/Appozite/workspace/CheapTweet/data/sphinx/development/twitterer_core morphology = stem_en charset_type = utf-8 } MORE ABOUT INDEXING Thinking Sphinx generates a config file for sphinx, indexes (aka “sources”) are defined. It’s a little complicated.
  • 9.
  • 10.
    #Searches all fieldsfor “pants” Article.search “pants” #Conditions are allowed on fields but must be hash Article.search “pants”, :conditions=>{ :subject=>”How To Wear” } #Query attributes using :with Article.search “pants”, :with=>{ :author_id=>1, :created_at=>1.week.ago..Time.now } SEARCHING Use the search method on AR classes
  • 11.
    BUT WAIT HOW DO I KEEP INDEXES (ESPECIALLY BIG ONES) UP TO DATE?
  • 12.
    DELTA INDEXES TOTHE RESCUE • Mini index of only rows that have been updated • Must merge into “core” index periodically or it’ll get slow • Simplest approach: add delta boolean column to model • Add set_property :delta=>true to define_index block • Delta index is rebuilt on model saves, can cause performance hit
  • 13.
    DEPLOYMENT & PRODUCTION • Must schedule full re-indexing periodically • Have god or monit keep an eye on things • Consider adding some cap tasks to help out with reindexing and restarting
  • 14.
    TIPS, TRICKS, GOTCHAS •Simplest delta indexing can lead to performance issues • Indexer assumes you have sequential ids on your DB rows and iterates through them in chunks - very bad if you have big gaps • Run full indexing as often as you can without hurting performance - it’s usually pretty fast • Youcan hand-edit config files if you need to tune - but be careful not to regenerate
  • 15.
    RESOURCES Sphinx http://www.sphinxsearch.com/ Thinking Sphinxhttp://freelancing-god.github.com/ts/en/ Railscast http://railscasts.com/episodes/120-thinking-sphinx