Digging into solr
Rails Usergroup Hamburg 13. April 2011
Overview
●   What is solr
●   Solr integration into Rails
●   Challenges for the search
●   Experiences
What is solr
●   Matthew 7:7b / Lukas 11:9b
●   (sermon on the Mount)
●   seek and you will find;
What is solr
What is solr
                           HTTP Request Servlet                     Update Servlet

Admin
                                                                        XML
                          Different Request Handler
                                                                       Update


                 schema
                                                      caching
        config                       Solr Core
                                                                concurrency



                                    Lucene

                                                                      Replication
What is solr
●   Unstructured rows
●   Denormalization of data
●   Dynamic fields
●   Schema → Tokenizer, Filters, etc.
●   Tons of XML
What is solr

          Indexing                                      Query


                                               Filter   Tokenizer Query
Tokenizer Token   Filter   Strings


                                     Index

                                             Results
What is solr
●   Get Requests
hl.fragsize=0
&spellcheck=true
&spellcheck.extendedResults=true
&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+revi
ew_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+displ
ay_name_wa^128
&spellcheck.collate=true
&wt=ruby
&hl=true
&rows=100
&f =pk_i,score
  l
&start=0
&q=chipotle+bbq
&spellcheck.dictionary=spell_en
&bf=linear(en_rating_points_i,100,0)
&spellcheck.count=1
&qt=dismax&
fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
What is solr
●   Response type
    ●   XML
    ●   Ruby
    ●   JSON
    ●   XML + XSLT
    ●   etc.
Solr integration into Rails
●   Sunspot
●   acts_as_solr
●   Qype → acts_as_solr
●   Optimized Queries for solr
    ●   Monkey patching
    ●   Defined queries without dynamic fields
    ●   Names of search fields differ from AR names
Solr integration into Rails
●   Data consistency
    ●   Synchronous
        –   AR stores in mysql and solr
        –   Longer response times
        –   Not really synchron in case of replication
    ●   Asynchronous
        –   AR stores in mysql
        –   Data import via mysql requests by solr master
        –   Out of sync for some minutes
        –   Deletion by flag, later physically
        –   Javascript preprocessing of data possible
Challenges - Spellchecking
●   Pool of words for spellchecking
    Words from real data

                                           ?
●


●   Beeeeeeer
●   9 Languages                            CC BY-ND 2.0 - JM3


●   New → Spellchecker for different kind of data
●   Suggestion → Locator → Facet → best match ?
●   Similar word → fuzzy search vs. spellchecking
Challenges - Spellchecking

                                                           Chipotle BBQ
CC BY-ND 2.0
 raybdbomb          CC BY-ND 2.0 - Meindert Arnold Jacob




Chinese Baby
                                                                CC BY-ND 2.0 - joshDubya




        !      CC BY-ND 2.0 - michael clarke stuff
                                                           shingles
Challenges – Stemming
●   Stemming vs. Lemmatizing
●   9 Languages
●   Hafen – Hafer (Harbor – Oat)
●   Performance
●   Stemming → solr SnowBallPorterFactory
●   Polish → Lemmatizng → OpenOffice
Challenges – Synonyms
●   9 Languages
●   OpenOffice rules !
●   Not all languages available → NL is missing
Challenges – NGrams
●   Hugh Index
●   Tee matches Steeb
●   EdgeNGrams
●   Bar → Sofabar → Barmbek
    ●   Not matched string shall be a word → performance
Challenges – Phrases
●   Boost matching of phrases → whole entry
    ●   'Europa Passage'
●   Boost matching of phrases → left sided
    ●   'Galeria Kaufhof in Hamburg'
    ●   'Boutique in Galeria Kaufhof'
    ●   Javascript pre processing
●   Boost matching of phrase somewhere in entry
●   How to handle matches of some words in given
    phrase?
Challenges – Whitespace in index
●   Index: 'Ping Pong'
●   Search word: 'Pingpong'
●   Javascript pre processing


                                     CC BY-ND 2.0 - zimpenfish




             CC BY-ND 2.0 - Ewan-M
Experiences – sever setup
               Live                Staging      Dev
            Loadbalancer            Slave        iMac

 Solr queries
                                    Master
   Slave        Slave      Slave

Replication                                   Solr & MySql
                                   DB Slave
               Master

           Import
              DB Slave
Experiences – size of indices
●   Staging System → Sunday evening
●   Places in simple format: 712 MB
●   Previews simple format: 5,519 GByte
●   Places Previews Comments extended: 3,5 GB
●   Big Spellchecker: 16 GByte
●   New combined index: 15 GByte
    ●   Index: 14 Gbyte
    ●   Spellchecker: 1 GByte
Experiences – server setup
●   Live Servers
●   2 x 8 Cores, 2 x 16 Cores
●   32 Gbyte RAM
●   Max. CPU usage: up to 500%
●   Solr loves RAM → 32 Gbyte full with cache
Experiences – Solr loves RAM
●   Dev → 1 Gig
●   Staging → 4.5 Gig (no load)
●   Import → 11 Gig and more
●   Production → 14 Gig
Experiences – Solr loves RAM prod.
              slave
Experiences – accesses
●   More than ~60 requests per seconds are not
    recommended
●   Max of 40 requests per seconds is OK
Experiences – accesses
Experiences – CPU load
●   Last Import → up to 250 %
●   Production (slave):
Experiences – Response times
Experiences – Response times
●   Spellchecking 'pizzt' big index (staging):
●   1502 / 48 / 47 / 48 / 31 ms
●   Spellchecking 'pizzt' small index (staging):
●   603 / 12 / 8 / 9 / 9 ms
Experiences – Response times
●   Facet for spellchecking:
●   facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score&
    facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+
    OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"&
    facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+
    OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"&
    facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+di
    splay_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"&
    facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+
    display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"&
    facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+O
    R+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&f
    facet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+
    OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"&
    facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+O
    R+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"&
    facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+
    OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"&
    facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+
    OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"&
    facet.query=comment_de_wa:"pizz√§"+OR+review_de_wa:"pizz√§"+OR+everything_de_wa:"pizz√§"+OR+everything_wa:"pizz√
    §"+OR+display_name_de_wa:"pizz√§"+OR+display_name_wa:"pizz√§"+OR+display_name_ngram:"pizz√§"&
    q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place)


●   10 facets: 231 / 5 /4 / 22 / 3(->xml) ms
Experiences – Response times

●   Warming up → Staging vs. Production
●   Staging: slow
●   Production: fast
Experiences – Response times

●   Staging / index schama on prod
●   Standard Query 'pizza': 106 / 0 / 0 (9122)
●   Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149)
●   Fuzzy (pizza~0.5): 822 / 0 / 0    (12129)
●   Fuzzy (pizza~0.8): 34 / 1 / 0     (9122)
●   Wildcard: (rest*): 39 / 0 / 0      (41031)
Experiences - Monitoring
●   Munin
●   New Relic

Solr rug

  • 1.
    Digging into solr RailsUsergroup Hamburg 13. April 2011
  • 2.
    Overview ● What is solr ● Solr integration into Rails ● Challenges for the search ● Experiences
  • 3.
    What is solr ● Matthew 7:7b / Lukas 11:9b ● (sermon on the Mount) ● seek and you will find;
  • 4.
  • 5.
    What is solr HTTP Request Servlet Update Servlet Admin XML Different Request Handler Update schema caching config Solr Core concurrency Lucene Replication
  • 6.
    What is solr ● Unstructured rows ● Denormalization of data ● Dynamic fields ● Schema → Tokenizer, Filters, etc. ● Tons of XML
  • 7.
    What is solr Indexing Query Filter Tokenizer Query Tokenizer Token Filter Strings Index Results
  • 8.
    What is solr ● Get Requests hl.fragsize=0 &spellcheck=true &spellcheck.extendedResults=true &qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+revi ew_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+displ ay_name_wa^128 &spellcheck.collate=true &wt=ruby &hl=true &rows=100 &f =pk_i,score l &start=0 &q=chipotle+bbq &spellcheck.dictionary=spell_en &bf=linear(en_rating_points_i,100,0) &spellcheck.count=1 &qt=dismax& fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
  • 9.
    What is solr ● Response type ● XML ● Ruby ● JSON ● XML + XSLT ● etc.
  • 10.
    Solr integration intoRails ● Sunspot ● acts_as_solr ● Qype → acts_as_solr ● Optimized Queries for solr ● Monkey patching ● Defined queries without dynamic fields ● Names of search fields differ from AR names
  • 11.
    Solr integration intoRails ● Data consistency ● Synchronous – AR stores in mysql and solr – Longer response times – Not really synchron in case of replication ● Asynchronous – AR stores in mysql – Data import via mysql requests by solr master – Out of sync for some minutes – Deletion by flag, later physically – Javascript preprocessing of data possible
  • 12.
    Challenges - Spellchecking ● Pool of words for spellchecking Words from real data ? ● ● Beeeeeeer ● 9 Languages CC BY-ND 2.0 - JM3 ● New → Spellchecker for different kind of data ● Suggestion → Locator → Facet → best match ? ● Similar word → fuzzy search vs. spellchecking
  • 13.
    Challenges - Spellchecking Chipotle BBQ CC BY-ND 2.0 raybdbomb CC BY-ND 2.0 - Meindert Arnold Jacob Chinese Baby CC BY-ND 2.0 - joshDubya ! CC BY-ND 2.0 - michael clarke stuff shingles
  • 14.
    Challenges – Stemming ● Stemming vs. Lemmatizing ● 9 Languages ● Hafen – Hafer (Harbor – Oat) ● Performance ● Stemming → solr SnowBallPorterFactory ● Polish → Lemmatizng → OpenOffice
  • 15.
    Challenges – Synonyms ● 9 Languages ● OpenOffice rules ! ● Not all languages available → NL is missing
  • 16.
    Challenges – NGrams ● Hugh Index ● Tee matches Steeb ● EdgeNGrams ● Bar → Sofabar → Barmbek ● Not matched string shall be a word → performance
  • 17.
    Challenges – Phrases ● Boost matching of phrases → whole entry ● 'Europa Passage' ● Boost matching of phrases → left sided ● 'Galeria Kaufhof in Hamburg' ● 'Boutique in Galeria Kaufhof' ● Javascript pre processing ● Boost matching of phrase somewhere in entry ● How to handle matches of some words in given phrase?
  • 18.
    Challenges – Whitespacein index ● Index: 'Ping Pong' ● Search word: 'Pingpong' ● Javascript pre processing CC BY-ND 2.0 - zimpenfish CC BY-ND 2.0 - Ewan-M
  • 19.
    Experiences – seversetup Live Staging Dev Loadbalancer Slave iMac Solr queries Master Slave Slave Slave Replication Solr & MySql DB Slave Master Import DB Slave
  • 20.
    Experiences – sizeof indices ● Staging System → Sunday evening ● Places in simple format: 712 MB ● Previews simple format: 5,519 GByte ● Places Previews Comments extended: 3,5 GB ● Big Spellchecker: 16 GByte ● New combined index: 15 GByte ● Index: 14 Gbyte ● Spellchecker: 1 GByte
  • 21.
    Experiences – serversetup ● Live Servers ● 2 x 8 Cores, 2 x 16 Cores ● 32 Gbyte RAM ● Max. CPU usage: up to 500% ● Solr loves RAM → 32 Gbyte full with cache
  • 22.
    Experiences – Solrloves RAM ● Dev → 1 Gig ● Staging → 4.5 Gig (no load) ● Import → 11 Gig and more ● Production → 14 Gig
  • 23.
    Experiences – Solrloves RAM prod. slave
  • 24.
    Experiences – accesses ● More than ~60 requests per seconds are not recommended ● Max of 40 requests per seconds is OK
  • 25.
  • 26.
    Experiences – CPUload ● Last Import → up to 250 % ● Production (slave):
  • 27.
  • 28.
    Experiences – Responsetimes ● Spellchecking 'pizzt' big index (staging): ● 1502 / 48 / 47 / 48 / 31 ms ● Spellchecking 'pizzt' small index (staging): ● 603 / 12 / 8 / 9 / 9 ms
  • 29.
    Experiences – Responsetimes ● Facet for spellchecking: ● facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score& facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+ OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"& facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+ OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"& facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+di splay_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"& facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+ display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"& facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+O R+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&f facet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+ OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"& facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+O R+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"& facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+ OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"& facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+ OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"& facet.query=comment_de_wa:"pizz√§"+OR+review_de_wa:"pizz√§"+OR+everything_de_wa:"pizz√§"+OR+everything_wa:"pizz√ §"+OR+display_name_de_wa:"pizz√§"+OR+display_name_wa:"pizz√§"+OR+display_name_ngram:"pizz√§"& q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place) ● 10 facets: 231 / 5 /4 / 22 / 3(->xml) ms
  • 30.
    Experiences – Responsetimes ● Warming up → Staging vs. Production ● Staging: slow ● Production: fast
  • 31.
    Experiences – Responsetimes ● Staging / index schama on prod ● Standard Query 'pizza': 106 / 0 / 0 (9122) ● Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149) ● Fuzzy (pizza~0.5): 822 / 0 / 0 (12129) ● Fuzzy (pizza~0.8): 34 / 1 / 0 (9122) ● Wildcard: (rest*): 39 / 0 / 0 (41031)
  • 32.
    Experiences - Monitoring ● Munin ● New Relic