Digging into solrRails Usergroup Hamburg 13. April 2011
Overview●   What is solr●   Solr integration into Rails●   Challenges for the search●   Experiences
What is solr●   Matthew 7:7b / Lukas 11:9b●   (sermon on the Mount)●   seek and you will find;
What is solr
What is solr                           HTTP Request Servlet                     Update ServletAdmin                       ...
What is solr●   Unstructured rows●   Denormalization of data●   Dynamic fields●   Schema → Tokenizer, Filters, etc.●   Ton...
What is solr          Indexing                                      Query                                               Fi...
What is solr●   Get Requestshl.fragsize=0&spellcheck=true&spellcheck.extendedResults=true&qf=everything_phonetic_wa^1+disp...
What is solr●   Response type    ●   XML    ●   Ruby    ●   JSON    ●   XML + XSLT    ●   etc.
Solr integration into Rails●   Sunspot●   acts_as_solr●   Qype → acts_as_solr●   Optimized Queries for solr    ●   Monkey ...
Solr integration into Rails●   Data consistency    ●   Synchronous        –   AR stores in mysql and solr        –   Longe...
Challenges - Spellchecking●   Pool of words for spellchecking    Words from real data                                     ...
Challenges - Spellchecking                                                           Chipotle BBQCC BY-ND 2.0 raybdbomb   ...
Challenges – Stemming●   Stemming vs. Lemmatizing●   9 Languages●   Hafen – Hafer (Harbor – Oat)●   Performance●   Stemmin...
Challenges – Synonyms●   9 Languages●   OpenOffice rules !●   Not all languages available → NL is missing
Challenges – NGrams●   Hugh Index●   Tee matches Steeb●   EdgeNGrams●   Bar → Sofabar → Barmbek    ●   Not matched string ...
Challenges – Phrases●   Boost matching of phrases → whole entry    ●   Europa Passage●   Boost matching of phrases → left ...
Challenges – Whitespace in index●   Index: Ping Pong●   Search word: Pingpong●   Javascript pre processing                ...
Experiences – sever setup               Live                Staging      Dev            Loadbalancer            Slave     ...
Experiences – size of indices●   Staging System → Sunday evening●   Places in simple format: 712 MB●   Previews simple for...
Experiences – server setup●   Live Servers●   2 x 8 Cores, 2 x 16 Cores●   32 Gbyte RAM●   Max. CPU usage: up to 500%●   S...
Experiences – Solr loves RAM●   Dev → 1 Gig●   Staging → 4.5 Gig (no load)●   Import → 11 Gig and more●   Production → 14 ...
Experiences – Solr loves RAM prod.              slave
Experiences – accesses●   More than ~60 requests per seconds are not    recommended●   Max of 40 requests per seconds is OK
Experiences – accesses
Experiences – CPU load●   Last Import → up to 250 %●   Production (slave):
Experiences – Response times
Experiences – Response times●   Spellchecking pizzt big index (staging):●   1502 / 48 / 47 / 48 / 31 ms●   Spellchecking p...
Experiences – Response times●   Facet for spellchecking:●   facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk...
Experiences – Response times●   Warming up → Staging vs. Production●   Staging: slow●   Production: fast
Experiences – Response times●   Staging / index schama on prod●   Standard Query pizza: 106 / 0 / 0 (9122)●   Fuzzy (pizza...
Experiences - Monitoring●   Munin●   New Relic
Upcoming SlideShare
Loading in …5
×

Solr rug

2,417 views

Published on

Digging into Solr

Published in: Technology
  • Be the first to comment

Solr rug

  1. 1. Digging into solrRails Usergroup Hamburg 13. April 2011
  2. 2. Overview● What is solr● Solr integration into Rails● Challenges for the search● Experiences
  3. 3. What is solr● Matthew 7:7b / Lukas 11:9b● (sermon on the Mount)● seek and you will find;
  4. 4. What is solr
  5. 5. What is solr HTTP Request Servlet Update ServletAdmin XML Different Request Handler Update schema caching config Solr Core concurrency Lucene Replication
  6. 6. What is solr● Unstructured rows● Denormalization of data● Dynamic fields● Schema → Tokenizer, Filters, etc.● Tons of XML
  7. 7. What is solr Indexing Query Filter Tokenizer QueryTokenizer Token Filter Strings Index Results
  8. 8. What is solr● Get Requestshl.fragsize=0&spellcheck=true&spellcheck.extendedResults=true&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+review_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+display_name_wa^128&spellcheck.collate=true&wt=ruby&hl=true&rows=100&f =pk_i,score l&start=0&q=chipotle+bbq&spellcheck.dictionary=spell_en&bf=linear(en_rating_points_i,100,0)&spellcheck.count=1&qt=dismax&fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
  9. 9. What is solr● Response type ● XML ● Ruby ● JSON ● XML + XSLT ● etc.
  10. 10. Solr integration into Rails● Sunspot● acts_as_solr● Qype → acts_as_solr● Optimized Queries for solr ● Monkey patching ● Defined queries without dynamic fields ● Names of search fields differ from AR names
  11. 11. Solr integration into Rails● Data consistency ● Synchronous – AR stores in mysql and solr – Longer response times – Not really synchron in case of replication ● Asynchronous – AR stores in mysql – Data import via mysql requests by solr master – Out of sync for some minutes – Deletion by flag, later physically – Javascript preprocessing of data possible
  12. 12. Challenges - Spellchecking● Pool of words for spellchecking Words from real data ?●● Beeeeeeer● 9 Languages CC BY-ND 2.0 - JM3● New → Spellchecker for different kind of data● Suggestion → Locator → Facet → best match ?● Similar word → fuzzy search vs. spellchecking
  13. 13. Challenges - Spellchecking Chipotle BBQCC BY-ND 2.0 raybdbomb CC BY-ND 2.0 - Meindert Arnold JacobChinese Baby CC BY-ND 2.0 - joshDubya ! CC BY-ND 2.0 - michael clarke stuff shingles
  14. 14. Challenges – Stemming● Stemming vs. Lemmatizing● 9 Languages● Hafen – Hafer (Harbor – Oat)● Performance● Stemming → solr SnowBallPorterFactory● Polish → Lemmatizng → OpenOffice
  15. 15. Challenges – Synonyms● 9 Languages● OpenOffice rules !● Not all languages available → NL is missing
  16. 16. Challenges – NGrams● Hugh Index● Tee matches Steeb● EdgeNGrams● Bar → Sofabar → Barmbek ● Not matched string shall be a word → performance
  17. 17. Challenges – Phrases● Boost matching of phrases → whole entry ● Europa Passage● Boost matching of phrases → left sided ● Galeria Kaufhof in Hamburg ● Boutique in Galeria Kaufhof ● Javascript pre processing● Boost matching of phrase somewhere in entry● How to handle matches of some words in given phrase?
  18. 18. Challenges – Whitespace in index● Index: Ping Pong● Search word: Pingpong● Javascript pre processing CC BY-ND 2.0 - zimpenfish CC BY-ND 2.0 - Ewan-M
  19. 19. Experiences – sever setup Live Staging Dev Loadbalancer Slave iMac Solr queries Master Slave Slave SlaveReplication Solr & MySql DB Slave Master Import DB Slave
  20. 20. Experiences – size of indices● Staging System → Sunday evening● Places in simple format: 712 MB● Previews simple format: 5,519 GByte● Places Previews Comments extended: 3,5 GB● Big Spellchecker: 16 GByte● New combined index: 15 GByte ● Index: 14 Gbyte ● Spellchecker: 1 GByte
  21. 21. Experiences – server setup● Live Servers● 2 x 8 Cores, 2 x 16 Cores● 32 Gbyte RAM● Max. CPU usage: up to 500%● Solr loves RAM → 32 Gbyte full with cache
  22. 22. Experiences – Solr loves RAM● Dev → 1 Gig● Staging → 4.5 Gig (no load)● Import → 11 Gig and more● Production → 14 Gig
  23. 23. Experiences – Solr loves RAM prod. slave
  24. 24. Experiences – accesses● More than ~60 requests per seconds are not recommended● Max of 40 requests per seconds is OK
  25. 25. Experiences – accesses
  26. 26. Experiences – CPU load● Last Import → up to 250 %● Production (slave):
  27. 27. Experiences – Response times
  28. 28. Experiences – Response times● Spellchecking pizzt big index (staging):● 1502 / 48 / 47 / 48 / 31 ms● Spellchecking pizzt small index (staging):● 603 / 12 / 8 / 9 / 9 ms
  29. 29. Experiences – Response times● Facet for spellchecking:● facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score& facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+ OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"& facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+ OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"& facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+di splay_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"& facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+ display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"& facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+O R+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&f facet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+ OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"& facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+O R+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"& facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+ OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"& facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+ OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"& facet.query=comment_de_wa:"pizz√§"+OR+review_de_wa:"pizz√§"+OR+everything_de_wa:"pizz√§"+OR+everything_wa:"pizz√ §"+OR+display_name_de_wa:"pizz√§"+OR+display_name_wa:"pizz√§"+OR+display_name_ngram:"pizz√§"& q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place)● 10 facets: 231 / 5 /4 / 22 / 3(->xml) ms
  30. 30. Experiences – Response times● Warming up → Staging vs. Production● Staging: slow● Production: fast
  31. 31. Experiences – Response times● Staging / index schama on prod● Standard Query pizza: 106 / 0 / 0 (9122)● Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149)● Fuzzy (pizza~0.5): 822 / 0 / 0 (12129)● Fuzzy (pizza~0.8): 34 / 1 / 0 (9122)● Wildcard: (rest*): 39 / 0 / 0 (41031)
  32. 32. Experiences - Monitoring● Munin● New Relic

×