Solr rug

2,347 views
2,253 views

Published on

Digging into Solr

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,347
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Solr rug

  1. 1. Digging into solrRails Usergroup Hamburg 13. April 2011
  2. 2. Overview● What is solr● Solr integration into Rails● Challenges for the search● Experiences
  3. 3. What is solr● Matthew 7:7b / Lukas 11:9b● (sermon on the Mount)● seek and you will find;
  4. 4. What is solr
  5. 5. What is solr HTTP Request Servlet Update ServletAdmin XML Different Request Handler Update schema caching config Solr Core concurrency Lucene Replication
  6. 6. What is solr● Unstructured rows● Denormalization of data● Dynamic fields● Schema → Tokenizer, Filters, etc.● Tons of XML
  7. 7. What is solr Indexing Query Filter Tokenizer QueryTokenizer Token Filter Strings Index Results
  8. 8. What is solr● Get Requestshl.fragsize=0&spellcheck=true&spellcheck.extendedResults=true&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+review_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+display_name_wa^128&spellcheck.collate=true&wt=ruby&hl=true&rows=100&f =pk_i,score l&start=0&q=chipotle+bbq&spellcheck.dictionary=spell_en&bf=linear(en_rating_points_i,100,0)&spellcheck.count=1&qt=dismax&fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
  9. 9. What is solr● Response type ● XML ● Ruby ● JSON ● XML + XSLT ● etc.
  10. 10. Solr integration into Rails● Sunspot● acts_as_solr● Qype → acts_as_solr● Optimized Queries for solr ● Monkey patching ● Defined queries without dynamic fields ● Names of search fields differ from AR names
  11. 11. Solr integration into Rails● Data consistency ● Synchronous – AR stores in mysql and solr – Longer response times – Not really synchron in case of replication ● Asynchronous – AR stores in mysql – Data import via mysql requests by solr master – Out of sync for some minutes – Deletion by flag, later physically – Javascript preprocessing of data possible
  12. 12. Challenges - Spellchecking● Pool of words for spellchecking Words from real data ?●● Beeeeeeer● 9 Languages CC BY-ND 2.0 - JM3● New → Spellchecker for different kind of data● Suggestion → Locator → Facet → best match ?● Similar word → fuzzy search vs. spellchecking
  13. 13. Challenges - Spellchecking Chipotle BBQCC BY-ND 2.0 raybdbomb CC BY-ND 2.0 - Meindert Arnold JacobChinese Baby CC BY-ND 2.0 - joshDubya ! CC BY-ND 2.0 - michael clarke stuff shingles
  14. 14. Challenges – Stemming● Stemming vs. Lemmatizing● 9 Languages● Hafen – Hafer (Harbor – Oat)● Performance● Stemming → solr SnowBallPorterFactory● Polish → Lemmatizng → OpenOffice
  15. 15. Challenges – Synonyms● 9 Languages● OpenOffice rules !● Not all languages available → NL is missing
  16. 16. Challenges – NGrams● Hugh Index● Tee matches Steeb● EdgeNGrams● Bar → Sofabar → Barmbek ● Not matched string shall be a word → performance
  17. 17. Challenges – Phrases● Boost matching of phrases → whole entry ● Europa Passage● Boost matching of phrases → left sided ● Galeria Kaufhof in Hamburg ● Boutique in Galeria Kaufhof ● Javascript pre processing● Boost matching of phrase somewhere in entry● How to handle matches of some words in given phrase?
  18. 18. Challenges – Whitespace in index● Index: Ping Pong● Search word: Pingpong● Javascript pre processing CC BY-ND 2.0 - zimpenfish CC BY-ND 2.0 - Ewan-M
  19. 19. Experiences – sever setup Live Staging Dev Loadbalancer Slave iMac Solr queries Master Slave Slave SlaveReplication Solr & MySql DB Slave Master Import DB Slave
  20. 20. Experiences – size of indices● Staging System → Sunday evening● Places in simple format: 712 MB● Previews simple format: 5,519 GByte● Places Previews Comments extended: 3,5 GB● Big Spellchecker: 16 GByte● New combined index: 15 GByte ● Index: 14 Gbyte ● Spellchecker: 1 GByte
  21. 21. Experiences – server setup● Live Servers● 2 x 8 Cores, 2 x 16 Cores● 32 Gbyte RAM● Max. CPU usage: up to 500%● Solr loves RAM → 32 Gbyte full with cache
  22. 22. Experiences – Solr loves RAM● Dev → 1 Gig● Staging → 4.5 Gig (no load)● Import → 11 Gig and more● Production → 14 Gig
  23. 23. Experiences – Solr loves RAM prod. slave
  24. 24. Experiences – accesses● More than ~60 requests per seconds are not recommended● Max of 40 requests per seconds is OK
  25. 25. Experiences – accesses
  26. 26. Experiences – CPU load● Last Import → up to 250 %● Production (slave):
  27. 27. Experiences – Response times
  28. 28. Experiences – Response times● Spellchecking pizzt big index (staging):● 1502 / 48 / 47 / 48 / 31 ms● Spellchecking pizzt small index (staging):● 603 / 12 / 8 / 9 / 9 ms
  29. 29. Experiences – Response times● Facet for spellchecking:● facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score& facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+ OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"& facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+ OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"& facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+di splay_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"& facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+ display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"& facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+O R+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&f facet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+ OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"& facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+O R+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"& facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+ OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"& facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+ OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"& facet.query=comment_de_wa:"pizz√§"+OR+review_de_wa:"pizz√§"+OR+everything_de_wa:"pizz√§"+OR+everything_wa:"pizz√ §"+OR+display_name_de_wa:"pizz√§"+OR+display_name_wa:"pizz√§"+OR+display_name_ngram:"pizz√§"& q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place)● 10 facets: 231 / 5 /4 / 22 / 3(->xml) ms
  30. 30. Experiences – Response times● Warming up → Staging vs. Production● Staging: slow● Production: fast
  31. 31. Experiences – Response times● Staging / index schama on prod● Standard Query pizza: 106 / 0 / 0 (9122)● Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149)● Fuzzy (pizza~0.5): 822 / 0 / 0 (12129)● Fuzzy (pizza~0.8): 34 / 1 / 0 (9122)● Wildcard: (rest*): 39 / 0 / 0 (41031)
  32. 32. Experiences - Monitoring● Munin● New Relic

×