Solr 3.1 and beyond

887 views
776 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
887
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Solr 3.1 and beyond

  1. 1. Solr 3.1 and Beyond yonik@lucidimagination.com October 8, 2010 2 Lucid Imagination Yonik Seeley
  2. 2. Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0   Relevancy (Extended Dismax Parser)   Spatial/Geo Search   Search Result Grouping / Field Collapsing   Faceting (Pivot, Range, Per-segment)   Scalability (Solr Cloud)   Odds & Ends   Q&A 10/12/10 3
  3. 3. Solr 3.1? What happened to 1.5?   Lucene/Solr merged (March 2010)   Single set of committers   Single dev mailing list (dev@lucene.apache.org)   Single shared subversion trunk   Keep separate downloads, user mailing lists   Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)   Development   trunk is now always next major release (currently 4.0)   branch_3x will be base for all 3.x releases   Branch together, Release together, Share version numbers
  4. 4. RELEVANCE
  5. 5. Extended Dismax Parser   Superset of dismax &defType=edismax&q=foo&qf=body     Fixes edge cases where dismax could still throw exceptions OR      AND      NOT      -­‐      “     Full lucene syntax support   Tries lucene syntax first   Smart escaping is done if syntax errors   Optionally supports treating “and”/”or” as AND/OR in lucene syntax   Fielded queries (e.g. myfield:foo) even in degraded mode   uf parameter controls what field names may be directly specified in “q”
  6. 6. Extended Dismax Parser (continued)   boost parameter for multiplicative boost-by-function   Pure negative query clauses Example: solr  OR  (-­‐solr)     Enhanced term proximity boosting   pf2=myfield – results in term bigrams in sloppy phrase queries  myfield:“aa  bb  cc”    -­‐>    myfield:“aa  bb”    myfield:“bb  cc”     Enhanced stopword handling   stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr  is  awesome  &  qf=myfield  &  pf2=myfield      -­‐>          +myfield:(solr  awesome)    (myfield:”solr  is”  myfield:”is   awesome”)     Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
  7. 7. SPATIAL SEARCH 8
  8. 8. Spatial Search 10/12/10 9 Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc
  9. 9. RESULT GROUPING / FIELD COLLAPSING
  10. 10. Field Collapsing Definition  Field collapsing   Limit the number of results per category   “category” normally defined by unique values in a field  Uses   Web Search – collapse by web site   Email threads – collapse by thread id   Ecommerce/retail   Show the top 5 items for each store category (music, movies, etc)
  11. 11. Field Collapsing by Site
  12. 12. Field Collapse on Product Type Result Grouping by Category
  13. 13. Group by Field http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact 10/12/10 14 "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A",
  14. 14. Group by Query 10/12/10 15 http://...&group=true&group.query=price:[0 TO 99.99] &group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ {
  15. 15. Grouping Params parameter meaning default group.field=<field> Like facet.field – group by unique field values group.query=<query> Like facet.query – top docs that also match group.function=<function query> Group by unique values produced by the function query group.limit=<n> How many docs per group 1 group.sort=<sort spec> How to sort documents within a group Same as “sort” param rows=<n> How many groups to return 10 sort=<sort spec> How to sort the groups relative to each other (based on top doc) 10/12/10 16
  16. 16. FACETING
  17. 17. Pivot Faceting   Other names that could have made sense:   Grid Faceting, Cross-Product Faceting, Matrix Faceting   Syntax: facet.pivot=field1,field2,field3,… 10/12/10 18 #docs #docs w/ inStock:true #docs w/ instock:false cat:electronics 14 10 4 cat:memory 3 3 0 cat:connector 2 0 2 cat:graphics card 2 0 2 cat:hard drive 2 2 0 facet.pivot=cat,inStock
  18. 18. Pivot Faceting "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 10/12/10 19 http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6
  19. 19. Range Faceting •  Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}} 10/12/10 20
  20. 20. 5 3 5 1 4 5 2 1 (null) batman flash spiderman superman wolverine order: for each doc, an index into the lookup array lookup: the string values Lucene FieldCache Entry (StringIndex) for the “hero” field 0 2 7 0 1 0 0 0 2 Documents matching the base query “Juggernaut” accumulator increment lookup q=Juggernaut &facet=true &facet.field=hero Priority queue Batman, 3 flash, 5 Existing single-valued faceting algorithm
  21. 21. Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry 0 2 7 0 3 5 0 1 2 0 2 1 0 1 3 0 4 0 1 0 Priority queue Batman, 3 flash, 5 Base DocSet lookup inc accumulator1 accumulator2 accumulator3 accumulator4 FieldCache + accumulator merger (Priority queue) thread1 thread2 thread3 thread4 Per-segment single-valued algorithm
  22. 22. Per-segment faceting   Enable with facet.method=fcs   Controllable multi-threading facet.field={!threads=4}myfield     Disadvantages   Larger memory use (FieldCaches + accumulators)   Slower (extra FieldCache merge step needed)   Advantages   Rebuilds FieldCache entries only for new segments (NRT friendly)   Multi-threaded
  23. 23. Per-segment faceting performance comparison Time for request* facet.method=fc facet.method=fcs static index 3 ms 244 ms quickly changing index 1388 ms 267 ms Base DocSet=100 docs, facet.field on a field with 100,000 unique terms Test index: 10M documents, 18 segments, single valued field Time for request* facet.method=fc facet.method=fcs static index 26 ms 34 ms quickly changing index 741 ms 94 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms *complete request time, measured externally A B
  24. 24. Faceting Performance Improvements   For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement   Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster   Optimized deep facet paging – up to 10x faster with really large facet.offsets   Less memory consumed by field cache entries 10/12/10 25
  25. 25. SCALABILITY
  26. 26. SolrCloud   First steps toward simplifying cluster management   Integrates Zookeeper   Central configuration (schema.xml, solrconfig.xml, etc)   Tracks live nodes + shards of collections   Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr,                localhost:7574/solr|localhost:7500/solr     Can specify logical shard ids shards=NY_shard,NJ_shard     Clients don’t need to know shards at all: http://localhost:8983/solr/collection1/select?distrib=true  
  27. 27. SolrCloud : The Future   Eliminate all single points of failure   Remove Master/Searcher distinction   Enables near real-time search in a highly scalable environment   High Availability for Writes   Eventual consistency model (like Amazon Dynamo, Cassandra)   Elastic   Simply add/subtract servers, cluster will rebalance automatically   By default, Solr will handle document partitioning
  28. 28. ODDS & ENDS
  29. 29. Auto-Suggest   Many people currently use terms component   Can be slow for a large corpus   New auto-suggest builds off SpellCheck component   Compact memory based trie for really fast completions   Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult 10/12/10 30 "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}
  30. 30. Index with JSON $  URL=http://localhost:8983/solr/update/json   $  curl  $URL  -­‐H  'Content-­‐type:application/json'  -­‐d  '   {   "add":  {      "doc":  {          "id"  :  "978-­‐0641723445",          "cat"  :  ["book","hardcover"],          "title"  :  "The  Lightning  Thief",          "author"  :  "Rick  Riordan",          "series_t"  :  "Percy  Jackson  and  the  Olympians",          "sequence_i"  :  1,          "genre_s"  :  "fantasy",          "inStock"  :  true,          "price"  :  12.50,          "pages_i"  :  384      }   }   }'   31
  31. 31. Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10   Can handle multi-valued fields (see “cat” field in example)   Completely compatible with the CSV update handler (can round-trip)   Results are streamed – good for dumping entire parts of the index 10/12/10 32
  32. 32. http://localhost:8983/solr/browse 10/12/10 33
  33. 33. Q&A For more information about Solr visit www.lucidimagination.com

×