Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Seeley yonik solr performance key innovations


Published on

Published in: Technology
  • Be the first to comment

Seeley yonik solr performance key innovations

  1. 1. Solr Performance & Key Innovations Yonik Seeley, Lucid, May 26 2011
  2. 2. Solr 3.1 Highlights§  Numeric range facets (similar to date faceting).§  New spatial search, including spatial filtering, boosting and sorting capabilities.§  Example Velocity driven search UI at http://localhost:8983/solr/browse§  A new faster termvector-based highlighter.§  Extended dismax (edismax) query parser with support for fielded queries, enhanced relevancy, and full lucene syntax support.§  Distributed search support for the Spell check and Terms components. 3
  3. 3. Solr 3.1 Highlights (continued)§  Suggester, a fast trie-based autocomplete component.§  Sort results by any function query.§  JSON document indexing.§  CSV response format§  Apache UIMA integration for metadata extraction.§  Tons of optimizations, bugfixes, and new analysis capabilities via Apache Lucene 3.1. 4
  4. 4. What’s not in 3.1?§  Result Grouping (AKA Field Collapsing)§  Pivot Faceting§  SolrCloud§  Pseudo-fields§  Pseudo-join§  Relevancy function queries§  Per-segment faceting§  *Tons* of new Lucene performance/efficiency goodness 5
  5. 5. Recent Lucene Performance§  TieredMergePolicy – the new default •  Much better for incremental indexing / NRT •  Ignores segment order when selecting best merge •  Takes deletes into account •  Does not over-merge (no cascading merges)§  Finite State Transducer (FST) based terms index 6
  6. 6. DocumentWriterPerThread (DWPT) Indexing thread§  Flushing new segment is now Index Writer concurrent w/ indexing§  Use multiple DWPT DWPT DWPT in-memory indexing threads/ connections§  When max mem is Flush segment to disk hit, biggest DWPT is _1_0.tiv _2_0.tiv _3_0.tiv concurrently flushed _1_0.prx _2_0.prx _3_0.prx _1_0.frq _2_0.frq _3_0.frq … … … 7
  7. 7. Solr Cloudhttp://.../solr/collection1?distrib=true Load-balanced shard1 sub-request shard2 (replica1) (replica1) replica2 replica2 replica3 replica3 ZK node /livenodes server1:8983/solr ZK /collections server2:8983/solr node /collection1 configName=myconf ZK server2:8983/solr node /shards /shard1 /configs server1:8983/solr /myconf server2:8983/solr solrconfig.xml /shard2 schema.xml server3:8983/solr ZK server4:8983/solr ZK node node ZooKeeper quorum 8
  8. 8. Solr Cloud: Getting Started java  -­‐Dbootstrap_confdir=./solr/conf      -­‐Dcollection.configName=myconf      -­‐DzkRun     Upload /solr/conf  -­‐jar  start.jar   to ZK and call it “myconf” Run an internal ZK serverhttp://localhost:8983/solr/collection1/admin/zookeeper.jsp
  9. 9. Distributed Requestsl  Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr,                localhost:7574/solr|localhost:7500/solr   l  A list of equivalent nodes are separated by “|” l  Different phases of the same distributed request use the same nodel  Specify logical shard ids to search across shards=NY_shard,NJ_shard  l  Query across all shards in the collection http://localhost:8983/solr/collection1/select?distrib=true    l  public  CloudSolrServer(String  zkHost)   l  SolrJ Java client that load-balances across all nodes in cluster
  10. 10. Extended Dismax Parserl  Supersetof dismaxl  Designed to directly handle user queries w/o exceptions &defType=edismax&q=foo&qf=body  l  Fixes edge cases where dismax could still throw exceptions OR      AND      NOT      -­‐        l  Full lucene syntax support l  Tries lucene syntax first l  Smart escaping is done if syntax errorsl  Optionally supports treating and / or as AND/OR in lucene syntaxl  Fielded queries (e.g. myfield:foo) even in degraded mode l  uf parameter controls what field names may be directly specified in q
  11. 11. Extended Dismax Parser (continued)l  boost parameter for multiplicative boost-by-functionl  Pure negative query clauses Example: solr  OR  (-­‐solr)  l  Enhanced term proximity boosting l  pf2=myfield – results in term bigrams in sloppy phrase queries  myfield: aa  bb  cc -­‐>    myfield: aa  bb    myfield: bb  cc  l  Enhanced stopword handling l  stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr  is  awesome  &  qf=myfield  &  pf2=myfield      -­‐>          +myfield:(solr  awesome)    (myfield: solr  is  myfield: is   awesome )   l  Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
  12. 12. Faceting Performance Improvementsl  For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvementl  Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x fasterl  Optimized deep facet paging – up to 10x faster with really large facet.offsetsl  Less memory consumed by field cache entriesl  Per-segment faceting with facet.method=fcs l  Only faster when re-opening index frequently (many times a second) l  Only works for single-valued fields
  13. 13. Pivot Facetingl  Other names that could have made sense: l  Grid Faceting, Cross-Product Faceting, Matrix Facetingl  Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock #docs #docs w/ #docs w/ inStock:true instock:false cat:electronics 14 10 4 cat:memory 3 3 0 cat:connector 2 0 2 cat:graphics card 2 0 2 cat:hard drive 2 2 0
  14. 14. Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity "facet_counts":{ (continued) "facet_pivot":{ "cat,popularity":[{ { "field":"cat", "field":"popularity",14 docs w/ "value":"electronics", "value":"1",cat==electronics "count":14, "count":2}]}, "pivot":[{ {5 docs w/ "field":"popularity", "field":"cat",cat==electronics "value":"6", "value":"memory",&& popularity==6 "count":5}, "count":3, { "pivot":[]}, "field":"popularity", "value":"7", […] "count":4},
  15. 15. Range Faceting "facet_counts":{§  Like Date faceting, but "facet_ranges":{ more generic "price":{ "counts":{ "0.0":5,http://...&facet=true "50.0":2,&facet.range=price "100.0":0, "150.0":2,&facet.range.start=0 "200.0":0,&facet.range.end=500 "250.0":1, "300.0":2,& "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}
  16. 16. Spatial SearchStep1: Index some locations!<field name= name >The Alpine Shop</field><field name= store >44.013617,-73.168264</field>Step2: Decide where you are&pt=44.0153371,-73.16734&d=1&sfield=storeStep3: Profit!Spatial Filter: &fq={!geofilt}Bounding Box: &fq={!bbox}Distance Function: &sort=geodist() ascReturning the distance: &fl=geodist() Pseudo-fields! Note: You can now sort by any arbitrary function query!
  17. 17. Pseudo-FieldsReturns other info along with document stored fields§  Function queries fl=name,location,geodist(),add(myfield,10)  §  Fieldname globs fl=id,attr_*  §  Multiple “fl” (field list) values &fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’)  §  Aliasing fl=id,location:loc,_dist_:geodist()  §  Future: inlined highlighting, “explain”, sort-values, group-value   18
  18. 18. Result Grouping / Field Collapsingl  Goal l Limit the number of results per category l  category normally defined by unique values in a fieldl  Uses l  Web Search – collapse by web site l  Email threads – collapse by thread id l  Ecommerce/retail l  Show the top 5 items for each store category (music, movies, etc)
  19. 19. Field Collapsing by Site
  20. 20. Result Grouping by CategoryField Collapse on Product Type
  21. 21. Group by Fieldhttp://...&fl=id,name&q=ipod&group=true&group.field=manu_exact "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ {
  22. 22. Group by Queryhttp://...&group=true&group.query=price:[0 TO 99.99] &group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[
  23. 23. Grouping Paramsparameter meaning defaultgroup.field=<field> Like facet.field – group by unique field valuesgroup.query=<query> Like facet.query – top docs that also matchgroup.function=<function Group by unique values produced by thequery> function querygroup.limit=<n> How many docs per group 1group.sort=<sort spec> How to sort documents within a group Same as sortrows=<n> How many groups to return 10sort=<sort spec> How to sort the groups relative to each other (based on top doc)group.format=<format> grouped/simple – if simple, a single flat grouped list is used and rows units are “docs”group.main=true/false If true, the first field grouping command is false used as main result set
  24. 24. Pseudo-Join id: blog1 id: post1 blog_id: blog1 name: Solr ‘n Stuff author: Yonik Seeley owner: Yonik Seeley title: Solr relevancy function queries Started: 2007-10-26 body: Lucene’s default ranking […] id: blog2 id: post2 name: lifehacker blog_id: blog1 author: Yonik Seeley owner: Gawker Media title: Solr result grouping started: 2005-1-31 body: Result Grouping, also called […] id: post3 blog_id: blog2Restrict to blogs mentioning netflix author: Whitson Gordon title: How to Install Netflix on Almost Any Android Devicefq={!join from=blog_id to=id}body:netflix-  Finds all documents matching “netflix”-  Maps to different docs by following blog_id to id 25
  25. 25. Pseudo-Join Examples§  Only show posts from blogs started after 2010 q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]§  If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join) q=bomb&fq={!join from=blog_id to=blog_id}obama§  If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama 26
  26. 26. Cross-Core Join id: doc1 security: managers id: mary title: doc for managers only security_groups: managers, employees body: … id: doc1 id: john security_groups: employees security: managers, employees title: doc for everyone body: … collection1 sec1 Single Solr Serverhttp://localhost:8983/solr/collection1/select?q=foo&fq={!joinfromIndex=sec1 from=security_groups to=security}user:john 27
  27. 27. Pseudo-Join vs GroupingPseudo-Join Result Grouping / Field CollapsingO(n_terms_in_join_fields) O(n_docs_in_result)Single or multi-valued fields Single-valued fields onlyFilters only (no info currently passed from Can order docs within a group and groupsthe “from” docs to the “to” docs). by top doc within that group using normal sort criteria.Chainable (one join can be the input to Not currently chainable – can only groupanother) one field deepAffects which documents match a request, Grouping does not currently affect the setso naturally affects facet numbers (e.g. of documents matching the query, soyou can search posts and get numbers of faceting is unaffected.blogs) 28
  28. 28. Auto-Suggestl  Many people previously used terms component l  Can be slow for a large corpusl  New auto-suggest builds off SpellCheck component l  TST implementation: compact memory based trie l  FST implementation: slower to build, but smaller & faster lookup l  Based on a field in the main index, or on a dictionary filehttp://localhost:8983/solr/suggest?wt=json&indent=true&q=ult "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}} 29
  29. 29. Index with JSON$  URL=http://localhost:8983/solr/update/json  $  curl  $URL  -­‐H  Content-­‐type:application/json  -­‐d  ’  [      {          "id"  :  "978-­‐0641723445",          "cat"  :  ["book","hardcover"],          "title"  :  "The  Lightning  Thief",          "author"  :  "Rick  Riordan",          "series_t"  :  "Percy  Jackson  and  the  Olympians",          "sequence_i"  :  1,          "genre_s"  :  "fantasy",          "inStock"  :  true,          "price"  :  12.50,          "pages_i"  :  384      }  ]  
  30. 30. Query Results in CSVhttp://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csvname,price,cat,popularityiPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10l  Can handle multi-valued fields (see cat field in example)l  Completely compatible with the CSV update handler (can round-trip)l  Results are streamed – good for dumping entire parts of the index
  31. 31. http://localhost:8983/solr/browse
  32. 32. Q&A