Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in solr june 2014

3,697 views

Published on

Published in: Technology
  • Be the first to comment

What's new in solr june 2014

  1. 1. 1 What’s New in Solr Solr 4.7 & 4.8 June 12, 2014 Search | Discover | Analyze
  2. 2. Speaker • Software Engineer at LucidWorks • Lucene/Solr committer and PMC member • Previously worked on search and NLP at the Center for Natural Language Processing at Syracuse University’s iSchool • Twitter: @steven_a_rowe Steve Rowe 2
  3. 3. Agenda • A short history of Solr 4 • Solr 4.7 and 4.8: new features • Solr 4.9 and beyond 3
  4. 4. A short history of Solr 4 • Solr 4.0 released October 2012 4
  5. 5. A short history of Solr 4 • SolrCloud – Distributed indexing and searching, NRT and NoSQL features, e.g. realtime-get, optimistic concurrency and durable updates – Sharding, replication, ZooKeeper ensemble – High availability with no single points of failure • Real-time Get: Access latest document version, no commit or new searcher open required • Atomic updates: incremental field add/update/increment via stored fields • NRT: “soft” commits 5
  6. 6. A short history of Solr 4 • Solr Reference Guide now released with each feature release: – Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide – Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-PDF – Previous release PDFs: http://s.apache.org/Older-Solr-Ref-Guide-PDFs 6
  7. 7. A short history of Solr 4 • Flexible indexing – Solr core = Lucene index • Lucene index = 1 or more segments – Codec: per-segment suite of formats • Flexible scoring – You can specify similarity implementation per fieldType in your schema.xml if you use SchemaSimilarityFactory – Built-in Similarities (other than the default TF-IDF): • Okapi BM25 • Divergence from Randomness • Information-Based • Language Models (with two smoothing implementations) • SweetSpot 7
  8. 8. A short history of Solr 4 • DocValues: typed column stride fields – Document-to-value mapping built at index time – Reduced memory usage compared to field cache – Good for faceting and sorting – Missing values now supported as of Solr 4.5 • Pseudo-fields – Field aliasing, e.g. &fl=result:indexed – Function queries, aliasable too, e.g. &fl=price:sum(a,b) – Document transformers • Standard: [explain], [value], [shard], [docid] • Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod • Pivot faceting: automatic drill-down (no distr.’d support) 8
  9. 9. A short history of Solr 4 • Schema API • GET /collection/schema/fields/fieldname • PUT /collection/schema/fields/name • JSON body: { "type":"text_general", "stored":true, "indexed":true } • Schemaless mode • a.k.a. data-driven schema or field guessing • Class guessed based on field values, then class(es) mapped to a fieldType; first gets added to the schema • Supported value classes: Boolean, Integer, Long, Float, Double, and Date 9
  10. 10. A short history of Solr 4 • Document routing – CompositeId router, e.g. id=tenant!docid • Used by default when numShards specified when creating a collection. • Restrict queries to shard(s): &_route_=tenant! – Implicit router • Online shard splitting – Allows collections to scale, rather than having to decide on how much to overshard up front. – Split in two; with custom hash ranges; or using split.key param to split to a dedicated shard 10
  11. 11. A short history of Solr 4 • Nested documents, a.k.a. Block Join – Nested doc to be added: <add> <doc> <field name="id">1</field> <field name="title">Solr adds block join support</field> <field name="content_type">parentDocument</field> <doc> <field name="id">2</field> <field name="comments">SolrCloud supports it too!</field> </doc> </doc> </add> – Queries: • Child query parser, e.g. q={!child of="content_type:parentDocument"}title:Solr • Parent query parser, e.g. q={!parent which="content_type:parentDocument"}comments:SolrCloud 11
  12. 12. A short history of Solr 4 • solr.xml legacy & discovery modes – Legacy mode (cores listed in solr.xml) is deprecated; support will be removed in Solr 5. – Discovery mode (new as of Solr 4.3): • No cores are listed in solr.xml • Cores are discovered by a recursive walk of the solr home directory, marked by core.properties files • Nested core directories are not allowed 12
  13. 13. A short history of Solr 4 • New web admin UI with SolrCloud support 13
  14. 14. Solr 4.7 and 4.8: new features • As of Solr 4.8, Java 7 is the minimum supported JVM version. Recommended: Oracle 1.7.0_60 • <fields> and <types> tags are no longer necessary in schema.xml • Collections API improvements – Working toward “ZooKeeper = Truth” mode • legacyCloud=false cluster property – New actions: • CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE, ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS, MIGRATE, CLUSTERPROP – Core properties can be specified with CREATE and SPLITSHARD actions 14
  15. 15. Solr 4.7 and 4.8: new features • Asynchronous execution of long-running actions – SolrCloud Collections API: • CREATE, SPLITSHARD, MIGRATE – CoreAdminHandler: • CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES, SPLIT – Tracking request ID supplied via async param – Track status via the new REQUESTSTATUS action, using the tracking request ID • Possible states: running, complete, failed, notfound – Clear stored statuses with special request ID -1 15
  16. 16. Solr 4.7 and 4.8: new features • Cursors: Efficient Deep Paging – Request must include a sort, which must include the uniqueKey, which must be defined – First page: ?q=…&sort=id+asc&rows=N&cursorMark=* • Response contains "nextCursorMark":"<base64encoded>" – Following pages: ?q=…&sort=id+asc&rows=N&cursorMark=<from response> – Repeat; when nextCursorMark=cursorMark from the request, there are no more results – No server-side state 16
  17. 17. Solr 4.7 and 4.8: new features 17
  18. 18. Solr 4.7 and 4.8: new features • Document expiration and Time To Live (TTL) – Auto-delete expired documents • DocExpirationUpdateProcessorFactory can periodically wake up and delete expired documents – Compute expiration date from TTL • Update request _ttl_ param, or • Document _ttl_ field • Both names are configurable, defaulting to _ttl_. • _ttl_ values are interpreted as Date Math Expressions relative to NOW, e.g. “+1YEAR”. 18
  19. 19. Solr 4.7 and 4.8: new features • Dynamic synonyms and stopwords – “Managed” resources: configuration and content for synonyms and stopwords, persistence managed by Solr – Specified as ManagedSynonymFilterFactory and ManagedStopFilterFactory on analyzers in schema.xml – CRUD operations are enabled via a REST endpoint per managed resource. – The “managed” attribute names the REST endpoint, e.g. <filter class="solr.ManagedStopFilterFactory" managed="french" /> – E.g. to delete stopword “le” from the “french” managed stoplist: curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le" 19
  20. 20. Solr 4.7 and 4.8: new features • SSL support in SolrCloud – URL scheme stored in ZooKeeper – SSL certificates are specifiable via system properties, to enable authentication • Nested documents may be specified in JSON format • Tri-level compositeId routing – E.g. “tenant!group!docid”, 8/8/16 hash bits per component • Build Solr indexes with Hadoop’s MapReduce – +Mark Miller’s blog: http://bit.ly/1oh0fWq • Github solr-map-reduce-example: http://bit.ly/1pnDAao • Named config sets in non-SolrCloud mode – Default base directory is SOLR_HOME/configsets/ 20
  21. 21. Solr 4.7 and 4.8: new features • Suggester v2 – Added BlendedInfixSuggester – Added FreeTextSuggester – Queries can use multiple suggesters • New query parsing features – SimpleQParserPlugin: parser for human entered queries with selectable operators. – ComplexPhraseQParserPlugin: wildcards, ORs, etc. inside Phrase Queries • E.g. {!complexphrase inOrder=true}name:"Jo* Smith" 21
  22. 22. Solr 4.7 and 4.8: new features • CollapsingQParserPlugin – Performant alternative grouping/field collapsing implementation, for high distinct group cardinality. • ExpandComponent – Expands collapsed groups – Can also expand nested documents 22
  23. 23. Solr 4.9 and beyond • ZooKeeper = Truth / legacyCloud=false • MODIFYCOLLECTION collections API – Modify maxShardsPerNode, replicationFactor for the entire collection • Incremental Field Updates on numeric DocValues – Binary DocValues IFUs also coming • Multi-valued DocValues sort fields • Legacy numeric/date field types deprecated, removed in Solr 5 in favor of Trie field types 23
  24. 24. Solr 4.9 and beyond • In Solr 5, the .war will no longer be shipped • Index integrity: checksums • Integrity check on merge off by default • solrconfig.xml option <indexConfig><checkIntegrityAtMerge> • New update query param min_rf will allow clients to set the minimum successful replicas for the request • Return Block Join child documents when parents match, via a new DocTransformer [child parentFilter=“field:value”] 24
  25. 25. Solr 4.9 and beyond • AnalyticsQuery: support pluggable, pipeline-able analytics, orderable via the “cost” parameter, like PostFilters. • ReRankingQParserPlugin • Re-rank the top n results 25
  26. 26. Platform LucidWorks Open Source 26 • Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/banana • Data Quality Toolkit: https://github.com/LucidWorks/data- quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
  27. 27. Links Solr website: http://lucene.apache.org/solr Solr Reference Guide: • Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide • Most recent released PDF: http://s.apache.org/Solr-Ref-Guide- PDF • Previous release PDFs: http://s.apache.org/Older-Solr-Ref- Guide-PDFs Lucene/Solr Revolution: http://www.LuceneRevolution.org Q & A 27

×