Your SlideShare is downloading. ×
What's new in solr june 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

What's new in solr june 2014

2,326
views

Published on

Published in: Technology

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,326
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
38
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Asynchronous collection API calls in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-AsynchronousCalls

    REQUESTSTATUS action in the Solr Reference Guide: http://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-RequestStatus
  • See Pagination of Results in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
  • Chris Hostetter’s scripts to produce the graph: https://github.com/LucidWorks/blog-deep-paging-perf
  • Date Math Expressions in Solr Javadocs: https://lucene.apache.org/solr/4_8_1/solr-core/org/apache/solr/util/DateMathParser.html

    See Chris Hostetter’s blog post “New in Solr 4.8: Document Expiration”: http://searchhub.org/2014/05/07/document-expiration/
  • See the “Managed Resources” page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Managed+Resources

    See also Tim Potter’s blog “Using Solr’s REST APIs to manage stop words and synonyms”: http://searchhub.org/2014/03/31/introducing-solrs-restmanager-and-managed-stop-words-and-synonyms/
  • For info on Tri-level compositeId routing, see Anshum Gupta’s blog “Multi level composite-id routing in SolrCloud”: http://searchhub.org/2014/01/06/10590/

    See the Config Sets page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Config+Sets
  • Suggester v2 JIRA issue: https://issues.apache.org/jira/browse/SOLR-5378

    Simple Query Parser in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-SimpleQueryParser

    Complex Phrase Query Parser in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
  • See the Collapse & Expand page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Collapse+%26+Expand

    See also Joel Bernstein’s blog post “The CollapsingQParserPlugin: Solr’s New High Performance Field Collapsing PostFilter”: http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/

    See also Joel Bernstein’s blog post “Solr’s New Expand Component”: http://heliosearch.org/solrs-new-expand-component/

    See also Joel Bernstein’s blog post “Using the ExpandComponent to expand a Solr Block Join”: http://heliosearch.org/expand-block-join/
  • See Joel Bernstein’s blog post “Solr’s New AnalyticsQuery API”: http://heliosearch.org/solrs-new-analyticsquery-api/

    See Joel Bernstein’s blog post “New in Solr 4.9: Query Re-Ranking”: http://heliosearch.org/solrs-new-re-ranking-feature/

  • Transcript

    • 1. 1 What’s New in Solr Solr 4.7 & 4.8 June 12, 2014 Search | Discover | Analyze
    • 2. Speaker • Software Engineer at LucidWorks • Lucene/Solr committer and PMC member • Previously worked on search and NLP at the Center for Natural Language Processing at Syracuse University’s iSchool • Twitter: @steven_a_rowe Steve Rowe 2
    • 3. Agenda • A short history of Solr 4 • Solr 4.7 and 4.8: new features • Solr 4.9 and beyond 3
    • 4. A short history of Solr 4 • Solr 4.0 released October 2012 4
    • 5. A short history of Solr 4 • SolrCloud – Distributed indexing and searching, NRT and NoSQL features, e.g. realtime-get, optimistic concurrency and durable updates – Sharding, replication, ZooKeeper ensemble – High availability with no single points of failure • Real-time Get: Access latest document version, no commit or new searcher open required • Atomic updates: incremental field add/update/increment via stored fields • NRT: “soft” commits 5
    • 6. A short history of Solr 4 • Solr Reference Guide now released with each feature release: – Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide – Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-PDF – Previous release PDFs: http://s.apache.org/Older-Solr-Ref-Guide-PDFs 6
    • 7. A short history of Solr 4 • Flexible indexing – Solr core = Lucene index • Lucene index = 1 or more segments – Codec: per-segment suite of formats • Flexible scoring – You can specify similarity implementation per fieldType in your schema.xml if you use SchemaSimilarityFactory – Built-in Similarities (other than the default TF-IDF): • Okapi BM25 • Divergence from Randomness • Information-Based • Language Models (with two smoothing implementations) • SweetSpot 7
    • 8. A short history of Solr 4 • DocValues: typed column stride fields – Document-to-value mapping built at index time – Reduced memory usage compared to field cache – Good for faceting and sorting – Missing values now supported as of Solr 4.5 • Pseudo-fields – Field aliasing, e.g. &fl=result:indexed – Function queries, aliasable too, e.g. &fl=price:sum(a,b) – Document transformers • Standard: [explain], [value], [shard], [docid] • Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod • Pivot faceting: automatic drill-down (no distr.’d support) 8
    • 9. A short history of Solr 4 • Schema API • GET /collection/schema/fields/fieldname • PUT /collection/schema/fields/name • JSON body: { "type":"text_general", "stored":true, "indexed":true } • Schemaless mode • a.k.a. data-driven schema or field guessing • Class guessed based on field values, then class(es) mapped to a fieldType; first gets added to the schema • Supported value classes: Boolean, Integer, Long, Float, Double, and Date 9
    • 10. A short history of Solr 4 • Document routing – CompositeId router, e.g. id=tenant!docid • Used by default when numShards specified when creating a collection. • Restrict queries to shard(s): &_route_=tenant! – Implicit router • Online shard splitting – Allows collections to scale, rather than having to decide on how much to overshard up front. – Split in two; with custom hash ranges; or using split.key param to split to a dedicated shard 10
    • 11. A short history of Solr 4 • Nested documents, a.k.a. Block Join – Nested doc to be added: <add> <doc> <field name="id">1</field> <field name="title">Solr adds block join support</field> <field name="content_type">parentDocument</field> <doc> <field name="id">2</field> <field name="comments">SolrCloud supports it too!</field> </doc> </doc> </add> – Queries: • Child query parser, e.g. q={!child of="content_type:parentDocument"}title:Solr • Parent query parser, e.g. q={!parent which="content_type:parentDocument"}comments:SolrCloud 11
    • 12. A short history of Solr 4 • solr.xml legacy & discovery modes – Legacy mode (cores listed in solr.xml) is deprecated; support will be removed in Solr 5. – Discovery mode (new as of Solr 4.3): • No cores are listed in solr.xml • Cores are discovered by a recursive walk of the solr home directory, marked by core.properties files • Nested core directories are not allowed 12
    • 13. A short history of Solr 4 • New web admin UI with SolrCloud support 13
    • 14. Solr 4.7 and 4.8: new features • As of Solr 4.8, Java 7 is the minimum supported JVM version. Recommended: Oracle 1.7.0_60 • <fields> and <types> tags are no longer necessary in schema.xml • Collections API improvements – Working toward “ZooKeeper = Truth” mode • legacyCloud=false cluster property – New actions: • CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE, ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS, MIGRATE, CLUSTERPROP – Core properties can be specified with CREATE and SPLITSHARD actions 14
    • 15. Solr 4.7 and 4.8: new features • Asynchronous execution of long-running actions – SolrCloud Collections API: • CREATE, SPLITSHARD, MIGRATE – CoreAdminHandler: • CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES, SPLIT – Tracking request ID supplied via async param – Track status via the new REQUESTSTATUS action, using the tracking request ID • Possible states: running, complete, failed, notfound – Clear stored statuses with special request ID -1 15
    • 16. Solr 4.7 and 4.8: new features • Cursors: Efficient Deep Paging – Request must include a sort, which must include the uniqueKey, which must be defined – First page: ?q=…&sort=id+asc&rows=N&cursorMark=* • Response contains "nextCursorMark":"<base64encoded>" – Following pages: ?q=…&sort=id+asc&rows=N&cursorMark=<from response> – Repeat; when nextCursorMark=cursorMark from the request, there are no more results – No server-side state 16
    • 17. Solr 4.7 and 4.8: new features 17
    • 18. Solr 4.7 and 4.8: new features • Document expiration and Time To Live (TTL) – Auto-delete expired documents • DocExpirationUpdateProcessorFactory can periodically wake up and delete expired documents – Compute expiration date from TTL • Update request _ttl_ param, or • Document _ttl_ field • Both names are configurable, defaulting to _ttl_. • _ttl_ values are interpreted as Date Math Expressions relative to NOW, e.g. “+1YEAR”. 18
    • 19. Solr 4.7 and 4.8: new features • Dynamic synonyms and stopwords – “Managed” resources: configuration and content for synonyms and stopwords, persistence managed by Solr – Specified as ManagedSynonymFilterFactory and ManagedStopFilterFactory on analyzers in schema.xml – CRUD operations are enabled via a REST endpoint per managed resource. – The “managed” attribute names the REST endpoint, e.g. <filter class="solr.ManagedStopFilterFactory" managed="french" /> – E.g. to delete stopword “le” from the “french” managed stoplist: curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le" 19
    • 20. Solr 4.7 and 4.8: new features • SSL support in SolrCloud – URL scheme stored in ZooKeeper – SSL certificates are specifiable via system properties, to enable authentication • Nested documents may be specified in JSON format • Tri-level compositeId routing – E.g. “tenant!group!docid”, 8/8/16 hash bits per component • Build Solr indexes with Hadoop’s MapReduce – +Mark Miller’s blog: http://bit.ly/1oh0fWq • Github solr-map-reduce-example: http://bit.ly/1pnDAao • Named config sets in non-SolrCloud mode – Default base directory is SOLR_HOME/configsets/ 20
    • 21. Solr 4.7 and 4.8: new features • Suggester v2 – Added BlendedInfixSuggester – Added FreeTextSuggester – Queries can use multiple suggesters • New query parsing features – SimpleQParserPlugin: parser for human entered queries with selectable operators. – ComplexPhraseQParserPlugin: wildcards, ORs, etc. inside Phrase Queries • E.g. {!complexphrase inOrder=true}name:"Jo* Smith" 21
    • 22. Solr 4.7 and 4.8: new features • CollapsingQParserPlugin – Performant alternative grouping/field collapsing implementation, for high distinct group cardinality. • ExpandComponent – Expands collapsed groups – Can also expand nested documents 22
    • 23. Solr 4.9 and beyond • ZooKeeper = Truth / legacyCloud=false • MODIFYCOLLECTION collections API – Modify maxShardsPerNode, replicationFactor for the entire collection • Incremental Field Updates on numeric DocValues – Binary DocValues IFUs also coming • Multi-valued DocValues sort fields • Legacy numeric/date field types deprecated, removed in Solr 5 in favor of Trie field types 23
    • 24. Solr 4.9 and beyond • In Solr 5, the .war will no longer be shipped • Index integrity: checksums • Integrity check on merge off by default • solrconfig.xml option <indexConfig><checkIntegrityAtMerge> • New update query param min_rf will allow clients to set the minimum successful replicas for the request • Return Block Join child documents when parents match, via a new DocTransformer [child parentFilter=“field:value”] 24
    • 25. Solr 4.9 and beyond • AnalyticsQuery: support pluggable, pipeline-able analytics, orderable via the “cost” parameter, like PostFilters. • ReRankingQParserPlugin • Re-rank the top n results 25
    • 26. Platform LucidWorks Open Source 26 • Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/banana • Data Quality Toolkit: https://github.com/LucidWorks/data- quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
    • 27. Links Solr website: http://lucene.apache.org/solr Solr Reference Guide: • Live (targeting next Solr release): http://s.apache.org/SolrReferenceGuide • Most recent released PDF: http://s.apache.org/Solr-Ref-Guide- PDF • Previous release PDFs: http://s.apache.org/Older-Solr-Ref- Guide-PDFs Lucene/Solr Revolution: http://www.LuceneRevolution.org Q & A 27