Lucene/Solr 8: The next major release

Lucene/Solr 8:  
The next major release
Steve Rowe
Senior Software Developer, Lucidworks
@steven_a_rowe
#Activate18 #ActivateSearch

Agenda
• Recent release cadence
• 7.X
• 8.0
• 8.X
YOU 
ARE
HERE

7.X average: 11 weeks6.X average: 10 weeks

7.X
1. Metrics
2. Autoscaling
3. CDCR
4. Time Routed Aliases
5. Replica types
6. Streaming expressions
7. JSON facet API
8. Configset / schema
9. Text Analysis / ML
10. Collections API
11. Queries
12. Large index segment
merging
13. Replication / recovery /
rolling updates
14. Block-join / nested docs
15. Miscellaneous

7.X: Metrics
• Continuation of 6.X work to support Autoscaling efforts
• 7.0: - Aggregated metrics collected in overseer 
- solrconfig.xml <jmx> ➞ solr.xml <metrics><reporter>
• 7.1: Prometheus metrics exporter contrib
• 7.4: /admin/metrics/history API: basic long-term key metric
time series aggregation
• Fixed-width windows at 
several resolutions
• Not yet in Admin UI: 
SOLR-12426

7.X: Autoscaling
• 7.0: - Preferences and policy DSL: flexible replica placement 
[ { minimize: cores }, { maximize: freedisk } ] 
{ replica: "<2", shard: "#EACH", node: "#ANY" } 
- Diagnostics API: return sorted nodes, policy violations
• 7.1: - autoAddReplicas ported to autoscaling framework 
- Add/remove/suspend/resume triggers and listeners 
- Triggers for added and lost nodes 
- ComputePlanAction / ExecutePlanAction 
- /autoscaling/history API: cluster events and actions
• 7.2: - Search rate trigger 
- /autoscaling/suggestions API 
- UTILIZENODE collections API command

7.X: Autoscaling
• 7.3: - Simulation framework 
- Arbitrary metric threshold trigger 
- Scheduled trigger 
- Admin UI to display and execute suggestions

7.X: Autoscaling
• 7.4: - Periodic house-keeping task: cleans up inactive shards 
- Index size trigger: document count or size in bytes
• 7.5: - Policy replica attribute: #ALL, #EQUAL, percentage, 
range, and floating point values 
- Policy cores attribute: #EQUAL, percentage,  
range, and floating point values 
- Percentage in freedisk policy attribute 
- Simulation framework: test scaling up to 1 billion docs

7.X: Cross Data Center Replication
• 7.2: Support bi-directional syncing of CDCR clusters
This is not
active-active,  
but rather 
passive-active
or active-passive:
only one active 
cluster at a time.

7.X: Time Routed Aliases
• 7.3: - Specialization of Solr’s collection alias feature 
- Support time series data, e.g. logs / sensor data 
- Maintain performance under continuous indexing 
- CREATEALIAS: start, interval, retention policy 
- Automatically create new collections 
- Automatically delete old collections (optional) 
- Route updates based on timestamp 
- Search against all aliased collections*
• 7.5: Preemptively create the next collection when updates 
are near the latest collection’s end date (optional) 
* Pending optimization: minimize queried collections (SOLR-9562)

7.X: Replica types
• 7.0: 
 
 
 
 
 
 
• 7.4: Query param to prioritize replicas by type, e.g.
shards.preference=replica.type:PULL,replica.type:TLOG
Type
Indexes 
locally
Supports 
soft
commit 
& RTG
Pulls
segments
from
leader
Writes to 
TLog
Can
become
shard
leader
Queryable
NRT ✅ ✅ ✅ ✅ ✅
TLOG leader ✅ ✅ ✅ ✅ ✅
TLOG ✅ ✅ ✅ ✅
PULL ✅ ✅

7.X: Streaming expressions
• Parallel computation function suite
• Some use cases: MapReduce, aggregations, parallel SQL, pub/
sub messaging, graph traversal, machine learning, statistical
programming
• Each 7.X release has added 
many new functions
• 7.5: Ref guide: 
Math Expressions User Guide

7.X: JSON Facet API
• 7.0: Terms facets: added optional refinement support
• 7.4: Semantic Knowledge Graph support via new  
relatedness() aggregate function
• Finds ad-hoc relationships by scoring documents
relative to foreground and background document
sets
• 7.5: Heatmap facet support

7.X: Configsets / schema
• 7.0: - _default configset 
- Data-driven schema: auto-guessed text fields indexed 2 ways:
• tokenized for search
• strings for sorting/faceting: "*_str" string field, max 256 chars
- Turn off data-driven schema functionality: 
curl http://host:8983/solr/mycollection/config  
-d "{ set-user-property: { update.autoCreateFields: false }}"
• 7.5: Disable configset upload: -Dconfigset.upload.enabled=false

7.X: Text analysis / machine learning
• 7.1: Bengali normalizer and stemmer
• 7.2: Enable off-ZooKeeper storage of large (>1MB) LTR models
• 7.3: OpenNLP integration: tokenization, POS tagging, phrase 
chunking, lemmatization, NER, language detection
• 7.4: - ProtectedTermFilterFactory: don’t filter protected terms 
- TaggerRequestHandler (a.k.a. SolrTextTagger): NER
• 7.5: - "nori" Korean morphological text analysis: "*_txt_ko" 
- PhrasesIdentificationComponent: identify and score 
candidate query phrases based on index statistics 
- UIMA integration removed

7.X: Collections API
• 7.3: Add collection level properties similar to cluster properties
• 7.4: Cluster-wide defaults for numShards, nrtReplicas, 
tlogReplicas, pullReplicas
• 7.5: - Support co-locating replicas of two or more collections 
together in a node via the withCollection parameter 
to the CREATE and MODIFYCOLLECTION commands 
- SPLITSHARD: New split method using hard links: splitMethod=link
• 3-5 times faster than the original splitMethod=rewrite
• Slows down replication
• Increases disk usage on replica nodes

7.X: Queries
• 7.1: JSON
query
DSL 
curl http://localhost:8983/solr/books/query -d '
{
query: {
bool: {
must: [
"title:solr",
{lucene: {df: content, query: "lucene solr"}}
],
must_not: [
{frange: {u: 3.0, query: ranking}}
]}}}'

7.X: Queries
• 7.2: New synonymQueryStyle field type option: enable 
generation of appropriate queries for hierarchical 
relations between overlapping terms
• as_same_term (default): SynonymQuery(bird,robin)
• pick_best: Dismax(bird,robin)
• as_distinct_terms: (bird OR robin)
• 7.4: JSON query DSL: Enable query/filter tagging, 
e.g. { "#colorfilt" : "color:blue" }  
equivalent to local-param {!tag=colorfilt}color:blue

7.X: Large index segment merging
• Problem: Overly large segments (e.g. as a result of force- 
merge/optimize) stop being eligible for merging, 
and can start accumulating >50% deleted 
documents, wasting space and skewing index stats.
• 7.5: - TieredMergePolicy now respects maxSegmentSizeMB 
by default when executing force-merge/optimize and 
expunge-deletes 
- TieredMergePolicy’s reclaimDeletesWeight has been 
replaced with a new deletesPctAllowed setting to 
control how aggressively deletes should be reclaimed

7.X: Replication/recovery/rolling upgrades
• 7.3: The old Leader-Initiated-Recovery (LIR) implementation 
is deprecated and replaced
• To perform a rolling upgrade to Solr 8, you must be on
Solr 7.3 or higher
• 7.4: - IndexFetcher now skips fetching identical files 
- Buffering updates are written to a separate TLog 
- Parallel replay of buffering TLogs

7.X: Block-join / nested documents
• 7.3: Added filters and excludeTags local-params for 
{!parent} and {!child} query parsers, usable for 
multi-select faceting
• 7.5: WIP: Allow Solr to more faithfully represent deeply 
nested document relationships, rather than requiring 
reconstruction based on the flattened list of child docs 
returned by Solr

7.X: Miscellaneous
• 7.3: add-distinct atomic updates
• 7.4: - Ignore large document URP 
- TLog: maxSize auto hard-commit setting 
(in addition to maxDocs & maxTime)
• 7.5: Custom cluster properties allowed with ext. prefix

8.0
• Autoscaling
• Index upgrades
• HTTP/2
• Miscellaneous

8.0: Autoscaling
• Suggestions API: rebalance options even if no violations
• Suggestions API: add-replica for lost replicas
• maxOps limit for index size trigger
• Autoscaling policy framework will be the default replica
placement strategy

8.0: Index upgrades
• 7.0: Lucene indexes record the major Lucene version that 
created the index, and the minimum Lucene version 
that contributed to segments.
• 8.0: Version N-2 or older indexes will now fail to open, 
even if they have been merged into an N-1 index.
• IndexUpgrader will not upgrade 6.X or earlier indexes
• Re-indexing will be required to upgrade

8.0: HTTP/2
• May 2018: Mark Miller announced his Star Burst effort: 
many cleanups and performance enhancements
• July 2018: Cao Manh Dat took up the HTTP/2 aspects: SOLR-12639
• Indexing test: 33M docs, 1 shard, 2 replicas (SOLR-12642)
• Garbage: Leader: 26% less; replica: 76% less
• Indexing throughput: 54% higher
• CPU time: Leader: 39% higher; replica: 76% lower
• Ready to merge back to master, pending release of 
Jetty 9.4.13, containing SPNEGO HTTP/2 implementation

8.0: Miscellaneous
• Lucene: scores must be non-negative
• Function(Score)Query-s convert negative scores to zero
• TODO: remove deprecations
• Trie fields? Removal effectively blocked by:
• SOLR-12074: Add numeric equivalent to StrField
• SOLR-11127: Mechanism to migrate schema
for .system collection (a.k.a. blob store) schema from
Trie (pre-7.0) to Points (7.0+)

8.X
• Lucene/Solr minimum JDK
• Luke: Lucene Toolbox
• New Lucene features

8.X: Lucene/Solr minimum JDK
• Oracle will end free JDK 8 support in January 2019
• Both JDK 9 & 10 are already EOL, no more Oracle support
• JDK 11 will very likely be next minimum supported JDK, no
schedule yet
• Under JDK 9+, Solr’s Hadoop-related functionality has
problems, including with Kerberos
• Uwe Schindler’s Jenkins server tests Lucene/Solr on Oracle
9+10+11+12 JDKs
• All have higher Solr test failure rates than on JDK 8

8.X: Luke: UI framework & licensing
• Andrzej Bialecki: Initial implementation: Thinlet, GPL
• Mark Harwood: GWT
• Mark Miller: Apache Pivot
• Dmitry Kan and Tomoko Uchida took ownership on Github
• Tomoko Uchida: JavaFX (bundled w/JDK 8)
• LUCENE-2562: Make Luke a Lucene/Solr Module
• JavaFX/OpenJFX unbundled from Java 11 JDK, GPL+CPE
• Tomoko Uchida: Swing (7.5 release available)

8.X: New Lucene features
• Index impacts, Block-Max WAND, similarity cleanups
• Some queries (especially term queries and disjunctions)
are much faster when number of hits is not required
• FeatureField: incorporate static relevance signals, e.g.
PageRank
• Soft deletes
• Merge policy retains deleted docs according to policy
• Enables document history, e.g. for time-travel indexes
• RAMDirectory replaced by ByteBuffersDirectory

Thank you!
Steve Rowe
Senior Software Engineer, Lucidworks
@steven_a_rowe
#Activate18 #ActivateSearch

Lucene/Solr 8: The next major release

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lucene/Solr 8: The next major release

Similar to Lucene/Solr 8: The next major release (20)

Recently uploaded

Recently uploaded (20)

Lucene/Solr 8: The next major release