Lucene/Solr 8: 

The next major release
Steve Rowe
Senior Software Developer, Lucidworks
@steven_a_rowe
#Activate18 #ActivateSearch
Agenda
• Recent release cadence
• 7.X
• 8.0
• 8.X
YOU

ARE
HERE
7.X average: 11 weeks6.X average: 10 weeks
7.X
1. Metrics
2. Autoscaling
3. CDCR
4. Time Routed Aliases
5. Replica types
6. Streaming expressions
7. JSON facet API
8. Configset / schema
9. Text Analysis / ML
10. Collections API
11. Queries
12. Large index segment
merging
13. Replication / recovery /
rolling updates
14. Block-join / nested docs
15. Miscellaneous
7.X: Metrics
• Continuation of 6.X work to support Autoscaling efforts
• 7.0: - Aggregated metrics collected in overseer

- solrconfig.xml <jmx> ➞ solr.xml <metrics><reporter>
• 7.1: Prometheus metrics exporter contrib
• 7.4: /admin/metrics/history API: basic long-term key metric
time series aggregation
• Fixed-width windows at

several resolutions
• Not yet in Admin UI:

SOLR-12426
7.X: Autoscaling
• 7.0: - Preferences and policy DSL: flexible replica placement

[ { minimize: cores }, { maximize: freedisk } ]

{ replica: "<2", shard: "#EACH", node: "#ANY" }

- Diagnostics API: return sorted nodes, policy violations
• 7.1: - autoAddReplicas ported to autoscaling framework

- Add/remove/suspend/resume triggers and listeners

- Triggers for added and lost nodes

- ComputePlanAction / ExecutePlanAction

- /autoscaling/history API: cluster events and actions
• 7.2: - Search rate trigger

- /autoscaling/suggestions API

- UTILIZENODE collections API command
7.X: Autoscaling
• 7.3: - Simulation framework

- Arbitrary metric threshold trigger

- Scheduled trigger

- Admin UI to display and execute suggestions
7.X: Autoscaling
• 7.4: - Periodic house-keeping task: cleans up inactive shards

- Index size trigger: document count or size in bytes
• 7.5: - Policy replica attribute: #ALL, #EQUAL, percentage,

range, and floating point values

- Policy cores attribute: #EQUAL, percentage, 

range, and floating point values

- Percentage in freedisk policy attribute

- Simulation framework: test scaling up to 1 billion docs
7.X: Cross Data Center Replication
• 7.2: Support bi-directional syncing of CDCR clusters
This is not
active-active, 

but rather

passive-active
or active-passive:
only one active

cluster at a time.
7.X: Time Routed Aliases
• 7.3: - Specialization of Solr’s collection alias feature

- Support time series data, e.g. logs / sensor data

- Maintain performance under continuous indexing

- CREATEALIAS: start, interval, retention policy

- Automatically create new collections

- Automatically delete old collections (optional)

- Route updates based on timestamp

- Search against all aliased collections*
• 7.5: Preemptively create the next collection when updates

are near the latest collection’s end date (optional)

* Pending optimization: minimize queried collections (SOLR-9562)
7.X: Replica types
• 7.0:













• 7.4: Query param to prioritize replicas by type, e.g.
shards.preference=replica.type:PULL,replica.type:TLOG
Type
Indexes

locally
Supports

soft
commit

& RTG
Pulls
segments
from
leader
Writes to

TLog
Can
become
shard
leader
Queryable
NRT ✅ ✅ ✅ ✅ ✅
TLOG leader ✅ ✅ ✅ ✅ ✅
TLOG ✅ ✅ ✅ ✅
PULL ✅ ✅
7.X: Streaming expressions
• Parallel computation function suite
• Some use cases: MapReduce, aggregations, parallel SQL, pub/
sub messaging, graph traversal, machine learning, statistical
programming
• Each 7.X release has added

many new functions
• 7.5: Ref guide:

Math Expressions User Guide
7.X: JSON Facet API
• 7.0: Terms facets: added optional refinement support
• 7.4: Semantic Knowledge Graph support via new 

relatedness() aggregate function
• Finds ad-hoc relationships by scoring documents
relative to foreground and background document
sets
• 7.5: Heatmap facet support
7.X: Configsets / schema
• 7.0: - _default configset

- Data-driven schema: auto-guessed text fields indexed 2 ways:
• tokenized for search
• strings for sorting/faceting: "*_str" string field, max 256 chars
- Turn off data-driven schema functionality:

curl http://host:8983/solr/mycollection/config 

-d "{ set-user-property: { update.autoCreateFields: false }}"
• 7.5: Disable configset upload: -Dconfigset.upload.enabled=false
7.X: Text analysis / machine learning
• 7.1: Bengali normalizer and stemmer
• 7.2: Enable off-ZooKeeper storage of large (>1MB) LTR models
• 7.3: OpenNLP integration: tokenization, POS tagging, phrase

chunking, lemmatization, NER, language detection
• 7.4: - ProtectedTermFilterFactory: don’t filter protected terms

- TaggerRequestHandler (a.k.a. SolrTextTagger): NER
• 7.5: - "nori" Korean morphological text analysis: "*_txt_ko"

- PhrasesIdentificationComponent: identify and score

candidate query phrases based on index statistics

- UIMA integration removed
7.X: Collections API
• 7.3: Add collection level properties similar to cluster properties
• 7.4: Cluster-wide defaults for numShards, nrtReplicas,

tlogReplicas, pullReplicas
• 7.5: - Support co-locating replicas of two or more collections

together in a node via the withCollection parameter

to the CREATE and MODIFYCOLLECTION commands

- SPLITSHARD: New split method using hard links: splitMethod=link
• 3-5 times faster than the original splitMethod=rewrite
• Slows down replication
• Increases disk usage on replica nodes
7.X: Queries
• 7.1: JSON
query
DSL

curl http://localhost:8983/solr/books/query -d '
{
query: {
bool: {
must: [
"title:solr",
{lucene: {df: content, query: "lucene solr"}}
],
must_not: [
{frange: {u: 3.0, query: ranking}}
]}}}'
7.X: Queries
• 7.2: New synonymQueryStyle field type option: enable

generation of appropriate queries for hierarchical

relations between overlapping terms
• as_same_term (default): SynonymQuery(bird,robin)
• pick_best: Dismax(bird,robin)
• as_distinct_terms: (bird OR robin)
• 7.4: JSON query DSL: Enable query/filter tagging,

e.g. { "#colorfilt" : "color:blue" } 

equivalent to local-param {!tag=colorfilt}color:blue

7.X: Large index segment merging
• Problem: Overly large segments (e.g. as a result of force-

merge/optimize) stop being eligible for merging,

and can start accumulating >50% deleted

documents, wasting space and skewing index stats.
• 7.5: - TieredMergePolicy now respects maxSegmentSizeMB

by default when executing force-merge/optimize and

expunge-deletes

- TieredMergePolicy’s reclaimDeletesWeight has been

replaced with a new deletesPctAllowed setting to

control how aggressively deletes should be reclaimed
7.X: Replication/recovery/rolling upgrades
• 7.3: The old Leader-Initiated-Recovery (LIR) implementation

is deprecated and replaced
• To perform a rolling upgrade to Solr 8, you must be on
Solr 7.3 or higher
• 7.4: - IndexFetcher now skips fetching identical files

- Buffering updates are written to a separate TLog

- Parallel replay of buffering TLogs
7.X: Block-join / nested documents
• 7.3: Added filters and excludeTags local-params for

{!parent} and {!child} query parsers, usable for

multi-select faceting
• 7.5: WIP: Allow Solr to more faithfully represent deeply

nested document relationships, rather than requiring

reconstruction based on the flattened list of child docs

returned by Solr
7.X: Miscellaneous
• 7.3: add-distinct atomic updates
• 7.4: - Ignore large document URP

- TLog: maxSize auto hard-commit setting

(in addition to maxDocs & maxTime)
• 7.5: Custom cluster properties allowed with ext. prefix
8.0
• Autoscaling
• Index upgrades
• HTTP/2
• Miscellaneous
8.0: Autoscaling
• Suggestions API: rebalance options even if no violations
• Suggestions API: add-replica for lost replicas
• maxOps limit for index size trigger
• Autoscaling policy framework will be the default replica
placement strategy
8.0: Index upgrades
• 7.0: Lucene indexes record the major Lucene version that

created the index, and the minimum Lucene version

that contributed to segments.
• 8.0: Version N-2 or older indexes will now fail to open,

even if they have been merged into an N-1 index.
• IndexUpgrader will not upgrade 6.X or earlier indexes
• Re-indexing will be required to upgrade
8.0: HTTP/2
• May 2018: Mark Miller announced his Star Burst effort:

many cleanups and performance enhancements
• July 2018: Cao Manh Dat took up the HTTP/2 aspects: SOLR-12639
• Indexing test: 33M docs, 1 shard, 2 replicas (SOLR-12642)
• Garbage: Leader: 26% less; replica: 76% less
• Indexing throughput: 54% higher
• CPU time: Leader: 39% higher; replica: 76% lower
• Ready to merge back to master, pending release of

Jetty 9.4.13, containing SPNEGO HTTP/2 implementation
8.0: Miscellaneous
• Lucene: scores must be non-negative
• Function(Score)Query-s convert negative scores to zero
• TODO: remove deprecations
• Trie fields? Removal effectively blocked by:
• SOLR-12074: Add numeric equivalent to StrField
• SOLR-11127: Mechanism to migrate schema
for .system collection (a.k.a. blob store) schema from
Trie (pre-7.0) to Points (7.0+)
8.X
• Lucene/Solr minimum JDK
• Luke: Lucene Toolbox
• New Lucene features
8.X: Lucene/Solr minimum JDK
• Oracle will end free JDK 8 support in January 2019
• Both JDK 9 & 10 are already EOL, no more Oracle support
• JDK 11 will very likely be next minimum supported JDK, no
schedule yet
• Under JDK 9+, Solr’s Hadoop-related functionality has
problems, including with Kerberos
• Uwe Schindler’s Jenkins server tests Lucene/Solr on Oracle
9+10+11+12 JDKs
• All have higher Solr test failure rates than on JDK 8
8.X: Luke: UI framework & licensing
• Andrzej Bialecki: Initial implementation: Thinlet, GPL
• Mark Harwood: GWT
• Mark Miller: Apache Pivot
• Dmitry Kan and Tomoko Uchida took ownership on Github
• Tomoko Uchida: JavaFX (bundled w/JDK 8)
• LUCENE-2562: Make Luke a Lucene/Solr Module
• JavaFX/OpenJFX unbundled from Java 11 JDK, GPL+CPE
• Tomoko Uchida: Swing (7.5 release available)
8.X: New Lucene features
• Index impacts, Block-Max WAND, similarity cleanups
• Some queries (especially term queries and disjunctions)
are much faster when number of hits is not required
• FeatureField: incorporate static relevance signals, e.g.
PageRank
• Soft deletes
• Merge policy retains deleted docs according to policy
• Enables document history, e.g. for time-travel indexes
• RAMDirectory replaced by ByteBuffersDirectory
Questions?
Thank you!
Steve Rowe
Senior Software Engineer, Lucidworks
@steven_a_rowe
#Activate18 #ActivateSearch

Lucene/Solr 8: The next major release

  • 1.
    Lucene/Solr 8: 
 Thenext major release Steve Rowe Senior Software Developer, Lucidworks @steven_a_rowe #Activate18 #ActivateSearch
  • 2.
    Agenda • Recent releasecadence • 7.X • 8.0 • 8.X YOU
 ARE HERE
  • 3.
    7.X average: 11weeks6.X average: 10 weeks
  • 4.
    7.X 1. Metrics 2. Autoscaling 3.CDCR 4. Time Routed Aliases 5. Replica types 6. Streaming expressions 7. JSON facet API 8. Configset / schema 9. Text Analysis / ML 10. Collections API 11. Queries 12. Large index segment merging 13. Replication / recovery / rolling updates 14. Block-join / nested docs 15. Miscellaneous
  • 5.
    7.X: Metrics • Continuationof 6.X work to support Autoscaling efforts • 7.0: - Aggregated metrics collected in overseer
 - solrconfig.xml <jmx> ➞ solr.xml <metrics><reporter> • 7.1: Prometheus metrics exporter contrib • 7.4: /admin/metrics/history API: basic long-term key metric time series aggregation • Fixed-width windows at
 several resolutions • Not yet in Admin UI:
 SOLR-12426
  • 6.
    7.X: Autoscaling • 7.0:- Preferences and policy DSL: flexible replica placement
 [ { minimize: cores }, { maximize: freedisk } ]
 { replica: "<2", shard: "#EACH", node: "#ANY" }
 - Diagnostics API: return sorted nodes, policy violations • 7.1: - autoAddReplicas ported to autoscaling framework
 - Add/remove/suspend/resume triggers and listeners
 - Triggers for added and lost nodes
 - ComputePlanAction / ExecutePlanAction
 - /autoscaling/history API: cluster events and actions • 7.2: - Search rate trigger
 - /autoscaling/suggestions API
 - UTILIZENODE collections API command
  • 7.
    7.X: Autoscaling • 7.3:- Simulation framework
 - Arbitrary metric threshold trigger
 - Scheduled trigger
 - Admin UI to display and execute suggestions
  • 8.
    7.X: Autoscaling • 7.4:- Periodic house-keeping task: cleans up inactive shards
 - Index size trigger: document count or size in bytes • 7.5: - Policy replica attribute: #ALL, #EQUAL, percentage,
 range, and floating point values
 - Policy cores attribute: #EQUAL, percentage, 
 range, and floating point values
 - Percentage in freedisk policy attribute
 - Simulation framework: test scaling up to 1 billion docs
  • 9.
    7.X: Cross DataCenter Replication • 7.2: Support bi-directional syncing of CDCR clusters This is not active-active, 
 but rather
 passive-active or active-passive: only one active
 cluster at a time.
  • 10.
    7.X: Time RoutedAliases • 7.3: - Specialization of Solr’s collection alias feature
 - Support time series data, e.g. logs / sensor data
 - Maintain performance under continuous indexing
 - CREATEALIAS: start, interval, retention policy
 - Automatically create new collections
 - Automatically delete old collections (optional)
 - Route updates based on timestamp
 - Search against all aliased collections* • 7.5: Preemptively create the next collection when updates
 are near the latest collection’s end date (optional)
 * Pending optimization: minimize queried collections (SOLR-9562)
  • 11.
    7.X: Replica types •7.0:
 
 
 
 
 
 
 • 7.4: Query param to prioritize replicas by type, e.g. shards.preference=replica.type:PULL,replica.type:TLOG Type Indexes
 locally Supports
 soft commit
 & RTG Pulls segments from leader Writes to
 TLog Can become shard leader Queryable NRT ✅ ✅ ✅ ✅ ✅ TLOG leader ✅ ✅ ✅ ✅ ✅ TLOG ✅ ✅ ✅ ✅ PULL ✅ ✅
  • 12.
    7.X: Streaming expressions •Parallel computation function suite • Some use cases: MapReduce, aggregations, parallel SQL, pub/ sub messaging, graph traversal, machine learning, statistical programming • Each 7.X release has added
 many new functions • 7.5: Ref guide:
 Math Expressions User Guide
  • 13.
    7.X: JSON FacetAPI • 7.0: Terms facets: added optional refinement support • 7.4: Semantic Knowledge Graph support via new 
 relatedness() aggregate function • Finds ad-hoc relationships by scoring documents relative to foreground and background document sets • 7.5: Heatmap facet support
  • 14.
    7.X: Configsets /schema • 7.0: - _default configset
 - Data-driven schema: auto-guessed text fields indexed 2 ways: • tokenized for search • strings for sorting/faceting: "*_str" string field, max 256 chars - Turn off data-driven schema functionality:
 curl http://host:8983/solr/mycollection/config 
 -d "{ set-user-property: { update.autoCreateFields: false }}" • 7.5: Disable configset upload: -Dconfigset.upload.enabled=false
  • 15.
    7.X: Text analysis/ machine learning • 7.1: Bengali normalizer and stemmer • 7.2: Enable off-ZooKeeper storage of large (>1MB) LTR models • 7.3: OpenNLP integration: tokenization, POS tagging, phrase
 chunking, lemmatization, NER, language detection • 7.4: - ProtectedTermFilterFactory: don’t filter protected terms
 - TaggerRequestHandler (a.k.a. SolrTextTagger): NER • 7.5: - "nori" Korean morphological text analysis: "*_txt_ko"
 - PhrasesIdentificationComponent: identify and score
 candidate query phrases based on index statistics
 - UIMA integration removed
  • 16.
    7.X: Collections API •7.3: Add collection level properties similar to cluster properties • 7.4: Cluster-wide defaults for numShards, nrtReplicas,
 tlogReplicas, pullReplicas • 7.5: - Support co-locating replicas of two or more collections
 together in a node via the withCollection parameter
 to the CREATE and MODIFYCOLLECTION commands
 - SPLITSHARD: New split method using hard links: splitMethod=link • 3-5 times faster than the original splitMethod=rewrite • Slows down replication • Increases disk usage on replica nodes
  • 17.
    7.X: Queries • 7.1:JSON query DSL
 curl http://localhost:8983/solr/books/query -d ' { query: { bool: { must: [ "title:solr", {lucene: {df: content, query: "lucene solr"}} ], must_not: [ {frange: {u: 3.0, query: ranking}} ]}}}'
  • 18.
    7.X: Queries • 7.2:New synonymQueryStyle field type option: enable
 generation of appropriate queries for hierarchical
 relations between overlapping terms • as_same_term (default): SynonymQuery(bird,robin) • pick_best: Dismax(bird,robin) • as_distinct_terms: (bird OR robin) • 7.4: JSON query DSL: Enable query/filter tagging,
 e.g. { "#colorfilt" : "color:blue" } 
 equivalent to local-param {!tag=colorfilt}color:blue

  • 19.
    7.X: Large indexsegment merging • Problem: Overly large segments (e.g. as a result of force-
 merge/optimize) stop being eligible for merging,
 and can start accumulating >50% deleted
 documents, wasting space and skewing index stats. • 7.5: - TieredMergePolicy now respects maxSegmentSizeMB
 by default when executing force-merge/optimize and
 expunge-deletes
 - TieredMergePolicy’s reclaimDeletesWeight has been
 replaced with a new deletesPctAllowed setting to
 control how aggressively deletes should be reclaimed
  • 20.
    7.X: Replication/recovery/rolling upgrades •7.3: The old Leader-Initiated-Recovery (LIR) implementation
 is deprecated and replaced • To perform a rolling upgrade to Solr 8, you must be on Solr 7.3 or higher • 7.4: - IndexFetcher now skips fetching identical files
 - Buffering updates are written to a separate TLog
 - Parallel replay of buffering TLogs
  • 21.
    7.X: Block-join /nested documents • 7.3: Added filters and excludeTags local-params for
 {!parent} and {!child} query parsers, usable for
 multi-select faceting • 7.5: WIP: Allow Solr to more faithfully represent deeply
 nested document relationships, rather than requiring
 reconstruction based on the flattened list of child docs
 returned by Solr
  • 22.
    7.X: Miscellaneous • 7.3:add-distinct atomic updates • 7.4: - Ignore large document URP
 - TLog: maxSize auto hard-commit setting
 (in addition to maxDocs & maxTime) • 7.5: Custom cluster properties allowed with ext. prefix
  • 23.
    8.0 • Autoscaling • Indexupgrades • HTTP/2 • Miscellaneous
  • 24.
    8.0: Autoscaling • SuggestionsAPI: rebalance options even if no violations • Suggestions API: add-replica for lost replicas • maxOps limit for index size trigger • Autoscaling policy framework will be the default replica placement strategy
  • 25.
    8.0: Index upgrades •7.0: Lucene indexes record the major Lucene version that
 created the index, and the minimum Lucene version
 that contributed to segments. • 8.0: Version N-2 or older indexes will now fail to open,
 even if they have been merged into an N-1 index. • IndexUpgrader will not upgrade 6.X or earlier indexes • Re-indexing will be required to upgrade
  • 26.
    8.0: HTTP/2 • May2018: Mark Miller announced his Star Burst effort:
 many cleanups and performance enhancements • July 2018: Cao Manh Dat took up the HTTP/2 aspects: SOLR-12639 • Indexing test: 33M docs, 1 shard, 2 replicas (SOLR-12642) • Garbage: Leader: 26% less; replica: 76% less • Indexing throughput: 54% higher • CPU time: Leader: 39% higher; replica: 76% lower • Ready to merge back to master, pending release of
 Jetty 9.4.13, containing SPNEGO HTTP/2 implementation
  • 27.
    8.0: Miscellaneous • Lucene:scores must be non-negative • Function(Score)Query-s convert negative scores to zero • TODO: remove deprecations • Trie fields? Removal effectively blocked by: • SOLR-12074: Add numeric equivalent to StrField • SOLR-11127: Mechanism to migrate schema for .system collection (a.k.a. blob store) schema from Trie (pre-7.0) to Points (7.0+)
  • 28.
    8.X • Lucene/Solr minimumJDK • Luke: Lucene Toolbox • New Lucene features
  • 29.
    8.X: Lucene/Solr minimumJDK • Oracle will end free JDK 8 support in January 2019 • Both JDK 9 & 10 are already EOL, no more Oracle support • JDK 11 will very likely be next minimum supported JDK, no schedule yet • Under JDK 9+, Solr’s Hadoop-related functionality has problems, including with Kerberos • Uwe Schindler’s Jenkins server tests Lucene/Solr on Oracle 9+10+11+12 JDKs • All have higher Solr test failure rates than on JDK 8
  • 30.
    8.X: Luke: UIframework & licensing • Andrzej Bialecki: Initial implementation: Thinlet, GPL • Mark Harwood: GWT • Mark Miller: Apache Pivot • Dmitry Kan and Tomoko Uchida took ownership on Github • Tomoko Uchida: JavaFX (bundled w/JDK 8) • LUCENE-2562: Make Luke a Lucene/Solr Module • JavaFX/OpenJFX unbundled from Java 11 JDK, GPL+CPE • Tomoko Uchida: Swing (7.5 release available)
  • 31.
    8.X: New Lucenefeatures • Index impacts, Block-Max WAND, similarity cleanups • Some queries (especially term queries and disjunctions) are much faster when number of hits is not required • FeatureField: incorporate static relevance signals, e.g. PageRank • Soft deletes • Merge policy retains deleted docs according to policy • Enables document history, e.g. for time-travel indexes • RAMDirectory replaced by ByteBuffersDirectory
  • 32.
  • 33.
    Thank you! Steve Rowe SeniorSoftware Engineer, Lucidworks @steven_a_rowe #Activate18 #ActivateSearch