Solr Search Engine: Optimize Is (Not) Bad for You

Sematext Group, Inc.
Sematext Group, Inc. Sematext Group, Inc.
Optimize Is (Not) Bad For You
Deep Dive Into The Segment Merge Abyss
Rafał Kuć
Sematext Group, Inc.
Agenda
• Segments – where, what & how
• Writing segments
• Modifying segments
• Segment merging – what, where, how, why
• Force merging
• Force merging & SolrCloud
• Performance considerations
• Specialized merge policies
https://github.com/sematext/lr/tree/master/2017/optimize
3
01
Sematext & I
cloud
metrics
logs
&
4
01
Solr Collection Architecture
Zookeeper
5
01
Solr Collection Architecture
Zookeeper
SOLR
SOLR
SOLR
SOLR
6
01
Solr Collection Architecture
Zookeeper
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
7
01
Solr Shard Architecture
TLOG
8
01
Solr Shard Architecture
TLOG
Segment Segment Segment
Segment
9
01
Lucene Segment
Segment Info
Field Names
Stored Field Values
Point Values
Term Dictionary
Term Frequency
Term Proximity
Normalization
Per Document Vals
Live Documents
1
01
Inside the Segment – Term Dictionary
TERM DOCID
lucene <1>, <2>
revolution <1>, <2>
washington <1>
boston <2>
_1.tim
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
_1.tip
1
01
Inside the Segment – Doc Values
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID FIELD VALUE
1 Title Lucene Revolution Washington
1 City Washington D.C.
2 Title Lucene Revolution Boston
2 City Boston
_1.dvd
_1.dvm
1
01
Inside the Segment – Stored Fields
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID VALUE
1 Title: Lucene Revolution Washington
City: Washington D.C
2 Title: Lucene Revolution Boston
City: Boston
_1.fdx
_1.fdt
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
_2.cfs
_2.cfe
1
01
Indexing
1
01
Indexing
1
01
Indexing
1
01
Indexing
level/tier
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Deletes
2
01
Deletes – After Merge
2
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene" ],
"awesome" : true
}
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
apply changes
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
delete old document
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
3
01
Atomic Updates – In Place
Works on top of numeric, doc values based fields
Fields need to be not indexed and not stored
Doesn’t require delete/index
Support only inc and set modifers
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true,
"views" : 100
}
apply changes
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true,
"views" : 100
}
update doc values
3
01
Search – Importance of Segments
Immutable – write once read many
3
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
3
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
4
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
4
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
Rapid segment changes – worse I/O cache usage
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
Segment Warmer
<mergedSegmentWarmer
class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
4
01
Taking Control – Default Indexing Throughput
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control – Default Indexing Throughput
throughput < 5k/sec @ ~14GB
4
01
Taking Control – Max Merged Segment Size
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower higher indexing throughput – smaller segments
Higher better search latency (depends) – more merges
4
01
Taking Control – Lowering Max Merged Size
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">512</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control – Lowering Max Segment Size
throughput < 5k/sec @ ~15.5GB
11% throughput increase
5
01
Taking Control – Merge At Once
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower better search latency (depends)
Higher higher indexing throughput
5
01
Taking Control – Lowering Merge At Once
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">2</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
5
01
Taking Control – Lowering Merge At Once
throughput < 5k/sec @ ~13GB
8% throughput decrease
5
01
Taking Control – Merge At Once Explicit
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls number of segments merged at once during force merge
5
01
Taking Control – Segments Per Tier
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower value means more merging, but less segments
Along with maxMergeAtOnce can smoothen I/O spikes
For better indexing throughput set maxMergeAtOnce <
segmentsPerTier
5
01
Taking Control – Combined Together
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">30</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">30</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">512</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
5
01
Taking Control – Combined Together
throughput < 5k/sec @ ~15GB
but look at read difference
5
01
Taking Control – Default vs Combined Read/Write
default settings
5
01
Taking Control – Default vs Combined Read/Write
default settings combined changes settings
5
01
Taking Control – Reclaim Deletes Weight
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls importance of merging segments with deleted documents
Increase to put priority on merging segments with deleted documents
6
01
Taking Control – No CFS Ratio
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls compound file system segments ratio
To completely disable CFS set to 0.0
6
01
Taking Control – Merge Scheduler
Controls maximum number of concurrent merges
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
For SSD set maxThreadCount to min(4, #CPUs / 2)
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
$ curl
'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
7
01
Force Merge – The Good
Improves search speed (fewer segments)
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
Reduces number of used files
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
Will cause temporary increase of disk usage (up to 3x)
7
01
Force Merge – SolrCloud Performance Example
8
01
Force Merge – SolrCloud Performance Example
8
01
Force Merge – Legacy
Index on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
index
Documents
8
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
force merge
8
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Replicate after optimize is done
Solr Master
Solr Slave
Solr Slave
Solr Slave
pull after optimize
8
01
Force Merge – SolrCloud (Solr 7 – pull replicas)
Create collection
Force merge
Solr will do the rest
Solr Solr
Solr Solr
Primary 1
Primary 2 Pull Replica 2
Pull Replica 1
8
01
Force Merge – SolrCloud (NRT, pre 7.0)
Ask yourself if you really need force merge
Solr Solr
Solr Solr
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Solr Solr
Solr Solr
Primary 1
Primary 2
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Solr Solr
Solr Solr
Primary 1
Primary 2
DocumentsDocuments
Documents
Documents
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Solr Solr
Solr Solr
Primary 1
Primary 2optimize
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Create replicas
Solr Solr
Solr Solr
Primary 1
Primary 2 Replica 2
Replica 1
9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
Pre-sorts data during merge for:
- faster range queries
- faster data retrieval
- possibility of early query termination
- convenient for time based data
9
01
http://sematext.com/jobs
You love like we do?
You want to work with ?
Want to work with open source?
You want to do fun stuff?
9
01
Get in touch
Rafał
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs
Come talk to us
at the booth
Thank You
1 of 94

Recommended

Scaling Solr with SolrCloud by
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
2.3K views57 slides
Introduction to Apache Solr by
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
4.3K views57 slides
Apache Solr Workshop by
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopSaumitra Srivastav
7.2K views71 slides
Introduction to Apache solr by
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
1.7K views23 slides
Apache Solr by
Apache SolrApache Solr
Apache SolrSemih Hakkıoğlu
385 views34 slides
Discovering the 2 in Alfresco Search Services 2.0 by
Discovering the 2 in Alfresco Search Services 2.0Discovering the 2 in Alfresco Search Services 2.0
Discovering the 2 in Alfresco Search Services 2.0Angel Borroy López
631 views45 slides

More Related Content

What's hot

What is in a Lucene index? by
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
48K views38 slides
Solr Introduction by
Solr IntroductionSolr Introduction
Solr IntroductionIsmaeel Enjreny
845 views10 slides
Neural Search Comes to Apache Solr by
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
635 views49 slides
Faceted Search with Lucene by
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
12.5K views21 slides
Scylla core dump debugging tools by
Scylla core dump debugging toolsScylla core dump debugging tools
Scylla core dump debugging toolsTomasz Grabiec
482 views54 slides
Alfresco search services: Now and Then by
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and ThenAngel Borroy López
3K views28 slides

What's hot(20)

Neural Search Comes to Apache Solr by Sease
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease635 views
Scylla core dump debugging tools by Tomasz Grabiec
Scylla core dump debugging toolsScylla core dump debugging tools
Scylla core dump debugging tools
Tomasz Grabiec482 views
Hacking ansible by bcoca
Hacking ansibleHacking ansible
Hacking ansible
bcoca12.5K views
How to Extend Apache Spark with Customized Optimizations by Databricks
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks5.1K views
Field collapsing/Result groupingについて by Jun Ohtani
Field collapsing/Result groupingについてField collapsing/Result groupingについて
Field collapsing/Result groupingについて
Jun Ohtani3.2K views
Oracle Database SQL Tuning Concept by Chien Chung Shen
Oracle Database SQL Tuning ConceptOracle Database SQL Tuning Concept
Oracle Database SQL Tuning Concept
Chien Chung Shen2.1K views
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise by Angel Borroy López
How to migrate from Alfresco Search Services to Alfresco SearchEnterpriseHow to migrate from Alfresco Search Services to Alfresco SearchEnterprise
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
Spark Summit East 2017: Apache spark and object stores by Steve Loughran
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Steve Loughran1.5K views
Alfresco Share - Recycle Bin Ideas by AlfrescoUE
Alfresco Share - Recycle Bin IdeasAlfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin Ideas
AlfrescoUE1.5K views
Workshop: Learning Elasticsearch by Anurag Patel
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
Anurag Patel15.6K views

Viewers also liked

Effective Hive Queries by
Effective Hive QueriesEffective Hive Queries
Effective Hive QueriesQubole
1.5K views14 slides
Cross Datacenter Replication in Apache Solr 6 by
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
2.3K views24 slides
SolrCloud on Hadoop by
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on HadoopAlex Moundalexis
4.9K views47 slides
Solr on Docker - the Good, the Bad and the Ugly by
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySematext Group, Inc.
15.5K views21 slides
Best practices for highly available and large scale SolrCloud by
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
4.1K views24 slides
How to Run Solr on Docker and Why by
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and WhySematext Group, Inc.
14.1K views32 slides

Viewers also liked(6)

Effective Hive Queries by Qubole
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
Qubole1.5K views
Best practices for highly available and large scale SolrCloud by Anshum Gupta
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta4.1K views

Similar to Solr Search Engine: Optimize Is (Not) Bad for You

Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc. by
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.Lucidworks
684 views94 slides
04 data accesstechnologies by
04 data accesstechnologies04 data accesstechnologies
04 data accesstechnologiesBat Programmer
248 views23 slides
Beyond full-text searches with Lucene and Solr by
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
4.8K views32 slides
IT talk SPb "Full text search for lazy guys" by
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
740 views65 slides
Interactive Questions and Answers - London Information Retrieval Meetup by
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
359 views27 slides
20150210 solr introdution by
20150210 solr introdution20150210 solr introdution
20150210 solr introdutionXuan-Chao Huang
239 views56 slides

Similar to Solr Search Engine: Optimize Is (Not) Bad for You(20)

Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc. by Lucidworks
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Lucidworks684 views
Beyond full-text searches with Lucene and Solr by Bertrand Delacretaz
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz4.8K views
IT talk SPb "Full text search for lazy guys" by DataArt
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
DataArt740 views
Interactive Questions and Answers - London Information Retrieval Meetup by Sease
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
Sease359 views
Dev8d Apache Solr Tutorial by Sourcesense
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense2.6K views
Presto at Tivo, Boston Hadoop Meetup by Justin Borgman
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
Justin Borgman945 views
Letting In the Light: Using Solr as an External Search Component by Jay Luker
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker2.8K views
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas... by Amazon Web Services
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
Amazon Web Services1.6K views
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can... by Amazon Web Services
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Rails and the Apache SOLR Search Engine by David Keener
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
David Keener3.4K views
Web analytics at scale with Druid at naver.com by Jungsu Heo
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
Jungsu Heo5.9K views
Rapid Prototyping with Solr by Erik Hatcher
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher1.1K views
Rapid Prototyping with Solr by Erik Hatcher
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher1.2K views
Developing on SQL Azure by Ike Ellis
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
Ike Ellis939 views
Lessons Learned While Scaling Elasticsearch at Vinted by Dainius Jocas
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
Dainius Jocas189 views

More from Sematext Group, Inc.

Tweaking the Base Score: Lucene/Solr Similarities Explained by
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedSematext Group, Inc.
8.6K views53 slides
OOPs, OOMs, oh my! Containerizing JVM apps by
OOPs, OOMs, oh my! Containerizing JVM appsOOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsSematext Group, Inc.
1.5K views25 slides
Is observability good for your brain? by
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?Sematext Group, Inc.
751 views28 slides
Introducing log analysis to your organization by
Introducing log analysis to your organization Introducing log analysis to your organization
Introducing log analysis to your organization Sematext Group, Inc.
3.5K views106 slides
Monitoring and Log Management for by
Monitoring and Log Management forMonitoring and Log Management for
Monitoring and Log Management forSematext Group, Inc.
14.5K views42 slides
Introduction to solr by
Introduction to solrIntroduction to solr
Introduction to solrSematext Group, Inc.
860 views21 slides

More from Sematext Group, Inc. (20)

Tweaking the Base Score: Lucene/Solr Similarities Explained by Sematext Group, Inc.
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka by Sematext Group, Inc.
Building Resilient Log Aggregation Pipeline with Elasticsearch & KafkaBuilding Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker by Sematext Group, Inc.
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc. 10.5K views
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker by Sematext Group, Inc.
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerRunning High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc. 38.1K views
Large Scale Log Analytics with Solr (from Lucene Revolution 2015) by Sematext Group, Inc.
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc. 11.8K views
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ... by Sematext Group, Inc.
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Sematext Group, Inc. 14.1K views
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale by Sematext Group, Inc.
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale

Recently uploaded

GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
85 views32 slides
Attacking IoT Devices from a Web Perspective - Linux Day by
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day Simone Onofri
16 views68 slides
Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
26 views45 slides
Melek BEN MAHMOUD.pdf by
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
14 views1 slide
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
23 views38 slides
Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
12 views36 slides

Recently uploaded(20)

GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri16 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman33 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software263 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada136 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely21 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi127 views

Solr Search Engine: Optimize Is (Not) Bad for You

  • 1. Optimize Is (Not) Bad For You Deep Dive Into The Segment Merge Abyss Rafał Kuć Sematext Group, Inc.
  • 2. Agenda • Segments – where, what & how • Writing segments • Modifying segments • Segment merging – what, where, how, why • Force merging • Force merging & SolrCloud • Performance considerations • Specialized merge policies https://github.com/sematext/lr/tree/master/2017/optimize
  • 6. 6 01 Solr Collection Architecture Zookeeper SOLR shard shard SOLR shard shard SOLR shard shard SOLR shard shard
  • 9. 9 01 Lucene Segment Segment Info Field Names Stored Field Values Point Values Term Dictionary Term Frequency Term Proximity Normalization Per Document Vals Live Documents
  • 10. 1 01 Inside the Segment – Term Dictionary TERM DOCID lucene <1>, <2> revolution <1>, <2> washington <1> boston <2> _1.tim Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston _1.tip
  • 11. 1 01 Inside the Segment – Doc Values Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID FIELD VALUE 1 Title Lucene Revolution Washington 1 City Washington D.C. 2 Title Lucene Revolution Boston 2 City Boston _1.dvd _1.dvm
  • 12. 1 01 Inside the Segment – Stored Fields Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID VALUE 1 Title: Lucene Revolution Washington City: Washington D.C 2 Title: Lucene Revolution Boston City: Boston _1.fdx _1.fdt
  • 13. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm
  • 14. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm
  • 15. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm _2.cfs _2.cfe
  • 29. 2 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' retrieve document { "id" : 3, "tags" : [ "lucene" ], "awesome" : true }
  • 30. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } apply changes
  • 31. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } delete old document
  • 32. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }
  • 33. 3 01 Atomic Updates – In Place Works on top of numeric, doc values based fields Fields need to be not indexed and not stored Doesn’t require delete/index Support only inc and set modifers $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]'
  • 34. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' retrieve document { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }
  • 35. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } apply changes
  • 36. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } update doc values
  • 37. 3 01 Search – Importance of Segments Immutable – write once read many
  • 38. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed
  • 39. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches
  • 40. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size
  • 41. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size Rapid segment changes – worse I/O cache usage
  • 42. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 43. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
  • 44. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" /> Segment Warmer <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
  • 45. 4 01 Taking Control – Default Indexing Throughput Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 46. 4 01 Taking Control – Default Indexing Throughput throughput < 5k/sec @ ~14GB
  • 47. 4 01 Taking Control – Max Merged Segment Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower higher indexing throughput – smaller segments Higher better search latency (depends) – more merges
  • 48. 4 01 Taking Control – Lowering Max Merged Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 49. 4 01 Taking Control – Lowering Max Segment Size throughput < 5k/sec @ ~15.5GB 11% throughput increase
  • 50. 5 01 Taking Control – Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower better search latency (depends) Higher higher indexing throughput
  • 51. 5 01 Taking Control – Lowering Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">2</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 52. 5 01 Taking Control – Lowering Merge At Once throughput < 5k/sec @ ~13GB 8% throughput decrease
  • 53. 5 01 Taking Control – Merge At Once Explicit Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls number of segments merged at once during force merge
  • 54. 5 01 Taking Control – Segments Per Tier Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower value means more merging, but less segments Along with maxMergeAtOnce can smoothen I/O spikes For better indexing throughput set maxMergeAtOnce < segmentsPerTier
  • 55. 5 01 Taking Control – Combined Together Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">30</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">30</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 56. 5 01 Taking Control – Combined Together throughput < 5k/sec @ ~15GB but look at read difference
  • 57. 5 01 Taking Control – Default vs Combined Read/Write default settings
  • 58. 5 01 Taking Control – Default vs Combined Read/Write default settings combined changes settings
  • 59. 5 01 Taking Control – Reclaim Deletes Weight Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls importance of merging segments with deleted documents Increase to put priority on merging segments with deleted documents
  • 60. 6 01 Taking Control – No CFS Ratio Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls compound file system segments ratio To completely disable CFS set to 0.0
  • 61. 6 01 Taking Control – Merge Scheduler Controls maximum number of concurrent merges Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 62. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 63. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 64. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 For SSD set maxThreadCount to min(4, #CPUs / 2) Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 65. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive
  • 66. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified
  • 67. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default)
  • 68. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case
  • 69. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case $ curl 'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
  • 70. 7 01 Force Merge – The Good Improves search speed (fewer segments)
  • 71. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents
  • 72. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data
  • 73. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data Reduces number of used files
  • 74. 7 01 Force Merge – The Bad Invalidates operating system I/O cache
  • 75. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments
  • 76. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data
  • 77. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues
  • 78. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues Will cause temporary increase of disk usage (up to 3x)
  • 79. 7 01 Force Merge – SolrCloud Performance Example
  • 80. 8 01 Force Merge – SolrCloud Performance Example
  • 81. 8 01 Force Merge – Legacy Index on the master server Solr Master Solr Slave Solr Slave Solr Slave index Documents
  • 82. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Solr Master Solr Slave Solr Slave Solr Slave force merge
  • 83. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Replicate after optimize is done Solr Master Solr Slave Solr Slave Solr Slave pull after optimize
  • 84. 8 01 Force Merge – SolrCloud (Solr 7 – pull replicas) Create collection Force merge Solr will do the rest Solr Solr Solr Solr Primary 1 Primary 2 Pull Replica 2 Pull Replica 1
  • 85. 8 01 Force Merge – SolrCloud (NRT, pre 7.0) Ask yourself if you really need force merge Solr Solr Solr Solr
  • 86. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Solr Solr Solr Solr Primary 1 Primary 2
  • 87. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Solr Solr Solr Solr Primary 1 Primary 2 DocumentsDocuments Documents Documents
  • 88. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Solr Solr Solr Solr Primary 1 Primary 2optimize
  • 89. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Create replicas Solr Solr Solr Solr Primary 1 Primary 2 Replica 2 Replica 1
  • 90. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory>
  • 91. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory> Pre-sorts data during merge for: - faster range queries - faster data retrieval - possibility of early query termination - convenient for time based data
  • 92. 9 01 http://sematext.com/jobs You love like we do? You want to work with ? Want to work with open source? You want to do fun stuff?