Scaling Solr with SolrCloud

Scaling Solr with SolrCloud

Rafał Kuć – Sematext Group, Inc.
@kucrafal @sematext sematext.com

Ta me…
Sematext consultant & engineer
Solr.pl co-founder
Father and husband 

Solr History
Solr 4.1 and counting
Solr 4.0 released
Lucene / Solr merge
Solr 1.4 released
Solr 1.3 released
Incubator graduation
Solr donated to ASF
Y. Seeley creates Solr

Master – Slave Deployment
Solr Slave

Solr Slave

Solr Slave

Solr Master

Application

Solr Slave

Master as SPOF
Solr Slave

Solr Slave

Solr Slave

Solr Master

Application

Solr Slave

Replication Time
Solr Slave

Indexing
App

Solr Master

Solr R
Slave

Solr Slave

Querying
App

Too Much for a Single Shard
Solr Slave

Solr Slave

Solr Master

Application

Too Much for a Single Shard
Solr Slave

Solr Slave

Solr Master

Solr Slave

Solr Slave

Solr Master

Application

Solr Slave

Solr Master

Solr Slave

Querying in Multi Master Deployment

Shard1, shard2,
shard3

Solr Slave
Shard 2
Doc

Shard1, shard2,
shard3

Solr Slave
Response
Shard 1

Application

Solr Slave
Shard 3

Response

Basic Glossary
Cluster

Node
Collection

Shard
Leader & Replica
Overseer
https://cwiki.apache.org/confluence/display/solr/SolrCloud+Glossary

Apache ZooKeeper
Quorum is required
Sample configuration
clientPort=2181
dataDir=/usr/share/zookeeper/data
tickTime=2000
initLimit=10
syncLimit=5
server.1=192.168.1.1:2888:3888
server.2=192.168.1.2:2888:3888
server.3=192.168.1.3:2888:3888
ZooKeeper

ZooKeeper

ZooKeeper

Solr Instances
-DzkHost=192.168.1.1:2181,
192.168.1.2:2181,192.168.1.3:2181

Solr Server

-DzkHost=192.168.1.2:2181,
192.168.1.1:2181,192.168.1.3:2181

Solr Server

ZooKeeper

ZooKeeper

ZooKeeper

-DzkHost=192.168.1.3:2181,
192.168.1.1:2181,192.168.1.2:2181

Solr Server

-DzkHost=192.168.1.3:2181,
192.168.1.1:2181,192.168.1.2:2181

Solr Server

Collection Creation

Solr Server
Solr
$ curl
$ cloud-scripts/zkcli.sh –cmd upconfig -zkhost Server
'http://solr1:8983/solr/admin/collections?action=CREATE&
192.168.1.2:2181 -confdir
name=revolution&numShards=2&replicationFactor=1'
/usr/share/config/revolution/conf -conf revolution

Solr Server

ZooKeeper

ZooKeeper

ZooKeeper

Solr Server

Single Collection Deployment
Shard1

Shard2

Solr Server

Solr Server

Solr Server

Solr Server

Application

Collection with Replica

Solr Server
Solr Server
$ curl
name=revolution&numShards=2&replicationFactor=2'

Solr Server

ZooKeeper

ZooKeeper

ZooKeeper

Solr Server

Collection with Replicas
Shard1
Replica

Shard2
Replica

Solr Server

Solr Server

Shard2

Shard1

Solr Server

Solr Server

Application

Querying
Shard2
Id,score

Shard1
Id,score

Solr Server

Solr Server

QUERY

Solr Server

Application

Querying
Shard2
doc

Shard1
doc

Solr Server

Solr Server

Results

Solr Server

Application

Shard and Replica Number
How your data looks

Expected data growth
Target performance

Target node number

Max number of nodes = number of
shards * (number of replicas + 1)

What should I go for?
More data?

Shard

Shard

Shard

More queries ?

Replica
Replica

Replica
Replica

Replica
Replica

Custom Routing

Default
(numShards present, pre 4.5)

Implicit
(numShards not present, pre 4.5)

Custom Routing Example
Shard1

Shard2

Solr Server

Solr Server

id=userB!3
id=userA!1
id=userA!2

Querying Solr – Default Routing
Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Shard 6

Shard 7

Shard 8

Solr Collection

Application

Quering Solr – Custom Routing
Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Shard 6

Shard 7

Shard 8

Solr Collection
q=revolution&_route_=userA!

Application

Collection Manipulation Commands
Create
Delete

Reload
Split

Create Alias
Delete Alias
Shard Creation/Deletion

http://wiki.apache.org/solr/SolrCloud

Collection Creation
name

numShards
replicationFactor

maxShardsPerNode
createNodeSet
collection.configName

Collection Split Example

$ curl
name=collection1&numShards=2&replicationFactor=1'

Collection Split Example

$ curl 'http://localhost:8983/solr/admin/collections?
action=SPLITSHARD&collection=collection1&shard=shard1'

Collection Aliasing
$ curl 'http://solr1:8983/solr/admin/collections?
action=CREATEALIAS&name=weekly&collections=20131107,
20131108,20131109,20131110,20131111,20131112,20131113'

$ curl
'http://solr1:8983/solr/weekly/select?q=revolution'

$ curl 'http://solr1:8983/solr/admin/collections?
action=DELETEALIAS&name=weekly'

Caches
Refreshed with IndexSearcher

Configurable
Different purposes

Different implementations

Solr Cache

Filter Cache
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="128" />
q=lucene+revolution+city:Dublin
q=lucene+revolution&fq=city:Dublin

q=*:*&fq={!cache=false}city:Dublin
q=*:*&fq={!frange l=0 u=10 cache=false cost=200}sum(price,pro)

Document Cache
<documentCache class="solr.LRUCache"
size="512"
initialSize="512" />

Query Result Cache
<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="128"/>
q=lucene+revolution+city:Dublin&sort=date+desc&start=0&rows=10
q=lucene+revolution&fq=city:Dublin&sort=date+desc&start=0&rows=10

<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>

Warming
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst><str name="q">*:*</str><str name="sort">date desc</str></lst>
<lst><str name="q">keywords:* OR tags:*</str></lst>
<lst><str name="q">*:*</str><str name="fq">active:*</str></lst>
</arr>
</listener>
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst><str name="q">*:*</str><str name="sort">date desc</str></lst>
<lst><str name="q">keywords:* OR tags:*</str></lst>
<lst><str name="q">*:*</str><str name="fq">active:*</str></lst>
</arr>
</listener>

<useColdSearcher>false</useColdSearcher>

The Right Directory
StandardDirectory
SimpleFSDirectory
NIOFSDirectory
MMapDirectory

_0.fdt

_0.fdx _0.fnm _0.nvd

_1.fdt

_1.fdx _1.fnm _1.nvd

NRTCachingDirectory

RAMDirectory

<directoryFactory name="DirectoryFactory"
class="solr.NRTCachingDirectoryFactory" />

Column oriented fields - DocValues
NRT compatible
Better compression than field cache
Can store data outside of JVM heap
Can improve things for dynamic indices
<field name="categories" type="string" indexed="false"
stored="false" multiValued="true" docValues="true"/>
<field name="categories" type="string" indexed="false"
stored="false" multiValued="true" docValues="true"
docValuesFormat="Disk"/>

Segment Merge
Level 0

a

b

f

Level 1

c

c

d

e

g

Segment Merge Under Control
Merge policy
Merge scheduler
Merge factor

Merge policy configuration

Configuring Segment Merge
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">10</int>
<int name="segmentsPerTier">10</int>
</mergePolicy>

<mergeScheduler
class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
<mergedSegmentWarmer
class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>

<mergeFactor>10</mergeFactor>

Indexing Throughput Tuning
Maximum indexing threads
RAM buffer size
Maximum buffered documents
Bulk, bulks and bulks
CloudSolrServer
Autocommit
Cutting off unnecessary stuff

TransactionLog
Updates durability
Recovering peer replay
Performant Realtime Get

<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<requestHandler name="/get" class="solr.RealTimeGetHandler">
</requestHandler>

Autocommit or Not?
Automatic data flush
Automatic index view refresh

<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>

Autocommit & openSearcher=true
<autoCommit>
<openSearcher>true</openSearcher>
</autoCommit>

AutoSoftCommit & openSearcher=false
<autoCommit>
<openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
</autoSoftCommit>

Postings Formats to the Rescue
Lucene 4.0 >= Flexible Indexing
Postings == docs, positions, payloads
Different postings formats available

Bloom
Pulsing
Simple text
Direct
Memory

<codecFactory class="solr.SchemaCodecFactory" />
<field name="id" type="string_pulsing" indexed="true" stored="true" />
<fieldType name="string_pulsing" class="solr.StrField"
postingsFormat="Pulsing41" />

Monitoring
Cluster state
Nodes utilization

Memory usage
Cache utilization

Query response time
Warmup times
Garbage collector work

Other Monitoring Tools
Ganglia
http://ganglia.sourceforge.net/

New Relic
http://www.newrelic.com/

Opsview
http://www.opsview.com

We Are Hiring !
Dig Search ?
Dig Analytics ?
Dig Big Data ?
Dig Performance ?
Dig working with and in open – source ?
We’re hiring world – wide !
http://sematext.com/about/jobs.html

Thank You !
Rafał Kuć
@kucrafal
rafal.kuc@sematext.com

Sematext
@sematext
http://sematext.com
http://blog.sematext.com
SPM discount code:

LR2013SPM20

@ Sematext booth ;)

Scaling Solr with SolrCloud

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (13)

Similar to Scaling Solr with SolrCloud

Similar to Scaling Solr with SolrCloud (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Scaling Solr with SolrCloud