Scaling Solr with SolrCloud

Rafał  Kuć  – Sematext Group, Inc.
@kucrafal @sematext sematext.com
Ta  me…
Sematext consultant & engineer
Solr.pl co-founder
Father and husband 
Solr History
Solr 4.1 and counting
Solr 4.0 released
Lucene / Solr merge
Solr 1.4 released
Solr 1.3 released
Incubator gra...
The Past
Master – Slave Deployment
Solr Slave

Solr Slave

Solr Slave

Solr Master

Application

Solr Slave
Master as SPOF
Solr Slave

Solr Slave

Solr Slave

Solr Master

Application

Solr Slave
Replication Time
Solr Slave

Indexing
App

Solr Master

Solr R
Slave

Solr Slave

Querying
App
Too Much for a Single Shard
Solr Slave

Solr Slave

Solr Master

Application
Too Much for a Single Shard
Solr Slave

Solr Slave

Solr Master

Solr Slave

Solr Slave

Solr Master

Application

Solr Sl...
Querying in Multi Master Deployment

Shard1, shard2,
shard3

Solr Slave
Shard 2
Doc

Shard1, shard2,
shard3

Solr Slave
Re...
SolrCloud Comes Into Play
Basic Glossary
Cluster

Node
Collection

Shard
Leader & Replica
Overseer
https://cwiki.apache.org/confluence/display/solr/...
Apache ZooKeeper
Quorum is required
Sample configuration
clientPort=2181
dataDir=/usr/share/zookeeper/data
tickTime=2000
i...
Solr Instances
-DzkHost=192.168.1.1:2181,
192.168.1.2:2181,192.168.1.3:2181

Solr Server

-DzkHost=192.168.1.2:2181,
192.1...
Collection Creation

Solr Server
Solr
$ curl
$ cloud-scripts/zkcli.sh –cmd upconfig -zkhost Server
'http://solr1:8983/solr...
Single Collection Deployment
Shard1

Shard2

Solr Server

Solr Server

Solr Server

Solr Server

Application
Collection with Replica

Solr Server
Solr Server
$ curl
'http://solr1:8983/solr/admin/collections?action=CREATE&
name=revo...
Collection with Replicas
Shard1
Replica

Shard2
Replica

Solr Server

Solr Server

Shard2

Shard1

Solr Server

Solr Serve...
Querying
Shard2
Id,score

Shard1
Id,score

Solr Server

Solr Server

QUERY

Solr Server

Application
Querying
Shard2
doc

Shard1
doc

Solr Server

Solr Server

Results

Solr Server

Application
Shard and Replica Number
How your data looks

Expected data growth
Target performance

Target node number

Max number of n...
What should I go for?
More data?

Shard

Shard

Shard

More queries ?

Replica
Replica

Replica
Replica

Replica
Replica
Custom Routing

Default
(numShards present, pre 4.5)

Implicit
(numShards not present, pre 4.5)
Custom Routing Example
Shard1

Shard2

Solr Server

Solr Server

id=userB!3
id=userA!1
id=userA!2
Querying Solr – Default Routing
Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Shard 6

Shard 7

Shard 8

Solr Collection

A...
Quering Solr – Custom Routing
Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Shard 6

Shard 7

Shard 8

Solr Collection
q=re...
Collection Manipulation Commands
Create
Delete

Reload
Split

Create Alias
Delete Alias
Shard Creation/Deletion

http://wi...
Collection Creation
name

numShards
replicationFactor

maxShardsPerNode
createNodeSet
collection.configName
Collection Split Example

$ curl
'http://solr1:8983/solr/admin/collections?action=CREATE&
name=collection1&numShards=2&rep...
Collection Split Example

$ curl 'http://localhost:8983/solr/admin/collections?
action=SPLITSHARD&collection=collection1&s...
Collection Aliasing
$ curl 'http://solr1:8983/solr/admin/collections?
action=CREATEALIAS&name=weekly&collections=20131107,...
Caches
Refreshed with IndexSearcher

Configurable
Different purposes

Different implementations

Solr Cache
Filter Cache
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="128" />
q=lucene+revolutio...
Document Cache
<documentCache class="solr.LRUCache"
size="512"
initialSize="512" />
Query Result Cache
<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="128"/>
q=lucene+rev...
Warming
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst><str name="q">*:*</str><...
The Right Directory
StandardDirectory
SimpleFSDirectory
NIOFSDirectory
MMapDirectory

_0.fdt

_0.fdx _0.fnm _0.nvd

_1.fdt...
Column oriented fields - DocValues
NRT compatible
Better compression than field cache
Can store data outside of JVM heap
C...
Segment Merge
Level 0

a

b

f

Level 1

c

c

d

e

g
Segment Merge Under Control
Merge policy
Merge scheduler
Merge factor

Merge policy configuration
Configuring Segment Merge
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">10</i...
Indexing Throughput Tuning
Maximum indexing threads
RAM buffer size
Maximum buffered documents
Bulk, bulks and bulks
Cloud...
TransactionLog
Updates durability
Recovering peer replay
Performant Realtime Get

<updateLog>
<str name="dir">${solr.ulog....
Autocommit or Not?
Automatic data flush
Automatic index view refresh

<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1000<...
Autocommit & openSearcher=true
<autoCommit>
<maxDocs>10</maxDocs>
<openSearcher>true</openSearcher>
</autoCommit>
AutoSoftCommit & openSearcher=false
<autoCommit>
<maxDocs>1000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
...
Postings Formats to the Rescue
Lucene 4.0 >= Flexible Indexing
Postings == docs, positions, payloads
Different postings fo...
Monitoring
Cluster state
Nodes utilization

Memory usage
Cache utilization

Query response time
Warmup times
Garbage colle...
JMX and Solr
JMX and Solr
Administration Panel
Administration Panel
Monitoring with SPM
Monitoring with SPM
Other Monitoring Tools
Ganglia
http://ganglia.sourceforge.net/

New Relic
http://www.newrelic.com/

Opsview
http://www.ops...
We Are Hiring !
Dig Search ?
Dig Analytics ?
Dig Big Data ?
Dig Performance ?
Dig working with and in open – source ?
We’r...
Thank You !
Rafał  Kuć  
@kucrafal
rafal.kuc@sematext.com

Sematext
@sematext
http://sematext.com
http://blog.sematext.com...
Upcoming SlideShare
Loading in...5
×

Scaling Solr with SolrCloud

1,250

Published on

Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,250
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Scaling Solr with SolrCloud"

  1. 1. Scaling Solr with SolrCloud Rafał  Kuć  – Sematext Group, Inc. @kucrafal @sematext sematext.com
  2. 2. Ta  me… Sematext consultant & engineer Solr.pl co-founder Father and husband 
  3. 3. Solr History Solr 4.1 and counting Solr 4.0 released Lucene / Solr merge Solr 1.4 released Solr 1.3 released Incubator graduation Solr donated to ASF Y. Seeley creates Solr
  4. 4. The Past
  5. 5. Master – Slave Deployment Solr Slave Solr Slave Solr Slave Solr Master Application Solr Slave
  6. 6. Master as SPOF Solr Slave Solr Slave Solr Slave Solr Master Application Solr Slave
  7. 7. Replication Time Solr Slave Indexing App Solr Master Solr R Slave Solr Slave Querying App
  8. 8. Too Much for a Single Shard Solr Slave Solr Slave Solr Master Application
  9. 9. Too Much for a Single Shard Solr Slave Solr Slave Solr Master Solr Slave Solr Slave Solr Master Application Solr Slave Solr Master Solr Slave
  10. 10. Querying in Multi Master Deployment Shard1, shard2, shard3 Solr Slave Shard 2 Doc Shard1, shard2, shard3 Solr Slave Response Shard 1 Application Solr Slave Shard 3 Response
  11. 11. SolrCloud Comes Into Play
  12. 12. Basic Glossary Cluster Node Collection Shard Leader & Replica Overseer https://cwiki.apache.org/confluence/display/solr/SolrCloud+Glossary
  13. 13. Apache ZooKeeper Quorum is required Sample configuration clientPort=2181 dataDir=/usr/share/zookeeper/data tickTime=2000 initLimit=10 syncLimit=5 server.1=192.168.1.1:2888:3888 server.2=192.168.1.2:2888:3888 server.3=192.168.1.3:2888:3888 ZooKeeper ZooKeeper ZooKeeper
  14. 14. Solr Instances -DzkHost=192.168.1.1:2181, 192.168.1.2:2181,192.168.1.3:2181 Solr Server -DzkHost=192.168.1.2:2181, 192.168.1.1:2181,192.168.1.3:2181 Solr Server ZooKeeper ZooKeeper ZooKeeper -DzkHost=192.168.1.3:2181, 192.168.1.1:2181,192.168.1.2:2181 Solr Server -DzkHost=192.168.1.3:2181, 192.168.1.1:2181,192.168.1.2:2181 Solr Server
  15. 15. Collection Creation Solr Server Solr $ curl $ cloud-scripts/zkcli.sh –cmd upconfig -zkhost Server 'http://solr1:8983/solr/admin/collections?action=CREATE& 192.168.1.2:2181 -confdir name=revolution&numShards=2&replicationFactor=1' /usr/share/config/revolution/conf -conf revolution Solr Server ZooKeeper ZooKeeper ZooKeeper Solr Server
  16. 16. Single Collection Deployment Shard1 Shard2 Solr Server Solr Server Solr Server Solr Server Application
  17. 17. Collection with Replica Solr Server Solr Server $ curl 'http://solr1:8983/solr/admin/collections?action=CREATE& name=revolution&numShards=2&replicationFactor=2' Solr Server ZooKeeper ZooKeeper ZooKeeper Solr Server
  18. 18. Collection with Replicas Shard1 Replica Shard2 Replica Solr Server Solr Server Shard2 Shard1 Solr Server Solr Server Application
  19. 19. Querying Shard2 Id,score Shard1 Id,score Solr Server Solr Server QUERY Solr Server Application
  20. 20. Querying Shard2 doc Shard1 doc Solr Server Solr Server Results Solr Server Application
  21. 21. Shard and Replica Number How your data looks Expected data growth Target performance Target node number Max number of nodes = number of shards * (number of replicas + 1)
  22. 22. What should I go for? More data? Shard Shard Shard More queries ? Replica Replica Replica Replica Replica Replica
  23. 23. Custom Routing Default (numShards present, pre 4.5) Implicit (numShards not present, pre 4.5)
  24. 24. Custom Routing Example Shard1 Shard2 Solr Server Solr Server id=userB!3 id=userA!1 id=userA!2
  25. 25. Querying Solr – Default Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8 Solr Collection Application
  26. 26. Quering Solr – Custom Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8 Solr Collection q=revolution&_route_=userA! Application
  27. 27. Collection Manipulation Commands Create Delete Reload Split Create Alias Delete Alias Shard Creation/Deletion http://wiki.apache.org/solr/SolrCloud
  28. 28. Collection Creation name numShards replicationFactor maxShardsPerNode createNodeSet collection.configName
  29. 29. Collection Split Example $ curl 'http://solr1:8983/solr/admin/collections?action=CREATE& name=collection1&numShards=2&replicationFactor=1'
  30. 30. Collection Split Example $ curl 'http://localhost:8983/solr/admin/collections? action=SPLITSHARD&collection=collection1&shard=shard1'
  31. 31. Collection Aliasing $ curl 'http://solr1:8983/solr/admin/collections? action=CREATEALIAS&name=weekly&collections=20131107, 20131108,20131109,20131110,20131111,20131112,20131113' $ curl 'http://solr1:8983/solr/weekly/select?q=revolution' $ curl 'http://solr1:8983/solr/admin/collections? action=DELETEALIAS&name=weekly'
  32. 32. Caches Refreshed with IndexSearcher Configurable Different purposes Different implementations Solr Cache
  33. 33. Filter Cache <filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="128" /> q=lucene+revolution+city:Dublin q=lucene+revolution&fq=city:Dublin q=*:*&fq={!cache=false}city:Dublin q=*:*&fq={!frange l=0 u=10 cache=false cost=200}sum(price,pro)
  34. 34. Document Cache <documentCache class="solr.LRUCache" size="512" initialSize="512" />
  35. 35. Query Result Cache <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="128"/> q=lucene+revolution+city:Dublin&sort=date+desc&start=0&rows=10 q=lucene+revolution&fq=city:Dublin&sort=date+desc&start=0&rows=10 <queryResultWindowSize>20</queryResultWindowSize> <queryResultMaxDocsCached>200</queryResultMaxDocsCached>
  36. 36. Warming <listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr> </listener> <listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr> </listener> <useColdSearcher>false</useColdSearcher>
  37. 37. The Right Directory StandardDirectory SimpleFSDirectory NIOFSDirectory MMapDirectory _0.fdt _0.fdx _0.fnm _0.nvd _1.fdt _1.fdx _1.fnm _1.nvd NRTCachingDirectory RAMDirectory <directoryFactory name="DirectoryFactory" class="solr.NRTCachingDirectoryFactory" />
  38. 38. Column oriented fields - DocValues NRT compatible Better compression than field cache Can store data outside of JVM heap Can improve things for dynamic indices <field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true"/> <field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true" docValuesFormat="Disk"/>
  39. 39. Segment Merge Level 0 a b f Level 1 c c d e g
  40. 40. Segment Merge Under Control Merge policy Merge scheduler Merge factor Merge policy configuration
  41. 41. Configuring Segment Merge <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">10</int> <int name="segmentsPerTier">10</int> </mergePolicy> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/> <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/> <mergeFactor>10</mergeFactor>
  42. 42. Indexing Throughput Tuning Maximum indexing threads RAM buffer size Maximum buffered documents Bulk, bulks and bulks CloudSolrServer Autocommit Cutting off unnecessary stuff
  43. 43. TransactionLog Updates durability Recovering peer replay Performant Realtime Get <updateLog> <str name="dir">${solr.ulog.dir:}</str> </updateLog> <requestHandler name="/get" class="solr.RealTimeGetHandler"> </requestHandler>
  44. 44. Autocommit or Not? Automatic data flush Automatic index view refresh <autoCommit> <maxTime>15000</maxTime> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit>
  45. 45. Autocommit & openSearcher=true <autoCommit> <maxDocs>10</maxDocs> <openSearcher>true</openSearcher> </autoCommit>
  46. 46. AutoSoftCommit & openSearcher=false <autoCommit> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxDocs>10</maxDocs> </autoSoftCommit>
  47. 47. Postings Formats to the Rescue Lucene 4.0 >= Flexible Indexing Postings == docs, positions, payloads Different postings formats available Bloom Pulsing Simple text Direct Memory <codecFactory class="solr.SchemaCodecFactory" /> <field name="id" type="string_pulsing" indexed="true" stored="true" /> <fieldType name="string_pulsing" class="solr.StrField" postingsFormat="Pulsing41" />
  48. 48. Monitoring Cluster state Nodes utilization Memory usage Cache utilization Query response time Warmup times Garbage collector work
  49. 49. JMX and Solr
  50. 50. JMX and Solr
  51. 51. Administration Panel
  52. 52. Administration Panel
  53. 53. Monitoring with SPM
  54. 54. Monitoring with SPM
  55. 55. Other Monitoring Tools Ganglia http://ganglia.sourceforge.net/ New Relic http://www.newrelic.com/ Opsview http://www.opsview.com
  56. 56. We Are Hiring ! Dig Search ? Dig Analytics ? Dig Big Data ? Dig Performance ? Dig working with and in open – source ? We’re hiring world – wide ! http://sematext.com/about/jobs.html
  57. 57. Thank You ! Rafał  Kuć   @kucrafal rafal.kuc@sematext.com Sematext @sematext http://sematext.com http://blog.sematext.com SPM discount code: LR2013SPM20 @ Sematext booth ;)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×