Search | Discover | Analyze
Confidential and Proprietary © Copyright 2013
Benchmarking Solr
Performance
June 18, 2014
Timothy Potter
Confidential and Proprietary © Copyright 2013
My SolrCloud Experience
• At LucidWorks, mostly focused on hardening SolrCloud; Lucene/Solr
committer
• Operated 36 node cluster in AWS for Dachis Group (1.5 years ago, 18
shards ~900M docs)
• Built a Fabric/boto framework for deploying and managing a cluster in
EC2
• Co-author of Solr In Action
Confidential and Proprietary © Copyright 2013
Agenda
• Indexing performance tests
• Solr Scale Toolkit
• Next steps
Confidential and Proprietary © Copyright 2013
Cluster sizing
How many servers do I need to index X docs?
... shards ... ?
... replicas ... ?
I need N queries per second over
M docs, how many servers do I need?
It depends?!?
Confidential and Proprietary © Copyright 2013
Methodology
• Transparent repeatable results
– Ideally hoping for something owned by the community
• Synthetic docs ~ 1K each on disk, mix of field types
– Data set created using code borrowed from PigMix
– English text fields generated using a Zipfian distribution
• Java 1.7u55, Amazon Linux, r3.2xlarge nodes
– enhanced networking enabled, placement group, same AZ
• Stock Solr (cloud) 4.8.1
– Using Shawn Heisey’s GC tuning parameters
• Use Elastic MapReduce to generate load
– As many nodes as I need to drive Solr!
Confidential and Proprietary © Copyright 2013
Indexing Results
Cluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec
10 10 1 48 1762 73,780
10 10 2 34 3727 34,881
10 20 1 48 1282 101,404
10 20 2 34 3207 40,536
10 30 1 72 1070 121,495
10 30 2 60 3159 41,152
15 15 1 60 1106 117,541
15 15 2 42 2465 52,738
15 30 1 60 827 157,195
15 30 2 42 2129 61,062
Confidential and Proprietary © Copyright 2013
Direct Updates
Indexing
Client 1
CloudSolrServer
(SolrJ)
ZooKeeper
/clusterstate.json
Shard 1
(leader)
Shard 2
(leader)
Shard 3
(leader)
<doc>
<doc>
Watch
/clusterstate.json
<doc>
<doc>
compute shard
assignment on
clientbatch
Confidential and Proprietary © Copyright 2013
Replication
CloudSolrServer
(SolrJ)
ZooKeeper
/clusterstate.json
Shard 1
(leader)
Shard 2
(leader)
Shard 3
(leader)
<doc>
<doc>
Watch
/clusterstate.json
<doc>
Shard 1
(replica)
Shard 2
(replica)
Shard 3
(replica)
Blocks for response
from replica(s)
Confidential and Proprietary © Copyright 2013
Don’t swamp your servers!
Confidential and Proprietary © Copyright 2013
Lessons Learned
• Know what throughput your client side is capable of
generating
– If in MapReduce, index from reducers with speculative execution
disabled
• Don’t change Solr config without good reasons for doing
so
• Overshard (but not too much)
• Near-linear scalability as I added nodes!
Confidential and Proprietary © Copyright 2013
Query Performance Tests
• All nodes in SolrCloud perform indexing and execute queries
• Using the TermsComponent to build queries based on the terms in
each field.
• Harder to accurately simulate user queries over synthetic data
– Need mix of faceting, paging, sorting, grouping, boolean clauses, range
queries, boosting, filters (some cached, some not), etc ...
• Does the randomness in your test queries model (expected) user
behavior?
• Start with one server (1 shard) to determine baseline query
performance.
– Look for inefficiencies in your schema and other config settings
Confidential and Proprietary © Copyright 2013
Solr Scale Toolkit
• Fabric / Python based toolset for deploying and
managing SolrCloud clusters
• SolrJ-based client application useful for building
tools that need access to cluster state information
in ZooKeeper
• Code to support benchmarks for Solr
Confidential and Proprietary © Copyright 2013
Python-based Tools
boto – Python API for AWS (EC2, S3, etc)
Fabric – Python-based tool for automating system admin tasks
over SSH
pysolr – Python library for Solr (sending commits, queries, ...)
kazoo – Python client tools for ZooKeeper
Supporting Cast:
JMeter – run tests, generate reports
collectd – system monitoring
Logstash4Solr – log aggregation
JConsole/VisualVM – monitor JVM during indexing / queries
Confidential and Proprietary © Copyright 2013
Solr Scale Toolkit: Demo
• Launch a meta node
– Log agg / basic monitoring using SiLK
• Launch ZooKeeper Ensemble
– 3 nodes to establish quorum
– Setup cron job to clean-up snapshots
• Launch SolrCloud cluster
• Create new collection and index some docs
– Attach JConsole while indexing
• Run a healthcheck on the collection
• Checkout Banana Dashboard
• Backup / Restore
– Requires patch for SOLR-5956
– Use fab patch_jars to update jars and do a rolling restart
Confidential and Proprietary © Copyright 2013
• Custom built AMI?
• Block device mapping
– dedicated disk per Solr node
• Launch and then poll status until they are live
– verify SSH connectivity
• Tag each instance with a cluster ID and username
Provisioning machines
fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge
Confidential and Proprietary © Copyright 2013
• Two options:
– provision 1 to N nodes when you launch Solr cluster
– use existing named ensemble
• Fabric command simply creates the myid
files and zoo.cfg file for the ensemble
– and some cron scripts for managing snapshots
• Basic health checking of ZooKeeper status:
– echo srvr | nc localhost 2181
ZooKeeper
fab new_zk_ensemble:zk1,n=3
Confidential and Proprietary © Copyright 2013
• Upload a BASH script that starts/stops Solr
• Set system props: jetty.port, host, zkHost, JVM
opts
• One or more Solr nodes per machine
• JVM mem opts dependent on instance type and
# of Solr nodes per instance
• Optionally configure log4j.properties to append
messages to Rabbitmq for Logstash4Solr
integration
SolrCloud
fab new_solrcloud:test1,zk=zk1,nodesPerHost=2
Confidential and Proprietary © Copyright 2013
• BASH script that implements:
– start/stop Solr nodes on each EC2 instance
– sets JVM memory options, system properties
(jetty.port), enable remote JMX, etc
– backup log files before restarting nodes
– ensure JVM is killed correctly before restarting
• Environment variables in:
solr-ctl-env.sh
solr-ctl.sh
Confidential and Proprietary © Copyright 2013
• Deploy a configuration directory to ZooKeeper
• Create a new collection
• Attach a local JConsole/VisualVM to a remote JVM
• Rolling restart (with Overseer awareness)
• Build Solr locally and patch remote
– Use a relay server to scp the JARs to Amazon network once and then
scp them to other nodes from within the network
• Put/get files
• Grep over all log files (across the cluster)
Miscellaneous Utility Tasks
Confidential and Proprietary © Copyright 2013
• fab mine: See clusters I’m running (or for other users too)
• fab kill_mine: Terminate all instances I’m running
– Use termination protection in production
• fab ssh_to: Quick way to SSH to one of the nodes in a
cluster
• fab stop/recover/kill: Basic commands for controlling
specific Solr nodes in the cluster
• fab jmeter: Execute a JMeter test plan against your cluster
– Example test plan and Java sampler is included with the source
Other useful stuff ...
Confidential and Proprietary © Copyright 2013
• Java-based command-line application that uses SolrJ’s
CloudSolrServer to perform advanced cluster
management operations:
– healthcheck: collect metadata and health information from all
replicas for a collection from ZooKeeper
– backup: create a snapshot of each shard in a collection for
backing up to remote storage (S3)
• Framework for building complex tools that benefit from
having access to cluster state information in ZooKeeper
SolrCloud Tools (SolrJ client app)
./tools.sh –tool healthcheck
Confidential and Proprietary © Copyright 2013
SiLK Integration
• SiLK: Solr integrated with Logstash and Kibana
– Index time-series data, such as log data (collectd, Solr logs, ...)
– Build cool dashboards with Banana (fork of Kibana)
• Easily aggregate all WARN and more severe log
messages from all Solr servers into logstash4solr
• Send collectd metrics to logstash4solr
Confidential and Proprietary © Copyright 2013
SiLK Integration
Confidential and Proprietary © Copyright 2013
What’s Next?
• Migrate to using Apache libcloud instead of using boto
directly
• Benchmark mixed work-loads (queries and indexing)
• SiLK is improving rapidly!
• Chaos monkey tests
– integrate jepsen?
• Open source so please kick the tires!
Confidential and Proprietary © Copyright 2013
Wrap-up
• Solr Scale Toolkit: https://github.com/LucidWorks/solr-scale-tk
• LucidWorks: http://www.lucidworks.com
• SiLK: http://www.lucidworks.com/lucidworks-silk/
• Solr In Action: http://www.manning.com/grainger/
• Connect: @thelabdude / tim.potter@lucidworks.com
Questions?

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance

  • 1.
    Search | Discover| Analyze Confidential and Proprietary © Copyright 2013 Benchmarking Solr Performance June 18, 2014 Timothy Potter
  • 2.
    Confidential and Proprietary© Copyright 2013 My SolrCloud Experience • At LucidWorks, mostly focused on hardening SolrCloud; Lucene/Solr committer • Operated 36 node cluster in AWS for Dachis Group (1.5 years ago, 18 shards ~900M docs) • Built a Fabric/boto framework for deploying and managing a cluster in EC2 • Co-author of Solr In Action
  • 3.
    Confidential and Proprietary© Copyright 2013 Agenda • Indexing performance tests • Solr Scale Toolkit • Next steps
  • 4.
    Confidential and Proprietary© Copyright 2013 Cluster sizing How many servers do I need to index X docs? ... shards ... ? ... replicas ... ? I need N queries per second over M docs, how many servers do I need? It depends?!?
  • 5.
    Confidential and Proprietary© Copyright 2013 Methodology • Transparent repeatable results – Ideally hoping for something owned by the community • Synthetic docs ~ 1K each on disk, mix of field types – Data set created using code borrowed from PigMix – English text fields generated using a Zipfian distribution • Java 1.7u55, Amazon Linux, r3.2xlarge nodes – enhanced networking enabled, placement group, same AZ • Stock Solr (cloud) 4.8.1 – Using Shawn Heisey’s GC tuning parameters • Use Elastic MapReduce to generate load – As many nodes as I need to drive Solr!
  • 6.
    Confidential and Proprietary© Copyright 2013 Indexing Results Cluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec 10 10 1 48 1762 73,780 10 10 2 34 3727 34,881 10 20 1 48 1282 101,404 10 20 2 34 3207 40,536 10 30 1 72 1070 121,495 10 30 2 60 3159 41,152 15 15 1 60 1106 117,541 15 15 2 42 2465 52,738 15 30 1 60 827 157,195 15 30 2 42 2129 61,062
  • 7.
    Confidential and Proprietary© Copyright 2013 Direct Updates Indexing Client 1 CloudSolrServer (SolrJ) ZooKeeper /clusterstate.json Shard 1 (leader) Shard 2 (leader) Shard 3 (leader) <doc> <doc> Watch /clusterstate.json <doc> <doc> compute shard assignment on clientbatch
  • 8.
    Confidential and Proprietary© Copyright 2013 Replication CloudSolrServer (SolrJ) ZooKeeper /clusterstate.json Shard 1 (leader) Shard 2 (leader) Shard 3 (leader) <doc> <doc> Watch /clusterstate.json <doc> Shard 1 (replica) Shard 2 (replica) Shard 3 (replica) Blocks for response from replica(s)
  • 9.
    Confidential and Proprietary© Copyright 2013 Don’t swamp your servers!
  • 10.
    Confidential and Proprietary© Copyright 2013 Lessons Learned • Know what throughput your client side is capable of generating – If in MapReduce, index from reducers with speculative execution disabled • Don’t change Solr config without good reasons for doing so • Overshard (but not too much) • Near-linear scalability as I added nodes!
  • 11.
    Confidential and Proprietary© Copyright 2013 Query Performance Tests • All nodes in SolrCloud perform indexing and execute queries • Using the TermsComponent to build queries based on the terms in each field. • Harder to accurately simulate user queries over synthetic data – Need mix of faceting, paging, sorting, grouping, boolean clauses, range queries, boosting, filters (some cached, some not), etc ... • Does the randomness in your test queries model (expected) user behavior? • Start with one server (1 shard) to determine baseline query performance. – Look for inefficiencies in your schema and other config settings
  • 12.
    Confidential and Proprietary© Copyright 2013 Solr Scale Toolkit • Fabric / Python based toolset for deploying and managing SolrCloud clusters • SolrJ-based client application useful for building tools that need access to cluster state information in ZooKeeper • Code to support benchmarks for Solr
  • 13.
    Confidential and Proprietary© Copyright 2013 Python-based Tools boto – Python API for AWS (EC2, S3, etc) Fabric – Python-based tool for automating system admin tasks over SSH pysolr – Python library for Solr (sending commits, queries, ...) kazoo – Python client tools for ZooKeeper Supporting Cast: JMeter – run tests, generate reports collectd – system monitoring Logstash4Solr – log aggregation JConsole/VisualVM – monitor JVM during indexing / queries
  • 14.
    Confidential and Proprietary© Copyright 2013 Solr Scale Toolkit: Demo • Launch a meta node – Log agg / basic monitoring using SiLK • Launch ZooKeeper Ensemble – 3 nodes to establish quorum – Setup cron job to clean-up snapshots • Launch SolrCloud cluster • Create new collection and index some docs – Attach JConsole while indexing • Run a healthcheck on the collection • Checkout Banana Dashboard • Backup / Restore – Requires patch for SOLR-5956 – Use fab patch_jars to update jars and do a rolling restart
  • 15.
    Confidential and Proprietary© Copyright 2013 • Custom built AMI? • Block device mapping – dedicated disk per Solr node • Launch and then poll status until they are live – verify SSH connectivity • Tag each instance with a cluster ID and username Provisioning machines fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge
  • 16.
    Confidential and Proprietary© Copyright 2013 • Two options: – provision 1 to N nodes when you launch Solr cluster – use existing named ensemble • Fabric command simply creates the myid files and zoo.cfg file for the ensemble – and some cron scripts for managing snapshots • Basic health checking of ZooKeeper status: – echo srvr | nc localhost 2181 ZooKeeper fab new_zk_ensemble:zk1,n=3
  • 17.
    Confidential and Proprietary© Copyright 2013 • Upload a BASH script that starts/stops Solr • Set system props: jetty.port, host, zkHost, JVM opts • One or more Solr nodes per machine • JVM mem opts dependent on instance type and # of Solr nodes per instance • Optionally configure log4j.properties to append messages to Rabbitmq for Logstash4Solr integration SolrCloud fab new_solrcloud:test1,zk=zk1,nodesPerHost=2
  • 18.
    Confidential and Proprietary© Copyright 2013 • BASH script that implements: – start/stop Solr nodes on each EC2 instance – sets JVM memory options, system properties (jetty.port), enable remote JMX, etc – backup log files before restarting nodes – ensure JVM is killed correctly before restarting • Environment variables in: solr-ctl-env.sh solr-ctl.sh
  • 19.
    Confidential and Proprietary© Copyright 2013 • Deploy a configuration directory to ZooKeeper • Create a new collection • Attach a local JConsole/VisualVM to a remote JVM • Rolling restart (with Overseer awareness) • Build Solr locally and patch remote – Use a relay server to scp the JARs to Amazon network once and then scp them to other nodes from within the network • Put/get files • Grep over all log files (across the cluster) Miscellaneous Utility Tasks
  • 20.
    Confidential and Proprietary© Copyright 2013 • fab mine: See clusters I’m running (or for other users too) • fab kill_mine: Terminate all instances I’m running – Use termination protection in production • fab ssh_to: Quick way to SSH to one of the nodes in a cluster • fab stop/recover/kill: Basic commands for controlling specific Solr nodes in the cluster • fab jmeter: Execute a JMeter test plan against your cluster – Example test plan and Java sampler is included with the source Other useful stuff ...
  • 21.
    Confidential and Proprietary© Copyright 2013 • Java-based command-line application that uses SolrJ’s CloudSolrServer to perform advanced cluster management operations: – healthcheck: collect metadata and health information from all replicas for a collection from ZooKeeper – backup: create a snapshot of each shard in a collection for backing up to remote storage (S3) • Framework for building complex tools that benefit from having access to cluster state information in ZooKeeper SolrCloud Tools (SolrJ client app) ./tools.sh –tool healthcheck
  • 22.
    Confidential and Proprietary© Copyright 2013 SiLK Integration • SiLK: Solr integrated with Logstash and Kibana – Index time-series data, such as log data (collectd, Solr logs, ...) – Build cool dashboards with Banana (fork of Kibana) • Easily aggregate all WARN and more severe log messages from all Solr servers into logstash4solr • Send collectd metrics to logstash4solr
  • 23.
    Confidential and Proprietary© Copyright 2013 SiLK Integration
  • 24.
    Confidential and Proprietary© Copyright 2013 What’s Next? • Migrate to using Apache libcloud instead of using boto directly • Benchmark mixed work-loads (queries and indexing) • SiLK is improving rapidly! • Chaos monkey tests – integrate jepsen? • Open source so please kick the tires!
  • 25.
    Confidential and Proprietary© Copyright 2013 Wrap-up • Solr Scale Toolkit: https://github.com/LucidWorks/solr-scale-tk • LucidWorks: http://www.lucidworks.com • SiLK: http://www.lucidworks.com/lucidworks-silk/ • Solr In Action: http://www.manning.com/grainger/ • Connect: @thelabdude / tim.potter@lucidworks.com Questions?

Editor's Notes

  • #5 Yes, it does, but we need to start somewhere
  • #8 more shards == better indexing throughput (if your servers can handle it)
  • #9 replication is as fast as your slowest replica replicas have to re-analyze documents future – would like to send pre-analyzed docs between leader and replica, esp if text analysis is complex
  • #10 Not much capacity available for running queries on this node
  • #12 First couple of passes, I either had really bad performance (too random) or really fast (not random enough)
  • #14 Make it very easy to launch and manage SolrCloud clusters in Amazon of sizes 1 node to 100’s User has basic control over instance type, # of instances, # of nodes per instance, ZooKeeper ensemble Doesn’t have to care about how to start each Solr, how to connect it with ZooKeeper, host names / IPs, etc.
  • #16 Custom AMI that we just need to “start” stuff on is preferred to configuring a barebones instance each time