Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
About Me 
• Lucene/Solr committer. Work for Lucidworks; focus on hardening 
SolrCloud, devops, big data architecture / dep...
Agenda 
1. Quick review of the SolrCloud architecture 
2. Indexing & Query performance tests 
3. Solr Scale Toolkit (quick...
Solr in the wild … 
https://twitter.com/bretthoerner/status/476830302430437376
SolrCloud distilled 
Subset of optional features in Solr to enable and 
simplify horizontal scaling a search index using 
...
Collection == distributed index 
A collection is a distributed index defined by: 
• named configuration stored in ZooKeepe...
SolrCloud High-level Architecture
ZooKeeper 
• Is a very good thing ... clusters are a zoo! 
• Centralized configuration management 
• Cluster state managem...
ZooKeeper: State Management 
• Keep track of live nodes /live_nodes znode 
• ephemeral nodes 
• ZooKeeper client timeout 
...
Scalability Highlights 
• No split-brain problems (b/c of ZooKeeper) 
• All nodes in cluster perform indexing and execute ...
Cluster sizing 
How many servers do I need to index X docs? 
... shards ... ? 
... replicas ... ? 
I need N queries per se...
Testing Methodology 
• Transparent repeatable results 
• Ideally hoping for something owned by the community 
• Synthetic ...
Indexing Performance 
Cluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec 
10 10 1 48 1762 73,780 
10 1...
Visualize Server Performance
Direct Updates to Leaders
Replication
Indexing Performance Lessons 
• Solr has no built-in throttling support – will accept work until it falls over; need to bu...
GC Tuning 
• Stop-the-world GC pauses can lead to ZooKeeper session expiration (which is bad) 
• More JVMs with smaller he...
GC Flags I use with Solr 
-Xss256k  
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC  
-XX:MaxTenuringThreshold=8 -XX:NewRatio=3 ...
Sizing GC Spaces 
http://kumarsoablog.blogspot.com/2013/02/jvm-parameter-survivorratio_7.html
Query Performance 
• Still a work in progress! 
• Sustained QPS & Execution time of 99th Percentile (coda hale metrics is ...
Query Performance, cont. 
• Higher risk of full GC pauses (facets, filters, sorting) 
• Use optimized data structures (Doc...
Call me maybe - Jepsen 
https://github.com/aphyr/jepsen 
• Solr tests being developed by Lucene/Solr committer Shalin 
Man...
Solr Scale Toolkit 
• Open source: https://github.com/LucidWorks/solr-scale-tk 
• Fabric (Python) toolset for deploying an...
Provisioning cluster nodes 
fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge 
• Custom built AMI (one for PV instan...
Deploy ZooKeeper ensemble 
fab new_zk_ensemble:zk1,n=3 
• Two options: 
• provision 1 to N nodes when you launch Solr clus...
Deploy SolrCloud cluster 
fab new_solrcloud:test1,zk=zk1,nodesPerHost=2 
• Uses bin/solr in Solr 4.10 to control Solr node...
Automate day-to-day cluster management tasks 
• Deploy a configuration directory to ZooKeeper 
• Create a new collection 
...
Wrap-up and Q & A 
• LucidWorks: http://www.lucidworks.com -- We’re hiring! 
• Solr Scale Toolkit: https://github.com/Luci...
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
Upcoming SlideShare
Loading in …5
×

Benchmarking Solr Performance at Scale

15,045 views

Published on

Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.​

Published in: Data & Analytics

Benchmarking Solr Performance at Scale

  1. 1. About Me • Lucene/Solr committer. Work for Lucidworks; focus on hardening SolrCloud, devops, big data architecture / deployments • Operated smallish cluster in AWS for Dachis Group (1.5 years ago, 18 shards ~900M docs) • Solr Scale Toolkit: Fabric/boto framework for deploying and managing clusters in EC2 • Co-author of Solr In Action with Trey Grainger
  2. 2. Agenda 1. Quick review of the SolrCloud architecture 2. Indexing & Query performance tests 3. Solr Scale Toolkit (quick overview) 4. Q & A
  3. 3. Solr in the wild … https://twitter.com/bretthoerner/status/476830302430437376
  4. 4. SolrCloud distilled Subset of optional features in Solr to enable and simplify horizontal scaling a search index using sharding and replication. Goals performance, scalability, high-availability, simplicity, elasticity, and community-driven!
  5. 5. Collection == distributed index A collection is a distributed index defined by: • named configuration stored in ZooKeeper • number of shards: documents are distributed across N partitions of the index • document routing strategy: how documents get assigned to shards • replication factor: how many copies of each document in the collection Collections API: curl "http://localhost:8983/solr/admin/collections? action=CREATE&name=logstash4solr&replicationFactor=2& numShards=2&collection.configName=logs"
  6. 6. SolrCloud High-level Architecture
  7. 7. ZooKeeper • Is a very good thing ... clusters are a zoo! • Centralized configuration management • Cluster state management • Leader election (shard leader and overseer) • Overseer distributed work queue • Live Nodes • Ephemeral znodes used to signal a server is gone • Needs at least 3 nodes for quorum in production
  8. 8. ZooKeeper: State Management • Keep track of live nodes /live_nodes znode • ephemeral nodes • ZooKeeper client timeout • Collection metadata and replica state in /clusterstate.json • Every Solr node has watchers for /live_nodes and /clusterstate.json • Leader election • ZooKeeper sequence number on ephemeral znodes
  9. 9. Scalability Highlights • No split-brain problems (b/c of ZooKeeper) • All nodes in cluster perform indexing and execute queries; no master node • Distributed indexing: No SPoF, high throughput via direct updates to leaders, automated failover to new leader • Distributed queries: Add replicas to scale-out qps; parallelize complex query computations; fault-tolerance • Indexing / queries continue so long as there is 1 healthy replica per shard
  10. 10. Cluster sizing How many servers do I need to index X docs? ... shards ... ? ... replicas ... ? I need N queries per second over M docs, how many servers do I need? It depends!
  11. 11. Testing Methodology • Transparent repeatable results • Ideally hoping for something owned by the community • Synthetic docs ~ 1K each on disk, mix of field types • Data set created using code borrowed from PigMix • English text fields generated using a Zipfian distribution • Java 1.7u67, Amazon Linux, r3.2xlarge nodes • enhanced networking enabled, placement group, same AZ • Stock Solr (cloud) 4.10 • Using custom GC tuning parameters and auto-commit settings • Use Elastic MapReduce to generate indexing load • As many nodes as I need to drive Solr!
  12. 12. Indexing Performance Cluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec 10 10 1 48 1762 73,780 10 10 2 34 3727 34,881 10 20 1 48 1282 101,404 10 20 2 34 3207 40,536 10 30 1 72 1070 121,495 10 30 2 60 3159 41,152 15 15 1 60 1106 117,541 15 15 2 42 2465 52,738 15 30 1 60 827 157,195 15 30 2 42 2129 61,062
  13. 13. Visualize Server Performance
  14. 14. Direct Updates to Leaders
  15. 15. Replication
  16. 16. Indexing Performance Lessons • Solr has no built-in throttling support – will accept work until it falls over; need to build this into your indexing application logic • Oversharding helps parallelize indexing work and gives you an easy way to add more hardware to your cluster • GC tuning is critical (more below) • Auto-hard commit to keep transaction logs manageable • Auto soft-commit to see docs as they are indexed • Replication is expensive! (more work needed here)
  17. 17. GC Tuning • Stop-the-world GC pauses can lead to ZooKeeper session expiration (which is bad) • More JVMs with smaller heap sizes are better! (12-16GB max per JVM ~ less if you can) • MMapDirectory relies on sufficient memory available to the OS cache (off-heap) • GC activity during Solr indexing is stable and generally doesn’t cause any stop-the-world collections … queries are a different story • Enable verbose GC logging (even in prod) so you can troubleshoot issues: -verbose:gc –Xloggc:gc.log -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
  18. 18. GC Flags I use with Solr -Xss256k -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxTenuringThreshold=8 -XX:NewRatio=3 -XX:CMSInitiatingOccupancyFraction=40 -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=12m -XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSTriggerPermRatio=80 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts
  19. 19. Sizing GC Spaces http://kumarsoablog.blogspot.com/2013/02/jvm-parameter-survivorratio_7.html
  20. 20. Query Performance • Still a work in progress! • Sustained QPS & Execution time of 99th Percentile (coda hale metrics is good for this) • Stable: ~5,000 QPS / 99th at 300ms while indexing ~10,000 docs / sec • Using the TermsComponent to build queries based on the terms in each field. • Harder to accurately simulate user queries over synthetic data • Need mix of faceting, paging, sorting, grouping, boolean clauses, range queries, boosting, filters (some cached, some not), etc ... • Does the randomness in your test queries model (expected) user behavior? • Start with one server (1 shard) to determine baseline query performance. • Look for inefficiencies in your schema and other config settings
  21. 21. Query Performance, cont. • Higher risk of full GC pauses (facets, filters, sorting) • Use optimized data structures (DocValues) for facet / sort fields, Trie-based numeric fields for range queries, facet.method=enum for low cardinality fields • Check sizing of caches, esp. filterCache in solrconfig.xml • Add more replicas; load-balance; Solr can set HTTP headers to work with caching proxies like Squid • -Dhttp.maxConnections=## (default = 5, increase to accommodate more threads sending queries) • Avoid increasing ZooKeeper client timeout ~ 15000 (15 seconds) is about right • Don’t just keep throwing more memory at Java! –Xmx128G
  22. 22. Call me maybe - Jepsen https://github.com/aphyr/jepsen • Solr tests being developed by Lucene/Solr committer Shalin Mangar (@shalinmanger) • Prototype in place: • No ack’d writes were lost! • No un-ack’d writes succeeded See: https://github.com/LucidWorks/jepsen/tree/solr-jepsen
  23. 23. Solr Scale Toolkit • Open source: https://github.com/LucidWorks/solr-scale-tk • Fabric (Python) toolset for deploying and managing SolrCloud clusters in the cloud • Code to support benchmark tests (Pig script for data generation / indexing, JMeter samplers) • EC2 for now, more cloud providers coming soon via Apache libcloud • Contributors welcome! • More info: http://searchhub.org/2014/06/03/introducing-the-solr-scale-toolkit/
  24. 24. Provisioning cluster nodes fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge • Custom built AMI (one for PV instances and one for HVM instances) – Amazon Linux • Block device mapping • dedicated disk per Solr node • Launch and then poll status until they are live • verify SSH connectivity • Tag each instance with a cluster ID and username
  25. 25. Deploy ZooKeeper ensemble fab new_zk_ensemble:zk1,n=3 • Two options: • provision 1 to N nodes when you launch Solr cluster • use existing named ensemble • Fabric command simply creates the myid files and zoo.cfg file for the ensemble • and some cron scripts for managing snapshots • Basic health checking of ZooKeeper status: echo srvr | nc localhost 2181
  26. 26. Deploy SolrCloud cluster fab new_solrcloud:test1,zk=zk1,nodesPerHost=2 • Uses bin/solr in Solr 4.10 to control Solr nodes • Set system props: jetty.port, host, zkHost, JVM opts • One or more Solr nodes per machine • JVM mem opts dependent on instance type and # of Solr nodes per instance • Optionally configure log4j.properties to append messages to Rabbitmq for SiLK integration
  27. 27. Automate day-to-day cluster management tasks • Deploy a configuration directory to ZooKeeper • Create a new collection • Attach a local JConsole/VisualVM to a remote JVM • Rolling restart (with Overseer awareness) • Build Solr locally and patch remote • Use a relay server to scp the JARs to Amazon network once and then scp them to other nodes from within the network • Put/get files • Grep over all log files (across the cluster)
  28. 28. Wrap-up and Q & A • LucidWorks: http://www.lucidworks.com -- We’re hiring! • Solr Scale Toolkit: https://github.com/LucidWorks/solr-scale-tk • SiLK: http://www.lucidworks.com/lucidworks-silk/ • Solr In Action: http://www.manning.com/grainger/ • Connect: @thelabdude / tim.potter@lucidworks.com

×