Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scaling SolrCloud to a Large 
Number of Collections 
Shalin Shekhar Mangar 
Lucidworks Inc.
Apache Solr has tremendous momentum 
Solr is both established & growing 
8M+ total 
downloads 
250,000+ 
monthly downloads...
Solr Scalability is unmatched
The traditional search use-case 
• One large index distributed across multiple nodes 
• A large number of users searching ...
“The limits of the possible can only be defined by 
going beyond them into the impossible.” 
—Arthur C. Clarke
Analyze, measure and optimize 
•Analyze and find missing features 
•Setup a performance testing environment on AWS 
•Devis...
Problem #1: Cluster state and updates 
•The SolrCloud cluster state has information about all 
collections, their shards a...
Solution: Split cluster state and scale 
• Each collection gets it’s own state node in ZK 
• Nodes selectively watch only ...
Problem #2: Overseer Performance 
• Thousands of collections create a lot of state 
updates 
• Overseer falls behind and r...
Solution - Improve the overseer 
• Optimize polling for new items in overseer 
queue (SOLR-5436) 
• Dedicated overseers no...
Problem #3: Moving data around 
• Not all users are born equal - A tenant may have a 
few very large users 
• We wanted to...
Solution: Improved data management 
• Shard can be split on arbitrary hash ranges 
(SOLR-5300) 
• Shard can be split by a ...
Problem #4: Exporting data 
• Lucene/Solr is designed for finding top-N 
search results 
• Trying to export full result se...
Solution: Distributed deep paging
Testing scale at scale 
• Performance goals: 6 billion documents, 4000 
queries/sec, 400 updates/sec, 2 seconds NRT 
susta...
Test Infrastructure
Logging
How to manage large clusters? 
• Tim Potter wrote the Solr Scale Toolkit 
• Fabric based tool to setup and manage 
SolrClo...
Gathering metrics and analyzing logs 
• Lucidworks SiLK (Solr + Logstash + Kibana) 
• collectd daemons on each host 
• rab...
Generating data and load 
• Custom randomized data generator (re-producible 
using a seed) 
• JMeter for generating load 
...
Numbers 
•30 hosts, 120 nodes, 1000 collections, 6B+ docs, 
15000 queries/second, 2000 writes/second, 2 
second NRT sustai...
Not over yet 
• We continue to test performance at scale 
• Published indexing performance benchmark, working on others 
•...
Pushing the limits
Not over yet 
• SolrCloud continues to be improved 
• SOLR-6220 - Replica placement strategy 
• SOLR-6273 - Cross data cen...
Thank you! 
shalin@apache.org 
@shalinmangar
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks
Upcoming SlideShare
Loading in …5
×

Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

1,421 views

Published on

Presented at Lucene/Solr Revolution 2014

Published in: Software
  • Be the first to comment

Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

  1. 1. Scaling SolrCloud to a Large Number of Collections Shalin Shekhar Mangar Lucidworks Inc.
  2. 2. Apache Solr has tremendous momentum Solr is both established & growing 8M+ total downloads 250,000+ monthly downloads Largest community of developers. 2500+ open Solr jobs. The most widely used search solution on the planet. Lucene/Solr Revolution world’s largest open source user conference dedicated to Lucene/Solr. You use Solr everyday. Solr has tens of thousands of applications in production.
  3. 3. Solr Scalability is unmatched
  4. 4. The traditional search use-case • One large index distributed across multiple nodes • A large number of users searching on the same data • Searches happen across the entire cluster Example: eCommerce Product Catalogue
  5. 5. “The limits of the possible can only be defined by going beyond them into the impossible.” —Arthur C. Clarke
  6. 6. Analyze, measure and optimize •Analyze and find missing features •Setup a performance testing environment on AWS •Devise tests for stability and performance •Find bugs and bottlenecks and fixes
  7. 7. Problem #1: Cluster state and updates •The SolrCloud cluster state has information about all collections, their shards and replicas •All nodes and (Java) clients watch the cluster state •Every state change is notified to all nodes •Limited to (slightly less than) 1MB by default •1 node restart triggers a few 100 watcher fires and pulls from ZK for a 100 node cluster (three states: down, recovering and active)
  8. 8. Solution: Split cluster state and scale • Each collection gets it’s own state node in ZK • Nodes selectively watch only those states which they are a member of • Clients cache state and use smart cache updates instead of watching nodes • http://issues.apache.org/jira/browse/ SOLR-5473
  9. 9. Problem #2: Overseer Performance • Thousands of collections create a lot of state updates • Overseer falls behind and replicas can’t recover or can’t elect a leader • Under high indexing/search load, GC pauses can cause overseer queue to back up
  10. 10. Solution - Improve the overseer • Optimize polling for new items in overseer queue (SOLR-5436) • Dedicated overseers nodes (SOLR-5476) • New Overseer Status API (SOLR-5749) • Asynchronous execution of collection commands (SOLR-5477, SOLR-5681)
  11. 11. Problem #3: Moving data around • Not all users are born equal - A tenant may have a few very large users • We wanted to be able to scale an individual user’s data — maybe even as it’s own collection • SolrCloud can split shards with no downtime but it only splits in half • No way to ‘extract’ user’s data to another collection or shard
  12. 12. Solution: Improved data management • Shard can be split on arbitrary hash ranges (SOLR-5300) • Shard can be split by a given key (SOLR-5338, SOLR-5353) • A new ‘migrate’ API to move a user’s data to another (new) collection without downtime (SOLR-5308)
  13. 13. Problem #4: Exporting data • Lucene/Solr is designed for finding top-N search results • Trying to export full result set brings down the system due to high memory requirements as you go deeper
  14. 14. Solution: Distributed deep paging
  15. 15. Testing scale at scale • Performance goals: 6 billion documents, 4000 queries/sec, 400 updates/sec, 2 seconds NRT sustained performance • 5% large collections (50 shards), 15% medium (10 shards), 85% small (1 shard) with replication factor of 3 • Target hardware: 24 CPUs, 126G RAM, 7 SSDs (460G) + 1 HDD (200G) • 80% traffic served by 20% of the tenants
  16. 16. Test Infrastructure
  17. 17. Logging
  18. 18. How to manage large clusters? • Tim Potter wrote the Solr Scale Toolkit • Fabric based tool to setup and manage SolrCloud clusters in AWS complete with collectd and SiLK • Backup/Restore from S3. Parallel clone commands. • Open source! • https://github.com/LucidWorks/solr-scale-tk
  19. 19. Gathering metrics and analyzing logs • Lucidworks SiLK (Solr + Logstash + Kibana) • collectd daemons on each host • rabbitmq to queue messages before delivering to log stash • Initially started with Kafka but discarded thinking it is overkill • Not happy with rabbitmq — crashes/unstable • Might try Kafka again soon • http://www.lucidworks.com/lucidworks-silk
  20. 20. Generating data and load • Custom randomized data generator (re-producible using a seed) • JMeter for generating load • Embedded CloudSolrServer (Solr Java client) using JMeter Java Action Sampler • JMeter distributed mode was itself a bottleneck! • Not open source (yet) but we’re working on it!
  21. 21. Numbers •30 hosts, 120 nodes, 1000 collections, 6B+ docs, 15000 queries/second, 2000 writes/second, 2 second NRT sustained over 24-hours •More than 3x the numbers we needed •Unfortunately, we had to stop testing at that point :( •Our biggest cluster cost us just $120/hour :)
  22. 22. Not over yet • We continue to test performance at scale • Published indexing performance benchmark, working on others • 15 nodes, 30 shards, 1 replica, 157195 docs/sec • 15 nodes, 30 shards, 2 replicas, 61062 docs/sec • http://searchhub.org/introducing-the-solr-scale-toolkit/ • Setting up an internal performance testing environment • Jenkins CI • Jepsen tests • Single node benchmarks • Cloud tests • Stay tuned!
  23. 23. Pushing the limits
  24. 24. Not over yet • SolrCloud continues to be improved • SOLR-6220 - Replica placement strategy • SOLR-6273 - Cross data center replication • SOLR-5656 - Auto-add replicas on HDFS • SOLR-5986 - Don’t allow runaway queries to harm the cluster • SOLR-5750 - Backup/Restore API for SolrCloud • Many, many more
  25. 25. Thank you! shalin@apache.org @shalinmangar

×