How to make a simple cheap high availability self-healing solr cluster
Presented by Stephane Gamard, Chief Technology Officer, Searchbox
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.
We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
Lucene revolu+on 2013What is SolrCloud?10Automa=c Rou=ngNodeMonitoringNode Node NodeZKNodeNode NodeZKZKLB LB• Smart client connect to ZK• Any node can forward a requests to node that can process it
Lucene revolu+on 2013What is SolrCloud?11Collec=on API• Abstrac+on level• An index is a collec+on• A collec+on is a set of shards• A shard is a set of cores• CRUD API for collec+on“Collec?ons represents a set of cores with iden)cal conﬁgura?on. The set of cores of a collec?on covers the en?re index”
Lucene revolu+on 2013What is SolrCloud?12NodeCoreShardCollec=on Abstrac+on level of interac+on & conﬁgScaling factor for collec+on size (numShards)Scaling factor for QPS (replica?onFactor)Scaling factor for cluster size (liveNodes)=> SolrCloud is highly geared toward horizontal scaling
Lucene revolu+on 2013 19SolrCloud -‐ Core SizingHeuris=cally inferred from “experience”• Size on shard, not collec+on• Do NOT starve resources on nodes• Senle for JVM/Disk sizing • Large amount of spare disk (op+mize)RAM Disk3 G 60 G
Lucene revolu+on 2013 22SolrCloud -‐ ProvisioningStand-‐by nodes• Automa+cally assigned as replica• provides a metric of HANode addi=on * (self healing)• Scheduled check on cluster conges+on• Automa+cally spawn new nodes per need
Lucene revolu+on 2013 23SolrCloud -‐ ConclusionUsing SolrCloud is like juggling• Gets bener with prac+ce• There is always some magic leq• Could become very overwhelming• When it fails you loose your ballsTest -‐> Test -‐> Test -‐> some more Tests -‐> Test
Lucene revolu+on 2013 24What would make our current SolrCloud cluster even more awesome:• Balance/distribute core based on machine load• Standby core (replicas not serving request and auto-‐shurng downNext Steps
Lucene revolu+on 2013CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30CONTACTStephane Gamardstephane.email@example.comLucene revolu+on 2013