Introduction To SolrCloud 
Varun Thacker
Apache Solr has a huge install base and tremendous momentum 
Solr is both established & growing 
250,000+ 
most widely used search 
solution on the planet. 8M+ total downloads 
monthly downloads 
You use Solr everyday. 
Solr has tens of thousands 
of applications in production. 
2500+ open Solr jobs. 
Activity Summary 
30 Day summary 
Aug 18 - Sep 17 2014 
• 128 Commits 
• 18 Contributors 
12 Month Summary 
Sep 17, 2013 - Sep 17, 2014 
• 1351 Commits 
• 29 Contributors 
via https://www.openhub.net/p/solr
Solr scalability is unmatched. 
• 10TB+ Index Size 
• 10 Billion+ Documents 
• 100 Million+ Daily Requests
Solr’s scalability is unmatched
What is Solr? 
• A system built to search text 
• A specialized type of database management 
system 
• A platform to build search applications on 
• Customizable, open source software
Where does Solr fit?
What is SolrCloud? 
Subset of optional features in Solr to enable and 
simplify horizontal scaling a search index using 
sharding and replication. 
Goals 
scalability, performance, high-availability, 
simplicity, and elasticity
Terminology 
• ZooKeeper: Distributed coordination service that provides centralised 
configuration, cluster state management, and leader election 
• Node: JVM process bound to a specific port on a machine 
• Collection: Search index distributed across multiple nodes with same 
configuration 
• Shard: Logical slice of a collection; each shard has a name, hash range, leader 
and replication factor. Documents are assigned to one and only one shard 
per collection using a hash-based document routing strategy 
• Replica: A copy of a shard in a collection 
• Overseer: A special node that executes cluster administration commands and 
writes updated state to ZooKeeper. Automatic failover and leader election.
Collection == Distributed Index 
• A collection is a distributed index defined by: 
• named configuration stored in ZooKeeper 
• number of shards: documents are distributed across N partitions of the index 
• document routing strategy: how documents get assigned to shards 
• replication factor: how many copies of each document in the collection 
• Collections API: 
• curl "http://localhost:8983/solr/admin/collections? 
action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll 
ection.configName=myonf
DEMO
Document Routing 
• Each shard covers a hash-range 
• Default: Hash ID into 32-bit integer, map to range 
• leads to balanced (roughly) shards 
• Custom-hashing 
• Tri-level: app!user!doc 
• Implicit: no hash-range set for shards
Replication 
• Why replicate? 
• High-availability 
• Load balancing 
• How does it work in SolrCloud? 
• Near-real-time, not master-slave 
• Leader forwards to replicas in parallel, waits for response 
• Error handling during indexing is tricky
Distributed Indexing 
• Get cluster state from ZK 
• Route document directly to leader (hash on doc ID) 
• Persist document on durable storage (tlog) 
• Forward to healthy replicas 
• Acknowledge write succeed to client
Shard Leader 
• Additional responsibilities during indexing only! 
Not a master node 
• Leader is a replica (handles queries) 
• Accepts update requests for the shard 
• Increments the _version_ on the new or updated 
doc 
• Sends updates (in parallel) to all replicas
Distributed Queries 
• Query client can be ZK aware or just query via a load 
balancer 
• Client can send query to any node in the cluster 
• Controller node distributes the query to a replica for 
each shard to identify documents matching query 
• Controller node sorts the results from step 3 and 
issues a second query for all fields for a page of 
results
Scalability / Stability Highlights 
• All nodes in cluster perform indexing and execute 
queries; no master node 
• Distributed indexing: No SPoF, high throughput via 
direct updates to leaders, automated failover to new 
leader 
• Distributed queries: Add replicas to scale-out qps; 
parallelize complex query computations; fault-tolerance 
• Indexing / queries continue so long as there is 1 healthy 
replica per shard
Zookeeper 
• Is a very good thing ... clusters are a zoo! 
• Centralized configuration management 
• Cluster state management 
• Leader election (shard leader and overseer) 
• Overseer distributed work queue 
• Live Nodes 
• Ephemeral znodes used to signal a server is gone 
• Needs 3 nodes for quorum in production
Zookeeper: State Management 
• Keep track of live nodes /live_nodes znode 
• ephemeral nodes 
• ZooKeeper client timeout 
• Collection metadata and replica state in /clusterstate.json 
• Every core has watchers for /live_nodes and / 
clusterstate.json 
• Leader election 
• ZooKeeper sequence number on ephemeral znodes
Other Features/Highlights 
• Near-Real-Time Search: Documents are visible within a second or so after 
being indexed 
• Partial Document Update: Just update the fields you need to change on 
existing documents 
• Optimistic Locking: Ensure updates are applied to the correct version of a 
document 
• Transaction log: Better recoverability; peer-sync between nodes after 
hiccups 
• HTTPS 
• Use HDFS for storing indexes 
• Use MapReduce for building index (SOLR-1301)
Solr on YARN
Solr on YARN 
• Run the SolrClient application : 
• Allocate container to run SolrMaster 
• SolrMaster requests containers to run SolrCloud 
nodes 
• Solr containers allocated across cluster 
• SolrCloud node connects to ZooKeeper
More Information on Solr on YARN 
• https://lucidworks.com/blog/solr-yarn/ 
• https://github.com/LucidWorks/yarn-proto 
• https://issues.apache.org/jira/browse/SOLR-6743
Our users are also pushing the limits 
https://twitter.com/bretthoerner/status/476830302430437376
Up, up and away! 
https://twitter.com/bretthoerner/status/476838275106091008
Connect @ 
https://twitter.com/varunthacker 
http://in.linkedin.com/in/varunthacker 
varun.thacker@lucidworks.com

Introduction to SolrCloud

  • 1.
  • 2.
    Apache Solr hasa huge install base and tremendous momentum Solr is both established & growing 250,000+ most widely used search solution on the planet. 8M+ total downloads monthly downloads You use Solr everyday. Solr has tens of thousands of applications in production. 2500+ open Solr jobs. Activity Summary 30 Day summary Aug 18 - Sep 17 2014 • 128 Commits • 18 Contributors 12 Month Summary Sep 17, 2013 - Sep 17, 2014 • 1351 Commits • 29 Contributors via https://www.openhub.net/p/solr
  • 3.
    Solr scalability isunmatched. • 10TB+ Index Size • 10 Billion+ Documents • 100 Million+ Daily Requests
  • 4.
  • 5.
    What is Solr? • A system built to search text • A specialized type of database management system • A platform to build search applications on • Customizable, open source software
  • 6.
  • 7.
    What is SolrCloud? Subset of optional features in Solr to enable and simplify horizontal scaling a search index using sharding and replication. Goals scalability, performance, high-availability, simplicity, and elasticity
  • 8.
    Terminology • ZooKeeper:Distributed coordination service that provides centralised configuration, cluster state management, and leader election • Node: JVM process bound to a specific port on a machine • Collection: Search index distributed across multiple nodes with same configuration • Shard: Logical slice of a collection; each shard has a name, hash range, leader and replication factor. Documents are assigned to one and only one shard per collection using a hash-based document routing strategy • Replica: A copy of a shard in a collection • Overseer: A special node that executes cluster administration commands and writes updated state to ZooKeeper. Automatic failover and leader election.
  • 10.
    Collection == DistributedIndex • A collection is a distributed index defined by: • named configuration stored in ZooKeeper • number of shards: documents are distributed across N partitions of the index • document routing strategy: how documents get assigned to shards • replication factor: how many copies of each document in the collection • Collections API: • curl "http://localhost:8983/solr/admin/collections? action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll ection.configName=myonf
  • 11.
  • 12.
    Document Routing •Each shard covers a hash-range • Default: Hash ID into 32-bit integer, map to range • leads to balanced (roughly) shards • Custom-hashing • Tri-level: app!user!doc • Implicit: no hash-range set for shards
  • 13.
    Replication • Whyreplicate? • High-availability • Load balancing • How does it work in SolrCloud? • Near-real-time, not master-slave • Leader forwards to replicas in parallel, waits for response • Error handling during indexing is tricky
  • 14.
    Distributed Indexing •Get cluster state from ZK • Route document directly to leader (hash on doc ID) • Persist document on durable storage (tlog) • Forward to healthy replicas • Acknowledge write succeed to client
  • 15.
    Shard Leader •Additional responsibilities during indexing only! Not a master node • Leader is a replica (handles queries) • Accepts update requests for the shard • Increments the _version_ on the new or updated doc • Sends updates (in parallel) to all replicas
  • 16.
    Distributed Queries •Query client can be ZK aware or just query via a load balancer • Client can send query to any node in the cluster • Controller node distributes the query to a replica for each shard to identify documents matching query • Controller node sorts the results from step 3 and issues a second query for all fields for a page of results
  • 17.
    Scalability / StabilityHighlights • All nodes in cluster perform indexing and execute queries; no master node • Distributed indexing: No SPoF, high throughput via direct updates to leaders, automated failover to new leader • Distributed queries: Add replicas to scale-out qps; parallelize complex query computations; fault-tolerance • Indexing / queries continue so long as there is 1 healthy replica per shard
  • 18.
    Zookeeper • Isa very good thing ... clusters are a zoo! • Centralized configuration management • Cluster state management • Leader election (shard leader and overseer) • Overseer distributed work queue • Live Nodes • Ephemeral znodes used to signal a server is gone • Needs 3 nodes for quorum in production
  • 19.
    Zookeeper: State Management • Keep track of live nodes /live_nodes znode • ephemeral nodes • ZooKeeper client timeout • Collection metadata and replica state in /clusterstate.json • Every core has watchers for /live_nodes and / clusterstate.json • Leader election • ZooKeeper sequence number on ephemeral znodes
  • 20.
    Other Features/Highlights •Near-Real-Time Search: Documents are visible within a second or so after being indexed • Partial Document Update: Just update the fields you need to change on existing documents • Optimistic Locking: Ensure updates are applied to the correct version of a document • Transaction log: Better recoverability; peer-sync between nodes after hiccups • HTTPS • Use HDFS for storing indexes • Use MapReduce for building index (SOLR-1301)
  • 21.
  • 22.
    Solr on YARN • Run the SolrClient application : • Allocate container to run SolrMaster • SolrMaster requests containers to run SolrCloud nodes • Solr containers allocated across cluster • SolrCloud node connects to ZooKeeper
  • 23.
    More Information onSolr on YARN • https://lucidworks.com/blog/solr-yarn/ • https://github.com/LucidWorks/yarn-proto • https://issues.apache.org/jira/browse/SOLR-6743
  • 24.
    Our users arealso pushing the limits https://twitter.com/bretthoerner/status/476830302430437376
  • 25.
    Up, up andaway! https://twitter.com/bretthoerner/status/476838275106091008
  • 26.
    Connect @ https://twitter.com/varunthacker http://in.linkedin.com/in/varunthacker varun.thacker@lucidworks.com