Best practices for highly available and large scale SolrCloud
Best practices for highly available SolrCloud
Apache Lucene/Solr committer, PMC member
Search Guy @ IBM Watson
• Anshum Gupta, Apache Lucene/Solr committer
and PMC member, IBM Watson Search team.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
• Organizations I am or have been a part of:
Apache Solr is the most widely-used search
solution on the planet.
Solr has tens of thousands of
applications in production.
You use everyday.
Solr is both established
Open Solr jobs and the largest
community of developers.
SolrCloud - Physical Architecture
Node 1 Node 2
Coins by Creative Stall from the Noun Project
• Not just conﬁg repo but a lot more!
• No Zk = Stale clusterstate, and other things + No
• Watches & GC!
Solr <> ZK interaction
• NEVER use embedded zk in production
• ZK ensemble - (2n + 1) nodes
• ZK chroot, especially if sharing
• Use an OOM hook - shipped with Solr
ZooKeeper best practices
• Be frugal with watches - For every watch on the ZK
server, there’s a 300 bytes memory footprint
• ZK - not built for 1000’s of watchers on a single
node. Break it down! e.g. Clusterstate
Also remember - for custom code
• Shard your data - It generally helps
• Sharding is almost = Splitting into different
• Use different nodes for replicas - Replica
• Use a composite key or a custom router
• Distributed IDF - Sharding > Different collections
Sharding and Routing
• Reuse the http and solr client
• Atomic updates - It’s wrapped and expensive
• Omit norms, term freq, and positions if you don’t
Indexing best practices
• Replication Bandwidth limiting
• Think about what you want indexed vs stored
Other things to look at
• Soft commits = visibility
• delay as much as you can
• Hard commits = durability
• initiate background merges if needed
• Only in times of desperation : updateLog conﬁg - syncLevel=fsync
Commits and transaction log
As long as
As long as
hard commit 15 sec 15 sec 10 min 15 sec
FALSE FALSE TRUE TRUE/FALSE
• DocValues - Don’t forget there are 3 of those:
• Large heaps - Bad idea generally, unless you know
what you're doing
• OS Cache - It’s important
• Only retrieve what you want!
• Fields (ﬂ=*)
• Rows (rows=0, when all you want is hit count)
• Partial results
• ReRankQueryParser - Only recent releases
• Warm up caches
• UI ! UI ! UI ! - It’s got almost everything you need!
• Efﬁciently use caches - Hit/eviction stats
• Non-cached - specify cost
• Postﬁlters can be your friend
• Don’t run a regular query if all you need is to
export the data!
• /export handler - not distributed, sans ranking
• Have more than 1 replicas
• HDFS - High availability, but at a cost!
• Great work
• Way more redundancy, on its way to being ﬁxed
• Use sharding
• Hostname - More reliable than IP addresses at times.
• Jepsen tests came back ﬁne!
More things to note…
• Overestimating heap size? ~ index-size + delta for
• Watch out for increasing major GCs - Red ﬂag!
• Turn off swapping
• Consider explicit GC if it comes to that
• The OS needs memory, as much as the JVM…
• Rolling restarts to upgrade
• Watch out back-compat issues
• Don’t kill the leader unless need be. Ditto with the
• Outsource it all to solr-scale-toolkit
Upgrading and restarts
• Protect your cluster
• Kerberos, BasicAuth
• Role based
• Protect your ZooKeeper