Best practices for highly available and large scale SolrCloud

Best practices for highly available SolrCloud
Anshum Gupta
Apache Lucene/Solr committer, PMC member
Search Guy @ IBM Watson

• Anshum Gupta, Apache Lucene/Solr committer
and PMC member, IBM Watson Search team.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
• Organizations I am or have been a part of:
About me

Apache Solr is the most widely-used search
solution on the planet.
Solr has tens of thousands of
applications in production.
You use everyday.
8,000,000+
Total downloads
Solr is both established
and growing.
250,000+
Monthly downloads
2,500+
Open Solr jobs and the largest
community of developers.

01
SolrCloud Logical Architecture
Shard 1
(leader)
Followers
Shard 2
(leader)
Followers
ZooKeeper
ZooKeeper instance
Solr Instance

01
SolrCloud - Physical Architecture
ZooKeeper
Node 1 Node 2
Load
Balancer
Client
Client
Client
Client
Client
Client
Client
Client
Client
Lots
Of
Interaction
Coins by Creative Stall from the Noun Project

• Not just conﬁg repo but a lot more!
• No Zk = Stale clusterstate, and other things + No
writes
• Watches & GC!
Solr <> ZK interaction

• NEVER use embedded zk in production
• ZK ensemble - (2n + 1) nodes
• ZK chroot, especially if sharing
• Use an OOM hook - shipped with Solr
ZooKeeper best practices

• Be frugal with watches - For every watch on the ZK
server, there’s a 300 bytes memory footprint
• ZK - not built for 1000’s of watchers on a single
node. Break it down! e.g. Clusterstate
Also remember - for custom code

• Shard your data - It generally helps
• Sharding is almost = Splitting into different
collections
• Use different nodes for replicas - Replica
placement strategy
• Use a composite key or a custom router
• Distributed IDF - Sharding > Different collections
Sharding and Routing

• Batching
• Reuse the http and solr client
• CloudSolrClient
• Atomic updates - It’s wrapped and expensive
• Omit norms, term freq, and positions if you don’t
need them
Indexing best practices

• Replication Bandwidth limiting
• Think about what you want indexed vs stored
Other things to look at

• Soft commits = visibility
• delay as much as you can
• Hard commits = durability
• Durability
• autoCommit
• openSearcher
• initiate background merges if needed
• Only in times of desperation : updateLog conﬁg - syncLevel=fsync
Commits and transaction log

Indexing recommendations
Bulk
Indexing
Heavy
Indexing
Heavy
querying
Crazy!
soft commit
Long!
Best: -1
As long as
possible
-1
As long as
possible
hard commit 15 sec 15 sec 10 min 15 sec
openSearch
er
FALSE FALSE TRUE TRUE/FALSE

• DocValues - Don’t forget there are 3 of those:
• default
• memory
• direct
• Large heaps - Bad idea generally, unless you know
what you're doing
• OS Cache - It’s important
Memory usage

• Only retrieve what you want!
• Fields (ﬂ=*)
• Rows (rows=0, when all you want is hit count)
• timeAllowed
• Partial results
• ReRankQueryParser - Only recent releases
Tuning Queries

• Warm up caches
• UI ! UI ! UI ! - It’s got almost everything you need!
• Efﬁciently use caches - Hit/eviction stats
• Non-cached - specify cost
• Postﬁlters can be your friend
Caches

• Don’t run a regular query if all you need is to
export the data!
• Cursormark
• /export handler - not distributed, sans ranking
Deep paging

• Have more than 1 replicas
• HDFS - High availability, but at a cost!
• Great work
• Way more redundancy, on its way to being ﬁxed
• Use sharding
• Hostname - More reliable than IP addresses at times.
• Jepsen tests came back ﬁne!
More things to note…

• Overestimating heap size? ~ index-size + delta for
new generation
• Watch out for increasing major GCs - Red ﬂag!
• Turn off swapping
• Consider explicit GC if it comes to that
• The OS needs memory, as much as the JVM…
JVM tuning

• Rolling restarts to upgrade
• Watch out back-compat issues
• Don’t kill the leader unless need be. Ditto with the
Overseer
• Outsource it all to solr-scale-toolkit
Upgrading and restarts

• Protect your cluster
• Kerberos, BasicAuth
• Role based
• Protect your ZooKeeper
Security

Connect @
http://www.twitter.com/anshumgupta
http://www.linkedin.com/in/anshumgupta/
anshum@apache.org

Best practices for highly available and large scale SolrCloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Best practices for highly available and large scale SolrCloud

Similar to Best practices for highly available and large scale SolrCloud (20)

More from Anshum Gupta

More from Anshum Gupta (8)

Recently uploaded

Recently uploaded (20)

Best practices for highly available and large scale SolrCloud