SolrCloud Failover
and Testing
Mark Miller
(Cloudera)
Mark Miller
Lucene Committer, Solr Committer.
Works for Cloudera.
A lot of work on SolrCloud.
At Cloudera…
We are building an Enterprise Data
Hub
Search is a part of that.
Solr is our search engine.
Solr On HDFS
Performance is
good.
It can be even
better.
A shared filesystem
has advantages.
SolrCloud Reminder
Limitation
We replicate via
both Solr and HDFS.
Replicating with
just one has huge
tradeoffs.
We are working on
better tradeoffs.
autoAddReplicas
A new per collection
option.
When a replica goes
down, it is replaced on
a node that is still up.
A shared filesystem as
well means all replicas
can go down and you
can still automatically
failover.
+
-
How Does it Work?
SolrCloud elects a fault tolerant,
single node to be an Overseer.
The Overseer monitors the cluster
state in ZooKeeper.
Creates a new SolrCore on a machine
that is up when necessary to replace
‘downed’ replicas.
Let’s Do A Demo!
SolrCloud Testing
Let’s talk about
tests.
SolrCloud Tests
We did a straw man implementation
of SolrCloud first.
We did the same for tests.
We favored integration tests over
unit tests.
We did not make enough tests.
Distributed Tests
Are hard.
For a variety of reasons.
The Lucene / Solr testing
framework hurts in order to help.
The Lucene / Solr
Test Framework
Randomized Testing.
Rule Enforcement.
The Jenkins Cluster.
Mocks
We avoided doing them early - too
much churn.
They can be dangerous to future
contributors / refactoring.
Some of the early mocking that did get
in is a little painful.
We need them for good unit tests.
Testing Culture
Lucene has A+ testing culture. In
many cases, it’s easier for Lucene.
Solr has a C testing culture.
Solr needs to get better.
Prescription?
More focus on back filling tests
when adding features or changing
code.
More focus on fixing frequently
failing tests.
More focus on unit tests.
The End
@heismark
Thank You.

SolrCloud Failover and Testing