2. Intro
Mano Kovacs
• Cloudera Search engineer
• Working on “Where did my
Solr go?” mysteries.
Mike Drob (co-author)
• Committer on Apache Solr,
HBase, etc…
• Distributed Systems Junkie
3. Agenda
• Consistency basics (leaders/follower)
• Leader election
• When to recover
• General recovery (peersync, replication)
• Recovery in detail
• Leader-Initiated Recovery
• Auto Add Replica
4. 01
Basics
• Shards in collection
- ID or other shard-key routing
• Shards can have replicas
• One leader per shard
• Read distributed to shards
• Write
- First on leader (consistency)
- Replicates
5. 01
Leader Election
• Zookeeper Leader election recipe
- Sequential, ephemeral nodes
- Order dictates the leader candidates
- First becomes leader candidate
• Replicas* watch the previous candidate
• If leader fails, next will be the candidate
• Candidates follow leader preparation
process
*(Solr 7.0+: Applies to Realtime Replicas)
6. 01
Leader Election - Preparation Process
• On restart: waits all replicas to participate (default 3
mins)
- Replicas are asked to replay any missing updates
• Verify last state ACTIVE if not startup
- If all were DOWN, shard hangs (SOLR-7065)
• Verify there was no error reported (LIR… tbd)
7. 01
What causes Recovery?
Routine Events
• Add or Move Replica - not having the data
• Restart (upgrade/tuning) - might missed updates
Not Routine Events
• Server crash
- Leader
- Replica
• Network failure (Lose ZK Connection)
• Replica partitioned: can access ZK, but not the leader
8. 01
Recovery (from 30k fts.)
• Replaying unfinished updates from tlog
• Check if we are synced
• If no, “How much am I behind?”
- If N (def=100) docs or less
Retrieving delta
- Else
Replication: pulling full index
• Go ACTIVE
9. Recovery (from 1000 fts.)
• Buffering new updates
- So we won’t get behind over and over
again
• Waiting leader to notice us
- Otherwise we don’t get updates
• Replay buffered updates
- Hopefully replay catches up with
incoming updates
10. Problem with PeerSync
If there’s a document missing before the
window, then we won’t know!
LeaderReplica
Last 100 docs
all match
Some older docs
are missing
11. Recovery (from 100 fts.)
• Updates are versioned
- Timestamp+counter
• Index has fingerprint (checksum of doc)
• If there is other updates missing, fingerprint
will fail
- Consistency safety net if others fail
12. 01
Leader-Initiated Recovery
• Partitioning Leader from Replica, but
not ZK
• Leader will send recovery requests to
replica (with retries)
• If Replica went down, it will do normal
recovery process anyway
• If replica is partitioned and up, it will
still serve stale reads :(
• (can happen during update forwarding
phase of recovery)
13. 01
LIR problems - SOLR-9555
• Race condition between LIR and
standard Recovery
• Leader overwrites RECOVERING
state
• Follower waits leader to see until
timeout
• Mike Drob’s patch is almost done
- Solves problem with partitioned
replicas too with ZK watches
14. 01
AutoAddReplica
• Using shared file system (e.g. HDFS)
- Provides durability
- Instances share index folders
• Move cores to live nodes on failure
• Use same index folder
• Benefits
- Durability with rep factor 1
- Handle perm. node loss
• Lots of fix from Mark Miller lately
• Rewrite in SOLR-10397