Salvatore Sanfilippo – How Redis Cluster works, and why
In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.
3. Go Cluster
• Redis Cluster must have same Redis use case.
• Tradeoffs are inherently needed in DS.
• CAP? Merge values? Strong consistency and
consensus? How to replicate values?
4. CP systems
CAP: consistency price is added latency
Client S1
S2
S3
S4
5. CP systems
Reply to client after majority ACKs
Client S1
S2
S3
S4
6. And… there is the disk
S1 S2 S3
Disk Disk Disk
CP algorithms may require fsync-befor-ack.
Durability / Consistency not always orthogonal.
7. AP systems
Eventual consistency with merges?
(note: merge is not strictly part of EC)
Client
S1
S2
A = {1,2,3,8,12,13,14}
Client
A = {2,3,8,11,12,1}
8. Many kinds of consistencies
• “C” of CAP is strong consistency.
• It is not the only available tradeoff of course.
• Consistency is the set of liveness and safety
properties a given system provides.
• Eventual consistency: like to say nothing at all.
What liveness/safety properties if not “C”?
18. Global slots config
• A master FAIL state triggers a failover.
• Cluster needs a coherent view of configuration.
• Who is serving this slot currently?
• Slots config must eventually converge.
19. Raft and failover
• Config propagation is solved using ideas from the
Raft algorithm (just a subset).
• Raft is a consensus algorithm built on top of
different “layers”.
• Raft paper is already a classic (highly
recommended).
• Full Raft not needed for Redis Cluster slots config.
20. Failover and config
Failed
Slave
Slave
Slave
Master
Master
Master
Epoch = Epoch+1
(logical clock)
Vote for me!
21. Too easy?
• Why we don’t need full Raft?
• Because our config is idempotent: when the
partition heals we can trow away slots config for
new versions.
• Same algorithm is used in Sentinel v2 and works
well.
22. Config propagation
• After a successful failover, new slot config is
broadcasted.
• If there are partitions, when they heal, config will
get updated (broadcasted from time to time, plus
stale config detection and UPADTE messages).
• Config with greater Epoch always wins.
23. Redis Cluster consistency?
• Eventual consistent: last failover wins.
• In the “vanilla” losing writes is unbound.
• Mechanisms to avoid unbound data loss.
27. More data safety?
• OP logging until async ACK received.
• Re-played to master when node turns into slave.
• “Safe” connections, on demand.
• Example SADD (idempotent + commutative).
• SET-LWW foo bar <wall-clock>.
28. Multi key ops
• Hey hashtags!
• {user:1000}.following {user:1000}.followers.
• Unavailable for small windows, but no data
exchange between nodes.
29. Multi key ops
(availability)
• Single key ops: always available during resharding.
• Multi key ops, available if:
• No manual resharding of this hash slot in progress.
• Resharding in progress, but source or destination
node have all keys.
• Otherwise we get a -TRYAGAIN error.