Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Redis Cluster
design tradeoffs
@antirez - Pivotal

What is performance?
• Low latency.
• IOPS.
• Operations quality and data model.

Go Cluster
• Redis Cluster must have same Redis use case.
• Tradeoffs are inherently needed in DS.
• CAP? Merge values? Strong consistency and
consensus? How to replicate values?

CP systems
CAP: consistency price is added latency
Client S1
S2
S3
S4

CP systems
Reply to client after majority ACKs
Client S1
S2
S3
S4

And… there is the disk
S1 S2 S3
Disk Disk Disk
CP algorithms may require fsync-befor-ack.
Durability / Consistency not always orthogonal.

AP systems
Eventual consistency with merges?
(note: merge is not strictly part of EC)
Client
S1
S2
A = {1,2,3,8,12,13,14}
Client
A = {2,3,8,11,12,1}

Many kinds of consistencies
• “C” of CAP is strong consistency.
• It is not the only available tradeoff of course.
• Consistency is the set of liveness and safety
properties a given system provides.
• Eventual consistency: like to say nothing at all.
What liveness/safety properties if not “C”?

Redis Cluster
Sharding and replication (asynchronous).
Client
A,B,C
A,B,C
A,B,C
D,E,F
D,E,F
D,E,F

Asynchronous replication
Client A,B,C
A,B,C
A,B,C
A,B,C
A,B,C
A,B,C
async ACK

Full Mesh
A,B,C A,B,C
D,E,F D,E,F
• Heartbeats.
• Nodes gossip.
• Failover auth.
• Config update.

No proxy, but redirections
Client Client
A? D?
A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T

Failure detection
• Failure reports within window of time (via gossip).
• Trigger for actual failover.
• Two main states: PFAIL -> FAIL.

Failure detection
S1 is not responding?
S1
S2
S3
S4
S1 = PFAIL
S1 = PFAIL
S1 = PFAIL

Failure detection
PFAIL state propagates
S1
S2
S3
S4
S1 = PFAIL
S1 = PFAIL
Reported by:
S2, S4
S1 = PFAIL

Failure detection
PFAIL state propagates
S1
S2
S3
S4
S1 = PFAIL
S1 = FAIL
S1 = PFAIL

Failure detection
Force FAIL state
S1
S2
S3
S4
S1 = FAIL
S1 = FAIL
S1 = FAIL

Global slots config
• A master FAIL state triggers a failover.
• Cluster needs a coherent view of configuration.
• Who is serving this slot currently?
• Slots config must eventually converge.

Raft and failover
• Config propagation is solved using ideas from the
Raft algorithm (just a subset).
• Raft is a consensus algorithm built on top of
different “layers”.
• Raft paper is already a classic (highly
recommended).
• Full Raft not needed for Redis Cluster slots config.

Failover and config
Failed
Slave
Slave
Slave
Master
Master
Master
Epoch = Epoch+1
(logical clock)
Vote for me!

Too easy?
• Why we don’t need full Raft?
• Because our config is idempotent: when the
partition heals we can trow away slots config for
new versions.
• Same algorithm is used in Sentinel v2 and works
well.

Config propagation
• After a successful failover, new slot config is
broadcasted.
• If there are partitions, when they heal, config will
get updated (broadcasted from time to time, plus
stale config detection and UPADTE messages).
• Config with greater Epoch always wins.

Redis Cluster consistency?
• Eventual consistent: last failover wins.
• In the “vanilla” losing writes is unbound.
• Mechanisms to avoid unbound data loss.

Failure mode… #1
Client A,B,C
A,B,C
A,B,C
Failed
A,B,C
A,B,C
lost write…

Failure mode #2
Client
A,B,C
A,B,C
D,E,F
G,H,I
Minority side Majority side

Boud divergences
Client
A,B,C
D,E,F
G,H,I
After node-timeot
Minority side Majority side

More data safety?
• OP logging until async ACK received.
• Re-played to master when node turns into slave.
• “Safe” connections, on demand.
• Example SADD (idempotent + commutative).
• SET-LWW foo bar <wall-clock>.

Multi key ops
• Hey hashtags!
• {user:1000}.following {user:1000}.followers.
• Unavailable for small windows, but no data
exchange between nodes.

Multi key ops
(availability)
• Single key ops: always available during resharding.
• Multi key ops, available if:
• No manual resharding of this hash slot in progress.
• Resharding in progress, but source or destination
node have all keys.
• Otherwise we get a -TRYAGAIN error.

{User:1}.key_A {User:2}.Key_B
{User:1}.key_A
{User:1}.Key_B
{User:1}.key_A
{User:1}.Key_B
SUNION key_A key_B
-TRYAGAIN
SUNION key_A key_B
… output …
SUNION key_A key_B
… output …

Redis Cluster ETA
• Release Candidate available.
• We’ll go stable in Q1 2015.
• Ask me anything.

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Similar to Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014 (20)

More from NoSQLmatters

More from NoSQLmatters (20)

Recently uploaded

Recently uploaded (20)

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014