Consensus algo with_distributed_key_value_store_in_distributed_system

Using consensus algorithm and
distributed store in designing
distributed system
Atin Mukherjee
GlusterFS Hacker
@mukherjee_atin

Topics
● What is consensus in distributed system?
● What is CAP theorem in distributed system
● Different distributed system design approaches
● Challenges in design of a distributed system
● What is RAFT algorithm and how it works
● Distributed store
● Combining RAFT & distributed store – in the form of
technologies like consul/etcd/zookeeper etc
● Q & A

What is consensus in distributed
system
● Consensus – An agrement but for what and
between whom?
● For what → the op/transaction to be committed
or not
● Between whom → Answer is pretty simple, the
nodes forming the distributed system
● Quorum – (n/2) + 1

CAP theorem
● Any two of the following three gurantees
– Consistency (all nodes see the same data at the
same time)
– Availability (a guarantee that every request
receives a response about whether it succeeded or
failed)
– Partition tolerance (the system continues to
operate despite arbitrary message loss or failure of
part of the system)

Design approaches of distributed
system
● No meta data – all nodes share across their
data
● Meta data server – One node holds data where
others fetches from it
So which one is better???
Probably none of them? Ask yourself for a
minute....

Challenges in design of a distributed
system
● No meta data
– N * N exchange of Network messages
– Not scalable when N is probably in hundreds or
thousands
– Initialization time can be very high
– Can end up in a situation like “whom to believe,
whom not to” - popularly known as split brain
– How to undo a transaction locally

Challenges in design of a distributed
system contd...
● MDS (Meta data server)
– SPOF
Ahh!! so is this the only drawback??
– How about having replicas and then replica count??
– Additional N/W hop, lower performance

RAFT – A consensus algorithm
● Key features
– Leader followers based model
– Leader election
– Normal operation
– Safety and consistency after leader changes
– Neutralizing old leaders
– Client interactions
– Configuration changes

RAFT : Server states
● Server states transition

RAFT : Terms
● Divided into two parts
– Election
– Normal operation
● At most 1 leader per term
● Failed election
● Split vote
● Each server maintains current term value

RAFT : Replicated state machine
● A picture says thousand words...

RAFT : Different RPCs
● RequestVote RPCs – Candidate sends to other
nodes for electing itself as leader
● AppendEntries RPCs – Normal operation
workload
● AppendEntries RPCs with no message - Heart
beat messages – Leader sends to all followers
to make its presence

RAFT : Leader Election
● current_term++
● Follower->Candidate
● Self vote
● Send request vote RPCs to all other servers, retry until either:
– Receive votes from majority of server
– Receive RPC from valid leader
– Election time out elapses – increment term
● Election properties
– Safety – allow at most one winner per term
– Liveness – some candidate must eventually win

RAFT : Picking the best leader
● Candidate include log info in RequestVote
RPCs with index & term of last log entry
● Voting server V denies vote if its log is more
complete by
(votingServerLastTerm > candidateLastTerm ||
((votingServerLastTerm == candidateLastTerm) &&
(votingServerLastIndex > candidateLastIndex))
● But is this enough to have crash consistency?

RAFT : New commitment rules
● For a leader to decide an entry is committed:
– Must be stored on a majority of server &
– At least one new entry from leader's term must also
be stored on majority of servers

RAFT : Log inconsistency
● Leader repairs log entries by
– Delete extraneous entries
– Fill in missing entries from the leader

RAFT : Neutralizing old leaders
● Sender sends its term over RPC
● If sender's term in older than receiver's term
RPC is rejected else it receiver steps down to
follower, updates its term and process the RPC

RAFT : Client protocol
● Send commands to leader
– If leader is unknown, send to anyone
– If contacted server is not leader, it will redirect to leader
● Client gets back the response after the full cycle at leader
● Req- timeout
– Re-issues command to other server
– Unique id for each command at client to avoid duplicate
execution

Joint consensus phase
● 2 phase approach
● Need majority of both old and new
configurations for election and commitment
● Configuration change is just a log entry, applied
immediately on receipt (committed or not)
● Once joint consensus is committed, begin
replicate log entry for final configuration

Distributed store
● A common store which can be shared by
different nodes
● In the form of key value pair for ease of use
● Such distributed key value store
implementations are available.

etcd
● Named as /etc distributed
● Open source distributed consistent key value store
● Highly available and reliable
● Sequentially consistent
● Watchable
● Exposed via HTTP
● Runtime reconfigurable (Saling feature)
● Durable (snapshot backup/restore)
● Time to live keys (have a time out)

etcd cond..
● Bootstraping using RAFT
● Proxy mode in node
● Cluster configuration – etcdctl member
add/remove/list
● Similar projects like consul, zookeeper are also
available.

Why etcd
● Vibrant community
● 500+ applications like kubernetes, cloud
foundry using it
● 150+ developers
● Stable releases

References
● https://raftconsensus.github.io/
● https://www.youtube.com/watch?
v=YbZ3zDzDnrw
● https://github.com/coreos/etcd#etcd
● https://consul.io/

Consensus algo with_distributed_key_value_store_in_distributed_system

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Consensus algo with_distributed_key_value_store_in_distributed_system

Similar to Consensus algo with_distributed_key_value_store_in_distributed_system (20)

More from Atin Mukherjee

More from Atin Mukherjee (7)

Recently uploaded

Recently uploaded (20)

Consensus algo with_distributed_key_value_store_in_distributed_system