This slide deck tries to evolve the challenges a developer can go through while designing a distributed system. It also explains a consensus algorithm called RAFT. Few slides also describe about consistent distributed key value store concepts and project like etcd using all these concepts.
1. Using consensus algorithm and
distributed store in designing
distributed system
Atin Mukherjee
GlusterFS Hacker
@mukherjee_atin
2. Topics
● What is consensus in distributed system?
● What is CAP theorem in distributed system
● Different distributed system design approaches
● Challenges in design of a distributed system
● What is RAFT algorithm and how it works
● Distributed store
● Combining RAFT & distributed store – in the form of
technologies like consul/etcd/zookeeper etc
● Q & A
3. What is consensus in distributed
system
● Consensus – An agrement but for what and
between whom?
● For what → the op/transaction to be committed
or not
● Between whom → Answer is pretty simple, the
nodes forming the distributed system
● Quorum – (n/2) + 1
4. CAP theorem
● Any two of the following three gurantees
– Consistency (all nodes see the same data at the
same time)
– Availability (a guarantee that every request
receives a response about whether it succeeded or
failed)
– Partition tolerance (the system continues to
operate despite arbitrary message loss or failure of
part of the system)
5. Design approaches of distributed
system
● No meta data – all nodes share across their
data
● Meta data server – One node holds data where
others fetches from it
So which one is better???
Probably none of them? Ask yourself for a
minute....
6. Challenges in design of a distributed
system
● No meta data
– N * N exchange of Network messages
– Not scalable when N is probably in hundreds or
thousands
– Initialization time can be very high
– Can end up in a situation like “whom to believe,
whom not to” - popularly known as split brain
– How to undo a transaction locally
7. Challenges in design of a distributed
system contd...
● MDS (Meta data server)
– SPOF
Ahh!! so is this the only drawback??
– How about having replicas and then replica count??
– Additional N/W hop, lower performance
8. RAFT – A consensus algorithm
● Key features
– Leader followers based model
– Leader election
– Normal operation
– Safety and consistency after leader changes
– Neutralizing old leaders
– Client interactions
– Configuration changes
10. RAFT : Terms
● Divided into two parts
– Election
– Normal operation
● At most 1 leader per term
● Failed election
● Split vote
● Each server maintains current term value
12. RAFT : Different RPCs
● RequestVote RPCs – Candidate sends to other
nodes for electing itself as leader
● AppendEntries RPCs – Normal operation
workload
● AppendEntries RPCs with no message - Heart
beat messages – Leader sends to all followers
to make its presence
13. RAFT : Leader Election
● current_term++
● Follower->Candidate
● Self vote
● Send request vote RPCs to all other servers, retry until either:
– Receive votes from majority of server
– Receive RPC from valid leader
– Election time out elapses – increment term
● Election properties
– Safety – allow at most one winner per term
– Liveness – some candidate must eventually win
14. RAFT : Picking the best leader
● Candidate include log info in RequestVote
RPCs with index & term of last log entry
● Voting server V denies vote if its log is more
complete by
(votingServerLastTerm > candidateLastTerm ||
((votingServerLastTerm == candidateLastTerm) &&
(votingServerLastIndex > candidateLastIndex))
● But is this enough to have crash consistency?
15. RAFT : New commitment rules
● For a leader to decide an entry is committed:
– Must be stored on a majority of server &
– At least one new entry from leader's term must also
be stored on majority of servers
16. RAFT : Log inconsistency
● Leader repairs log entries by
– Delete extraneous entries
– Fill in missing entries from the leader
17. RAFT : Neutralizing old leaders
● Sender sends its term over RPC
● If sender's term in older than receiver's term
RPC is rejected else it receiver steps down to
follower, updates its term and process the RPC
18. RAFT : Client protocol
● Send commands to leader
– If leader is unknown, send to anyone
– If contacted server is not leader, it will redirect to leader
● Client gets back the response after the full cycle at leader
● Req- timeout
– Re-issues command to other server
– Unique id for each command at client to avoid duplicate
execution
19. Joint consensus phase
● 2 phase approach
● Need majority of both old and new
configurations for election and commitment
● Configuration change is just a log entry, applied
immediately on receipt (committed or not)
● Once joint consensus is committed, begin
replicate log entry for final configuration
20. Distributed store
● A common store which can be shared by
different nodes
● In the form of key value pair for ease of use
● Such distributed key value store
implementations are available.
21. etcd
● Named as /etc distributed
● Open source distributed consistent key value store
● Highly available and reliable
● Sequentially consistent
● Watchable
● Exposed via HTTP
● Runtime reconfigurable (Saling feature)
● Durable (snapshot backup/restore)
● Time to live keys (have a time out)
22. etcd cond..
● Bootstraping using RAFT
● Proxy mode in node
● Cluster configuration – etcdctl member
add/remove/list
● Similar projects like consul, zookeeper are also
available.
23. Why etcd
● Vibrant community
● 500+ applications like kubernetes, cloud
foundry using it
● 150+ developers
● Stable releases