This document discusses distributed systems and consensus algorithms. It covers the CAP theorem, which states that a distributed system can only provide two of three guarantees: consistency, availability, and partition tolerance. The document then discusses challenges in designing distributed systems without a metadata server or with a single metadata server. It introduces the RAFT consensus algorithm, which uses leaders and log replication to achieve consensus. RAFT elects a single leader per term and uses heartbeat messages to ensure the leader remains active. The document concludes by discussing etcd, an open-source distributed key-value store based on RAFT that provides a consistent data store for distributed applications and services.
1. How to scale a distributed (file) system
Atin Mukherjee
Gluster Hacker
SSE @ Red Hat
@mukherjee_atin
IRC : atinmu
2. Agenda
● Consensus in distributed system
● CAP theorem in distributed system
● Different distributed system design approaches
● Design challenges
● RAFT algorithm
● Consistent distributed store
● etcd
● Q & A
3. Consensus in distributed system
● Consensus – An agrement but for what and
between whom?
● For what → the op/transaction can be
committed or not
● Between whom → Answer is pretty simple, the
nodes forming the distributed system
● Quorum – (n/2) + 1
4. CAP theorem
● Any two of the following three gurantees
– Consistency (all nodes see the same data at the
same time)
– Availability (a guarantee that every request
receives a response about whether it succeeded or
failed)
– Partition tolerance (the system continues to operate
despite arbitrary message loss or failure of part of
the system)
5. Distributed system design approaches
● No meta data – all nodes share across their
data
● Meta data server – One node holds data where
others fetches from it
So which one is better???
Probably none of them? Ask yourself for a
minute....
6. Challenges in design of a distributed
system
● No meta data server
– N * N exchange of Network messages
– Not scalable when N is probably in hundreds or
thousands
– Initialization time can be very high
– Can end up in a situation like “whom to believe,
whom not to” - popularly known as split brain
– How to undo a transaction locally
7. Challenges in design of a distributed
system - 2
● MDS (Meta data server)
– SPOF
Ahh!! so is this the only drawback??
– How about having replicas and then replica count??
– Additional N/W hop, lower performance
8. RAFT – A consensus algorithm
● Key functions
– Asymmetric – leader based
– Leader election
– Normal operation
– Safety and consistency after leader changes
– Neutralizing old leaders
– Client interactions
– Configuration changes
9. RAFT : Terms
● Divided into two parts
– Election
– Normal operation
● At most 1 leader per term
● Failed election - split vote
● Each server maintains current term value
● Identify obsolete information
12. RAFT : Different RPCs
● RequestVote RPCs – Candidate sends to other
nodes for electing itself as leader
● AppendEntries RPCs – Normal operation
workload
● AppendEntries RPCs with no message - Heart
beat messages – Leader sends to all followers
to make its presence
13. RAFT : Leader Election
● current_term++
● Follower->Candidate
● Self vote
● Send request vote RPCs to all other servers, retry until either:
– Receive votes from majority of server
– Receive RPC from valid leader
– Election time out elapses – increment term
● Election properties
– Safety – allow at most one winner per term
– Liveness – some candidate must eventually win
14. Consistent distributed store
● A common consistent store which can be
shared by different nodes
● In the form of key value pair for ease of use
● Such distributed key value store
implementations are available.
15. etcd
● Named as /etc distributed
● Open source distributed consistent key value store
● Based on RAFT
● Highly available and reliable
● Sequentially consistent
● Watchable
● Exposed via HTTP
● Runtime reconfigurable (Saling feature)
● Durable (snapshot backup/restore)
● Time to live keys (have a time out)
16. Why etcd
● Vibrant community
● 500+ applications like kubernetes, cloud
foundry using it
● 150+ developers
● Stable releases
17. Conclusion
● Use etcd sub cluster to store configuration data
● No burden on application to maintain
consistency
● And that's all!!