How to scale a distributed (file) system
Atin Mukherjee
Gluster Hacker
SSE @ Red Hat
@mukherjee_atin
IRC : atinmu
Agenda
● Consensus in distributed system
● CAP theorem in distributed system
● Different distributed system design approaches
● Design challenges
● RAFT algorithm
● Consistent distributed store
● etcd
● Q & A
Consensus in distributed system
● Consensus – An agrement but for what and
between whom?
● For what → the op/transaction can be
committed or not
● Between whom → Answer is pretty simple, the
nodes forming the distributed system
● Quorum – (n/2) + 1
CAP theorem
● Any two of the following three gurantees
– Consistency (all nodes see the same data at the
same time)
– Availability (a guarantee that every request
receives a response about whether it succeeded or
failed)
– Partition tolerance (the system continues to operate
despite arbitrary message loss or failure of part of
the system)
Distributed system design approaches
● No meta data – all nodes share across their
data
● Meta data server – One node holds data where
others fetches from it
So which one is better???
Probably none of them? Ask yourself for a
minute....
Challenges in design of a distributed
system
● No meta data server
– N * N exchange of Network messages
– Not scalable when N is probably in hundreds or
thousands
– Initialization time can be very high
– Can end up in a situation like “whom to believe,
whom not to” - popularly known as split brain
– How to undo a transaction locally
Challenges in design of a distributed
system - 2
● MDS (Meta data server)
– SPOF
Ahh!! so is this the only drawback??
– How about having replicas and then replica count??
– Additional N/W hop, lower performance
RAFT – A consensus algorithm
● Key functions
– Asymmetric – leader based
– Leader election
– Normal operation
– Safety and consistency after leader changes
– Neutralizing old leaders
– Client interactions
– Configuration changes
RAFT : Terms
● Divided into two parts
– Election
– Normal operation
● At most 1 leader per term
● Failed election - split vote
● Each server maintains current term value
● Identify obsolete information
RAFT : Server states
● Server states transition
RAFT : Replicated state machine
● A picture says thousand words...
RAFT : Different RPCs
● RequestVote RPCs – Candidate sends to other
nodes for electing itself as leader
● AppendEntries RPCs – Normal operation
workload
● AppendEntries RPCs with no message - Heart
beat messages – Leader sends to all followers
to make its presence
RAFT : Leader Election
● current_term++
● Follower->Candidate
● Self vote
● Send request vote RPCs to all other servers, retry until either:
– Receive votes from majority of server
– Receive RPC from valid leader
– Election time out elapses – increment term
● Election properties
– Safety – allow at most one winner per term
– Liveness – some candidate must eventually win
Consistent distributed store
● A common consistent store which can be
shared by different nodes
● In the form of key value pair for ease of use
● Such distributed key value store
implementations are available.
etcd
● Named as /etc distributed
● Open source distributed consistent key value store
● Based on RAFT
● Highly available and reliable
● Sequentially consistent
● Watchable
● Exposed via HTTP
● Runtime reconfigurable (Saling feature)
● Durable (snapshot backup/restore)
● Time to live keys (have a time out)
Why etcd
● Vibrant community
● 500+ applications like kubernetes, cloud
foundry using it
● 150+ developers
● Stable releases
Conclusion
● Use etcd sub cluster to store configuration data
● No burden on application to maintain
consistency
● And that's all!!
References
● https://raftconsensus.github.io/
● https://www.youtube.com/watch?
v=YbZ3zDzDnrw
● https://github.com/coreos/etcd#etcd
Q & A
THANK YOU

Manging scalability of distributed system

  • 1.
    How to scalea distributed (file) system Atin Mukherjee Gluster Hacker SSE @ Red Hat @mukherjee_atin IRC : atinmu
  • 2.
    Agenda ● Consensus indistributed system ● CAP theorem in distributed system ● Different distributed system design approaches ● Design challenges ● RAFT algorithm ● Consistent distributed store ● etcd ● Q & A
  • 3.
    Consensus in distributedsystem ● Consensus – An agrement but for what and between whom? ● For what → the op/transaction can be committed or not ● Between whom → Answer is pretty simple, the nodes forming the distributed system ● Quorum – (n/2) + 1
  • 4.
    CAP theorem ● Anytwo of the following three gurantees – Consistency (all nodes see the same data at the same time) – Availability (a guarantee that every request receives a response about whether it succeeded or failed) – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
  • 5.
    Distributed system designapproaches ● No meta data – all nodes share across their data ● Meta data server – One node holds data where others fetches from it So which one is better??? Probably none of them? Ask yourself for a minute....
  • 6.
    Challenges in designof a distributed system ● No meta data server – N * N exchange of Network messages – Not scalable when N is probably in hundreds or thousands – Initialization time can be very high – Can end up in a situation like “whom to believe, whom not to” - popularly known as split brain – How to undo a transaction locally
  • 7.
    Challenges in designof a distributed system - 2 ● MDS (Meta data server) – SPOF Ahh!! so is this the only drawback?? – How about having replicas and then replica count?? – Additional N/W hop, lower performance
  • 8.
    RAFT – Aconsensus algorithm ● Key functions – Asymmetric – leader based – Leader election – Normal operation – Safety and consistency after leader changes – Neutralizing old leaders – Client interactions – Configuration changes
  • 9.
    RAFT : Terms ●Divided into two parts – Election – Normal operation ● At most 1 leader per term ● Failed election - split vote ● Each server maintains current term value ● Identify obsolete information
  • 10.
    RAFT : Serverstates ● Server states transition
  • 11.
    RAFT : Replicatedstate machine ● A picture says thousand words...
  • 12.
    RAFT : DifferentRPCs ● RequestVote RPCs – Candidate sends to other nodes for electing itself as leader ● AppendEntries RPCs – Normal operation workload ● AppendEntries RPCs with no message - Heart beat messages – Leader sends to all followers to make its presence
  • 13.
    RAFT : LeaderElection ● current_term++ ● Follower->Candidate ● Self vote ● Send request vote RPCs to all other servers, retry until either: – Receive votes from majority of server – Receive RPC from valid leader – Election time out elapses – increment term ● Election properties – Safety – allow at most one winner per term – Liveness – some candidate must eventually win
  • 14.
    Consistent distributed store ●A common consistent store which can be shared by different nodes ● In the form of key value pair for ease of use ● Such distributed key value store implementations are available.
  • 15.
    etcd ● Named as/etc distributed ● Open source distributed consistent key value store ● Based on RAFT ● Highly available and reliable ● Sequentially consistent ● Watchable ● Exposed via HTTP ● Runtime reconfigurable (Saling feature) ● Durable (snapshot backup/restore) ● Time to live keys (have a time out)
  • 16.
    Why etcd ● Vibrantcommunity ● 500+ applications like kubernetes, cloud foundry using it ● 150+ developers ● Stable releases
  • 17.
    Conclusion ● Use etcdsub cluster to store configuration data ● No burden on application to maintain consistency ● And that's all!!
  • 18.
  • 19.
  • 20.