Non-Stop Hadoop
Applying Paxos to make critical Hadoop
services Continuously Available
Jagane Sundar - CTO, WANdisco
Brett...
WANdisco Background
 WANdisco: Wide Area Network Distributed Computing
 Enterprise ready, high availability software sol...
Customers
Recap of
Server Software Architecture
Elementary Server Software:
Single thread processing client requests in a loop
Server Process
make change to state (db)
OP...
Multi-threaded Server Software:
Multiple threads processing client requests in a loop
Server Process
make change to state ...
Continuously Available Servers
Multiple Servers replicated and serving the same content
server1
Server Process
server2
Ser...
Problem
 How do we ensure that the three servers contain exactly the same data?
 In other words, how do you achieve stro...
Two parts to the solution:
 (Multiple Replicas of) A Deterministic State Machine
 The exact same sequence of operations,...
A Deterministic State Machine
 A state machine where a specific operation will always result in a deterministic
state
 N...
Creating three replicated servers
Apply all modify operations in the same exact sequence in each replicated server =
Multi...
Problem:
 How to achieve consensus between these servers as to the sequence of operations to
perform?
 Paxos is the answ...
Three replicated servers
server3
Server Process
OP OP OP OP
Distributed
Coordination Engine
server2
Server Process
Distrib...
Paxos Primer
Paxos
 Paxos is an Algorithm for building Replicated Servers with strong consistency
1. Synod algorithm for achieving con...
Replicated State Machine
 Installed on each node that participates in the distributed system
 All nodes function as peer...
Paxos Roles
 Proposer
- The client or a proxy for the client
- Proposes a change to the Deterministic State Machine
 Acc...
Paxos - Ordering
 Proposers issue a new sequence number of a higher value from the last sequence
known
 A majority agree...
WANdisco DConE
Beyond Paxos
DConE Innovations
Beyond Paxos
 Quorum Configurations
- Majority, Singleton, Unanimous
- Distinguished Node – Tie Breaker...
DConE Innovations
Beyond Paxos
 Dynamic group evolution
- Add and remove nodes
- Add and remove sites
- No interruption o...
DConE Innovations
Beyond Paxos
 Backoff and collision avoidance
- Avoids repeated pre-emption of proposers by their peers...
Self Healing
Automatic Back up and Recovery
 All nodes are mirrors/replicas of each other
- Any node can be used as a hel...
‘Co-ordinate intent, not the outcome’
- Yeturu Aahlad
Active-Active, not Active-Standby
Co-ordinating intent
Proposal to
mkdir /a
P
a
x
o
s
server2
server
1
Proposal to
createFile /a
createFile /a
createFile /a...
HDFS
 Recap
26
HDFS Architecture
 HDFS metadata is decoupled from data
- Namespace is a hierarchy of files and directories represented b...
HDFS Cluster
 Single active NameNode
 Thousands of DataNodes
 Tens of thousands of HDFS clients
Active-Standby Architec...
Standard HDFS operations
 Active NameNode workflow
1. Receive request from a client,
2. Apply the update to its memory st...
Consensus Node
 Coordinated Replication of HDFS Namespace
30
Replicated Namespace
 Replicated NameNode is called a ConsensusNode or CNode
 ConsensusNodes play equal active role on t...
Coordinated HDFS Cluster
 Independent CNodes – the same namespace
 Load balancing client requests
 Proposal, Agreement
...
Coordinated HDFS operations
 ConsensusNode workflow
1. Receive request from a client
2. Submit proposal to update to the ...
Strict Consistency Model
 Coordination Engine transforms namespace modification proposals into the global
sequence of agr...
Consensus Node Proxy
 CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider
- Defined via Configuration
 ...
Alternatives to a Paxos based Replicated State
Machine
Using a TCP Connection to send data to three
replicated servers (Load Balancer)
server3
Server Process
OP OP
server2
Serve...
Problems with using a Load Balancer
 Load balancer becomes the single point of failure
- Need to make the LB highly avail...
HBase WAL or HDFS Edits Log replication
 State Machine (HRegion contents, HDFS NameNode metadata, etc.) is modified first...
HBase WAL or HDFS Edits Log replication
server
1
Server Process
OP OP OP OP
server2
Server Process
Shared
Storage
Standby ...
 Only one active server is possible
 Failover takes time
 Failover is error prone, with intricate fencing etc.
 Cost o...
HBase Continuous Availability
HBase Single Points of Failure
 HBase Region Server
 HBase Master
HBase Region Server
Replication
NonStopRegionServer:
Client Service
e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 1
Client Service
e.g...
HBase RegionServer replication using
WANdisco DConE
 Shared nothing architecture
 HFiles, WALs etc. are not shared
 Rep...
HBase RegionServer replication using
WANdisco DConE
 Not an eventual consistency model
 Does not serve up stale data
/ page 48
DEMO
DEMO
Thank you
Jagane Sundar
jagane.sundar@wandisco.com
@jagane
Upcoming SlideShare
Loading in …5
×

NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

1,280 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,280
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Seven key innovations over paxos
  • Distributed garbage collection
    Any system that deals with distributed state should be able to safely discard state information on disk and in memory for efficient resource utilization. The point at which it is safe to do so is the point at which the state information is no longer required to assist in the recovery of a node at any site. Each DConE instance sends messages to its peers at other nodes at pre-defined intervals to determine the highest contiguously populated agreement common to all of them. It then deletes all agreements from the agreement log, and all agreed proposals from the proposal log that are no longer needed for recovery.

    Distinguished and fair round numbers fopr proposals
    Weak reservations

    DConE’s use of distinguished and fair round numbers in the process of achieving consensus avoids the contention that would otherwise arise when multiple proposals are submitted simultaneously by different nodes using the same round number. If this option is used, the round number will consist of three components: (1) a monotonically increasing component which is simply the increment of the last monotonic component; (2) a distinguished component which is a component specific to each proposer and (3) a random component. If two proposers clash on the first component, then the random component is evaluated, and the proposer whose number has the larger random number component wins. If there is still no winner, then the distinguished component is compared, and the winner is the one with the largest distinguished component. Without this approach the competing nodes could end up simply incrementing the last attempted round number and resubmitting their proposals. This could lead to thrashing that would negatively impact the performance of the distributed system. This approach also ensures fairness in the sense that it prevents any node from always winning.
    Weak Reservations
    DConE provides an optional weak reservation mechanism to eliminate pre- emption of proposers under high transaction volume scenarios. For example, if there are three proposers - one, two and three - the proposer’s number determines which range of agreement numbers that proposer will drive. This avoids any possibility of collisions among the multiple proposals from each proposer that are proceeding in parallel across the distributed system.



  • Dynamic group evolution
    DConE supports the concept of dynamic group evolution, allowing a distributed system to scale to support new sites and users. New nodes can be added to a distributed system, or existing nodes can be removed without interrupting the operation of the remaining nodes.

    Backoff and collison avoidance
    DConE provides a backoff mechanism for avoiding repeated pre-emption of proposers by their peers. Conventional replicated state machines allow the preempted proposer to immediately initiate a new round with an agreement number higher than that of the pre-emptor. This approach can lead an agreement protocol to thrash for an extended period of time and severely degrade performance.
    With DConE, when a round is pre-empted, the DConE instance which initiated the proposal computes the duration of backoff delay. The proposer then waits for this duration before initiating the next round. DConE uses an approach similar to Carrier Sense Multiple Access/Collision Detection (CSMA/CD) protocols for nonswitched ethernet.

    Multiple propsers
    Both say I want tx 179 so they are competing..

    Collison avoidance


    Paxos round… sends out read messagner to acceptors


  • Disadvantages:
    1. Resources used to support Standby
    2. Single NN is a bottleneck
    3. Failover: complex, still outage
    Can do better than that with consistent replication
  • Disadvantages:
    1. Resources used to support Standby
    2. Single NN is a bottleneck
    3. Failover: complex, still outage
    Can do better than that with consistent replication
  • Double determinism is important
  • NameNodes start from the same state and apply the same deterministic updates in the same deterministic order, their states are consistent.
    Independent NameNodes don’t know about each other
  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

    1. 1. Non-Stop Hadoop Applying Paxos to make critical Hadoop services Continuously Available Jagane Sundar - CTO, WANdisco Brett Rudenstein – Senior Product Manager, WANdisco
    2. 2. WANdisco Background  WANdisco: Wide Area Network Distributed Computing  Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability  Leader in tools for software engineers – Subversion  Apache Software Foundation sponsor  Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)  US patented active-active replication technology granted, November 2012  Global locations - San Ramon (CA) - Chengdu (China) - Tokyo (Japan) - Boston (MA) - Sheffield (UK) - Belfast (UK)
    3. 3. Customers
    4. 4. Recap of Server Software Architecture
    5. 5. Elementary Server Software: Single thread processing client requests in a loop Server Process make change to state (db) OP OP OP OP get client request e.g. hbase put send return value to client
    6. 6. Multi-threaded Server Software: Multiple threads processing client requests in a loop Server Process make change to state (db) get client request e.g. hbase put send return value to client OP OP OP OP OP OP OP OPOP OP OP OP thread 1 thread 3 thread 2 thread 1 thread 2 thread 3 acquire lock release lock
    7. 7. Continuously Available Servers Multiple Servers replicated and serving the same content server1 Server Process server2 Server Process server3 Server Process
    8. 8. Problem  How do we ensure that the three servers contain exactly the same data?  In other words, how do you achieve strong consistency replication?
    9. 9. Two parts to the solution:  (Multiple Replicas of) A Deterministic State Machine  The exact same sequence of operations, to be applied to each replica of the DSM
    10. 10. A Deterministic State Machine  A state machine where a specific operation will always result in a deterministic state  Non deterministic factors cannot play a role in the end state of any operation in a DSM. Examples of non-deterministic factors - Time - Random
    11. 11. Creating three replicated servers Apply all modify operations in the same exact sequence in each replicated server = Multiple Servers with exactly the same replicated data server1 Server Process (DSM) server2 Server Process (DSM) server3 Server Process (DSM) O P O P O P O P O P O P O P O P O P O P O P O P
    12. 12. Problem:  How to achieve consensus between these servers as to the sequence of operations to perform?  Paxos is the answer - Algorithm for reaching consensus in a network of unreliable processors
    13. 13. Three replicated servers server3 Server Process OP OP OP OP Distributed Coordination Engine server2 Server Process Distributed Coordination Engine OP OP OP OP server 1 Server Process OP OP OP OP Distributed Coordination Engine Paxos Client Client ClientClient Client Paxos OP OPOP OP
    14. 14. Paxos Primer
    15. 15. Paxos  Paxos is an Algorithm for building Replicated Servers with strong consistency 1. Synod algorithm for achieving consensus among a network of unreliable processes 2. The application of consensus to the task of replicating a Deterministic State Machine  Paxos does not - Specify a network protocol - Invent a new language - Restrict use in a specific language
    16. 16. Replicated State Machine  Installed on each node that participates in the distributed system  All nodes function as peers to deliver and assure the same transaction order occurs on every system - Achieve Consistent Replication  Consensus - Roles • Proposers, Acceptors, Learners - Phases • Election of a node to be the proposer • Broadcast of the proposal to peers • Acceptance of the proposal for majority
    17. 17. Paxos Roles  Proposer - The client or a proxy for the client - Proposes a change to the Deterministic State Machine  Acceptor - Acceptors are the ‘memory’ of paxos - Quorum is established amongst acceptors  Learner - The DSM (Replicated, of course) - Each Learner applies the exact same sequence of operations as proposed by the Proposers, and accepted by a majority quorum of Acceptors
    18. 18. Paxos - Ordering  Proposers issue a new sequence number of a higher value from the last sequence known  A majority agrees this number has not been seen  Consensus must be reached on the current proposal
    19. 19. WANdisco DConE Beyond Paxos
    20. 20. DConE Innovations Beyond Paxos  Quorum Configurations - Majority, Singleton, Unanimous - Distinguished Node – Tie Breaker - Quorum Rotations – follow the sun - Emergency Reconfigure  Concurrent agreement handling - Paxos only allows agreements on one proposal at a time • Slow performance in a high transaction volume environment - DConE allows simultaneous proposals from multiple proposers
    21. 21. DConE Innovations Beyond Paxos  Dynamic group evolution - Add and remove nodes - Add and remove sites - No interruption of current operations  Distributed garbage collection - Safely discard state on disk and in memory when it is no longer required to assist in recovery - Messages are sent to peers pre-defined intervals to determine the highest common agreement - All agreements and agreed proposals are deleted
    22. 22. DConE Innovations Beyond Paxos  Backoff and collision avoidance - Avoids repeated pre-emption of proposers by their peers - Prevents thrashing which can severely degrade performance. - When a round is pre-empted, a backoff delay is computed
    23. 23. Self Healing Automatic Back up and Recovery  All nodes are mirrors/replicas of each other - Any node can be used as a helper to bring it back  Read access without Quorum - Cluster is still accessible for reads - No writes prevent split brain  Automatic catch up - Servers that have been offline, learn of transactions that were agreed on while it was unavailable - The missing transactions are played back and one caught up become fully participating members of the distributed system again  Servers can be updated without down time - Allows for rolling upgrades
    24. 24. ‘Co-ordinate intent, not the outcome’ - Yeturu Aahlad Active-Active, not Active-Standby
    25. 25. Co-ordinating intent Proposal to mkdir /a P a x o s server2 server 1 Proposal to createFile /a createFile /a createFile /a mkdir /a mkdir /a Op fails Op fails mkdir /a Co-ordinate outcome (WAL, HDFS Edits Log, etc.) server2 server 1 createFile /a server1 state is wrong mkdir /a operation needs to be undone Co-ordinating outcome
    26. 26. HDFS  Recap 26
    27. 27. HDFS Architecture  HDFS metadata is decoupled from data - Namespace is a hierarchy of files and directories represented by INodes - INodes record attributes: permissions, quotas, timestamps, replication  NameNode keeps its entire state in RAM - Memory state: the namespace tree and the mapping of blocks to DataNodes - Persistent state: recent checkpoint of the namespace and journal log  File data is divided into blocks (default 128MB) - Each block is independently replicated on multiple DataNodes (default 3) - Block replicas stored on DataNodes as local files on local drives Reliable distributed file system for storing very large data sets 27
    28. 28. HDFS Cluster  Single active NameNode  Thousands of DataNodes  Tens of thousands of HDFS clients Active-Standby Architecture 28
    29. 29. Standard HDFS operations  Active NameNode workflow 1. Receive request from a client, 2. Apply the update to its memory state, 3. Record the update as a journal transaction in persistent storage, 4. Return result to the client  HDFS Client (read or write to a file) - Send request to the NameNode, receive replica locations - Read or write data from or to DataNodes  DataNode - Data transfer to / from clients and between DataNodes - Report replica state change to NameNode(s): new, deleted, corrupt - Report its state to NameNode(s): heartbeats, block reports 29
    30. 30. Consensus Node  Coordinated Replication of HDFS Namespace 30
    31. 31. Replicated Namespace  Replicated NameNode is called a ConsensusNode or CNode  ConsensusNodes play equal active role on the cluster - Provide write and read access to the namespace  The namespace replicas are consistent with each other - Each CNode maintains a copy of the same namespace - Namespace updates applied to one CNode propagated to the others  Coordination Engine establishes the global order of namespace updates - All CNodes apply the same deterministic updates in the same deterministic order - Starting from the same initial state and applying the same updates = consistency Coordination Engine provides consistency of multiple namespace replicas 31
    32. 32. Coordinated HDFS Cluster  Independent CNodes – the same namespace  Load balancing client requests  Proposal, Agreement  Coordinated updates Multiple active Consensus Nodes share namespace via Coordination Engine 32
    33. 33. Coordinated HDFS operations  ConsensusNode workflow 1. Receive request from a client 2. Submit proposal to update to the Coordination Engine Wait for agreement 3. Apply the agreed update to its memory state, 4. Record the update as a journal transaction in persistent storage (optional) 5. Return result to the client  HDFS Client and DataNode operations remain the same Updates to the namespace when a file or a directory is created are coordinated 33
    34. 34. Strict Consistency Model  Coordination Engine transforms namespace modification proposals into the global sequence of agreements - Applied to namespace replicas in the order of their Global Sequence Number  ConsensusNodes may have different states at a given moment of “clock” time - As the rate of consuming agreements may vary  CNodes have the same namespace state when they reach the same GSN  One-copy-equivalence - each replica presented to the client as if it has only one copy One-Copy-Equivalence as known in replicated databases 34
    35. 35. Consensus Node Proxy  CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider - Defined via Configuration  Main features - Randomly chooses CNode when client is instantiated - Sticky until a timeout occurs - Fails over to another CNode - Smart enough to avoid SafeMode  Further improvements - Take into account network proximity Reads do not modify namespace can be directed to any ConsensusNode 35
    36. 36. Alternatives to a Paxos based Replicated State Machine
    37. 37. Using a TCP Connection to send data to three replicated servers (Load Balancer) server3 Server Process OP OP server2 Server Process OP OP OP OP server 1 Server Process OP OP OP OP Client OP OP OP OP Load BalancerLoad Balancer
    38. 38. Problems with using a Load Balancer  Load balancer becomes the single point of failure - Need to make the LB highly available and distributed  Since Paxos is not employed to reach consensus between the three replicas, strong consistency cannot be guaranteed - Replicas will quickly diverge
    39. 39. HBase WAL or HDFS Edits Log replication  State Machine (HRegion contents, HDFS NameNode metadata, etc.) is modified first  Modification Log (HBase WAL or HDFS Edits Log) is sent to a Highly Available shared storage, QJM, etc.  Standby Server(s) read edits log and serve as warm standby servers, ready to take over should the active server fail
    40. 40. HBase WAL or HDFS Edits Log replication server 1 Server Process OP OP OP OP server2 Server Process Shared Storage Standby Server WAL/Edits Log Single Active Server
    41. 41.  Only one active server is possible  Failover takes time  Failover is error prone, with intricate fencing etc.  Cost of reaching consensus needs to be paid for HDFS Edits log entry to be deemed safely stored, so why not pay the cost before modifying the state and thereby have multiple active servers? HBase WAL or HDFS Edits Log tailing
    42. 42. HBase Continuous Availability
    43. 43. HBase Single Points of Failure  HBase Region Server  HBase Master
    44. 44. HBase Region Server Replication
    45. 45. NonStopRegionServer: Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 1 Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 2 Hbase Client 1. Client calls HRegionServer multi 2. NonStopRegionServer intercepts 3. NonStopRegionServer makes paxos proposal using DConE library 4. Proposal comes back as agreement on all NonStopRegionServers 5. NonStopRegionServer calls super.multi on all nodes. State changes are recorded 6. NonStopRegionServer 1 alone sends response back to client Subclassing the HRegionServer
    46. 46. HBase RegionServer replication using WANdisco DConE  Shared nothing architecture  HFiles, WALs etc. are not shared  Replica count is tuned  Snapshots of HFiles do not need to be created  Messy details of WAL tailing are not necessary
    47. 47. HBase RegionServer replication using WANdisco DConE  Not an eventual consistency model  Does not serve up stale data
    48. 48. / page 48 DEMO DEMO
    49. 49. Thank you Jagane Sundar jagane.sundar@wandisco.com @jagane

    ×