Your SlideShare is downloading. ×
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoop Services Continuously Available

654

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
654
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Seven key innovations over paxos
  • Distributed garbage collection
    Any system that deals with distributed state should be able to safely discard state information on disk and in memory for efficient resource utilization. The point at which it is safe to do so is the point at which the state information is no longer required to assist in the recovery of a node at any site. Each DConE instance sends messages to its peers at other nodes at pre-defined intervals to determine the highest contiguously populated agreement common to all of them. It then deletes all agreements from the agreement log, and all agreed proposals from the proposal log that are no longer needed for recovery.

    Distinguished and fair round numbers fopr proposals
    Weak reservations

    DConE’s use of distinguished and fair round numbers in the process of achieving consensus avoids the contention that would otherwise arise when multiple proposals are submitted simultaneously by different nodes using the same round number. If this option is used, the round number will consist of three components: (1) a monotonically increasing component which is simply the increment of the last monotonic component; (2) a distinguished component which is a component specific to each proposer and (3) a random component. If two proposers clash on the first component, then the random component is evaluated, and the proposer whose number has the larger random number component wins. If there is still no winner, then the distinguished component is compared, and the winner is the one with the largest distinguished component. Without this approach the competing nodes could end up simply incrementing the last attempted round number and resubmitting their proposals. This could lead to thrashing that would negatively impact the performance of the distributed system. This approach also ensures fairness in the sense that it prevents any node from always winning.
    Weak Reservations
    DConE provides an optional weak reservation mechanism to eliminate pre- emption of proposers under high transaction volume scenarios. For example, if there are three proposers - one, two and three - the proposer’s number determines which range of agreement numbers that proposer will drive. This avoids any possibility of collisions among the multiple proposals from each proposer that are proceeding in parallel across the distributed system.



  • Dynamic group evolution
    DConE supports the concept of dynamic group evolution, allowing a distributed system to scale to support new sites and users. New nodes can be added to a distributed system, or existing nodes can be removed without interrupting the operation of the remaining nodes.

    Backoff and collison avoidance
    DConE provides a backoff mechanism for avoiding repeated pre-emption of proposers by their peers. Conventional replicated state machines allow the preempted proposer to immediately initiate a new round with an agreement number higher than that of the pre-emptor. This approach can lead an agreement protocol to thrash for an extended period of time and severely degrade performance.
    With DConE, when a round is pre-empted, the DConE instance which initiated the proposal computes the duration of backoff delay. The proposer then waits for this duration before initiating the next round. DConE uses an approach similar to Carrier Sense Multiple Access/Collision Detection (CSMA/CD) protocols for nonswitched ethernet.

    Multiple propsers
    Both say I want tx 179 so they are competing..

    Collison avoidance


    Paxos round… sends out read messagner to acceptors


  • Disadvantages:
    1. Resources used to support Standby
    2. Single NN is a bottleneck
    3. Failover: complex, still outage
    Can do better than that with consistent replication
  • Disadvantages:
    1. Resources used to support Standby
    2. Single NN is a bottleneck
    3. Failover: complex, still outage
    Can do better than that with consistent replication
  • Double determinism is important
  • NameNodes start from the same state and apply the same deterministic updates in the same deterministic order, their states are consistent.
    Independent NameNodes don’t know about each other
  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Sequenced set of operations
    Proposers Nodes that propose issue a new number of a higher value based on last sequence it is aware of
    Majority agrees that a higher number has not been seen and if so allows transaction to complete
    Consensus must be reached on the current proposal

  • Transcript

    • 1. Non-Stop Hadoop Applying Paxos to make critical Hadoop services Continuously Available Jagane Sundar - CTO, WANdisco Brett Rudenstein – Senior Product Manager, WANdisco
    • 2. WANdisco Background  WANdisco: Wide Area Network Distributed Computing  Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability  Leader in tools for software engineers – Subversion  Apache Software Foundation sponsor  Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)  US patented active-active replication technology granted, November 2012  Global locations - San Ramon (CA) - Chengdu (China) - Tokyo (Japan) - Boston (MA) - Sheffield (UK) - Belfast (UK)
    • 3. Customers
    • 4. Recap of Server Software Architecture
    • 5. Elementary Server Software: Single thread processing client requests in a loop Server Process make change to state (db) OP OP OP OP get client request e.g. hbase put send return value to client
    • 6. Multi-threaded Server Software: Multiple threads processing client requests in a loop Server Process make change to state (db) get client request e.g. hbase put send return value to client OP OP OP OP OP OP OP OPOP OP OP OP thread 1 thread 3 thread 2 thread 1 thread 2 thread 3 acquire lock release lock
    • 7. Continuously Available Servers Multiple Servers replicated and serving the same content server1 Server Process server2 Server Process server3 Server Process
    • 8. Problem  How do we ensure that the three servers contain exactly the same data?  In other words, how do you achieve strong consistency replication?
    • 9. Two parts to the solution:  (Multiple Replicas of) A Deterministic State Machine  The exact same sequence of operations, to be applied to each replica of the DSM
    • 10. A Deterministic State Machine  A state machine where a specific operation will always result in a deterministic state  Non deterministic factors cannot play a role in the end state of any operation in a DSM. Examples of non-deterministic factors - Time - Random
    • 11. Creating three replicated servers Apply all modify operations in the same exact sequence in each replicated server = Multiple Servers with exactly the same replicated data server1 Server Process (DSM) server2 Server Process (DSM) server3 Server Process (DSM) O P O P O P O P O P O P O P O P O P O P O P O P
    • 12. Problem:  How to achieve consensus between these servers as to the sequence of operations to perform?  Paxos is the answer - Algorithm for reaching consensus in a network of unreliable processors
    • 13. Three replicated servers server3 Server Process OP OP OP OP Distributed Coordination Engine server2 Server Process Distributed Coordination Engine OP OP OP OP server 1 Server Process OP OP OP OP Distributed Coordination Engine Paxos Client Client ClientClient Client Paxos OP OPOP OP
    • 14. Paxos Primer
    • 15. Paxos  Paxos is an Algorithm for building Replicated Servers with strong consistency 1. Synod algorithm for achieving consensus among a network of unreliable processes 2. The application of consensus to the task of replicating a Deterministic State Machine  Paxos does not - Specify a network protocol - Invent a new language - Restrict use in a specific language
    • 16. Replicated State Machine  Installed on each node that participates in the distributed system  All nodes function as peers to deliver and assure the same transaction order occurs on every system - Achieve Consistent Replication  Consensus - Roles • Proposers, Acceptors, Learners - Phases • Election of a node to be the proposer • Broadcast of the proposal to peers • Acceptance of the proposal for majority
    • 17. Paxos Roles  Proposer - The client or a proxy for the client - Proposes a change to the Deterministic State Machine  Acceptor - Acceptors are the ‘memory’ of paxos - Quorum is established amongst acceptors  Learner - The DSM (Replicated, of course) - Each Learner applies the exact same sequence of operations as proposed by the Proposers, and accepted by a majority quorum of Acceptors
    • 18. Paxos - Ordering  Proposers issue a new sequence number of a higher value from the last sequence known  A majority agrees this number has not been seen  Consensus must be reached on the current proposal
    • 19. WANdisco DConE Beyond Paxos
    • 20. DConE Innovations Beyond Paxos  Quorum Configurations - Majority, Singleton, Unanimous - Distinguished Node – Tie Breaker - Quorum Rotations – follow the sun - Emergency Reconfigure  Concurrent agreement handling - Paxos only allows agreements on one proposal at a time • Slow performance in a high transaction volume environment - DConE allows simultaneous proposals from multiple proposers
    • 21. DConE Innovations Beyond Paxos  Dynamic group evolution - Add and remove nodes - Add and remove sites - No interruption of current operations  Distributed garbage collection - Safely discard state on disk and in memory when it is no longer required to assist in recovery - Messages are sent to peers pre-defined intervals to determine the highest common agreement - All agreements and agreed proposals are deleted
    • 22. DConE Innovations Beyond Paxos  Backoff and collision avoidance - Avoids repeated pre-emption of proposers by their peers - Prevents thrashing which can severely degrade performance. - When a round is pre-empted, a backoff delay is computed
    • 23. Self Healing Automatic Back up and Recovery  All nodes are mirrors/replicas of each other - Any node can be used as a helper to bring it back  Read access without Quorum - Cluster is still accessible for reads - No writes prevent split brain  Automatic catch up - Servers that have been offline, learn of transactions that were agreed on while it was unavailable - The missing transactions are played back and one caught up become fully participating members of the distributed system again  Servers can be updated without down time - Allows for rolling upgrades
    • 24. ‘Co-ordinate intent, not the outcome’ - Yeturu Aahlad Active-Active, not Active-Standby
    • 25. Co-ordinating intent Proposal to mkdir /a P a x o s server2 server 1 Proposal to createFile /a createFile /a createFile /a mkdir /a mkdir /a Op fails Op fails mkdir /a Co-ordinate outcome (WAL, HDFS Edits Log, etc.) server2 server 1 createFile /a server1 state is wrong mkdir /a operation needs to be undone Co-ordinating outcome
    • 26. HDFS  Recap 26
    • 27. HDFS Architecture  HDFS metadata is decoupled from data - Namespace is a hierarchy of files and directories represented by INodes - INodes record attributes: permissions, quotas, timestamps, replication  NameNode keeps its entire state in RAM - Memory state: the namespace tree and the mapping of blocks to DataNodes - Persistent state: recent checkpoint of the namespace and journal log  File data is divided into blocks (default 128MB) - Each block is independently replicated on multiple DataNodes (default 3) - Block replicas stored on DataNodes as local files on local drives Reliable distributed file system for storing very large data sets 27
    • 28. HDFS Cluster  Single active NameNode  Thousands of DataNodes  Tens of thousands of HDFS clients Active-Standby Architecture 28
    • 29. Standard HDFS operations  Active NameNode workflow 1. Receive request from a client, 2. Apply the update to its memory state, 3. Record the update as a journal transaction in persistent storage, 4. Return result to the client  HDFS Client (read or write to a file) - Send request to the NameNode, receive replica locations - Read or write data from or to DataNodes  DataNode - Data transfer to / from clients and between DataNodes - Report replica state change to NameNode(s): new, deleted, corrupt - Report its state to NameNode(s): heartbeats, block reports 29
    • 30. Consensus Node  Coordinated Replication of HDFS Namespace 30
    • 31. Replicated Namespace  Replicated NameNode is called a ConsensusNode or CNode  ConsensusNodes play equal active role on the cluster - Provide write and read access to the namespace  The namespace replicas are consistent with each other - Each CNode maintains a copy of the same namespace - Namespace updates applied to one CNode propagated to the others  Coordination Engine establishes the global order of namespace updates - All CNodes apply the same deterministic updates in the same deterministic order - Starting from the same initial state and applying the same updates = consistency Coordination Engine provides consistency of multiple namespace replicas 31
    • 32. Coordinated HDFS Cluster  Independent CNodes – the same namespace  Load balancing client requests  Proposal, Agreement  Coordinated updates Multiple active Consensus Nodes share namespace via Coordination Engine 32
    • 33. Coordinated HDFS operations  ConsensusNode workflow 1. Receive request from a client 2. Submit proposal to update to the Coordination Engine Wait for agreement 3. Apply the agreed update to its memory state, 4. Record the update as a journal transaction in persistent storage (optional) 5. Return result to the client  HDFS Client and DataNode operations remain the same Updates to the namespace when a file or a directory is created are coordinated 33
    • 34. Strict Consistency Model  Coordination Engine transforms namespace modification proposals into the global sequence of agreements - Applied to namespace replicas in the order of their Global Sequence Number  ConsensusNodes may have different states at a given moment of “clock” time - As the rate of consuming agreements may vary  CNodes have the same namespace state when they reach the same GSN  One-copy-equivalence - each replica presented to the client as if it has only one copy One-Copy-Equivalence as known in replicated databases 34
    • 35. Consensus Node Proxy  CNodeProxyProvider – a pluggable substitute of FailoverProxyProvider - Defined via Configuration  Main features - Randomly chooses CNode when client is instantiated - Sticky until a timeout occurs - Fails over to another CNode - Smart enough to avoid SafeMode  Further improvements - Take into account network proximity Reads do not modify namespace can be directed to any ConsensusNode 35
    • 36. Alternatives to a Paxos based Replicated State Machine
    • 37. Using a TCP Connection to send data to three replicated servers (Load Balancer) server3 Server Process OP OP server2 Server Process OP OP OP OP server 1 Server Process OP OP OP OP Client OP OP OP OP Load BalancerLoad Balancer
    • 38. Problems with using a Load Balancer  Load balancer becomes the single point of failure - Need to make the LB highly available and distributed  Since Paxos is not employed to reach consensus between the three replicas, strong consistency cannot be guaranteed - Replicas will quickly diverge
    • 39. HBase WAL or HDFS Edits Log replication  State Machine (HRegion contents, HDFS NameNode metadata, etc.) is modified first  Modification Log (HBase WAL or HDFS Edits Log) is sent to a Highly Available shared storage, QJM, etc.  Standby Server(s) read edits log and serve as warm standby servers, ready to take over should the active server fail
    • 40. HBase WAL or HDFS Edits Log replication server 1 Server Process OP OP OP OP server2 Server Process Shared Storage Standby Server WAL/Edits Log Single Active Server
    • 41.  Only one active server is possible  Failover takes time  Failover is error prone, with intricate fencing etc.  Cost of reaching consensus needs to be paid for HDFS Edits log entry to be deemed safely stored, so why not pay the cost before modifying the state and thereby have multiple active servers? HBase WAL or HDFS Edits Log tailing
    • 42. HBase Continuous Availability
    • 43. HBase Single Points of Failure  HBase Region Server  HBase Master
    • 44. HBase Region Server Replication
    • 45. NonStopRegionServer: Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 1 Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 2 Hbase Client 1. Client calls HRegionServer multi 2. NonStopRegionServer intercepts 3. NonStopRegionServer makes paxos proposal using DConE library 4. Proposal comes back as agreement on all NonStopRegionServers 5. NonStopRegionServer calls super.multi on all nodes. State changes are recorded 6. NonStopRegionServer 1 alone sends response back to client Subclassing the HRegionServer
    • 46. HBase RegionServer replication using WANdisco DConE  Shared nothing architecture  HFiles, WALs etc. are not shared  Replica count is tuned  Snapshots of HFiles do not need to be created  Messy details of WAL tailing are not necessary
    • 47. HBase RegionServer replication using WANdisco DConE  Not an eventual consistency model  Does not serve up stale data
    • 48. / page 48 DEMO DEMO
    • 49. Thank you Jagane Sundar jagane.sundar@wandisco.com @jagane

    ×