Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Cloud Meetup Big Data & Cloud Computing  - Help, Educate & Demystify. September 8 th  2011
Fail-Proofing Hadoop Clusters with Automated Service Failover Michael Dalton, CTO Zettaset Sept 8th 2011 Meetup
Problem <ul><li>Hadoop environments have many SPOFs </li></ul><ul><ul><li>NameNode , JobTracker, Oozie </li></ul></ul><ul>...
Ideal Solution <ul><li>Automated failover </li></ul><ul><ul><li>No data loss </li></ul></ul><ul><ul><li>Handle all failove...
Existing Solutions <ul><li>AvatarNode (NameNode, patch from FB) </li></ul><ul><ul><li>Replicate writes to a backup service...
Why is Failover Hard? Sept 8th 2011 Meetup M1 M2 C1 C2
Data Loss <ul><li>Split-Brain issues lose data </li></ul><ul><ul><li>Multiple masters = data corruption </li></ul></ul><ul...
Theoretical Limits <ul><li>Can we solve this reliably? </li></ul><ul><li>Fischer-Lynch-Paterson (FLP) Theorem </li></ul><u...
Revisiting Our Assumptions <ul><li>Drop fully asynchronous requirement </li></ul><ul><li>What about leases? </li></ul><ul>...
Master Failover <ul><li>Requires highly available lock / lease system </li></ul><ul><li>Master obtains a lease to be maste...
Failover: Locks/Consensus <ul><li>Apache ZooKeeper – Hadoop subproject  </li></ul><ul><li>Highly-available distributed fil...
ZooKeeper Internals <ul><li>ZooKeeper consists of a quorum of nodes (typically 3-9) </li></ul><ul><li>Majority vote elects...
Example: HBase <ul><li>Apache HBase has full automated multi-master failover </li></ul><ul><li>Prospective masters registe...
Failover: Replication <ul><li>HBase approach avoids replication issues with HDFS </li></ul><ul><li>Kerberos, NN, Oozie, et...
Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy ser...
Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy ser...
IP Failover <ul><li>Instead, you can failover IP addresses </li></ul><ul><li>Virtual IPs – if supported by router </li></u...
Putting it all together <ul><li>Consensus/Election </li></ul><ul><ul><li>Use ZooKeeper, 3-9 node quorum </li></ul></ul><ul...
Conclusion <ul><li>Fully automated failover is possible </li></ul><ul><ul><li>Design for synchronous replication </li></ul...
Upcoming SlideShare
Loading in …5
×

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

1,259 views

Published on

Fail- Proofing Hadoop Clusters with Automatic Service Failover ( Financial Industry)
- Michael Dalton, Ph.D., CTO, Co-founder, Zettaset Inc.

Published in: Technology
  • Be the first to comment

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

  1. 1. Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. September 8 th 2011
  2. 2. Fail-Proofing Hadoop Clusters with Automated Service Failover Michael Dalton, CTO Zettaset Sept 8th 2011 Meetup
  3. 3. Problem <ul><li>Hadoop environments have many SPOFs </li></ul><ul><ul><li>NameNode , JobTracker, Oozie </li></ul></ul><ul><ul><li>Kerberos </li></ul></ul>Sept 8th 2011 Meetup
  4. 4. Ideal Solution <ul><li>Automated failover </li></ul><ul><ul><li>No data loss </li></ul></ul><ul><ul><li>Handle all failover aspects (IP failover, etc) </li></ul></ul><ul><li>Failover all services </li></ul><ul><ul><li>No JobTracker = No MR </li></ul></ul><ul><ul><li>No Kerberos = no new Kerberos authentication </li></ul></ul>Sept 8th 2011 Meetup
  5. 5. Existing Solutions <ul><li>AvatarNode (NameNode, patch from FB) </li></ul><ul><ul><li>Replicate writes to a backup service </li></ul></ul><ul><li>BackupNameNode (NN, not committed) </li></ul><ul><ul><li>'Hot' copy of NameNode, replicated </li></ul></ul><ul><li>All failover manual </li></ul>Sept 8th 2011 Meetup
  6. 6. Why is Failover Hard? Sept 8th 2011 Meetup M1 M2 C1 C2
  7. 7. Data Loss <ul><li>Split-Brain issues lose data </li></ul><ul><ul><li>Multiple masters = data corruption </li></ul></ul><ul><ul><li>Clients confused about who is up </li></ul></ul><ul><li>Problem for traditional HA environments </li></ul><ul><ul><li>Linux-HA, etc </li></ul></ul><ul><ul><li>Heartbeat failure != Death </li></ul></ul>Sept 8th 2011 Meetup
  8. 8. Theoretical Limits <ul><li>Can we solve this reliably? </li></ul><ul><li>Fischer-Lynch-Paterson (FLP) Theorem </li></ul><ul><ul><li>Consensus impossible in asynchronous distributed system when even a single process can fail </li></ul></ul><ul><ul><li>No free lunch </li></ul></ul>Sept 8th 2011 Meetup
  9. 9. Revisiting Our Assumptions <ul><li>Drop fully asynchronous requirement </li></ul><ul><li>What about leases? </li></ul><ul><ul><li>Masters obtain, renew a lease </li></ul></ul><ul><ul><li>Shutdown if lease expires (not asynchronous) </li></ul></ul><ul><li>Assumes only bounded relative clock skew </li></ul><ul><ul><li>Everyone should agree on how fast time elapses </li></ul></ul>Sept 8th 2011 Meetup
  10. 10. Master Failover <ul><li>Requires highly available lock / lease system </li></ul><ul><li>Master obtains a lease to be master </li></ul><ul><ul><li>Replicates writes to a backup master </li></ul></ul><ul><li>If master loses lease, hold a new election </li></ul><ul><ul><li>Old master will shut down when lease expires </li></ul></ul><ul><li>If clock skew bounded, no split-brain! </li></ul>Sept 8th 2011 Meetup
  11. 11. Failover: Locks/Consensus <ul><li>Apache ZooKeeper – Hadoop subproject </li></ul><ul><li>Highly-available distributed filesystem for distributed consensus problems </li></ul><ul><li>Create election, membership, etc. using special-purpose FS semantics </li></ul><ul><li>'Ephemeral' files disappear when session lease expires </li></ul><ul><li>'Sequential' files have auto-incremented suffix </li></ul>Sept 8th 2011 Meetup
  12. 12. ZooKeeper Internals <ul><li>ZooKeeper consists of a quorum of nodes (typically 3-9) </li></ul><ul><li>Majority vote elects a leader (via leases) </li></ul><ul><li>Leader proposes all FS modifications </li></ul><ul><li>Majority must approve a modification for it to be committed </li></ul>Sept 8th 2011 Meetup
  13. 13. Example: HBase <ul><li>Apache HBase has full automated multi-master failover </li></ul><ul><li>Prospective masters register in ZooKeeper </li></ul><ul><li>ZooKeeper ephemeral/sequential files used for election </li></ul><ul><li>Clients lookup current address of master in ZooKeeper </li></ul><ul><li>Failover fully automated </li></ul><ul><li>All files stored on HDFS, so no replication issues </li></ul>Sept 8th 2011 Meetup
  14. 14. Failover: Replication <ul><li>HBase approach avoids replication issues with HDFS </li></ul><ul><li>Kerberos, NN, Oozie, etc can't use HDFS </li></ul><ul><ul><li>Legacy compatibility (and for NN, circular deps) </li></ul></ul><ul><li>How can we add synchronous write replication? </li></ul><ul><ul><li>Can't break compatibility or change apps </li></ul></ul>Sept 8th 2011 Meetup
  15. 15. Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy services use IP or hostnames, not ZK, to connect to master </li></ul><ul><li>Out-of-trunk patches to make ZK a DNS server </li></ul><ul><ul><li>But Java doesn't respect DNS TTLs anyway, complicating max time for failover </li></ul></ul>Sept 8th 2011 Meetup
  16. 16. Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy services use IP or hostnames, not ZK, to connect to master </li></ul><ul><li>Out-of-trunk patches to make ZK a DNS server </li></ul><ul><ul><li>But Java doesn't respect DNS TTLs anyway, complicating max time for failover </li></ul></ul><ul><ul><li>DNS introduces its own issues anyway... </li></ul></ul>Sept 8th 2011 Meetup
  17. 17. IP Failover <ul><li>Instead, you can failover IP addresses </li></ul><ul><li>Virtual IPs – if supported by router </li></ul><ul><li>Otherwise, dynamically update routes as part of your failover </li></ul><ul><ul><li>New leader updates routing tables. </li></ul></ul><ul><li>For local area networks, ensure ARP tables updated </li></ul><ul><ul><li>Gratuitous ARP or store ARP information in ZK </li></ul></ul>Sept 8th 2011 Meetup
  18. 18. Putting it all together <ul><li>Consensus/Election </li></ul><ul><ul><li>Use ZooKeeper, 3-9 node quorum </li></ul></ul><ul><li>State Replication </li></ul><ul><ul><li>Small data in ZK, Large data in HDFS </li></ul></ul><ul><ul><li>If neither possible, DRBD </li></ul></ul><ul><li>Network Failover </li></ul><ul><ul><li>Store master address in ZK </li></ul></ul><ul><ul><li>Or, perform IP failover </li></ul></ul><ul><ul><ul><li>Dynamically update routing tables, update ARPcache </li></ul></ul></ul>Sept 8th 2011 Meetup
  19. 19. Conclusion <ul><li>Fully automated failover is possible </li></ul><ul><ul><li>Design for synchronous replication </li></ul></ul><ul><ul><li>Prevent split-brain </li></ul></ul><ul><ul><li>Manage legacy compatibility </li></ul></ul><ul><li>Coming to Hadoop </li></ul><ul><ul><li>ZettaSet provides fully HA Hadoop </li></ul></ul>Sept 8th 2011 Meetup

×