Big Data Cloud Meetup Big Data & Cloud Computing  - Help, Educate & Demystify. September 8 th  2011
Fail-Proofing Hadoop Clusters with Automated Service Failover Michael Dalton, CTO Zettaset Sept 8th 2011 Meetup
Problem <ul><li>Hadoop environments have many SPOFs </li></ul><ul><ul><li>NameNode , JobTracker, Oozie </li></ul></ul><ul>...
Ideal Solution <ul><li>Automated failover </li></ul><ul><ul><li>No data loss </li></ul></ul><ul><ul><li>Handle all failove...
Existing Solutions <ul><li>AvatarNode (NameNode, patch from FB) </li></ul><ul><ul><li>Replicate writes to a backup service...
Why is Failover Hard? Sept 8th 2011 Meetup M1 M2 C1 C2
Data Loss <ul><li>Split-Brain issues lose data </li></ul><ul><ul><li>Multiple masters = data corruption </li></ul></ul><ul...
Theoretical Limits <ul><li>Can we solve this reliably? </li></ul><ul><li>Fischer-Lynch-Paterson (FLP) Theorem </li></ul><u...
Revisiting Our Assumptions <ul><li>Drop fully asynchronous requirement </li></ul><ul><li>What about leases? </li></ul><ul>...
Master Failover <ul><li>Requires highly available lock / lease system </li></ul><ul><li>Master obtains a lease to be maste...
Failover: Locks/Consensus <ul><li>Apache ZooKeeper – Hadoop subproject  </li></ul><ul><li>Highly-available distributed fil...
ZooKeeper Internals <ul><li>ZooKeeper consists of a quorum of nodes (typically 3-9) </li></ul><ul><li>Majority vote elects...
Example: HBase <ul><li>Apache HBase has full automated multi-master failover </li></ul><ul><li>Prospective masters registe...
Failover: Replication <ul><li>HBase approach avoids replication issues with HDFS </li></ul><ul><li>Kerberos, NN, Oozie, et...
Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy ser...
Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy ser...
IP Failover <ul><li>Instead, you can failover IP addresses </li></ul><ul><li>Virtual IPs – if supported by router </li></u...
Putting it all together <ul><li>Consensus/Election </li></ul><ul><ul><li>Use ZooKeeper, 3-9 node quorum </li></ul></ul><ul...
Conclusion <ul><li>Fully automated failover is possible </li></ul><ul><ul><li>Design for synchronous replication </li></ul...
Upcoming SlideShare
Loading in …5
×

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

1,245 views

Published on

Fail- Proofing Hadoop Clusters with Automatic Service Failover ( Financial Industry)
- Michael Dalton, Ph.D., CTO, Co-founder, Zettaset Inc.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,245
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
30
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

  1. 1. Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. September 8 th 2011
  2. 2. Fail-Proofing Hadoop Clusters with Automated Service Failover Michael Dalton, CTO Zettaset Sept 8th 2011 Meetup
  3. 3. Problem <ul><li>Hadoop environments have many SPOFs </li></ul><ul><ul><li>NameNode , JobTracker, Oozie </li></ul></ul><ul><ul><li>Kerberos </li></ul></ul>Sept 8th 2011 Meetup
  4. 4. Ideal Solution <ul><li>Automated failover </li></ul><ul><ul><li>No data loss </li></ul></ul><ul><ul><li>Handle all failover aspects (IP failover, etc) </li></ul></ul><ul><li>Failover all services </li></ul><ul><ul><li>No JobTracker = No MR </li></ul></ul><ul><ul><li>No Kerberos = no new Kerberos authentication </li></ul></ul>Sept 8th 2011 Meetup
  5. 5. Existing Solutions <ul><li>AvatarNode (NameNode, patch from FB) </li></ul><ul><ul><li>Replicate writes to a backup service </li></ul></ul><ul><li>BackupNameNode (NN, not committed) </li></ul><ul><ul><li>'Hot' copy of NameNode, replicated </li></ul></ul><ul><li>All failover manual </li></ul>Sept 8th 2011 Meetup
  6. 6. Why is Failover Hard? Sept 8th 2011 Meetup M1 M2 C1 C2
  7. 7. Data Loss <ul><li>Split-Brain issues lose data </li></ul><ul><ul><li>Multiple masters = data corruption </li></ul></ul><ul><ul><li>Clients confused about who is up </li></ul></ul><ul><li>Problem for traditional HA environments </li></ul><ul><ul><li>Linux-HA, etc </li></ul></ul><ul><ul><li>Heartbeat failure != Death </li></ul></ul>Sept 8th 2011 Meetup
  8. 8. Theoretical Limits <ul><li>Can we solve this reliably? </li></ul><ul><li>Fischer-Lynch-Paterson (FLP) Theorem </li></ul><ul><ul><li>Consensus impossible in asynchronous distributed system when even a single process can fail </li></ul></ul><ul><ul><li>No free lunch </li></ul></ul>Sept 8th 2011 Meetup
  9. 9. Revisiting Our Assumptions <ul><li>Drop fully asynchronous requirement </li></ul><ul><li>What about leases? </li></ul><ul><ul><li>Masters obtain, renew a lease </li></ul></ul><ul><ul><li>Shutdown if lease expires (not asynchronous) </li></ul></ul><ul><li>Assumes only bounded relative clock skew </li></ul><ul><ul><li>Everyone should agree on how fast time elapses </li></ul></ul>Sept 8th 2011 Meetup
  10. 10. Master Failover <ul><li>Requires highly available lock / lease system </li></ul><ul><li>Master obtains a lease to be master </li></ul><ul><ul><li>Replicates writes to a backup master </li></ul></ul><ul><li>If master loses lease, hold a new election </li></ul><ul><ul><li>Old master will shut down when lease expires </li></ul></ul><ul><li>If clock skew bounded, no split-brain! </li></ul>Sept 8th 2011 Meetup
  11. 11. Failover: Locks/Consensus <ul><li>Apache ZooKeeper – Hadoop subproject </li></ul><ul><li>Highly-available distributed filesystem for distributed consensus problems </li></ul><ul><li>Create election, membership, etc. using special-purpose FS semantics </li></ul><ul><li>'Ephemeral' files disappear when session lease expires </li></ul><ul><li>'Sequential' files have auto-incremented suffix </li></ul>Sept 8th 2011 Meetup
  12. 12. ZooKeeper Internals <ul><li>ZooKeeper consists of a quorum of nodes (typically 3-9) </li></ul><ul><li>Majority vote elects a leader (via leases) </li></ul><ul><li>Leader proposes all FS modifications </li></ul><ul><li>Majority must approve a modification for it to be committed </li></ul>Sept 8th 2011 Meetup
  13. 13. Example: HBase <ul><li>Apache HBase has full automated multi-master failover </li></ul><ul><li>Prospective masters register in ZooKeeper </li></ul><ul><li>ZooKeeper ephemeral/sequential files used for election </li></ul><ul><li>Clients lookup current address of master in ZooKeeper </li></ul><ul><li>Failover fully automated </li></ul><ul><li>All files stored on HDFS, so no replication issues </li></ul>Sept 8th 2011 Meetup
  14. 14. Failover: Replication <ul><li>HBase approach avoids replication issues with HDFS </li></ul><ul><li>Kerberos, NN, Oozie, etc can't use HDFS </li></ul><ul><ul><li>Legacy compatibility (and for NN, circular deps) </li></ul></ul><ul><li>How can we add synchronous write replication? </li></ul><ul><ul><li>Can't break compatibility or change apps </li></ul></ul>Sept 8th 2011 Meetup
  15. 15. Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy services use IP or hostnames, not ZK, to connect to master </li></ul><ul><li>Out-of-trunk patches to make ZK a DNS server </li></ul><ul><ul><li>But Java doesn't respect DNS TTLs anyway, complicating max time for failover </li></ul></ul>Sept 8th 2011 Meetup
  16. 16. Failover: Networking <ul><li>HBase avoids networking failover by storing master address in ZK </li></ul><ul><li>Legacy services use IP or hostnames, not ZK, to connect to master </li></ul><ul><li>Out-of-trunk patches to make ZK a DNS server </li></ul><ul><ul><li>But Java doesn't respect DNS TTLs anyway, complicating max time for failover </li></ul></ul><ul><ul><li>DNS introduces its own issues anyway... </li></ul></ul>Sept 8th 2011 Meetup
  17. 17. IP Failover <ul><li>Instead, you can failover IP addresses </li></ul><ul><li>Virtual IPs – if supported by router </li></ul><ul><li>Otherwise, dynamically update routes as part of your failover </li></ul><ul><ul><li>New leader updates routing tables. </li></ul></ul><ul><li>For local area networks, ensure ARP tables updated </li></ul><ul><ul><li>Gratuitous ARP or store ARP information in ZK </li></ul></ul>Sept 8th 2011 Meetup
  18. 18. Putting it all together <ul><li>Consensus/Election </li></ul><ul><ul><li>Use ZooKeeper, 3-9 node quorum </li></ul></ul><ul><li>State Replication </li></ul><ul><ul><li>Small data in ZK, Large data in HDFS </li></ul></ul><ul><ul><li>If neither possible, DRBD </li></ul></ul><ul><li>Network Failover </li></ul><ul><ul><li>Store master address in ZK </li></ul></ul><ul><ul><li>Or, perform IP failover </li></ul></ul><ul><ul><ul><li>Dynamically update routing tables, update ARPcache </li></ul></ul></ul>Sept 8th 2011 Meetup
  19. 19. Conclusion <ul><li>Fully automated failover is possible </li></ul><ul><ul><li>Design for synchronous replication </li></ul></ul><ul><ul><li>Prevent split-brain </li></ul></ul><ul><ul><li>Manage legacy compatibility </li></ul></ul><ul><li>Coming to Hadoop </li></ul><ul><ul><li>ZettaSet provides fully HA Hadoop </li></ul></ul>Sept 8th 2011 Meetup

×