Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks


Published on

HDFS HA has been a highly sought after feature for years. Through collaboration between Cloudera, Facebook, Yahoo!, and others, a high availability system for the HDFS Name Node is actively being worked on, and will likely be complete by Hadoop World. This talk will discuss the architecture and setup of this system.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
  • Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks

    1. 1. NameNode HA<br />Suresh Srinivas - Hortonworks<br />Aaron T. Myers - Cloudera<br />
    2. 2. Overview<br />Part 1 – Suresh Srinivas (Hortonworks)<br />HDFS Availability and Data Integrity – what is the record?<br />NN HA Design<br />Part 2 – Aaron T. Myers (Cloudera)<br />NN HA Design continued<br />Client-NN Connection failover<br />Operations and Admin of HA<br />Future Work<br />2<br />
    3. 3. Current HDFS Availability & Data Integrity<br />Simple design, storage fault tolerance<br />Storage: Rely in OS’s file system rather than use raw disk<br />Storage Fault Tolerance: multiple replicas, active monitoring<br />Single NameNode Master<br />Persistent state: multiple copies + checkpoints<br />Restart on failure<br />How well did it work?<br />Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 <br />7-9’s of reliability<br />Fixed in 20 and 21.<br />18 month Study: 22 failures on 25 clusters - 0.58 failures per year per cluster<br />Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year)<br />NN is very robust and can take a lot of abuse<br />NN is resilient against overload caused by misbehaving apps<br />3<br />
    4. 4. HA NameNode<br />Active work has started on HA NameNode (Failover)<br />HA NameNode<br />Detailed design and sub tasks in HDFS-1623<br />HA: Related work<br />Backup NN (0.21)<br />Avatar NN (Facebook)<br />HA NN prototype using Linux HA (Yahoo!)<br />HA NN prototype with Backup NN and block report replicator (eBay)<br />HA is the highest priority<br />4<br />
    5. 5. Approach and Terminology<br />Initial goal is Active-Standby<br />With Federation each namespace volume has a NameNode<br />Single active NN for any namespace volume<br />Terminology<br />Active NN – actively serves the read/write operations from the clients<br />Standby NN - waits, becomes active when Active dies or is unhealthy<br />Could serve read operations<br />Standby’s State may be cold, warm or hot <br />Cold : Standby has zero state (e.g. started after the Active is declared dead.<br />Warm: Standby has partial state:<br />has loaded fsImage& editLogsbut has not received any block reports<br />Hot Standby: Standby has all most of the Active’s state and start immediately<br />5<br />
    6. 6. High Level Use Cases<br />Supported failures<br />Single hardware failure<br />Double hardware failure not supported<br />Some software failures<br />Same software failure affects both active and standby<br />6<br />Planned downtime<br />Upgrades<br />Config changes<br />Main reason for downtime<br />Unplanned downtime<br />Hardware failure<br />Server unresponsive<br />Software failures<br />Occurs infrequently<br />
    7. 7. Use Cases<br />Deployment models<br />Single NN configuration; no failover<br />Active and Standby with manual failover<br />Standby could be cold/warm/hot<br />Addresses downtime during upgrades – main cause of unavailability<br />Active and Standby with automatic failover<br />Hot standby<br />Addresses downtime during upgrades and other failures<br />See HDFS-1623 for detailed use cases<br />7<br />
    8. 8. Design<br />Failover control outside NN<br />Parallel Block reports to Active and Standby (Hot failover)<br />Shared or non-shared NN state<br />Fencing of shared resources/data<br />Datanodes<br />Shared NN state (if any)<br />Client failover<br />IP Failover<br />Smart clients (e.g configuration, or ZooKeeper for coordination)<br />8<br />
    9. 9. Failover Control Outside NN<br />HA Daemon outside NameNode<br />Daemon manages resources<br />All resources modeled uniformly<br />Resources – OS, HW, Network etc.<br />NameNode is just another resource<br />Heartbeat with other nodes<br />Quorum based leader election<br />Zookeeper for coordination and Quorum<br />Fencing during split brain<br />Prevents data corruption<br />Quorum<br />Service<br />Heartbeat<br />Leader Election<br />HA<br />Daemon<br />Resources<br />Resources<br />Resources<br />Actions<br />start, stop, <br />failover, monitor, …<br />Fencing/<br />STONITH<br />Shared<br />Resources<br />
    10. 10. NN HA with Shared Storage and ZooKeeper<br />ZK<br />ZK<br />ZK<br />Heartbeat<br />Heartbeat<br />FailoverController<br />Standby<br />FailoverController<br />Active<br />Cmds<br />Monitor Health<br /> of NN. OS, HW<br />Monitor Health<br /> of NN. OS, HW<br />NN<br />Active<br />NN<br />Standby<br />Shared NN state with single writer<br />(fencing)<br />Block Reports to Active & Standby<br />DN fencing: Update cmds from one<br />DN<br />DN<br />DN<br />
    11. 11. HA Design Details<br />11<br />
    12. 12. Client Failover Design<br />Smart clients<br />Users use one logical URI, client selects correct NN to connect to<br />Implementing two options out of the box<br />Client Knows of multiple NNs <br />Use a coordination service (ZooKeeper)<br />Common things between these<br />Which operations are idempotent, therefore safe to retry on a failover<br />Failover/retry strategies<br />Some differences<br />Expected time for client failover<br />Ease of administration<br />12<br />
    13. 13. Ops/Admin: Shared Storage<br />To share NN state, need shared storage<br />Needs to be HA itself to avoid just shifting SPOF<br />BookKeeper, etc will likely take care of this in the future<br />Many come with IP fencing options<br />Recommended mount options:<br />tcp,soft,intr,timeo=60,retrans=10<br />Not all edits directories are created equal<br />Used to be all edits dirs were just a pool of redundant dirs<br />Can now configure some edits directories to be required<br />Can now configure number of tolerated failures<br />You want at least 2 for durability, 1 remote for HA<br />13<br />
    14. 14. Ops/Admin: NN fencing<br />Client failover does not solve this problem<br />Out of the box<br />RPC to active NN to tell it to go to standby (graceful failover)<br />SSH to active NN and `kill -9’ NN<br />Pluggable options<br />Many filers have protocols for IP-based fencing options<br />Many PDUs have protocols for IP-based plug-pulling (STONITH)<br />Nuke the node from orbit. It’s the only way to be sure.<br />Configure extra options if available to you<br />Will be tried in order during a failover event<br />Escalate the aggressiveness of the method<br />Fencing is critical for correctness of NN metadata<br />14<br />
    15. 15. Ops/Admin: Monitoring<br />New NN metrics<br />Size of pending DN message queues<br />Seconds since the standby NN last read from shared edit log<br />DN block report lag<br />All measurements of standby NN lag – monitor/alert on all of these<br />Monitor shared storage solution<br />Volumes fill up, disks go bad, etc<br />Should configure paranoid edit log retention policy (default is 2)<br />Canary-based monitoring of HDFS a good idea<br />Pinging both NNs not sufficient<br />15<br />
    16. 16. Ops/Admin: Hardware<br />Active/Standby NNs should be on separate racks<br />Shared storage system should be on separate rack<br />Active/Standby NNs should have close to the same hardware<br />Same amount of RAM – need to store the same things<br />Same # of processors - need to serve same number of clients<br />All the same recommendations still apply for NN<br />ECC memory, 48GB<br />Several separate disks for NN metadata directories<br />Redundant disks for OS drives, probably RAID 5 or mirroring<br />Redundant power<br />16<br />
    17. 17. Future Work<br />Other options to share NN metadata<br />BookKeeper<br />Multiple, potentially non-HA filers<br />Entirely different metadata system<br />More advanced client failover/load shedding<br />Serve stale reads from the standby NN<br />Speculative RPC<br />Non-RPC clients (IP failover, DNS failover, proxy, etc.)<br />Even Higher HA<br />Multiple standby NNs<br />17<br />
    18. 18. QA<br />Detailed design (HDFS-1623)<br />Community effort<br />HDFS-1971, 1972, 1973, 1974, 1975, 2005, 2064, 1073<br />18<br />