Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2


Published on

Learn the design and implementation of features for a highly available namenode, and get an overview of how to deploy these new features.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2

  1. 1. DO NOT USE PUBLICLY High Availability for the HDFS NameNode PRIOR TO 10/23/12 Phase 2 Headline Goes Here Aaron T. Myers and Todd Lipcon | Cloudera HDFS Team Speaker Name or Subhead Goes Here October 20121
  2. 2. Introductions / who we are • Software engineers on Cloudera’s HDFS engineering team • Committers/PMC Members for Apache Hadoop at ASF • Main developers on HDFS HA • Responsible for ~80% of the code for all phases of HA development • Have helped numerous customers setup and troubleshoot HA HDFS clusters this year ©2012 Cloudera, Inc. All Rights2 Reserved.
  3. 3. Outline • HDFS HA Phase 1 • How did it work? What could it do? • What problems remained? • HDFS HA Phase 2: Automatic failover • HDFS HA Phase 2: Quorum Journal ©2012 Cloudera, Inc. All Rights3 Reserved.
  4. 4. HDFS HA Phase 1 Review HDFS-1623: completed March 20124
  5. 5. HDFS HA Development Phase 1 • Completed March 2012 (HDFS-1623) • Introduced the StandbyNode, a hot backup for the HDFS NameNode. • Relied on shared storage to synchronize namespace state • (e.g. a NAS filer appliance) • Allowed operators to manually trigger failover to the Standby • Sufficient for many HA use cases: avoided planned downtime for hardware and software upgrades, planned machine/OS maintenance, configuration changes, etc. ©2012 Cloudera, Inc. All Rights5 Reserved.
  6. 6. HDFS HA Architecture Phase 1 • Parallel block reports sent to Active and Standby NameNodes • NameNode state shared by locating edit log on NAS over NFS • Fencing of shared resources/data • Critical that only a single NameNode is Active at any point in time • Client failover done via client configuration • Each client configured with the address of both NNs: try both to find active ©2012 Cloudera, Inc. All Rights6 Reserved.
  7. 7. HDFS HA Architecture Phase 1 ©2012 Cloudera, Inc. All Rights7 Reserved.
  8. 8. Fencing and NFS • Must avoid split-brain syndrome • Both nodes think they are active and try to write to the same file. Your metadata becomes corrupt and requires manual intervention to restart • Configure a fencing script • Script must ensure that prior active has stopped writing • STONITH: shoot-the-other-node-in-the-head • Storage fencing: e.g using NetApp ONTAP API to restrict filer access • Fencing script must succeed to have a successful failover ©2012 Cloudera, Inc. All Rights8 Reserved.
  9. 9. Shortcomings of Phase 1 • Insufficient to protect against unplanned downtime • Manual failover only: requires an operator to step in quickly after a crash • Various studies indicated this was the minority of downtime, but still important to address • Requirement of a NAS device made deployment complex, expensive, and error-prone (we always knew this was just the first phase!) ©2012 Cloudera, Inc. All Rights9 Reserved.
  10. 10. HDFS HA Development Phase 2 • Multiple new features for high availability • Automatic failover, based on Apache ZooKeeper • Remove dependency on NAS (network-attached storage) • Address new HA use cases • Avoid unplanned downtime due to software or hardware faults • Deploy in filer-less environments • Completely stand-alone HA with no external hardware or software dependencies • no Linux-HA, filers, etc ©2012 Cloudera, Inc. All Rights10 Reserved.
  11. 11. Automatic Failover Overview HDFS-3042: completed May 201211
  12. 12. Automatic Failover Goals • Automatically detect failure of the Active NameNode • Hardware, software, network, etc. • Do not require operator intervention to initiate failover • Once failure is detected, process completes automatically • Support manually initiated failover as first-class • Operators can still trigger failover without having to stop Active • Do not introduce a new SPOF • All parts of auto-failover deployment must themselves be HA ©2012 Cloudera, Inc. All Rights12 Reserved.
  13. 13. Automatic Failover Architecture • Automatic failover requires ZooKeeper • Not required for manual failover • ZK makes it easy to: • Detect failure of Active NameNode • Determine which NameNode should become the Active NN ©2012 Cloudera, Inc. All Rights13 Reserved.
  14. 14. Automatic Failover Architecture • Introduce new daemon in HDFS: ZooKeeper Failover Controller • In an auto failover deployment, run two ZKFCs • One per NameNode, on that NameNode machine • ZooKeeper Failover Controller (ZKFC) is responsible for: • Monitoring health of associated NameNode • Participating in leader election of NameNodes • Fencing the other NameNode if it wins election ©2012 Cloudera, Inc. All Rights14 Reserved.
  15. 15. Automatic Failover Architecture ©2012 Cloudera, Inc. All Rights15 Reserved.
  16. 16. ZooKeeper Failover Controller Details • When a ZKFC is started, it: • Begins checking the health of its associated NN via RPC • As long as the associated NN is healthy, attempts to create an ephemeral znode in ZK • One of the two ZKFCs will succeed in creating the znode and transition its associated NN to the Active state • The other ZKFC transitions its associated NN to the Standby state and begins monitoring the ephemeral znode ©2012 Cloudera, Inc. All Rights16 Reserved.
  17. 17. What happens when… • … a NameNode process crashes? • Associated ZKFC notices the health failure of the NN and quits from active/standby election by removing znode • … a whole NameNode machine crashes? • ZKFC process crashes with it and the ephemeral znode is deleted from ZK ©2012 Cloudera, Inc. All Rights17 Reserved.
  18. 18. What happens when… • … the two NameNodes are partitioned from each other? • Nothing happens: Only one will still have the znode • … ZooKeeper crashes (or down for upgrade)? • Nothing happens: active stays active ©2012 Cloudera, Inc. All Rights18 Reserved.
  19. 19. Fencing Still Required with ZKFC • Tempting to think ZooKeeper means no need for fencing • Consider the following scenario: • Two NameNodes: A and B, each with associated ZKFC • ZKFC A process crashes, ephemeral znode removed • NameNode A process is still running • ZKFC B notices znode removed • ZKFC B wants to transition NN B to Active, but without fencing NN A, both NNs would be active simultaneously ©2012 Cloudera, Inc. All Rights19 Reserved.
  20. 20. Auto-failover recap • New daemon ZooKeeperFailoverController monitors the NameNodes • Automatically triggers fail-overs • No need for operator intervention Fencing and dependency on NFS storage still a pain ©2012 Cloudera, Inc. All Rights20 Reserved.
  21. 21. Removing the NAS dependency HDFS-3077: completed October 201221
  22. 22. Shared Storage in HDFS HA • The Standby NameNode synchronizes the namespace by following the Active NameNode’s transaction log • Each operation (eg mkdir(/foo)) is written to the log by the Active • The StandbyNode periodically reads all new edits and applies them to its own metadata structures • Reliable shared storage is required for correct operation ©2012 Cloudera, Inc. All Rights22 Reserved.
  23. 23. Shared Storage in “Phase 1” • Operator configures a traditional shared storage device (eg SAN or NAS) • Mount the shared storage via NFS on both Active and Standby NNs • Active NN writes to a directory on NFS, while Standby reads it ©2012 Cloudera, Inc. All Rights23 Reserved.
  24. 24. Shortcomings of NFS-based approach • Custom hardware • Lots of our customers don’t have SAN/NAS available in their datacenter • Costs money, time and expertise • Extra “stuff” to monitor outside HDFS • We just moved the SPOF, didn’t eliminate it! • Complicated • Storage fencing, NFS mount options, multipath networking, etc • Organizationally complicated: dependencies on storage ops team • NFS issues • Buggy client implementations, little control over timeout behavior, etc ©2012 Cloudera, Inc. All Rights24 Reserved.
  25. 25. Primary Requirements for Improved Storage • No special hardware (PDUs, NAS) • No custom fencing configuration • Too complicated == too easy to misconfigure • No SPOFs • punting to filers isn’t a good option • need something inherently distributed ©2012 Cloudera, Inc. All Rights25 Reserved.
  26. 26. Secondary Requirements • Configurable failure toleration • Configure N nodes to tolerate (N-1)/2 • Making N bigger (within reasonable bounds) shouldn’t hurt performance. Implies: • Writes done in parallel, not pipelined • Writes should not wait on slowest replica • Locate replicas on existing hardware investment (eg share with JobTracker, NN, SBN) ©2012 Cloudera, Inc. All Rights26 Reserved.
  27. 27. Operational Requirements • Should be operable by existing Hadoop admins. Implies: • Same metrics system (“hadoop metrics”) • Same configuration system (xml) • Same logging infrastructure (log4j) • Same security system (Kerberos-based) • Allow existing ops to easily deploy and manage the new feature • Allow existing Hadoop tools to monitor the feature • (eg Cloudera Manager, Ganglia, etc) ©2012 Cloudera, Inc. All Rights27 Reserved.
  28. 28. Our solution: QuorumJournalManager • QuorumJournalManager (client) • Plugs into JournalManager abstraction in NN (instead of existing FileJournalManager) • Provides edit log storage abstraction • JournalNode (server) • Standalone daemon running on an odd number of nodes • Provides actual storage of edit logs on local disks • Could run inside other daemons in the future ©2012 Cloudera, Inc. All Rights28 Reserved.
  29. 29. Architecture ©2012 Cloudera, Inc. All Rights29 Reserved.
  30. 30. Commit protocol • NameNode accumulates edits locally as they are logged • On logSync(), sends accumulated batch to all JNs via Hadoop RPC • Waits for success ACK from a majority of nodes • Majority commit means that a single lagging or crashed replica does not impact NN latency • Latency @ NN = median(Latency @ JNs) ©2012 Cloudera, Inc. All Rights30 Reserved.
  31. 31. JN Fencing • How do we prevent split-brain? • Each instance of QJM is assigned a unique epoch number • provides a strong ordering between client NNs • Each IPC contains the client’s epoch • JN remembers on disk the highest epoch it has seen • Any request from an earlier epoch is rejected. Any from a newer one is recorded on disk • Distributed Systems folks may recognize this technique from Paxos and other literature ©2012 Cloudera, Inc. All Rights31 Reserved.
  32. 32. Fencing with epochs • Fencing is now implicit • The act of becoming active causes any earlier active NN to be fenced out • Since a quorum of nodes has accepted the new active, any other IPC by an earlier epoch number can’t get quorum • Eliminates confusing and error-prone custom fencing configuration ©2012 Cloudera, Inc. All Rights32 Reserved.
  33. 33. Segment recovery • In normal operation, a minority of JNs may be out of sync • After a crash, all JNs may have different numbers of txns (last batch may or may not have arrived at each) • eg JN1 was down, JN2 crashed right before NN wrote txnid 150: • JN1: has no edits • JN2: has edits 101-149 • JN3: has edits 101-150 • Before becoming active, we need to come to consensus on this last batch: was it committed or not? • Use the well-known Paxos algorithm to solve consensus ©2012 Cloudera, Inc. All Rights33 Reserved.
  34. 34. Other implementation features • Hadoop Metrics • lag, percentile latencies, etc from perspective of JN, NN • metrics for queued txns, % of time each JN fell behind, etc, to help suss out a slow JN before it causes problems • Security • full Kerberos and SSL support: edits can be optionally encrypted in-flight, and all access is mutually authenticated ©2012 Cloudera, Inc. All Rights34 Reserved.
  35. 35. Testing • Randomized fault test • Runs all communications in a single thread with deterministic order and fault injections based on a seed • Caught a number of really subtle bugs along the way • Run as an MR job: 5000 fault tests in parallel • Multiple CPU-years of stress testing: found 2 bugs in Jetty! • Cluster testing: 100-node, MR, HBase, Hive, etc • Commit latency in practice: within same range as local disks (better than one of two local disks, worse than the other one) ©2012 Cloudera, Inc. All Rights36 Reserved.
  36. 36. Deployment and Configuration • Most customers running 3 JNs (tolerate 1 failure) • 1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager • Optionally run 2 more (eg on bastion/gateway nodes) to tolerate 2 failures • Configuration: • dfs.namenode.shared.edits.dir: qjournal://,, com:8485/my-journal • dfs.journalnode.edits.dir: /data/1/hadoop/journalnode/ • dfs.ha.fencing.methods: shell(/bin/true) (fencing not required!) ©2012 Cloudera, Inc. All Rights37 Reserved.
  37. 37. Status • Merged into Hadoop development trunk in early October • Available in CDH4.1 • Deployed at several customer/community sites with good success so far • Planned rollout to 20+ production HBase clusters within the month ©2012 Cloudera, Inc. All Rights38 Reserved.
  38. 38. Conclusion39
  39. 39. HA Phase 2 Improvements • Run an active NameNode and a hot Standby NameNode • Automatically triggers seamless failover using Apache ZooKeeper • Stores shared metadata on QuorumJournalManager: a fully distributed, redundant, low latency journaling system. • All improvements available now in HDFS trunk and CDH4.1 ©2012 Cloudera, Inc. All Rights40 Reserved.
  40. 40. 41
  41. 41. Backup Slides42
  42. 42. Why not BookKeeper? • Pipelined commit instead of quorum commit • Unpredictable latency • Research project • Not “Hadoopy” • Their own IPC system, no security, different configuration, no metrics • External • Feels like “two systems” to ops/deployment instead of just one • Nevertheless: it’s pluggable and BK is an additional option. ©2012 Cloudera, Inc. All Rights43 Reserved.
  43. 43. Epoch number assignment • On startup: • NN -> JN: getEpochInfo() • JN: respond with current promised epoch • NN: set epoch = max(promisedEpoch) + 1 • NN -> JN: newEpoch(epoch) • JN: if it is still higher than promisedEpoch, remember it and ACK, otherwise NACK • If NN receives ACK from a quorum of nodes, then it has uniquely claimed that epoch ©2012 Cloudera, Inc. All Rights44 Reserved.