• Save
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera &  Sanjay Radia, Hortonworks
 

Like this? Share it with your network

Share

Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks

on

  • 6,604 views

HDFS HA has been a highly sought after feature for years. Through collaboration between Cloudera, Facebook, Yahoo!, and others, a high availability system for the HDFS Name Node is actively being ...

HDFS HA has been a highly sought after feature for years. Through collaboration between Cloudera, Facebook, Yahoo!, and others, a high availability system for the HDFS Name Node is actively being worked on, and will likely be complete by Hadoop World. This talk will discuss the architecture and setup of this system.

Statistics

Views

Total Views
6,604
Views on SlideShare
6,123
Embed Views
481

Actions

Likes
17
Downloads
0
Comments
0

7 Embeds 481

http://www.cloudera.com 452
http://a0.twimg.com 24
http://paper.li 1
http://cloudera.matt.dev 1
https://twimg0-a.akamaihd.net 1
https://si0.twimg.com 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs

Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks Presentation Transcript

  • 1. NameNode HA
    Suresh Srinivas - Hortonworks
    Aaron T. Myers - Cloudera
  • 2. Overview
    Part 1 – Suresh Srinivas (Hortonworks)
    HDFS Availability and Data Integrity – what is the record?
    NN HA Design
    Part 2 – Aaron T. Myers (Cloudera)
    NN HA Design continued
    Client-NN Connection failover
    Operations and Admin of HA
    Future Work
    2
  • 3. Current HDFS Availability & Data Integrity
    Simple design, storage fault tolerance
    Storage: Rely in OS’s file system rather than use raw disk
    Storage Fault Tolerance: multiple replicas, active monitoring
    Single NameNode Master
    Persistent state: multiple copies + checkpoints
    Restart on failure
    How well did it work?
    Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009
    7-9’s of reliability
    Fixed in 20 and 21.
    18 month Study: 22 failures on 25 clusters - 0.58 failures per year per cluster
    Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year)
    NN is very robust and can take a lot of abuse
    NN is resilient against overload caused by misbehaving apps
    3
  • 4. HA NameNode
    Active work has started on HA NameNode (Failover)
    HA NameNode
    Detailed design and sub tasks in HDFS-1623
    HA: Related work
    Backup NN (0.21)
    Avatar NN (Facebook)
    HA NN prototype using Linux HA (Yahoo!)
    HA NN prototype with Backup NN and block report replicator (eBay)
    HA is the highest priority
    4
  • 5. Approach and Terminology
    Initial goal is Active-Standby
    With Federation each namespace volume has a NameNode
    Single active NN for any namespace volume
    Terminology
    Active NN – actively serves the read/write operations from the clients
    Standby NN - waits, becomes active when Active dies or is unhealthy
    Could serve read operations
    Standby’s State may be cold, warm or hot
    Cold : Standby has zero state (e.g. started after the Active is declared dead.
    Warm: Standby has partial state:
    has loaded fsImage& editLogsbut has not received any block reports
    Hot Standby: Standby has all most of the Active’s state and start immediately
    5
  • 6. High Level Use Cases
    Supported failures
    Single hardware failure
    Double hardware failure not supported
    Some software failures
    Same software failure affects both active and standby
    6
    Planned downtime
    Upgrades
    Config changes
    Main reason for downtime
    Unplanned downtime
    Hardware failure
    Server unresponsive
    Software failures
    Occurs infrequently
  • 7. Use Cases
    Deployment models
    Single NN configuration; no failover
    Active and Standby with manual failover
    Standby could be cold/warm/hot
    Addresses downtime during upgrades – main cause of unavailability
    Active and Standby with automatic failover
    Hot standby
    Addresses downtime during upgrades and other failures
    See HDFS-1623 for detailed use cases
    7
  • 8. Design
    Failover control outside NN
    Parallel Block reports to Active and Standby (Hot failover)
    Shared or non-shared NN state
    Fencing of shared resources/data
    Datanodes
    Shared NN state (if any)
    Client failover
    IP Failover
    Smart clients (e.g configuration, or ZooKeeper for coordination)
    8
  • 9. Failover Control Outside NN
    HA Daemon outside NameNode
    Daemon manages resources
    All resources modeled uniformly
    Resources – OS, HW, Network etc.
    NameNode is just another resource
    Heartbeat with other nodes
    Quorum based leader election
    Zookeeper for coordination and Quorum
    Fencing during split brain
    Prevents data corruption
    Quorum
    Service
    Heartbeat
    Leader Election
    HA
    Daemon
    Resources
    Resources
    Resources
    Actions
    start, stop,
    failover, monitor, …
    Fencing/
    STONITH
    Shared
    Resources
  • 10. NN HA with Shared Storage and ZooKeeper
    ZK
    ZK
    ZK
    Heartbeat
    Heartbeat
    FailoverController
    Standby
    FailoverController
    Active
    Cmds
    Monitor Health
    of NN. OS, HW
    Monitor Health
    of NN. OS, HW
    NN
    Active
    NN
    Standby
    Shared NN state with single writer
    (fencing)
    Block Reports to Active & Standby
    DN fencing: Update cmds from one
    DN
    DN
    DN
  • 11. HA Design Details
    11
  • 12. Client Failover Design
    Smart clients
    Users use one logical URI, client selects correct NN to connect to
    Implementing two options out of the box
    Client Knows of multiple NNs
    Use a coordination service (ZooKeeper)
    Common things between these
    Which operations are idempotent, therefore safe to retry on a failover
    Failover/retry strategies
    Some differences
    Expected time for client failover
    Ease of administration
    12
  • 13. Ops/Admin: Shared Storage
    To share NN state, need shared storage
    Needs to be HA itself to avoid just shifting SPOF
    BookKeeper, etc will likely take care of this in the future
    Many come with IP fencing options
    Recommended mount options:
    tcp,soft,intr,timeo=60,retrans=10
    Not all edits directories are created equal
    Used to be all edits dirs were just a pool of redundant dirs
    Can now configure some edits directories to be required
    Can now configure number of tolerated failures
    You want at least 2 for durability, 1 remote for HA
    13
  • 14. Ops/Admin: NN fencing
    Client failover does not solve this problem
    Out of the box
    RPC to active NN to tell it to go to standby (graceful failover)
    SSH to active NN and `kill -9’ NN
    Pluggable options
    Many filers have protocols for IP-based fencing options
    Many PDUs have protocols for IP-based plug-pulling (STONITH)
    Nuke the node from orbit. It’s the only way to be sure.
    Configure extra options if available to you
    Will be tried in order during a failover event
    Escalate the aggressiveness of the method
    Fencing is critical for correctness of NN metadata
    14
  • 15. Ops/Admin: Monitoring
    New NN metrics
    Size of pending DN message queues
    Seconds since the standby NN last read from shared edit log
    DN block report lag
    All measurements of standby NN lag – monitor/alert on all of these
    Monitor shared storage solution
    Volumes fill up, disks go bad, etc
    Should configure paranoid edit log retention policy (default is 2)
    Canary-based monitoring of HDFS a good idea
    Pinging both NNs not sufficient
    15
  • 16. Ops/Admin: Hardware
    Active/Standby NNs should be on separate racks
    Shared storage system should be on separate rack
    Active/Standby NNs should have close to the same hardware
    Same amount of RAM – need to store the same things
    Same # of processors - need to serve same number of clients
    All the same recommendations still apply for NN
    ECC memory, 48GB
    Several separate disks for NN metadata directories
    Redundant disks for OS drives, probably RAID 5 or mirroring
    Redundant power
    16
  • 17. Future Work
    Other options to share NN metadata
    BookKeeper
    Multiple, potentially non-HA filers
    Entirely different metadata system
    More advanced client failover/load shedding
    Serve stale reads from the standby NN
    Speculative RPC
    Non-RPC clients (IP failover, DNS failover, proxy, etc.)
    Even Higher HA
    Multiple standby NNs
    17
  • 18. QA
    Detailed design (HDFS-1623)
    Community effort
    HDFS-1971, 1972, 1973, 1974, 1975, 2005, 2064, 1073
    18