High Availability Hadoop
Clusters
• Planned downtime
−Upgrades
−Config changes
• Unplanned downtime
−Hardware failure
−Server unresponsive
−Software failures
−Occurs infrequently
Impact
• HDFS HA using QJM
• HDFS HA using NFS for shared storage
• Resource manager HA
Different Kinds Of HA Configurations
HDFS HA - Necessary Hardware Resources
• Name node machines
− Active NN
− Stand by NN
Both of these should ideally be of equivalent hardware.
• Journal Nodes
− Light weight daemons that can be run on machines running other hadoop daemons.
− There must be at least 3 journal node daemons running at any point of time as the
shared edit logs are published to a majority of the journal nodes.
− Journal node daemons should be run in odd numbers (3,5,7 etc)
− When running N journal nodes the system tolerates a maximum of (N-1)/2 failures.
• Zookeeper Service
HDFS HA Architecture Using The Quorum Journal
Manager
RM HA -Necessary Hardware Resources
• Resource manager machines
− Active RM
− Stand by RM
Both of these should ideally be of equivalent hardware.
• Zookeeper service
Resource Manager HA Architecture
RM Failover
• Two failover mechanisms
− Manual Transition - Transition current active rm to standby and then transition standby
rm to Active
− Automatic failover - Embedded zookeeper based ActiveStandby elector to decide which
rm is in active state.
• Each client must have the all resource managers listed with them. The clients use a round
robin fashion to connect to the active resource manager.
• Promoted RM continues to perform from where the previous RM left off. The new RM
spawns new attempts for each of the managed applications. Applications can create
checkpoints to avoid losing work. All states are stored in the zookeeper state store which
allows only a single rm to get write access.

Failsafe Hadoop Infrastructure and the way they work

  • 1.
  • 2.
    • Planned downtime −Upgrades −Configchanges • Unplanned downtime −Hardware failure −Server unresponsive −Software failures −Occurs infrequently Impact
  • 3.
    • HDFS HAusing QJM • HDFS HA using NFS for shared storage • Resource manager HA Different Kinds Of HA Configurations
  • 4.
    HDFS HA -Necessary Hardware Resources • Name node machines − Active NN − Stand by NN Both of these should ideally be of equivalent hardware. • Journal Nodes − Light weight daemons that can be run on machines running other hadoop daemons. − There must be at least 3 journal node daemons running at any point of time as the shared edit logs are published to a majority of the journal nodes. − Journal node daemons should be run in odd numbers (3,5,7 etc) − When running N journal nodes the system tolerates a maximum of (N-1)/2 failures. • Zookeeper Service
  • 5.
    HDFS HA ArchitectureUsing The Quorum Journal Manager
  • 6.
    RM HA -NecessaryHardware Resources • Resource manager machines − Active RM − Stand by RM Both of these should ideally be of equivalent hardware. • Zookeeper service
  • 7.
    Resource Manager HAArchitecture
  • 8.
    RM Failover • Twofailover mechanisms − Manual Transition - Transition current active rm to standby and then transition standby rm to Active − Automatic failover - Embedded zookeeper based ActiveStandby elector to decide which rm is in active state. • Each client must have the all resource managers listed with them. The clients use a round robin fashion to connect to the active resource manager. • Promoted RM continues to perform from where the previous RM left off. The new RM spawns new attempts for each of the managed applications. Applications can create checkpoints to avoid losing work. All states are stored in the zookeeper state store which allows only a single rm to get write access.