AvatarNode (AN) Active-Standby Pair Client Coordinated via ZooKeeper Failover in few seconds Client retrieves block location from Wrapper over NameNode Primary or Standby Active AvatarNode Write Read Active transaction Standby Writes transaction log to AvatarNode transaction AvatarNode NFS filter (NameNode) (NameNode) Standby AvatarNode Reads/Consumes transactions from NFS filter Block Block Processes all messages from Location Location DataNodes messages messages Latest metadata in memory DataNodes
Four steps to failover Wipe ZooKeeper entry. Clients will know the failover is in progress. (0 seconds) Stop the primary NameNode. Last bits of data will be flushed to Transaction Log and it will die. (Seconds) Switch Standby to Primary. It will consume the rest of the Transaction log and get out of SafeMode ready to serve traffic. (Seconds) Update the entry in ZooKeeper. All the clients waiting for failover will pick up the new connection (0 seconds) After: Start the first node in the Standby Mode (Takes a while, but the cluster is up and running)
AvatarNode @Facebook Diagram from Facebook Contrib@hadoop 0.20 (HDFS-976)
Conclusions Complete Hot Standby NFS for storage of fsimage and editlogs. (no data loss) Standby node Consumes transactions from editlogs on NFS continuously. (namespace hot standby) DataNodes send message to both primary and standby node. (block reports hot standby) Fast Switchover Less than a minute Make sense!
BackupNode (BN) NN synchronously streams Client transaction log to Client retrieves block location BackupNode from NN BackupNode applies log Synchronous NN to in-memory and disk stream transacton (NameNode) logs to BN image BN always commit to disk BN Block (BackupNode before success to NN Location ) If BN restarts, it has to messages catch up with NN Available in HDFS 0.20.1 release DataNodes
Limitations of BackupNode(BN) Maximum of one BackupNode per NN Support only two-machine failure NN doesn’t forward block reports to BackupNode Time to restart from 12GB image, 70M files + 100M blocks 3-5 minutes to read the image from the disk 20 min to process block reports BN will still take 25+ minutes to failover!
Conclusions Incomplete Hot Standby / Semi-Hot Standby Namespace: hot standby Block reports: cold standby Still-Slow Switchover
Other HA solutions DRDB + Linux HA http://www.cloudera.com/blog/2009/07/hadoop-ha- configuration/ metadata backup http://wiki.apache.org/hadoop/NameNodeFailover