André OrianiIslene Calciolari GarciaInstitute of Computing – University of Campinas-BrazilFROM BACKUP TO HOT STANDBY:HIGH ...
AGENDA• Motivation;• Architecture of HDFS 0.21;• Implementation of Hot Standby Node;• Experiments and Results;• High Avail...
MOTIVATIONCLUSTER-BASED PARALLEL FILE SYSTEMSMaster-Slaves Architecture:• Metadata Server – serves clients, manages the na...
HADOOP DISTRIBUTED FILE SYSTEM(HDFS) 0.21SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 4NameNodeClientD...
DATANODESSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 5Storage Nodes• Files are split in equal-sized b...
NAMENODESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 6NameNodeClientDataNode DataNodeRequestsMetadataH...
NAMENODE’S STATESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 7File SystemTreeBlockManagementLeasesLOGJ...
BACKUP NODESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 8NameNodeBackupNodeJournal EntriesCheckpointCh...
A HOT STANDBY FOR HDFS 0.21• Backup Node: Opportunity• Already replicates namespace state.• NameNode’s subclass  Can proc...
Reuses AvatarNodes’strategy:• Sends DataNodes’ statusMessages to Hot Standbytoo.• No rigid sync• DataNode failures arerela...
Not replicated:• Blocks are onlyrecorded to logwhen file is closed.• So any write inprogress is lost ifNameNode fails.• Re...
Uses ZooKeeper• Highly availabledistributed coordinationservice.• Keeps a tree of znodesreplicated among theensemble.• All...
NameNode failsSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 13FAILOVERNameNodeHot StandbyNodeNamenodesH...
Switch Hot Standby Node toactive1. Stop checkpointing;2. Close all open files;3. Restart lease management;4. Leave safe mo...
EXPERIMENTS• Environment Test : Amazon EC2• 1 Zookeeper server;• 1 NameNode;• 1 Backup Node or Hot Standby Node;• 20 DataN...
RESULTS OVERVIEWLines of Code:1373 lines: 0.18% of original code for HDFS 0.21.Performance:• NameNode• Failover Manager ov...
RESULTS OVERVIEW(CONT.)• Big Block-Received Message Problem:• Hot Standby only knows (thru log) the blocks of a file when ...
PERFORMANCE RESULTSTHROUGHPUT AT CLIENTSMETADATA6.2425.85.4525.32051015202530Write ReadThroughtput(MB/s)HDFS 0.21 Hot Stan...
FAILOVER RESULTSMETADATA I/OSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 19NameNode FailureHot Standby...
HDFS 2.0.0-ALPHASRDS’12 - From Backup to Hot Standby: High Availability for HDFS 20NameNodeSharedStorageStandbyNameNodeCli...
CONCLUSIONS ANDFUTURE WORK• We built a high availability solution for HDFS that is capable ofdelivering good throughput wi...
CONCLUSIONS ANDFUTURE WORK (CONT.)• Our results showed that we can improve the performance if wehandle better the new bloc...
ACKNOWLEDGMENTS• Rodrigo Schmidt• Alumnus of University of Campinas (Unicamp);• Facebook Engineer.• Motorola MobilitySRDS’...
REFERENCES1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in MassStorage System...
REFERENCES9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th,2012.10. “HDFS archite...
Q&ASRDS’12 - From Backup to Hot Standby: High Availability for HDFS 26
BACKUP SLIDESSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 27
PERFORMANCE RESULTSDATANODE’S ETHERNET FLOWMETADATA I/O35.0035.5036.0036.5037.0037.5038.0038.5039.0039.5040.0040.50Receive...
PERFORMANCE RESULTSCPU TIMEMETADATA I/OSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 290.0050.00100.001...
PERFORMANCE RESULTSHEAP USAGEMETADATA0.002.505.007.5010.0012.5015.0017.5020.0022.5025.00DataNode Name Node Backup/HotStand...
PERFORMANCE RESULTSNAMENODE RPC DATAFLOWMETADATA0.00100.00200.00300.00400.00500.00600.00700.00800.00Received SentDataFlow(...
• Group znode holds theIP address of the activemetadata server.• Client query Zookeeperand register to benotified of chang...
Upcoming SlideShare
Loading in …5
×

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

1,163 views

Published on

Paper presentation in the 31st Symposium on Reliable and Distributed Systems

Published in: Technology
  • Be the first to comment

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

  1. 1. André OrianiIslene Calciolari GarciaInstitute of Computing – University of Campinas-BrazilFROM BACKUP TO HOT STANDBY:HIGH AVAILABILITY FOR HDFS
  2. 2. AGENDA• Motivation;• Architecture of HDFS 0.21;• Implementation of Hot Standby Node;• Experiments and Results;• High Availability features on HDFS 2.0.0-alpha;• Conclusions and Future Work.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2
  3. 3. MOTIVATIONCLUSTER-BASED PARALLEL FILE SYSTEMSMaster-Slaves Architecture:• Metadata Server – serves clients, manages the namespace.• Storage Servers – store the data.Centralized System  Metadata Server is a SPOFImportance of a Hot Standby for the metadata server of HDFSCold start of a 2000-node HDFS cluster with 21PB and 150 million filestakes ~ 45min [8].SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 3
  4. 4. HADOOP DISTRIBUTED FILE SYSTEM(HDFS) 0.21SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 4NameNodeClientDataNode DataNodeDataNodeBackupNode
  5. 5. DATANODESSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 5Storage Nodes• Files are split in equal-sized blocks.• Blocks are replicated to DataNodes.• Send status messages to NameNode:• Heartbeats;• Block-Reports;• Block-Received.NameNodeStatuses
  6. 6. NAMENODESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 6NameNodeClientDataNode DataNodeRequestsMetadataHearbeatsBlock-ReportsBlock-ReceivedCommandsMetadata Server• Manages the file system tree.• Handles metadata requests.• Controls access and leases.• Manages Blocks:• Allocation;• Location;• Replication levels.
  7. 7. NAMENODE’S STATESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 7File SystemTreeBlockManagementLeasesLOGJournaling• All state is kept in RAM forbetter performance.• Changes to namespaceare recorded to log.• Lease and Blockinformation is volatile.
  8. 8. BACKUP NODESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 8NameNodeBackupNodeJournal EntriesCheckpointCheckpoint Helper• NameNode streams changes to Backup Node.• Efficient checkpoint strategy: apply changes to its own state.• Checkpoint Backup’s state == Checkpoint NameNode’s state.
  9. 9. A HOT STANDBY FOR HDFS 0.21• Backup Node: Opportunity• Already replicates namespace state.• NameNode’s subclass  Can process client requests.• Evolving the Backup Node into Hot Standby Node:1. Handle the missing state components:1. Replica locations2. Leases2. Detect NameNode’s Failure.3. Switch the Hot Standby Node to active state (failover).4. Disseminate current metadata server information.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 9
  10. 10. Reuses AvatarNodes’strategy:• Sends DataNodes’ statusMessages to Hot Standbytoo.• No rigid sync• DataNode failures arerelatively common.Stable clusters: 2-3failures per day in1,000 nodes [3]• Hot Standby Node iskept on safe mode. Itbecomes the authorityonce it is activeSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 10MISSING STATEREPLICA LOCATIONSNameNodeDataNodeHot StandbyNodeHeartbeatsBlock-ReportsBlock-Received
  11. 11. Not replicated:• Blocks are onlyrecorded to logwhen file is closed.• So any write inprogress is lost ifNameNode fails.• Restarting the writewill create a newlease.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 11MISSING STATELEASESNameNodeClient LOGopen(file)open(file)getAdditionalBlock()getAdditionalBlock()getAdditionalBlock()close(file)complete(file,blocks)addLease(file,client)
  12. 12. Uses ZooKeeper• Highly availabledistributed coordinationservice.• Keeps a tree of znodesreplicated among theensemble.• All operations areatomic.• Leaf node may beephemeral.• Clients can watchznodes.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 12FAILURE DETECTIONNameNodeHot StandbyNodeHot StandbyNameNodeZooKeeperEnsembleNamenodesactive’s IPClient
  13. 13. NameNode failsSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 13FAILOVERNameNodeHot StandbyNodeNamenodesHot StandbyZooKeeperEnsemble
  14. 14. Switch Hot Standby Node toactive1. Stop checkpointing;2. Close all open files;3. Restart lease management;4. Leave safe mode;5. Update group znode to itsnetwork address.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 14FAILOVER(CONT.)Hot StandbyNode
  15. 15. EXPERIMENTS• Environment Test : Amazon EC2• 1 Zookeeper server;• 1 NameNode;• 1 Backup Node or Hot Standby Node;• 20 DataNodes;• 20 Clients running the test programs;• All small instances.• Performance tests - Comparison against HDFS 0.21.• Failover tests - NameNode is shutdown when block count is more than 2K.• Two test scenarios• Metadata : Each client creates 200 files of one block each;• I/O : Each client creates a single file of 200 blocks;• 5 sample per scenario.• Source code, tests scripts and raw data:• Available at https://sites.google.com/site/hadoopfs/experimentsSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 15
  16. 16. RESULTS OVERVIEWLines of Code:1373 lines: 0.18% of original code for HDFS 0.21.Performance:• NameNode• Failover Manager overhead: increase of 16% in CPU time and12% in heap memory compared to HDFS 0.21.• DataNodes• No considerable change on network traffic. Extra messages areless than 0.43% of total flow out of DataNode.• Substantial overhead only on I/O scenario: 17% in CPU and 6%in heap memory compared to HDFS 0.21 in the same scenario.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 16
  17. 17. RESULTS OVERVIEW(CONT.)• Big Block-Received Message Problem:• Hot Standby only knows (thru log) the blocks of a file when it is closed.• Hot Standby returns non-recognized blocks, so DataNodes can retry them onnext block-received message.• In I/O scenario files have 200 blocks Many pending blocks will be retrieduntil files are closed  larger block-received messages and responses Processing Memory• Hot Standby Node• CPU time 3.3 times and heap memory 1.9 times higher in I/O scenariocompared to metadata scenario.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 17DataNodeHot StandbyNodeBlock-Received: new + old blocksResponse: non-recognized blocks
  18. 18. PERFORMANCE RESULTSTHROUGHPUT AT CLIENTSMETADATA6.2425.85.4525.32051015202530Write ReadThroughtput(MB/s)HDFS 0.21 Hot StandbyI/O4.7213.894.8612.39051015202530Write ReadThroughtput(MB/s)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 18
  19. 19. FAILOVER RESULTSMETADATA I/OSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 19NameNode FailureHot Standby is notified ofthe failure.First request processedNameNode FailureHot Standby is notified ofthe failure.First request processedZookeeper session timeout : 2 min(1.62±0.23)min.(2.31±0.46)min.0.24%22%
  20. 20. HDFS 2.0.0-ALPHASRDS’12 - From Backup to Hot Standby: High Availability for HDFS 20NameNodeSharedStorageStandbyNameNodeClientDataNodewrites log reads logReleased on May 2012• DataNodes sendmessages to both nodes.• Transactional log iswritten to High AvailableShared Storage.• Standby keeps readinglog from storage.• Blocks are logged asthey are allocated.• Manual Failover with IOfencing.
  21. 21. CONCLUSIONS ANDFUTURE WORK• We built a high availability solution for HDFS that is capable ofdelivering good throughput with low overhead to existingcomponents.• The solution has a reasonable reaction time to failures andworks well in Elastic Computing environments.• Impact on the code base was small and no componentsexternal to the Hadoop Project were required.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 21
  22. 22. CONCLUSIONS ANDFUTURE WORK (CONT.)• Our results showed that we can improve the performance if wehandle better the new blocks. If the Hot Standby becomesaware of which blocks compose a file before it is closed, we willbe able to continue writes. We also plan to supportreconfiguration.• High Availability on HDFS is still a open problem:• Multiple failure support;• Integration with BooKeeper;• Using HDFS itself to store the logs.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 22
  23. 23. ACKNOWLEDGMENTS• Rodrigo Schmidt• Alumnus of University of Campinas (Unicamp);• Facebook Engineer.• Motorola MobilitySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 23
  24. 24. REFERENCES1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in MassStorage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 3-7 2010, pp. 1 –10.2. “Facebook has the world’s largest Hadoop cluster!” http://hadoopblog. blogspot.com/2010/05/facebook-has- worlds- largest- hadoop.html, last access on September 13th, 2012.3. K. Shvachko, “HDFS Scability: The Limits to Growth,” ;login: The Usenix Magazine, vol. 35, no. 2, pp.6–16, April 2010.4. “Apache Hadoop,” http://hadoop.apache.org/, last access on September 13th, 2012.5. “HBase,” http://hbase.apache.org/, last access on September 13th, 2012.6. B. Bockelman, “Using Hadoop as a grid storage element,” Journal of Physics: Conference Series, vol.180, no. 1, p. 012047, 2009. [Online].Available: http://stacks.iop.org/1742-6596/180/i=1/a=0120477. D. Borthakur, “HDFS High Availability,” http://hadoopblog.blogspot. com/2009/11/hdfs-high-availability.html, last access on September 13th, 2012.8. D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg, H. Kuang, K. Ranganathan, D.Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” inProceedings of the 2011 international conference on Management of data, ser. SIGMOD ’11. New York,NY, USA: ACM, 2011, pp. 1071– 1080. [Online]. Available: http://doi.acm.org/10.1145/1989323.1989438SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 24
  25. 25. REFERENCES9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th,2012.10. “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, lastaccess on September 13th, 2012.11. “Streaming Edits to a Standby Name-Node,” http://issues.apache.org/jira/browse/HADOOP-4539, last access on September 13th, 2012.12. “Hot Standby for NameNode,” http://issues.apache.org/jira/browse/HDFS-976, last access onSeptember 13th, 2012.13. D. Borthakur, “Hadoop AvatarNode High Availability,” http://14. hadoopblog.blogspot.com/2010/02/hadoop- namenode- high- availability.html, last access onSeptember 13th, 2012.15. “Amazon EC2,” http://aws.amazon.com/ec2/, last access on September 13th, 2012.16. “HighAvailabilityFrameworkforHDFSNN,”https://issues.apache.org/ jira/browse/HDFS-1623, lastaccess on September 13th, 2012.17. “Automatic failover support for NN HA,” http://issues.apache.org/jira/browse/HDFS-3042, lastaccess on September 13th, 2012.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 25
  26. 26. Q&ASRDS’12 - From Backup to Hot Standby: High Availability for HDFS 26
  27. 27. BACKUP SLIDESSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 27
  28. 28. PERFORMANCE RESULTSDATANODE’S ETHERNET FLOWMETADATA I/O35.0035.5036.0036.5037.0037.5038.0038.5039.0039.5040.0040.50Received SentDataFlow(GB)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2835.0035.5036.0036.5037.0037.5038.0038.5039.0039.5040.0040.50Received SentDataFlow(GB)HDFS 0.21 Hot Standby
  29. 29. PERFORMANCE RESULTSCPU TIMEMETADATA I/OSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 290.0050.00100.00150.00200.00250.00300.00350.00400.00450.00500.00DataNode Name Node Backup/HotStandbyCPUme(s)HDFS 0.21 Hot Standby0.0050.00100.00150.00200.00250.00300.00350.00400.00450.00500.00DataNode Name Node Backup/HotStandbyCPUme(s)HDFS 0.21 Hot Standby
  30. 30. PERFORMANCE RESULTSHEAP USAGEMETADATA0.002.505.007.5010.0012.5015.0017.5020.0022.5025.00DataNode Name Node Backup/HotStandbyJavaHeap(MB)HDFS 0.21 Hot StandbyI/O0.002.505.007.5010.0012.5015.0017.5020.0022.5025.00DataNode Name Node Backup/HotStandbyJavaHeap(MB)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 30
  31. 31. PERFORMANCE RESULTSNAMENODE RPC DATAFLOWMETADATA0.00100.00200.00300.00400.00500.00600.00700.00800.00Received SentDataFlow(MB)HDFS 0.21 Hot StandbyI/O0.002.004.006.008.0010.0012.0014.0016.0018.00Received SentDataFlow(MB)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 31
  32. 32. • Group znode holds theIP address of the activemetadata server.• Client query Zookeeperand register to benotified of changes.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 32FINDING THE ACTIVE SERVERNamenodesactive’s IPClientWatcherNotificationQuery

×