• Save
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
Upcoming SlideShare
Loading in...5
×
 

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

on

  • 832 views

Paper presentation in the 31st Symposium on Reliable and Distributed Systems

Paper presentation in the 31st Symposium on Reliable and Distributed Systems

Statistics

Views

Total Views
832
Slideshare-icon Views on SlideShare
828
Embed Views
4

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 4

http://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Good morning Everyone. I am André Oriani from University of Campinas, Brazil. I’m going to present our work : From Backup to Hot Standby: High Availability for HDFS.
  • So here’s today agenda:I’m gonna talk briefly about cluster-based parallel file systems, then the architecture of HDFS 0.21, our implementation of a Hot Standby. give some overview about our experiments and results and finish talking about the high availability features introduced on HDFS 2-alpha.
  • Cluster-based parallel file system generally adopt a master-slaves architecture. The master, the metadata-server, manages the namespace; while the slaves, the storage nodes, store the data. Although this architecture is simple to implement and maintain, as in any centralized system the master, the metadata server, is a single point of failure.One widely used specimen of such file systems is the Hadoop Distributed File System. And what is a importance of a Hot standby for the HDFS? Well, A cold start for a big cluster such as Facebook’s can take about 45 minutes, making it no viable for 24 by seven applications.
  • This slide show the architecture and the data flows for HDFS 0.21. I’m gonna describe each node.
  • Like in any Parallel File System, file are split into equal sized blocks that are stored in the Data Nodes, the storage Nodes. In particular a block is replicated to 3 DataNodes for reliability. They constantly communicating their status to the metadata server, the Name Node thru messages:Heartbeats – which are not only used to tell the DataNode is still alive, but also to tell about its load and free space.Block-reports are a list of healthy the DataNode can offer. And the DataNode receives a set of new blocks , It sends a block-received message. From those message, the Name Node can build its global view of the cluster, knowing where each block replica is.
  • As I said the NameNode is the metadata server and thus it is responsible for servicing the clients, handling metadata requests and and controlling access to files. HDFS offers POSIX-like permission and leases. A client can only write to a file if it has a lease for it.The NameNode also manages the block allocation, location and the replication status. To accomplish that work , the Name Node can send commands to DataNodes in the response to the hearbeatmessages.
  • The Name Node keeps all its state on main memory for performance reasons. In other to have some resilience, the Name Node employs journaling: every change to the file system tree is recorded to a log file. Information about lease and blocks are not recorded to log because they are ephemeral data. So they are lost if the namenode fails.
  • If not action were taken , NameNode would end up with a big transactional log, what would seriously impact its startup time. So it counts on the Backup Node, its checkpointer helper , to compact the log in the form of a serialized version of the file system tree. The Backup Node employs an efficient checkpoint strategy. The Name Node streams all changes to it, so it can apply to it owns state. So to generate a checkpoint for the Name Node it only needs to checkpoint itself.
  • We found the Backup Node to a great opportunity for implementing a hot standby for the NameNode. It already does some state replication and because it is a subclasses of Name Node it can potentially handle client requests. In fact , turning into a Hot standby node was a long term goal for the Backup Node when it was created. To evolve the Backup Node into a Hot Standby, we had to handle the missing namenode’s state components , create an automatic failover mechanism, and means to disseminate the information about the current active metadata server.
  • To replicate the information about blocks, we reused a technique developed in another high-availability solution, the Avatar Nodes from Facebook. The technique consists in modifying the Data Nodes to also send the status messages to the Hot Standby. The Hot Standby Node is kept on safe mode mode to not issue commands to DataNode which would conflict with Name Node’s. There is no rigid synchronization among the duplicated messages because datanodes fail at considerable rates and the nodes are made to handle that. An once it becomes active , the Hot Standby will become the file system authority, so only its view will matter.
  • Regarding leases we decided to not replicate them. The reason for that blocks that compose a file are only recorded to the transactional log when the file is closed. So if the Name Node fails while a file was being written, all the blocks of the file are lost. So the client will need to restart the write, requiring a new leases. Thus the previous leases is not going to use and therefore it doesn’t need to be replicated. This behavior is somewhat tolerable by applications. MapReduce will retry any failed job and Hbase will only commit a commit a transaction when the file is flushed and synced.
  • In order to detect Name Node’s failure we use Zookeeper, Zookeeper is subproject of Hadoop that provides a high available distributed coordination service. It keeps a replicated tree of znodes among the server of the ensemble. One interesting feature of Zookeeper is that the znodes can be ephemeral: if the session of client that created expires , the znode is removed. So you can implement some liveness detection using this principle. You can also register to be notified of such events. So, when they both the Name Node and the Hot Standby create a ephemeral znode for them, under a znode to represent the group. The namenode will write is network address to the group znode.
  • When the Name Node fails , its session with Zookeeper will eventually expires, so its znode gets removed. The Hot Standby is notified about that and it starts the failover procedures.
  • The Hot standby wills stop the checkpoint, close all open files. restart the lease management, leave the safe mode so it can control the Data Nodes. And it writes it network address on the znode of the group.
  • We did experiments in order to determine the overhead implied by our solution over the HDFS 0.21 and the total failover time. We did both tests on two scenarios: one scenario more oriented towards metadata operations and another towards I/O operations. For each scenario we run the test times.The tests were executed on Amazon EC2 using 43 small instances. Source code,test scripts and raw data are available at the address denoted in the slide.
  • As time is short I am gonna give an overview of the results.The complete implementation took less then fourteen hundred lines. Thus is easy for others to understand the implementation and maintain.Regarding the performance overhead, the namenode should not be impacted since it was not changed by the solution, but its process also hosts the Failover Manager, to we saw Increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21For theDataNodes there was not considerable change in the network traffic . The extra messages sent to the hot standby got diluted in the I/O flow created bry clients reading and writing files. We only observed substantial overhead in the I/O Scenario. We saw increase of 17% in the CPU time and 6% in the heap if compared to HDFS 0.21 DataNode’s in the same scenario. Tha t is caused by a growth in the blocked received messages. Remember the Hot standby only becomes aware of block that compose a certain file is closed. If the hot standby node receives block-received message from a block it does not know about, it will return that block, so the datanode can retry to that block in the next block-received message again. The trouble is that in the I/O scenario the files are 200 block-long and the hot standby node will only recognized any block aof a file when the 200 block were written. So the block-receive messages become long, taking a lot of processing and memory from the Data Nodes and the Hot Standby. And because the hot standby is just one , he’s the most affected node.
  • As time is short I am gonna give an overview of the results.The complete implementation took less then fourteen hundred lines. Thus is easy for others to understand the implementation and maintain.Regarding the performance overhead, the namenode should not be impacted since it was not changed by the solution, but its process also hosts the Failover Manager, to we saw Increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21For theDataNodes there was not considerable change in the network traffic . The extra messages sent to the hot standby got diluted in the I/O flow created bry clients reading and writing files. We only observed substantial overhead in the I/O Scenario. We saw increase of 17% in the CPU time and 6% in the heap if compared to HDFS 0.21 DataNode’s in the same scenario. Tha t is caused by a growth in the blocked received messages. Remember the Hot standby only becomes aware of block that compose a certain file is closed. If the hot standby node receives block-received message from a block it does not know about, it will return that block, so the datanode can retry to that block in the next block-received message again. The trouble is that in the I/O scenario the files are 200 block-long and the hot standby node will only recognized any block aof a file when the 200 block were written. So the block-receive messages become long, taking a lot of processing and memory from the Data Nodes and the Hot Standby. And because the hot standby is just one , he’s the most affected node.
  • Despite of those problems, the data throughput is still good. We consider the data throughput the most important metric because it measures how much works can be done in behalf of clients . In average, the data throughput was never less than 2 MB/s of throughput achieved by the HDFS 0.21 in both scenario, for read and write.
  • Failover time…We use a timeout of 2 minutes because it was a safe value to avoid false positive in the virtualized environment of Amazon EC2.We are considering the failover from the time the Name Node fails until the Hot Standby process its first request. In both case the failover took less than 3 minutes. The time from the start of the Hot Standby’s transition until the first request, that it is the time that we can impact with our implementation took only 0.24% of the total failover in the metada scenario. However in the I/O scenario that jumps to 22% because of the problem with block-received messages I just mentioned. The Hot standby node is just too busy processing blocks the transition takes longer. Once it is finish , thingst get worse, because the hot standby not only has to process the block-received but has to instruct datanodes to remove block of all in progress writes, so first resquest is delayed for a long time, although client could react almost instantaneously.
  • HDFS 2-aplha , released on may of this year has introduced some high availability features. They also use the technique of modifying the DataNodes to also send their messages to the hot standby. But instead of streaming the changes, the active metadata server keeps the logs in a shared storage, and the standby keeps on reading the log from the storage in order to update itself. So the high availability issue is transferred from the file system to the shared storage, which is a external component and needs to be high available. It logs blocks as they are allocated avoid the problem we have. Currently it only supports manual failover, so it is targeted to maintenance and upgrades, but a automatica failover is very likely to be in the next release. They employ IO fencing mechanisms to prevent both namenodes from writing to the shared storage.
  • How the client can determine which is the current metadata server? Remember the Name Node writes it address to the group znode when it starts, and the Hot standby writes it address when it finishes the failover. So the group znode always keep the address of active metadata server up-to-date. So clients just need to query Zookeeper for that znode and register to be notified of changes.

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS Presentation Transcript

  • André OrianiIslene Calciolari GarciaInstitute of Computing – University of Campinas-BrazilFROM BACKUP TO HOT STANDBY:HIGH AVAILABILITY FOR HDFS
  • AGENDA• Motivation;• Architecture of HDFS 0.21;• Implementation of Hot Standby Node;• Experiments and Results;• High Availability features on HDFS 2.0.0-alpha;• Conclusions and Future Work.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2
  • MOTIVATIONCLUSTER-BASED PARALLEL FILE SYSTEMSMaster-Slaves Architecture:• Metadata Server – serves clients, manages the namespace.• Storage Servers – store the data.Centralized System  Metadata Server is a SPOFImportance of a Hot Standby for the metadata server of HDFSCold start of a 2000-node HDFS cluster with 21PB and 150 million filestakes ~ 45min [8].SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 3
  • HADOOP DISTRIBUTED FILE SYSTEM(HDFS) 0.21SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 4NameNodeClientDataNode DataNodeDataNodeBackupNode
  • DATANODESSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 5Storage Nodes• Files are split in equal-sized blocks.• Blocks are replicated to DataNodes.• Send status messages to NameNode:• Heartbeats;• Block-Reports;• Block-Received.NameNodeStatuses
  • NAMENODESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 6NameNodeClientDataNode DataNodeRequestsMetadataHearbeatsBlock-ReportsBlock-ReceivedCommandsMetadata Server• Manages the file system tree.• Handles metadata requests.• Controls access and leases.• Manages Blocks:• Allocation;• Location;• Replication levels.
  • NAMENODE’S STATESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 7File SystemTreeBlockManagementLeasesLOGJournaling• All state is kept in RAM forbetter performance.• Changes to namespaceare recorded to log.• Lease and Blockinformation is volatile.
  • BACKUP NODESRDS’12 - From Backup to Hot Standby: High Availability for HDFS 8NameNodeBackupNodeJournal EntriesCheckpointCheckpoint Helper• NameNode streams changes to Backup Node.• Efficient checkpoint strategy: apply changes to its own state.• Checkpoint Backup’s state == Checkpoint NameNode’s state.
  • A HOT STANDBY FOR HDFS 0.21• Backup Node: Opportunity• Already replicates namespace state.• NameNode’s subclass  Can process client requests.• Evolving the Backup Node into Hot Standby Node:1. Handle the missing state components:1. Replica locations2. Leases2. Detect NameNode’s Failure.3. Switch the Hot Standby Node to active state (failover).4. Disseminate current metadata server information.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 9
  • Reuses AvatarNodes’strategy:• Sends DataNodes’ statusMessages to Hot Standbytoo.• No rigid sync• DataNode failures arerelatively common.Stable clusters: 2-3failures per day in1,000 nodes [3]• Hot Standby Node iskept on safe mode. Itbecomes the authorityonce it is activeSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 10MISSING STATEREPLICA LOCATIONSNameNodeDataNodeHot StandbyNodeHeartbeatsBlock-ReportsBlock-Received
  • Not replicated:• Blocks are onlyrecorded to logwhen file is closed.• So any write inprogress is lost ifNameNode fails.• Restarting the writewill create a newlease.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 11MISSING STATELEASESNameNodeClient LOGopen(file)open(file)getAdditionalBlock()getAdditionalBlock()getAdditionalBlock()close(file)complete(file,blocks)addLease(file,client)
  • Uses ZooKeeper• Highly availabledistributed coordinationservice.• Keeps a tree of znodesreplicated among theensemble.• All operations areatomic.• Leaf node may beephemeral.• Clients can watchznodes.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 12FAILURE DETECTIONNameNodeHot StandbyNodeHot StandbyNameNodeZooKeeperEnsembleNamenodesactive’s IPClient
  • NameNode failsSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 13FAILOVERNameNodeHot StandbyNodeNamenodesHot StandbyZooKeeperEnsemble
  • Switch Hot Standby Node toactive1. Stop checkpointing;2. Close all open files;3. Restart lease management;4. Leave safe mode;5. Update group znode to itsnetwork address.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 14FAILOVER(CONT.)Hot StandbyNode
  • EXPERIMENTS• Environment Test : Amazon EC2• 1 Zookeeper server;• 1 NameNode;• 1 Backup Node or Hot Standby Node;• 20 DataNodes;• 20 Clients running the test programs;• All small instances.• Performance tests - Comparison against HDFS 0.21.• Failover tests - NameNode is shutdown when block count is more than 2K.• Two test scenarios• Metadata : Each client creates 200 files of one block each;• I/O : Each client creates a single file of 200 blocks;• 5 sample per scenario.• Source code, tests scripts and raw data:• Available at https://sites.google.com/site/hadoopfs/experimentsSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 15
  • RESULTS OVERVIEWLines of Code:1373 lines: 0.18% of original code for HDFS 0.21.Performance:• NameNode• Failover Manager overhead: increase of 16% in CPU time and12% in heap memory compared to HDFS 0.21.• DataNodes• No considerable change on network traffic. Extra messages areless than 0.43% of total flow out of DataNode.• Substantial overhead only on I/O scenario: 17% in CPU and 6%in heap memory compared to HDFS 0.21 in the same scenario.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 16
  • RESULTS OVERVIEW(CONT.)• Big Block-Received Message Problem:• Hot Standby only knows (thru log) the blocks of a file when it is closed.• Hot Standby returns non-recognized blocks, so DataNodes can retry them onnext block-received message.• In I/O scenario files have 200 blocks Many pending blocks will be retrieduntil files are closed  larger block-received messages and responses Processing Memory• Hot Standby Node• CPU time 3.3 times and heap memory 1.9 times higher in I/O scenariocompared to metadata scenario.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 17DataNodeHot StandbyNodeBlock-Received: new + old blocksResponse: non-recognized blocks
  • PERFORMANCE RESULTSTHROUGHPUT AT CLIENTSMETADATA6.2425.85.4525.32051015202530Write ReadThroughtput(MB/s)HDFS 0.21 Hot StandbyI/O4.7213.894.8612.39051015202530Write ReadThroughtput(MB/s)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 18
  • FAILOVER RESULTSMETADATA I/OSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 19NameNode FailureHot Standby is notified ofthe failure.First request processedNameNode FailureHot Standby is notified ofthe failure.First request processedZookeeper session timeout : 2 min(1.62±0.23)min.(2.31±0.46)min.0.24%22%
  • HDFS 2.0.0-ALPHASRDS’12 - From Backup to Hot Standby: High Availability for HDFS 20NameNodeSharedStorageStandbyNameNodeClientDataNodewrites log reads logReleased on May 2012• DataNodes sendmessages to both nodes.• Transactional log iswritten to High AvailableShared Storage.• Standby keeps readinglog from storage.• Blocks are logged asthey are allocated.• Manual Failover with IOfencing.
  • CONCLUSIONS ANDFUTURE WORK• We built a high availability solution for HDFS that is capable ofdelivering good throughput with low overhead to existingcomponents.• The solution has a reasonable reaction time to failures andworks well in Elastic Computing environments.• Impact on the code base was small and no componentsexternal to the Hadoop Project were required.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 21
  • CONCLUSIONS ANDFUTURE WORK (CONT.)• Our results showed that we can improve the performance if wehandle better the new blocks. If the Hot Standby becomesaware of which blocks compose a file before it is closed, we willbe able to continue writes. We also plan to supportreconfiguration.• High Availability on HDFS is still a open problem:• Multiple failure support;• Integration with BooKeeper;• Using HDFS itself to store the logs.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 22
  • ACKNOWLEDGMENTS• Rodrigo Schmidt• Alumnus of University of Campinas (Unicamp);• Facebook Engineer.• Motorola MobilitySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 23
  • REFERENCES1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in MassStorage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 3-7 2010, pp. 1 –10.2. “Facebook has the world’s largest Hadoop cluster!” http://hadoopblog. blogspot.com/2010/05/facebook-has- worlds- largest- hadoop.html, last access on September 13th, 2012.3. K. Shvachko, “HDFS Scability: The Limits to Growth,” ;login: The Usenix Magazine, vol. 35, no. 2, pp.6–16, April 2010.4. “Apache Hadoop,” http://hadoop.apache.org/, last access on September 13th, 2012.5. “HBase,” http://hbase.apache.org/, last access on September 13th, 2012.6. B. Bockelman, “Using Hadoop as a grid storage element,” Journal of Physics: Conference Series, vol.180, no. 1, p. 012047, 2009. [Online].Available: http://stacks.iop.org/1742-6596/180/i=1/a=0120477. D. Borthakur, “HDFS High Availability,” http://hadoopblog.blogspot. com/2009/11/hdfs-high-availability.html, last access on September 13th, 2012.8. D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg, H. Kuang, K. Ranganathan, D.Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” inProceedings of the 2011 international conference on Management of data, ser. SIGMOD ’11. New York,NY, USA: ACM, 2011, pp. 1071– 1080. [Online]. Available: http://doi.acm.org/10.1145/1989323.1989438SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 24
  • REFERENCES9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th,2012.10. “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, lastaccess on September 13th, 2012.11. “Streaming Edits to a Standby Name-Node,” http://issues.apache.org/jira/browse/HADOOP-4539, last access on September 13th, 2012.12. “Hot Standby for NameNode,” http://issues.apache.org/jira/browse/HDFS-976, last access onSeptember 13th, 2012.13. D. Borthakur, “Hadoop AvatarNode High Availability,” http://14. hadoopblog.blogspot.com/2010/02/hadoop- namenode- high- availability.html, last access onSeptember 13th, 2012.15. “Amazon EC2,” http://aws.amazon.com/ec2/, last access on September 13th, 2012.16. “HighAvailabilityFrameworkforHDFSNN,”https://issues.apache.org/ jira/browse/HDFS-1623, lastaccess on September 13th, 2012.17. “Automatic failover support for NN HA,” http://issues.apache.org/jira/browse/HDFS-3042, lastaccess on September 13th, 2012.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 25
  • Q&ASRDS’12 - From Backup to Hot Standby: High Availability for HDFS 26
  • BACKUP SLIDESSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 27
  • PERFORMANCE RESULTSDATANODE’S ETHERNET FLOWMETADATA I/O35.0035.5036.0036.5037.0037.5038.0038.5039.0039.5040.0040.50Received SentDataFlow(GB)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2835.0035.5036.0036.5037.0037.5038.0038.5039.0039.5040.0040.50Received SentDataFlow(GB)HDFS 0.21 Hot Standby
  • PERFORMANCE RESULTSCPU TIMEMETADATA I/OSRDS’12 - From Backup to Hot Standby: High Availability for HDFS 290.0050.00100.00150.00200.00250.00300.00350.00400.00450.00500.00DataNode Name Node Backup/HotStandbyCPUme(s)HDFS 0.21 Hot Standby0.0050.00100.00150.00200.00250.00300.00350.00400.00450.00500.00DataNode Name Node Backup/HotStandbyCPUme(s)HDFS 0.21 Hot Standby
  • PERFORMANCE RESULTSHEAP USAGEMETADATA0.002.505.007.5010.0012.5015.0017.5020.0022.5025.00DataNode Name Node Backup/HotStandbyJavaHeap(MB)HDFS 0.21 Hot StandbyI/O0.002.505.007.5010.0012.5015.0017.5020.0022.5025.00DataNode Name Node Backup/HotStandbyJavaHeap(MB)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 30
  • PERFORMANCE RESULTSNAMENODE RPC DATAFLOWMETADATA0.00100.00200.00300.00400.00500.00600.00700.00800.00Received SentDataFlow(MB)HDFS 0.21 Hot StandbyI/O0.002.004.006.008.0010.0012.0014.0016.0018.00Received SentDataFlow(MB)HDFS 0.21 Hot StandbySRDS’12 - From Backup to Hot Standby: High Availability for HDFS 31
  • • Group znode holds theIP address of the activemetadata server.• Client query Zookeeperand register to benotified of changes.SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 32FINDING THE ACTIVE SERVERNamenodesactive’s IPClientWatcherNotificationQuery