IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

André Oriani
Islene Calciolari Garcia
Institute of Computing – University of Campinas-Brazil
FROM BACKUP TO HOT STANDBY:
HIGH AVAILABILITY FOR HDFS

AGENDA
• Motivation;
• Architecture of HDFS 0.21;
• Implementation of Hot Standby Node;
• Experiments and Results;
• High Availability features on HDFS 2.0.0-alpha;
• Conclusions and Future Work.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2

MOTIVATION
CLUSTER-BASED PARALLEL FILE SYSTEMS
Master-Slaves Architecture:
• Metadata Server – serves clients, manages the namespace.
• Storage Servers – store the data.
Centralized System  Metadata Server is a SPOF
Importance of a Hot Standby for the metadata server of HDFS
Cold start of a 2000-node HDFS cluster with 21PB and 150 million files
takes ~ 45min [8].

HADOOP DISTRIBUTED FILE SYSTEM
(HDFS) 0.21
NameNodeClient
DataNode DataNodeDataNode
Backup
Node

DATANODES
Storage Nodes
• Files are split in equal-sized blocks.
• Blocks are replicated to DataNodes.
• Send status messages to NameNode:
• Heartbeats;
• Block-Reports;
• Block-Received.
NameNode
Statuses

NAMENODE
NameNodeClient
DataNode DataNode
Requests
Metadata
Hearbeats
Block-Reports
Block-Received
Commands
Metadata Server
• Manages the file system tree.
• Handles metadata requests.
• Controls access and leases.
• Manages Blocks:
• Allocation;
• Location;
• Replication levels.

NAMENODE’S STATE
File System
Tree
Block
Management
Leases
LOG
Journaling
• All state is kept in RAM for
better performance.
• Changes to namespace
are recorded to log.
• Lease and Block
information is volatile.

BACKUP NODE
NameNode
Backup
Node
Journal Entries
Checkpoint
Checkpoint Helper
• NameNode streams changes to Backup Node.
• Efficient checkpoint strategy: apply changes to its own state.
• Checkpoint Backup’s state == Checkpoint NameNode’s state.

A HOT STANDBY FOR HDFS 0.21
• Backup Node: Opportunity
• Already replicates namespace state.
• NameNode’s subclass  Can process client requests.
• Evolving the Backup Node into Hot Standby Node:
1. Handle the missing state components:
1. Replica locations
2. Leases
2. Detect NameNode’s Failure.
3. Switch the Hot Standby Node to active state (failover).
4. Disseminate current metadata server information.

Reuses AvatarNodes’
strategy:
• Sends DataNodes’ status
Messages to Hot Standby
too.
• No rigid sync
• DataNode failures are
relatively common.
Stable clusters: 2-3
failures per day in
1,000 nodes [3]
• Hot Standby Node is
kept on safe mode. It
becomes the authority
once it is active
MISSING STATE
REPLICA LOCATIONS
NameNode
DataNode
Hot Standby
Node
Heartbeats
Block-Reports
Block-Received

Not replicated:
• Blocks are only
recorded to log
when file is closed.
• So any write in
progress is lost if
NameNode fails.
• Restarting the write
will create a new
lease.
MISSING STATE
LEASES
NameNodeClient LOG
open(file)
open(file)
getAdditionalBlock()
close(file)
complete(file,blocks)
addLease(file,client)

Uses ZooKeeper
• Highly available
distributed coordination
service.
• Keeps a tree of znodes
replicated among the
ensemble.
• All operations are
atomic.
• Leaf node may be
ephemeral.
• Clients can watch
znodes.
FAILURE DETECTION
NameNode
Hot Standby
Node
Hot StandbyNameNode
ZooKeeper
Ensemble
Namenodes
active’s IP
Client

NameNode fails
FAILOVER
NameNode
Hot Standby
Node
Namenodes
Hot Standby
ZooKeeper
Ensemble

Switch Hot Standby Node to
active
1. Stop checkpointing;
2. Close all open files;
3. Restart lease management;
4. Leave safe mode;
5. Update group znode to its
network address.
FAILOVER(CONT.)
Hot Standby
Node

EXPERIMENTS
• Environment Test : Amazon EC2
• 1 Zookeeper server;
• 1 NameNode;
• 1 Backup Node or Hot Standby Node;
• 20 DataNodes;
• 20 Clients running the test programs;
• All small instances.
• Performance tests - Comparison against HDFS 0.21.
• Failover tests - NameNode is shutdown when block count is more than 2K.
• Two test scenarios
• Metadata : Each client creates 200 files of one block each;
• I/O : Each client creates a single file of 200 blocks;
• 5 sample per scenario.
• Source code, tests scripts and raw data:
• Available at https://sites.google.com/site/hadoopfs/experiments

RESULTS OVERVIEW
Lines of Code:
1373 lines: 0.18% of original code for HDFS 0.21.
Performance:
• NameNode
• Failover Manager overhead: increase of 16% in CPU time and
12% in heap memory compared to HDFS 0.21.
• DataNodes
• No considerable change on network traffic. Extra messages are
less than 0.43% of total flow out of DataNode.
• Substantial overhead only on I/O scenario: 17% in CPU and 6%
in heap memory compared to HDFS 0.21 in the same scenario.

RESULTS OVERVIEW(CONT.)
• Big Block-Received Message Problem:
• Hot Standby only knows (thru log) the blocks of a file when it is closed.
• Hot Standby returns non-recognized blocks, so DataNodes can retry them on
next block-received message.
• In I/O scenario files have 200 blocks Many pending blocks will be retried
until files are closed  larger block-received messages and responses 
Processing Memory
• Hot Standby Node
• CPU time 3.3 times and heap memory 1.9 times higher in I/O scenario
compared to metadata scenario.
DataNode
Hot Standby
Node
Block-Received: new + old blocks
Response: non-recognized blocks

PERFORMANCE RESULTS
THROUGHPUT AT CLIENTS
METADATA
6.24
25.8
5.45
25.32
0
5
10
15
20
25
30
Write Read
Throughtput(MB/s)
HDFS 0.21 Hot Standby
I/O
4.72
13.89
4.86
12.39
0
5
10
15
20
25
30
Write ReadThroughtput(MB/s)

FAILOVER RESULTS
METADATA I/O
NameNode Failure
Hot Standby is notified of
the failure.
First request processed
NameNode Failure
Hot Standby is notified of
the failure.
First request processed
Zookeeper session timeout : 2 min
(1.62±0.23)
min.
(2.31±0.46)
min.
0.24%
22%

HDFS 2.0.0-ALPHA
Name
Node
Shared
Storage
Standby
Name
Node
Client
Data
Node
writes log reads log
Released on May 2012
• DataNodes send
messages to both nodes.
• Transactional log is
written to High Available
Shared Storage.
• Standby keeps reading
log from storage.
• Blocks are logged as
they are allocated.
• Manual Failover with IO
fencing.

CONCLUSIONS AND
FUTURE WORK
• We built a high availability solution for HDFS that is capable of
delivering good throughput with low overhead to existing
components.
• The solution has a reasonable reaction time to failures and
works well in Elastic Computing environments.
• Impact on the code base was small and no components
external to the Hadoop Project were required.

CONCLUSIONS AND
FUTURE WORK (CONT.)
• Our results showed that we can improve the performance if we
handle better the new blocks. If the Hot Standby becomes
aware of which blocks compose a file before it is closed, we will
be able to continue writes. We also plan to support
reconfiguration.
• High Availability on HDFS is still a open problem:
• Multiple failure support;
• Integration with BooKeeper;
• Using HDFS itself to store the logs.

ACKNOWLEDGMENTS
• Rodrigo Schmidt
• Alumnus of University of Campinas (Unicamp);
• Facebook Engineer.
• Motorola Mobility

REFERENCES
1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Mass
Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 3-7 2010, pp. 1 –10.
2. “Facebook has the world’s largest Hadoop cluster!” http://hadoopblog. blogspot.com/2010/05/facebook-
has- worlds- largest- hadoop.html, last access on September 13th, 2012.
3. K. Shvachko, “HDFS Scability: The Limits to Growth,” ;login: The Usenix Magazine, vol. 35, no. 2, pp.
6–16, April 2010.
4. “Apache Hadoop,” http://hadoop.apache.org/, last access on September 13th, 2012.
5. “HBase,” http://hbase.apache.org/, last access on September 13th, 2012.
6. B. Bockelman, “Using Hadoop as a grid storage element,” Journal of Physics: Conference Series, vol.
180, no. 1, p. 012047, 2009. [Online].Available: http://stacks.iop.org/1742-6596/180/i=1/a=012047
7. D. Borthakur, “HDFS High Availability,” http://hadoopblog.blogspot. com/2009/11/hdfs-high-
availability.html, last access on September 13th, 2012.
8. D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg, H. Kuang, K. Ranganathan, D.
Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” in
Proceedings of the 2011 international conference on Management of data, ser. SIGMOD ’11. New York,
NY, USA: ACM, 2011, pp. 1071– 1080. [Online]. Available: http://doi.acm.org/10.1145/1989323.1989438

REFERENCES
9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th,
2012.
10. “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, last
access on September 13th, 2012.
11. “Streaming Edits to a Standby Name-Node,” http://issues.apache.org/jira/browse/HADOOP-
4539, last access on September 13th, 2012.
12. “Hot Standby for NameNode,” http://issues.apache.org/jira/browse/HDFS-976, last access on
September 13th, 2012.
13. D. Borthakur, “Hadoop AvatarNode High Availability,” http://
14. hadoopblog.blogspot.com/2010/02/hadoop- namenode- high- availability.html, last access on
September 13th, 2012.
15. “Amazon EC2,” http://aws.amazon.com/ec2/, last access on September 13th, 2012.
16. “HighAvailabilityFrameworkforHDFSNN,”https://issues.apache.org/ jira/browse/HDFS-1623, last
17. “Automatic failover support for NN HA,” http://issues.apache.org/jira/browse/HDFS-3042, last

Q&A

BACKUP SLIDES

PERFORMANCE RESULTS
DATANODE’S ETHERNET FLOW
METADATA I/O
35.00
35.50
36.00
36.50
37.00
37.50
38.00
38.50
39.00
39.50
40.00
40.50
Received SentDataFlow(GB)
35.00
35.50
36.00
36.50
37.00
37.50
38.00
38.50
39.00
39.50
40.00
40.50
Received Sent
DataFlow(GB)

PERFORMANCE RESULTS
CPU TIME
METADATA I/O
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
DataNode Name Node Backup/Hot
Standby
CPUme(s)
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
StandbyCPUme(s)

PERFORMANCE RESULTS
HEAP USAGE
METADATA
0.00
2.50
5.00
7.50
10.00
12.50
15.00
17.50
20.00
22.50
25.00
Standby
JavaHeap(MB)
I/O
0.00
2.50
5.00
7.50
10.00
12.50
15.00
17.50
20.00
22.50
25.00
StandbyJavaHeap(MB)

PERFORMANCE RESULTS
NAMENODE RPC DATAFLOW
METADATA
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
Received Sent
DataFlow(MB)
I/O
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
Received SentDataFlow(MB)

• Group znode holds the
IP address of the active
metadata server.
• Client query Zookeeper
and register to be
notified of changes.
FINDING THE ACTIVE SERVER
Namenodes
active’s IP
Client
Watcher
Notification
Query

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

Similar to IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS (20)

Recently uploaded

Recently uploaded (20)

IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS

Editor's Notes