Cloudera Admin Training HDFS HA

1
An Introduction to Cloudera’s Administrator
Training for Apache Hadoop
Ian Wrigley
Sr. Curriculum Manager
ian@cloudera.com

2© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
 Why Take Cloudera Training?
 Administrator Course Contents
 A Deeper Dive: An overview of HDFS High Availability
 A Deeper Dive: Some of Hadoop’s advanced configuration options
 Question time
Topics

3
1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
Over 15,000 students trained since 2009
5 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
7 Depth of Training Material
Hands-on exercises and VMs support live instruction
Leader in Certification
Over 5,000 accredited Cloudera professionals
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?

4
Data Analyst
Training
Implement massively distributed, columnar storage at scale
Enable random, real-time read/write access to all data
HBase
Training
Configure, install, and monitor clusters for optimal performance
Implement security measures and multi-user functionality
Vertically integrate basic analytics into data management
Transform and manipulate data to drive high-value utilization
Enterprise
Training
Use Cloudera Manager to speed deployment and scale the cluster
Learn which tools and techniques improve cluster performance
Learning Path: System Administrators
Administrator
Training

 Why Take Training?
 Question time
Topics

During the Administrator course, you learn:
 The core technologies of Hadoop
 How to populate HDFS from external sources
 How to plan your Hadoop cluster hardware and software
 How to deploy a Hadoop cluster
 What issues to consider when installing Pig, Hive, and Impala
 What issues to consider when deploying Hadoop clients
 How Cloudera Manager can simplify Hadoop administration
 How to configure HDFS for high availability
 What issues to consider when implementing Hadoop security
Administrator Course Objectives

 How to schedule jobs on the cluster
 How to maintain your cluster
 How to monitor, troubleshoot, and optimize the cluster
Administrator Course Objectives (cont’d)

 The course features many Hands-On Exercises, including:
–Deploying Hadoop in pseudo-distributed mode
–Deploying a complete, multi-node Hadoop cluster
–Importing data into HDFS using Sqoop and Flume
–Installing Hive and Impala
–Using Hue to control user access
–Configuring HDFS High Availability
–Configuring the FairScheduler
–Troubleshooting problems on the cluster
–… and more
Hands-On Exercises

Course Chapters
 Introduction
 Planning Your Hadoop Cluster
 Hadoop Installation and Initial Configuration
 Installing and Configuring Hive, Impala, and Pig
 Hadoop Clients
 Cloudera Manager
 Advanced Cluster Configuration
 Hadoop Security
Introduction to Apache Hadoop
Planning, Installing, and
Configuring a Hadoop Cluster
Course Introduction
 The Case for Apache Hadoop
 HDFS
 Getting Data Into HDFS
 MapReduce
 Managing and Scheduling Jobs
 Cluster Maintenance
 Cluster Monitoring and Troubleshooting
 Conclusion
 Kerberos Configuration
 Configuring HDFS Federation
Cluster Operations and
Maintenance
Course Conclusion and Appendices

 Question time
Topics

 A single NameNode is a single point of failure
 Two ways a NameNode can result in HDFS downtime
–Unexpected NameNode crash (rare)
–Planned maintenance of NameNode (more common)
 HDFS High Availability (HA) eliminates this SPOF
–Available in CDH4 (or related Apache Hadoop 0.23.x, and 2.x)
HDFS High Availability Overview

 HDFS High Availability uses a pair of NameNodes
–One Active and one Standby
–Clients only contact the Active NameNode
–DataNodes heartbeat in to both NameNodes
–Active NameNode writes its metadata to a quorum of JournalNodes
–Standby NameNode reads from the JournalNodes to remain in sync with
the Active NameNode
HDFS High Availability Architecture
NameNode
(Active)/Quorum
Journal Manager
DataNode
DataNode
DataNode
DataNode
NameNode
(Standby)/Quorum
Journal Manager
JournalNode
JournalNode
JournalNode

 Active NameNode writes edits to the JournalNodes
–Software to do this is the Quorum Journal Manager (QJM)
–Built in to the NameNode
–Waits for a success acknowledgment from the majority of JournalNodes
–Majority commit means a single crashed or lagging JournalNode
will not impact NameNode latency
–Uses the Paxos algorithm to ensure reliability even if edits are being
written as a JournalNode fails
 Note that there is no Secondary NameNode when implementing HDFS
High Availability
–The Standby NameNode periodically performs checkpointing
HDFS High Availability Architecture (cont’d)

 Only one NameNode must be active at any given time
–The other is in standby mode
 The standby maintains a copy of the active NameNode’s state
–So it can take over when the active NameNode goes down
 Two types of failover
–Manual (detected and initiated by a user)
–Automatic (detected and initiated by HDFS itself)
Failover

 Automatic failover is based on Apache ZooKeeper
–A coordination service system also used by HBase
–An open source Apache project
–One of the components in CDH
 A daemon called the ZooKeeper Failover Controller (ZKFC) runs on each
NameNode machine
 ZooKeeper needs a quorum of nodes
–Typical installations use three or five nodes
–Low resource usage
–Can install alongside existing master daemons
Automatic Failover

HDFS HA With Automatic Failover – Deployment
DataNodeDataNode DataNodeDataNode
JournalNode
JournalNode
JournalNode
ZooKeeper Ensemble - Instances Typically Reside on Master Nodes
NameNode
(Active)/Quorum
Journal Manager
ZooKeeper
Failover
Controller
NameNode
(Standby)/
Quorum Journal
Manager
ZooKeeper
Failover
Controller
ZooKeeperZooKeeper ZooKeeper
Must Reside
on the
Same Host
JournalNodes
Typically Reside
on Master Nodes
Must Reside
on the
Same Host

 A Deeper Dive: Some of Hadoop’s more advanced configuration options
 Question time
Topics

hdfs-site.xml
dfs.namenode.handler.count The number of threads the NameNode
uses to handle RPC requests from
DataNodes. Default: 10. Recommended:
ln(number of cluster nodes) * 20.
Symptoms of this being set too low:
‘connection refused’ messages in
DataNode logs as they try to transmit
block reports to the NameNode. Used by
the NameNode.
dfs.datanode.failed.
volumes.tolerated
The number of volumes allowed to fail
before the DataNode takes itself offline,
ultimately resulting in all of its blocks
being re-replicated. Default: 0, but often
increased on machines with several
disks. Used by DataNodes.

core-site.xml
fs.trash.interval When a file is deleted, it is placed in a
.Trash directory in the user’s home
directory, rather than being immediately
deleted. It is purged from HDFS after the
number of minutes specified. Default: 0
(disabled). Recommended: 1440 (one
day). Used by clients and the
NameNode.

mapred-site.xml
mapred.job.tracker.
handler.count
Number of threads used by the
JobTracker to respond to heartbeats
from the TaskTrackers. Default: 10.
Recommendation: ln(number of cluster
nodes) * 20. Used by the JobTracker.
mapred.reduce.parallel.
copies
Number of TaskTrackers a Reducer can
connect to in parallel to transfer its data.
Default: 5. Recommendation: ln(number
of cluster nodes) * 4 with a floor of 10.
Used by TaskTrackers.
tasktracker.http.threads The number of HTTP threads in the
TaskTracker which the Reducers use to
retrieve data. Default: 40.
Recommendation: 80. Used by
TaskTrackers.

mapred-site.xml (cont’d)
mapred.reduce.slowstart.
completed.maps
The percentage of Map tasks which must
be completed before the JobTracker will
schedule Reducers on the cluster.
Default: 0.05 (5 percent).
Recommendation: 0.8 (80 percent).
Used by the JobTracker.

 A Deeper Dive: Some of Hadoop’s more advanced configuration options
 Question time
Topics

24
• Submit questions in the Q&A panel
• Watch on-demand video of this webinar
and many more at http://cloudera.com
• Follow Ian on Twitter @iwrigley
• Follow Cloudera University @ClouderaU
• Learn more at Strata + Hadoop World:
http://tinyurl.com/hadoopworld
• Thank you for attending!
Register now for Cloudera training at
http://university.cloudera.com
Use discount code Admin_10 to save
10% on new enrollments in
Administrator Training classes delivered
by Cloudera until December 1, 2013*
Use discount code 15off2 to save 15% on
enrollments in two or more training
classes delivered by Cloudera until
December 1, 2013*
* Excludes classes sold or delivered by Cloudera partners

Cloudera Admin Training HDFS HA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloudera Admin Training HDFS HA

Similar to Cloudera Admin Training HDFS HA (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Cloudera Admin Training HDFS HA

Editor's Notes