With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
2. www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives
At the end of this module, you will be able to
Hadoop Cluster introduction
Hadoop cluster running modes
Hadoop configuration files
Hadoop Admin Responsibilities
Hadoop High Availability
Demo on high Availability
4. www.edureka.co/hadoop-adminSlide 4
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately
5TB per week
HDFS set up to replicate each
block three times
Thus, 15TB of extra storage
space required per week
Assuming machines with 5x3TB
hard drives, equating to a new
machine required each week
Assume Overheads to be 30%
5. www.edureka.co/hadoop-adminSlide 5
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement
‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a
JBOD* configuration
» Do not use RAID!
» 2 x Quad-core CPUs
» 24 -32GB RAM
» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores
+ 6-8GB RAM) generally work well
for many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
6. www.edureka.co/hadoop-adminSlide 6
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce task
will take 1GB to 2GB of RAM
Slave nodes should not be using
virtual memory
RULE OF THUMB!
Total number of tasks = 1.5 x number
of processor core
Ensure enough RAM is present to
run all tasks, plus the DataNode,
TaskTracker daemons, plus the
operating system
7. www.edureka.co/hadoop-adminSlide 7
Master Node Hardware Recommendations
Carrier-class hardware
(Not commodity hardware)
Dual power supplies
Dual Ethernet cards
(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master
Node
Requires
8. www.edureka.co/hadoop-adminSlide 8
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM
Suitable for running MapReduce programs during development
Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
9. www.edureka.co/hadoop-adminSlide 9
Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS and
YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
11. Slide 11
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
www.edureka.co/hadoop-admin
12. Slide 12 www.edureka.in/hadoop-admin
Secondary NameNode:
"Not a hot standby" for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
Secondary
NameNode
NameNode
metadata
metadata
Single Point
Failure
You give me
metadata
every hour, I
will make it
secure
NameNode – Single Point of Failure
13. www.edureka.in/hadoop-adminSlide 13
High Availability in
Hadoop 2.0
NameNode recovery in
Hadoop 1.0
Secondary
NameNode
Standby
NameNode
Active
NameNode
Secondary
NameNode
NameNode
Edit logs
Meta-Data
Automatic failover
to Standby
NameNode
Manually Recover
using Secondary
NameNode
FSImage
NameNode Recovery Vs. Failover
14. Slide 14
Hadoop-2.X HA
www.edureka.co/hadoop-admin
Hadoop-2.X new feature called High Availability
The HDFS HA feature addresses the Hadoop-1.X problems by providing the option of running two Name Nodes in
the same cluster, in an Active/Passive configuration.
These are referred to as the Active Name Node and the Standby Name Node.
Standby Name Node is hot back up for cluster.
Allowing a fast failover to a new Name Node in the case that a machine crashes, or a graceful administrator-
initiated failover for the purpose of planned maintenance.
We can set up HA in ways :
Quorum-based Storage
Shared storage using NFS
15. www.edureka.co/hadoop-adminSlide 15
Slave NodeSlave NodeSlave Node
Standby NodeActive Node
Journal Nodes
(Shared Edits)
Failover Controller
Standby
Failover Controller
Active
Zookeeper Service
Block Report & Heart
beat
Monitor status and
health. Manage HA
state
HA Architecture
Monitor status and
health. Manage HA
state
Write Read
16. www.edureka.co/hadoop-adminSlide 16
Active Name Node
ZKFC
Zookeeper
Journal Node
Active Name Node
ZKFC
Zookeeper
Journal Node
Data node
Zookeeper
Journal Node
Daemons in HA Architecture
Block Report & Heart
beat
18. www.edureka.co/hadoop-adminSlide 18
Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.
19. LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?