Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Administration | Edureka

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Webpage

Agenda
➢ What is Big Data?
➢ Hadoop Introduction: Solution to Big Data Problem
➢ Hadoop Ecosystem
➢ Hadoop Core Components: HDFS & YARN
➢ Hadoop Core Configuration Files
➢ Multi-Node Hadoop Installation
➢ Configuring Hadoop using Configuration Files
➢ Commissioning/ Decommissioning of the DataNode
➢ Hadoop Web UI Components
➢ Hadoop Job Responsibilities

What is Big Data?

What is Big Data?
“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications”
Volume Variety Velocity Value Veracity
Uncertainty and
inconsistencies in the
data
Finding correct
meaning out of the
data
Data is being
generated at an
alarming rate
Processing different
types of data
Processing increasing
huge data sets

Hadoop: Solution to Big Data

Hadoop: Solution to Big Data
Hadoop Cluster
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing
of the data stored in HDFS
Master
Slaves
Allows to dump any kind of
data across the cluster
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion

Hadoop Ecosystem

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem

Hadoop Core Components

Hadoop Core Components
Secondary Namenode

HDFS Core Components

HDFS Core Components
NameNode
▪ Master daemon
▪ Maintains and Manages DataNodes
▪ Records metadata e.g. location of blocks stored, the
size of the files, permissions, hierarchy, etc.
▪ Receives heartbeat and block report from all the
DataNodes
DataNode
▪ Slave daemons
▪ Stores actual data
▪ Serves read and write requests from the clients
Secondary NameNode
▪ Checkpointing Node
▪ Responsible for combining the EditLogs with FsImage
from the NameNode
Secondary NN

HDFS Architecture

HDFS Architecture
write

YARN Core Components

YARN Architecture

YARN Architecture
Resource Manager
Node
Manager
Node
Manager
container
App
Master
App
Master
container
Node
Manager
App
Master
container
Client Node Status
Resource Request
MapReduce Status

Hadoop Cluster

Hadoop Cluster
Slaves and
Master
Machines
NameNode
Secondary
NameNode
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
switch switch switch
core switch
Rack 1 Rack 2 Rack 3

Hadoop Cluster Modes

Hadoop Cluster Modes
Standalone (or Local) Mode
➢ No daemons, everything runs in a single JVM
➢ Suitable for running MapReduce programs during development
➢ Has no DFS or Distributed File System
Pseudo Distributed Mode
➢ All Hadoop daemons run on the local machine
Multi-Node Cluster Mode
➢ Hadoop daemons run on a cluster of machines

Hadoop Cluster Hardware Specification

Hadoop Cluster Hardware Specification
NameNode
DataNodes
▪ RAM: 64 GB,
▪ Hard disk: 1 TB
▪ Processor: Xenon with 8 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS
▪ Power: Redundant Power Supply
▪ RAM: 32 GB
▪ Hard disk: 1 TB
▪ Processor: Xenon with 4 Cores
▪ Power: Redundant Power Supply
▪ RAM: 16GB
▪ Hard disk: 6 x 2TB
▪ Processor: Xenon with 2 cores
Active/Passive NameNode
Secondary NameNode
DataNode

Hadoop Configuration Files

HDFS Configuration Files
Configuration setting for Hadoop Core such as
I/O settings that are common to HDFS and
MapReduce
1 /etc/hadoop/core-site.xml
Configuration settings for HDFS daemons, the
NameNode, the Secondary NameNode and
DataNodes
2 /etc/hadoop/hdfs-site.xml
Configuration settings for MapReduce3 /etc/hadoop/mapred-site.xml

HDFS Configuration Files
specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/hadoop)
5 hadoop-env.sh
A list of machines specifying each DataNode
and Node Manager
6 slaves
Configuration settings for YARN
Daemons
4 /etc/hadoop/yarn-site.xml

Multi-Node Hadoop Installation
(Demo)

Configuring Hadoop
02
core-site.xml
03
hdfs-site.xml
04
mapred-site.xml
05
Yarn-site.xml
01
hadoop-env.sh

hadoop-env.sh
export HADOOP_PID_DIR=/home/hadoop/data/pid
export HADOOP_HEAPSIZE=2000
export JAVA_HOME=/home/hadoop/jdk Java Implementation to use
Max. amount of Heap to use
Directory where the daemons’
process id files are stored.

core-site.xml
Default Port Number
and Filesystem Name
Specifying Temporary Directory

hdfs-site.xml
Replication Factor

hdfs-site.xml
Replication Factor: 3
Moving test.txt into HDFS
File Replication: 3
Listing Files in directory

hdfs-site.xml
If true, enable
permission
checking in HDFS

hdfs-site.xml
Local Directory
where FsImage will
be stored

hdfs-site.xml
Local Directory
where all the DFS
data will be stored

hdfs-site.xml
Block Size
Configuration in
Bytes
file.xml 1 2 8
M B
128
MB
1 2 8
M B
128
MB
HDFS Cluster
HDFS Blocks
moving to HDFS
Note: The default Block Size is 128 MB

hdfs-site.xml
Secondary NameNode
HTTP Address

hdfs-site.xml
Directory where Secondary
NameNode stores the
temporary images to merge

mapred-site.xml
MAP
Other Maps
REDUCEinput-split in-memory
buffer
partition, sort
and spill to disc
merge
fetch merge
merge
partitions
mapreduce.task.io.sort.factor
Default: 10
mapreduce.task.io.sort.mb
Default: 100 MB
reduce phase
M A P R E D U C E J O B W O R K F L O W

mapred-site.xml
Number of streams (spill files) to
merge at once while sorting files
Total amount of buffer memory to
use while sorting files, in megabytes
Runtime framework for
executing MapReduce jobs

mapred-site.xml
threshold at which buffer data
will spilled into disc by thread
Local Disc
20 %
50 %
80%80%
Spill data
Node Manager
RAM
local directory where MapReduce
stores intermediate data files

mapred-site.xml
MapReduce JobHistory Server
IPC host:port
MapReduce JobHistory Server
Web UI host:port

MapReduce Job Workflow

yarn-site.xml
The address of the
applications manager
interface in the RM.
The address of the scheduler
interface
Required by NodeManagers
to connect to Resource
Manager

yarn-site.xml
Where to aggregate logs to
Remote log Dir
Where to store container logs
URL where aggregated logs
can be accessed after the
application completes

yarn-site.xml
Path to file with nodes to exclude.
Path to file with nodes to include
minimum allocation for every container request
at the RM, in MBs

yarn-site.xml
maximum allocation for every container
request at the RM, in MBs
Amount of physical memory, in MB, that can
be allocated for containers.
maximum allocation for every container request
at the RM, in terms of virtual CPU cores

Commissioning DataNode &
NodeManager

Commissioning DataNode
NameNode
DataNode1 DataNode2 DataNode3
NameNode
DataNode1 DataNode2 DataNode3 DataNode4
NameNode
Run Balancer
Commissioning
DataNode

Commissioning DataNode
1
Update hdfs-site.xml:
dfs.hosts.include
2
Update yarn-site.xml:
yarn.resourcemanager.n
odes.include-path
3
Add DataNode IP to file
‘includes’
i6 5 4
hadoop dfsadmin -
report
hdfs dfsadmin
-refreshNodes
yarn rmadmin
-refreshNodes

Demo

Decommissioning DataNode &
NodeManager

Decommissioning DataNode
NameNode
NameNode
DataNode1 DataNode2 DataNode3
Decommissioning
DataNode
(no balancer required)

Decommissioning DataNode
1
Update hdfs-site.xml:
dfs.hosts.exclude
2
Update yarn-site.xml:
yarn.resourcemanager.n
odes.exclude-path
3
Add DataNode IP to file
‘excludes’
i6 5 4
hadoop dfsadmin -
report
hdfs dfsadmin
-refreshNodes
yarn rmadmin
-refreshNodes

Hadoop Web UI Component

NameNode Web UI - 50070

ResourceManager Web UI - 8088

Secondary NameNode Web UI - 50090

JobHistory Web UI - 19888

Hadoop Architecture - Recap

Hadoop Architecture - Recap
Secondary
NameNode
NameNode ResourceManager
DataNode DataNode
NodeManger NodeManager
container
App
Master
container
App
Master
NodeManger NodeManager
container
App
Master
container
App
Master
DataNode DataNode
JobHistory Server

Conclusion: Hadoop Job Responsibilities

Hadoop Admin Job Responsibilities
Responsible for implementation and
support of the Enterprise Hadoop
environment
Involves designing, capacity arrangement, cluster
set up, performance fine-tuning, monitoring,
structure planning & scaling
Need to implement concepts of Hadoop
eco system such as YARN, MapReduce,
HDFS, HBase, Zookeeper, Pig and Hive
Manage, monitor and analyze Hadoop File
System & Log files. Also responsible for
Security measures

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Thank You …
Questions/Queries/Feedback

Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Administration | Edureka

More Related Content

What's hot

Similar to Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Administration | Edureka

More from Edureka!

Recently uploaded

Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Administration | Edureka