www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Webpage
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Agenda
➢ What is Big Data?
➢ Hadoop Introduction: Solution to Big Data Problem
➢ Hadoop Ecosystem
➢ Hadoop Core Components: HDFS & YARN
➢ Hadoop Core Configuration Files
➢ Multi-Node Hadoop Installation
➢ Configuring Hadoop using Configuration Files
➢ Commissioning/ Decommissioning of the DataNode
➢ Hadoop Web UI Components
➢ Hadoop Job Responsibilities
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Big Data?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Big Data?
“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications”
Volume Variety Velocity Value Veracity
Uncertainty and
inconsistencies in the
data
Finding correct
meaning out of the
data
Data is being
generated at an
alarming rate
Processing different
types of data
Processing increasing
huge data sets
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop: Solution to Big Data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop: Solution to Big Data
Hadoop Cluster
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing
of the data stored in HDFS
Master
Slaves
Allows to dump any kind of
data across the cluster
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Ecosystem
EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Core Components
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Core Components
Secondary Namenode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Core Components
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Core Components
NameNode
▪ Master daemon
▪ Maintains and Manages DataNodes
▪ Records metadata e.g. location of blocks stored, the
size of the files, permissions, hierarchy, etc.
▪ Receives heartbeat and block report from all the
DataNodes
DataNode
▪ Slave daemons
▪ Stores actual data
▪ Serves read and write requests from the clients
Secondary NameNode
▪ Checkpointing Node
▪ Responsible for combining the EditLogs with FsImage
from the NameNode
Secondary NN
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Architecture
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Architecture
write
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
YARN Core Components
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
YARN Core Components
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
YARN Architecture
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
YARN Architecture
Resource Manager
Node
Manager
Node
Manager
container
App
Master
App
Master
container
Node
Manager
App
Master
container
Client Node Status
Resource Request
MapReduce Status
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Cluster
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Cluster
Slaves and
Master
Machines
NameNode
Secondary
NameNode
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
Slave Nodes
switch switch switch
core switch
Rack 1 Rack 2 Rack 3
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Cluster Modes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Cluster Modes
Standalone (or Local) Mode
➢ No daemons, everything runs in a single JVM
➢ Suitable for running MapReduce programs during development
➢ Has no DFS or Distributed File System
Pseudo Distributed Mode
➢ All Hadoop daemons run on the local machine
Multi-Node Cluster Mode
➢ Hadoop daemons run on a cluster of machines
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Cluster Hardware Specification
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Cluster Hardware Specification
NameNode
DataNodes
▪ RAM: 64 GB,
▪ Hard disk: 1 TB
▪ Processor: Xenon with 8 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS
▪ Power: Redundant Power Supply
▪ RAM: 32 GB
▪ Hard disk: 1 TB
▪ Processor: Xenon with 4 Cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS
▪ Power: Redundant Power Supply
▪ RAM: 16GB
▪ Hard disk: 6 x 2TB
▪ Processor: Xenon with 2 cores
▪ Ethernet: 3 x 10 GB/s
▪ OS: 64-bit CentOS
Active/Passive NameNode
Secondary NameNode
DataNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Configuration Files
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Configuration Files
Configuration setting for Hadoop Core such as
I/O settings that are common to HDFS and
MapReduce
1 /etc/hadoop/core-site.xml
Configuration settings for HDFS daemons, the
NameNode, the Secondary NameNode and
DataNodes
2 /etc/hadoop/hdfs-site.xml
Configuration settings for MapReduce3 /etc/hadoop/mapred-site.xml
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Configuration Files
specifies environment variables that affect the
JDK used by Hadoop Daemon (bin/hadoop)
5 hadoop-env.sh
A list of machines specifying each DataNode
and Node Manager
6 slaves
Configuration settings for YARN
Daemons
4 /etc/hadoop/yarn-site.xml
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Multi-Node Hadoop Installation
(Demo)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Configuring Hadoop
02
core-site.xml
03
hdfs-site.xml
04
mapred-site.xml
05
Yarn-site.xml
01
hadoop-env.sh
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hadoop-env.sh
export HADOOP_PID_DIR=/home/hadoop/data/pid
export HADOOP_HEAPSIZE=2000
export JAVA_HOME=/home/hadoop/jdk Java Implementation to use
Max. amount of Heap to use
Directory where the daemons’
process id files are stored.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Configuring Hadoop
02
core-site.xml
03
hdfs-site.xml
04
mapred-site.xml
05
Yarn-site.xml
01
hadoop-env.sh
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
core-site.xml
Default Port Number
and Filesystem Name
Specifying Temporary Directory
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Configuring Hadoop
02
core-site.xml
03
hdfs-site.xml
04
mapred-site.xml
05
Yarn-site.xml
01
hadoop-env.sh
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Replication Factor
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Replication Factor: 3
Moving test.txt into HDFS
File Replication: 3
Listing Files in directory
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
If true, enable
permission
checking in HDFS
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Local Directory
where FsImage will
be stored
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Local Directory
where all the DFS
data will be stored
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Block Size
Configuration in
Bytes
file.xml 1 2 8
M B
128
MB
1 2 8
M B
128
MB
HDFS Cluster
HDFS Blocks
moving to HDFS
Note: The default Block Size is 128 MB
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Secondary NameNode
HTTP Address
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
hdfs-site.xml
Directory where Secondary
NameNode stores the
temporary images to merge
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Configuring Hadoop
02
core-site.xml
03
hdfs-site.xml
04
mapred-site.xml
05
Yarn-site.xml
01
hadoop-env.sh
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
mapred-site.xml
MAP
Other Maps
REDUCEinput-split in-memory
buffer
partition, sort
and spill to disc
merge
fetch merge
merge
partitions
mapreduce.task.io.sort.factor
Default: 10
mapreduce.task.io.sort.mb
Default: 100 MB
reduce phase
M A P R E D U C E J O B W O R K F L O W
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
mapred-site.xml
Number of streams (spill files) to
merge at once while sorting files
Total amount of buffer memory to
use while sorting files, in megabytes
Runtime framework for
executing MapReduce jobs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
mapred-site.xml
threshold at which buffer data
will spilled into disc by thread
Local Disc
20 %
50 %
80%80%
Spill data
Node Manager
RAM
local directory where MapReduce
stores intermediate data files
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
mapred-site.xml
MapReduce JobHistory Server
IPC host:port
MapReduce JobHistory Server
Web UI host:port
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Configuring Hadoop
02
core-site.xml
03
hdfs-site.xml
04
mapred-site.xml
05
Yarn-site.xml
01
hadoop-env.sh
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Job Workflow
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
yarn-site.xml
The address of the
applications manager
interface in the RM.
The address of the scheduler
interface
Required by NodeManagers
to connect to Resource
Manager
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
yarn-site.xml
Where to aggregate logs to
Remote log Dir
Where to store container logs
URL where aggregated logs
can be accessed after the
application completes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
yarn-site.xml
Path to file with nodes to exclude.
Path to file with nodes to include
minimum allocation for every container request
at the RM, in MBs
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
yarn-site.xml
maximum allocation for every container
request at the RM, in MBs
Amount of physical memory, in MB, that can
be allocated for containers.
maximum allocation for every container request
at the RM, in terms of virtual CPU cores
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Commissioning DataNode &
NodeManager
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Commissioning DataNode
NameNode
DataNode1 DataNode2 DataNode3
NameNode
DataNode1 DataNode2 DataNode3 DataNode4
NameNode
DataNode1 DataNode2 DataNode3 DataNode4
Run Balancer
Commissioning
DataNode
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Commissioning DataNode
1
Update hdfs-site.xml:
dfs.hosts.include
2
Update yarn-site.xml:
yarn.resourcemanager.n
odes.include-path
3
Add DataNode IP to file
‘includes’
i6 5 4
hadoop dfsadmin -
report
hdfs dfsadmin
-refreshNodes
yarn rmadmin
-refreshNodes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Demo
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Decommissioning DataNode &
NodeManager
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Decommissioning DataNode
NameNode
DataNode1 DataNode2 DataNode3 DataNode4
NameNode
DataNode1 DataNode2 DataNode3
Decommissioning
DataNode
(no balancer required)
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Decommissioning DataNode
1
Update hdfs-site.xml:
dfs.hosts.exclude
2
Update yarn-site.xml:
yarn.resourcemanager.n
odes.exclude-path
3
Add DataNode IP to file
‘excludes’
i6 5 4
hadoop dfsadmin -
report
hdfs dfsadmin
-refreshNodes
yarn rmadmin
-refreshNodes
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Demo
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Web UI Component
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
NameNode Web UI - 50070
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
ResourceManager Web UI - 8088
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Secondary NameNode Web UI - 50090
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
JobHistory Web UI - 19888
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Architecture - Recap
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Architecture - Recap
Secondary
NameNode
NameNode ResourceManager
DataNode DataNode
NodeManger NodeManager
container
App
Master
container
App
Master
NodeManger NodeManager
container
App
Master
container
App
Master
DataNode DataNode
JobHistory Server
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Conclusion: Hadoop Job Responsibilities
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Admin Job Responsibilities
Responsible for implementation and
support of the Enterprise Hadoop
environment
Involves designing, capacity arrangement, cluster
set up, performance fine-tuning, monitoring,
structure planning & scaling
Need to implement concepts of Hadoop
eco system such as YARN, MapReduce,
HDFS, HBase, Zookeeper, Pig and Hive
Manage, monitor and analyze Hadoop File
System & Log files. Also responsible for
Security measures
EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Thank You …
Questions/Queries/Feedback

Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Administration | Edureka