Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

What’s in it for you?
Need for Hadoop

Need for Hadoop
What is Hadoop?

Need for Hadoop
What is Hadoop?
Hadoop Ecosystem

Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features

Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?

Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?

Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What is YARN?

Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What is YARN?
Bank case study

Need for HadoopNeed for Hadoop
JUNe
In today’s world, data is
increasingly growing from
heterogenous sources like social
media, aviation, logistics, e-
commerce, etc.

JUNe
All these digital data is expected to
reach 163 zettabytes by 2025
( 1 ZB = 10 TB )
9

JUNe
Companies face problems in storing and
processing these vast volumes of data

JUNe
Solution is big data
technologies such as

What is Hadoop?
Open source framework to store
and process huge volumes of data

What is Hadoop?
Stores large volumes of data in multiple data nodes
Data DN1 DN2 DN3 DN4

What is Hadoop?
Stores large volumes of data in multiple data nodes
Data DN1 DN2 DN3 DN4
Processes data parallelly in multiple data nodes

Components of Hadoop
HDFS – Distributed data storage1

MapReduce – Parallel data processing2

MapReduce – Parallel data processing2
YARN – Cluster resource management3

Hadoop Ecosystem
Data Collection
and ingestion
Work Flow
Pig
(Scripting)
Hive
(SQL Query)
Interactive
Analysis
Machine
Learning
Streaming
Read/write
access to data
Hadoop Distributed Files System
Cluster Resource Management
Data Processing
Management and
Monitoring

Hadoop Features
Scalable Fault tolerantFlexible
Distributed
storage
Cost effective
Robust
ecosystem

Robust
ecosystem
Hadoop Features
Distributed
storage
Cost effective
Hadoop in flexible in storing any type of data, be it
structured, semi structured or unstructured data

Robust
ecosystem
Distributed
storage
Hadoop Features
Cost effective
As the volume of data grows, new node machines can be
easily added and scaled to the Hadoop cluster

Robust
ecosystem
Distributed
storage
Hadoop Features
Cost effective
Data stored on HDFS gets replicated automatically on to
different data nodes. This brings high fault tolerance when a
data node crashes

Robust
ecosystem
Distributed
storage
Hadoop Features
Cost effective
Hadoop supports distributed data storage and hence allows
faster processing of data

Distributed
storage
Hadoop Features
Cost effective
Hadoop has a robust ecosystem that suits the analytical needs
of small and big organizations. These include spark, pig, hive,
mahout, etc.
Robust
ecosystem

Robust
ecosystem
Distributed
storage
Hadoop Features
Cost effective
Hadoop stores and processes data on a cluster of commodity
hardware, resulting in a substantial reduction of cost per
terabyte of storage

Hadoop use case
Before 2008 economic recession, every bank
maintained a legacy data warehouse
Data warehouse
Home mortgage details, credit card
transactions and other financial details of
every customer was restricted to local
database systems

Hadoop use case
Before 2008 economic recession, every bank
maintained a legacy data warehouse
Data warehouse
Home mortgage details, credit card
transactions and other financial details of
every customer was restricted to local
database systems
Banks could not store and
process data efficiently
Failed to build a comprehensive risk
portfolio for their customers

Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster

Hadoop use case
Hadoop cluster
Along with transaction data, it could also store
call records, email, chat
and web logs

Hadoop use case
Hadoop cluster
Along with transaction data, it could also store
call records, email, chat
and web logs
Data is analyzed to perform sentiment
analysis, text processing,
pattern matching

Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan

Hadoop use case
nations
oil for JP Morgan
Storing vast volumes of unstructured
data allows the company to collect web
logs, transaction data, social media
data, etc.

Hadoop use case
nations
oil for JP Morgan
Storing vast volumes of unstructured
data allows the company to collect web
logs, transaction data, social media
data, etc.
Uses Hadoop framework for risk
management and detecting fraud
transactions

What is HDFS?
Hadoop Distributed File System (HDFS) is the storage layer of
Hadoop that stores data in multiple data nodes

What is HDFS?
Hadoop Distributed File System (HDFS) is the storage layer of
Hadoop that stores data in multiple data nodes
Datanode2 Datanode3Datanode1 Datanode4
Big Data

What is HDFS?
In HDFS, data gets divided into multiple blocks and
the blocks are stored on multiple nodes

What is HDFS?
Each block of data is stored on multiple data nodes
and by default has 128 MB of data
300 MB Data
128 MB 128 MB 44 MB
In HDFS, data gets divided into multiple blocks and
the blocks are stored on multiple nodes
Datanode Datanode Datanode

HDFS Architecture
Secondary
Namenode
Namenode
Master
Namenode is the master server Secondary Namenode is the
backup server
Namenode and Secondary Namenode are the master daemons

HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
Namenode holds metadata information about the various
Datanodes, their location, the size of each block, etc.

HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
File.txt
Helps to execute file system namespace operations –
opening, closing, renaming files and directories

HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
File.txt
Maintains
Metadata in Disk
Edit log Fsimage
Secondary Namenode server is responsible for
maintaining a copy of Metadata in disk

HDFS Architecture
Secondary
Namenode
Namenode
Master
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode N
……….
Datanodes are the slave daemons that store and maintain the data blocks
Slave

HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode 4 Datanode 5
B4 B2 B4B3
Client
Metadata ops Metadata (Name, replicas, ….):
Read
Replication
Namenode
Datanode reads and writes client’s request and performs block creation, deletion
and replication on instruction from the Namenode

HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
B4 B2 B4B3
Client
Block ops
Read
Replication
Namenode

HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
B4 B2 B4B3
Client
Block ops
Read
Replication
Response from the Namenode
that the operation was
successful
Namenode

HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
B4 B2 B4B3
Client
Client
Block ops
Read
Write
Replication
Response from the Namenode
that the operation was
successful
Namenode

HDFS Write
MasterClient
Where can I write &
store my data?
128 MB
128 MB
44 MB
300 MB
split
Data is split into multiple blocks on
128 MB each

HDFS Write
MasterClient Datanodes
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
split

HDFS Write
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
Write the 1st block of data to A3,
B2, B4
split
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
Data block is replicated thrice on different datanotes

HDFS Write
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
split
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Similarly, the other 2 blocks are written on to different datanodes
128 MB 128 MB
128 MB
44 MB44 MB
44 MB

HDFS Read
MasterClient
I want to read my file
128 MB
128 MB
44 MB 128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
128 MB 128 MB
128 MB
44 MB44 MB
44 MB

HDFS Read
Finds the datanodes to
read from
128 MB
128 MB
44 MB 128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
128 MB 128 MB
128 MB
44 MB44 MB
44 MB

HDFS Read
Finds the datanodes to
read from
128 MB
128 MB
44 MB
Read data from A2, A3, B1
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
128 MB 128 MB
128 MB
44 MB44 MB
44 MB

Importing data to HDFS
Relational databases
RDBMS Data
warehouse
SQOOP is used to import data from relational databases on to HDFS

Importing data to HDFS
Relational databases Streaming data
RDBMS Data
warehouse
Sensor
Web
server
FLUME is used to import streaming data from sensors and web servers on to HDFS

What is MapReduce?
MapReduce is a programming model to process large datasets parallelly
on different nodes

What is MapReduce?
MapReduce is a programming model to process large datasets parallelly
on different nodes
Data is processed simultaneously on
different slave nodes
Slave node 1 Slave node 2
Slave node 3 Slave node 4
Master node

MapReduce Workflow
Map Tasks
Input
Map( )

MapReduce Workflow
Map Tasks Shuffle and
sort
Input
Map( )

MapReduce Workflow
sort
Reduce
Tasks
Input
Map( ) Reduce( )

MapReduce Workflow
sort
Reduce
Tasks
OutputInput
Map( ) Reduce( )

MapReduce Example
Input
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow

MapReduce Example
Map Function
Split step

MapReduce Example
Square = 1
Red = 1
Triangle = 1
Blue = 1
Circle = 1
Green = 1
Square = 1
Green = 1
Triangle = 1
White = 1
Cube = 1
Blue = 1
Cube = 1
Yellow = 1
Circle = 1
Red = 1
Cube = 1
Blue = 1
Hexagon = 1
Green = 1
Square = 1
Blue = 1
Cube = 1
Yellow = 1
Map step

MapReduce Example
Square = 1
Red = 1
Triangle = 1
Blue = 1
Circle = 1
Green = 1
Square = 1
Green = 1
Triangle = 1
White = 1
Cube = 1
Blue = 1
Cube = 1
Yellow = 1
Circle = 1
Red = 1
Cube = 1
Blue = 1
Hexagon = 1
Green = 1
Square = 1
Blue = 1
Cube = 1
Yellow = 1
Merge step
Square = {1,1}
Red = {1}
Triangle = {1,1}
Blue = {1,1}
Circle = {1}
Green = {1,1}
White = {1}
Cube = {1}
Cube = {1,1,1}
Yellow = {1,1}
Circle = {1}
Red = {1}
Blue = {1,1}
Hexagon = {1}
Green = {1}
Square = {1}
Merge step
Square = {1,1,1}
Red = {1,1}
Triangle = {1,1}
Blue = {1,1,1,1}
Circle = {1,1}
Green = {1,1,1}
White = {1}
Cube = {1,1,1,1}
Yellow = {1,1}
Hexagon = {1}

MapReduce Example
Square = {1,1,1}
Red = {1,1}
Triangle = {1,1}
Blue = {1,1,1,1}
Circle = {1,1}
Green = {1,1,1}
White = {1}
Cube = {1,1,1,1}
Yellow = {1,1}
Hexagon = {1}
Blue = {1,1,1,1}
Circle = {1,1}
Cube = {1,1,1,1}
Green = {1,1,1}
Hexagon = {1}
Red = {1,1}
Square = {1,1,1}
Triangle = {1,1}
White = {1}
Yellow = {1,1}
Shuffle and sort step Reduce step
Blue = 4
Circle = 2
Cube = 4
Green = 3
Hexagon = 1
Red = 2
Square = 3
Triangle = 2
White = 1
Yellow = 2

YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.

YARN is the cluster resource management layer of Hadoop
that schedules jobs and assigns resources to running
applications

YARN is the cluster resource management layer of Hadoop
that schedules jobs and assigns resources to running
applications
MapReduce
application
Memory CPU
RAM

Client

Resource
ManagerClient
Job Submission
Submit job
request
Client submits an application to the
ResourceManager

Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Submit job
request

Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
Submit job
request
NodeManager sends its status to
ResourceManager

Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
MapReduce Status
Submit job
request
ApplicationMaster contacts the related
NodeManager

Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
MapReduce Status
Resource Request
Submit job
request
Container executes the ApplicationMaster

Bank case study
VIRTUAL BANK
You own a Virtual Bank that is
generating a lot of customer
transaction data and uses RDBMS
to store the data
RDBMS

Bank case study
VIRTUAL BANK
But, the bank’s data is rapidly
increasing and RDBMS has become
inefficient in handling such large
volumes of data
RDBMS

Bank case study
VIRTUAL BANK
You need a solution to move your
bank data from traditional RDBMS
to more flexible and scalable data
storage

Bank case study
VIRTUAL BANK
What if I use Hadoop Distributed
File System (HDFS) to store the
data?

Bank case study
VIRTUAL BANK
HDFS can easily store large
volumes of data. So, let me use
Sqoop to move all the bank’s data
from RDBMS onto HDFS
RDBMS

Bank case study
VIRTUAL BANK
This will also allow us to analyze
customer data using Sqoop
commands. Now, let’s see how to
do this
RDBMS

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

More Related Content

What's hot

Similar to Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

More from Simplilearn

Recently uploaded

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

Editor's Notes