Hadoop Training
What’s in it for you?
Need for Hadoop
What’s in it for you?
Need for Hadoop
What is Hadoop?
What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What is YARN?
What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What is YARN?
Bank case study
Need for HadoopNeed for Hadoop
JUNe
In today’s world, data is
increasingly growing from
heterogenous sources like social
media, aviation, logistics, e-
commerce, etc.
Need for HadoopNeed for Hadoop
JUNe
All these digital data is expected to
reach 163 zettabytes by 2025
( 1 ZB = 10 TB )
9
Need for HadoopNeed for Hadoop
JUNe
Companies face problems in storing and
processing these vast volumes of data
Need for HadoopNeed for Hadoop
JUNe
Solution is big data
technologies such as
What is Hadoop?
Open source framework to store
and process huge volumes of data
What is Hadoop?
Open source framework to store
and process huge volumes of data
Stores large volumes of data in multiple data nodes
Data DN1 DN2 DN3 DN4
What is Hadoop?
Open source framework to store
and process huge volumes of data
Stores large volumes of data in multiple data nodes
Data DN1 DN2 DN3 DN4
Processes data parallelly in multiple data nodes
Components of Hadoop
HDFS – Distributed data storage1
Components of Hadoop
MapReduce – Parallel data processing2
HDFS – Distributed data storage1
Components of Hadoop
MapReduce – Parallel data processing2
YARN – Cluster resource management3
HDFS – Distributed data storage1
Hadoop Ecosystem
Data Collection
and ingestion
Work Flow
Pig
(Scripting)
Hive
(SQL Query)
Interactive
Analysis
Machine
Learning
Streaming
Read/write
access to data
Hadoop Distributed Files System
Cluster Resource Management
Data Processing
Management and
Monitoring
Hadoop Features
Scalable Fault tolerantFlexible
Distributed
storage
Cost effective
Robust
ecosystem
Robust
ecosystem
Hadoop Features
Scalable Fault tolerantFlexible
Distributed
storage
Cost effective
Hadoop in flexible in storing any type of data, be it
structured, semi structured or unstructured data
Robust
ecosystem
Distributed
storage
Hadoop Features
Scalable Fault tolerantFlexible
Cost effective
As the volume of data grows, new node machines can be
easily added and scaled to the Hadoop cluster
Robust
ecosystem
Distributed
storage
Hadoop Features
Scalable Fault tolerantFlexible
Cost effective
Data stored on HDFS gets replicated automatically on to
different data nodes. This brings high fault tolerance when a
data node crashes
Robust
ecosystem
Distributed
storage
Hadoop Features
Scalable Fault tolerantFlexible
Cost effective
Hadoop supports distributed data storage and hence allows
faster processing of data
Distributed
storage
Hadoop Features
Scalable Fault tolerantFlexible
Cost effective
Hadoop has a robust ecosystem that suits the analytical needs
of small and big organizations. These include spark, pig, hive,
mahout, etc.
Robust
ecosystem
Robust
ecosystem
Distributed
storage
Hadoop Features
Scalable Fault tolerantFlexible
Cost effective
Hadoop stores and processes data on a cluster of commodity
hardware, resulting in a substantial reduction of cost per
terabyte of storage
Hadoop use case
Before 2008 economic recession, every bank
maintained a legacy data warehouse
Data warehouse
Home mortgage details, credit card
transactions and other financial details of
every customer was restricted to local
database systems
Hadoop use case
Before 2008 economic recession, every bank
maintained a legacy data warehouse
Data warehouse
Home mortgage details, credit card
transactions and other financial details of
every customer was restricted to local
database systems
Banks could not store and
process data efficiently
Failed to build a comprehensive risk
portfolio for their customers
Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
Along with transaction data, it could also store
call records, email, chat
and web logs
Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
Along with transaction data, it could also store
call records, email, chat
and web logs
Data is analyzed to perform sentiment
analysis, text processing,
pattern matching
Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan
Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan
Storing vast volumes of unstructured
data allows the company to collect web
logs, transaction data, social media
data, etc.
Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan
Storing vast volumes of unstructured
data allows the company to collect web
logs, transaction data, social media
data, etc.
Uses Hadoop framework for risk
management and detecting fraud
transactions
What is HDFS?
Hadoop Distributed File System (HDFS) is the storage layer of
Hadoop that stores data in multiple data nodes
What is HDFS?
Hadoop Distributed File System (HDFS) is the storage layer of
Hadoop that stores data in multiple data nodes
Datanode2 Datanode3Datanode1 Datanode4
Big Data
What is HDFS?
In HDFS, data gets divided into multiple blocks and
the blocks are stored on multiple nodes
What is HDFS?
Each block of data is stored on multiple data nodes
and by default has 128 MB of data
300 MB Data
128 MB 128 MB 44 MB
In HDFS, data gets divided into multiple blocks and
the blocks are stored on multiple nodes
Datanode Datanode Datanode
HDFS Architecture
Secondary
Namenode
Namenode
Master
Namenode is the master server Secondary Namenode is the
backup server
Namenode and Secondary Namenode are the master daemons
HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
Namenode holds metadata information about the various
Datanodes, their location, the size of each block, etc.
HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
File.txt
Helps to execute file system namespace operations –
opening, closing, renaming files and directories
HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
File.txt
Maintains
Metadata in Disk
Edit log Fsimage
Secondary Namenode server is responsible for
maintaining a copy of Metadata in disk
HDFS Architecture
Secondary
Namenode
Namenode
Master
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode N
……….
Datanodes are the slave daemons that store and maintain the data blocks
Slave
HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode 4 Datanode 5
B4 B2 B4B3
Client
Metadata ops Metadata (Name, replicas, ….):
/home/foo/data, 3, …
Read
Replication
Namenode
Datanode reads and writes client’s request and performs block creation, deletion
and replication on instruction from the Namenode
HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode 4 Datanode 5
B4 B2 B4B3
Client
Metadata ops Metadata (Name, replicas, ….):
/home/foo/data, 3, …
Block ops
Read
Replication
Namenode
Datanode reads and writes client’s request and performs block creation, deletion
and replication on instruction from the Namenode
HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode 4 Datanode 5
B4 B2 B4B3
Client
Metadata ops Metadata (Name, replicas, ….):
/home/foo/data, 3, …
Block ops
Read
Replication
Response from the Namenode
that the operation was
successful
Namenode
Datanode reads and writes client’s request and performs block creation, deletion
and replication on instruction from the Namenode
HDFS Architecture
Datanode 1
B1 B2
Datanode 2
B1 B3
Datanode 3
B2 B3
Datanode 4 Datanode 5
B4 B2 B4B3
Client
Metadata ops Metadata (Name, replicas, ….):
/home/foo/data, 3, …
Client
Block ops
Read
Write
Replication
Response from the Namenode
that the operation was
successful
Namenode
Datanode reads and writes client’s request and performs block creation, deletion
and replication on instruction from the Namenode
HDFS Write
MasterClient
Where can I write &
store my data?
128 MB
128 MB
44 MB
300 MB
split
Data is split into multiple blocks on
128 MB each
HDFS Write
MasterClient Datanodes
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
split
HDFS Write
MasterClient Datanodes
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
Write the 1st block of data to A3,
B2, B4
split
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
Data block is replicated thrice on different datanotes
HDFS Write
MasterClient Datanodes
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
split
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
Similarly, the other 2 blocks are written on to different datanodes
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
HDFS Read
MasterClient
I want to read my file
128 MB
128 MB
44 MB 128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
HDFS Read
MasterClient Datanodes
I want to read my file
Finds the datanodes to
read from
128 MB
128 MB
44 MB 128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
HDFS Read
MasterClient Datanodes
I want to read my file
Finds the datanodes to
read from
128 MB
128 MB
44 MB
Read data from A2, A3, B1
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
Importing data to HDFS
Relational databases
RDBMS Data
warehouse
SQOOP is used to import data from relational databases on to HDFS
Importing data to HDFS
Relational databases Streaming data
RDBMS Data
warehouse
Sensor
Web
server
FLUME is used to import streaming data from sensors and web servers on to HDFS
What is MapReduce?
MapReduce is a programming model to process large datasets parallelly
on different nodes
What is MapReduce?
MapReduce is a programming model to process large datasets parallelly
on different nodes
Data is processed simultaneously on
different slave nodes
Slave node 1 Slave node 2
Slave node 3 Slave node 4
Master node
MapReduce Workflow
Input
MapReduce Workflow
Map Tasks
Input
Map( )
MapReduce Workflow
Map Tasks Shuffle and
sort
Input
Map( )
MapReduce Workflow
Map Tasks Shuffle and
sort
Reduce
Tasks
Input
Map( ) Reduce( )
MapReduce Workflow
Map Tasks Shuffle and
sort
Reduce
Tasks
OutputInput
Map( ) Reduce( )
MapReduce Example
Input
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
MapReduce Example
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
Map Function
Split step
MapReduce Example
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
Square = 1
Red = 1
Triangle = 1
Blue = 1
Circle = 1
Green = 1
Square = 1
Green = 1
Triangle = 1
White = 1
Cube = 1
Blue = 1
Cube = 1
Yellow = 1
Circle = 1
Red = 1
Cube = 1
Blue = 1
Hexagon = 1
Green = 1
Square = 1
Blue = 1
Cube = 1
Yellow = 1
Map step
MapReduce Example
Square = 1
Red = 1
Triangle = 1
Blue = 1
Circle = 1
Green = 1
Square = 1
Green = 1
Triangle = 1
White = 1
Cube = 1
Blue = 1
Cube = 1
Yellow = 1
Circle = 1
Red = 1
Cube = 1
Blue = 1
Hexagon = 1
Green = 1
Square = 1
Blue = 1
Cube = 1
Yellow = 1
Merge step
Square = {1,1}
Red = {1}
Triangle = {1,1}
Blue = {1,1}
Circle = {1}
Green = {1,1}
White = {1}
Cube = {1}
Cube = {1,1,1}
Yellow = {1,1}
Circle = {1}
Red = {1}
Blue = {1,1}
Hexagon = {1}
Green = {1}
Square = {1}
Merge step
Square = {1,1,1}
Red = {1,1}
Triangle = {1,1}
Blue = {1,1,1,1}
Circle = {1,1}
Green = {1,1,1}
White = {1}
Cube = {1,1,1,1}
Yellow = {1,1}
Hexagon = {1}
MapReduce Example
Square = {1,1,1}
Red = {1,1}
Triangle = {1,1}
Blue = {1,1,1,1}
Circle = {1,1}
Green = {1,1,1}
White = {1}
Cube = {1,1,1,1}
Yellow = {1,1}
Hexagon = {1}
Blue = {1,1,1,1}
Circle = {1,1}
Cube = {1,1,1,1}
Green = {1,1,1}
Hexagon = {1}
Red = {1,1}
Square = {1,1,1}
Triangle = {1,1}
White = {1}
Yellow = {1,1}
Shuffle and sort step Reduce step
Blue = 4
Circle = 2
Cube = 4
Green = 3
Hexagon = 1
Red = 2
Square = 3
Triangle = 2
White = 1
Yellow = 2
YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.
YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.
YARN is the cluster resource management layer of Hadoop
that schedules jobs and assigns resources to running
applications
YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.
YARN is the cluster resource management layer of Hadoop
that schedules jobs and assigns resources to running
applications
MapReduce
application
Memory CPU
RAM
YARN – Yet Another Resource Negotiator
Client
YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Job Submission
Submit job
request
Client submits an application to the
ResourceManager
YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Submit job
request
YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
Submit job
request
NodeManager sends its status to
ResourceManager
YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
MapReduce Status
Submit job
request
ApplicationMaster contacts the related
NodeManager
YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
MapReduce Status
Resource Request
Submit job
request
Container executes the ApplicationMaster
Bank case study
VIRTUAL BANK
You own a Virtual Bank that is
generating a lot of customer
transaction data and uses RDBMS
to store the data
RDBMS
Bank case study
VIRTUAL BANK
But, the bank’s data is rapidly
increasing and RDBMS has become
inefficient in handling such large
volumes of data
RDBMS
Bank case study
VIRTUAL BANK
You need a solution to move your
bank data from traditional RDBMS
to more flexible and scalable data
storage
Bank case study
VIRTUAL BANK
What if I use Hadoop Distributed
File System (HDFS) to store the
data?
Bank case study
VIRTUAL BANK
HDFS can easily store large
volumes of data. So, let me use
Sqoop to move all the bank’s data
from RDBMS onto HDFS
RDBMS
Bank case study
VIRTUAL BANK
This will also allow us to analyze
customer data using Sqoop
commands. Now, let’s see how to
do this
RDBMS
Key Takeaways
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoop Tutorial |Simplilearn

  • 1.
  • 2.
    What’s in itfor you? Need for Hadoop
  • 3.
    What’s in itfor you? Need for Hadoop What is Hadoop?
  • 4.
    What’s in itfor you? Need for Hadoop What is Hadoop? Hadoop Ecosystem
  • 5.
    What’s in itfor you? Need for Hadoop What is Hadoop? Hadoop Ecosystem Hadoop Features
  • 6.
    What’s in itfor you? Need for Hadoop What is Hadoop? Hadoop Ecosystem Hadoop Features What is HDFS?
  • 7.
    What’s in itfor you? Need for Hadoop What is Hadoop? Hadoop Ecosystem Hadoop Features What is HDFS? What is MapReduce?
  • 8.
    What’s in itfor you? Need for Hadoop What is Hadoop? Hadoop Ecosystem Hadoop Features What is HDFS? What is MapReduce? What is YARN?
  • 9.
    What’s in itfor you? Need for Hadoop What is Hadoop? Hadoop Ecosystem Hadoop Features What is HDFS? What is MapReduce? What is YARN? Bank case study
  • 10.
    Need for HadoopNeedfor Hadoop JUNe In today’s world, data is increasingly growing from heterogenous sources like social media, aviation, logistics, e- commerce, etc.
  • 11.
    Need for HadoopNeedfor Hadoop JUNe All these digital data is expected to reach 163 zettabytes by 2025 ( 1 ZB = 10 TB ) 9
  • 12.
    Need for HadoopNeedfor Hadoop JUNe Companies face problems in storing and processing these vast volumes of data
  • 13.
    Need for HadoopNeedfor Hadoop JUNe Solution is big data technologies such as
  • 14.
    What is Hadoop? Opensource framework to store and process huge volumes of data
  • 15.
    What is Hadoop? Opensource framework to store and process huge volumes of data Stores large volumes of data in multiple data nodes Data DN1 DN2 DN3 DN4
  • 16.
    What is Hadoop? Opensource framework to store and process huge volumes of data Stores large volumes of data in multiple data nodes Data DN1 DN2 DN3 DN4 Processes data parallelly in multiple data nodes
  • 17.
    Components of Hadoop HDFS– Distributed data storage1
  • 18.
    Components of Hadoop MapReduce– Parallel data processing2 HDFS – Distributed data storage1
  • 19.
    Components of Hadoop MapReduce– Parallel data processing2 YARN – Cluster resource management3 HDFS – Distributed data storage1
  • 20.
    Hadoop Ecosystem Data Collection andingestion Work Flow Pig (Scripting) Hive (SQL Query) Interactive Analysis Machine Learning Streaming Read/write access to data Hadoop Distributed Files System Cluster Resource Management Data Processing Management and Monitoring
  • 21.
    Hadoop Features Scalable FaulttolerantFlexible Distributed storage Cost effective Robust ecosystem
  • 22.
    Robust ecosystem Hadoop Features Scalable FaulttolerantFlexible Distributed storage Cost effective Hadoop in flexible in storing any type of data, be it structured, semi structured or unstructured data
  • 23.
    Robust ecosystem Distributed storage Hadoop Features Scalable FaulttolerantFlexible Cost effective As the volume of data grows, new node machines can be easily added and scaled to the Hadoop cluster
  • 24.
    Robust ecosystem Distributed storage Hadoop Features Scalable FaulttolerantFlexible Cost effective Data stored on HDFS gets replicated automatically on to different data nodes. This brings high fault tolerance when a data node crashes
  • 25.
    Robust ecosystem Distributed storage Hadoop Features Scalable FaulttolerantFlexible Cost effective Hadoop supports distributed data storage and hence allows faster processing of data
  • 26.
    Distributed storage Hadoop Features Scalable FaulttolerantFlexible Cost effective Hadoop has a robust ecosystem that suits the analytical needs of small and big organizations. These include spark, pig, hive, mahout, etc. Robust ecosystem
  • 27.
    Robust ecosystem Distributed storage Hadoop Features Scalable FaulttolerantFlexible Cost effective Hadoop stores and processes data on a cluster of commodity hardware, resulting in a substantial reduction of cost per terabyte of storage
  • 28.
    Hadoop use case Before2008 economic recession, every bank maintained a legacy data warehouse Data warehouse Home mortgage details, credit card transactions and other financial details of every customer was restricted to local database systems
  • 29.
    Hadoop use case Before2008 economic recession, every bank maintained a legacy data warehouse Data warehouse Home mortgage details, credit card transactions and other financial details of every customer was restricted to local database systems Banks could not store and process data efficiently Failed to build a comprehensive risk portfolio for their customers
  • 30.
    Hadoop use case After2008 economic recession, most of the financial institutions and national monetary associations started maintaining a single Hadoop Cluster containing more than petabytes of financial data Hadoop cluster
  • 31.
    Hadoop use case After2008 economic recession, most of the financial institutions and national monetary associations started maintaining a single Hadoop Cluster containing more than petabytes of financial data Hadoop cluster
  • 32.
    Hadoop use case After2008 economic recession, most of the financial institutions and national monetary associations started maintaining a single Hadoop Cluster containing more than petabytes of financial data Hadoop cluster Along with transaction data, it could also store call records, email, chat and web logs
  • 33.
    Hadoop use case After2008 economic recession, most of the financial institutions and national monetary associations started maintaining a single Hadoop Cluster containing more than petabytes of financial data Hadoop cluster Along with transaction data, it could also store call records, email, chat and web logs Data is analyzed to perform sentiment analysis, text processing, pattern matching
  • 34.
    Hadoop use case Bankingand financial giant with services in more than 100 nations With over 150 petabytes of data, 30,000 databases, 3.5 billion log records, data is the oil for JP Morgan
  • 35.
    Hadoop use case Bankingand financial giant with services in more than 100 nations With over 150 petabytes of data, 30,000 databases, 3.5 billion log records, data is the oil for JP Morgan Storing vast volumes of unstructured data allows the company to collect web logs, transaction data, social media data, etc.
  • 36.
    Hadoop use case Bankingand financial giant with services in more than 100 nations With over 150 petabytes of data, 30,000 databases, 3.5 billion log records, data is the oil for JP Morgan Storing vast volumes of unstructured data allows the company to collect web logs, transaction data, social media data, etc. Uses Hadoop framework for risk management and detecting fraud transactions
  • 37.
    What is HDFS? HadoopDistributed File System (HDFS) is the storage layer of Hadoop that stores data in multiple data nodes
  • 38.
    What is HDFS? HadoopDistributed File System (HDFS) is the storage layer of Hadoop that stores data in multiple data nodes Datanode2 Datanode3Datanode1 Datanode4 Big Data
  • 39.
    What is HDFS? InHDFS, data gets divided into multiple blocks and the blocks are stored on multiple nodes
  • 40.
    What is HDFS? Eachblock of data is stored on multiple data nodes and by default has 128 MB of data 300 MB Data 128 MB 128 MB 44 MB In HDFS, data gets divided into multiple blocks and the blocks are stored on multiple nodes Datanode Datanode Datanode
  • 41.
    HDFS Architecture Secondary Namenode Namenode Master Namenode isthe master server Secondary Namenode is the backup server Namenode and Secondary Namenode are the master daemons
  • 42.
    HDFS Architecture Secondary Namenode Namenode Master Metadata inDisk Edit log Fsimage Metadata in RAM Metadata (Name, replicas,….): /home/foo/data, 3, … Namenode holds metadata information about the various Datanodes, their location, the size of each block, etc.
  • 43.
    HDFS Architecture Secondary Namenode Namenode Master Metadata inDisk Edit log Fsimage Metadata in RAM Metadata (Name, replicas,….): /home/foo/data, 3, … File.txt Helps to execute file system namespace operations – opening, closing, renaming files and directories
  • 44.
    HDFS Architecture Secondary Namenode Namenode Master Metadata inDisk Edit log Fsimage Metadata in RAM Metadata (Name, replicas,….): /home/foo/data, 3, … File.txt Maintains Metadata in Disk Edit log Fsimage Secondary Namenode server is responsible for maintaining a copy of Metadata in disk
  • 45.
    HDFS Architecture Secondary Namenode Namenode Master Datanode 1 B1B2 Datanode 2 B1 B3 Datanode 3 B2 B3 Datanode N ………. Datanodes are the slave daemons that store and maintain the data blocks Slave
  • 46.
    HDFS Architecture Datanode 1 B1B2 Datanode 2 B1 B3 Datanode 3 B2 B3 Datanode 4 Datanode 5 B4 B2 B4B3 Client Metadata ops Metadata (Name, replicas, ….): /home/foo/data, 3, … Read Replication Namenode Datanode reads and writes client’s request and performs block creation, deletion and replication on instruction from the Namenode
  • 47.
    HDFS Architecture Datanode 1 B1B2 Datanode 2 B1 B3 Datanode 3 B2 B3 Datanode 4 Datanode 5 B4 B2 B4B3 Client Metadata ops Metadata (Name, replicas, ….): /home/foo/data, 3, … Block ops Read Replication Namenode Datanode reads and writes client’s request and performs block creation, deletion and replication on instruction from the Namenode
  • 48.
    HDFS Architecture Datanode 1 B1B2 Datanode 2 B1 B3 Datanode 3 B2 B3 Datanode 4 Datanode 5 B4 B2 B4B3 Client Metadata ops Metadata (Name, replicas, ….): /home/foo/data, 3, … Block ops Read Replication Response from the Namenode that the operation was successful Namenode Datanode reads and writes client’s request and performs block creation, deletion and replication on instruction from the Namenode
  • 49.
    HDFS Architecture Datanode 1 B1B2 Datanode 2 B1 B3 Datanode 3 B2 B3 Datanode 4 Datanode 5 B4 B2 B4B3 Client Metadata ops Metadata (Name, replicas, ….): /home/foo/data, 3, … Client Block ops Read Write Replication Response from the Namenode that the operation was successful Namenode Datanode reads and writes client’s request and performs block creation, deletion and replication on instruction from the Namenode
  • 50.
    HDFS Write MasterClient Where canI write & store my data? 128 MB 128 MB 44 MB 300 MB split Data is split into multiple blocks on 128 MB each
  • 51.
    HDFS Write MasterClient Datanodes Wherecan I write & store my data? Finds the datanodes available 128 MB 128 MB 44 MB 300 MB split
  • 52.
    HDFS Write MasterClient Datanodes Wherecan I write & store my data? Finds the datanodes available 128 MB 128 MB 44 MB 300 MB Write the 1st block of data to A3, B2, B4 split 128 MB 128 MB 128 MB A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 Rack 1 Rack 2 Rack 3 Data block is replicated thrice on different datanotes
  • 53.
    HDFS Write MasterClient Datanodes Wherecan I write & store my data? Finds the datanodes available 128 MB 128 MB 44 MB 300 MB split 128 MB 128 MB 128 MB A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 Rack 1 Rack 2 Rack 3 Similarly, the other 2 blocks are written on to different datanodes 128 MB 128 MB 128 MB 44 MB44 MB 44 MB
  • 54.
    HDFS Read MasterClient I wantto read my file 128 MB 128 MB 44 MB 128 MB 128 MB 128 MB A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 Rack 1 Rack 2 Rack 3 128 MB 128 MB 128 MB 44 MB44 MB 44 MB
  • 55.
    HDFS Read MasterClient Datanodes Iwant to read my file Finds the datanodes to read from 128 MB 128 MB 44 MB 128 MB 128 MB 128 MB A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 Rack 1 Rack 2 Rack 3 128 MB 128 MB 128 MB 44 MB44 MB 44 MB
  • 56.
    HDFS Read MasterClient Datanodes Iwant to read my file Finds the datanodes to read from 128 MB 128 MB 44 MB Read data from A2, A3, B1 128 MB 128 MB 128 MB A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 Rack 1 Rack 2 Rack 3 128 MB 128 MB 128 MB 44 MB44 MB 44 MB
  • 57.
    Importing data toHDFS Relational databases RDBMS Data warehouse SQOOP is used to import data from relational databases on to HDFS
  • 58.
    Importing data toHDFS Relational databases Streaming data RDBMS Data warehouse Sensor Web server FLUME is used to import streaming data from sensors and web servers on to HDFS
  • 59.
    What is MapReduce? MapReduceis a programming model to process large datasets parallelly on different nodes
  • 60.
    What is MapReduce? MapReduceis a programming model to process large datasets parallelly on different nodes Data is processed simultaneously on different slave nodes Slave node 1 Slave node 2 Slave node 3 Slave node 4 Master node
  • 61.
  • 62.
  • 63.
    MapReduce Workflow Map TasksShuffle and sort Input Map( )
  • 64.
    MapReduce Workflow Map TasksShuffle and sort Reduce Tasks Input Map( ) Reduce( )
  • 65.
    MapReduce Workflow Map TasksShuffle and sort Reduce Tasks OutputInput Map( ) Reduce( )
  • 66.
    MapReduce Example Input Square RedTriangle Blue Circle Green Square Green Triangle White Cube Blue Cube Yellow Circle Red Cube Blue Hexagon Green Square Blue Cube Yellow
  • 67.
    MapReduce Example Square RedTriangle Blue Circle Green Square Green Triangle White Cube Blue Cube Yellow Circle Red Cube Blue Hexagon Green Square Blue Cube Yellow Square Red Triangle Blue Circle Green Square Green Triangle White Cube Blue Cube Yellow Circle Red Cube Blue Hexagon Green Square Blue Cube Yellow Map Function Split step
  • 68.
    MapReduce Example Square RedTriangle Blue Circle Green Square Green Triangle White Cube Blue Cube Yellow Circle Red Cube Blue Hexagon Green Square Blue Cube Yellow Square = 1 Red = 1 Triangle = 1 Blue = 1 Circle = 1 Green = 1 Square = 1 Green = 1 Triangle = 1 White = 1 Cube = 1 Blue = 1 Cube = 1 Yellow = 1 Circle = 1 Red = 1 Cube = 1 Blue = 1 Hexagon = 1 Green = 1 Square = 1 Blue = 1 Cube = 1 Yellow = 1 Map step
  • 69.
    MapReduce Example Square =1 Red = 1 Triangle = 1 Blue = 1 Circle = 1 Green = 1 Square = 1 Green = 1 Triangle = 1 White = 1 Cube = 1 Blue = 1 Cube = 1 Yellow = 1 Circle = 1 Red = 1 Cube = 1 Blue = 1 Hexagon = 1 Green = 1 Square = 1 Blue = 1 Cube = 1 Yellow = 1 Merge step Square = {1,1} Red = {1} Triangle = {1,1} Blue = {1,1} Circle = {1} Green = {1,1} White = {1} Cube = {1} Cube = {1,1,1} Yellow = {1,1} Circle = {1} Red = {1} Blue = {1,1} Hexagon = {1} Green = {1} Square = {1} Merge step Square = {1,1,1} Red = {1,1} Triangle = {1,1} Blue = {1,1,1,1} Circle = {1,1} Green = {1,1,1} White = {1} Cube = {1,1,1,1} Yellow = {1,1} Hexagon = {1}
  • 70.
    MapReduce Example Square ={1,1,1} Red = {1,1} Triangle = {1,1} Blue = {1,1,1,1} Circle = {1,1} Green = {1,1,1} White = {1} Cube = {1,1,1,1} Yellow = {1,1} Hexagon = {1} Blue = {1,1,1,1} Circle = {1,1} Cube = {1,1,1,1} Green = {1,1,1} Hexagon = {1} Red = {1,1} Square = {1,1,1} Triangle = {1,1} White = {1} Yellow = {1,1} Shuffle and sort step Reduce step Blue = 4 Circle = 2 Cube = 4 Green = 3 Hexagon = 1 Red = 2 Square = 3 Triangle = 2 White = 1 Yellow = 2
  • 71.
    YARN – YetAnother Resource Negotiator YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1) such as scalability, availability of nodes, resource utilization, etc.
  • 72.
    YARN – YetAnother Resource Negotiator YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1) such as scalability, availability of nodes, resource utilization, etc. YARN is the cluster resource management layer of Hadoop that schedules jobs and assigns resources to running applications
  • 73.
    YARN – YetAnother Resource Negotiator YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1) such as scalability, availability of nodes, resource utilization, etc. YARN is the cluster resource management layer of Hadoop that schedules jobs and assigns resources to running applications MapReduce application Memory CPU RAM
  • 74.
    YARN – YetAnother Resource Negotiator Client
  • 75.
    YARN – YetAnother Resource Negotiator Resource ManagerClient Job Submission Submit job request Client submits an application to the ResourceManager
  • 76.
    YARN – YetAnother Resource Negotiator Resource ManagerClient Node Manager container App Master App Master container Node Manager Node Manager container container Job Submission Submit job request
  • 77.
    YARN – YetAnother Resource Negotiator Resource ManagerClient Node Manager container App Master App Master container Node Manager Node Manager container container Job Submission Node Status Submit job request NodeManager sends its status to ResourceManager
  • 78.
    YARN – YetAnother Resource Negotiator Resource ManagerClient Node Manager container App Master App Master container Node Manager Node Manager container container Job Submission Node Status MapReduce Status Submit job request ApplicationMaster contacts the related NodeManager
  • 79.
    YARN – YetAnother Resource Negotiator Resource ManagerClient Node Manager container App Master App Master container Node Manager Node Manager container container Job Submission Node Status MapReduce Status Resource Request Submit job request Container executes the ApplicationMaster
  • 80.
    Bank case study VIRTUALBANK You own a Virtual Bank that is generating a lot of customer transaction data and uses RDBMS to store the data RDBMS
  • 81.
    Bank case study VIRTUALBANK But, the bank’s data is rapidly increasing and RDBMS has become inefficient in handling such large volumes of data RDBMS
  • 82.
    Bank case study VIRTUALBANK You need a solution to move your bank data from traditional RDBMS to more flexible and scalable data storage
  • 83.
    Bank case study VIRTUALBANK What if I use Hadoop Distributed File System (HDFS) to store the data?
  • 84.
    Bank case study VIRTUALBANK HDFS can easily store large volumes of data. So, let me use Sqoop to move all the bank’s data from RDBMS onto HDFS RDBMS
  • 85.
    Bank case study VIRTUALBANK This will also allow us to analyze customer data using Sqoop commands. Now, let’s see how to do this RDBMS
  • 86.

Editor's Notes