What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

Machine
Learning Basics
An Introduction

Jack harvests grapes and then sells it in
the nearby town

After harvesting, he then stores the
produce in a storage room

Soon there was a high demand for other fruits. So,
he started harvesting apples and oranges as well

He then realizes that it is time consuming and
difficult to harvest all the fruits by himself

So, he hires 2 more people to work with him. With
this, harvesting is done simultaneously

Now, the storage room becomes a bottleneck to
store and access all the fruits in a single storage
area

Jack now decides to distribute the storage area
and give each one of them a separate storage
space

Hello, I want a fruit
basket of 3 grapes, 2
apples and 3 oranges

To complete the order on time, all of them work
parallelly with their own storage space
Hello, I want a fruit
basket of 3 grapes, 2
apples and 3 oranges

This solution helps them to complete the order on
time without any hassles
Fruit
basket

All of them are happy and they are prepared
for an increase in demand in the future

All of them are happy and they are prepared
for an increase in demand in the future
So, how does this story
relate to Big Data?

The rise of Big Data
Structured data
Earlier with limited data, only one processor and one storage unit was needed

Structured data
Semi structured data
Unstructured data
Soon, data generation increased leading to high volume of data along with
different data formats

Structured data
Unstructured data
A single processor was not enough to process such high volume of different kinds
of data as it was very time consuming

Structured data
Unstructured data
Hence, multiple processors were used to process high volume of data and this
saved time

Structured data
Unstructured data
The single storage unit became the bottleneck due to which network overhead
was generated

Structured data
Unstructured data
The solution was to use distributed storage for each processor. This enabled easy
access to store and access data

Structured data
Unstructured data
This method worked and there was no network overhead generated

Structured data
Unstructured data
This is known as parallel processing with distributed storage

Structured data
Unstructured data
Parallel processing

Structured data
Unstructured data
Parallel processing Distributed storage

What’s in it for you?
1. Big Data and it’s challenges1

1. Hadoop as a solution2

1. What is Hadoop?3

1. What is Hadoop?3
1. Components of Hadoop4

1. What is Hadoop?3
1. Components of Hadoop4
1. Use case of Hadoop5

What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways

What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY

Big Data challenges and solution
Single central storage
Challenges

Distributed storagesSingle central storage
Challenges Solutions
Distributed storage

Serial processing
OutputProcess
Input
A
Distributed storage

Serial processing
OutputProcess
Input
A
Parallel processing
Output
B
Inputs
A
Process
Distributed storage

Serial processing
OutputProcess
Input
A
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Distributed storage

Serial processing
OutputProcess
Input
A
Parallel processing
Output
B
Inputs
A
Process
unstructured data
Ability to process every type
of data
Distributed storage

Hadoop as a solution
Serial processing
OutputProcess
Input
A
Parallel processing
Output
B
Inputs
A
Process
unstructured data
Ability to process every type
of data
Distributed storage

What is Hadoop?
Big Data
VOLUME
STORING
Storing Processing Analyzing
Hadoop is a framework that manages big data storage in a distributed way and processes it parallelly

Components of Hadoop
Storage unit of
Hadoop
Processing unit of
Hadoop

What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) is specially designed for storing huge datasets in commodity
hardware
Distributed storage

What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode
NameNode
DataNode

What is HDFS?
VOLUME
STORING
NameNode
DataNode
There is only one
NameNode

What is HDFS?
VOLUME
STORING
NameNode
DataNode
There is only one
NameNode
DataNode DataNode
There can be multiple
DataNodes

What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster

What is HDFS?
VOLUME
STORING
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode

What is HDFS?
VOLUME
STORING
Master/NameNode
NameNode maintains and manages the
DataNode. It also stores the metadata

What is HDFS?
VOLUME
STORING
Master/NameNode
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well

What is HDFS?
VOLUME
STORING
Master/NameNode
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
HeartBeat is the signal that DataNode
continuously sends to the NameNode.
This signal shows the status of the DataNode

What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file

What is HDFS?
VOLUME
STORING
30 TB
file
NameNode
30 TB of
data is
loaded

What is HDFS?
VOLUME
STORING
30 TB
file
NameNode
30 TB of
data is
loaded
.
.
.
Data is divided into
blocks of 128 MB each

What is HDFS?
VOLUME
STORING
30 TB
file
NameNode
30 TB of
data is
loaded
DataNodes
.
.
.
.
.
.
.
.

What is HDFS?
VOLUME
STORING
30 TB
file
NameNode
30 TB of
data is
loaded
DataNodes
.
.
.
Blocks are then
replicated among the
DataNodes
.
.
.
.
.

What is HDFS?
Provides distributed
storage
Features of HDFS

What is HDFS?
storage
Implemented on
commodity hardware
Features of HDFS

What is HDFS?
storage
Implemented on
commodity hardware
Provides data
security
Features of HDFS

What is HDFS?
storage
Implemented on
commodity hardware
Provides data
security
Highly fault tolerant
Features of HDFS

What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion

What is MapReduce?
VOLUME
STORING
distributed fashion
Big Data

What is MapReduce?
VOLUME
STORING
distributed fashion
Big Data
Processor

What is MapReduce?
VOLUME
STORING
distributed fashion
Big Data
Processor
MapReduce is used for parallel processing of the Big
Data, which is stored in HDFS

What is MapReduce?
VOLUME
STORING
distributed fashion
Big Data
Output
Processor
MapReduce is used for parallel processing of the Big
Data, which is stored in HDFS

What is MapReduce?
VOLUME
STORING
In MapReduce approach, processing is done at the slave nodes and the final result is sent to the
master node

What is MapReduce?
VOLUME
STORING
master node
Master
Slave Slave
Slave Slave
Traditional approach – Data is
processed at the Master node

What is MapReduce?
VOLUME
STORING
master node
Master
Slave Slave
Slave Slave
Traditional approach – Data is
processed at the Master node
MapReduce approach – Data is
processed at the Slave nodes
Slave Slave
Slave Slave
Master

What is MapReduce?
Input
Bus Car Train
Ship Ship Train
Bus Ship Car

What is MapReduce?
Input Split
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
The input dataset is first
split into chunks of data

What is MapReduce?
Input Split Map phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
These chunks of data are
then processed by map
tasks parallelly

What is MapReduce?
Input Split Map phase Reduce phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
Ship, 1
Ship, 1
Ship, 1
Bus, 1
Bus, 1
Car, 1
Car, 1
Train, 1
Train, 1

What is MapReduce?
Input Split Map phase Shuffle and sortReduce phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
Ship, 1
Ship, 1
Ship, 1
Bus, 2
Car, 2
Ship, 3
Train, 2
Bus, 1
Bus, 1
Car, 1
Car, 1
Train, 1
Train, 1
At the reduce task, the
aggregation takes place and
the final output is obtained

Components of Hadoop version 2.0
Storage unit of
Hadoop
Processing unit of
Hadoop
Resource management
unit of Hadoop

YARN – Yet Another Resource Negotiator
Acts like an OS
to Hadoop 2 Does job scheduling
Responsible for managing
cluster resources
What is YARN?

What is YARN?
Client
Client
Client

What is YARN?
Client
Client
Client
Client submits the
job request

What is YARN?
Resource
Manager
Client
Client
Client
Client submits the
job request

What is YARN?
Resource
Manager
Responsible for resource
allocation and
management
Client
Client
Client
Client submits the
job request

What is YARN?
Resource
Manager
allocation and
management
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
Client submits the
job request

What is YARN?
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
allocation and
management
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
Client submits the
job request

What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
resource usage
Resource
Manager
allocation and
management
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
container
container container
Client submits the
job request

What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
resource usage
Resource
Manager
allocation and
management
App Master
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
App Master container
container container
App Master requests
container from the
NodeManager
Client submits the
job request

Hadoop use case – Combating
fraudulent activities

Hadoop use case – Combating fraudulent activities
Fraud activities
Detecting fraudulent transactions is one among the various problems any bank faces

Zions’ main challenge was to combat the fraudulent activities which were taking place
Challenge

Approaches used by Zions’ security team to combat fraudulent activities

Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed

Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Parallel processing system
Problem
Analyzing unstructured data
was not possible

How Hadoop solved the problems
Storing
Zions could now store
massive amount of data
using Hadoop

Storing
using Hadoop
Processing
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible

Storing
using Hadoop
Processing Analyzing
In-depth analysis of different data
formats became easy and time
efficient
was now possible

Storing
using Hadoop
Processing Analyzing Detecting
In-depth analysis of different data
formats became easy and time
efficient
The team could now detect
everything from malware, spear
phishing attempts to account
takeovers
was now possible

What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

Similar to What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn (20)

More from Simplilearn

More from Simplilearn (20)

Recently uploaded

Recently uploaded (20)

What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

Editor's Notes