Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Naveen P.N
Trainer
NPN TrainingTraining is the essence of success and we are committed to it.
www.npntraining.com
Module 01 - Understanding Big Data and Hadoop
Includes (Hadoop 1.x & 2.x Architecture)

Topics for the Module `
What is Big Data
OLTP VS OLAP
Limitation of existing Data Analytics
Moving Data into Code
Moving Code into Data
Hadoop 1.0 / 2.0 Core Components
Hadoop 2.0 Core Components
Hadoop Master Slave Architecture
After completing the module, you will be able to understand:
File Blocks
Rack Awareness
Anatomy of File Read and Write
Hadoop 1.x Challenges
Scala REPL
Scala Variable Types

Big Data is the term for a collection of data sets so large and complex that it
becomes difficult to process it using Traditional data processing applications.
What is Big Data
www.npntraining.com/masters-program/big-data-architect-training.php

12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataevery
day
2+ billion
people on
the Web
by end
2011
30 billion RFID tags
today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions of
GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Where Is This “Big Data” Coming From ?

About RDBMS
Why do I need RDBMS
 For quick response
 It enables relation between data elements to be defined and managed.
 It enables one database to be utilized for all applications.
Presently the data is stored in RDBMS, then what is the problem why the problem of BigData come

OLTP VS OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that
OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it

"Big Data are high-volume, high-velocity, and/or high-variety information assets that require new
forms of processing to enable enhanced decision making, insight discovery and process
optimization”
Big Data spans three dimensions (3Vs)

Storage only Grid (SAN)
(Raw Data)
ETL Compute Grid
RDBMS
(Aggregated Data)
1. Can’t explore original high
fidelity raw data
3. Premature data death
90% of data is Archived
A meagre 10% of Data
is available for BI
Limitation of Existing Data Analytics Architecture

BI Reports + Interactive Apps
Solution: A Combined Storage Compute Layer
Hadoop: Storage + Compute Grid
RDBMS
(Aggregated Data)
Scalable throughput for
ETL & aggregation
Data Exploration
& Advanced
analytics
No Data
Archiving
Keep data alive
forever
Both Storage
And Compute
Grid together
Entire Data is
available for
processing

Processing 1TB of Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
45 Minutes
Limitation
Traditional Approach
Processing Data in Enterprise

Processing 1TB of Data
10 Machine
4 I/O Channels
Each Channel – 100 MB/s
4.3 Minutes
Hadoop Approach
Processing Data in DFS

What is Apache Hadoop
Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming
model
To solve the Big Data problem a new
framework has evolved that is Hadoop.
Hadoop provides.
 Commodity Hardware
 Big Cluster
 Map Reduce
 Failover
 Data distribution
 Moving code to data
 Heterogeneous Hardware
 Scalable
Hadoop is based on work done by Google in the early 2000s
 Google File System (GFS) paper published in 2003
 MapReduce paper published in 2004
It is an architecture that can scale with huge volumes,
variety and speed requirements of Big Data by distributing
the work across dozens, hundreds, or even thousands of
commodity servers that process the data in parallel.

Moving data into code Contd...
Terabyte
Wants to Analyze
the data
Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes connected by high capacity link
 Many data intensive applications are CPU demanding causing bottle neck in networks.
 Latency in transferring data.
``

Moving code to data
Wants to Analyze
the data
Map Reduce
Client writes
MapReduce
Jobs Jobs Jobs
Jobs Jobs Jobs
Hadoop takes radically new approach to the problem of distributed computing.
Distribute the data to multiple nodes.
Distribute the program for computation to these multiple nodes.
Individual nodes then work on data stay in their nodes.
No data transfer over the network is required for initial processing.
Additional nodes can be added for scalability.
``

Distribution Vendors
Cloudera Distribution for Hadoop (CDH)
MapR Distribution.
Hortonworks Data Platform
Apache BigTop Distribution.
``

Hadoop has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS MapReduce
Responsible to store the data in chunks(by
splitting into blocks of 64MB each)
To process the data in a massive parallel
manner.
Daemons
 Name Node
 Data Node
 Secondary Name Node
 Job Tracker
 Task Tracker
HDFS
NameNode (Master)
DataNode
Secondary NameNode
MapReduce
JobTracker
TaskTracker
Storage Processing

Hadoop 2.0 has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS YARN/MRv2
Responsible to store the data in chunks(by
splitting into blocks of 128MB each)
To process the data in a massive parallel
manner.
Daemons
 Name Node
 Data Node
 Secondary Name Node
 ResourceManager
 NodeManager
HDFS
NameNode (Master)
DataNode(Slave)
Secondary NameNode
MapReduce
ResourceManager(Master)
NodeManager(Slave)
Storage Processing

HDFS – Hadoop Distributed File System
HDFS is a distributed and scalable file system designed for storing very large files with
streaming data access patterns, running clusters on commodity hardware.
HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode
(Master node) and a number of DataNodes (Slave nodes).
Each and every file in the
File System is divided into
blocks of size 512 Bytes
File System

File Blocks
By default, block size is 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x
Why block size is large?
 The main reason for having the HDFS blocks in large size is due to cost of seek time.
 The large block size is to account for proper usage of storage space while considering the limit on the
memory of NameNode.

1.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode JobTracker
Secondary
NameNode
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
TaskTracker TaskTracker TaskTracker
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1 2 3
4
5
website : www.npntraining.com

2.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode
Active
ResourceManager
NameNode
Standby
Single Box Single Box
Optional to have in two boxes
Single Box
NodeManager NodeManager
NodeManager
1
2 3
4
5
Secondary
NameNode
Single Box
3

Master Slave Architecture – Simple cluster setup with Hadoop Daemons
NodeManager
ResourceManager
NameNode
MapReduce
HDFS
NodeManager NodeManager

Hadoop Cluster: A Typical Use Case

File Blocks in HDFS
Master
Node
Communicates
Wants to save 400 MB of
data into cluster/HDFS
Decides which nodes to
write the data to
First copy is always stored in nodes
which is in close proximity to the client
128
MB
128
MB
16
MB
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
16
MB
128
MB
16
MB
In HDFS the data is broken
into blocks of size 64 MB
Hadoop creates a 3 replication
by default which is
configurable and achieves fault
tolerance
``

File Blocks in HDFS Contd…
NameNode
200 MB – weblog.dat
128
72
b1r1
b1r2
b1r3
b2r1
b2r2
b2r3
Namespace
b1
r1
s1
b1
r2
s2
b1
r3
s3
b2
r1
s1
b2
r2
s2
b2
r3
s3
b1r1
b2r1
b1r2
b2r2
b1r3
b2r3
``

Rack Awareness & Replication Factor

NameNode
NameNode does not store the files but only files metadata.
NameNode keeps track of all the file system related information(Metadata) such as:
 Block Locations
 Information about file permissions and ownership.
 Last access time for the file.
 User permission like which user have access to the file.
NameNode oversees the health DataNode and coordinates access to the data stored in DataNode.
The entire metadata is in main memory.
``

NameNode Metadata
The entire metadata is in main memory.
No demand paging of FS meta-data.
NameNode maintains two files
1. fsimage and
2. edit log
The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.
However, while the fsimage file format is very efficient to read
It’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than
writing a new fsimage every time the namespace is modified, the NameNode instead records
the modifying operation in the edit log for durability.
``

Secondary NameNode or CheckPoint Node
NameNode
Secondary
NameNode
-fsImage
Edit logs
Not a hot standby for the NameNode
Connects to NameNode every hour.
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
Pulls metadata
``

Hadoop Components Contd…
ResourceManager
Name Node
NodeManager
Data Node
NodeManager
Data Node
NodeManager
Data Node
Master Node
Slave Node
Maintains and manages the blocks
present on the Slave Nodes
Periodically receives a Heartbeat and
a Block report from each of the data
nodes in the cluster
Heartbeat implies that the
DataNode is functioning properly,
once every 3 seconds
The HDFS architecture is built in
such a way that the user data is
never stored in the NameNode, it
only stored metadata.
It records the metadata of all the
files stored in the cluster, e.g. the
location, the size of the files,
permissions, hierarchy, etc
DataNodes perform the
low-level read and write
requests from the file
system’s clients.
Responsible for creating blocks,
deleting blocks and replicating
the same based on the decisions
taken by the NameNode.
Secondary Name Node

Anatomy of File Write – High Level
A user wants to write data to Hadoop
hdfs dfs –put 2016-apache-logs.txt / Client “cuts” input file into chunks of “block
size”
Client then contacts the NameNode to request write operation.
 Sends No of blocks
 Replication Factor
NameNode responds with pipeline of DataNodes for replication to write.
Clients reaches out to first DataNode in pipeline + Performs write
* No actual data transfer will take place from NameNode
Client takes the request “splits” input file into chunks of “block size”
Client writes blocks in parallel  all the blocks are written at a time  not one by one

Anatomy of File Write – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–put request
Write pipeline
blk_000 to DN1,DN5,DN6
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01

Anatomy of File Read – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–get request
Write pipeline
blk_000 to DN1
blk_001 to DN4
blk_002 to DN7
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01

In HDFS, blocks of a file are written in parallel, however
replication of the blocks are done sequentially.
a) True
b) False
Hadoop is a framework that allows for the distributed
processing of :
a) Small Data sets
b) Large Data sets
A file of 400 MB is being copied to HDFS. The System has
finished copying 250 MB . What happens if a client tries to
access that file.
a) Can read up to block that’s successfully written.
b) Can read up to last bit successfully written
c) Will throw an exception
d) Cannot see that file until its finished copying

Hadoop Eco-System

What could be the limitation of Hadoop 1 / Gen 1
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Which of the following are significant disadvantage in Hadoop 1.0
a) Single Point of Failure on NameNode
b) Too much burden on JobTracker
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Can you use other than MapReduce for processing in Hadoop 1.x

Hadoop 1.x - Challenges
NameNode – No Horizontal Scalability
Single NameNode and single Namespaces, limited by NameNode RAM
NameNode – No High Availability (HA)
NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case
of failure.
Job Tracker - Overburdened
Spends significant portion of time and effort managing the life cycle of applications.
MRv1 – Only Map & Reduce tasks
HumongousData stored in HDFS remains unutilized and cannot be used for other workloads
such as Graph processing etc.

Single NameNode running and managing Single Namespace. Maintains metadata in RAM
100 slaves/ 1000 slaves --> Managed by Single NameNode
Max tested till --> 4000 servers --> Single NameNode --> Single NameSpace
Lets assume we have /VOICE directory with too many files and folders we configure separate
NameNode for this
directory
/VOICE/...  NameNode01
/SMS/...  NameNode02
/Data/...  NameNode03
So based on the directory structure we can configure NameNode, so in Hadoop 2 we can configure
10000 servers can be configured because because NameNode separately managing directory
structure that's why we call it as Federation
Limitation 1 – No Horizontal Scalability

Hadoop 2.x Architecture - Federation
NameNode1 NameNode2 NameNode3
Hadoop 1.x Hadoop 2.x
``

How does HDFS Federation help HDFS scale horizontally?
Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
You have configured two name nodes to manage
/voice and /sms respectively. What will happen if you try to
put a file in /lte directory?
Put will fail. None of the namespace will manage the file
and you will get an IOException with a no such file or
directory error

If you loose Namenode you will loose the Cluster details.Manual intervention should be there to
start new NameNode and copy backup from SecondaryNN
Problem
10am --> backup to SNN
10:45am --> NameNode breakdown --> You can get data till 10:00am from SNN ( Problem in Gen 1 )
Solution
==========
HighAvailability : Active and Standby Namenodes manage same data at given point of time.
--> In case Active NameNode fails Standby NameNode will act as Active and serves request
Limitation 2 – No High Availability

Hadoop 2.x Architecture - HA
https://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
``

HDFDS HA was developed to overcome the following
disadvantage in Hadoop 1.0
a) Single Point of Failure of NameNode
b) Only one version can be run in classic Map-Reduce
c) Too much burden on JobTracker

YARN
(Yet Another Resource Negotiator

YARN – Yet Another Resource Negotiator
YARN is the core component of Hadoop 2 and is added to improve performance in Hadoop
Hadoop 1.x
MapReduce
Cluster Resource Management &
Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
It is the next generation computing platform which offer various advantage when compared to
classic MapReduce
It is a layer that separates the resource management layer and the processing components layer.
MapReduce2 moves Resource management (like infrastructure to monitor nodes, allocate
resources and schedule jobs) into YARN.

MapReduce 1.x Execution Framework
TaskTracker
JobTracker
Tasks
Client
NameNode
128
MB
72
MB
DataNode
Tasks
TaskTracker
Tasks
128
MB
72
MB
DataNode
Tasks
TaskTracker
Tasks
128
MB
72
MB
DataNode
Tasks
Program
Job Job

www.npntraining.com/courses/big-data-and-hadoop.php
MapReduce 1.x Execution Framework Contd…
DataNode2
4GB
DataNode4
DataNode1
4GB RAM
DataNode3
8GB RAM
Map
Map
Map
Map
Map
8GB RAM
Each Map/ Reduce task takes 1GB of RAM.
Resource is not properly utilized.

YARN Components
YARN consists of 3 components
1. ResourceManager
i. Scheduler
ii. Application Manager
2. NodeManager
3. Application Master

YARN Architecture
ResourceManager
DN1NodeManager
Client
DN1NodeManager
DN1NodeManager
1. Client submits job to
Resource Manager
2. RM will contact any
of the NodeManager
Application Master
3. Node Manager will
create a daemon by Name
Application Master on the
same Node, its one per job
4. AM will communicate to
the ResourceManager to
find where the data is.
Container

Sinble JobTracker to manage thousands of jobs
Problem
Jobtracker was overburdend
Solution
==========
YARN with multiple deamons like ResourceManager, NodeManger, ApplicationMaster(one per
Application)
Container --> variable resources allocated per task(in slave m/c) --> cpu,memory,disk,network
1. Resource Manager --> Entire Cluster Lever
2. NodeManager --> Per Node/Slave/machine/server
3. App Master --> life cycle of job ( App Master one per job )
Limitation 3 – Job Tracker Overburden

YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0
Hadoop 1.x
MapReduce
& Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
Introduction to new YARN layer in Hadoop 2.0``

`
Key Takeaways
Hard work beats talent
when talent fails to work hard.

Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Similar to Module 01 - Understanding Big Data and Hadoop 1.x,2.x (20)

Recently uploaded

Recently uploaded (20)

Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Editor's Notes