3. Introduction to Big Data and Hadoop
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
Systems/Enterprises generate huge amount of data from terabytes to
even petabytes/zettabytes of information.
It’s very difficult to manage such huge data…
3
HADOOP
BIG DATA
&
4. Big Data and its challenges
Challenges of processing Big Data are 3 V’s.
4
VOLUME VELOCITY VARIETY
Modern systems have
Much more data.
- Terabytes + a day.
- Petabytes + total
We need a new
approach.
To Process such a huge
volume of data within a
specified time period, We
need a new approach .
We have to process different
sorts of data such as
Structured, Semi-structured,
and Unstructured data. We
need a new approach.
5. What is Hadoop ?
Apache Hadoop is a framework that allows the
distributed processing of large data sets across
clusters of commodity computers using a
simple programming model.
It is an open-source data management
technology with scale-out storage and
distributed processing.
5
7. Background : Hadoop + HDFS
HDFS Distributed File System
NameNode
DataNode DataNode
Local File
System
Local File
System
Every node contributes part of
its local file System to HDFS.
Tasks can only depend on the
local file system
(JVM class path does not
understand HDFS Protocol)
7
9. YARN
9
Yet Another Resource Negotiator
YARN Application Resource Negotiator (Recursive
Acronym)
Remedies the scalability shortcomings of “classic”
MapReduce
Classic MapReduce has scalability issues around
4000 nodes and higher
Is more of a general-purpose framework of which
classic MapReduce is one application.
10. YARN Flow
YARN = YET ANOTHER RESOURCE NEGOTIATOR
Resource Manager
Cluster-level Resource Manager
Long Life, High-quality hardware
Node Manager
One per Data Node
Monitor resources on Data Node
Application Master
One per Data Node
Short Life
Manages Task/Scheduling
10
11. YARN – How
It Works
Protocols :
1.) Client – RM: Submit the
App Master
2.) RM – NM: Start the App
Master
3.) AM – RM: Request +
Release containers
4.) RM – NM: Start tasks in
containers
YARN
Client
YARN
Resource Manager
Node Manager
Node Manager
Task
AM
Node Manager
Task
Task
Task Task
1.)
2.)
3.)
4.)
11
12. YARN Architectural
Overview
Scalability – Clusters of 6000 –
10000 machines
Each machine with 16 cores ,
48GB/96GB RAM, 24TB/36TB Hard
Disks.
100,000 + Concurrent tasks
10000 concurrent jobs
12
13. YARN Architectural
Overview(Contd..)
Splits up the two major functions of JobTracker
Global Resource Manager - Cluster resource
management
Application Master - Job scheduling and
monitoring (one per application). The
Application Master negotiates resource
containers from the Scheduler, tracking their
status and monitoring for progress. Application
Master itself runs as a normal container.
Tasktracker
NodeManager (NM) - A new per-node slave is
responsible for launching the applications’
containers, monitoring their resource usage
(cpu, memory, disk, network) and reporting to
the Resource Manager.
YARN maintains compatibility with existing
MapReduce applications and users.
13