Hadoop mapreduce and yarn frame work- unit5

 MapReduce is a programming model Google
has used successfully is processing its “big-
data” sets (~ 20000 peta bytes per day)
Users specify the computation in terms of a map
and a reduce function,
Underlying runtime system automatically
parallelizes the computation across large-scale
clusters of machines, and
Underlying system also handles machine failures,
efficient communications, and performance issues.
.
2

Consider a large data collection:
{web, weed, green, sun, moon, land, part, web, green,…}
Problem: Count the occurrences of the different words in
the collection.
Lets design a solution for this problem;
We will start from scratch
We will add and relax constraints
We will do incremental design, improving the solution for
performance and scalability
3

 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism
without mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and
partition (out of the scope of this talk).
 All the map should be completed before reduce operation
starts.
 Map and reduce operations are typically performed by the
same physical processor.
 Number of map tasks and reduce tasks are configurable
4

5
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large

 At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called
the project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
6

 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
7

 Hadoop is a software framework for distributed
processing of large datasets across large clusters of
computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google
MapReduce
 Hadoop is based on a simple programming model
called MapReduce
 Hadoop is based on a simple data model, any data
will fit
8

 Hadoop framework consists on two main layers
 Distributed file system (HDFS)
 Execution engine (MapReduce)
9

 Automatic parallelization & distribution
 Hidden from the end-user
 Fault tolerance and automatic recovery
 Nodes/tasks will fail and will recover automatically
 Clean and simple programming abstraction
 Users only provide two functions “map” and
“reduce”
10

 Google: Inventors of MapReduce computing
paradigm
 Yahoo: Developing Hadoop open-source of
MapReduce
 IBM, Microsoft, Oracle
 Facebook, Amazon, AOL, NetFlex
 Many others + universities and research labs
11

12
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)

13
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)

 Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core
architectural goal of HDFS
 Namenode is consistently checking Datanodes
14

15
Deciding on what will be the key and what will be the value 
developer’s responsibility

Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency
control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known
schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over
thousands of machines
- Simple yet efficient fault
tolerance
Key Characteristics - Efficiency, optimizations, fine-
tuning
- Scalability, flexibility, fault
tolerance
16

 Master: JobTracker (JT)
 Worker: Tasktracker (TT)
◦ Fixed # of map slots and reduce slots
client
Master Node
(JobTracker)
Worker Node
(TaskTracker)
Worker Node
(TaskTracker)
Worker Node
(TaskTracker)

 Hadoop is being used for all kinds of tasks
beyond its original design
 Tight coupling of a specific programming
model with the resource management
infrastructure
 Centralized handling of jobs’ control flow

 Scalability
 Multi-tenancy
 Serviceability
 Locality Awareness
 High Cluster Utilization
◦ HoD does not resize the cluster between stages
◦ Users allocate more nodes than needed
 Competing for resources results in longer latency to
start a job

 Scalability
 Multi-tenancy
 Serviceability
 Locality Awareness
 High Cluster Utilization
 Reliability/Availability
 Secure and auditable operation
 Support for Programming Model Diversity
 Flexible Resource Model
◦ Hadoop: # of Map/reduce slots are fixed.
◦ Easy, but lower utilization

 Separating resource management functions
from the programming model
 MapReduce becomes just one of the
application
 Dryad, …. Etc
 Binary compatible/Source compatible

 The head of a job
 Runs as a container
 Request resources from RM
◦ # of containers/ resource per container/ locality …
 Dynamically changing resource consumption
 Can run any user code (Dryad, MapReduce,
Tez, REEF…etc)
 Requests are “late-binding”

 Optimizes for locality among map tasks with
identical resource requirements
◦ Selecting a task with input data close to the
container.
 AM determines the semantics of the success
or failure of the container

1. Submitting the application by passing a CLC for the
Application Master to the RM.
2. When RM starts the AM, it should register with the RM
and periodically advertise its liveness and requirements
over the heartbeat protocol
3. Once the RM allocates a container, AM can construct a
CLC to launch the container on the corresponding NM. It
may also monitor the status of the running container and
stop it when the resource should be reclaimed.
Monitoring the progress of work done inside the
container is strictly the AM’s responsibility.
4. Once the AM is done with its work, it should unregister
from the RM and exit cleanly.
5. Optionally, framework authors may add control flow
between their own clients to report job status and expose
a control plane.

 RM Failure
 Recover using persistent storage
 Kill all containers, including AMs’
 Relaunch AMs
 NM Failure
 RM detects it, mark the containers as killed, report
to Ams
 AM Failure
 RM kills the container and restarts it.
 Container Failure
 The framework is responsible for recovery

 In a 2500-node cluster, throughput improves
from 77 K jobs/day to 150 K jobs/day

 Pig, Hive, Oozie
 Decompose a DAG job into multiple MR jobs
 Apache Tez
 DAG execution framework
 Spark
 Dryad
 Giraph
 Vertice centric graph computation framework
 fits naturally within YARN model
 Storm – distributed real time processing engine
(parallel stream processing)
 REEF
 Simplify implementing ApplicationMaster
 Haya – Hbase clusters

Hadoop mapreduce and yarn frame work- unit5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop mapreduce and yarn frame work- unit5

Similar to Hadoop mapreduce and yarn frame work- unit5 (20)

Recently uploaded

Recently uploaded (20)

Hadoop mapreduce and yarn frame work- unit5

Editor's Notes