Apache Hadoop is an open-source software framework that supports dataintensive distributed applications.
Supports running of applications on large clusters of commodity hardware.
Task are divided into Map-Reduce framework
Provides a distributed file system that stores data on the compute nodes.
Drawbacks of Hadoop 1.0
Cluster is tightly couple with Hadoop.
What is Hadoop 2.0
● Re-architectured Hadoop is complete overhaul of 0.23 branch.
● Introduced YARN and MR2.
● Enhanced resource scheduler.
● Efficient utilization of cluster by running apps apart from MR Jobs.
Components of YARN
● History Server
The ResourceManager is the ultimate authority in Hadoop cluster. Which utilise
resources among all the applications in the system. All the negotiations of resources
are done from the ResourceManager.
Components of Resource Manager
The Scheduler is responsible for allocating resources to the various running
The ApplicationsManager is responsible for accepting job-submissions, negotiating
the first container for executing the application specific ApplicationMaster and
provides the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine agent who is responsible monitoring the
resources for the respective machine it is running on and report the same to the
Containers are allocated on NodeManager to perform the task assigned
It is a specific library for negotiating resources from the ResourceManager and
working with the NodeManager(s) to execute the task on containers and the
monitor the same.
ApplicationMaster has the responsibility of negotiating resource containers
from the Scheduler for the tasks.
Provides communication port to users to communicate with Application
The history server provide users to get status on finished applications.
Apache YARN, will provide a framework on which various application
Hadoop backers expect that the advent of Yarn could open the
floodgates for new applications being built to run on Hadoop.
Various projects, like Apache Tez, have been created to do more
advanced data processing compared to what MapReduce specializes in.
YARN promotes effective utilization of resources while providing
distributed environment for application execution
Current use case on YARN
Samza: Linked-In Release
Apache Samza is a distributed stream
processing framework. It uses Apache Kafka
for messaging, and Apache Hadoop YARN to
provide fault tolerance, processor isolation,
security, and resource management
Streaming IN Hadoop: Yahoo! release
Storm-YARN enables Storm applications to utilize the computational
resources in a Hadoop cluster along with accessing Hadoop storage
resources such as HBase and HDFS.