Apache hadoop overview

Apache Hadoop
Apache Hadoop is an open-source software
framework for storage and large-scale processing of
data-sets on clusters of commodity hardware.
There are mainly five building blocks inside this
runtime environment.

Cluster
The Cluster is the set of host machines. Nodes may be
partitioned in racks. This is the hardware part of the
infrastructure.
The YARN Infrastructure (Yet Another Resource Negotiator)
is the framework responsible for providing the computational
resources needed for application executions.
Two important elements are:
The Resource Manager is the master. It knows where the
slaves are located and how many resources they have. It runs
several services, the most important is the Resource
Scheduler which decides how to assign the resources.

The Node Manager is the slave of the infrastructure. When it starts, it
announces himself to the Resource Manager. Periodically, it sends an
heartbeat to the Resource Manager.
Each Node Manager offers some resources to the cluster. Its resource
capacity is the amount of memory and the number of vectors.
At run-time, the Resource Scheduler will decide how to use this
capacity: a Container is a fraction of the NM capacity and it is used by
the client for running a program.

HowYARN works
The fundamental idea of YARN is to split up the two major
responsibilities of the Job Tracker/Task Tracker into separate entities:
A Global Resource Manager
A Per-application Application Master
A Per-node slave Node Manager and
A Per-application container running on a Node Manager
The Resource Manager and the Node Manager form the new, and
generic, system for managing applications in a distributed manner.
The Resource Manager is the ultimate authority that arbitrates
resources among all the applications in the system.

Hadoop distributed file system
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file-system written in Java for the
Hadoop framework.
Each node in a Hadoop instance typically has a single name
node, and a cluster of data nodes form the HDFS cluster.
The situation is typical because each node does not require a
data node to be present.

Map Reduce
If a Task Tracker fails or times out, that part of the job is
rescheduled.
The Task Tracker on each node spawns off a separate Java
Virtual Machine process to prevent the Task Tracker itself from
failing if the running job crashes the JVM.
A heartbeat is sent from the Task Tracker to the Job Tracker
every few minutes to check its status.
The Job Tracker and Task Tracker status and information is
exposed by Jetty and can b e viewed from a web browser.

Learn More Visit Here
www.peridotsystems.in

Apache hadoop overview

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Apache hadoop overview

Similar to Apache hadoop overview (20)

Recently uploaded

Recently uploaded (20)

Apache hadoop overview