Apache Hadoop is an open-source software framework used for distributed storage and processing of large datasets across clusters of computers. It has five main components: a cluster of nodes, YARN for resource management, HDFS for storage, and MapReduce for processing data. YARN splits resource management from job tracking, with a resource manager assigning containers across node managers to run application masters and tasks. HDFS stores data across clusters in a fault-tolerant way, with a name node and data nodes.
2. Apache Hadoop
Apache Hadoop is an open-source software
framework for storage and large-scale processing of
data-sets on clusters of commodity hardware.
There are mainly five building blocks inside this
runtime environment.
3.
4. Cluster
The Cluster is the set of host machines. Nodes may be
partitioned in racks. This is the hardware part of the
infrastructure.
The YARN Infrastructure (Yet Another Resource Negotiator)
is the framework responsible for providing the computational
resources needed for application executions.
Two important elements are:
The Resource Manager is the master. It knows where the
slaves are located and how many resources they have. It runs
several services, the most important is the Resource
Scheduler which decides how to assign the resources.
5. The Node Manager is the slave of the infrastructure. When it starts, it
announces himself to the Resource Manager. Periodically, it sends an
heartbeat to the Resource Manager.
Each Node Manager offers some resources to the cluster. Its resource
capacity is the amount of memory and the number of vectors.
At run-time, the Resource Scheduler will decide how to use this
capacity: a Container is a fraction of the NM capacity and it is used by
the client for running a program.
6. HowYARN works
The fundamental idea of YARN is to split up the two major
responsibilities of the Job Tracker/Task Tracker into separate entities:
A Global Resource Manager
A Per-application Application Master
A Per-node slave Node Manager and
A Per-application container running on a Node Manager
The Resource Manager and the Node Manager form the new, and
generic, system for managing applications in a distributed manner.
The Resource Manager is the ultimate authority that arbitrates
resources among all the applications in the system.
7. Hadoop distributed file system
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file-system written in Java for the
Hadoop framework.
Each node in a Hadoop instance typically has a single name
node, and a cluster of data nodes form the HDFS cluster.
The situation is typical because each node does not require a
data node to be present.
8.
9. Map Reduce
If a Task Tracker fails or times out, that part of the job is
rescheduled.
The Task Tracker on each node spawns off a separate Java
Virtual Machine process to prevent the Task Tracker itself from
failing if the running job crashes the JVM.
A heartbeat is sent from the Task Tracker to the Job Tracker
every few minutes to check its status.
The Job Tracker and Task Tracker status and information is
exposed by Jetty and can b e viewed from a web browser.