The document provides an overview of distributed systems and the Hadoop framework. It defines distributed systems as collections of interconnected computers that work together to achieve a common goal. Hadoop is introduced as an open-source distributed processing framework for massive datasets. Key components of Hadoop include HDFS for storage, YARN for resource management, MapReduce for processing, and common utilities. The document also explains how Hadoop works and its features such as scalability, fault tolerance, and flexible data processing.
2. Table Of Content
What is distributed system?
What is Hadoop?
How Hadoop works?
Important components of Hadoop
Hadoop Common
Hadoop HDFS
Hadoop YARN
Hadoop MapReduce
Key features of Hadoop
3. What is distributed systems?
The distributed system is a collection of interconnected computers or nodes that work
together to achieve a common goal.
In a distributed system, these nodes are physically separated and communicate with each
other through a network, such as the internet or a local area network (LAN).
Distributed computing is a way to make computers work together like a team. It's like
breaking down a big job into smaller pieces, and then giving each piece to a different
computer to work on.
Distributed computing is used in all sorts of applications, from scientific research to business
intelligence to video games.
It's a powerful tool that can be used to solve problems that would be too big or too hard for a
single computer to handle.
4. Some common types of Distributed systems
There are many distributed systems have like:
Client-server system
Peer-to-Peer(P2P) system
Cluster and Grid Computing
Cloud Computing
Distributed Database
Distributed file systems
5. What is Hadoop?
Hadoop follow the distributed architecture or you can say Hadoop also be a distribute systems
service.
Hadoop is an open-source framework that allows us to store and process large datasets in a
parallel and distributed manner.
This distributed environment is built up of a cluster of machines that work closely together to
give an impression of a single working machine.
It is designed to handle massive amounts of data across a distributed cluster of commodity
hardware.
Hadoop was originally developed by Doug Cutting and Mike Cafarella in 2005 and is now
maintained by the Apache Software Foundation.
6. How Hadoop Works?
Hadoop works by distributing and processing large datasets across a cluster of computers,
providing a framework for scalable and fault-tolerant data storage and analysis. Here's an
overview of how Hadoop works:
Data Storage with HDFS (Hadoop Distributed File System):
Data is stored in Hadoop using HDFS, which divides large files into smaller blocks (typically 128
MB or 256 MB in size).
These blocks are replicated across multiple nodes in the Hadoop cluster for fault tolerance. By
default, each block is replicated three times.
Data Ingestion:
Data is ingested into Hadoop by copying it to the HDFS. This can be done using Hadoop
commands, APIs, or other tools.
Data Processing with MapReduce:
MapReduce is a programming model for parallel data processing. It consists of two main phases:
Map and Reduce.
In the Map phase, data is broken down into key-value pairs, and a set of user-defined Map functions
is applied to each pair.
7. How Hadoop Works? (Continued)
Hadoop works by distributing and processing large datasets across a cluster of computers,
providing a framework for scalable and fault-tolerant data storage and analysis. Here's an
overview of how Hadoop works:
Job Scheduling and Execution:
Hadoop's resource manager (usually YARN) manages the allocation of cluster resources and
schedules job execution.
The Map and Reduce tasks are distributed across the cluster nodes, where the data is located, to
minimize data transfer over the network.
Fault Tolerance:
Hadoop provides fault tolerance through data replication and task recovery.
If a node or task fails, Hadoop automatically reschedules tasks to run on healthy nodes and utilizes
the replicated data blocks.
Monitoring and Management:
Hadoop provides tools like the Hadoop Distributed File System (HDFS) web interface and resource
manager web UI for monitoring and managing the cluster.
8. Important components of Hadoop
Hadoop is an open-source framework used for distributed storage and processing of
large datasets. It consists of several key components, including four most important key
components in below:
Hadoop Common.
Hadoop HDFS.
Hadoop YARN.
Hadoop MapReduce.
9. Hadoop Common
Hadoop Common refers to the collection of common utilities and libraries that support other
Hadoop modules.
It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop
Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.
Like all other modules, Hadoop Common assumes that hardware failures are common and
that these should be automatically handled in software by the Hadoop Framework. Hadoop
Common is also known as Hadoop Core.
Here are some key aspects of Hadoop Common:
Core Libraries
HDFS Clients
Configuration Management
Logging and Monitoring
Security
CLI Tools
Error Handling
Utilities
10. Hadoop HDFS
Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. It
divides large files into smaller blocks and distributes them across multiple data nodes in a cluster,
providing fault tolerance and high availability.
11. Hadoop HDFS (Continued)
Name Node (Master Node)
Manages all the slave nodes and assign work to them.
It executes filesystem namespace operations like opening, closing, renaming files and
directories.
It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
It should be deployed on reliable hardware which has the high config. not on commodity
hardware.
Master Node has the record of everything, it knows the location and info of each and
every single data node and the blocks they contain, i.e., nothing is done without the
permission of master node.
12. Hadoop HDFS (Continued)
Data Node (Slave Node)
Actual worker nodes, who do the actual work like reading, writing, processing etc.
They also perform creation, deletion, and replication upon instruction from the master.
They can be deployed on commodity hardware.
The HDFS cluster contains multiple DataNodes. Each DataNodes contains multiple data
blocks.
13. Hadoop YARN
YARN (Yet Another Resource Negotiator): Hadoop YARN, or Yet Another Resource
Negotiator, is a key component of the Hadoop ecosystem that manages and allocates resources in
a Hadoop cluster. YARN is responsible for resource management and job scheduling, making it
an integral part of distributed data processing in Hadoop.
14. Hadoop YARN (Continued)
ResourceManager
The ResourceManager is the central component of YARN.
It manages and allocates cluster resources, such as CPU and memory, to different applications.
It tracks available resources and queues, making sure that resources are allocated efficiently.
NodeManager
Each worker node in the cluster runs a NodeManager, which is responsible for monitoring resource
usage on that node and reporting it back to the ResourceManager.
NodeManagers manage the execution of application containers.
15. Hadoop MapReduce
MapReduce: MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map job.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
16. Hadoop MapReduce (Continued)
Map stage
The map or mapper’s job is to process the input data.
Generally, the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
Reduce stage
This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be stored in the HDFS.
17. Hadoop MapReduce (Continued)
Two essential daemons of Map Reducer Job tracker, Task tracker:
Job Tracker: In Hadoop's classic MapReduce framework, the Job Tracker was a central service
responsible for scheduling and managing MapReduce jobs, monitoring task progress, and
handling job recovery.
Task Tracker: In the same framework, Task Trackers were worker nodes responsible for
executing individual map and reduce tasks within a MapReduce job, with a focus on data
localization and failure handling.
18. Key features of Hadoop
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance.
19. Key features of Hadoop (Continued)
High Availability: Hadoop provides High Availability feature, which helps to make sure that
the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of data
processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
20. Key features of Hadoop (Continued)
Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.