SlideShare a Scribd company logo
1 of 22
Distributed Systems (Hadoop)
Name: Alamin
Stu Id: 23-92971-2
Table Of Content
 What is distributed system?
 What is Hadoop?
 How Hadoop works?
 Important components of Hadoop
 Hadoop Common
 Hadoop HDFS
 Hadoop YARN
 Hadoop MapReduce
 Key features of Hadoop
What is distributed systems?
 The distributed system is a collection of interconnected computers or nodes that work
together to achieve a common goal.
 In a distributed system, these nodes are physically separated and communicate with each
other through a network, such as the internet or a local area network (LAN).
 Distributed computing is a way to make computers work together like a team. It's like
breaking down a big job into smaller pieces, and then giving each piece to a different
computer to work on.
 Distributed computing is used in all sorts of applications, from scientific research to business
intelligence to video games.
 It's a powerful tool that can be used to solve problems that would be too big or too hard for a
single computer to handle.
Some common types of Distributed systems
There are many distributed systems have like:
 Client-server system
 Peer-to-Peer(P2P) system
 Cluster and Grid Computing
 Cloud Computing
 Distributed Database
 Distributed file systems
What is Hadoop?
 Hadoop follow the distributed architecture or you can say Hadoop also be a distribute systems
service.
 Hadoop is an open-source framework that allows us to store and process large datasets in a
parallel and distributed manner.
 This distributed environment is built up of a cluster of machines that work closely together to
give an impression of a single working machine.
 It is designed to handle massive amounts of data across a distributed cluster of commodity
hardware.
 Hadoop was originally developed by Doug Cutting and Mike Cafarella in 2005 and is now
maintained by the Apache Software Foundation.
How Hadoop Works?
Hadoop works by distributing and processing large datasets across a cluster of computers,
providing a framework for scalable and fault-tolerant data storage and analysis. Here's an
overview of how Hadoop works:
 Data Storage with HDFS (Hadoop Distributed File System):
 Data is stored in Hadoop using HDFS, which divides large files into smaller blocks (typically 128
MB or 256 MB in size).
 These blocks are replicated across multiple nodes in the Hadoop cluster for fault tolerance. By
default, each block is replicated three times.
 Data Ingestion:
 Data is ingested into Hadoop by copying it to the HDFS. This can be done using Hadoop
commands, APIs, or other tools.
 Data Processing with MapReduce:
 MapReduce is a programming model for parallel data processing. It consists of two main phases:
Map and Reduce.
 In the Map phase, data is broken down into key-value pairs, and a set of user-defined Map functions
is applied to each pair.
How Hadoop Works? (Continued)
Hadoop works by distributing and processing large datasets across a cluster of computers,
providing a framework for scalable and fault-tolerant data storage and analysis. Here's an
overview of how Hadoop works:
 Job Scheduling and Execution:
 Hadoop's resource manager (usually YARN) manages the allocation of cluster resources and
schedules job execution.
 The Map and Reduce tasks are distributed across the cluster nodes, where the data is located, to
minimize data transfer over the network.
 Fault Tolerance:
 Hadoop provides fault tolerance through data replication and task recovery.
 If a node or task fails, Hadoop automatically reschedules tasks to run on healthy nodes and utilizes
the replicated data blocks.
 Monitoring and Management:
 Hadoop provides tools like the Hadoop Distributed File System (HDFS) web interface and resource
manager web UI for monitoring and managing the cluster.
Important components of Hadoop
Hadoop is an open-source framework used for distributed storage and processing of
large datasets. It consists of several key components, including four most important key
components in below:
 Hadoop Common.
 Hadoop HDFS.
 Hadoop YARN.
 Hadoop MapReduce.
Hadoop Common
 Hadoop Common refers to the collection of common utilities and libraries that support other
Hadoop modules.
 It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop
Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.
 Like all other modules, Hadoop Common assumes that hardware failures are common and
that these should be automatically handled in software by the Hadoop Framework. Hadoop
Common is also known as Hadoop Core.
 Here are some key aspects of Hadoop Common:
 Core Libraries
 HDFS Clients
 Configuration Management
 Logging and Monitoring
 Security
 CLI Tools
 Error Handling
 Utilities
Hadoop HDFS
Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. It
divides large files into smaller blocks and distributes them across multiple data nodes in a cluster,
providing fault tolerance and high availability.
Hadoop HDFS (Continued)
 Name Node (Master Node)
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening, closing, renaming files and
directories.
 It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
 It should be deployed on reliable hardware which has the high config. not on commodity
hardware.
 Master Node has the record of everything, it knows the location and info of each and
every single data node and the blocks they contain, i.e., nothing is done without the
permission of master node.
Hadoop HDFS (Continued)
 Data Node (Slave Node)
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the master.
 They can be deployed on commodity hardware.
 The HDFS cluster contains multiple DataNodes. Each DataNodes contains multiple data
blocks.
Hadoop YARN
YARN (Yet Another Resource Negotiator): Hadoop YARN, or Yet Another Resource
Negotiator, is a key component of the Hadoop ecosystem that manages and allocates resources in
a Hadoop cluster. YARN is responsible for resource management and job scheduling, making it
an integral part of distributed data processing in Hadoop.
Hadoop YARN (Continued)
 ResourceManager
 The ResourceManager is the central component of YARN.
 It manages and allocates cluster resources, such as CPU and memory, to different applications.
 It tracks available resources and queues, making sure that resources are allocated efficiently.
 NodeManager
 Each worker node in the cluster runs a NodeManager, which is responsible for monitoring resource
usage on that node and reporting it back to the ResourceManager.
 NodeManagers manage the execution of application containers.
Hadoop MapReduce
MapReduce: MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map job.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Hadoop MapReduce (Continued)
Map stage
 The map or mapper’s job is to process the input data.
 Generally, the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS).
 The input file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
Reduce stage
 This stage is the combination of the Shuffle stage and the Reduce stage.
 The Reducer’s job is to process the data that comes from the mapper.
 After processing, it produces a new set of output, which will be stored in the HDFS.
Hadoop MapReduce (Continued)
Two essential daemons of Map Reducer Job tracker, Task tracker:
Job Tracker: In Hadoop's classic MapReduce framework, the Job Tracker was a central service
responsible for scheduling and managing MapReduce jobs, monitoring task progress, and
handling job recovery.
Task Tracker: In the same framework, Task Trackers were worker nodes responsible for
executing individual map and reduce tasks within a MapReduce job, with a focus on data
localization and failure handling.
Key features of Hadoop
 Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance.
Key features of Hadoop (Continued)
 High Availability: Hadoop provides High Availability feature, which helps to make sure that
the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of data
processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
Key features of Hadoop (Continued)
 Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
References:
1. https://www.simplilearn.com/tutorials/hadoop-tutorial/hadoop-ecosystem
2. https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/
3. https://www.geeksforgeeks.org/hadoop-ecosystem/
4. https://en.wikipedia.org/wiki/Distributed_computing
5. https://aws.amazon.com/emr/details/hadoop/what-is-hadoop/
6. https://www.javatpoint.com/what-is-hadoop
7. https://www.geeksforgeeks.org/hadoop-an-introduction/
8. https://www.projectpro.io/hadoop-tutorial/hadoop-mapreduce-tutorial-
9. https://www.geeksforgeeks.org/hadoop-yarn-architecture/
10. https://www.techopedia.com/definition/30427/hadoop-common
11. https://techvidvan.com/tutorials/how-hadoop-works-internally/
Thank You

More Related Content

Similar to Distributed Systems Hadoop.pptx

Similar to Distributed Systems Hadoop.pptx (20)

hadoop
hadoophadoop
hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Big data
Big dataBig data
Big data
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Anju
AnjuAnju
Anju
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 

Distributed Systems Hadoop.pptx

  • 1. Distributed Systems (Hadoop) Name: Alamin Stu Id: 23-92971-2
  • 2. Table Of Content  What is distributed system?  What is Hadoop?  How Hadoop works?  Important components of Hadoop  Hadoop Common  Hadoop HDFS  Hadoop YARN  Hadoop MapReduce  Key features of Hadoop
  • 3. What is distributed systems?  The distributed system is a collection of interconnected computers or nodes that work together to achieve a common goal.  In a distributed system, these nodes are physically separated and communicate with each other through a network, such as the internet or a local area network (LAN).  Distributed computing is a way to make computers work together like a team. It's like breaking down a big job into smaller pieces, and then giving each piece to a different computer to work on.  Distributed computing is used in all sorts of applications, from scientific research to business intelligence to video games.  It's a powerful tool that can be used to solve problems that would be too big or too hard for a single computer to handle.
  • 4. Some common types of Distributed systems There are many distributed systems have like:  Client-server system  Peer-to-Peer(P2P) system  Cluster and Grid Computing  Cloud Computing  Distributed Database  Distributed file systems
  • 5. What is Hadoop?  Hadoop follow the distributed architecture or you can say Hadoop also be a distribute systems service.  Hadoop is an open-source framework that allows us to store and process large datasets in a parallel and distributed manner.  This distributed environment is built up of a cluster of machines that work closely together to give an impression of a single working machine.  It is designed to handle massive amounts of data across a distributed cluster of commodity hardware.  Hadoop was originally developed by Doug Cutting and Mike Cafarella in 2005 and is now maintained by the Apache Software Foundation.
  • 6. How Hadoop Works? Hadoop works by distributing and processing large datasets across a cluster of computers, providing a framework for scalable and fault-tolerant data storage and analysis. Here's an overview of how Hadoop works:  Data Storage with HDFS (Hadoop Distributed File System):  Data is stored in Hadoop using HDFS, which divides large files into smaller blocks (typically 128 MB or 256 MB in size).  These blocks are replicated across multiple nodes in the Hadoop cluster for fault tolerance. By default, each block is replicated three times.  Data Ingestion:  Data is ingested into Hadoop by copying it to the HDFS. This can be done using Hadoop commands, APIs, or other tools.  Data Processing with MapReduce:  MapReduce is a programming model for parallel data processing. It consists of two main phases: Map and Reduce.  In the Map phase, data is broken down into key-value pairs, and a set of user-defined Map functions is applied to each pair.
  • 7. How Hadoop Works? (Continued) Hadoop works by distributing and processing large datasets across a cluster of computers, providing a framework for scalable and fault-tolerant data storage and analysis. Here's an overview of how Hadoop works:  Job Scheduling and Execution:  Hadoop's resource manager (usually YARN) manages the allocation of cluster resources and schedules job execution.  The Map and Reduce tasks are distributed across the cluster nodes, where the data is located, to minimize data transfer over the network.  Fault Tolerance:  Hadoop provides fault tolerance through data replication and task recovery.  If a node or task fails, Hadoop automatically reschedules tasks to run on healthy nodes and utilizes the replicated data blocks.  Monitoring and Management:  Hadoop provides tools like the Hadoop Distributed File System (HDFS) web interface and resource manager web UI for monitoring and managing the cluster.
  • 8. Important components of Hadoop Hadoop is an open-source framework used for distributed storage and processing of large datasets. It consists of several key components, including four most important key components in below:  Hadoop Common.  Hadoop HDFS.  Hadoop YARN.  Hadoop MapReduce.
  • 9. Hadoop Common  Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules.  It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.  Like all other modules, Hadoop Common assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop Framework. Hadoop Common is also known as Hadoop Core.  Here are some key aspects of Hadoop Common:  Core Libraries  HDFS Clients  Configuration Management  Logging and Monitoring  Security  CLI Tools  Error Handling  Utilities
  • 10. Hadoop HDFS Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. It divides large files into smaller blocks and distributes them across multiple data nodes in a cluster, providing fault tolerance and high availability.
  • 11. Hadoop HDFS (Continued)  Name Node (Master Node)  Manages all the slave nodes and assign work to them.  It executes filesystem namespace operations like opening, closing, renaming files and directories.  It manages the file system namespace by executing an operation like the opening, renaming and closing the files.  It should be deployed on reliable hardware which has the high config. not on commodity hardware.  Master Node has the record of everything, it knows the location and info of each and every single data node and the blocks they contain, i.e., nothing is done without the permission of master node.
  • 12. Hadoop HDFS (Continued)  Data Node (Slave Node)  Actual worker nodes, who do the actual work like reading, writing, processing etc.  They also perform creation, deletion, and replication upon instruction from the master.  They can be deployed on commodity hardware.  The HDFS cluster contains multiple DataNodes. Each DataNodes contains multiple data blocks.
  • 13. Hadoop YARN YARN (Yet Another Resource Negotiator): Hadoop YARN, or Yet Another Resource Negotiator, is a key component of the Hadoop ecosystem that manages and allocates resources in a Hadoop cluster. YARN is responsible for resource management and job scheduling, making it an integral part of distributed data processing in Hadoop.
  • 14. Hadoop YARN (Continued)  ResourceManager  The ResourceManager is the central component of YARN.  It manages and allocates cluster resources, such as CPU and memory, to different applications.  It tracks available resources and queues, making sure that resources are allocated efficiently.  NodeManager  Each worker node in the cluster runs a NodeManager, which is responsible for monitoring resource usage on that node and reporting it back to the ResourceManager.  NodeManagers manage the execution of application containers.
  • 15. Hadoop MapReduce MapReduce: MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
  • 16. Hadoop MapReduce (Continued) Map stage  The map or mapper’s job is to process the input data.  Generally, the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS).  The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage  This stage is the combination of the Shuffle stage and the Reduce stage.  The Reducer’s job is to process the data that comes from the mapper.  After processing, it produces a new set of output, which will be stored in the HDFS.
  • 17. Hadoop MapReduce (Continued) Two essential daemons of Map Reducer Job tracker, Task tracker: Job Tracker: In Hadoop's classic MapReduce framework, the Job Tracker was a central service responsible for scheduling and managing MapReduce jobs, monitoring task progress, and handling job recovery. Task Tracker: In the same framework, Task Trackers were worker nodes responsible for executing individual map and reduce tasks within a MapReduce job, with a focus on data localization and failure handling.
  • 18. Key features of Hadoop  Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage and processing of extremely large amounts of data.  Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add more capacity as needed.  Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate even in the presence of hardware failures.  Data locality: Hadoop provides data locality feature, where the data is stored on the same node where it will be processed, this feature helps to reduce the network traffic and improve the performance.
  • 19. Key features of Hadoop (Continued)  High Availability: Hadoop provides High Availability feature, which helps to make sure that the data is always available and is not lost.  Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of data in a distributed fashion, making it easy to implement a wide variety of data processing tasks.  Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored is consistent and correct.  Data Replication: Hadoop provides data replication feature, which helps to replicate the data across the cluster for fault tolerance.
  • 20. Key features of Hadoop (Continued)  Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the storage space and improve the performance.  YARN: A resource management platform that allows multiple data processing engines like real-time streaming, batch processing, and interactive SQL, to run and process data stored in HDFS.
  • 21. References: 1. https://www.simplilearn.com/tutorials/hadoop-tutorial/hadoop-ecosystem 2. https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/ 3. https://www.geeksforgeeks.org/hadoop-ecosystem/ 4. https://en.wikipedia.org/wiki/Distributed_computing 5. https://aws.amazon.com/emr/details/hadoop/what-is-hadoop/ 6. https://www.javatpoint.com/what-is-hadoop 7. https://www.geeksforgeeks.org/hadoop-an-introduction/ 8. https://www.projectpro.io/hadoop-tutorial/hadoop-mapreduce-tutorial- 9. https://www.geeksforgeeks.org/hadoop-yarn-architecture/ 10. https://www.techopedia.com/definition/30427/hadoop-common 11. https://techvidvan.com/tutorials/how-hadoop-works-internally/