Advanced Topics in Computer Science
Name Registration Number
Ben Wycliff Mugalu 2022/HD05/1315U
Sudi Murindanyi 2022/HD05/5583X
Paul Mutambuze 2022/HD05/1452U
Group: AT 12
2
Presentation outline
● Big data
● Hadoop
● Features of Hadoop
● Hadoop Distributed File System
● Distributed Vs Parallel File systems
● MapReduce
● YARN
● Shortcomings of Hadoop
● The future of Hadoop
Big Data
Big data is a term used to describe data that is too large and complex to
store in traditional database.
3
Big Data Analytics
● The primary tasks for organizations dealing with big data include
designing appropriate processing and handling techniques.
● Organizations need faster insights for big data.
● Hadoop is the most widely used tool for big data analytics in social
media platforms i.e Google, Facebook, Yahoo, Amazon, Instagram,
etc. largely because it is an open-source framework
● Other tools include IBM Biginsights, SAP Hana tool, Oracle Big Data
Appliance, Pivotal Big Data Suite, Lumify, Apache Storm,
RapidMiner, etc
4
Why Hadoop?
● Challenge with data growth in volume and variety at a high velocity.
● Problems extracting valuable insights from big data for the benefit
of revenue and profit.
● Need for distributed storage machines where big data could be
stored and processed as data warehousing was not up to the task
5
Please Google this
What companies use big data?
What companies use hadoop?
What is Hadoop?
● It is an open-source Java-based tool
whose development has been led by the
Apache foundation
● Used for distributed storage and
processing of large datasets.
● It has a cluster architecture implemented
using commodity hardware.
6
7
Component of hadoop Framework
● Hadoop Common: collection of utilities and
libraries that support other Hadoop
modules
● HDFS: the primary data storage system
used by hadoop applications.
● YARN: Responsible for job scheduling and
monitoring, and cluster resource
management.
● Hadoop MapReduce: Software framework
for easily writing applications which process
vast amounts of data in-parallel on large
clusters of commodity hardware.
Hadoop Ecosystem
● Made up of several
modules supported by a
large ecosystem of
technologies.
● Includes Apache projects
and various commercial
options to accommodate
the four major activities.
8
Hadoop Distributed File System (HDFS)
●Hadoop storage is handled by HDFS.
●HDFS circulates multiple copies of data to the
nodes, grouped into racks in a cluster.
●HDFS deploys a master-slave architecture
Name Node: Master node that monitors
data nodes and contains all metadata.
Data Nodes: Slave nodes that contains the
actual data in form of blocks. They frequently
send their status updates through heartbeat
signals to the NameNode.
Secondary Name Node has a copy of Name
Node’s metadata in disk.
9
Distributed File Systems Vs Parallel File Systems
10
Distributed File Systems
(Hadoop Distributed File System)
Parallel File System
(Lustre, PVFS and GPFS)
Uses a standard network file
access protocol to access a
storage
Requires the installation of
client-based software drivers to
access the shared storage
Stores a files on a single storage
node
Breaks up the file and stripes
the data blocks across multiple
storage nodes
Deployments can store data on
the application servers or
centralized servers
Deployments separate the
compute and storage servers
for performance reasons
Tend to target loose coupled,
data-heavy applications or
active archives
Focus on high-performance
workloads that can benefit from
coordinated I/O access and
significant bandwidth
MapReduce in HDFS Architecture
● To process huge amounts
of data in a parallel and
distributed manner.
● Execution controlled by a
Job tracker and multiple
task trackers.
● Job tracker coordinates
activity by scheduling tasks.
● Task tracker executes tasks
and reports to the job
tracker.
11
12
MapReduce Intuition
Word count example
Yet Another Resource Negotiator (YARN)
13
● Introduced in Hadoop 2.0 to
address MapReduce limitations.
● Takes care of computational
resource provision.
● YARN decentralizes execution
and monitoring of processing
jobs by separating the various
responsibilities
Hadoop 1.0 Vs Hadoop 2.0
14
15
Hadoop with Small Data
● Anything that can fit in an Excel file
might be a workable definition of small
data.E.g.: Text articles scraped from
wikipedia
● A small file is one that is smaller than
the block size in HDFS, typically 64 or
128 megabytes
● By being creative, you can use Hadoop
to process small data
● Hadoop have 5 options for processing
small data: Concatenating text files,
Hadoop archives, Parquet, Hadoop
Ozone, Integrate.io
16
Compatibility with cloud storage modes
● There are three main cloud storage modes (object storage, file
storage, and block storage)
● The HDFS stores data in blocks, Hadoop is a block storage system
● Hadoop can integrate with object storage systems such as
Amazon AWS’s S3A, Azure blob storage, and Openstack’s Swift.
● Hadoop has been used with cloud file storage systems.
Shortcomings of Hadoop
● Complexity: Many different components and technologies, which can
make it difficult to learn, use, troubleshoot and maintain.
● Performance: Slower than some other technologies with certain types
of data processing tasks. Not always the best solution for real-time
processing or interactive data analysis.
● Cost: Setting up and maintaining a Hadoop cluster can be expensive,
particularly if you need to purchase hardware and software licenses.
Security: Hadoop focuses on data storage, the data security has been
ignored to some extent.
17
Future of Hadoop
● Hadoop has become an important part of the data ecosystem and
is used by various organizations.
● There have been key developments in recent years (Apache Spark,
Apache Flink and integration with other technologies such as Cloud
Computing and Machine Learning).
● Evolving of capabilities to meet the changing needs of users.
● Apache Spark is compact, 100x faster in memory and 10x faster on
disk than Hadoop. Its ecosystem contains well built features that
are continually being improved.
18
References

Hadoop-2022.pptx

  • 1.
    Advanced Topics inComputer Science Name Registration Number Ben Wycliff Mugalu 2022/HD05/1315U Sudi Murindanyi 2022/HD05/5583X Paul Mutambuze 2022/HD05/1452U Group: AT 12
  • 2.
    2 Presentation outline ● Bigdata ● Hadoop ● Features of Hadoop ● Hadoop Distributed File System ● Distributed Vs Parallel File systems ● MapReduce ● YARN ● Shortcomings of Hadoop ● The future of Hadoop
  • 3.
    Big Data Big datais a term used to describe data that is too large and complex to store in traditional database. 3
  • 4.
    Big Data Analytics ●The primary tasks for organizations dealing with big data include designing appropriate processing and handling techniques. ● Organizations need faster insights for big data. ● Hadoop is the most widely used tool for big data analytics in social media platforms i.e Google, Facebook, Yahoo, Amazon, Instagram, etc. largely because it is an open-source framework ● Other tools include IBM Biginsights, SAP Hana tool, Oracle Big Data Appliance, Pivotal Big Data Suite, Lumify, Apache Storm, RapidMiner, etc 4
  • 5.
    Why Hadoop? ● Challengewith data growth in volume and variety at a high velocity. ● Problems extracting valuable insights from big data for the benefit of revenue and profit. ● Need for distributed storage machines where big data could be stored and processed as data warehousing was not up to the task 5 Please Google this What companies use big data? What companies use hadoop?
  • 6.
    What is Hadoop? ●It is an open-source Java-based tool whose development has been led by the Apache foundation ● Used for distributed storage and processing of large datasets. ● It has a cluster architecture implemented using commodity hardware. 6
  • 7.
    7 Component of hadoopFramework ● Hadoop Common: collection of utilities and libraries that support other Hadoop modules ● HDFS: the primary data storage system used by hadoop applications. ● YARN: Responsible for job scheduling and monitoring, and cluster resource management. ● Hadoop MapReduce: Software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware.
  • 8.
    Hadoop Ecosystem ● Madeup of several modules supported by a large ecosystem of technologies. ● Includes Apache projects and various commercial options to accommodate the four major activities. 8
  • 9.
    Hadoop Distributed FileSystem (HDFS) ●Hadoop storage is handled by HDFS. ●HDFS circulates multiple copies of data to the nodes, grouped into racks in a cluster. ●HDFS deploys a master-slave architecture Name Node: Master node that monitors data nodes and contains all metadata. Data Nodes: Slave nodes that contains the actual data in form of blocks. They frequently send their status updates through heartbeat signals to the NameNode. Secondary Name Node has a copy of Name Node’s metadata in disk. 9
  • 10.
    Distributed File SystemsVs Parallel File Systems 10 Distributed File Systems (Hadoop Distributed File System) Parallel File System (Lustre, PVFS and GPFS) Uses a standard network file access protocol to access a storage Requires the installation of client-based software drivers to access the shared storage Stores a files on a single storage node Breaks up the file and stripes the data blocks across multiple storage nodes Deployments can store data on the application servers or centralized servers Deployments separate the compute and storage servers for performance reasons Tend to target loose coupled, data-heavy applications or active archives Focus on high-performance workloads that can benefit from coordinated I/O access and significant bandwidth
  • 11.
    MapReduce in HDFSArchitecture ● To process huge amounts of data in a parallel and distributed manner. ● Execution controlled by a Job tracker and multiple task trackers. ● Job tracker coordinates activity by scheduling tasks. ● Task tracker executes tasks and reports to the job tracker. 11
  • 12.
  • 13.
    Yet Another ResourceNegotiator (YARN) 13 ● Introduced in Hadoop 2.0 to address MapReduce limitations. ● Takes care of computational resource provision. ● YARN decentralizes execution and monitoring of processing jobs by separating the various responsibilities
  • 14.
    Hadoop 1.0 VsHadoop 2.0 14
  • 15.
    15 Hadoop with SmallData ● Anything that can fit in an Excel file might be a workable definition of small data.E.g.: Text articles scraped from wikipedia ● A small file is one that is smaller than the block size in HDFS, typically 64 or 128 megabytes ● By being creative, you can use Hadoop to process small data ● Hadoop have 5 options for processing small data: Concatenating text files, Hadoop archives, Parquet, Hadoop Ozone, Integrate.io
  • 16.
    16 Compatibility with cloudstorage modes ● There are three main cloud storage modes (object storage, file storage, and block storage) ● The HDFS stores data in blocks, Hadoop is a block storage system ● Hadoop can integrate with object storage systems such as Amazon AWS’s S3A, Azure blob storage, and Openstack’s Swift. ● Hadoop has been used with cloud file storage systems.
  • 17.
    Shortcomings of Hadoop ●Complexity: Many different components and technologies, which can make it difficult to learn, use, troubleshoot and maintain. ● Performance: Slower than some other technologies with certain types of data processing tasks. Not always the best solution for real-time processing or interactive data analysis. ● Cost: Setting up and maintaining a Hadoop cluster can be expensive, particularly if you need to purchase hardware and software licenses. Security: Hadoop focuses on data storage, the data security has been ignored to some extent. 17
  • 18.
    Future of Hadoop ●Hadoop has become an important part of the data ecosystem and is used by various organizations. ● There have been key developments in recent years (Apache Spark, Apache Flink and integration with other technologies such as Cloud Computing and Machine Learning). ● Evolving of capabilities to meet the changing needs of users. ● Apache Spark is compact, 100x faster in memory and 10x faster on disk than Hadoop. Its ecosystem contains well built features that are continually being improved. 18
  • 19.

Editor's Notes

  • #3 How to add in Hadoop stores data,