Hadoop-2022.pptx

Advanced Topics in Computer Science
Name Registration Number
Ben Wycliff Mugalu 2022/HD05/1315U
Sudi Murindanyi 2022/HD05/5583X
Paul Mutambuze 2022/HD05/1452U
Group: AT 12

2
Presentation outline
● Big data
● Hadoop
● Features of Hadoop
● Hadoop Distributed File System
● Distributed Vs Parallel File systems
● MapReduce
● YARN
● Shortcomings of Hadoop
● The future of Hadoop

Big Data
Big data is a term used to describe data that is too large and complex to
store in traditional database.
3

Big Data Analytics
● The primary tasks for organizations dealing with big data include
designing appropriate processing and handling techniques.
● Organizations need faster insights for big data.
● Hadoop is the most widely used tool for big data analytics in social
media platforms i.e Google, Facebook, Yahoo, Amazon, Instagram,
etc. largely because it is an open-source framework
● Other tools include IBM Biginsights, SAP Hana tool, Oracle Big Data
Appliance, Pivotal Big Data Suite, Lumify, Apache Storm,
RapidMiner, etc
4

Why Hadoop?
● Challenge with data growth in volume and variety at a high velocity.
● Problems extracting valuable insights from big data for the benefit
of revenue and profit.
● Need for distributed storage machines where big data could be
stored and processed as data warehousing was not up to the task
5
Please Google this
What companies use big data?
What companies use hadoop?

What is Hadoop?
● It is an open-source Java-based tool
whose development has been led by the
Apache foundation
● Used for distributed storage and
processing of large datasets.
● It has a cluster architecture implemented
using commodity hardware.
6

7
Component of hadoop Framework
● Hadoop Common: collection of utilities and
libraries that support other Hadoop
modules
● HDFS: the primary data storage system
used by hadoop applications.
● YARN: Responsible for job scheduling and
monitoring, and cluster resource
management.
● Hadoop MapReduce: Software framework
for easily writing applications which process
vast amounts of data in-parallel on large
clusters of commodity hardware.

Hadoop Ecosystem
● Made up of several
modules supported by a
large ecosystem of
technologies.
● Includes Apache projects
and various commercial
options to accommodate
the four major activities.
8

Hadoop Distributed File System (HDFS)
●Hadoop storage is handled by HDFS.
●HDFS circulates multiple copies of data to the
nodes, grouped into racks in a cluster.
●HDFS deploys a master-slave architecture
Name Node: Master node that monitors
data nodes and contains all metadata.
Data Nodes: Slave nodes that contains the
actual data in form of blocks. They frequently
send their status updates through heartbeat
signals to the NameNode.
Secondary Name Node has a copy of Name
Node’s metadata in disk.
9

Distributed File Systems Vs Parallel File Systems
10
Distributed File Systems
(Hadoop Distributed File System)
Parallel File System
(Lustre, PVFS and GPFS)
Uses a standard network file
access protocol to access a
storage
Requires the installation of
client-based software drivers to
access the shared storage
Stores a files on a single storage
node
Breaks up the file and stripes
the data blocks across multiple
storage nodes
Deployments can store data on
the application servers or
centralized servers
Deployments separate the
compute and storage servers
for performance reasons
Tend to target loose coupled,
data-heavy applications or
active archives
Focus on high-performance
workloads that can benefit from
coordinated I/O access and
significant bandwidth

MapReduce in HDFS Architecture
● To process huge amounts
of data in a parallel and
distributed manner.
● Execution controlled by a
Job tracker and multiple
task trackers.
● Job tracker coordinates
activity by scheduling tasks.
● Task tracker executes tasks
and reports to the job
tracker.
11

12
MapReduce Intuition
Word count example

Yet Another Resource Negotiator (YARN)
13
● Introduced in Hadoop 2.0 to
address MapReduce limitations.
● Takes care of computational
resource provision.
● YARN decentralizes execution
and monitoring of processing
jobs by separating the various
responsibilities

15
Hadoop with Small Data
● Anything that can fit in an Excel file
might be a workable definition of small
data.E.g.: Text articles scraped from
wikipedia
● A small file is one that is smaller than
the block size in HDFS, typically 64 or
128 megabytes
● By being creative, you can use Hadoop
to process small data
● Hadoop have 5 options for processing
small data: Concatenating text files,
Hadoop archives, Parquet, Hadoop
Ozone, Integrate.io

16
Compatibility with cloud storage modes
● There are three main cloud storage modes (object storage, file
storage, and block storage)
● The HDFS stores data in blocks, Hadoop is a block storage system
● Hadoop can integrate with object storage systems such as
Amazon AWS’s S3A, Azure blob storage, and Openstack’s Swift.
● Hadoop has been used with cloud file storage systems.

Shortcomings of Hadoop
● Complexity: Many different components and technologies, which can
make it difficult to learn, use, troubleshoot and maintain.
● Performance: Slower than some other technologies with certain types
of data processing tasks. Not always the best solution for real-time
processing or interactive data analysis.
● Cost: Setting up and maintaining a Hadoop cluster can be expensive,
particularly if you need to purchase hardware and software licenses.
Security: Hadoop focuses on data storage, the data security has been
ignored to some extent.
17

Future of Hadoop
● Hadoop has become an important part of the data ecosystem and
is used by various organizations.
● There have been key developments in recent years (Apache Spark,
Apache Flink and integration with other technologies such as Cloud
Computing and Machine Learning).
● Evolving of capabilities to meet the changing needs of users.
● Apache Spark is compact, 100x faster in memory and 10x faster on
disk than Hadoop. Its ecosystem contains well built features that
are continually being improved.
18

Hadoop-2022.pptx

More Related Content

What's hot

Similar to Hadoop-2022.pptx

Recently uploaded

Hadoop-2022.pptx

Editor's Notes