Hadoop Foundation for Analytics
History of Hadoop
Features of Hadoop
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Eco Projects
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
1. Bon Secours College for Women
Accredited with A++ Grade by NAAC in Cycle-II
Recognized by 2(f) and 12(B) Institution, Vilar, Bypass,
Thanjavur.
Dr.M.FLORENCE DAYANA
Assistant Professor
Department of Computer Applications
Year : 2023 – 2024 Class : II-MSc. CS
Semester : III
Course : Big Data Analytics (PP22CSCC31 )
Unit : IV
Hadoop Foundation for Analytics
2. History pf Hadoop
Features
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
Hadoop Foundation for Analytics
3. • Hadoop was created by Doug Cutting and Mike
Cafarella in 2005.
• It was originally developed to support distribution
for the Nutch search engine project.
• In 2006, Hadoop was released by Yahoo and today
is maintained and distributed by Apache Software
Foundation (ASF).
History of Hadoop
4. Handles massive quantities of structured, semi structured
and unstructured data using commodity h/w
Has shared nothing architecture
Replicates data across multiple computers-Replica
For high throughput rather than latency
Batch processing therefore response time is not immediate
Complements OLTP and OLAP
Not a replacement for RDBMS
Not good when work cannot be parallelized
Not good for processing small files
Features
5. Key Advantages of Hadoop
1. Stores data in its native form(HDFS)
No structure that is imposed in keying or storing data
Schema less
Only when data needs to be processed that structure
is imposed on new data
2. Scalable
Can store and distribute very large data sets across
hundred of inexpensive servers that operate in
parallel
3. Cost Effective
Has a much reduced cost/terabyte of storage and
processing
6. Key Advantages of Hadoop
4. Resilient to Failure
Fault tolerant. Practices replication of data. When
data is sent, it is replicated.
5. Flexibility
Works with all type of data structures. Helps drive
meaningful information from email, social media.
ClickStreamData.
Put to several purpose such as log analysis, data
mining, recommendation systems, market campaign
analysis etc.
6. Fast
Extremely fast. Moves code to data.
8. • Hadoop 1.0
• • Data storage Framework
• • Data processing
Versions of Hadoop
Hadoop 1.0
Hadoop 2.0
9. Hadoop 1.0
Data storage Framework
• HDFS is schemaless. Stores data files in data format.
• Stores files close to original form.
Data processing framework:
• Uses two functions MAP and REDUCE to process data.
• “Mappers” take in a set of key value pairs and generate
intermediate data.
• “Reducers” act on this input to produce the output data.
Two functions work in isolation enabling high distributed in
a high parallel, fault tolerant and scalable way
Versions of Hadoop
10. Limitations
• Requires MapReduce programming expertise with
proficiency required in other programming languages
like Java
• Supported batch processing suitable for tasks such as
log analysis, large scale data mining projects.
• Tightly computationally coupled with MapReduce.
Either rewrite their functionality in MapReduce so that
it could be executed in Hadoop or extract the data from
HDFS and process it outside of Hadoop. None of the
options were viable as a Hadoop. Led to process
inefficiencies caused by the data being moved in and
out of Hadoop cluster.
Hadoop 1.0
11. • HDFS continues to be the data storage framework.
• Yet Another Resource Negotiator(YARN) has been
added
• Any application capable of dividing itself into
parallel tasks is supported by YARN
• YARN co ordinates the allocation of the subtasks of
the submitted applications thereby enhancing
flexibility, scalability and efficiency of the
applications
Hadoop 2.0
12. • It works by having ApplicationMaster in place of
the JobTracker , Running applications on resources
governed by a new NodeManager
• MapReduce programming expertise is no longer
required
• It supports Batch Processing and also Real time
processing
• Data Processing Functions such as Data
Standardisation, Master Data Management can
now be performed in HDFS.
Hadoop 2.0
13. Supports projects to enhance the functionality of
Hadoop Core Components
The Eco projects
• HIVE
• PIG
• SQOOP
• HBASE
• FLUME.
• OOZIE
• MAHOUT
Essential of Hadoop Ecosystems
14. Essential of Hadoop Ecosystems
The Eco projects are
• HIVE: It enables analysis of large data sets using a
language similar to standard ANSI SQL. Enables to access
data stored on a Hadoop Cluster
• PIG: Easy to understand data flow language. Helps with
the analysis of large data sets. Even without the
proficiency in MapReduce, the data in the Hadoop cluster
can be analysed as PIG scripts are automatically
converted into MapReduce jobs by the PIG interpreter
• SQOOP: Used to transfer bulk data between Hadoop and
structured data stores as RDBMS
15. • HBASE: It is Hadoop’s database and compares well with
an RDBMS. It supports structured data storage for large
tables
• FLUME:Is a distributed, reliable and available software
for efficiently collecting, aggregating and moving large
amounts of log data. Has simple and flexible
architecture.
• OOZIE: It is a workflow scheduler system to manage
Apache Hadoop jobs
• MAHOUT: It is a scalable machine learning and data
mining library
Essential of Hadoop Ecosystems
16. RDBMS versus HADOOP
PARAMETERS RDBMS HADOOP
System Relational database
Management System
Node Based Flat Structure
Data Suitable for structured
data
Suitable for structured, unstructured data,
Supports variety of data formats in real time
such as XML, JSON, text based flat file
formats etc.
Processing OLTP Analytical, Big Data Processing
Choice When the data needs
consistent Relationship
Big Data processing, which does not require
any consistent relationships between data
Processor Needs expensive
hardware or high-end
processors to store
huge volumes of data
In a HADOOP cluster, a node requires only a
processor, a network card and few hard
drives
Cost Cost around $10,000
to $14,000 per
terabytes of storage
Cost around $4,000 per terabytes of storage
17. 1
• Open Source Software
• It is free to download, use and contribute
2
• Framework
• The requirements to develop and execute and application is
provided-program tools etc.
3
• Distributed
• Divides and stores data across multiple computers.
• Computation/Processing is done in parallel across multiple
connected nodes
4
• Massive Storage
• Stores colossal amounts of data across nodes of low-cost commodity
hardware
5
• Faster Processing
• Large amounts of data is processed in parallel yielding quick
response
Key Aspects of Hadoop
19. HDFS
Storage Components
Distribute data across several
nodes
Natively redundant
MapReduce
Computational framework
Splits a task across several nodes
Process data in parallel
Core Components Hadoop Ecosystem
• HIVE
• PIG
• SQOOP
• HBASE
• FLUME
• OOZIE
• MAHOUT