M. Florence Dayana - Hadoop Foundation for Analytics.pptx

Bon Secours College for Women
Accredited with A++ Grade by NAAC in Cycle-II
Recognized by 2(f) and 12(B) Institution, Vilar, Bypass,
Thanjavur.
Dr.M.FLORENCE DAYANA
Assistant Professor
Department of Computer Applications
Year : 2023 – 2024 Class : II-MSc. CS
Semester : III
Course : Big Data Analytics (PP22CSCC31 )
Unit : IV
Hadoop Foundation for Analytics

 History pf Hadoop
 Features
 Key Advantages of Hadoop
 Why Hadoop
 Versions of Hadoop
 Essential of Hadoop ecosystem
 RDBMS versus Hadoop
 Key Aspects of Hadoop
 Components of Hadoop
Hadoop Foundation for Analytics

• Hadoop was created by Doug Cutting and Mike
Cafarella in 2005.
• It was originally developed to support distribution
for the Nutch search engine project.
• In 2006, Hadoop was released by Yahoo and today
is maintained and distributed by Apache Software
Foundation (ASF).
History of Hadoop

 Handles massive quantities of structured, semi structured
and unstructured data using commodity h/w
 Has shared nothing architecture
 Replicates data across multiple computers-Replica
 For high throughput rather than latency
 Batch processing therefore response time is not immediate
 Complements OLTP and OLAP
 Not a replacement for RDBMS
 Not good when work cannot be parallelized
 Not good for processing small files
Features

Key Advantages of Hadoop
1. Stores data in its native form(HDFS)
 No structure that is imposed in keying or storing data
 Schema less
 Only when data needs to be processed that structure
is imposed on new data
2. Scalable
 Can store and distribute very large data sets across
hundred of inexpensive servers that operate in
parallel
3. Cost Effective
 Has a much reduced cost/terabyte of storage and
processing

Key Advantages of Hadoop
4. Resilient to Failure
 Fault tolerant. Practices replication of data. When
 data is sent, it is replicated.
5. Flexibility
 Works with all type of data structures. Helps drive
 meaningful information from email, social media.
 ClickStreamData.
 Put to several purpose such as log analysis, data
 mining, recommendation systems, market campaign
 analysis etc.
6. Fast
Extremely fast. Moves code to data.

Why Hadoop
Why
Hadoop
Inherent
Data
Protection
Low
Cost
Computing
Power
Scalability
Storage
Flexibility

• Hadoop 1.0
• • Data storage Framework
• • Data processing
Versions of Hadoop
Hadoop 1.0
Hadoop 2.0

Hadoop 1.0
Data storage Framework
• HDFS is schemaless. Stores data files in data format.
• Stores files close to original form.
Data processing framework:
• Uses two functions MAP and REDUCE to process data.
• “Mappers” take in a set of key value pairs and generate
intermediate data.
• “Reducers” act on this input to produce the output data.
Two functions work in isolation enabling high distributed in
a high parallel, fault tolerant and scalable way
Versions of Hadoop

Limitations
• Requires MapReduce programming expertise with
proficiency required in other programming languages
like Java
• Supported batch processing suitable for tasks such as
log analysis, large scale data mining projects.
• Tightly computationally coupled with MapReduce.
Either rewrite their functionality in MapReduce so that
it could be executed in Hadoop or extract the data from
HDFS and process it outside of Hadoop. None of the
options were viable as a Hadoop. Led to process
inefficiencies caused by the data being moved in and
out of Hadoop cluster.
Hadoop 1.0

• HDFS continues to be the data storage framework.
• Yet Another Resource Negotiator(YARN) has been
added
• Any application capable of dividing itself into
parallel tasks is supported by YARN
• YARN co ordinates the allocation of the subtasks of
the submitted applications thereby enhancing
flexibility, scalability and efficiency of the
applications
Hadoop 2.0

• It works by having ApplicationMaster in place of
the JobTracker , Running applications on resources
governed by a new NodeManager
• MapReduce programming expertise is no longer
required
• It supports Batch Processing and also Real time
processing
• Data Processing Functions such as Data
Standardisation, Master Data Management can
now be performed in HDFS.
Hadoop 2.0

Supports projects to enhance the functionality of
Hadoop Core Components
The Eco projects
• HIVE
• PIG
• SQOOP
• HBASE
• FLUME.
• OOZIE
• MAHOUT
Essential of Hadoop Ecosystems

The Eco projects are
• HIVE: It enables analysis of large data sets using a
language similar to standard ANSI SQL. Enables to access
data stored on a Hadoop Cluster
• PIG: Easy to understand data flow language. Helps with
the analysis of large data sets. Even without the
proficiency in MapReduce, the data in the Hadoop cluster
can be analysed as PIG scripts are automatically
converted into MapReduce jobs by the PIG interpreter
• SQOOP: Used to transfer bulk data between Hadoop and
structured data stores as RDBMS

• HBASE: It is Hadoop’s database and compares well with
an RDBMS. It supports structured data storage for large
tables
• FLUME:Is a distributed, reliable and available software
for efficiently collecting, aggregating and moving large
amounts of log data. Has simple and flexible
architecture.
• OOZIE: It is a workflow scheduler system to manage
Apache Hadoop jobs
• MAHOUT: It is a scalable machine learning and data
mining library

RDBMS versus HADOOP
PARAMETERS RDBMS HADOOP
System Relational database
Management System
Node Based Flat Structure
Data Suitable for structured
data
Suitable for structured, unstructured data,
Supports variety of data formats in real time
such as XML, JSON, text based flat file
formats etc.
Processing OLTP Analytical, Big Data Processing
Choice When the data needs
consistent Relationship
Big Data processing, which does not require
any consistent relationships between data
Processor Needs expensive
hardware or high-end
processors to store
huge volumes of data
In a HADOOP cluster, a node requires only a
processor, a network card and few hard
drives
Cost Cost around $10,000
to $14,000 per
terabytes of storage
Cost around $4,000 per terabytes of storage

1
• Open Source Software
• It is free to download, use and contribute
2
• Framework
• The requirements to develop and execute and application is
provided-program tools etc.
3
• Distributed
• Divides and stores data across multiple computers.
• Computation/Processing is done in parallel across multiple
connected nodes
4
• Massive Storage
• Stores colossal amounts of data across nodes of low-cost commodity
hardware
5
• Faster Processing
• Large amounts of data is processed in parallel yielding quick
response
Key Aspects of Hadoop

HDFS
Storage Components
Distribute data across several
nodes
Natively redundant
MapReduce
Computational framework
Splits a task across several nodes
Process data in parallel
Core Components Hadoop Ecosystem
• HIVE
• PIG
• SQOOP
• HBASE
• FLUME
• OOZIE
• MAHOUT

M. Florence Dayana - Hadoop Foundation for Analytics.pptx

Recommended

Recommended

More Related Content

Similar to M. Florence Dayana - Hadoop Foundation for Analytics.pptx

Similar to M. Florence Dayana - Hadoop Foundation for Analytics.pptx (20)

More from Dr.Florence Dayana

More from Dr.Florence Dayana (20)

Recently uploaded

Recently uploaded (20)

M. Florence Dayana - Hadoop Foundation for Analytics.pptx