12components ofcomponents of12
The popularity of Hadoop has grown in the last few years, because
it meets the needs of many organizations for flexible data analysis
capabilities with an unmatched price-performance curve.
The Hadoop ecosystem is continuously growing to meet the needs
of Big Data. Let’s understand the role of each component of the
Hadoop ecosystem.
Hadoop Ecosystem comprises of the following 12 components:
Hadoop HDFS
HBase
SQOOP
Flume
Apache Spark
Hadoop MapReduce
Pig
Impala
hadoop Hive
Cloudera Search
Oozie
Hue
Now, let’s check how
these components work
as part of an ecosystem
DATA INGESTION
DATA PROCESSING
DATA ANALYSIS
Sqoop
Cluster Resource Management
Pig
Distributed File System
Hadoop Core
Flume
YARN
HADOOP ECOSYSTEM - COMPONENTS
Impala Hive Cloudera Search Hue
DATA EXPLORATION
HBase
NoSQL
Oozie
WORKSLOW SYSTEM
What does each of these
components actually do?
Provides file permissions and
authentication1
Streaming access to file
system data2
Hadoop provides a command
line interface to interact with
HDFS
3
Suitable for the distributed
storage and processing4
A storage layer for Hadoop
5
HADOOP DISTRIBUTED FILE SYSTEM1
HBASE
Stores data in
HDFS
Mainly used when you need
random, real-time, read/write
access to your Big Data
A NoSQL database
or non-relational database
Provides support to high
volume of data and high
throughput
The table can have
thousands of columns
1 2 3 4 5
2
SQOOP
Sqoop is a tool designed
to transfer data between
Hadoop and relational
database servers
It is used to import data
from relational databases
such as Oracle and
MySQL to HDFS and
export data from HDFS
to relational databases
3
FLUME
Ideally suited for
event data from
multiple systems
A distributed service
for ingesting
streaming data
4
SPARK
1
An open-source cluster
computing framework
3
Provides 100 times faster
performance than MapReduce
2
Supports Machine learning,
Business intelligence, Streaming,
and Batch processing
Spark Core and Resilient
Distributed Datasets (RDDs)
Spark SQL
Spark Streaming
Machine Learning
Library (Mlib)
GraphX
5
HADOOP MAPREDUCE
4
3
2
1Commonly used
An extensive and mature
fault tolerance framework
The original Hadoop processing
engine which is primarily Java based
Based on the map and
reduce programming model
6
PIG
An open-source
dataflow system
Converts pig script
to Map-Reduce code
An alternative to writing
Map-Reduce code
Best for ad-hoc queries
such as join and filter
7
IMPALA
1
High performance
SQL engine which
runs on Hadoop
cluster
2
Ideal for
interactive
analysis
4
Supports a
dialect of SQL
(Impala SQL)
3
Very low
latency – measured
in milliseconds
8
HIVE
Executes queries using
MapReduce
Similar to Impala
Best for data processing
and ETL
1
2
3
9
CLOUDERA SEARCH
One of Cloudera's
near-real-time access
products
Users do not need SQL or
programming skills to use
Cloudera Search
A fully integrated data
processing platform
Enables non-technical users
to search and explore data
stored in or ingested into
Hadoop and HBase
10
OOZIE
Oozie is a workflow or
coordination system
used to manage the
Hadoop jobs
End
Action
Action1
Action2
Action3
A
B
C
Oozie Coordinator
Engine
Oozie Workflow
Engine
Start
11
HUE
Hue is an open source
Web interface for
analyzing data with
Hadoop
It provides SQL editors
for Hive, Impala, MySQL,
Oracle, PostgreSQL,
Spark SQL, and Solr SQL
12
DO YOU WANT TO BE
HADOOP
CERTIFIED?
CLICK HERE
THANK YOU!
If you like this presentation,
please share it.

12 Components of Hadoop Ecosystem

  • 1.
  • 2.
    The popularity ofHadoop has grown in the last few years, because it meets the needs of many organizations for flexible data analysis capabilities with an unmatched price-performance curve. The Hadoop ecosystem is continuously growing to meet the needs of Big Data. Let’s understand the role of each component of the Hadoop ecosystem.
  • 3.
    Hadoop Ecosystem comprisesof the following 12 components: Hadoop HDFS HBase SQOOP Flume Apache Spark Hadoop MapReduce Pig Impala hadoop Hive Cloudera Search Oozie Hue
  • 4.
    Now, let’s checkhow these components work as part of an ecosystem
  • 5.
    DATA INGESTION DATA PROCESSING DATAANALYSIS Sqoop Cluster Resource Management Pig Distributed File System Hadoop Core Flume YARN HADOOP ECOSYSTEM - COMPONENTS Impala Hive Cloudera Search Hue DATA EXPLORATION HBase NoSQL Oozie WORKSLOW SYSTEM
  • 6.
    What does eachof these components actually do?
  • 7.
    Provides file permissionsand authentication1 Streaming access to file system data2 Hadoop provides a command line interface to interact with HDFS 3 Suitable for the distributed storage and processing4 A storage layer for Hadoop 5 HADOOP DISTRIBUTED FILE SYSTEM1
  • 8.
    HBASE Stores data in HDFS Mainlyused when you need random, real-time, read/write access to your Big Data A NoSQL database or non-relational database Provides support to high volume of data and high throughput The table can have thousands of columns 1 2 3 4 5 2
  • 9.
    SQOOP Sqoop is a tooldesigned to transfer data between Hadoop and relational database servers It is used to import data from relational databases such as Oracle and MySQL to HDFS and export data from HDFS to relational databases 3
  • 10.
    FLUME Ideally suited for eventdata from multiple systems A distributed service for ingesting streaming data 4
  • 11.
    SPARK 1 An open-source cluster computingframework 3 Provides 100 times faster performance than MapReduce 2 Supports Machine learning, Business intelligence, Streaming, and Batch processing Spark Core and Resilient Distributed Datasets (RDDs) Spark SQL Spark Streaming Machine Learning Library (Mlib) GraphX 5
  • 12.
    HADOOP MAPREDUCE 4 3 2 1Commonly used Anextensive and mature fault tolerance framework The original Hadoop processing engine which is primarily Java based Based on the map and reduce programming model 6
  • 13.
    PIG An open-source dataflow system Convertspig script to Map-Reduce code An alternative to writing Map-Reduce code Best for ad-hoc queries such as join and filter 7
  • 14.
    IMPALA 1 High performance SQL enginewhich runs on Hadoop cluster 2 Ideal for interactive analysis 4 Supports a dialect of SQL (Impala SQL) 3 Very low latency – measured in milliseconds 8
  • 15.
    HIVE Executes queries using MapReduce Similarto Impala Best for data processing and ETL 1 2 3 9
  • 16.
    CLOUDERA SEARCH One ofCloudera's near-real-time access products Users do not need SQL or programming skills to use Cloudera Search A fully integrated data processing platform Enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase 10
  • 17.
    OOZIE Oozie is aworkflow or coordination system used to manage the Hadoop jobs End Action Action1 Action2 Action3 A B C Oozie Coordinator Engine Oozie Workflow Engine Start 11
  • 18.
    HUE Hue is anopen source Web interface for analyzing data with Hadoop It provides SQL editors for Hive, Impala, MySQL, Oracle, PostgreSQL, Spark SQL, and Solr SQL 12
  • 19.
    DO YOU WANTTO BE HADOOP CERTIFIED? CLICK HERE
  • 20.
    THANK YOU! If youlike this presentation, please share it.