Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS COLLEGE FOR WOMEN

Name :J.Ayeesha Parveen
Class :II-M.Sc.,Computer Science
Batch :2017-2019
Incharge Staff : Ms.M.Florence Dayana

 Apache Hadoop is an open-source software
framework for storage and large-scale
processing of data-sets on clusters of
commodity hardware.
 Hadoop is best known for MapReduce and its
distributed ﬁlesystem (HDFS),and large-scale
data processing.

 Open source
 Distributed processing
 Distributed storage
 Scalable
 Reliable
 Fault-tolerant
 Economical
 Flexible

 Introduction to the world of Hadoop and the
core related software projects.
 There are countless commercial Hadoop-
integrated products focused on making
Hadoop more usable and layman-accessible,
but the ones here were chosen because they
provide core functionality and speed in
Hadoop so called Hadoop Ecosystem

 HDFS – Hadoop Distributed File System
(Storage)
 Map Reduce (Processing)

 NameNode:
 Master of the system
 Maintains and manages the blocks which are present on
the DataNodes
 DataNodes:
 Slaves which are deployed on each machine and
provide the actual storage
 Responsible for serving read and write requests for the
clients
 Jobtracker:
 takes care of all the job scheduling and assign tasks to
Task Trackers.
 TaskTracker:
 a node in the cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a jobtracker

 Hadoop Distributed File System (HDFS) is designed
to reliably store very large files across machines in
a large cluster. It is inspired by the
GoogleFileSystem.
 Distribute large data file into blocks
 Blocks are managed by different nodes in the
cluster
 Each block is replicated on multiple nodes
 Name node stored metadata information about files
and blocks

 The Mapper:
 Each block is processed in isolation by a map task
called mapper
 Map task runs on the node where the block is stored
 The Reducer:
 Consolidate result from different mappers
 Produce final output

 Hbase-Hadoop database for random read/write
access
 Hive-SQLlike queries and tables on large datasets
 Pig-Data flow language and compiler
 Oozie-Workflow for interdependent Hadoop jobs
 Sqoop-Integration of databases and data
warehouses with Hadoop
 Flume-Configurable streaming data collection
 ZooKeeper-Coordination service for distributed
applications

Feature. RDBMS Hadoop
Data Variety Mainly for Structured data. Used for Structured, Semi-
Structured and Unstructured
data
Data Storage Average size data (GBS) Use for large data set (Tbs
and Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write (static
schema)
Required on read (dynamic
schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Use Case OLTP (Online transaction
processing)
Analytics (Audio, video, logs
etc), Data Discovery
Data Objects Works on Relational Tables Works on Key/Value Pair
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low

Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS COLLEGE FOR WOMEN

Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS COLLEGE FOR WOMEN

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS COLLEGE FOR WOMEN

Similar to Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS COLLEGE FOR WOMEN (20)

Recently uploaded

Recently uploaded (20)

Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS COLLEGE FOR WOMEN