HADOOP OVERVIEW &
ARCHITECTURE
BY
CHANDINI SANS
CONTENTS
1. Why hadoop?
2. Importance of hadoop
3. What’s in hadoop?
4. Apache hadoop echo system
5. Hadoop architecture
6. Hadoop map reduce
7. Hdfs
8. Advantages of hadoop
COST PER GIGA BYTE
STORAGE TRENDS
ISSUES WITH LARGE DATA
• Map Parallelism: Chunking input data
• Reduce Parallelism: Grouping related
data
• Dealing with failures & load imbalance
• Doug Cutting, Mike Cafarella developed an
Open Source Project called HADOOP in 2005
and Daug named it after his son's toy elephant.
• Hadoop has become one of the most talked about
technologies.
• Why? One of the top reasons is its ability to handle
huge amounts of data – any kind of data – quickly.
With volumes and varieties of data growing each
day, especially from social media and automated
sensors, that’s a key consideration for most
organizations. 
• Hadoop is an open-source software framework
for storing and processing big data in a
distributed fashion on large clusters of
commodity hardware.
• Essentially, it accomplishes two tasks:
-massive data storage
- faster processing.
• Hadoop is an Apache open source framework
written in java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.
WHO USES HADOOP?
WHY IS HADOOP IMPORTANT?
• Low cost : The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Computing power : Its distributed computing model
can quickly process very large volumes of data.
• Scalability : You can easily grow your system simply by
adding more nodes
• Storage flexibility : You can store as much data as you
want and decide how to use it later.
• Inherent data protection and self-healing
capabilities : Data, application processing are protected
WHAT’S IN HADOOP?
• HDFS – the Java-based distributed file system that can
store all kinds of data without prior organization.
• MapReduce – a software programming model for
processing large sets of data in parallel.
• YARN – a resource management framework for
scheduling and handling resource requests from distributed
applications.
COMPONENTS THAT HAVE ACHIEVED TOP-
LEVEL APACHE PROJECT STATUS
• Pig – a platform for manipulating data stored in HDFS. It
consists of a compiler for Map Reduce programs and a
high-level language called Pig Latin.
• Hive – a data warehousing and SQL-like query language
that presents data in the form of tables. Hive programming
is similar to database programming. (It was initially
developed by Facebook.)
• HBase – a non relational, distributed database that runs
on top of Hadoop. HBase tables can serve as input and
output for Map Reduce jobs.
• Zookeeper – an application that coordinates distributed
processes.
• Ambari – a web interface for managing, configuring
and testing Hadoop services and components.
• Flume – software that collects, aggregates and moves
large amounts of streaming data into HDFS.
• Sqoop – a connection and transfer mechanism that
moves data between Hadoop and relational databases.
• Oozie – a Hadoop job scheduler.
HADOOP ARCHITECTURE
• Hadoop framework includes following four modules:
• Hadoop Common : These are Java libraries and
utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts
required to start Hadoop.
• Hadoop YARN : This is a framework for job
scheduling and cluster resource management.
• Hadoop Distributed File System (HDFS) : A
distributed file system that provides high-throughput
access to application data.
• Hadoop MapReduce : This is YARN-based system
for parallel processing of large data sets.
COMPONENTS OF HADOOP
FRAMEWORK:
HADOOP MAP REDUCE
• Hadoop runs applications using the Map
Reduce algorithm, where the data is processed
in parallel on different CPU nodes.
• Map Reduce program executes in three stages,
namely map stage, shuffle stage, and reduce
stage.
WHAT IS MAP REDUCE?
STAGES OF MAP REDUCE
• Map stage : The map ‘s job is to process the input data
which is in the form of file or directory and is stored in the
Hadoop file system (HDFS) and is passed to the mapper
function line by line. The mapper processes the data and
creates several small chunks of data.
• Reduce stage : This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will
be stored in the HDFS.
MAP REDUCE
MAP REDUCE
ARCHITECTURE
THINK MAP REDUCE
• Record = (Key, Value)
• Key : Comparable, Serializable
• Value : Serializable
• Input, Map, Shuffle, Reduce, Output
MAP
• Input: (Key1, Value1)
• Output: List(Key2, Value2)
• Projections, Filtering, Transformation
• Data is organized into files and
directories
• Files are divided into uniform sized
blocks(default 128MB) and distributed
across cluster nodes
HDFS
• Blocks are replicated to handle hardware
failure
• Replication for performance and fault
tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for
corruption detection and recovery
FEATURES OF HDFS
• It is suitable for the distributed storage and
processing.
• Hadoop provides a command interface to
interact with HDFS.
• The built-in servers of name node and data
node help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and
authentication.
HDFS ARCHITECTURE
• Namenode is a software that can be run on commodity
hardware. The system having the namenode acts as the
master server and it does the following tasks:
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system operations such as renaming,
closing, and opening files and directories.
• Datanode nodes manage the data storage of the system.
- perform read-write operations on the file systems, as per
client request.
- perform operations such as block creation, deletion, and
replication
• Block the user data is stored in the files of HDFS in which file
system will be divided into one or more segments and stored
in individual data nodes segments are called as blocks
MASTER-SLAVE
ARCHITECTURE
GOALS OF HDFS
• Fault detection and recovery :
Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore
HDFS should have mechanisms for quick and automatic
fault detection and recovery.
• Huge datasets :
HDFS should have hundreds of nodes per cluster to
manage the applications having huge datasets.
• Hardware at data :
A requested task can be done efficiently, when the
computation takes place near the data where huge
datasets are involved, it reduces the network traffic and
increases the throughput.
ADVANTAGES OF HADOOP
• Hadoop framework allows the user to quickly write and
test distributed systems.
• Hadoop library itself detects and handles failures at the
application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• apart from being open source, it is compatible on all the
platforms since it is Java based.
Thank
You…!!!

Hadoop

  • 1.
  • 2.
    CONTENTS 1. Why hadoop? 2.Importance of hadoop 3. What’s in hadoop? 4. Apache hadoop echo system 5. Hadoop architecture 6. Hadoop map reduce 7. Hdfs 8. Advantages of hadoop
  • 4.
  • 5.
  • 6.
    ISSUES WITH LARGEDATA • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures & load imbalance
  • 8.
    • Doug Cutting,Mike Cafarella developed an Open Source Project called HADOOP in 2005 and Daug named it after his son's toy elephant.
  • 9.
    • Hadoop hasbecome one of the most talked about technologies. • Why? One of the top reasons is its ability to handle huge amounts of data – any kind of data – quickly. With volumes and varieties of data growing each day, especially from social media and automated sensors, that’s a key consideration for most organizations. 
  • 10.
    • Hadoop isan open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. • Essentially, it accomplishes two tasks: -massive data storage - faster processing.
  • 11.
    • Hadoop isan Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
  • 12.
  • 14.
    WHY IS HADOOPIMPORTANT? • Low cost : The open-source framework is free and uses commodity hardware to store large quantities of data. • Computing power : Its distributed computing model can quickly process very large volumes of data. • Scalability : You can easily grow your system simply by adding more nodes • Storage flexibility : You can store as much data as you want and decide how to use it later. • Inherent data protection and self-healing capabilities : Data, application processing are protected
  • 15.
    WHAT’S IN HADOOP? •HDFS – the Java-based distributed file system that can store all kinds of data without prior organization. • MapReduce – a software programming model for processing large sets of data in parallel. • YARN – a resource management framework for scheduling and handling resource requests from distributed applications.
  • 17.
    COMPONENTS THAT HAVEACHIEVED TOP- LEVEL APACHE PROJECT STATUS • Pig – a platform for manipulating data stored in HDFS. It consists of a compiler for Map Reduce programs and a high-level language called Pig Latin. • Hive – a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.) • HBase – a non relational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for Map Reduce jobs. • Zookeeper – an application that coordinates distributed processes.
  • 18.
    • Ambari – aweb interface for managing, configuring and testing Hadoop services and components. • Flume – software that collects, aggregates and moves large amounts of streaming data into HDFS. • Sqoop – a connection and transfer mechanism that moves data between Hadoop and relational databases. • Oozie – a Hadoop job scheduler.
  • 19.
    HADOOP ARCHITECTURE • Hadoopframework includes following four modules: • Hadoop Common : These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop. • Hadoop YARN : This is a framework for job scheduling and cluster resource management. • Hadoop Distributed File System (HDFS) : A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce : This is YARN-based system for parallel processing of large data sets.
  • 21.
  • 23.
  • 24.
    • Hadoop runsapplications using the Map Reduce algorithm, where the data is processed in parallel on different CPU nodes. • Map Reduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. WHAT IS MAP REDUCE?
  • 25.
    STAGES OF MAPREDUCE • Map stage : The map ‘s job is to process the input data which is in the form of file or directory and is stored in the Hadoop file system (HDFS) and is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 26.
  • 27.
  • 28.
    THINK MAP REDUCE •Record = (Key, Value) • Key : Comparable, Serializable • Value : Serializable • Input, Map, Shuffle, Reduce, Output
  • 29.
    MAP • Input: (Key1,Value1) • Output: List(Key2, Value2) • Projections, Filtering, Transformation
  • 30.
    • Data isorganized into files and directories • Files are divided into uniform sized blocks(default 128MB) and distributed across cluster nodes
  • 31.
    HDFS • Blocks arereplicated to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery
  • 32.
    FEATURES OF HDFS •It is suitable for the distributed storage and processing. • Hadoop provides a command interface to interact with HDFS. • The built-in servers of name node and data node help users to easily check the status of cluster. • Streaming access to file system data. • HDFS provides file permissions and authentication.
  • 33.
  • 34.
    • Namenode isa software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks: - Manages the file system namespace. - Regulates client’s access to files. - It also executes file system operations such as renaming, closing, and opening files and directories. • Datanode nodes manage the data storage of the system. - perform read-write operations on the file systems, as per client request. - perform operations such as block creation, deletion, and replication • Block the user data is stored in the files of HDFS in which file system will be divided into one or more segments and stored in individual data nodes segments are called as blocks
  • 35.
  • 36.
    GOALS OF HDFS •Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data where huge datasets are involved, it reduces the network traffic and increases the throughput.
  • 37.
    ADVANTAGES OF HADOOP •Hadoop framework allows the user to quickly write and test distributed systems. • Hadoop library itself detects and handles failures at the application layer. • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. • apart from being open source, it is compatible on all the platforms since it is Java based.
  • 38.