Module 2 –
Part2
Understanding Hadoop Ecosystem:
Hadoop Ecosystem, Hadoop Distributed File System,
MapReduce, Hadoop YARN, Introducing HBase(db), Combing
HBase and HDFS, Hive(query language), Pig(script) and Pig
Latin, Sqoop, ZooKeeper (security)
Flume(trump)Oozie(security)
Understanding Hadoop Ecosystem
Hadoop
• Hadoop is not Big Data, however, it is true that Hadoop plays an
integral part in almost all Big Data processes.
• Hadoop is a ‘software library’ that allows its users to process large
datasets across distributed clusters of computers, thereby enabling
them to gather, store and analyse huge sets of data.
• Hadoop provides various tools and technologies, collectively termed
as Hadoop ecosystem, to enable the development and deployment of
Big Data solutions.
Hadoop Ecosystem
• Framework of various types of complex and evolving tools and components.
• Defined as “Comprehensive collection of tools and technologies that can be
effectively implemented and deployed to provide Big Data solutions in a
cost-effective manner”
• The base of Hadoop is HDFS and MapReduce(similar to the building with a
strong base)
• MapReduce and Hadoop Distributed File System(HDFS) are two core
components of Hadoop Ecosystem, however, not sufficient to deal with Big
data challenges.
• Along with MapReduce and Hadoop, Hadoop ecosystem provides a
collection of various elements to support the complete development and
deployment of Big Data solutions.
Hadoop
Distributed File
System(HDFS)
Architecture
HDFS Architecture
• HDFS has a master-slave architecture.
• It comprises a Namenode works as Master
and Datanode work as a slave.
• Namenode is the central component of
HDFS system. If Namenode gets down,
then the whole Hadoop cluster is
inaccessible and considered dead.
• Datanode stores actual data and
works as instructed by Namenode.
• A Hadoop file system can have
multiple data nodes but only one
active Namenode.
HDFS Architecture- Namenode
• Namenode maintains and manages the Data Nodes and
assigns the task to them.
• Namenode does not contain actual data of files.
• Namenode stores metadata of actual data like Filename,
path, number of data blocks, block IDs, block location,
number of replicas and other slave related informations.
• Namenode manages all the request(read, write) of client for
actual data file.
• Namenode executes file system name space operations like
opening/closing files, renaming files and directories.
HDFS Architecture- Datanode
• Datanodes is responsible of storing actual data.
• Upon instruction from Namenode, it performs operations
like creation/replication/deletion of data blocks.
• When one of Datanode gets down then it will not make any
effect on Hadoop cluster due to replication.
• All Datanodes are synchronized in the Hadoop cluster in a
way that they can communicate with each other for various
operations.
What happens if one of the Datanodes gets failed in HDFS?
• Namenode periodically receives a heartbeat and a Block report
from each Datanode in the cluster.
• Every Datanode sends heartbeat message after every 3 seconds to
Namenode.
• The health report is just information about a particular Datanode
that is working properly or not.
• A block report of a particular Datanode contains information about
all the blocks on that resides on the corresponding Datanode.
• When Namenode doesn’t receive any heartbeat message for 10
minutes(ByDefault) from a particular Datanode then corresponding
Datanode is considered Dead or failed by Namenode.
• Since blocks will be under replicated, the system starts the
replication process from one Datanode to another by taking all
block information from the Block report of corresponding
Datanode.
• The Data for replication transfers directly from one Datanode to
another without data passing through Namenode.
Hadoop Specific File System Types
HDFS Commands
The org.apache.hadoop.io package
The org.apache.hadoop.io package
The org.apache.hadoop.io package

Module 2 - Part2.pptx

  • 1.
    Module 2 – Part2 UnderstandingHadoop Ecosystem: Hadoop Ecosystem, Hadoop Distributed File System, MapReduce, Hadoop YARN, Introducing HBase(db), Combing HBase and HDFS, Hive(query language), Pig(script) and Pig Latin, Sqoop, ZooKeeper (security) Flume(trump)Oozie(security)
  • 2.
  • 3.
    Hadoop • Hadoop isnot Big Data, however, it is true that Hadoop plays an integral part in almost all Big Data processes. • Hadoop is a ‘software library’ that allows its users to process large datasets across distributed clusters of computers, thereby enabling them to gather, store and analyse huge sets of data. • Hadoop provides various tools and technologies, collectively termed as Hadoop ecosystem, to enable the development and deployment of Big Data solutions.
  • 4.
    Hadoop Ecosystem • Frameworkof various types of complex and evolving tools and components. • Defined as “Comprehensive collection of tools and technologies that can be effectively implemented and deployed to provide Big Data solutions in a cost-effective manner” • The base of Hadoop is HDFS and MapReduce(similar to the building with a strong base) • MapReduce and Hadoop Distributed File System(HDFS) are two core components of Hadoop Ecosystem, however, not sufficient to deal with Big data challenges. • Along with MapReduce and Hadoop, Hadoop ecosystem provides a collection of various elements to support the complete development and deployment of Big Data solutions.
  • 7.
  • 8.
    HDFS Architecture • HDFShas a master-slave architecture. • It comprises a Namenode works as Master and Datanode work as a slave. • Namenode is the central component of HDFS system. If Namenode gets down, then the whole Hadoop cluster is inaccessible and considered dead. • Datanode stores actual data and works as instructed by Namenode. • A Hadoop file system can have multiple data nodes but only one active Namenode.
  • 9.
    HDFS Architecture- Namenode •Namenode maintains and manages the Data Nodes and assigns the task to them. • Namenode does not contain actual data of files. • Namenode stores metadata of actual data like Filename, path, number of data blocks, block IDs, block location, number of replicas and other slave related informations. • Namenode manages all the request(read, write) of client for actual data file. • Namenode executes file system name space operations like opening/closing files, renaming files and directories.
  • 10.
    HDFS Architecture- Datanode •Datanodes is responsible of storing actual data. • Upon instruction from Namenode, it performs operations like creation/replication/deletion of data blocks. • When one of Datanode gets down then it will not make any effect on Hadoop cluster due to replication. • All Datanodes are synchronized in the Hadoop cluster in a way that they can communicate with each other for various operations.
  • 11.
    What happens ifone of the Datanodes gets failed in HDFS? • Namenode periodically receives a heartbeat and a Block report from each Datanode in the cluster. • Every Datanode sends heartbeat message after every 3 seconds to Namenode. • The health report is just information about a particular Datanode that is working properly or not. • A block report of a particular Datanode contains information about all the blocks on that resides on the corresponding Datanode. • When Namenode doesn’t receive any heartbeat message for 10 minutes(ByDefault) from a particular Datanode then corresponding Datanode is considered Dead or failed by Namenode. • Since blocks will be under replicated, the system starts the replication process from one Datanode to another by taking all block information from the Block report of corresponding Datanode. • The Data for replication transfers directly from one Datanode to another without data passing through Namenode.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.