2. TABLE OF CONTENT
• What is big data?
• Introduction of Hadoop
• Concept of Hadoop
• Introduction of HDFS(Hadoop distributed file system)
• Introduction of MapReduce
• Architecture of HDFS
• MapReduce workflow
• Terminology of MapReduce
3. WHAT IS BIG DATA?
• Big data refers to datasets whose size is beyond the ability of
typical database
• Big data refers to the ever increasing in volume, velocity,
veracity of data
• Source of big data- social sites, search engine, IOT, cloud etc.
4. INTRODUCTION OF HADOOP
• Hadoop is use for storing and processing huge data set with
cluster of commodity hardware.
• it was created by Dough Cutting.
• Hadoop is an open-source framework by Apache Foundation.
5. CONCEPT OF HADOOP
There are two core concept of Hadoop
• HDFS(Hadoop distributed file system)
• MapReduce
6. HADOOP DISTRIBUTED FILE
SYSTEM
• It is a specially design file system that is use for storing large
data set with cluster of commodity hardware with streaming
access patterns(WORA-write once read any number of time
without changing content).
• The default block size of HDFS is 64MB.
• It creates 3 replica of data by default.
• HDFS follows the master-slave architecture
7. MAPREDUCE
• MapReduce is a framework use to process the data by writing the
program.
• MapReduce divides a task into small parts and assign them to
different data nodes. Later, the result are collected at one place
and integrated to form the result dataset.
9. There are 5 service of Hadoop
• NameNode
• Job Tracker Master nodes
• Secondary NameNode
• DataNode
slave nodes
• Task Tracker
10. Name Node
• Stores all metadata: filenames, locations of each block on Data Nodes,
file attributes, etc.
• Keeps metadata in RAM for fast lookup
Data Node
• Store file and maintain replica of file in different Data Nodes
• Periodically sends a block report to the Name Node that contain block
related information like system number.
• Periodically sends heartbeat to the Name Node to show their existence and
working properly.
Secondary Name Node
• Behave live a helper node for name node.
• It is also called checkpoint node.
Job Tracker
• Assign task to Task Tracker.
• Manage all Task Tracker.
Task Tracker
• Accept and Perform all task assigned by Job Tracker.