Hadoop
Development
Series
By Sandeep Patil
11/1/2017 1Footer Text
Introduction to Big Data
and Hadoop
11/1/2017Footer Text 2
What is Big Data??
• Large amount of Data .
• Its a popular term used to express exponential growth of
data .
• Big data is difficult to store , collect , maintain , Analyze
and Visualize .
11/1/2017Footer Text 3
Big Data characteristics
• Volume :-
Large amount of data .
• Velocity :-
The rate at which data is getting generated
• Variety :-
Different types of Data
- Structured data ,eg MySql
- Semi-Structured data, eg xml , json
- Unstructured data, eg text , audio, video
11/1/2017Footer Text 4
Big Data sources
• Social Media
• Banks
• Instruments
• Websites
• Stock Market
11/1/2017Footer Text 5
Use cases of Big Data
• Recommendation engines
• Analyzing Call Detail Record(CDR)
• Fraud Detection
• Market Basket Analysis
• Sentimental Analysis
11/1/2017Footer Text 6
Hadoop Introduction
• Open source framework that allows distributed
processing of large datasets on the cluster of commodity
hardware
• Hadoop is a data management tool and uses scale out
storage .
11/1/2017Footer Text 7
Defining Hadoop Cluster
• Size of data is most important factor while defining
hadoop cluster
11/1/2017Footer Text 8
5 Servers with 10 TB storage
capacity each
Total Storage Capacity : - 50TB
Defining Hadoop Cluster
11/1/2017Footer Text 9
7 Servers with 10 TB storage
capacity each
Total storage capacity : 70TB
Hadoop Components
• Hadoop 1 Componets
- HDFS (Hadoop distributed file system)
- MapReduce
• Hadoop 2 Component
- HDFS (Hadoop distributed file system)
- YARN/MRv2
11/1/2017Footer Text 10
HDFS
MR/
YARN
Storage/
Reads-Writes
Processing
Hadoop Daemons
• Hadoop 1 Daemos
Namenode
Datanode
Secondary Namenode
job Tracker
Task Tracker
11/1/2017Footer Text 11
HDFS MapReduce
NameNode
DataNode
Job Tracker
Task Tracker
Hadoop Daemons
• Hadoop 2 Daemos
Namenode
Datanode
Secondary Namenode
Resource Manager
Node Manager
11/1/2017Footer Text 12
HDFS YARN
NameNode
DataNode
Resource Manager
Node Manager
Hadoop Master Slave
Architecture
11/1/2017Footer Text 13
HDFS MR/YARN
NameNode DataNode ResourceManager NodeManager
Master Slave Master Slave
Hadoop Cluster
• Assume that we have hadoop cluster with 4 nodes
11/1/2017Footer Text 14
Master
NameNode
ResourceManager
Slave
DataNode
NodeManager
Secondary Name Node
• Secondary Namenode is not a hot backup for Namenode
.
• It just takes hourly backup of Namenode metadata
• It is can be used to Restart a crashed Hadoop Cluster
• Secondary Namenode is an important demon for
Hadoop1 , However in hadoop2 It is not that much
Important .
11/1/2017Footer Text 15
Modes of Operation
• Stand Alone
• Pseudo Distributed
• Fully Distributed
11/1/2017Footer Text 16
Next Video
• Comparison between Hadoop1 and Hadoop2
11/1/2017Footer Text 17
Like and Subscribe
11/1/2017Footer Text 18
sdp117@gmail.com

Introduction to Big Data and hadoop

  • 1.
  • 2.
    Introduction to BigData and Hadoop 11/1/2017Footer Text 2
  • 3.
    What is BigData?? • Large amount of Data . • Its a popular term used to express exponential growth of data . • Big data is difficult to store , collect , maintain , Analyze and Visualize . 11/1/2017Footer Text 3
  • 4.
    Big Data characteristics •Volume :- Large amount of data . • Velocity :- The rate at which data is getting generated • Variety :- Different types of Data - Structured data ,eg MySql - Semi-Structured data, eg xml , json - Unstructured data, eg text , audio, video 11/1/2017Footer Text 4
  • 5.
    Big Data sources •Social Media • Banks • Instruments • Websites • Stock Market 11/1/2017Footer Text 5
  • 6.
    Use cases ofBig Data • Recommendation engines • Analyzing Call Detail Record(CDR) • Fraud Detection • Market Basket Analysis • Sentimental Analysis 11/1/2017Footer Text 6
  • 7.
    Hadoop Introduction • Opensource framework that allows distributed processing of large datasets on the cluster of commodity hardware • Hadoop is a data management tool and uses scale out storage . 11/1/2017Footer Text 7
  • 8.
    Defining Hadoop Cluster •Size of data is most important factor while defining hadoop cluster 11/1/2017Footer Text 8 5 Servers with 10 TB storage capacity each Total Storage Capacity : - 50TB
  • 9.
    Defining Hadoop Cluster 11/1/2017FooterText 9 7 Servers with 10 TB storage capacity each Total storage capacity : 70TB
  • 10.
    Hadoop Components • Hadoop1 Componets - HDFS (Hadoop distributed file system) - MapReduce • Hadoop 2 Component - HDFS (Hadoop distributed file system) - YARN/MRv2 11/1/2017Footer Text 10 HDFS MR/ YARN Storage/ Reads-Writes Processing
  • 11.
    Hadoop Daemons • Hadoop1 Daemos Namenode Datanode Secondary Namenode job Tracker Task Tracker 11/1/2017Footer Text 11 HDFS MapReduce NameNode DataNode Job Tracker Task Tracker
  • 12.
    Hadoop Daemons • Hadoop2 Daemos Namenode Datanode Secondary Namenode Resource Manager Node Manager 11/1/2017Footer Text 12 HDFS YARN NameNode DataNode Resource Manager Node Manager
  • 13.
    Hadoop Master Slave Architecture 11/1/2017FooterText 13 HDFS MR/YARN NameNode DataNode ResourceManager NodeManager Master Slave Master Slave
  • 14.
    Hadoop Cluster • Assumethat we have hadoop cluster with 4 nodes 11/1/2017Footer Text 14 Master NameNode ResourceManager Slave DataNode NodeManager
  • 15.
    Secondary Name Node •Secondary Namenode is not a hot backup for Namenode . • It just takes hourly backup of Namenode metadata • It is can be used to Restart a crashed Hadoop Cluster • Secondary Namenode is an important demon for Hadoop1 , However in hadoop2 It is not that much Important . 11/1/2017Footer Text 15
  • 16.
    Modes of Operation •Stand Alone • Pseudo Distributed • Fully Distributed 11/1/2017Footer Text 16
  • 17.
    Next Video • Comparisonbetween Hadoop1 and Hadoop2 11/1/2017Footer Text 17
  • 18.
    Like and Subscribe 11/1/2017FooterText 18 sdp117@gmail.com