Introduction to Big Data and hadoop

Hadoop
Development
Series
By Sandeep Patil
11/1/2017 1Footer Text

Introduction to Big Data
and Hadoop
11/1/2017Footer Text 2

What is Big Data??
• Large amount of Data .
• Its a popular term used to express exponential growth of
data .
• Big data is difficult to store , collect , maintain , Analyze
and Visualize .

Big Data characteristics
• Volume :-
Large amount of data .
• Velocity :-
The rate at which data is getting generated
• Variety :-
Different types of Data
- Structured data ,eg MySql
- Semi-Structured data, eg xml , json
- Unstructured data, eg text , audio, video

Big Data sources
• Social Media
• Banks
• Instruments
• Websites
• Stock Market

Use cases of Big Data
• Recommendation engines
• Analyzing Call Detail Record(CDR)
• Fraud Detection
• Market Basket Analysis
• Sentimental Analysis

Hadoop Introduction
• Open source framework that allows distributed
processing of large datasets on the cluster of commodity
hardware
• Hadoop is a data management tool and uses scale out
storage .

Defining Hadoop Cluster
• Size of data is most important factor while defining
hadoop cluster
5 Servers with 10 TB storage
capacity each
Total Storage Capacity : - 50TB

Defining Hadoop Cluster
7 Servers with 10 TB storage
capacity each
Total storage capacity : 70TB

Hadoop Components
• Hadoop 1 Componets
- HDFS (Hadoop distributed file system)
- MapReduce
• Hadoop 2 Component
- HDFS (Hadoop distributed file system)
- YARN/MRv2
HDFS
MR/
YARN
Storage/
Reads-Writes
Processing

Hadoop Daemons
• Hadoop 1 Daemos
Namenode
Datanode
Secondary Namenode
job Tracker
Task Tracker
HDFS MapReduce
NameNode
DataNode
Job Tracker
Task Tracker

Hadoop Daemons
• Hadoop 2 Daemos
Namenode
Datanode
Secondary Namenode
Resource Manager
Node Manager
HDFS YARN
NameNode
DataNode
Resource Manager
Node Manager

Hadoop Master Slave
Architecture
HDFS MR/YARN
NameNode DataNode ResourceManager NodeManager
Master Slave Master Slave

Hadoop Cluster
• Assume that we have hadoop cluster with 4 nodes
Master
NameNode
ResourceManager
Slave
DataNode
NodeManager

Secondary Name Node
• Secondary Namenode is not a hot backup for Namenode
.
• It just takes hourly backup of Namenode metadata
• It is can be used to Restart a crashed Hadoop Cluster
• Secondary Namenode is an important demon for
Hadoop1 , However in hadoop2 It is not that much
Important .

Modes of Operation
• Stand Alone
• Pseudo Distributed
• Fully Distributed

Next Video
• Comparison between Hadoop1 and Hadoop2

Like and Subscribe
sdp117@gmail.com

Introduction to Big Data and hadoop

More Related Content

What's hot

Similar to Introduction to Big Data and hadoop

Recently uploaded

In this document

Introduction to Big Data and hadoop