Run Your First Hadoop 2.x Program

© 2015 BlueCamphor Technologies (P) Ltd.
Run Your First Hadoop Program

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 2
Know your Instructor

Session Objectives
This Session will help you to:
ᗍ Understand
• Introduction to BIG Data
• Introduction to Hadoop 2.x
• HDFS Fundamentals
• MapReduce & YARN
• Hive Introduction

What is BIG Data
ᗍ Big Data refers to data-sets so large & complex/unstructured data that it becomes difficult to manage & process via
traditional RDBMS tools
ᗍ Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only
in the last 2 years
ᗍ Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes & Zeta-bytes

Structured vs Unstructured Data - II

Traditional Systems vs. New Systems
Traditional Systems New Systems
It is not scalable to meet new business demands It is scalable to meet new business demands
Can process massive data at high speedCannot process massive data at high speed
It can only be Scaled-Up and cannot be Scaled-Out It can be Scaled-Up and Scaled-Out
Cost of system, processing and data management is
economical
Cost of system, processing and data management is not
economical

Introduction to Hadoop
ᗍ Hadoop is a framework for storing, processing and analysing Big Data
ᗍ Allows distributed storage and distributed processing of large data sets across clusters of commodity computers
using a simple programming model
ᗍ It is an Apache Open Source framework

Key Features of Hadoop
It’s based on the Master - Slave architecture
Designed for massive scale
Highly available System
Low software and hardware costs
Distributed storage and processing achieves high
performance
No license costs; supported by a very large
developer community

Hadoop Ecosystem
Pig
(Data Flow)
MR
(Batch)
Hive
(SQL)
Others
(Cascading)
RTStream
Graph
(Strom,Giraph)
Services
(HBase)
TEZ
(Execution Engine)
YARN
(Cluster Resource Management)
HDFS
(Redundant Reliable Storage)
Hadoop

Hadoop Core Components
Distributed Data Storage
frame work
Distributed Data
Processing Framework

Hadoop Architecture
ᗍ HDFS - Storage
• NameNode
• Data Node
• Secondary NameNode
ᗍ MapReduce - Processing
• Resource Manager
• Node Manager

HDFS Architecture
NameNode
Client
Rack 1 Client Rack 2
Metadata (Name, replicas,...):
/home/foo/data, 3,…
Read DataNodes
Write
Replication
Blocks
Block ops
DataNodes
Metadata ops

HDFS File Read Operation
2. Get Block locations
4. Read 5. Read
Client Node
HDFS
Client
Distributed File
System
FS Data
Input Stream
Client JVM
6. Close
3. Read
1. Open
DataNode
Slave Node
DataNode
Slave Node
DataNode
Slave Node
NameNode
Admin Node

HDFS File Write Operation
NameNode
2. Create
7. Complete
5. Ack Packet4. Write Packet
Pipeline of
Data nodes
6. Close
HDFS
Client Distributed
File System
NameNode
DataNode
Slave Node
4
5
4
5DataNode
Slave Node
DataNode
Slave Node
Blocks
Admin Node
1. Create
3. Write

Hadoop Core Components
Distributed Data Storage
frame work
Distributed Data
Processing Framework

Hadoop Architecture
ᗍ HDFS - Storage
• NameNode
• Data Node
• Secondary NameNode
ᗍ MapReduce - Processing
• Resource Manager
• Node Manager

Traditional Solution
matchesSplit Data
All
matches
grep
grep
grep
cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data
Very
Big
Data

MapReduce Solution
Split Data
All
matches
:
Split Data
Split Data
Split Data
M
A
P
R
E
D
U
C
E
MapReduce Framework
Very
Big
Input

Understanding MapReduce Paradigm
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
 
Jack Bill Joe
Bill, 2
Don, 3
Jack, 2
Joe, 2
K2,List(V2)List(K2,V2)
K1,V1
Don Don Joe
Jack Don Bill
Bill, (1,1)
Don, (1,1,1)
Jack, (1,1)
Joe, (1,1)
MapReduce Word Count Process Flow
Jack Bill Joe
Don Don Joe
Jack Don Bill
Jack, 1
Bill, 1
Joe, 1
Don, 1
Don, 1
Joe, 1
Jack, 1
Don, 1
Bill, 1
Bill, 2
Don, 3
Jack, 2
Joe, 2

What is Hive?
Hive is data warehouse query tool built on
top of HDFS and YARN
Provides HiveQL, which is very similar to
SQL
Used for querying and analyzing large
structured data sets
It is extensible by User Defined Functions
(UDFs)
Hive

RDBMS Vs. Hive
RDBMS Hive
Schema on WRITE – table schema is enforced at data
load time i.e if the data being loaded doesn't conformed
on schema in that case it will rejected
Schema on READ – it’s does not verify the schema
while it’s loaded the data
Not much Scalable, costly scale up It’s very easily scalable at low cost
In traditional database we can read and write many
time
It’s based on hadoop notation that is Write once and
read many times
Record level updates, insertions and 
deletes, transactions and indexes are possible
Record level updates is not possible in Hive
Both OLTP (On-line Transaction Processing) and OLAP
(On-line Analytical Processing) are supported in
RDBMS
OLTP (On-line Transaction Processing) is not yet
supported in Hive but it’s supported OLAP (On-line
Analytical Processing)
RDBMS

Doubt’s Time

One more thing…A VERY
SPECIAL SURPRISE FOR YOU

Why SkillSpeed?
Course
Curriculum
from Industry
Experts
Instructor Led
Live Virtual
Sessions
Lifetime
access to
Course
Content via
LMS
100%
Placement
Assistance
24x7 Support
24x7

Corporate Partners

Run Your First Hadoop 2.x Program

Run Your First Hadoop 2.x Program

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Run Your First Hadoop 2.x Program

Similar to Run Your First Hadoop 2.x Program (20)

More from Skillspeed

More from Skillspeed (12)

Recently uploaded

Recently uploaded (20)

Run Your First Hadoop 2.x Program