© 2015 BlueCamphor Technologies (P) Ltd.
Run Your First Hadoop Program
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 2
Know your Instructor
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 3
Session Objectives
This Session will help you to:
ᗍ Understand
• Introduction to BIG Data
• Introduction to Hadoop 2.x
• HDFS Fundamentals
• MapReduce & YARN
• Hive Introduction
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 4
What is BIG Data
ᗍ Big Data refers to data-sets so large & complex/unstructured data that it becomes difficult to manage & process via
traditional RDBMS tools
ᗍ Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only
in the last 2 years
ᗍ Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes & Zeta-bytes
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 5
Structured vs Unstructured Data - II
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 6
Traditional Systems vs. New Systems
Traditional Systems New Systems
It is not scalable to meet new business demands It is scalable to meet new business demands
Can process massive data at high speedCannot process massive data at high speed
It can only be Scaled-Up and cannot be Scaled-Out It can be Scaled-Up and Scaled-Out
Cost of system, processing and data management is
economical
Cost of system, processing and data management is not
economical
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 7
Introduction to Hadoop
ᗍ Hadoop is a framework for storing, processing and analysing Big Data
ᗍ Allows distributed storage and distributed processing of large data sets across clusters of commodity computers
using a simple programming model
ᗍ It is an Apache Open Source framework
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 8
Key Features of Hadoop
It’s based on the Master - Slave architecture
Designed for massive scale
Highly available System
Low software and hardware costs
Distributed storage and processing achieves high
performance
No license costs; supported by a very large
developer community
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 9
Hadoop Ecosystem
Pig
(Data Flow)
MR
(Batch)
Hive
(SQL)
Others
(Cascading)
RTStream
Graph
(Strom,Giraph)
Services
(HBase)
TEZ
(Execution Engine)
YARN
(Cluster Resource Management)
HDFS
(Redundant Reliable Storage)
Hadoop
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 10
Hadoop Core Components
Distributed Data Storage
frame work
Distributed Data
Processing Framework
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 11
Hadoop Architecture
ᗍ HDFS - Storage
• NameNode
• Data Node
• Secondary NameNode
ᗍ MapReduce - Processing
• Resource Manager
• Node Manager
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 12
HDFS Architecture
NameNode
Client
Rack 1 Client Rack 2
Metadata (Name, replicas,...):
/home/foo/data, 3,…
Read DataNodes
Write
Replication
Blocks
Block ops
DataNodes
Metadata ops
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 13
HDFS File Read Operation
2. Get Block locations
4. Read 5. Read
Client Node
HDFS
Client
Distributed File
System
FS Data
Input Stream
Client JVM
6. Close
3. Read
1. Open
DataNode
Slave Node
DataNode
Slave Node
DataNode
Slave Node
NameNode
Admin Node
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 14
HDFS File Write Operation
NameNode
2. Create
7. Complete
5. Ack Packet4. Write Packet
Pipeline of
Data nodes
6. Close
HDFS
Client Distributed
File System
NameNode
DataNode
Slave Node
4
5
4
5DataNode
Slave Node
DataNode
Slave Node
Blocks
Admin Node
1. Create
3. Write
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 15
Hadoop Core Components
Distributed Data Storage
frame work
Distributed Data
Processing Framework
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 16
Hadoop Architecture
ᗍ HDFS - Storage
• NameNode
• Data Node
• Secondary NameNode
ᗍ MapReduce - Processing
• Resource Manager
• Node Manager
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 17
Traditional Solution
matchesSplit Data
All
matches
grep
grep
grep
cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data
Very
Big
Data
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 18
MapReduce Solution
Split Data
All
matches
:
Split Data
Split Data
Split Data
M
A
P
R
E
D
U
C
E
MapReduce Framework
Very
Big
Input
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 19
Understanding MapReduce Paradigm
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)


Jack Bill Joe
Bill, 2
Don, 3
Jack, 2
Joe, 2
K2,List(V2)List(K2,V2)
K1,V1
Don Don Joe
Jack Don Bill
Bill, (1,1)
Don, (1,1,1)
Jack, (1,1)
Joe, (1,1)
MapReduce Word Count Process Flow
Jack Bill Joe
Don Don Joe
Jack Don Bill
Jack, 1
Bill, 1
Joe, 1
Don, 1
Don, 1
Joe, 1
Jack, 1
Don, 1
Bill, 1
Bill, 2
Don, 3
Jack, 2
Joe, 2
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 20
What is Hive?
Hive is data warehouse query tool built on
top of HDFS and YARN
Provides HiveQL, which is very similar to
SQL
Used for querying and analyzing large
structured data sets
It is extensible by User Defined Functions
(UDFs)
Hive
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 21
RDBMS Vs. Hive
RDBMS Hive
Schema on WRITE – table schema is enforced at data
load time i.e if the data being loaded doesn't conformed
on schema in that case it will rejected
Schema on READ – it’s does not verify the schema
while it’s loaded the data
Not much Scalable, costly scale up It’s very easily scalable at low cost
In traditional database we can read and write many
time
It’s based on hadoop notation that is Write once and
read many times
Record level updates, insertions and

deletes, transactions and indexes are possible
Record level updates is not possible in Hive
Both OLTP (On-line Transaction Processing) and OLAP
(On-line Analytical Processing) are supported in
RDBMS
OLTP (On-line Transaction Processing) is not yet
supported in  Hive but it’s supported OLAP (On-line
Analytical Processing)
RDBMS
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 22
Doubt’s Time
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 23
One more thing…A VERY
SPECIAL SURPRISE FOR YOU
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 24
Why SkillSpeed?
Course
Curriculum
from Industry
Experts
Instructor Led
Live Virtual
Sessions
Lifetime
access to
Course
Content via
LMS
100%
Placement
Assistance
24x7 Support
24x7
© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com 25
Corporate Partners
Run Your First Hadoop 2.x Program

Run Your First Hadoop 2.x Program

  • 1.
    © 2015 BlueCamphorTechnologies (P) Ltd. Run Your First Hadoop Program
  • 2.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 2 Know your Instructor
  • 3.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 3 Session Objectives This Session will help you to: ᗍ Understand • Introduction to BIG Data • Introduction to Hadoop 2.x • HDFS Fundamentals • MapReduce & YARN • Hive Introduction
  • 4.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 4 What is BIG Data ᗍ Big Data refers to data-sets so large & complex/unstructured data that it becomes difficult to manage & process via traditional RDBMS tools ᗍ Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years ᗍ Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes & Zeta-bytes
  • 5.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 5 Structured vs Unstructured Data - II
  • 6.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 6 Traditional Systems vs. New Systems Traditional Systems New Systems It is not scalable to meet new business demands It is scalable to meet new business demands Can process massive data at high speedCannot process massive data at high speed It can only be Scaled-Up and cannot be Scaled-Out It can be Scaled-Up and Scaled-Out Cost of system, processing and data management is economical Cost of system, processing and data management is not economical
  • 7.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 7 Introduction to Hadoop ᗍ Hadoop is a framework for storing, processing and analysing Big Data ᗍ Allows distributed storage and distributed processing of large data sets across clusters of commodity computers using a simple programming model ᗍ It is an Apache Open Source framework
  • 8.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 8 Key Features of Hadoop It’s based on the Master - Slave architecture Designed for massive scale Highly available System Low software and hardware costs Distributed storage and processing achieves high performance No license costs; supported by a very large developer community
  • 9.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 9 Hadoop Ecosystem Pig (Data Flow) MR (Batch) Hive (SQL) Others (Cascading) RTStream Graph (Strom,Giraph) Services (HBase) TEZ (Execution Engine) YARN (Cluster Resource Management) HDFS (Redundant Reliable Storage) Hadoop
  • 10.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 10 Hadoop Core Components Distributed Data Storage frame work Distributed Data Processing Framework
  • 11.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 11 Hadoop Architecture ᗍ HDFS - Storage • NameNode • Data Node • Secondary NameNode ᗍ MapReduce - Processing • Resource Manager • Node Manager
  • 12.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 12 HDFS Architecture NameNode Client Rack 1 Client Rack 2 Metadata (Name, replicas,...): /home/foo/data, 3,… Read DataNodes Write Replication Blocks Block ops DataNodes Metadata ops
  • 13.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 13 HDFS File Read Operation 2. Get Block locations 4. Read 5. Read Client Node HDFS Client Distributed File System FS Data Input Stream Client JVM 6. Close 3. Read 1. Open DataNode Slave Node DataNode Slave Node DataNode Slave Node NameNode Admin Node
  • 14.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 14 HDFS File Write Operation NameNode 2. Create 7. Complete 5. Ack Packet4. Write Packet Pipeline of Data nodes 6. Close HDFS Client Distributed File System NameNode DataNode Slave Node 4 5 4 5DataNode Slave Node DataNode Slave Node Blocks Admin Node 1. Create 3. Write
  • 15.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 15 Hadoop Core Components Distributed Data Storage frame work Distributed Data Processing Framework
  • 16.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 16 Hadoop Architecture ᗍ HDFS - Storage • NameNode • Data Node • Secondary NameNode ᗍ MapReduce - Processing • Resource Manager • Node Manager
  • 17.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 17 Traditional Solution matchesSplit Data All matches grep grep grep cat grep : matches matches matches Split Data Split Data Split Data Very Big Data
  • 18.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 18 MapReduce Solution Split Data All matches : Split Data Split Data Split Data M A P R E D U C E MapReduce Framework Very Big Input
  • 19.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 19 Understanding MapReduce Paradigm Input Splitting Mapping Shuffling Reducing Final Result List(K3,V3) 
 Jack Bill Joe Bill, 2 Don, 3 Jack, 2 Joe, 2 K2,List(V2)List(K2,V2) K1,V1 Don Don Joe Jack Don Bill Bill, (1,1) Don, (1,1,1) Jack, (1,1) Joe, (1,1) MapReduce Word Count Process Flow Jack Bill Joe Don Don Joe Jack Don Bill Jack, 1 Bill, 1 Joe, 1 Don, 1 Don, 1 Joe, 1 Jack, 1 Don, 1 Bill, 1 Bill, 2 Don, 3 Jack, 2 Joe, 2
  • 20.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 20 What is Hive? Hive is data warehouse query tool built on top of HDFS and YARN Provides HiveQL, which is very similar to SQL Used for querying and analyzing large structured data sets It is extensible by User Defined Functions (UDFs) Hive
  • 21.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 21 RDBMS Vs. Hive RDBMS Hive Schema on WRITE – table schema is enforced at data load time i.e if the data being loaded doesn't conformed on schema in that case it will rejected Schema on READ – it’s does not verify the schema while it’s loaded the data Not much Scalable, costly scale up It’s very easily scalable at low cost In traditional database we can read and write many time It’s based on hadoop notation that is Write once and read many times Record level updates, insertions and
 deletes, transactions and indexes are possible Record level updates is not possible in Hive Both OLTP (On-line Transaction Processing) and OLAP (On-line Analytical Processing) are supported in RDBMS OLTP (On-line Transaction Processing) is not yet supported in  Hive but it’s supported OLAP (On-line Analytical Processing) RDBMS
  • 22.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 22 Doubt’s Time
  • 23.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 23 One more thing…A VERY SPECIAL SURPRISE FOR YOU
  • 24.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 24 Why SkillSpeed? Course Curriculum from Industry Experts Instructor Led Live Virtual Sessions Lifetime access to Course Content via LMS 100% Placement Assistance 24x7 Support 24x7
  • 25.
    © 2015 BlueCamphorTechnologies (P) Ltd. www.skillspeed.com 25 Corporate Partners