Hadoop MapReduce Framework

Module-3
Hadoop MapReduce Framework
www.edureka.co/big-data-and-hadoop

Objectives
www.edureka.co/big-data-and-hadoopSlide 2
At the end of this module, you will be able to
 Analyze different use-cases where MapReduce is used
 Differentiate between Traditional way and MapReduce way
 Learn about Hadoop 2.x MapReduce architecture and components
 Understand execution flow of YARN MapReduce application
 Implement basic MapReduce concepts
 Run a MapReduceProgram
 Understand Input Splits concept in MapReduce
 Understand MapReduce Job Submission Flow
 Implement Combiner and Partitioner in MapReduce

Let’s Revise
 Hadoop Cluster Configuration
 Data Loading Techniques
 Hadoop ClusterModes
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYarn
mapred-site.xml
Map
Reduce

Let’s Revise
Data Analysis
Data Loading
Using Pig Using HIVE
Using Flume
Using Sqoop
Using Hadoop Copy Commands
HDFS
Sqoop Flume – Questions

Annie’s Question
Secondary NameNode is a hot backup for NameNode:
» TRUE
» FALSE

Annie’s Answer
Ans. FALSE. The Secondary NameNode (SNN) is the most
misunderstood component of the HDFS architecture. SNN
is not a hot backup for NameNode but a checkpoint
backup mechanism enabler in a Hadoop Cluster.

Where MapReduce is Used?
Weather Forecasting
 Problem Statement:
» De-identify personal health information.
HealthCare www.edureka.co/big-data-and-hadoopSlide 8
 Problem Statement:
» Finding Maximum temperature recorded in a year.

The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep cat
grep
matches
matches
matches
Split Data
Split Data
:
Split Data

MapReduce Way
Very
Big
Data
Split Data
All
matches
Split Data
Split Data
:
Split Data
M
A
P
R
E
D
U
C
E
MapReduce Framework

Why MapReduce?
 Two biggest Advantages:
» Taking processing to the data
» Processing data in parallel
a
b
c
Map Task
HDFS Block
Data Center
Rack
Node

Solving the Problem with MapReduce
HDFS
Take DB dump in CSV format and
copy it on HDFS
Store CSV file into
HDFS
Read CSV file
from HDFS
matches
Reduce
Map
0100
1101
1001
0100
1101
1001
0100
1101
Sqoop
1001
Map Logic
Reduce Logic

Node Manager
Container
Map Task
Container
Application
Master
Hadoop 2.x MapReduce Architecture
Client
Job History
Server
Resource
Manager
MapReduce Status
Job Submission
Node Status
Resource Request
Node Manager
Datanode1
Container
Reduce Task
Datanode2
Namenode

 ApplicationMaster
» One per application
» Short life
» Coordinates and Manages MapReduce Jobs
» Negotiates with Resource Manager to
schedule tasks
» The tasks are started by NodeManager(s)
 Job HistoryServer
» Maintains information about submitted
MapReduce jobs after their ApplicationMaster
terminates
 Client
» Submits a MapReduce Job
 Resource Manager
» Cluster Level resource manager
» Long Life, High Quality Hardware
 Node Manager
» One per Data Node
» Monitors resources on Data Node
Hadoop 2.x MapReduce Components
 Container
» Created by NM when requested
» Allocates certain amount of resources
(memory, CPU etc.) on a slave node

MapReduce Application Execution
Executing MapReduce Application on YARN

YARN MR Application Execution Flow
 MapReduce Job Execution
» Job Submission
» Job Initialization
» Tasks Assignment
» Memory Assignment
» StatusUpdates
» Failure Recovery

HDFS
Job ObjectApplication
Client JVM
Client
Resource
Manager
Management Node
1. Run Job
2. Get New Application
4. Submit Application
3. Copy Job Resources

HDFS
Client JVM
Client
Resource
Manager
Management Node
1. Run Job
Node Manager
5. Start MR AppMaster container
6. Create 9. Start
container container
7. Get Input Splits
MRAppMaster
Data Node
8. Request
Resources

HDFS
Client JVM
Client
Resource
Manager
Management Node
1. Run Job
Node Manager
5. Start MR AppMaster container
6. Create 9. Start
container container
8. Request
Resources
7. Get Input Splits
MRAppMaster
Data Node
Map/Reduce
Task
10. Create
Container
Task JVM
YarnChild
12.
Execute
11. Acquire Job
Resources www.edureka.co/big-data-and-hadoopSlide 19

HDFS
Client JVM
Client
Resource
Manager
Management Node
Node Manager
MRAppMaster
Data Node
Map/Reduce
Task
YarnChild
Task JVM
Poll for Status
Update
Status

Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Container1.2
Node Manager
Container1.1
Container2.1
Node Manager
Container2.2
Node Manager
Container2.3
Node Manager
App
Master2
Node Manager
App
Master1
Scheduler
Applications
Manager(AsM)
Resource
Manager

Annie’s Question
YARN was developed to overcome the following disadvantage
in Hadoop 1.0 MapReduce framework?
» Single Point Of Failure Of NameNode
» Only one version can be run in classic MapReduce
» Too much burden on Job Tracker

Annie’s Answer
Too much burden on Job Tracker

Annie’s Question
In YARN, the functionality of JobTracker has been replaced by
which of the following YARN features:
» Job Scheduling
» TaskMonitoring
» Resource Management
» Node management

Annie’s Answer
Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
JobTracker, i.e. resource management and job
scheduling/monitoring, into separate daemons. A global
Resource Manager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.

Annie’s Question
In YARN, which of the following daemons takes care of the
container and the resource utilization by the applications?
» Node Manager
» JobTracker
» Tasktracker
» ApplicationMaster

Annie’s Answer
ApplicationMaster

Annie’s Question
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
» Yes
» No

Annie’s Answer
Yes. MapReduce on YARN ensures full binary compatibility.
These existing applications can run on YARN directly without
recompilation.

MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
K2,List(V2)
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)

Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value

Demo of WordCount Program

Annie’s Question
Input to the mapper is in the form of?
» A flat file
» (key, value) pair
» Only string
» All the above

Annie’s Answer
A Mapper accepts (key, value) pair as input.

Input Splits
INPUT DATA
Physical
Division
HDFS
Blocks
Logical
Division
Input
Splits

Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
 Logical records do not fit neatly into the HDFS blocks.
 Logical records are lines that cross the boundary of the blocks.
 First split contains line 5 although it spans across blocks.
File
Lines
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split
Split Split

MapReduce Job Submission Flow
Node 1 Node 2
INPUT DATA
 Input data is distributed to nodes

Map
Node 1
Map
Node 2
INPUT DATA
 Each map task works on a “split” of data

Map
Node 1
Map
Node 2
INPUT DATA
 Mapper outputs intermediate data

Map
Node 1
Map
Node 2
INPUT DATA
Node 1 Node 2
 Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for

responsible for
 Shuffle processor will sort and merge the data for a particular key
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2

Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
responsible for
 Shuffle processor will sort and merge the data for a particular key
 Reducer output is stored

Annie’s Question
MapReduce programming model provides a way for reducers
to communicate with each other?
» Yes, reducers running on the same machine can
communicate with each other through shared memory
» No, each reducer runs independently and in isolation.

Annie’s Answer
Ans. No, reducers run independently and in isolation.
Individual tasks do not know the input source. Reducer tasks
rely on Hadoop framework to deliver the appropriate input for
processing.

Annie’s Question
Who specify Input Split Information?
» randomly and decided by name node
» randomly and decided by job tracker
» line by Line and decided by Input Splitter
» we will have to specify explicitly

Annie’s Answer
Ans. The client have to submit the input spit information by specifying
the start and end point either in InputFormat Configuration.

Overview of MapReduce
Combiners Partitioners
Combiners can be viewed as
‘mini-reducers’ in the Map phase.
Complete view of MapReduce, illustrating combiners and partitioner in addition
to Mappers and Reducers
MapReduce
Partitioners determine which reducer is
responsible for a particular key.

Combiner – Local Reduce
Passed workload further
to the Reducers
Before we distribute the
mapper results
Mini-Reducers Perform a
“Local Reduce”
COMBINERS

Combiner
Combiner
Reducer
(B,1)
(C,1)
(D,1)
(E,1)
(D,1)
(B,1)
(D,1)
(A,1)
(A,1)
(C,1)
(B,1)
(D,1)
(B,2)
(C,1)
(D,2)
(E,1)
(D,2)
(A,2)
(C,1)
(B,1)
(A, [2])
(B, [2,1])
(C, [1,1])
(D, [2,2])
(E, [1])
(A,2)
(B,3)
(C,2)
(D,4)
(E,1)
Shuffle
CombinerMapper
Mapper
B
C
D
E
D
B
D
A
A
C
B
D
Block1Block2

Annie’s Question
Combiner works at?
» Mapper Level
» Patitioner Level
» Reducer Level
» All the above

Annie’s Answer
Ans. Mapper level as Combiner works on the output data from
Mapper.

Annie’s Question
Combiner can be considered as:
» Semi Partitioner
» Semi Reducer
» Semi Shuffler
» Major Reducer

Annie’s Answer
Ans. Semi Reducer. Combiner works on the Mapper
output and lessen the burden on Reducer.

Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner

Demo: Combiner and Partitioner
Demo: Combiner and Partitioner MR Code

Annie’s Question
Can we use same logic for combiner and reducer?
» No, they are separate entities.
» Yes, only if reducer and combiner logic are commutative
and associative and both of them are of same data
types.

Annie’s Answer
Ans. Yes, you can use same logic if Reducer and Combiner
logic are both commutative and associative and both of them
are of same data types.

Annie’s Question
Can we change the format of output key class and output
value class?
» TRUE
» FALSE

Annie’s Answer
Ans. TRUE

HealthCare Dataset

Revisit De-identification Architecture
HDFS
Taking DB dump in CSV format
and ingest into HDFS
Store De-identified CSV
file into HDFS
De-identify columns
based on
configurations
matches
Map Task 1
Map Task 2
.
.
Read CSV file
from HDFS
Reduce Task 1
0100
1101
1001
0100
1101
1001
0100
1101
Reduce Task 2
.
.
Sqoop
1001

public static String encrypt(String strToEncrypt, byte[] key)
{
try
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
return encryptedString.trim();
}
catch (Exception e)
{
logger.error("Error while encrypting", e);
}
return null;
}
}
DeIdentify MapReduce Code

Demo of DeIdentify Program

Weather Data
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

Demo of WeatherData Program

Assignment
Write MapReduce code for WordCount on your own and run it on Edureka VM
Download all the MapReduce codes from LMS and import them in your Eclipse IDE and execute them
Try Maximum Temperature problem in MapReduce
Try Hot and Cold day problem in MapReduce

Watch video “Running MapReduce Program” under Module-3 of your LMS
Attempt the Word Count , Patents,& Alphabets assignment using the items present in the LMS under the
tab Module 3
Review the Interview Questions for MapReduce
http://www.edureka.in/blog/hadoop-interview-questions-mapreduce/
Review the Next Generation MapReduce (MRv2 or YARN)
http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
http://www.edureka.in/blog/hadoop-2-0-setting-up-a-single-node-cluster-in-15-minutes/
Setup the CDH4 Hadoop development environment using the documents present in the LMS
http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/
Pre-work

Agenda for Next Class
 Map and Reduce Side Join
Counters
 DistributedCache
 Custom Input Format
 Sequence Input Format
MRUnit

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Survey

Hadoop MapReduce Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop MapReduce Framework

Similar to Hadoop MapReduce Framework (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Hadoop MapReduce Framework