Module-3
Hadoop MapReduce Framework
www.edureka.co/big-data-and-hadoop
Objectives
www.edureka.co/big-data-and-hadoopSlide 2
At the end of this module, you will be able to
 Analyze different use-cases where MapReduce is used
 Differentiate between Traditional way and MapReduce way
 Learn about Hadoop 2.x MapReduce architecture and components
 Understand execution flow of YARN MapReduce application
 Implement basic MapReduce concepts
 Run a MapReduceProgram
 Understand Input Splits concept in MapReduce
 Understand MapReduce Job Submission Flow
 Implement Combiner and Partitioner in MapReduce
Let’s Revise
 Hadoop Cluster Configuration
 Data Loading Techniques
 Hadoop ClusterModes
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYarn
mapred-site.xml
Map
Reduce
www.edureka.co/big-data-and-hadoopSlide 3
Let’s Revise
Data Analysis
Data Loading
Using Pig Using HIVE
Using Flume
Using Sqoop
Using Hadoop Copy Commands
HDFS
Sqoop Flume – Questions
www.edureka.co/big-data-and-hadoopSlide 4
Annie’s Question
Secondary NameNode is a hot backup for NameNode:
Âť TRUE
Âť FALSE
www.edureka.co/big-data-and-hadoopSlide 5
Annie’s Answer
Ans. FALSE. The Secondary NameNode (SNN) is the most
misunderstood component of the HDFS architecture. SNN
is not a hot backup for NameNode but a checkpoint
backup mechanism enabler in a Hadoop Cluster.
www.edureka.co/big-data-and-hadoopSlide 6
Where MapReduce is Used?
Weather Forecasting
 Problem Statement:
Âť De-identify personal health information.
HealthCare www.edureka.co/big-data-and-hadoopSlide 8
 Problem Statement:
Âť Finding Maximum temperature recorded in a year.
The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep cat
grep
matches
matches
matches
Split Data
Split Data
:
Split Data
www.edureka.co/big-data-and-hadoopSlide 8
MapReduce Way
Very
Big
Data
Split Data
All
matches
Split Data
Split Data
:
Split Data
M
A
P
www.edureka.co/big-data-and-hadoopSlide 9
R
E
D
U
C
E
MapReduce Framework
Why MapReduce?
 Two biggest Advantages:
Âť Taking processing to the data
Âť Processing data in parallel
a
b
c
Map Task
HDFS Block
Data Center
www.edureka.co/big-data-and-hadoopSlide 10
Rack
Node
Solving the Problem with MapReduce
HDFS
Take DB dump in CSV format and
copy it on HDFS
Store CSV file into
HDFS
Read CSV file
from HDFS
matches
Reduce
Map
0100
1101
1001
0100
1101
1001
0100
1101
Sqoop
1001
www.edureka.co/big-data-and-hadoopSlide 12
Map Logic
Reduce Logic
Node Manager
Container
Map Task
Container
Application
Master
Hadoop 2.x MapReduce Architecture
Client
Job History
Server
Resource
Manager
MapReduce Status
Job Submission
Node Status
Resource Request
Node Manager
Datanode1
www.edureka.co/big-data-and-hadoopSlide 12
Container
Reduce Task
Datanode2
Namenode
 ApplicationMaster
www.edureka.co/big-data-and-hadoopSlide 13
Âť One per application
Âť Short life
Âť Coordinates and Manages MapReduce Jobs
Âť Negotiates with Resource Manager to
schedule tasks
Âť The tasks are started by NodeManager(s)
 Job HistoryServer
Âť Maintains information about submitted
MapReduce jobs after their ApplicationMaster
terminates
 Client
Âť Submits a MapReduce Job
 Resource Manager
Âť Cluster Level resource manager
Âť Long Life, High Quality Hardware
 Node Manager
Âť One per Data Node
Âť Monitors resources on Data Node
Hadoop 2.x MapReduce Components
 Container
Âť Created by NM when requested
Âť Allocates certain amount of resources
(memory, CPU etc.) on a slave node
MapReduce Application Execution
www.edureka.co/big-data-and-hadoopSlide 14
Executing MapReduce Application on YARN
YARN MR Application Execution Flow
www.edureka.co/big-data-and-hadoopSlide 15
 MapReduce Job Execution
Âť Job Submission
Âť Job Initialization
Âť Tasks Assignment
Âť Memory Assignment
Âť StatusUpdates
Âť Failure Recovery
YARN MR Application Execution Flow
HDFS
Job ObjectApplication
Client JVM
Client
Resource
Manager
Management Node
1. Run Job
2. Get New Application
www.edureka.co/big-data-and-hadoopSlide 16
4. Submit Application
3. Copy Job Resources
YARN MR Application Execution Flow
HDFS
Job ObjectApplication
Client JVM
Client
Resource
Manager
Management Node
1. Run Job
2. Get New Application
4. Submit Application
3. Copy Job Resources
Node Manager
5. Start MR AppMaster container
6. Create 9. Start
container container
www.edureka.co/big-data-and-hadoopSlide 17
7. Get Input Splits
MRAppMaster
Data Node
8. Request
Resources
YARN MR Application Execution Flow
HDFS
Job ObjectApplication
Client JVM
Client
Resource
Manager
Management Node
1. Run Job
2. Get New Application
4. Submit Application
3. Copy Job Resources
Node Manager
5. Start MR AppMaster container
6. Create 9. Start
container container
8. Request
Resources
7. Get Input Splits
MRAppMaster
Data Node
Map/Reduce
Task
10. Create
Container
Task JVM
YarnChild
12.
Execute
11. Acquire Job
Resources www.edureka.co/big-data-and-hadoopSlide 19
YARN MR Application Execution Flow
HDFS
Job ObjectApplication
Client JVM
Client
Resource
Manager
Management Node
Node Manager
MRAppMaster
Data Node
Map/Reduce
Task
YarnChild
Task JVM
Poll for Status
www.edureka.co/big-data-and-hadoopSlide 19
Update
Status
Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Container1.2
Node Manager
Container1.1
Container2.1
Node Manager
Container2.2
Node Manager
Container2.3
Node Manager
App
Master2
Node Manager
App
Master1
Scheduler
Applications
Manager(AsM)
Resource
Manager
www.edureka.co/big-data-and-hadoopSlide 20
Annie’s Question
YARN was developed to overcome the following disadvantage
in Hadoop 1.0 MapReduce framework?
Âť Single Point Of Failure Of NameNode
Âť Only one version can be run in classic MapReduce
Âť Too much burden on Job Tracker
www.edureka.co/big-data-and-hadoopSlide 21
Annie’s Answer
Too much burden on Job Tracker
www.edureka.co/big-data-and-hadoopSlide 22
Annie’s Question
In YARN, the functionality of JobTracker has been replaced by
which of the following YARN features:
Âť Job Scheduling
Âť TaskMonitoring
Âť Resource Management
Âť Node management
www.edureka.co/big-data-and-hadoopSlide 23
Annie’s Answer
Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
JobTracker, i.e. resource management and job
scheduling/monitoring, into separate daemons. A global
Resource Manager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.
www.edureka.co/big-data-and-hadoopSlide 24
Annie’s Question
In YARN, which of the following daemons takes care of the
container and the resource utilization by the applications?
Âť Node Manager
Âť JobTracker
Âť Tasktracker
Âť ApplicationMaster
www.edureka.co/big-data-and-hadoopSlide 25
Annie’s Answer
ApplicationMaster
www.edureka.co/big-data-and-hadoopSlide 26
Annie’s Question
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
Âť Yes
Âť No
www.edureka.co/big-data-and-hadoopSlide 27
Annie’s Answer
Yes. MapReduce on YARN ensures full binary compatibility.
These existing applications can run on YARN directly without
recompilation.
www.edureka.co/big-data-and-hadoopSlide 28
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
K2,List(V2)
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
www.edureka.co/big-data-and-hadoopSlide 29
Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value
www.edureka.co/big-data-and-hadoopSlide 30
Demo of WordCount Program
www.edureka.co/big-data-and-hadoopSlide 31
Annie’s Question
Input to the mapper is in the form of?
Âť A flat file
Âť (key, value) pair
Âť Only string
Âť All the above
www.edureka.co/big-data-and-hadoopSlide 32
Annie’s Answer
A Mapper accepts (key, value) pair as input.
www.edureka.co/big-data-and-hadoopSlide 33
Input Splits
INPUT DATA
Physical
Division
HDFS
Blocks
Logical
Division
www.edureka.co/big-data-and-hadoopSlide 34
Input
Splits
Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
 Logical records do not fit neatly into the HDFS blocks.
 Logical records are lines that cross the boundary of the blocks.
 First split contains line 5 although it spans across blocks.
File
Lines
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split
www.edureka.co/big-data-and-hadoopSlide 35
Split Split
MapReduce Job Submission Flow
Node 1 Node 2
INPUT DATA
www.edureka.co/big-data-and-hadoopSlide 36
 Input data is distributed to nodes
MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
INPUT DATA
www.edureka.co/big-data-and-hadoopSlide 37
 Input data is distributed to nodes
 Each map task works on a “split” of data
MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
INPUT DATA
www.edureka.co/big-data-and-hadoopSlide 38
 Input data is distributed to nodes
 Each map task works on a “split” of data
 Mapper outputs intermediate data
MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
INPUT DATA
Node 1 Node 2
www.edureka.co/big-data-and-hadoopSlide 39
 Input data is distributed to nodes
 Each map task works on a “split” of data
 Mapper outputs intermediate data
 Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for
MapReduce Job Submission Flow
 Input data is distributed to nodes
 Each map task works on a “split” of data
 Mapper outputs intermediate data
 Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for
 Shuffle processor will sort and merge the data for a particular key
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
www.edureka.co/big-data-and-hadoopSlide 40
MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
www.edureka.co/big-data-and-hadoopSlide 41
 Input data is distributed to nodes
 Each map task works on a “split” of data
 Mapper outputs intermediate data
 Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for
 Shuffle processor will sort and merge the data for a particular key
 Reducer output is stored
Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 42
MapReduce programming model provides a way for reducers
to communicate with each other?
Âť Yes, reducers running on the same machine can
communicate with each other through shared memory
Âť No, each reducer runs independently and in isolation.
Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 43
Ans. No, reducers run independently and in isolation.
Individual tasks do not know the input source. Reducer tasks
rely on Hadoop framework to deliver the appropriate input for
processing.
Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 44
Who specify Input Split Information?
Âť randomly and decided by name node
Âť randomly and decided by job tracker
Âť line by Line and decided by Input Splitter
Âť we will have to specify explicitly
Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 45
Ans. The client have to submit the input spit information by specifying
the start and end point either in InputFormat Configuration.
Overview of MapReduce
Combiners Partitioners
Combiners can be viewed as
‘mini-reducers’ in the Map phase.
Complete view of MapReduce, illustrating combiners and partitioner in addition
to Mappers and Reducers
MapReduce
www.edureka.co/big-data-and-hadoopSlide 46
Partitioners determine which reducer is
responsible for a particular key.
Combiner – Local Reduce
Passed workload further
to the Reducers
Before we distribute the
mapper results
Mini-Reducers Perform a
“Local Reduce”
COMBINERS
www.edureka.co/big-data-and-hadoopSlide 47
Combiner
Combiner
Reducer
(B,1)
(C,1)
(D,1)
(E,1)
(D,1)
(B,1)
(D,1)
(A,1)
(A,1)
(C,1)
(B,1)
(D,1)
(B,2)
(C,1)
(D,2)
(E,1)
(D,2)
(A,2)
(C,1)
(B,1)
(A, [2])
(B, [2,1])
(C, [1,1])
(D, [2,2])
(E, [1])
(A,2)
(B,3)
(C,2)
(D,4)
(E,1)
Shuffle
CombinerMapper
Mapper
B
C
D
E
D
B
D
A
A
C
B
D
Block1Block2
www.edureka.co/big-data-and-hadoopSlide 48
Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 49
Combiner works at?
Âť Mapper Level
Âť Patitioner Level
Âť Reducer Level
Âť All the above
Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 50
Ans. Mapper level as Combiner works on the output data from
Mapper.
Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 51
Combiner can be considered as:
Âť Semi Partitioner
Âť Semi Reducer
Âť Semi Shuffler
Âť Major Reducer
Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 52
Ans. Semi Reducer. Combiner works on the Mapper
output and lessen the burden on Reducer.
Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
www.edureka.co/big-data-and-hadoopSlide 53
Demo: Combiner and Partitioner
www.edureka.co/big-data-and-hadoopSlide 54
Demo: Combiner and Partitioner MR Code
Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 55
Can we use same logic for combiner and reducer?
Âť No, they are separate entities.
Âť Yes, only if reducer and combiner logic are commutative
and associative and both of them are of same data
types.
Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 56
Ans. Yes, you can use same logic if Reducer and Combiner
logic are both commutative and associative and both of them
are of same data types.
Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 57
Can we change the format of output key class and output
value class?
Âť TRUE
Âť FALSE
Annie’s Answer
www.edureka.co/big-data-and-hadoopSlide 58
Ans. TRUE
HealthCare Dataset
www.edureka.co/big-data-and-hadoopSlide 59
Revisit De-identification Architecture
HDFS
Taking DB dump in CSV format
and ingest into HDFS
Store De-identified CSV
file into HDFS
De-identify columns
based on
configurations
matches
Map Task 1
Map Task 2
.
.
Read CSV file
from HDFS
Reduce Task 1
0100
1101
1001
0100
1101
1001
0100
1101
Reduce Task 2
.
.
Sqoop
1001
www.edureka.co/big-data-and-hadoopSlide 61
public static String encrypt(String strToEncrypt, byte[] key)
{
try
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
return encryptedString.trim();
}
catch (Exception e)
{
logger.error("Error while encrypting", e);
}
return null;
}
}
www.edureka.co/big-data-and-hadoopSlide 61
DeIdentify MapReduce Code
Demo of DeIdentify Program
www.edureka.co/big-data-and-hadoopSlide 62
Weather Data
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/
www.edureka.co/big-data-and-hadoopSlide 63
Demo of WeatherData Program
www.edureka.co/big-data-and-hadoopSlide 64
Assignment
Write MapReduce code for WordCount on your own and run it on Edureka VM
Download all the MapReduce codes from LMS and import them in your Eclipse IDE and execute them
Try Maximum Temperature problem in MapReduce
Try Hot and Cold day problem in MapReduce
www.edureka.co/big-data-and-hadoopSlide 65
Watch video “Running MapReduce Program” under Module-3 of your LMS
Attempt the Word Count , Patents,& Alphabets assignment using the items present in the LMS under the
tab Module 3
Review the Interview Questions for MapReduce
http://www.edureka.in/blog/hadoop-interview-questions-mapreduce/
Review the Next Generation MapReduce (MRv2 or YARN)
http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
http://www.edureka.in/blog/hadoop-2-0-setting-up-a-single-node-cluster-in-15-minutes/
Setup the CDH4 Hadoop development environment using the documents present in the LMS
http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/
www.edureka.co/big-data-and-hadoopSlide 66
Pre-work
Agenda for Next Class
www.edureka.co/big-data-and-hadoopSlide 67
 Map and Reduce Side Join
Counters
 DistributedCache
 Custom Input Format
 Sequence Input Format
MRUnit
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoopSlide 68
Survey
Hadoop MapReduce Framework

Hadoop MapReduce Framework

  • 1.
  • 2.
    Objectives www.edureka.co/big-data-and-hadoopSlide 2 At theend of this module, you will be able to  Analyze different use-cases where MapReduce is used  Differentiate between Traditional way and MapReduce way  Learn about Hadoop 2.x MapReduce architecture and components  Understand execution flow of YARN MapReduce application  Implement basic MapReduce concepts  Run a MapReduceProgram  Understand Input Splits concept in MapReduce  Understand MapReduce Job Submission Flow  Implement Combiner and Partitioner in MapReduce
  • 3.
    Let’s Revise  HadoopCluster Configuration  Data Loading Techniques  Hadoop ClusterModes Core HDFS core-site.xml hdfs-site.xml yarn-site.xmlYarn mapred-site.xml Map Reduce www.edureka.co/big-data-and-hadoopSlide 3
  • 4.
    Let’s Revise Data Analysis DataLoading Using Pig Using HIVE Using Flume Using Sqoop Using Hadoop Copy Commands HDFS Sqoop Flume – Questions www.edureka.co/big-data-and-hadoopSlide 4
  • 5.
    Annie’s Question Secondary NameNodeis a hot backup for NameNode: » TRUE » FALSE www.edureka.co/big-data-and-hadoopSlide 5
  • 6.
    Annie’s Answer Ans. FALSE.The Secondary NameNode (SNN) is the most misunderstood component of the HDFS architecture. SNN is not a hot backup for NameNode but a checkpoint backup mechanism enabler in a Hadoop Cluster. www.edureka.co/big-data-and-hadoopSlide 6
  • 7.
    Where MapReduce isUsed? Weather Forecasting  Problem Statement: » De-identify personal health information. HealthCare www.edureka.co/big-data-and-hadoopSlide 8  Problem Statement: » Finding Maximum temperature recorded in a year.
  • 8.
    The Traditional Way Very Big Data SplitData matches All matches grep grep grep cat grep matches matches matches Split Data Split Data : Split Data www.edureka.co/big-data-and-hadoopSlide 8
  • 9.
    MapReduce Way Very Big Data Split Data All matches SplitData Split Data : Split Data M A P www.edureka.co/big-data-and-hadoopSlide 9 R E D U C E MapReduce Framework
  • 10.
    Why MapReduce?  Twobiggest Advantages: » Taking processing to the data » Processing data in parallel a b c Map Task HDFS Block Data Center www.edureka.co/big-data-and-hadoopSlide 10 Rack Node
  • 11.
    Solving the Problemwith MapReduce HDFS Take DB dump in CSV format and copy it on HDFS Store CSV file into HDFS Read CSV file from HDFS matches Reduce Map 0100 1101 1001 0100 1101 1001 0100 1101 Sqoop 1001 www.edureka.co/big-data-and-hadoopSlide 12 Map Logic Reduce Logic
  • 12.
    Node Manager Container Map Task Container Application Master Hadoop2.x MapReduce Architecture Client Job History Server Resource Manager MapReduce Status Job Submission Node Status Resource Request Node Manager Datanode1 www.edureka.co/big-data-and-hadoopSlide 12 Container Reduce Task Datanode2 Namenode
  • 13.
     ApplicationMaster www.edureka.co/big-data-and-hadoopSlide 13 »One per application » Short life » Coordinates and Manages MapReduce Jobs » Negotiates with Resource Manager to schedule tasks » The tasks are started by NodeManager(s)  Job HistoryServer » Maintains information about submitted MapReduce jobs after their ApplicationMaster terminates  Client » Submits a MapReduce Job  Resource Manager » Cluster Level resource manager » Long Life, High Quality Hardware  Node Manager » One per Data Node » Monitors resources on Data Node Hadoop 2.x MapReduce Components  Container » Created by NM when requested » Allocates certain amount of resources (memory, CPU etc.) on a slave node
  • 14.
  • 15.
    YARN MR ApplicationExecution Flow www.edureka.co/big-data-and-hadoopSlide 15  MapReduce Job Execution » Job Submission » Job Initialization » Tasks Assignment » Memory Assignment » StatusUpdates » Failure Recovery
  • 16.
    YARN MR ApplicationExecution Flow HDFS Job ObjectApplication Client JVM Client Resource Manager Management Node 1. Run Job 2. Get New Application www.edureka.co/big-data-and-hadoopSlide 16 4. Submit Application 3. Copy Job Resources
  • 17.
    YARN MR ApplicationExecution Flow HDFS Job ObjectApplication Client JVM Client Resource Manager Management Node 1. Run Job 2. Get New Application 4. Submit Application 3. Copy Job Resources Node Manager 5. Start MR AppMaster container 6. Create 9. Start container container www.edureka.co/big-data-and-hadoopSlide 17 7. Get Input Splits MRAppMaster Data Node 8. Request Resources
  • 18.
    YARN MR ApplicationExecution Flow HDFS Job ObjectApplication Client JVM Client Resource Manager Management Node 1. Run Job 2. Get New Application 4. Submit Application 3. Copy Job Resources Node Manager 5. Start MR AppMaster container 6. Create 9. Start container container 8. Request Resources 7. Get Input Splits MRAppMaster Data Node Map/Reduce Task 10. Create Container Task JVM YarnChild 12. Execute 11. Acquire Job Resources www.edureka.co/big-data-and-hadoopSlide 19
  • 19.
    YARN MR ApplicationExecution Flow HDFS Job ObjectApplication Client JVM Client Resource Manager Management Node Node Manager MRAppMaster Data Node Map/Reduce Task YarnChild Task JVM Poll for Status www.edureka.co/big-data-and-hadoopSlide 19 Update Status
  • 20.
    Hadoop 2.x :YARN Workflow Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Container1.2 Node Manager Container1.1 Container2.1 Node Manager Container2.2 Node Manager Container2.3 Node Manager App Master2 Node Manager App Master1 Scheduler Applications Manager(AsM) Resource Manager www.edureka.co/big-data-and-hadoopSlide 20
  • 21.
    Annie’s Question YARN wasdeveloped to overcome the following disadvantage in Hadoop 1.0 MapReduce framework? » Single Point Of Failure Of NameNode » Only one version can be run in classic MapReduce » Too much burden on Job Tracker www.edureka.co/big-data-and-hadoopSlide 21
  • 22.
    Annie’s Answer Too muchburden on Job Tracker www.edureka.co/big-data-and-hadoopSlide 22
  • 23.
    Annie’s Question In YARN,the functionality of JobTracker has been replaced by which of the following YARN features: » Job Scheduling » TaskMonitoring » Resource Management » Node management www.edureka.co/big-data-and-hadoopSlide 23
  • 24.
    Annie’s Answer Task Monitoringand Resource Management. The fundamental idea of YARN is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global Resource Manager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring. www.edureka.co/big-data-and-hadoopSlide 24
  • 25.
    Annie’s Question In YARN,which of the following daemons takes care of the container and the resource utilization by the applications? » Node Manager » JobTracker » Tasktracker » ApplicationMaster www.edureka.co/big-data-and-hadoopSlide 25
  • 26.
  • 27.
    Annie’s Question Can werun MRv1 Jobs in a YARN enabled Hadoop Cluster? » Yes » No www.edureka.co/big-data-and-hadoopSlide 27
  • 28.
    Annie’s Answer Yes. MapReduceon YARN ensures full binary compatibility. These existing applications can run on YARN directly without recompilation. www.edureka.co/big-data-and-hadoopSlide 28
  • 29.
    MapReduce Paradigm The OverallMapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K3,V3) Deer Bear River Dear Bear River Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1 List(K2,V2) K1,V1 Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 K2,List(V2) Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1) www.edureka.co/big-data-and-hadoopSlide 29
  • 30.
    Anatomy of aMapReduce Program MapReduce Map: Reduce: (K1, V1) List (K2, V2) (K2, list (V2)) List (K3, V3) Key Value www.edureka.co/big-data-and-hadoopSlide 30
  • 31.
    Demo of WordCountProgram www.edureka.co/big-data-and-hadoopSlide 31
  • 32.
    Annie’s Question Input tothe mapper is in the form of? » A flat file » (key, value) pair » Only string » All the above www.edureka.co/big-data-and-hadoopSlide 32
  • 33.
    Annie’s Answer A Mapperaccepts (key, value) pair as input. www.edureka.co/big-data-and-hadoopSlide 33
  • 34.
  • 35.
    Relation Between InputSplits and HDFS Blocks 1 2 3 4 5 6 7 8 9 10 11  Logical records do not fit neatly into the HDFS blocks.  Logical records are lines that cross the boundary of the blocks.  First split contains line 5 although it spans across blocks. File Lines Block Boundary Block Boundary Block Boundary Block Boundary Split www.edureka.co/big-data-and-hadoopSlide 35 Split Split
  • 36.
    MapReduce Job SubmissionFlow Node 1 Node 2 INPUT DATA www.edureka.co/big-data-and-hadoopSlide 36  Input data is distributed to nodes
  • 37.
    MapReduce Job SubmissionFlow Map Node 1 Map Node 2 INPUT DATA www.edureka.co/big-data-and-hadoopSlide 37  Input data is distributed to nodes  Each map task works on a “split” of data
  • 38.
    MapReduce Job SubmissionFlow Map Node 1 Map Node 2 INPUT DATA www.edureka.co/big-data-and-hadoopSlide 38  Input data is distributed to nodes  Each map task works on a “split” of data  Mapper outputs intermediate data
  • 39.
    MapReduce Job SubmissionFlow Map Node 1 Map Node 2 INPUT DATA Node 1 Node 2 www.edureka.co/big-data-and-hadoopSlide 39  Input data is distributed to nodes  Each map task works on a “split” of data  Mapper outputs intermediate data  Data will be copied by the reducer processor once it identifies the respective task using application master for all data the reducer is responsible for
  • 40.
    MapReduce Job SubmissionFlow  Input data is distributed to nodes  Each map task works on a “split” of data  Mapper outputs intermediate data  Data will be copied by the reducer processor once it identifies the respective task using application master for all data the reducer is responsible for  Shuffle processor will sort and merge the data for a particular key Map Node 1 Map Node 2 Reduce Reduce INPUT DATA Node 1 Node 2 www.edureka.co/big-data-and-hadoopSlide 40
  • 41.
    MapReduce Job SubmissionFlow Map Node 1 Map Node 2 Reduce Reduce INPUT DATA Node 1 Node 2 www.edureka.co/big-data-and-hadoopSlide 41  Input data is distributed to nodes  Each map task works on a “split” of data  Mapper outputs intermediate data  Data will be copied by the reducer processor once it identifies the respective task using application master for all data the reducer is responsible for  Shuffle processor will sort and merge the data for a particular key  Reducer output is stored
  • 42.
    Annie’s Question www.edureka.co/big-data-and-hadoopSlide 42 MapReduceprogramming model provides a way for reducers to communicate with each other? » Yes, reducers running on the same machine can communicate with each other through shared memory » No, each reducer runs independently and in isolation.
  • 43.
    Annie’s Answer www.edureka.co/big-data-and-hadoopSlide 43 Ans.No, reducers run independently and in isolation. Individual tasks do not know the input source. Reducer tasks rely on Hadoop framework to deliver the appropriate input for processing.
  • 44.
    Annie’s Question www.edureka.co/big-data-and-hadoopSlide 44 Whospecify Input Split Information? » randomly and decided by name node » randomly and decided by job tracker » line by Line and decided by Input Splitter » we will have to specify explicitly
  • 45.
    Annie’s Answer www.edureka.co/big-data-and-hadoopSlide 45 Ans.The client have to submit the input spit information by specifying the start and end point either in InputFormat Configuration.
  • 46.
    Overview of MapReduce CombinersPartitioners Combiners can be viewed as ‘mini-reducers’ in the Map phase. Complete view of MapReduce, illustrating combiners and partitioner in addition to Mappers and Reducers MapReduce www.edureka.co/big-data-and-hadoopSlide 46 Partitioners determine which reducer is responsible for a particular key.
  • 47.
    Combiner – LocalReduce Passed workload further to the Reducers Before we distribute the mapper results Mini-Reducers Perform a “Local Reduce” COMBINERS www.edureka.co/big-data-and-hadoopSlide 47
  • 48.
    Combiner Combiner Reducer (B,1) (C,1) (D,1) (E,1) (D,1) (B,1) (D,1) (A,1) (A,1) (C,1) (B,1) (D,1) (B,2) (C,1) (D,2) (E,1) (D,2) (A,2) (C,1) (B,1) (A, [2]) (B, [2,1]) (C,[1,1]) (D, [2,2]) (E, [1]) (A,2) (B,3) (C,2) (D,4) (E,1) Shuffle CombinerMapper Mapper B C D E D B D A A C B D Block1Block2 www.edureka.co/big-data-and-hadoopSlide 48
  • 49.
    Annie’s Question www.edureka.co/big-data-and-hadoopSlide 49 Combinerworks at? » Mapper Level » Patitioner Level » Reducer Level » All the above
  • 50.
    Annie’s Answer www.edureka.co/big-data-and-hadoopSlide 50 Ans.Mapper level as Combiner works on the output data from Mapper.
  • 51.
    Annie’s Question www.edureka.co/big-data-and-hadoopSlide 51 Combinercan be considered as: » Semi Partitioner » Semi Reducer » Semi Shuffler » Major Reducer
  • 52.
    Annie’s Answer www.edureka.co/big-data-and-hadoopSlide 52 Ans.Semi Reducer. Combiner works on the Mapper output and lessen the burden on Reducer.
  • 53.
    Partitioner – RedirectingOutput from Mapper Map Map Map Reducer Reducer Reducer Partitioner Partitioner Partitioner www.edureka.co/big-data-and-hadoopSlide 53
  • 54.
    Demo: Combiner andPartitioner www.edureka.co/big-data-and-hadoopSlide 54 Demo: Combiner and Partitioner MR Code
  • 55.
    Annie’s Question www.edureka.co/big-data-and-hadoopSlide 55 Canwe use same logic for combiner and reducer? » No, they are separate entities. » Yes, only if reducer and combiner logic are commutative and associative and both of them are of same data types.
  • 56.
    Annie’s Answer www.edureka.co/big-data-and-hadoopSlide 56 Ans.Yes, you can use same logic if Reducer and Combiner logic are both commutative and associative and both of them are of same data types.
  • 57.
    Annie’s Question www.edureka.co/big-data-and-hadoopSlide 57 Canwe change the format of output key class and output value class? » TRUE » FALSE
  • 58.
  • 59.
  • 60.
    Revisit De-identification Architecture HDFS TakingDB dump in CSV format and ingest into HDFS Store De-identified CSV file into HDFS De-identify columns based on configurations matches Map Task 1 Map Task 2 . . Read CSV file from HDFS Reduce Task 1 0100 1101 1001 0100 1101 1001 0100 1101 Reduce Task 2 . . Sqoop 1001 www.edureka.co/big-data-and-hadoopSlide 61
  • 61.
    public static Stringencrypt(String strToEncrypt, byte[] key) { try { Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding"); SecretKeySpec secretKey = new SecretKeySpec(key, "AES"); cipher.init(Cipher.ENCRYPT_MODE, secretKey); String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes())); return encryptedString.trim(); } catch (Exception e) { logger.error("Error while encrypting", e); } return null; } } www.edureka.co/big-data-and-hadoopSlide 61 DeIdentify MapReduce Code
  • 62.
    Demo of DeIdentifyProgram www.edureka.co/big-data-and-hadoopSlide 62
  • 63.
  • 64.
    Demo of WeatherDataProgram www.edureka.co/big-data-and-hadoopSlide 64
  • 65.
    Assignment Write MapReduce codefor WordCount on your own and run it on Edureka VM Download all the MapReduce codes from LMS and import them in your Eclipse IDE and execute them Try Maximum Temperature problem in MapReduce Try Hot and Cold day problem in MapReduce www.edureka.co/big-data-and-hadoopSlide 65
  • 66.
    Watch video “RunningMapReduce Program” under Module-3 of your LMS Attempt the Word Count , Patents,& Alphabets assignment using the items present in the LMS under the tab Module 3 Review the Interview Questions for MapReduce http://www.edureka.in/blog/hadoop-interview-questions-mapreduce/ Review the Next Generation MapReduce (MRv2 or YARN) http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/ http://www.edureka.in/blog/hadoop-2-0-setting-up-a-single-node-cluster-in-15-minutes/ Setup the CDH4 Hadoop development environment using the documents present in the LMS http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/ www.edureka.co/big-data-and-hadoopSlide 66 Pre-work
  • 67.
    Agenda for NextClass www.edureka.co/big-data-and-hadoopSlide 67  Map and Reduce Side Join Counters  DistributedCache  Custom Input Format  Sequence Input Format MRUnit
  • 68.
    Your feedback isimportant to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better! Please spare few minutes to take the survey after the webinar. www.edureka.co/big-data-and-hadoopSlide 68 Survey