2. Objectives
www.edureka.co/big-data-and-hadoopSlide 2
At the end of this module, you will be able to
Analyze different use-cases where MapReduce is used
Differentiate between Traditional way and MapReduce way
Learn about Hadoop 2.x MapReduce architecture and components
Understand execution flow of YARN MapReduce application
Implement basic MapReduce concepts
Run a MapReduceProgram
Understand Input Splits concept in MapReduce
Understand MapReduce Job Submission Flow
Implement Combiner and Partitioner in MapReduce
4. Let’s Revise
Data Analysis
Data Loading
Using Pig Using HIVE
Using Flume
Using Sqoop
Using Hadoop Copy Commands
HDFS
Sqoop Flume – Questions
www.edureka.co/big-data-and-hadoopSlide 4
6. Annie’s Answer
Ans. FALSE. The Secondary NameNode (SNN) is the most
misunderstood component of the HDFS architecture. SNN
is not a hot backup for NameNode but a checkpoint
backup mechanism enabler in a Hadoop Cluster.
www.edureka.co/big-data-and-hadoopSlide 6
7. Where MapReduce is Used?
Weather Forecasting
Problem Statement:
» De-identify personal health information.
HealthCare www.edureka.co/big-data-and-hadoopSlide 8
Problem Statement:
» Finding Maximum temperature recorded in a year.
8. The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep cat
grep
matches
matches
matches
Split Data
Split Data
:
Split Data
www.edureka.co/big-data-and-hadoopSlide 8
10. Why MapReduce?
Two biggest Advantages:
» Taking processing to the data
» Processing data in parallel
a
b
c
Map Task
HDFS Block
Data Center
www.edureka.co/big-data-and-hadoopSlide 10
Rack
Node
11. Solving the Problem with MapReduce
HDFS
Take DB dump in CSV format and
copy it on HDFS
Store CSV file into
HDFS
Read CSV file
from HDFS
matches
Reduce
Map
0100
1101
1001
0100
1101
1001
0100
1101
Sqoop
1001
www.edureka.co/big-data-and-hadoopSlide 12
Map Logic
Reduce Logic
12. Node Manager
Container
Map Task
Container
Application
Master
Hadoop 2.x MapReduce Architecture
Client
Job History
Server
Resource
Manager
MapReduce Status
Job Submission
Node Status
Resource Request
Node Manager
Datanode1
www.edureka.co/big-data-and-hadoopSlide 12
Container
Reduce Task
Datanode2
Namenode
13. ApplicationMaster
www.edureka.co/big-data-and-hadoopSlide 13
» One per application
» Short life
» Coordinates and Manages MapReduce Jobs
» Negotiates with Resource Manager to
schedule tasks
» The tasks are started by NodeManager(s)
Job HistoryServer
» Maintains information about submitted
MapReduce jobs after their ApplicationMaster
terminates
Client
» Submits a MapReduce Job
Resource Manager
» Cluster Level resource manager
» Long Life, High Quality Hardware
Node Manager
» One per Data Node
» Monitors resources on Data Node
Hadoop 2.x MapReduce Components
Container
» Created by NM when requested
» Allocates certain amount of resources
(memory, CPU etc.) on a slave node
21. Annie’s Question
YARN was developed to overcome the following disadvantage
in Hadoop 1.0 MapReduce framework?
» Single Point Of Failure Of NameNode
» Only one version can be run in classic MapReduce
» Too much burden on Job Tracker
www.edureka.co/big-data-and-hadoopSlide 21
23. Annie’s Question
In YARN, the functionality of JobTracker has been replaced by
which of the following YARN features:
» Job Scheduling
» TaskMonitoring
» Resource Management
» Node management
www.edureka.co/big-data-and-hadoopSlide 23
24. Annie’s Answer
Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
JobTracker, i.e. resource management and job
scheduling/monitoring, into separate daemons. A global
Resource Manager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.
www.edureka.co/big-data-and-hadoopSlide 24
25. Annie’s Question
In YARN, which of the following daemons takes care of the
container and the resource utilization by the applications?
» Node Manager
» JobTracker
» Tasktracker
» ApplicationMaster
www.edureka.co/big-data-and-hadoopSlide 25
27. Annie’s Question
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
» Yes
» No
www.edureka.co/big-data-and-hadoopSlide 27
28. Annie’s Answer
Yes. MapReduce on YARN ensures full binary compatibility.
These existing applications can run on YARN directly without
recompilation.
www.edureka.co/big-data-and-hadoopSlide 28
29. MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)
Deer Bear River
Dear Bear River
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
List(K2,V2)
K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
K2,List(V2)
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
www.edureka.co/big-data-and-hadoopSlide 29
30. Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value
www.edureka.co/big-data-and-hadoopSlide 30
31. Demo of WordCount Program
www.edureka.co/big-data-and-hadoopSlide 31
32. Annie’s Question
Input to the mapper is in the form of?
» A flat file
» (key, value) pair
» Only string
» All the above
www.edureka.co/big-data-and-hadoopSlide 32
33. Annie’s Answer
A Mapper accepts (key, value) pair as input.
www.edureka.co/big-data-and-hadoopSlide 33
35. Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
Logical records do not fit neatly into the HDFS blocks.
Logical records are lines that cross the boundary of the blocks.
First split contains line 5 although it spans across blocks.
File
Lines
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split
www.edureka.co/big-data-and-hadoopSlide 35
Split Split
36. MapReduce Job Submission Flow
Node 1 Node 2
INPUT DATA
www.edureka.co/big-data-and-hadoopSlide 36
Input data is distributed to nodes
37. MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
INPUT DATA
www.edureka.co/big-data-and-hadoopSlide 37
Input data is distributed to nodes
Each map task works on a “split” of data
38. MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
INPUT DATA
www.edureka.co/big-data-and-hadoopSlide 38
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
39. MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
INPUT DATA
Node 1 Node 2
www.edureka.co/big-data-and-hadoopSlide 39
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for
40. MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for
Shuffle processor will sort and merge the data for a particular key
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
www.edureka.co/big-data-and-hadoopSlide 40
41. MapReduce Job Submission Flow
Map
Node 1
Map
Node 2
Reduce Reduce
INPUT DATA
Node 1 Node 2
www.edureka.co/big-data-and-hadoopSlide 41
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data will be copied by the reducer processor once it identifies the
respective task using application master for all data the reducer is
responsible for
Shuffle processor will sort and merge the data for a particular key
Reducer output is stored
42. Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 42
MapReduce programming model provides a way for reducers
to communicate with each other?
» Yes, reducers running on the same machine can
communicate with each other through shared memory
» No, each reducer runs independently and in isolation.
44. Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 44
Who specify Input Split Information?
» randomly and decided by name node
» randomly and decided by job tracker
» line by Line and decided by Input Splitter
» we will have to specify explicitly
46. Overview of MapReduce
Combiners Partitioners
Combiners can be viewed as
‘mini-reducers’ in the Map phase.
Complete view of MapReduce, illustrating combiners and partitioner in addition
to Mappers and Reducers
MapReduce
www.edureka.co/big-data-and-hadoopSlide 46
Partitioners determine which reducer is
responsible for a particular key.
47. Combiner – Local Reduce
Passed workload further
to the Reducers
Before we distribute the
mapper results
Mini-Reducers Perform a
“Local Reduce”
COMBINERS
www.edureka.co/big-data-and-hadoopSlide 47
54. Demo: Combiner and Partitioner
www.edureka.co/big-data-and-hadoopSlide 54
Demo: Combiner and Partitioner MR Code
55. Annie’s Question
www.edureka.co/big-data-and-hadoopSlide 55
Can we use same logic for combiner and reducer?
» No, they are separate entities.
» Yes, only if reducer and combiner logic are commutative
and associative and both of them are of same data
types.
65. Assignment
Write MapReduce code for WordCount on your own and run it on Edureka VM
Download all the MapReduce codes from LMS and import them in your Eclipse IDE and execute them
Try Maximum Temperature problem in MapReduce
Try Hot and Cold day problem in MapReduce
www.edureka.co/big-data-and-hadoopSlide 65
66. Watch video “Running MapReduce Program” under Module-3 of your LMS
Attempt the Word Count , Patents,& Alphabets assignment using the items present in the LMS under the
tab Module 3
Review the Interview Questions for MapReduce
http://www.edureka.in/blog/hadoop-interview-questions-mapreduce/
Review the Next Generation MapReduce (MRv2 or YARN)
http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
http://www.edureka.in/blog/hadoop-2-0-setting-up-a-single-node-cluster-in-15-minutes/
Setup the CDH4 Hadoop development environment using the documents present in the LMS
http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/
www.edureka.co/big-data-and-hadoopSlide 66
Pre-work
67. Agenda for Next Class
www.edureka.co/big-data-and-hadoopSlide 67
Map and Reduce Side Join
Counters
DistributedCache
Custom Input Format
Sequence Input Format
MRUnit
68. Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoopSlide 68
Survey