Map Reduce Execution Architecture

MapReduce
Execution Architecture

MapReduce Execution Architecture
Rupak Roy

Terminology Explanations:
 Input format: defines the amount of tasks the individual
maptask will process i.e. the input splits.
 Record Reader: reads and converts the data one line at a time
into key value pairs from the input spit for the Mapper function.
By default the Map function reads data in text input format.
Another feature of the record reader is whenever the HDFs splits
the data into blocks of 64mb(default) and it doesn’t consider the
type of data while creating a logical split to load the file into
HDFS. So the first block might terminate a logical record for
example in the middle of a line or a row of a text file.
In such case the record reader ensures if there is any break in a
logical record it will get the remaining part from the next
block and makes it a part of input split.
 Driver class function binds the Map and the Reduce Function
and initiates the process.
Rupak Roy

 A Combiner is also knows as Semi- reducer that helps
aggregating the segregate data of map key-value outputs
which helps in increase in performance by reducing the
amount of data being sent over the network.
 Example: instead of sending 3 key value pairs like
<bob,1>
<bob,1>
<bob,1>
It will simply send the aggregated key value pairs like
<bob,3>
 Combiner is still an optional class, since it has some limitations
like it doesn’t works with arithmetic functions like mean,
median, mode.
Rupak Roy

 Example 1:
Max of (12,6,4,9) is 12
With combiner:
Map job1 = max(12,6) = 12
Map job2 = max(4,9) = 9
Reducer = max(12,9)=12
 Example 2:
mean of (12,6,4,9) is 7.75
With combiner:
Map job1= mean(12,6)=9
Map job2 = mean(4,9)=6.5
Reducer= mean(9,6.5)= 15.5 which is wrong.
Combiner
Rupak Roy

 Partitioner partitions the output of map
keyvalue outputs. Or simply we can say
partitioner divides the data for the available
number of reducers to process.
 Output Format: defines the location of the
processed data to be stored.
 Record Writer: this is the last phase where every
key –value pair output from the Reducer is
forward to its Output Format defined location.
Rupak Roy

Example: MapReduce Programming (Java)
Rupak Roy

How to run MapReduce Jar File
 Save the MapReduce Programming in Java .jar file.
Then copy/store the .jar file in HDFS
next run the .jar file
hadoop jar test.jar Demo /user/data/input /user/data/output
i.e. hadoop jar file.jar DriverProgramName(Demo) /sourceDirectory /destinationDirectiory
Rupak Roy

Output files of MapReduce job
_Success: On the successful completion of a job,
the MapReduce runtime creates a _Success file.
This file is used for applications that need to see if
the results are successfully completed or not. One
such example is job scheduling systems like OOZIE
_logs: it will contain all the log details of the event.
part-m-00000: the ‘m’ stands for Map-only jobs i.e.
only mapper is used to complete the job
part-r-00000: the ‘r’ stands for Reducer jobs i.e the
reducer is also used to complete the job
Rupak Roy

Next
 We will learn a high level language call PIG
for analyzing massive amount of data.
Rupak Roy

Map Reduce Execution Architecture

More Related Content

What's hot

Similar to Map Reduce Execution Architecture

More from Rupak Roy

Recently uploaded

Map Reduce Execution Architecture