MapReduce
Execution Architecture
MapReduce Execution Architecture
Rupak Roy
Terminology Explanations:
 Input format: defines the amount of tasks the individual
maptask will process i.e. the input splits.
 Record Reader: reads and converts the data one line at a time
into key value pairs from the input spit for the Mapper function.
By default the Map function reads data in text input format.
Another feature of the record reader is whenever the HDFs splits
the data into blocks of 64mb(default) and it doesn’t consider the
type of data while creating a logical split to load the file into
HDFS. So the first block might terminate a logical record for
example in the middle of a line or a row of a text file.
In such case the record reader ensures if there is any break in a
logical record it will get the remaining part from the next
block and makes it a part of input split.
 Driver class function binds the Map and the Reduce Function
and initiates the process.
Rupak Roy
 A Combiner is also knows as Semi- reducer that helps
aggregating the segregate data of map key-value outputs
which helps in increase in performance by reducing the
amount of data being sent over the network.
 Example: instead of sending 3 key value pairs like
<bob,1>
<bob,1>
<bob,1>
It will simply send the aggregated key value pairs like
<bob,3>
 Combiner is still an optional class, since it has some limitations
like it doesn’t works with arithmetic functions like mean,
median, mode.
Rupak Roy
 Example 1:
Max of (12,6,4,9) is 12
With combiner:
Map job1 = max(12,6) = 12
Map job2 = max(4,9) = 9
Reducer = max(12,9)=12
 Example 2:
mean of (12,6,4,9) is 7.75
With combiner:
Map job1= mean(12,6)=9
Map job2 = mean(4,9)=6.5
Reducer= mean(9,6.5)= 15.5 which is wrong.
Combiner
Rupak Roy
 Partitioner partitions the output of map
keyvalue outputs. Or simply we can say
partitioner divides the data for the available
number of reducers to process.
 Output Format: defines the location of the
processed data to be stored.
 Record Writer: this is the last phase where every
key –value pair output from the Reducer is
forward to its Output Format defined location.
Rupak Roy
Example: MapReduce Programming (Java)
Rupak Roy
Rupak Roy
How to run MapReduce Jar File
 Save the MapReduce Programming in Java .jar file.
Then copy/store the .jar file in HDFS
next run the .jar file
hadoop jar test.jar Demo /user/data/input /user/data/output
i.e. hadoop jar file.jar DriverProgramName(Demo) /sourceDirectory /destinationDirectiory
Rupak Roy
Output files of MapReduce job
_Success: On the successful completion of a job,
the MapReduce runtime creates a _Success file.
This file is used for applications that need to see if
the results are successfully completed or not. One
such example is job scheduling systems like OOZIE
_logs: it will contain all the log details of the event.
part-m-00000: the ‘m’ stands for Map-only jobs i.e.
only mapper is used to complete the job
part-r-00000: the ‘r’ stands for Reducer jobs i.e the
reducer is also used to complete the job
Rupak Roy
Next
 We will learn a high level language call PIG
for analyzing massive amount of data.
Rupak Roy

Map Reduce Execution Architecture

  • 1.
  • 2.
  • 3.
    Terminology Explanations:  Inputformat: defines the amount of tasks the individual maptask will process i.e. the input splits.  Record Reader: reads and converts the data one line at a time into key value pairs from the input spit for the Mapper function. By default the Map function reads data in text input format. Another feature of the record reader is whenever the HDFs splits the data into blocks of 64mb(default) and it doesn’t consider the type of data while creating a logical split to load the file into HDFS. So the first block might terminate a logical record for example in the middle of a line or a row of a text file. In such case the record reader ensures if there is any break in a logical record it will get the remaining part from the next block and makes it a part of input split.  Driver class function binds the Map and the Reduce Function and initiates the process. Rupak Roy
  • 4.
     A Combineris also knows as Semi- reducer that helps aggregating the segregate data of map key-value outputs which helps in increase in performance by reducing the amount of data being sent over the network.  Example: instead of sending 3 key value pairs like <bob,1> <bob,1> <bob,1> It will simply send the aggregated key value pairs like <bob,3>  Combiner is still an optional class, since it has some limitations like it doesn’t works with arithmetic functions like mean, median, mode. Rupak Roy
  • 5.
     Example 1: Maxof (12,6,4,9) is 12 With combiner: Map job1 = max(12,6) = 12 Map job2 = max(4,9) = 9 Reducer = max(12,9)=12  Example 2: mean of (12,6,4,9) is 7.75 With combiner: Map job1= mean(12,6)=9 Map job2 = mean(4,9)=6.5 Reducer= mean(9,6.5)= 15.5 which is wrong. Combiner Rupak Roy
  • 6.
     Partitioner partitionsthe output of map keyvalue outputs. Or simply we can say partitioner divides the data for the available number of reducers to process.  Output Format: defines the location of the processed data to be stored.  Record Writer: this is the last phase where every key –value pair output from the Reducer is forward to its Output Format defined location. Rupak Roy
  • 7.
  • 8.
  • 9.
    How to runMapReduce Jar File  Save the MapReduce Programming in Java .jar file. Then copy/store the .jar file in HDFS next run the .jar file hadoop jar test.jar Demo /user/data/input /user/data/output i.e. hadoop jar file.jar DriverProgramName(Demo) /sourceDirectory /destinationDirectiory Rupak Roy
  • 10.
    Output files ofMapReduce job _Success: On the successful completion of a job, the MapReduce runtime creates a _Success file. This file is used for applications that need to see if the results are successfully completed or not. One such example is job scheduling systems like OOZIE _logs: it will contain all the log details of the event. part-m-00000: the ‘m’ stands for Map-only jobs i.e. only mapper is used to complete the job part-r-00000: the ‘r’ stands for Reducer jobs i.e the reducer is also used to complete the job Rupak Roy
  • 11.
    Next  We willlearn a high level language call PIG for analyzing massive amount of data. Rupak Roy