SlideShare a Scribd company logo
1 of 13
Topic Name :
MapReduce
Presented By:
Mrs. U. S. Patil
Assistant Professor,
SIT COE Yadrav Ichalkranji.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server
to store and process data.
Traditional model is certainly not suitable to process huge volumes
of scalable data and cannot be accommodated by standard database
servers.
The centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called
MapReduce.
MapReduce divides a task into small parts and assigns them to
many computers. Later, the results are collected at one place and
integrated to form the result dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely
Map and Reduce:
Map:
The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-
value pairs).
Reduce:
The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of
tuples.
The reduce task is always performed after the map job.
Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero or more
key-value pairs.
Intermediate Keys − They key-value pairs generated by the
mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups
similar data from the map phase into identifiable sets. It takes the
intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is
optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and
Sort step. It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running. The individual key-value
pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated
easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them. Here, the
data can be aggregated, filtered, and combined in a number of
ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final
step.
Output Phase − In the output phase, we have an output formatter
that translates the final key-value pairs from the Reducer function
and writes them onto a file using a record writer.
MapReduce-Example:
Task Execution :
Task Execution
MapReduce processes the data in various phases with the help of
different components.
Input Files
The data for MapReduce job is stored in Input Files. Input files
reside in HDFS. The input file format is random. Line-based log
files and binary format can also be used.
InputFormat
After that InputFormat defines how to divide and read these input
files. It selects the files or other objects for input. InputFormat
creates InputSplit.
InputSplits
It represents the data that will be processed by an individual
Mapper. For each split, one map task is created. Therefore the
number of map tasks is equal to the number of InputSplits.
Framework divide split into records, which mapper process.
Task Execution
MapReduce processes the data in various phases with the help of different
components.
RecordReader
It communicates with the inputSplit. And then transforms the data into key-value
pairs suitable for reading by the Mapper. RecordReader by default uses
TextInputFormat to transform data into a key-value pair. It interrelates to the
InputSplit until the completion of file reading. It allocates a byte offset to each line
present in the file. Then, these key-value pairs are further sent to the mapper for
added processing.
Mapper
It processes input records produced by the RecordReader and generates intermediate
key-value pairs. The intermediate output is entirely different from the input pair. The
output of the mapper is a full group of key-value pairs. Hadoop framework does not
store the output of the mapper on HDFS. Mapper doesn’t store, as data is temporary
and writing on HDFS will create unnecessary multiple copies. Then Mapper passes
the output to the combiner for extra processing.
Combiner
Combiner is Mini-reducer that performs local aggregation on the mapper’s output. It
minimizes the data transfer between mapper and reducer. So, when the combiner
functionality completes, the framework passes the output to the partitioner for further
processing.
Task Execution
Partitioner
Partitioner comes into existence if we are working with more than one
reducer. It grabs the output of the combiner and performs partitioning.
Partitioning of output occurs based on the key in MapReduce. By hash
function, the key (or a subset of the key) derives the partition.
Based on key-value in MapReduce, partitioning of each combiner output
occur. And then the record having a similar key-value goes into the same
partition. After that, each partition is sent to a reducer.
Partitioning in MapReduce execution permits even distribution of the
map output over the reducer.
Shuffling and Sorting
After partitioning, the output is shuffled to the reduced node. The
shuffling is the physical transfer of the data which is done over the
network. As all the mappers complete and shuffle the output on the
reducer nodes. Then the framework joins this intermediate output and
sort. This is then offered as input to reduce the phase.
Task Execution
Reducer
The reducer then takes a set of intermediate key-value pairs
produced by the mappers as the input. After that runs a reducer
function on each of them to generate the output. The output of the
reducer is the decisive output. Then the framework stores the output
on HDFS.
RecordWriter
It writes these output key-value pairs from the Reducer phase to the
output files.
OutputFormat
OutputFormat defines the way how RecordReader writes these
output key-value pairs in output files. So, its instances offered by
the Hadoop write files in HDFS. Thus OutputFormat instances
write the decisive output of reducer on HDFS.
Thank You

More Related Content

Similar to MapReduce.pptx

Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
S_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptxS_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptxRajiArun7
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce AnandMHadoop
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmNilaNila16
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 

Similar to MapReduce.pptx (20)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
S_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptxS_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptx
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Unit 2
Unit 2Unit 2
Unit 2
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 

Recently uploaded

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 

Recently uploaded (20)

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 

MapReduce.pptx

  • 1. Topic Name : MapReduce Presented By: Mrs. U. S. Patil Assistant Professor, SIT COE Yadrav Ichalkranji.
  • 2. Why MapReduce? Traditional Enterprise Systems normally have a centralized server to store and process data. Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. The centralized system creates too much of a bottleneck while processing multiple files simultaneously. Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset.
  • 4. The MapReduce algorithm contains two important tasks, namely Map and Reduce: Map: The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key- value pairs). Reduce: The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. The reduce task is always performed after the map job.
  • 5. Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. Map − Map is a user-defined function, which takes a series of key- value pairs and processes each one of them to generate zero or more key-value pairs. Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys. Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user- defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.
  • 6. Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task. Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step. Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
  • 9. Task Execution MapReduce processes the data in various phases with the help of different components. Input Files The data for MapReduce job is stored in Input Files. Input files reside in HDFS. The input file format is random. Line-based log files and binary format can also be used. InputFormat After that InputFormat defines how to divide and read these input files. It selects the files or other objects for input. InputFormat creates InputSplit. InputSplits It represents the data that will be processed by an individual Mapper. For each split, one map task is created. Therefore the number of map tasks is equal to the number of InputSplits. Framework divide split into records, which mapper process.
  • 10. Task Execution MapReduce processes the data in various phases with the help of different components. RecordReader It communicates with the inputSplit. And then transforms the data into key-value pairs suitable for reading by the Mapper. RecordReader by default uses TextInputFormat to transform data into a key-value pair. It interrelates to the InputSplit until the completion of file reading. It allocates a byte offset to each line present in the file. Then, these key-value pairs are further sent to the mapper for added processing. Mapper It processes input records produced by the RecordReader and generates intermediate key-value pairs. The intermediate output is entirely different from the input pair. The output of the mapper is a full group of key-value pairs. Hadoop framework does not store the output of the mapper on HDFS. Mapper doesn’t store, as data is temporary and writing on HDFS will create unnecessary multiple copies. Then Mapper passes the output to the combiner for extra processing. Combiner Combiner is Mini-reducer that performs local aggregation on the mapper’s output. It minimizes the data transfer between mapper and reducer. So, when the combiner functionality completes, the framework passes the output to the partitioner for further processing.
  • 11. Task Execution Partitioner Partitioner comes into existence if we are working with more than one reducer. It grabs the output of the combiner and performs partitioning. Partitioning of output occurs based on the key in MapReduce. By hash function, the key (or a subset of the key) derives the partition. Based on key-value in MapReduce, partitioning of each combiner output occur. And then the record having a similar key-value goes into the same partition. After that, each partition is sent to a reducer. Partitioning in MapReduce execution permits even distribution of the map output over the reducer. Shuffling and Sorting After partitioning, the output is shuffled to the reduced node. The shuffling is the physical transfer of the data which is done over the network. As all the mappers complete and shuffle the output on the reducer nodes. Then the framework joins this intermediate output and sort. This is then offered as input to reduce the phase.
  • 12. Task Execution Reducer The reducer then takes a set of intermediate key-value pairs produced by the mappers as the input. After that runs a reducer function on each of them to generate the output. The output of the reducer is the decisive output. Then the framework stores the output on HDFS. RecordWriter It writes these output key-value pairs from the Reducer phase to the output files. OutputFormat OutputFormat defines the way how RecordReader writes these output key-value pairs in output files. So, its instances offered by the Hadoop write files in HDFS. Thus OutputFormat instances write the decisive output of reducer on HDFS.