SlideShare a Scribd company logo
1 of 11
In this chapter, we will cover:
• Understanding the basics of MapReduce
• Introducing Hadoop MapReduce
• Understanding the Hadoop MapReduce fundamentals
• Writing a Hadoop MapReduce example
• Understanding several possible MapReduce definitions to
solve business problems
• Learning different ways to write Hadoop MapReduce in R
Writing Hadoop
MapReduce Programs By Shaik Jani
• HDFS handles the Distributed Filesystem layer
• MapReduce is a programming model for data processing. Execution sequence:
• MapReduce
– Framework for parallel computing
– Programmers get simple API
– Don’t have to worry about handling
• parallelization
• data distribution
• load balancing
• fault tolerance
• Allows one to process huge amounts of data (terabytes and petabytes) on
thousands of processors
Companies using MapReduce include: Amazon, eBay, Google, LinkedIn, Trovit,
Twitter.
Map Reduce
Introducing Hadoop
MapReduce
Listing Hadoop MapReduce entities
The following are the components of Hadoop that
are responsible for performing analytics over Big
Data:
• Client: This initializes the job
• JobTracker: This monitors the job
• TaskTracker: This executes the job
• HDFS: This stores the input and output data
Understanding the Hadoop MapReduce
Scenario
The four main stages of Hadoop MapReduce data
processing are as follows:
• The loading of data into HDFS
• The execution of the Map phase
• Shuffling and sorting
• The execution of the Reduce phase
1.Loading data into HDFS
• The input dataset needs to be uploaded to the Hadoop directory so it can be used by MapReduce nodes.
• Then, Hadoop Distributed File System (HDFS) will divide the input dataset into: data splits and store them to DataNodes in a
cluster.
• All the data splits will be processed by TaskTracker for the Map and Reduce tasks in a parallel manner.
 There are some alternative ways to get the dataset in HDFS with
Hadoop components:
1.Sqoop : This is an open source tool designed for efficiently transferring bulk
data between Apache Hadoop and structured, relational databases.
2.Flume :This is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data to HDFS.
2.Executing the Map phase
2.Executing the Map phase
 Executing the client application starts the Hadoop MapReduce processes.
• The Map phase then copies the job resources (unjarred class files) and stores it to HDFS, and
requests JobTracker to execute the job.
• The JobTracker initializes the job, retrieves the input, splits the information, and creates a Map task for each job.
• The JobTracker will call TaskTracker to run the Map task over the assigned input
data subset.
• The Map task reads this input split data as input (key, value) pairs
provided to the Mapper method, which then produces intermediate (key, value)
pairs.
• There will be at least one output for each input (key, value) pair.
After the completion of this Map operation, the TaskTracker will keep
the result in its buffer storage and local disk space (if the output data
size is more than the threshold).
For example, suppose we have a Map function that converts the input
text into lowercase.
This will convert the list of input strings into a list of lowercase strings.
Shuffling and sorting
4.Reducing phase execution
• As soon as the Mapper output is available, TaskTracker in the Reducer node will retrieve the available partitioned Map's
output data, and they will be grouped together and merged into one large file, which will then be assigned to a process with
a Reducer method.
• Finally, this will be sorted out before data is provided to the Reducer method.
• The Reducer method receives a list of input values from an input (key, list (value)) and aggregates them based on custom
logic, and produces the output (key, value) pairs.
• The output of the Reducer method of the Reduce phase will directly be written into HDFS as per the format specified by the
MapReduce job configuration class.
Reducing input values to an aggregate value as output:
Understanding the limitations of MapReduce
• The MapReduce framework is notoriously difficult to leverage for transformational logic that is not as simple, for example,
real-time streaming, graph processing, and message passing.
• Data querying is inefficient over distributed, unindexed data than in a database created with indexed data.
• We can't parallelize the Reduce task to the Map task to reduce the overall processing time because Reduce tasks do not start
until the output of the Map tasks is available to it.
• Long-running Reduce tasks can't be completed because of their poor resource
utilization
Understanding the Hadoop MapReduce
fundamentals
To understand Hadoop MapReduce fundamentals properly, we will:
1.Understanding MapReduce objects
 MapReduce operations in Hadoop are carried out mainly by three objects: Mapper, Reducer, and Driver.
• Mapper: This is designed for the Map phase of MapReduce, which starts MapReduce operations by carrying input files and
splitting them into several pieces. For each piece, it will emit a key-value data pair as the output value.
• Reducer: This is designed for the Reduce phase of a MapReduce job; it accepts key-based grouped data from the Mapper
output, reduces it by aggregation logic, and emits the (key, value) pair for the group of values.
• Driver: This is the main file that drives the MapReduce process.
It starts the execution of MapReduce tasks, responsible for building the configuration of a job and submitting it to the
Hadoop cluster.
2.Deciding the number of Maps in MapReduce
• The number of maps is driven by the total size of the inputs
• Hadoop has found the right level of parallelism for maps is between 10-100 maps/node
•If you expect 10TB of input data and have a block size of 128MB, you will have 82,000 maps
• Number of tasks controlled by number of splits returned and can be user overridden
3.Deciding the number of Reducers in MapReduce
• A numbers of Reducers are created based on the Mapper's input.
• Additionally, we can set the number of Reducers at runtime along with the MapReduce command at the command prompt -D
mapred.reduce.tasks, with the number you want. Programmatically, it can be set via conf. setNumReduceTasks(int).
Understanding MapReduce dataflow
Hadoop data processing includes several tasks that help achieve the final output
from an input dataset. These tasks are as follows:
1. Preloading data in HDFS.
2. Running MapReduce by calling Driver.
3. Reading of input data by the Mappers, which results in the splitting of
the data execution of the Mapper custom logic and the generation of
intermediate key-value pairs
4. Executing Combiner and the shuffle phase to optimize the overall Hadoop
MapReduce process.
5. Sorting and providing of intermediate key-value pairs to the Reduce phase.
The Reduce phase is then executed. Reducers take these partitioned key-value
pairs and aggregate them based on Reducer logic.
6. The final output data is stored at HDFS.

More Related Content

What's hot

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 

What's hot (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop
HadoopHadoop
Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 

Similar to writing Hadoop Map Reduce programs

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming ModelAdarshaDhakal
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 

Similar to writing Hadoop Map Reduce programs (20)

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Unit 2
Unit 2Unit 2
Unit 2
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
E031201032036
E031201032036E031201032036
E031201032036
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

writing Hadoop Map Reduce programs

  • 1. In this chapter, we will cover: • Understanding the basics of MapReduce • Introducing Hadoop MapReduce • Understanding the Hadoop MapReduce fundamentals • Writing a Hadoop MapReduce example • Understanding several possible MapReduce definitions to solve business problems • Learning different ways to write Hadoop MapReduce in R Writing Hadoop MapReduce Programs By Shaik Jani
  • 2. • HDFS handles the Distributed Filesystem layer • MapReduce is a programming model for data processing. Execution sequence: • MapReduce – Framework for parallel computing – Programmers get simple API – Don’t have to worry about handling • parallelization • data distribution • load balancing • fault tolerance • Allows one to process huge amounts of data (terabytes and petabytes) on thousands of processors Companies using MapReduce include: Amazon, eBay, Google, LinkedIn, Trovit, Twitter. Map Reduce
  • 3. Introducing Hadoop MapReduce Listing Hadoop MapReduce entities The following are the components of Hadoop that are responsible for performing analytics over Big Data: • Client: This initializes the job • JobTracker: This monitors the job • TaskTracker: This executes the job • HDFS: This stores the input and output data Understanding the Hadoop MapReduce Scenario The four main stages of Hadoop MapReduce data processing are as follows: • The loading of data into HDFS • The execution of the Map phase • Shuffling and sorting • The execution of the Reduce phase
  • 4. 1.Loading data into HDFS • The input dataset needs to be uploaded to the Hadoop directory so it can be used by MapReduce nodes. • Then, Hadoop Distributed File System (HDFS) will divide the input dataset into: data splits and store them to DataNodes in a cluster. • All the data splits will be processed by TaskTracker for the Map and Reduce tasks in a parallel manner.  There are some alternative ways to get the dataset in HDFS with Hadoop components: 1.Sqoop : This is an open source tool designed for efficiently transferring bulk data between Apache Hadoop and structured, relational databases. 2.Flume :This is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data to HDFS. 2.Executing the Map phase
  • 5. 2.Executing the Map phase  Executing the client application starts the Hadoop MapReduce processes. • The Map phase then copies the job resources (unjarred class files) and stores it to HDFS, and requests JobTracker to execute the job. • The JobTracker initializes the job, retrieves the input, splits the information, and creates a Map task for each job. • The JobTracker will call TaskTracker to run the Map task over the assigned input data subset. • The Map task reads this input split data as input (key, value) pairs provided to the Mapper method, which then produces intermediate (key, value) pairs. • There will be at least one output for each input (key, value) pair. After the completion of this Map operation, the TaskTracker will keep the result in its buffer storage and local disk space (if the output data size is more than the threshold). For example, suppose we have a Map function that converts the input text into lowercase. This will convert the list of input strings into a list of lowercase strings.
  • 7. 4.Reducing phase execution • As soon as the Mapper output is available, TaskTracker in the Reducer node will retrieve the available partitioned Map's output data, and they will be grouped together and merged into one large file, which will then be assigned to a process with a Reducer method. • Finally, this will be sorted out before data is provided to the Reducer method. • The Reducer method receives a list of input values from an input (key, list (value)) and aggregates them based on custom logic, and produces the output (key, value) pairs. • The output of the Reducer method of the Reduce phase will directly be written into HDFS as per the format specified by the MapReduce job configuration class. Reducing input values to an aggregate value as output:
  • 8. Understanding the limitations of MapReduce • The MapReduce framework is notoriously difficult to leverage for transformational logic that is not as simple, for example, real-time streaming, graph processing, and message passing. • Data querying is inefficient over distributed, unindexed data than in a database created with indexed data. • We can't parallelize the Reduce task to the Map task to reduce the overall processing time because Reduce tasks do not start until the output of the Map tasks is available to it. • Long-running Reduce tasks can't be completed because of their poor resource utilization
  • 9. Understanding the Hadoop MapReduce fundamentals To understand Hadoop MapReduce fundamentals properly, we will: 1.Understanding MapReduce objects  MapReduce operations in Hadoop are carried out mainly by three objects: Mapper, Reducer, and Driver. • Mapper: This is designed for the Map phase of MapReduce, which starts MapReduce operations by carrying input files and splitting them into several pieces. For each piece, it will emit a key-value data pair as the output value. • Reducer: This is designed for the Reduce phase of a MapReduce job; it accepts key-based grouped data from the Mapper output, reduces it by aggregation logic, and emits the (key, value) pair for the group of values. • Driver: This is the main file that drives the MapReduce process. It starts the execution of MapReduce tasks, responsible for building the configuration of a job and submitting it to the Hadoop cluster. 2.Deciding the number of Maps in MapReduce • The number of maps is driven by the total size of the inputs • Hadoop has found the right level of parallelism for maps is between 10-100 maps/node •If you expect 10TB of input data and have a block size of 128MB, you will have 82,000 maps • Number of tasks controlled by number of splits returned and can be user overridden
  • 10. 3.Deciding the number of Reducers in MapReduce • A numbers of Reducers are created based on the Mapper's input. • Additionally, we can set the number of Reducers at runtime along with the MapReduce command at the command prompt -D mapred.reduce.tasks, with the number you want. Programmatically, it can be set via conf. setNumReduceTasks(int). Understanding MapReduce dataflow
  • 11. Hadoop data processing includes several tasks that help achieve the final output from an input dataset. These tasks are as follows: 1. Preloading data in HDFS. 2. Running MapReduce by calling Driver. 3. Reading of input data by the Mappers, which results in the splitting of the data execution of the Mapper custom logic and the generation of intermediate key-value pairs 4. Executing Combiner and the shuffle phase to optimize the overall Hadoop MapReduce process. 5. Sorting and providing of intermediate key-value pairs to the Reduce phase. The Reduce phase is then executed. Reducers take these partitioned key-value pairs and aggregate them based on Reducer logic. 6. The final output data is stored at HDFS.