SlideShare a Scribd company logo
Big Data and Hadoop Training
MapReduce
Page 2Classification: Restricted
Agenda
• Meet MapReduce
• Word Count Algorithm – Traditional approach
• Traditional approach on a Distributed System
• Traditional approach – Drawbacks
• MapReduce Approach
• Input & Output Forms of a MR program
• Map, Shuffle & Sort, Reduce Phase
• WordCount Code walkthrough
• Workflow & Transformation of Data
• Input Split & HDFS Block
• Relation between Split & Block
• Data locality Optimization
• Speculative Execution
• MR Flow with Single Reduce Task
• MR flow with multiple Reducers
• Input Format & Hierarchy
• Output Format & Hierarchy
Page 3Classification: Restricted
Meet MapReduce
• MapReduce is a programming model for distributed processing
• Advantage - easy scaling of data processing over multiple computing nodes
• The basic entities in this model are – mappers & reducers
• Decomposing a data processing application into mappers and reducers
is the task of developer
• once you write an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change
Page 4Classification: Restricted
WordCount – Traditional Approach
• Input: do as I say not as I do
• Output:
Word Count
as 2
do 2
I 2
not 1
say 1
Page 5Classification: Restricted
WordCount – Traditional Approach
define wordCount as Multiset;
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
Page 6Classification: Restricted
Traditional Approach – Distributed Processing
define wordCount as Multiset;
for each document in documentSubset {
< same code as in perv.slide>
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
Page 7Classification: Restricted
Traditional Approach – Drawbacks
•Central Storage – bottleneck in bandwidth of the server
•Multiple Storage – handling splits
•Program runs in memory
•When processing large document sets, the number of unique
words can exceed the RAM storage of a machine
•Phase 2 handling by one machine?
•If Multiple machines are used for phase-2, how to partition the
data?
Page 8Classification: Restricted
Mapreduce Approach
• Has two execution phases – mapping & reducing
• These phases are defined by data processing functions called – mapper & reducer
• Mapping phase – MR takes the input data and feeds each data element to the mapper
• Reducing phase – reducer processes all the outputs from the mapper and arrives at a final result
Page 9Classification: Restricted
Input & Output forms:
• In order for mapping, reducing, partitioning, and shuffling (and a few others that were not
mentioned) to seamlessly work together, we need to agree on a common structure for the data being
processed
• InputFormat class is responsible for creating input splits and dividing them into records()
Input Output
map() <k1, v1> list(<k2, v2>)
reduce() <k2, list(v2)> list(<k3, v3>)
Page 10Classification: Restricted
Map Phase
Page 11Classification: Restricted
Reduce Phase
Page 12Classification: Restricted
Shuffle & sort Phase
Page 13Classification: Restricted
MR - Work flow & Transformation of data
From i/p files to
the mapper
From the
Mapper to the
intermediate
results
From
intermediate
results to the
reducer
From the
reducer to
output files
Page 14Classification: Restricted
Word Count: Source Code
Page 15Classification: Restricted
Input Split & Hdfs Block
Data Chunk
HDFS Block
(Physical Division)
Input Split
(Logical Division)
Page 16Classification: Restricted
Relation Between Input Split & Hdfs Block
1 2 3 4 76 8 1095
File
Line
s
Block
Boundary
Block
Boundary
Block
Boundary
Block
Boundary
Split Split Split
• Logical records do not fit neatly into the HDFS blocks.
• Logical records are lines that cross the boundary of the blocks.
• First split contains line 5 although it spans across blocks.
Page 17Classification: Restricted
Data locality Optimization
• MR job is split into various map & reduce
tasks
• Map tasks run on the input splits
• Ideally, the task JVM would be initiated in the
node where the split/block of data exists
• While in some scenarios, JVMs might not be
free to accept another task.
• In that case, Task Tracker will be initiated at a
different location.
• Scenario a) Same node execution
• Scenario b) Off-node execution
• Scenario c) Off-rack execution
Page 18Classification: Restricted
Speculative execution
• MR job is split into various map & reduce tasks and they get executed in parallel.
• Overall job execution time is pulled down by the slowest task.
• Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is
running slower than expected and launches another equivalent task as a backup. This is
termed speculative execution of tasks.
Page 19Classification: Restricted
MapReduce Dataflow With A Single Reduce Task
Page 20Classification: Restricted
Map Reduce Dataflow With Multiple Reduce Tasks
Page 21Classification: Restricted
MapReduce Dataflow With No Reduce Tasks
Page 22Classification: Restricted
Combiner
• A combiner is a mini-reducer
• It gets executed on the mapper output at the mapper side
• Combiner’s output is fed to Reducer
• As the mapper output is further refined using combiner, data that has to be shuffled across the
cluster is minimized
• Because the combiner function is an optimization, Hadoop does not provide a guarantee of how
many times it will call it for a particular map output record,
if at all
• So, calling the combiner function zero, one, or many times should produce the same output from the
reducer.
Page 23Classification: Restricted
Combiner’s Contract
• Only those functions that obey commutative & associative properties can use combiners.
• Because
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
where as,
mean(0, 20, 10, 25, 15) = 14 and
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
Page 24Classification: Restricted
Partitioner
• We know that a unique key will always go to a unique reducer.
• Partitioner is responsible for sending key, value pairs to a reducer based on the key content.
• The default partitioner is Hash-partitioner. It takes mapper output, create a Hash value for each key
and divide it modulo by the number of reducers. The output of this calculation will determine the
reducer that this particular key would go to
Page 25Classification: Restricted
Partitioner
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
Page 26Classification: Restricted
InputFormat Hierarchy
Page 27Classification: Restricted
InputFormat
Input Split Input SplitInput SplitInput Split
Record
Reader
Record
Reader
Record
Reader
Record
Reader
Mapper MapperMapperMapper
Page 28Classification: Restricted
OutputFormat
Reducer
Output File
Reducer ReducerReducer
RcordWriter RcordWriterRcordWriterRcordWriter
Output FileOutput FileOutput File
Page 29Classification: Restricted
OutputFormat Hierarchy
Page 30Classification: Restricted
Counters
• Counters are a useful channel for gathering statistics about the job: for quality control or for
application-level statistics.
• Often used for debugging purpose.
• eg: Count number of Good records, bad records in the input
• Two types – Built-in & Custom Counters
• Examples of Built-in Counters:
• Map input records
• Map output records
• Filesystem bytes read
• Launched map tasks
• Failed map tasks
• Killed reduce tasks
Page 31Classification: Restricted
Joins
• Map-side join(Replication): A map-side join that works in situations where one of the datasets is
small enough to cache
• Reduce-side join(Repartition join): A reduce-side join for situations where you’re joining two or
more large datasets together
• Semi-join(A map-side join): Another map-side join where one dataset is initially too large to fit into
memory, but after some filtering
can be reduced down to a size that can fit in memory
Page 32Classification: Restricted
Distributed Cache
• Side data can be defined as extra read-only data needed by a job to process the main dataset
• To make side data available to all map or reduce tasks, we distribute those datasets using Hadoop’s
Distributed Cache mechanism.
pavan.hadoop@outlook.com
Page 33Classification: Restricted
Map Join (Using Distributed Cache)
Page 34Classification: Restricted
Some Useful Links:
• http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
Page 35Classification: Restricted
Thank You

More Related Content

What's hot

E031201032036
E031201032036E031201032036
E031201032036
ijceronline
 
운영체제론 - Ch09
운영체제론 - Ch09운영체제론 - Ch09
운영체제론 - Ch09
Jongmyoung Kim
 
MapReduce
MapReduceMapReduce
MapReduce
Surinder Kaur
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
33734947 sap-pp-tables
33734947 sap-pp-tables33734947 sap-pp-tables
33734947 sap-pp-tables
Swapnil Rajane
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
kazuma_sato
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
Hiroshi Ono
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
Yu Liu
 
Mca ppt
Mca pptMca ppt
Mca ppt
SUNIL ANGADI
 
3D-DRESD Polaris
3D-DRESD Polaris3D-DRESD Polaris
3D-DRESD Polaris
Marco Santambrogio
 
C044051215
C044051215C044051215
C044051215
IJERA Editor
 
Avoiding Data Hotspots at Scale
Avoiding Data Hotspots at ScaleAvoiding Data Hotspots at Scale
Avoiding Data Hotspots at Scale
ScyllaDB
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
James McGalliard
 
Oracle Database 12c features for DBA
Oracle Database 12c features for DBAOracle Database 12c features for DBA
Oracle Database 12c features for DBA
Karan Kukreja
 
T180304125129
T180304125129T180304125129
T180304125129
IOSR Journals
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
Cloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopCloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in Hadoop
Pallav Jha
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
EMC
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
Omid Djoudi
 

What's hot (20)

E031201032036
E031201032036E031201032036
E031201032036
 
운영체제론 - Ch09
운영체제론 - Ch09운영체제론 - Ch09
운영체제론 - Ch09
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
33734947 sap-pp-tables
33734947 sap-pp-tables33734947 sap-pp-tables
33734947 sap-pp-tables
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
Mca ppt
Mca pptMca ppt
Mca ppt
 
3D-DRESD Polaris
3D-DRESD Polaris3D-DRESD Polaris
3D-DRESD Polaris
 
C044051215
C044051215C044051215
C044051215
 
Avoiding Data Hotspots at Scale
Avoiding Data Hotspots at ScaleAvoiding Data Hotspots at Scale
Avoiding Data Hotspots at Scale
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
 
Oracle Database 12c features for DBA
Oracle Database 12c features for DBAOracle Database 12c features for DBA
Oracle Database 12c features for DBA
 
T180304125129
T180304125129T180304125129
T180304125129
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
 
Cloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopCloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in Hadoop
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
 

Similar to MapReduce

MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
Nicolas Morales
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
NelakurthyVasanthRed1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
ssuser30e7d2
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Ahmad El Tawil
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
Map reduce
Map reduceMap reduce
Map reduce
대호 김
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
AtulYadav218546
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
WasyihunSema2
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
GoMR: A MapReduce Framework for Go
GoMR: A MapReduce Framework for GoGoMR: A MapReduce Framework for Go
GoMR: A MapReduce Framework for Go
ConnorZanin
 

Similar to MapReduce (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Map reduce
Map reduceMap reduce
Map reduce
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
GoMR: A MapReduce Framework for Go
GoMR: A MapReduce Framework for GoGoMR: A MapReduce Framework for Go
GoMR: A MapReduce Framework for Go
 

Recently uploaded

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

MapReduce

  • 1. Big Data and Hadoop Training MapReduce
  • 2. Page 2Classification: Restricted Agenda • Meet MapReduce • Word Count Algorithm – Traditional approach • Traditional approach on a Distributed System • Traditional approach – Drawbacks • MapReduce Approach • Input & Output Forms of a MR program • Map, Shuffle & Sort, Reduce Phase • WordCount Code walkthrough • Workflow & Transformation of Data • Input Split & HDFS Block • Relation between Split & Block • Data locality Optimization • Speculative Execution • MR Flow with Single Reduce Task • MR flow with multiple Reducers • Input Format & Hierarchy • Output Format & Hierarchy
  • 3. Page 3Classification: Restricted Meet MapReduce • MapReduce is a programming model for distributed processing • Advantage - easy scaling of data processing over multiple computing nodes • The basic entities in this model are – mappers & reducers • Decomposing a data processing application into mappers and reducers is the task of developer • once you write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change
  • 4. Page 4Classification: Restricted WordCount – Traditional Approach • Input: do as I say not as I do • Output: Word Count as 2 do 2 I 2 not 1 say 1
  • 5. Page 5Classification: Restricted WordCount – Traditional Approach define wordCount as Multiset; for each document in documentSet { T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount);
  • 6. Page 6Classification: Restricted Traditional Approach – Distributed Processing define wordCount as Multiset; for each document in documentSubset { < same code as in perv.slide> } sendToSecondPhase(wordCount); define totalWordCount as Multiset; for each wordCount received from firstPhase { multisetAdd (totalWordCount, wordCount); }
  • 7. Page 7Classification: Restricted Traditional Approach – Drawbacks •Central Storage – bottleneck in bandwidth of the server •Multiple Storage – handling splits •Program runs in memory •When processing large document sets, the number of unique words can exceed the RAM storage of a machine •Phase 2 handling by one machine? •If Multiple machines are used for phase-2, how to partition the data?
  • 8. Page 8Classification: Restricted Mapreduce Approach • Has two execution phases – mapping & reducing • These phases are defined by data processing functions called – mapper & reducer • Mapping phase – MR takes the input data and feeds each data element to the mapper • Reducing phase – reducer processes all the outputs from the mapper and arrives at a final result
  • 9. Page 9Classification: Restricted Input & Output forms: • In order for mapping, reducing, partitioning, and shuffling (and a few others that were not mentioned) to seamlessly work together, we need to agree on a common structure for the data being processed • InputFormat class is responsible for creating input splits and dividing them into records() Input Output map() <k1, v1> list(<k2, v2>) reduce() <k2, list(v2)> list(<k3, v3>)
  • 13. Page 13Classification: Restricted MR - Work flow & Transformation of data From i/p files to the mapper From the Mapper to the intermediate results From intermediate results to the reducer From the reducer to output files
  • 15. Page 15Classification: Restricted Input Split & Hdfs Block Data Chunk HDFS Block (Physical Division) Input Split (Logical Division)
  • 16. Page 16Classification: Restricted Relation Between Input Split & Hdfs Block 1 2 3 4 76 8 1095 File Line s Block Boundary Block Boundary Block Boundary Block Boundary Split Split Split • Logical records do not fit neatly into the HDFS blocks. • Logical records are lines that cross the boundary of the blocks. • First split contains line 5 although it spans across blocks.
  • 17. Page 17Classification: Restricted Data locality Optimization • MR job is split into various map & reduce tasks • Map tasks run on the input splits • Ideally, the task JVM would be initiated in the node where the split/block of data exists • While in some scenarios, JVMs might not be free to accept another task. • In that case, Task Tracker will be initiated at a different location. • Scenario a) Same node execution • Scenario b) Off-node execution • Scenario c) Off-rack execution
  • 18. Page 18Classification: Restricted Speculative execution • MR job is split into various map & reduce tasks and they get executed in parallel. • Overall job execution time is pulled down by the slowest task. • Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is running slower than expected and launches another equivalent task as a backup. This is termed speculative execution of tasks.
  • 19. Page 19Classification: Restricted MapReduce Dataflow With A Single Reduce Task
  • 20. Page 20Classification: Restricted Map Reduce Dataflow With Multiple Reduce Tasks
  • 21. Page 21Classification: Restricted MapReduce Dataflow With No Reduce Tasks
  • 22. Page 22Classification: Restricted Combiner • A combiner is a mini-reducer • It gets executed on the mapper output at the mapper side • Combiner’s output is fed to Reducer • As the mapper output is further refined using combiner, data that has to be shuffled across the cluster is minimized • Because the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all • So, calling the combiner function zero, one, or many times should produce the same output from the reducer.
  • 23. Page 23Classification: Restricted Combiner’s Contract • Only those functions that obey commutative & associative properties can use combiners. • Because max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25 where as, mean(0, 20, 10, 25, 15) = 14 and mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
  • 24. Page 24Classification: Restricted Partitioner • We know that a unique key will always go to a unique reducer. • Partitioner is responsible for sending key, value pairs to a reducer based on the key content. • The default partitioner is Hash-partitioner. It takes mapper output, create a Hash value for each key and divide it modulo by the number of reducers. The output of this calculation will determine the reducer that this particular key would go to
  • 27. Page 27Classification: Restricted InputFormat Input Split Input SplitInput SplitInput Split Record Reader Record Reader Record Reader Record Reader Mapper MapperMapperMapper
  • 28. Page 28Classification: Restricted OutputFormat Reducer Output File Reducer ReducerReducer RcordWriter RcordWriterRcordWriterRcordWriter Output FileOutput FileOutput File
  • 30. Page 30Classification: Restricted Counters • Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics. • Often used for debugging purpose. • eg: Count number of Good records, bad records in the input • Two types – Built-in & Custom Counters • Examples of Built-in Counters: • Map input records • Map output records • Filesystem bytes read • Launched map tasks • Failed map tasks • Killed reduce tasks
  • 31. Page 31Classification: Restricted Joins • Map-side join(Replication): A map-side join that works in situations where one of the datasets is small enough to cache • Reduce-side join(Repartition join): A reduce-side join for situations where you’re joining two or more large datasets together • Semi-join(A map-side join): Another map-side join where one dataset is initially too large to fit into memory, but after some filtering can be reduced down to a size that can fit in memory
  • 32. Page 32Classification: Restricted Distributed Cache • Side data can be defined as extra read-only data needed by a job to process the main dataset • To make side data available to all map or reduce tasks, we distribute those datasets using Hadoop’s Distributed Cache mechanism. pavan.hadoop@outlook.com
  • 33. Page 33Classification: Restricted Map Join (Using Distributed Cache)
  • 34. Page 34Classification: Restricted Some Useful Links: • http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html • http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html