SlideShare a Scribd company logo
1 of 12
MAP REDUCE
By Ishan Sharma
Animation in presentation can be viewed by downloading it…
WHAT IS MapReduce ?
A Programming model and an associated
implementation for processing and generating large
data sets with a parallel*, distributed* algorithm on a
cluster*.
A Parallel algorithm is an algorithm which can be executed a piece at a time
on many different processing devices, and then combined together again at
the end to get the correct result.
A distributed algorithm is an algorithm designed to run on computer hardware
constructed from interconnected processors.
A computer cluster consists of connected computers that work together so
that, in many respects, they can be viewed as a single system. Computer
clusters have each node set to perform the same task, controlled and
What is Map()?
A MapReduce program is composed of
a Map() procedure that takes one pair of data with a type
in one data domain, and returns a list of pairs in a
different domain.
It is applied in parallel to every pair in the input dataset.
This produces a list of pairs for each call.
What is Reduce()?
A MapReduce program is composed of a
Reduce() procedure
that is applied in parallel to all pairs with the same key
from all lists which in turn produces a collection of values
in the same domain. The returns of all calls are collected
DN1
TaskTracker
META DATA
DN1 :A,B,C
DN2:D,B,C
::
DN8:
DN4
TaskTracker
DN8
TaskTracker
DN6
TaskTracker
DN2
TaskTracker
DN7
TaskTracker
DN3
TaskTracker
DN5
TaskTracker
NameNode
JobTracker
Working of JobTracker & TaskTracker in
MapReduce engine of Hadoop
map map
o/p
o/p
Reducer
JOBCONF (User Interface )
JobTracker And
TaskTraker
• The primary function of the Job tracker is resource
management (managing the task trackers), tracking
resource availability and task life cycle management
(tracking its progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker
with its progress status periodically.
The task tracker is pre-configured with a number of slots
indicating the number of tasks it can accept.
Fault Tolerance
▫ The task tracker spawns different JVM
processes to ensure that process failures do
not bring down the task tracker.
▫ The task tracker keeps sending heartbeat
messages to the job tracker to say that it is alive
and to keep it updated with the number of empty
slots available for running more tasks.
▫ From version 0.21 of Hadoop, the job tracker does
some checkpointing of its work in the filesystem.
Basic Allowable text file formats
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileasTextInputFormat
Primitive class
datatypes
int
float
Long
char
String
Box class
datatypes
IntWritable
FloatWritable
LongWritable
Text
Text
Box class have by-default writable comparable interface.
(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)
inputSplitinputSplit inputSplitinputSplit
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper MapperMapper
Input file 200MB
64MB
What is your name
Where do you live
64MB
I am Ishan
I live in Delhi
64MB
Name of your
college
I study in MAIT
8MB
What are you
hobbies
0, What is your
name19, where do you live
What,1
Is,1
Your,1
Name,1
Where,1
Do,1
You,1
Live,1
I ,1
Am,1
Ishan,1
I,1
Live,1
In,1
Delhi,1
Name,1
Of,1
Your,1
College,1
I,1
Study,1
In,1
MAIT,1
What,1
Are,1
Your,1
Hobbies,1
INTERMEDIATE DATA
WORDCOUNT JOB Animation in slide
Where,1
Do,1
You,1
Live,1
I ,1
Am,1
Ishan,1
I,1
Live,1
In,1
Delhi,1
Name,1
Of,1
Your,1
College,1
I,1
Study,1
In,1
MAIT,1
What,1
Are,1
Your,1
Hobbies,1
INTERMEDIATE DATA
What,2 . . . . .
Is,1 . . . . .
Your,3 . . . . .
Name,2 . . . . .
SHUFFLING
Am,1
Are,1
.
.
Your,3
SORTING
Reducer
RecordWriter OutputFile (PART-0000)
What,1
Is,1
Your,1
Name,1
What,1
Is,1
Your,1
Name,1
What,1
Are,1
Your,1
Hobbies,1
What,1
Is,1
Your,1
Name,1
What,1
Is,1
Your,1
Name,1
What,1
Is,1
Your,1
Name,1
Name,1
Of,1
Your,1
College,1
What,1
Are,1
Your,1
Hobbies,1
What,1
Are,1
Your,1
Hobbies,1
Name,1
Of,1
Your,1
College,1
Fields where MapReduce can be
implemented
Distributed pattern-based searching
Distributed sorting
Web link-graph reversal
Web access log stats
Document clustering
Statistical machine translation.
Limitations of MapReduce
• It's not always very easy to implement each and
everything as a MapReduce program.
• When your intermediate processes need to talk to each
other.
• When your processing requires lot of data to
be shuffled over the network.
• The fundamentals of Hadoop were not designed to
facilitate highly interactive analytics.
• The answer you get from a Hadoop cluster may or may
not be 100% accurate, depending on the nature of the
job.
END OF PRESENTATION
THANKS FOR WATCHING…

More Related Content

What's hot

MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 

What's hot (20)

Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 

Viewers also liked (7)

Introducing MapReduce Programming Framework
Introducing MapReduce Programming FrameworkIntroducing MapReduce Programming Framework
Introducing MapReduce Programming Framework
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Kohl's Pay
Kohl's PayKohl's Pay
Kohl's Pay
 
Map reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsMap reduce programming model to solve graph problems
Map reduce programming model to solve graph problems
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 

Similar to Map reduce in Hadoop

2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
C# Parallel programming
C# Parallel programmingC# Parallel programming
C# Parallel programming
Umeshwaran V
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 

Similar to Map reduce in Hadoop (20)

Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
C# Parallel programming
C# Parallel programmingC# Parallel programming
C# Parallel programming
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Map reduce in Hadoop

  • 1. MAP REDUCE By Ishan Sharma Animation in presentation can be viewed by downloading it…
  • 2. WHAT IS MapReduce ? A Programming model and an associated implementation for processing and generating large data sets with a parallel*, distributed* algorithm on a cluster*. A Parallel algorithm is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result. A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. A computer cluster consists of connected computers that work together so that, in many respects, they can be viewed as a single system. Computer clusters have each node set to perform the same task, controlled and
  • 3. What is Map()? A MapReduce program is composed of a Map() procedure that takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain. It is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. What is Reduce()? A MapReduce program is composed of a Reduce() procedure that is applied in parallel to all pairs with the same key from all lists which in turn produces a collection of values in the same domain. The returns of all calls are collected
  • 5. JobTracker And TaskTraker • The primary function of the Job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.) • The task tracker has a simple function of following the orders of the job tracker and updating the job tracker with its progress status periodically. The task tracker is pre-configured with a number of slots indicating the number of tasks it can accept.
  • 6. Fault Tolerance ▫ The task tracker spawns different JVM processes to ensure that process failures do not bring down the task tracker. ▫ The task tracker keeps sending heartbeat messages to the job tracker to say that it is alive and to keep it updated with the number of empty slots available for running more tasks. ▫ From version 0.21 of Hadoop, the job tracker does some checkpointing of its work in the filesystem.
  • 7. Basic Allowable text file formats • TextInputFormat • KeyValueTextInputFormat • SequenceFileInputFormat • SequenceFileasTextInputFormat Primitive class datatypes int float Long char String Box class datatypes IntWritable FloatWritable LongWritable Text Text Box class have by-default writable comparable interface.
  • 8. (ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine) inputSplitinputSplit inputSplitinputSplit RecordReader RecordReader RecordReader RecordReader Mapper Mapper MapperMapper Input file 200MB 64MB What is your name Where do you live 64MB I am Ishan I live in Delhi 64MB Name of your college I study in MAIT 8MB What are you hobbies 0, What is your name19, where do you live What,1 Is,1 Your,1 Name,1 Where,1 Do,1 You,1 Live,1 I ,1 Am,1 Ishan,1 I,1 Live,1 In,1 Delhi,1 Name,1 Of,1 Your,1 College,1 I,1 Study,1 In,1 MAIT,1 What,1 Are,1 Your,1 Hobbies,1 INTERMEDIATE DATA WORDCOUNT JOB Animation in slide
  • 9. Where,1 Do,1 You,1 Live,1 I ,1 Am,1 Ishan,1 I,1 Live,1 In,1 Delhi,1 Name,1 Of,1 Your,1 College,1 I,1 Study,1 In,1 MAIT,1 What,1 Are,1 Your,1 Hobbies,1 INTERMEDIATE DATA What,2 . . . . . Is,1 . . . . . Your,3 . . . . . Name,2 . . . . . SHUFFLING Am,1 Are,1 . . Your,3 SORTING Reducer RecordWriter OutputFile (PART-0000) What,1 Is,1 Your,1 Name,1 What,1 Is,1 Your,1 Name,1 What,1 Are,1 Your,1 Hobbies,1 What,1 Is,1 Your,1 Name,1 What,1 Is,1 Your,1 Name,1 What,1 Is,1 Your,1 Name,1 Name,1 Of,1 Your,1 College,1 What,1 Are,1 Your,1 Hobbies,1 What,1 Are,1 Your,1 Hobbies,1 Name,1 Of,1 Your,1 College,1
  • 10. Fields where MapReduce can be implemented Distributed pattern-based searching Distributed sorting Web link-graph reversal Web access log stats Document clustering Statistical machine translation.
  • 11. Limitations of MapReduce • It's not always very easy to implement each and everything as a MapReduce program. • When your intermediate processes need to talk to each other. • When your processing requires lot of data to be shuffled over the network. • The fundamentals of Hadoop were not designed to facilitate highly interactive analytics. • The answer you get from a Hadoop cluster may or may not be 100% accurate, depending on the nature of the job.
  • 12. END OF PRESENTATION THANKS FOR WATCHING…