Map reduce definition
A Programming model and an associated implementation for processing and generating large data sets with a parallel*, distributed* algorithm on a cluster*.
A Parallel algorithm is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.
A distributed algorithm is an algorithm designed to run on computer hardware
constructed from interconnected processors.
A computer cluster consists of connected computers that work together so that, in many respects, they can be viewed as a single system. Computer clusters have each node set to perform the same task, controlled and scheduled by software.
Map reduce - division into two categories map and reduce
working of Jobtracker , TaskTracker ,Namenode , Datanode in mapreduce engine of hadoop
Fault tolerance in hadoop
Box class datatypes
Allowable file formats
wordcount job explained using animation in hadoop using mapreduce
fields where map reduce can be implimented
limitations of map reduce
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Map reduce in Hadoop
1. MAP REDUCE
By Ishan Sharma
Animation in presentation can be viewed by downloading it…
2. WHAT IS MapReduce ?
A Programming model and an associated
implementation for processing and generating large
data sets with a parallel*, distributed* algorithm on a
cluster*.
A Parallel algorithm is an algorithm which can be executed a piece at a time
on many different processing devices, and then combined together again at
the end to get the correct result.
A distributed algorithm is an algorithm designed to run on computer hardware
constructed from interconnected processors.
A computer cluster consists of connected computers that work together so
that, in many respects, they can be viewed as a single system. Computer
clusters have each node set to perform the same task, controlled and
3. What is Map()?
A MapReduce program is composed of
a Map() procedure that takes one pair of data with a type
in one data domain, and returns a list of pairs in a
different domain.
It is applied in parallel to every pair in the input dataset.
This produces a list of pairs for each call.
What is Reduce()?
A MapReduce program is composed of a
Reduce() procedure
that is applied in parallel to all pairs with the same key
from all lists which in turn produces a collection of values
in the same domain. The returns of all calls are collected
5. JobTracker And
TaskTraker
• The primary function of the Job tracker is resource
management (managing the task trackers), tracking
resource availability and task life cycle management
(tracking its progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker
with its progress status periodically.
The task tracker is pre-configured with a number of slots
indicating the number of tasks it can accept.
6. Fault Tolerance
▫ The task tracker spawns different JVM
processes to ensure that process failures do
not bring down the task tracker.
▫ The task tracker keeps sending heartbeat
messages to the job tracker to say that it is alive
and to keep it updated with the number of empty
slots available for running more tasks.
▫ From version 0.21 of Hadoop, the job tracker does
some checkpointing of its work in the filesystem.
7. Basic Allowable text file formats
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileasTextInputFormat
Primitive class
datatypes
int
float
Long
char
String
Box class
datatypes
IntWritable
FloatWritable
LongWritable
Text
Text
Box class have by-default writable comparable interface.
10. Fields where MapReduce can be
implemented
Distributed pattern-based searching
Distributed sorting
Web link-graph reversal
Web access log stats
Document clustering
Statistical machine translation.
11. Limitations of MapReduce
• It's not always very easy to implement each and
everything as a MapReduce program.
• When your intermediate processes need to talk to each
other.
• When your processing requires lot of data to
be shuffled over the network.
• The fundamentals of Hadoop were not designed to
facilitate highly interactive analytics.
• The answer you get from a Hadoop cluster may or may
not be 100% accurate, depending on the nature of the
job.