Map reduce in Hadoop

MAP REDUCE
By Ishan Sharma
Animation in presentation can be viewed by downloading it…

WHAT IS MapReduce ?
A Programming model and an associated
implementation for processing and generating large
data sets with a parallel*, distributed* algorithm on a
cluster*.
A Parallel algorithm is an algorithm which can be executed a piece at a time
on many different processing devices, and then combined together again at
the end to get the correct result.
A distributed algorithm is an algorithm designed to run on computer hardware
constructed from interconnected processors.
A computer cluster consists of connected computers that work together so
that, in many respects, they can be viewed as a single system. Computer
clusters have each node set to perform the same task, controlled and

What is Map()?
A MapReduce program is composed of
a Map() procedure that takes one pair of data with a type
in one data domain, and returns a list of pairs in a
different domain.
It is applied in parallel to every pair in the input dataset.
This produces a list of pairs for each call.
What is Reduce()?
A MapReduce program is composed of a
Reduce() procedure
that is applied in parallel to all pairs with the same key
from all lists which in turn produces a collection of values
in the same domain. The returns of all calls are collected

DN1
TaskTracker
META DATA
DN1 :A,B,C
DN2:D,B,C
::
DN8:
DN4
TaskTracker
DN8
TaskTracker
DN6
TaskTracker
DN2
TaskTracker
DN7
TaskTracker
DN3
TaskTracker
DN5
TaskTracker
NameNode
JobTracker
Working of JobTracker & TaskTracker in
MapReduce engine of Hadoop
map map
o/p
o/p
Reducer
JOBCONF (User Interface )

JobTracker And
TaskTraker
• The primary function of the Job tracker is resource
management (managing the task trackers), tracking
resource availability and task life cycle management
(tracking its progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker
with its progress status periodically.
The task tracker is pre-configured with a number of slots
indicating the number of tasks it can accept.

Fault Tolerance
▫ The task tracker spawns different JVM
processes to ensure that process failures do
not bring down the task tracker.
▫ The task tracker keeps sending heartbeat
messages to the job tracker to say that it is alive
and to keep it updated with the number of empty
slots available for running more tasks.
▫ From version 0.21 of Hadoop, the job tracker does
some checkpointing of its work in the filesystem.

Basic Allowable text file formats
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileasTextInputFormat
Primitive class
datatypes
int
float
Long
char
String
Box class
datatypes
IntWritable
FloatWritable
LongWritable
Text
Text
Box class have by-default writable comparable interface.

(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)
inputSplitinputSplit inputSplitinputSplit
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper MapperMapper
Input file 200MB
64MB
What is your name
Where do you live
64MB
I am Ishan
I live in Delhi
64MB
Name of your
college
I study in MAIT
8MB
What are you
hobbies
0, What is your
name19, where do you live
What,1
Is,1
Your,1
Name,1
Where,1
Do,1
You,1
Live,1
I ,1
Am,1
Ishan,1
I,1
Live,1
In,1
Delhi,1
Name,1
Of,1
Your,1
College,1
I,1
Study,1
In,1
MAIT,1
What,1
Are,1
Your,1
Hobbies,1
INTERMEDIATE DATA
WORDCOUNT JOB Animation in slide

Where,1
Do,1
You,1
Live,1
I ,1
Am,1
Ishan,1
I,1
Live,1
In,1
Delhi,1
Name,1
Of,1
Your,1
College,1
I,1
Study,1
In,1
MAIT,1
What,1
Are,1
Your,1
Hobbies,1
INTERMEDIATE DATA
What,2 . . . . .
Is,1 . . . . .
Your,3 . . . . .
Name,2 . . . . .
SHUFFLING
Am,1
Are,1
.
.
Your,3
SORTING
Reducer
RecordWriter OutputFile (PART-0000)
What,1
Is,1
Your,1
Name,1
What,1
Is,1
Your,1
Name,1
What,1
Are,1
Your,1
Hobbies,1
What,1
Is,1
Your,1
Name,1
What,1
Is,1
Your,1
Name,1
What,1
Is,1
Your,1
Name,1
Name,1
Of,1
Your,1
College,1
What,1
Are,1
Your,1
Hobbies,1
What,1
Are,1
Your,1
Hobbies,1
Name,1
Of,1
Your,1
College,1

Fields where MapReduce can be
implemented
Distributed pattern-based searching
Distributed sorting
Web link-graph reversal
Web access log stats
Document clustering
Statistical machine translation.

Limitations of MapReduce
• It's not always very easy to implement each and
everything as a MapReduce program.
• When your intermediate processes need to talk to each
other.
• When your processing requires lot of data to
be shuffled over the network.
• The fundamentals of Hadoop were not designed to
facilitate highly interactive analytics.
• The answer you get from a Hadoop cluster may or may
not be 100% accurate, depending on the nature of the
job.

END OF PRESENTATION
THANKS FOR WATCHING…

Map reduce in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Map reduce in Hadoop

Similar to Map reduce in Hadoop (20)

Recently uploaded

Recently uploaded (20)

Map reduce in Hadoop