Map Reduce An Introduction

-  Nagarjuna K
nagarjuna@outlook.com

•  Understanding
MapReduce

•  Map
Reduce

-‐
An
Introduction

•  Word
count
–
default

•  Word
count
–
custom


¡  Programming
model
to
process
large
datasets

¡  Supported
languages
for
MR

§  Java

§  Ruby

§  Python

§  C++

¡  Map
Reduce
Programs
are
Inherently
parallel.

§  More
data
è
more
machines
to
analyze.

§  No
need
to
change
anything
in
the
code.


¡  Start
with
WORDCOUNT
example

§  “Do
as
I
say,
not
as
I
do”
Word

Count

As
2

Do
2

I
2

Not
2

Say
1


deﬁne
wordCount
as
Map<String,long>;

for
each
document
in
documentSet
{

T
=
tokenize(document);

for
each
token
in
T
{

wordCount[token]++;

}

}

display(wordCount);

¡  This
works
until
the
no.of
documents
to
process
is
not

very
large


¡  Spam
ﬁlter

§  Millions
of
emails

§  Word
count
for
analysis

¡  Working
from
a
single
computer
is
time

consuming

¡  Rewrite
the
program
to
count
form
multiple

machines


¡  How
do
we
attain
parallel
computing
?

1.  All
the
machines
compute
fraction
of

documents

2.  Combine
the
results
from
all
the
machines


STAGE
1

deﬁne
wordCount
as
Map<String,long>;

for
each
document
in
documentSUBSet
{

T
=
tokenize(document);

for
each
token
in
T
{

wordCount[token]++;

}

}


STAGE
2

deﬁne
totalWordCount
as
Multiset;

for
each
wordCount
received
from
ﬁrstPhase
{

multisetAdd
(totalWordCount,
wordCount);

}

Display(totalWordcount)


Master

Comp-‐1

Comp-‐2

Documents

Comp-‐3

Comp-‐4


Problems

STAGE
1

•  Documents
segregations
to
be
well

Master

deﬁned

Comp-‐1
•  Bottle
neck
in
network
transfer

•  Data-‐intensive
processing

•  Not
computational
intensive

Comp-‐2
•  So
better
store
ﬁles
over

Documents

processing
machines

•  BIGGEST
FLAW

Comp-‐3

•  Storing
the
words
and
count
in

memory

•  Disk
based
hash-‐table

Comp-‐4

implementation
needed

Problems

STAGE
2

Master

•  Phase
2
has
only
once
machine

•  Bottle
Neck

Comp-‐1
•  Phase
1
highly
distributed
though

•  Make
phase
2
also
distributed

Comp-‐2

Documents

•  Need
changes
in
Phase
1

•  Partition
the
phase-‐1
output
(say
based

on
ﬁrst
character
of
the
word)

Comp-‐3
•  We
have
26
machines
in
phase
2

•  Single
Disk
based
hash-‐table
should
be

now
26
Disk
based
hash-‐table

•  Word
count-‐a
,
worcount-‐b,wordcount-‐c

Comp-‐4


Master

A
B
C
D
E

Comp-‐1
Comp-‐10

1
2
4
5
10

Comp-‐2
Comp-‐20

Documents

A
B
C
D
E

10
20
40
5
9

Comp-‐3
Comp-‐30

.

.

.

Comp-‐4


Comp-‐40

¡  After
phase-‐1

§  From
comp-‐1

▪  WordCount-‐A
à
comp-‐10

▪  WordCount-‐B
à
comp-‐20

▪  .

▪  .

▪  .

¡  Each
machine
in
phase
1
will
shuﬄe
its
output
to

diﬀerent
machines
in
phase
2


¡  This
is
getting
complicated

§  Store
ﬁles
where
are
they
are
being
processed

§  Write
disk-‐based
hash
table
obviating
RAM

limitations

§  Partition
the
phase-‐1
output

§  Shuﬄe
the
phase-‐1
output
and
send
it
to

appropriate
reducer


¡  This
is
more
than
a
lot
for
word
count

¡  We
haven’t
even
touched
the
fault
tolerance

§  What
if
comp-‐1
or
com-‐10
fails

¡  So,
A
need
of
frame
work
to
take
care
of
all

these
things

§  We
concentrate
only
on
business


Interim

MAPPER
output
REDUCER

Master

A
B
C
D
E

Comp-‐1
Comp-‐10

Shuﬄing

Partitioning

1
2
4
5
10

Comp-‐2
Comp-‐20

Documents

A
B
C
D
E

HDFS

1
2
4
5
10

Comp-‐3
Comp-‐30

.

.

.

Comp-‐4


Comp-‐40

¡  Mapper

¡  Reducer

Mapper
ﬁlters
and
transforms
the
input

Reducer
collects
that
and
aggregate
on
that.

Extensive
research
is
done
two
arrive
at
two

phase
strategy


¡  Mapper,Reducer,Partitioner,Shuﬄing

§  Work
together
à
common
structure
for
data

processing

Input
Output

Mapper
<K1,V1>
List<K2,V2>

Reducer
<k2,list(v2)>
List<k3,v3>


¡  Mapper

§  <key,words_per_line>

:
Input

§  <word,1>
:
output
Input
Output

¡  Reducer
Mapper
<K1,V1>
List<K2,V2>

Reducer
<k2,list(v2)>
List<k3,v3>

§  <word,list(1)>

:
Input

§  <word,count(list(1))>

:
Output


¡  As
said,
don’t
store
the
data
in
memory

§  So
keys
and
values
regularly
have
to
be
written
to

disk.

§  They
must
be
serialized.

§  Hadoop
provides
its
way
of
deserialization

§  Any
class
to
be
key
or
value
have
to
implement

WRITABLE
class.


Java
Type
Hadoop
Serialized

Types

String
Text

Integer
IntWritable

Long
LongWritable


¡  Let’s
try
to
execute
the
following
command

▪  hadoop
jar
hadoop-‐examples-‐0.20.2-‐cdh3u4.jar

wordcount

▪  hadoop
jar
hadoop-‐examples-‐0.20.2-‐cdh3u4.jar

wordcount
<input>

<output>

¡  What
does
this
code
do
?


¡  Switch
to
eclipse


Map Reduce An Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (9)

Similar to Map Reduce An Introduction

Similar to Map Reduce An Introduction (20)

Map Reduce An Introduction