MapReduce
MapReduce
  reduced
PetaMengurangi
*
PetaMengurangi

          *google translate
the
problem
lots
of
data
e.g.

the
entire
interwebs
single
computer
not
going
to
work
lots
of
computers
we
have
that
cluster
programming
cluster
programming
        =
suck
MapReduce
makes
the
pain
go
away
2
main
stages
map
map
process
data
on
hosts
reduce
reduce
summarise
the
results
example
count
words
on
lines
>>>
reduce(operator.add,
map(countWords,
lines))
>>>
reduce(operator.add,
map(countWords,
lines))
>>>
reduce(operator.add,
map(countWords,
lines))
except
in
this
case
lots
of
machines
typical
cluster
O(103)
machines

each
2‐8Gb
RAM
 local
IDE
disks
GFS
distributes
the
data
process
data
on
hosts
  summarise
results
split
data
into
chunks
process
data
on
hosts
  summarise
results
split
data
into
chunks
  allocate
machines
process
data
on
hosts
  summarise
results
split
data
into
chunks
  allocate
machines
   start
processes
process
data
on
hosts
  summarise
results
split
data
into
chunks
  allocate
machines
   start
processes
send
data
to
mappers
process
data
on
hosts
  summarise
resul...
split
data
into
chunks
  allocate
machines
   start
processes
send
data
to
mappers
process
data
on
hosts
    monitor
hosts...
split
data
into
chunks
   allocate
machines
    start
processes
 send
data
to
mappers
 process
data
on
hosts
     monitor
...
split
data
into
chunks
    allocate
machines
     start
processes
  send
data
to
mappers
  process
data
on
hosts
      mon...
split
data
into
chunks
    allocate
machines
     start
processes
  send
data
to
mappers
  process
data
on
hosts
      mon...
MapReduce
does
the

   yukky
stuff
split
data
into
chunks
                allocate
machines
                 start
processes
              send
data
to
mappe...
handles
failures
handles
stragglers
a
vanity
search
%
of
refs
to
Anthony
%
of
refs
to
Anthony
       Baxter
count(‘Anthony
Baxter’)
   count(‘Anthony’)
C++
library
...
with
Python
bindings,

           yay!
class
AnthonyMapper(mrpython.Mapper):




def
Map(self,
map_input):








meCount
=
otherCount
=
0








docId
=
map_i...
class
AnthonyReducer(mrpython.Reducer):




def
Reducer(self,
reduce_input):








'''
Passed
a
key
(either
'me'
or
'oth...
the
result:
the
result:

about
1
in
4000
other
uses
for
MapReduce
web
link
graphs
access
logs
text
analysis
google
news
clustering
local
search
road
traffic
take
speed
samples
group
by
road
segment
take
the
average
once
per
minute
output
to
a
map
layer
limitation:
availability
of

          data
MapReduce
is
pretty
cool
for
more
information
“mapreduce
paper”
   “gfs
paper”
 “google
papers”
if
you’d
like
to
play
hadoop.apache.org
open
source
java
implementation
HDFS
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Map Reduce In 5 Minutes
Upcoming SlideShare
Loading in …5
×

Map Reduce In 5 Minutes

1,788 views
1,711 views

Published on

Map Reduce In 5 Minutes

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,788
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
70
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Map Reduce In 5 Minutes

  1. 1. MapReduce
  2. 2. MapReduce reduced
  3. 3. PetaMengurangi
  4. 4. * PetaMengurangi *google translate
  5. 5. the
problem
  6. 6. lots
of
data
  7. 7. e.g.
 the
entire
interwebs
  8. 8. single
computer not
going
to
work
  9. 9. lots
of
computers
  10. 10. we
have
that
  11. 11. cluster
programming
  12. 12. cluster
programming =
suck
  13. 13. MapReduce
  14. 14. makes
the
pain
go
away
  15. 15. 2
main
stages
  16. 16. map
  17. 17. map process
data
on
hosts
  18. 18. reduce
  19. 19. reduce summarise
the
results
  20. 20. example count
words
on
lines
  21. 21. >>>
reduce(operator.add,
map(countWords,
lines))
  22. 22. >>>
reduce(operator.add,
map(countWords,
lines))
  23. 23. >>>
reduce(operator.add,
map(countWords,
lines))
  24. 24. except
in
this
case
  25. 25. lots
of
machines
  26. 26. typical
cluster
  27. 27. O(103)
machines each
2‐8Gb
RAM local
IDE
disks
  28. 28. GFS
distributes
the
data
  29. 29. process
data
on
hosts summarise
results
  30. 30. split
data
into
chunks process
data
on
hosts summarise
results
  31. 31. split
data
into
chunks allocate
machines process
data
on
hosts summarise
results
  32. 32. split
data
into
chunks allocate
machines start
processes process
data
on
hosts summarise
results
  33. 33. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts summarise
results
  34. 34. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts summarise
results
  35. 35. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts send
results
to
reducers summarise
results
  36. 36. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts redo
failed
and
stragglers send
results
to
reducers summarise
results
  37. 37. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts redo
failed
and
stragglers send
results
to
reducers summarise
results output
final
results
  38. 38. MapReduce
does
the
 yukky
stuff
  39. 39. split
data
into
chunks allocate
machines start
processes send
data
to
mappers MapReduce process
data
on
hosts monitor
hosts programmer redo
failed
and
stragglers send
results
to
reducers summarise
results output
final
results
  40. 40. handles
failures
  41. 41. handles
stragglers
  42. 42. a
vanity
search
  43. 43. %
of
refs
to
Anthony
  44. 44. %
of
refs
to
Anthony Baxter
  45. 45. count(‘Anthony
Baxter’) count(‘Anthony’)
  46. 46. C++
library
  47. 47. ...
with
Python
bindings,
 yay!
  48. 48. class
AnthonyMapper(mrpython.Mapper): 



def
Map(self,
map_input): 







meCount
=
otherCount
=
0 







docId
=
map_input.key()
#
ignored
‐
doc
id 







src
=
map_input.value()
#
document
source 







text
=
ExtractText(src).split() 







seenAnthony
=
False 







for
word
in
text: 











if
not
seenAnthony: 















if
word.lower()
==
'anthony': 



















seenAnthony
=
True 











else: 















if
word.lower()
==
'baxter': 



















meCount
+=
1 















else: 



















otherCount
+=
1
 















seenAnthony
=
False 







yield
'me',
meCount 







yield
'other',
otherCount
  49. 49. class
AnthonyReducer(mrpython.Reducer): 



def
Reducer(self,
reduce_input): 







'''
Passed
a
key
(either
'me'
or
'other')
and
a
list 











of
counts.
Adds
the
counts
and
returns
them. 







''' 







count
=
0 







for
val
in
reduce_input.values(): 











sum
+=
int(val) 







yield
count
  50. 50. the
result:
  51. 51. the
result: about
1
in
4000
  52. 52. other
uses
for
MapReduce
  53. 53. web
link
graphs
  54. 54. access
logs
  55. 55. text
analysis
  56. 56. google
news
clustering
  57. 57. local
search
  58. 58. road
traffic
  59. 59. take
speed
samples
  60. 60. group
by
road
segment
  61. 61. take
the
average
  62. 62. once
per
minute
  63. 63. output
to
a
map
layer
  64. 64. limitation:
availability
of
 data
  65. 65. MapReduce
is
pretty
cool
  66. 66. for
more
information
  67. 67. “mapreduce
paper” “gfs
paper” “google
papers”
  68. 68. if
you’d
like
to
play
  69. 69. hadoop.apache.org
  70. 70. open
source
  71. 71. java
implementation
  72. 72. HDFS

×