Map Reduce In 5 Minutes

  • 1,563 views
Uploaded on

Map Reduce In 5 Minutes

Map Reduce In 5 Minutes

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,563
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
63
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MapReduce
  • 2. MapReduce reduced
  • 3. PetaMengurangi
  • 4. * PetaMengurangi *google translate
  • 5. the
problem
  • 6. lots
of
data
  • 7. e.g.
 the
entire
interwebs
  • 8. single
computer not
going
to
work
  • 9. lots
of
computers
  • 10. we
have
that
  • 11. cluster
programming
  • 12. cluster
programming =
suck
  • 13. MapReduce
  • 14. makes
the
pain
go
away
  • 15. 2
main
stages
  • 16. map
  • 17. map process
data
on
hosts
  • 18. reduce
  • 19. reduce summarise
the
results
  • 20. example count
words
on
lines
  • 21. >>>
reduce(operator.add,
map(countWords,
lines))
  • 22. >>>
reduce(operator.add,
map(countWords,
lines))
  • 23. >>>
reduce(operator.add,
map(countWords,
lines))
  • 24. except
in
this
case
  • 25. lots
of
machines
  • 26. typical
cluster
  • 27. O(103)
machines each
2‐8Gb
RAM local
IDE
disks
  • 28. GFS
distributes
the
data
  • 29. process
data
on
hosts summarise
results
  • 30. split
data
into
chunks process
data
on
hosts summarise
results
  • 31. split
data
into
chunks allocate
machines process
data
on
hosts summarise
results
  • 32. split
data
into
chunks allocate
machines start
processes process
data
on
hosts summarise
results
  • 33. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts summarise
results
  • 34. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts summarise
results
  • 35. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts send
results
to
reducers summarise
results
  • 36. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts redo
failed
and
stragglers send
results
to
reducers summarise
results
  • 37. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts redo
failed
and
stragglers send
results
to
reducers summarise
results output
final
results
  • 38. MapReduce
does
the
 yukky
stuff
  • 39. split
data
into
chunks allocate
machines start
processes send
data
to
mappers MapReduce process
data
on
hosts monitor
hosts programmer redo
failed
and
stragglers send
results
to
reducers summarise
results output
final
results
  • 40. handles
failures
  • 41. handles
stragglers
  • 42. a
vanity
search
  • 43. %
of
refs
to
Anthony
  • 44. %
of
refs
to
Anthony Baxter
  • 45. count(‘Anthony
Baxter’) count(‘Anthony’)
  • 46. C++
library
  • 47. ...
with
Python
bindings,
 yay!
  • 48. class
AnthonyMapper(mrpython.Mapper): 



def
Map(self,
map_input): 







meCount
=
otherCount
=
0 







docId
=
map_input.key()
#
ignored
‐
doc
id 







src
=
map_input.value()
#
document
source 







text
=
ExtractText(src).split() 







seenAnthony
=
False 







for
word
in
text: 











if
not
seenAnthony: 















if
word.lower()
==
'anthony': 



















seenAnthony
=
True 











else: 















if
word.lower()
==
'baxter': 



















meCount
+=
1 















else: 



















otherCount
+=
1
 















seenAnthony
=
False 







yield
'me',
meCount 







yield
'other',
otherCount
  • 49. class
AnthonyReducer(mrpython.Reducer): 



def
Reducer(self,
reduce_input): 







'''
Passed
a
key
(either
'me'
or
'other')
and
a
list 











of
counts.
Adds
the
counts
and
returns
them. 







''' 







count
=
0 







for
val
in
reduce_input.values(): 











sum
+=
int(val) 







yield
count
  • 50. the
result:
  • 51. the
result: about
1
in
4000
  • 52. other
uses
for
MapReduce
  • 53. web
link
graphs
  • 54. access
logs
  • 55. text
analysis
  • 56. google
news
clustering
  • 57. local
search
  • 58. road
traffic
  • 59. take
speed
samples
  • 60. group
by
road
segment
  • 61. take
the
average
  • 62. once
per
minute
  • 63. output
to
a
map
layer
  • 64. limitation:
availability
of
 data
  • 65. MapReduce
is
pretty
cool
  • 66. for
more
information
  • 67. “mapreduce
paper” “gfs
paper” “google
papers”
  • 68. if
you’d
like
to
play
  • 69. hadoop.apache.org
  • 70. open
source
  • 71. java
implementation
  • 72. HDFS