Your SlideShare is downloading. ×
Map Reduce In 5 Minutes
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Map Reduce In 5 Minutes

1,578
views

Published on

Map Reduce In 5 Minutes

Map Reduce In 5 Minutes

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,578
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
65
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. MapReduce
  • 2. MapReduce reduced
  • 3. PetaMengurangi
  • 4. * PetaMengurangi *google translate
  • 5. the
problem
  • 6. lots
of
data
  • 7. e.g.
 the
entire
interwebs
  • 8. single
computer not
going
to
work
  • 9. lots
of
computers
  • 10. we
have
that
  • 11. cluster
programming
  • 12. cluster
programming =
suck
  • 13. MapReduce
  • 14. makes
the
pain
go
away
  • 15. 2
main
stages
  • 16. map
  • 17. map process
data
on
hosts
  • 18. reduce
  • 19. reduce summarise
the
results
  • 20. example count
words
on
lines
  • 21. >>>
reduce(operator.add,
map(countWords,
lines))
  • 22. >>>
reduce(operator.add,
map(countWords,
lines))
  • 23. >>>
reduce(operator.add,
map(countWords,
lines))
  • 24. except
in
this
case
  • 25. lots
of
machines
  • 26. typical
cluster
  • 27. O(103)
machines each
2‐8Gb
RAM local
IDE
disks
  • 28. GFS
distributes
the
data
  • 29. process
data
on
hosts summarise
results
  • 30. split
data
into
chunks process
data
on
hosts summarise
results
  • 31. split
data
into
chunks allocate
machines process
data
on
hosts summarise
results
  • 32. split
data
into
chunks allocate
machines start
processes process
data
on
hosts summarise
results
  • 33. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts summarise
results
  • 34. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts summarise
results
  • 35. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts send
results
to
reducers summarise
results
  • 36. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts redo
failed
and
stragglers send
results
to
reducers summarise
results
  • 37. split
data
into
chunks allocate
machines start
processes send
data
to
mappers process
data
on
hosts monitor
hosts redo
failed
and
stragglers send
results
to
reducers summarise
results output
final
results
  • 38. MapReduce
does
the
 yukky
stuff
  • 39. split
data
into
chunks allocate
machines start
processes send
data
to
mappers MapReduce process
data
on
hosts monitor
hosts programmer redo
failed
and
stragglers send
results
to
reducers summarise
results output
final
results
  • 40. handles
failures
  • 41. handles
stragglers
  • 42. a
vanity
search
  • 43. %
of
refs
to
Anthony
  • 44. %
of
refs
to
Anthony Baxter
  • 45. count(‘Anthony
Baxter’) count(‘Anthony’)
  • 46. C++
library
  • 47. ...
with
Python
bindings,
 yay!
  • 48. class
AnthonyMapper(mrpython.Mapper): 



def
Map(self,
map_input): 







meCount
=
otherCount
=
0 







docId
=
map_input.key()
#
ignored
‐
doc
id 







src
=
map_input.value()
#
document
source 







text
=
ExtractText(src).split() 







seenAnthony
=
False 







for
word
in
text: 











if
not
seenAnthony: 















if
word.lower()
==
'anthony': 



















seenAnthony
=
True 











else: 















if
word.lower()
==
'baxter': 



















meCount
+=
1 















else: 



















otherCount
+=
1
 















seenAnthony
=
False 







yield
'me',
meCount 







yield
'other',
otherCount
  • 49. class
AnthonyReducer(mrpython.Reducer): 



def
Reducer(self,
reduce_input): 







'''
Passed
a
key
(either
'me'
or
'other')
and
a
list 











of
counts.
Adds
the
counts
and
returns
them. 







''' 







count
=
0 







for
val
in
reduce_input.values(): 











sum
+=
int(val) 







yield
count
  • 50. the
result:
  • 51. the
result: about
1
in
4000
  • 52. other
uses
for
MapReduce
  • 53. web
link
graphs
  • 54. access
logs
  • 55. text
analysis
  • 56. google
news
clustering
  • 57. local
search
  • 58. road
traffic
  • 59. take
speed
samples
  • 60. group
by
road
segment
  • 61. take
the
average
  • 62. once
per
minute
  • 63. output
to
a
map
layer
  • 64. limitation:
availability
of
 data
  • 65. MapReduce
is
pretty
cool
  • 66. for
more
information
  • 67. “mapreduce
paper” “gfs
paper” “google
papers”
  • 68. if
you’d
like
to
play
  • 69. hadoop.apache.org
  • 70. open
source
  • 71. java
implementation
  • 72. HDFS