split data into chunks
process data on hosts
summarise results
split data into chunks
allocate machines
process data on hosts
summarise results
split data into chunks
allocate machines
start processes
process data on hosts
summarise results
split data into chunks
allocate machines
start processes
send data to mappers
process data on hosts
summarise results
split data into chunks
allocate machines
start processes
send data to mappers
process data on hosts
monitor hosts
summarise results
split data into chunks
allocate machines
start processes
send data to mappers
process data on hosts
monitor hosts
send results to reducers
summarise results
split data into chunks
allocate machines
start processes
send data to mappers
process data on hosts
monitor hosts
redo failed and stragglers
send results to reducers
summarise results
split data into chunks
allocate machines
start processes
send data to mappers
process data on hosts
monitor hosts
redo failed and stragglers
send results to reducers
summarise results
output final results
MapReduce does the
yukky stuff
split data into chunks
allocate machines
start processes
send data to mappers
MapReduce
process data on hosts
monitor hosts programmer
redo failed and stragglers
send results to reducers
summarise results
output final results
handles failures
handles stragglers
a vanity search
% of refs to Anthony
% of refs to Anthony
Baxter
count(‘Anthony Baxter’)
count(‘Anthony’)
C++ library
... with Python bindings,
yay!
class AnthonyMapper(mrpython.Mapper):
def Map(self, map_input):
meCount = otherCount = 0
docId = map_input.key() # ignored ‐ doc id
src = map_input.value() # document source
text = ExtractText(src).split()
seenAnthony = False
for word in text:
if not seenAnthony:
if word.lower() == 'anthony':
seenAnthony = True
else:
if word.lower() == 'baxter':
meCount += 1
else:
otherCount += 1
seenAnthony = False
yield 'me', meCount
yield 'other', otherCount
class AnthonyReducer(mrpython.Reducer):
def Reducer(self, reduce_input):
''' Passed a key (either 'me' or 'other') and a list
of counts. Adds the counts and returns them.
'''
count = 0
for val in reduce_input.values():
sum += int(val)
yield count
0 comments
Post a comment