Introduction to MapReduce using Disco

VanPyz, June 2, 2009

Introduction to MapReduce
using Disco
Erlang and Python

by @JimRoepcke

1

Computing at Google Scale
Image Source: http://ischool.tv/news/ﬁles/2006/12/computer-grid02s.jpg

Massive databases and data
streams need to be processed
quickly and reliably
Thousands of commodity PCs
available in Google’s cluster
for computations
Faults are statistically
“guaranteed” to occur

2

Google’s Motivation

Google has thousands of programs to process user-
generated data
Even simple computations were being obscured by the
complex code required to run efﬁciently and reliably on
their clusters.
Engineers shouldn’t have to be experts in distributed
systems to write scalable data-processing software.

3

Why not just use threads?

Threads only add concurrency, only on one node
Does not scale to > 1 node, a cluster, or a cloud
Coordinating work between nodes requires distribution
middleware
MapReduce is distribution middleware
MapReduce scales linearly with cores / nodes

4

Hadoop

Apache Foundation project

Written in Java

Includes the Hadoop Distributed File System

5

Disco

Created by Ville Tuulos of the Nokia Research Center

Written in Erlang and Python

Does not include a distributed File System

Provide your own data distribution mechanism

6

How MapReduce works

7

Source: http://labs.google.com/papers/mapreduce-osdi04.pdf

User
Program
(1) fork
(1) fork (1) fork

Master
(2)
(2) assign
assign reduce
map

worker
split 0 (6) write
output
split 1 worker file 0
(5) remote read
split 2 (3) read (4) local write
worker output
split 3 worker
file 1
split 4

worker

Input Map Intermediate files Reduce Output
files phase (on local disks) phase files

9

Figure 1: Execution overview

Master splits input

The (typically huge) input is split into chunks

One or more for each “map worker”

11

Splits fed to map workers

The master tells each map worker which split(s) it will
process

A split is a ﬁle containing some number of input
records

Each record has a key and its associated value

12

Map each input

The map worker executes your problem-speciﬁc map
algorithm

Called once for each record in its input

13

Map emits (Key,Value) pairs

Your map algorithm emits zero or more intermediate
key-value pairs for each record processed

Let’s call these “(K,V) pairs” from now on

Keys and values are both strings

14

(K,V) Pairs hashed to buckets
Each map worker has its own set of buckets

Each (K,V) pair is placed into one of these buckets

Which bucket is determined by a hash function

Advanced: if you know the distribution of your
intermediate keys is skewed, provide a custom hash
function that distributes (K,V) pairs evenly

15

Buckets sent to Reducers
Once all map workers are ﬁnished, corresponding
buckets of (K,V) pairs are sent to reduce workers

Example: Each map worker placed (K,V) pairs into its
own buckets A, B, and C.

Send bucket A from each map to reduce worker 1;
Send bucket B from each map to reduce worker 2;
Send bucket C from each map to reduce worker 3.

16

Reduce inputs sorted
The reduce worker first concatenates the buckets it
received into one file

Then the file of (K,V) pairs is sorted by K

Now the (K,V) pairs are grouped by key

This sorted list of (K,V) pairs is the input to the reduce
worker

17

Reduce the list of (K,V) pairs

The reduce worker executes your problem-speciﬁc
reduce algorithm

Called once for each key in its input

Writes whatever it wants to its output ﬁle

18

Output

The output of the MapReduce job is the set of output
ﬁles generated by the reduce workers

What you do with this output is up to you

You might use this output as the input to another
MapReduce job

19

Modiﬁed from source: http://labs.google.com/papers/mapreduce-osdi04.pdf

Example: Counting words
def map (key, value):
# key: document name (ignored)
# value: words in document (list)
for word in value:
EmitIntermediate(word, “1”)
def reduce (key, values):
# key: a word
# values: a list of counts
result = 0
for v in values:
result += int(v)
print key, result
20

Stand up! Let’s do it!

Organize yourselves into approximately equal numbers
of map and reduce workers

I’ll be the master

Disco demonstration
Wanted to demonstrate a cool
puzzle solver.

No go, but I can show the code.
It’s really simple!

Instead you get count_words again,
but scaled way up!

python count_words.py
disco://localhost

Introduction to MapReduce using Disco

More Related Content

What's hot

Viewers also liked

Similar to Introduction to MapReduce using Disco

Introduction to MapReduce using Disco