Processing Big Data: An Introduction to Data Intensive Computing

An
Introduc+on
to

Data
Intensive
Compu+ng

Chapter
3:
Processing
Big
Data

Robert
Grossman

University
of
Chicago

Open
Data
Group

Collin
BenneC

Open
Data
Group

November
14,
2011

1

1.  Introduc+on
(0830-‐0900)

a.  Data
clouds
(e.g.
Hadoop)

b.  U+lity
clouds
(e.g.
Amazon)

2.  Managing
Big
Data
(0900-‐0945)

a.  Databases

b.  Distributed
File
Systems
(e.g.
Hadoop)

c.  NoSql
databases
(e.g.
HBase)

3.  Processing
Big
Data
(0945-‐1000
and
1030-‐1100)

a.  Mul+ple
Virtual
Machines
&
Message
Queues

b.  MapReduce

c.  Streams
over
distributed
ﬁle
systems

4.  Lab
using
Amazon’s
Elas+c
Map
Reduce

(1100-‐1200)

Sec+on
3.1

Processing
Big
Data
Using

U+lity
and
Data
Clouds

A
Google
produc+on
rack
of

servers
from
about
1999.

•  How
do
you
do
analy+cs
over
commodity

disks
and
processors?

•  How
do
you
improve
the
eﬃciency
of

programmers?

Serial
&
SMP
Algorithms

•  *
local
disk
and
memory

local
disk*

Task

local
disk*

Task
Task
Task

Serial
algorithm
Symmetric

Mul+processing

(SMP)
algorithm

Pleasantly
(=
Embarrassingly)
Parallel

•  Need
to
par++on
data,
start
tasks,
collect
results.

•  Oden
tasks
organized
into
DAG.

local
disk

Task
Task
Task

local
disk

Task
Task
Task

local
disk

Task
Task
Task

MPI

How
Do
You
Program
A
Data
Center?

7

The
Google
Data
Stack

•  The
Google
File
System
(2003)

•  MapReduce:
Simpliﬁed
Data
Processing…
(2004)

•  BigTable:
A
Distributed
Storage
System…
(2006)

8

Google’s
Large
Data
Cloud

9
Google’s
Early
Data
Stack

circa
2000

Google
File
System
(GFS)

Google’s
MapReduce

Google’s
BigTable

Storage
Services

Compute
Services

Applica+ons

Data
Services

Hadoop’s
Large
Data
Cloud

(Open
Source)

Storage
Services

Compute
Services

10
Hadoop’s
Stack

Applica+ons

Hadoop
Distributed
File

System
(HDFS)

Hadoop’s
MapReduce

Data
Services
NoSQL,
e.g.
HBase

A
very
nice
recent
book
by

Barroso
and
Holzle

The
Amazon
Data
Stack

Amazon
uses
a
highly

decentralized,
loosely
coupled,

service
oriented
architecture

consis+ng
of
hundreds
of

services.
In
this
environment

there
is
a
par+cular
need
for

storage
technologies
that
are

always
available.
For
example,

customers
should
be
able
to

view
and
add
items
to
their

shopping
cart
even
if
disks
are

failing,
network
routes
are

ﬂapping,
or
data
centers
are

being
destroyed
by
tornados.

SOSP’07

Amazon
Style
Data
Cloud

S3
Storage
Services

Simple
Queue
Service

13
Load
Balancer

EC2
Instance

EC2
Instance

EC2
Instance

EC2
Instance

EC2
Instance

EC2
Instances

EC2
Instance

EC2
Instance

EC2
Instance

EC2
Instance

EC2
Instance

EC2
Instances

SDB

Open
Source
Versions

•  Eucalyptus

–  Ability
to
launch
VMs

–  S3
like
storage

•  Open
Stack

–  Ability
to
launch
VMs

–  S3
like
storage
-‐
Swid

•  Cassandra

–  Key-‐value
store
like
S3

–  Columns
like
BigTable

•  Many
other
open
source
Amazon
style
services

available.

Some
Programming
Models
for
Data
Centers

•  Opera+ons
over
data
center
of
disks

–  MapReduce
(“string-‐based”
scans
of
data)

–  User-‐Deﬁned
Func+ons
(UDFs)
over
data
center

–  Launch
VMs
that
all
have
access
to
highly
scalable
and

available
disk-‐based
data.

–  SQL
and
NoSQL
over
data
center

•  Opera+ons
over
data
center
of
memory

–  Grep
over
distributed
memory

–  UDFs
over
distributed
memory

–  Launch
VMs
that
all
have
access
to
highly
scalable
and

available
membory-‐based
data.

–  SQL
and
NoSQL
over
distributed
memory

Sec+on
3.2

Processing
Data
By
Scaling
Out

Virtual
Machines

Processing
Big
Data
PaCern
1:

Launch
Independent
Virtual
Machines

and
Task
with
a
Messaging
Service

Task
With
Messaging
Service

&
Use
S3
(Variant
1)

S3

Task

VM

Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)

Task

VM

Task

VM

Task

VM

…

Control
VM:
Launches
and

tasks
workers

Worker
VMs

Task
With
Messaging
Service

&
Use
NoSQL
DB
(Variant
2)

AWS
SimpleDB

Task

VM

Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)

Task

VM

Task

VM

Task

VM

…

Control
VM:
Launches
and

tasks
workers

Worker
VMs

Task
With
Messaging
Service

&
Use
Clustered
FS
(Variant
3)

GlusterFS

Task

VM

Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)

Task

VM

Task

VM

Task

VM

…

Control
VM:
Launches
and

tasks
workers

Worker
VMs

Sec+on
3.3

MapReduce

Google
2004

Technical
Report

Core
Concepts

•  Data
are
(key,
value)
pairs
and
that’s
it

•  Par++on
data
over
commodity
nodes
ﬁlling
racks

in
a
data
center.

•  Sodware
handles
failures,
restarts,
etc.
This
is

the
hard
part.

•  Basic
examples:

– Word
Count

– Inverted
index

Processing
Big
Data
PaCern
2:

MapReduce

HDFS

Map

Task

Task

Tracker

local
disk

Map

Task

Map

Task

HDFS

Map

Task

Task

Tracker

local
disk

Map

Task

Map

Task

HDFS

Map

Task

Task

Tracker

local
disk

Map

Task

Map

Task

local
disk

HDFS

Reduce

Task

local
disk

HDFS

Reduce

Task

Shuﬄe
&
Sort

Example:
Word
Count
&
Inverted
Index

•  How
do
you
count

the
words
in
a

million
books?

– (best,
7)

•  Inverted
index:

– (best;
page
1,
page

82,
…)

– (worst;
page
1,

page
12,
…)

Cover
of
serial
Vol.
V,
1859,
London

•  Assume
you
have
a
cluster
of
50
computers,
each

with
an
aCached
local
disk
and
half
full
of
web

pages.

•  What
is
a
simple
parallel
programming
framework

that
would
support
the
computa+on
of
word
counts

and
inverted
indices?

Basic
PaCern:
Strings

1.
Extract
words

from
web
pages
in

parallel.

2.
Hash
and

sort
words.

3.
Count
(or

construct
inverted

index)
in
parallel.

1.
Extract
words

from
web
pages
in

parallel.

2.
Hash
and

sort
words.

3.
Count
(or

construct
inverted

index)
in
parallel.

1.
Extract
binned

ﬁeld
value
from

data
records
in

parallel.

2.
Hash
and

sort
binned

ﬁeld
values.

3.
Count
(or

construct
inverted

index)
in
parallel.

What
about
data
records?

Map-‐Reduce
Example

•  Input
is
ﬁles
with
one
document
per
record

•  User
speciﬁes
map
func+on

–  key
=
document
URL

–  Value
=
document
contents

doc
cdickens
two
ci+es ,
it
was
the
best
of
+mes

it ,
1

was ,
1

the ,
1

best ,
1

Input
of
map

Output
of
map

Example
(cont d)

•  MapReduce
library
gathers
together
all
pairs

with
the
same
key
value
(shuﬄe/sort
phase)

•  The
user-‐deﬁned
reduce
func+on
combines
all

the
values
associated
with
the
same
key

key
=
it

values
=
1,
1

key
=
was

values
=
1,
1

key
=
best

values
=
1

key
=
worst

values
=
1

Input
of
reduce

it ,
2

was ,
2

best ,
1

worst ,
1

Output
of
reduce

Why
Is
Word
Count
Important?

•  It
is
one
of
the
most
important
examples
for

the
type
of
text
processing
oden
done
with

MapReduce.

•  There
is
an
important
mapping

document

<
-‐-‐-‐-‐-‐
>

data
record

words

<
-‐-‐-‐-‐-‐
>

(ﬁeld,
value)

Inversion

Pleasantly
Parallel
MapReduce

Data
structure
Arbitrary
(key,
value)
pairs

Func+ons
Arbitrary
Map
&
Reduce

Middleware
MPI
(message

passing)

Hadoop

Ease
of
use
Diﬃcult
Medium

Scope
Wide
Narrow

Challenge

Geung
something

working

Moving
to

MapReduce

Common
MapReduce
Design
PaCerns

•  Word
count

•  Inversion
–
inverted
index

•  Compu+ng
simple
sta+s+cs

•  Compu+ng
windowed
sta+s+cs

•  Sparse
matrix
(document-‐term,
data
record-‐
FieldBinValue,
…)

• 
Site-‐en+ty
sta+s+cs

•  PageRank

•  Par++oned
and
ensemble
models

•  EM

Sec+on
3.4

User
Deﬁned
Func+ons
over
DFS

sector.sf.net

Processing
Big
Data
PaCern
3:

User
Deﬁned
Func+ons
over

Distributed
File
Systems

Sector/Sphere

•  Sector/Sphere
is
a
plaworm
for
data
intensive

compu+ng.

Idea
1:
Apply
User
Deﬁned
Func+ons

(UDF)
to
Files
in
a
Distributed
File
System

map/shuffle reduce
UDFUDF
This
generalizes
Hadoop’s
implementa+on
of
MapReduce

over
the
Hadoop
Distributed
File
system.

Idea
2:
Add
Security
From
the
Start

•  Security
server
maintains

informa+on
about
users

and
slaves.

•  User
access
control:

password
and
client
IP

address.

•  File
level
access
control.

•  Messages
are
encrypted

over
SSL.
Cer+ﬁcate
is

used
for
authen+ca+on.

•  Sector
is
a
good
basis
for

HIPAA
compliant

applica+ons.

Security
Server
Master Client
Slaves
dataAAA
SSLSSL

Idea
3:
Extend
the
Stack
to
Include

Network
Transport
Services

Storage
Services

39

Storage
Services

Rou+ng
&

Transport
Services
Google,
Hadoop

Sector

Compute
Services

Data
Services

Compute
Services

Data
Services

Sec+on
3.5

Compu+ng
With
Streams:

Warming
Up
With
Means
and

Variances

Warm
Up:
Par++oned
Means

•  Means
and
variances
cannot
be
computed

naively
when
the
data
is
in
distributed

par++ons.

Step
1.
Compute
local

(Σ
xi,

Σ
xi
2,

ni)

in
parallel
for
each

par++on.

Step
2.
Compute
global

mean
and
variance
from

these
tuples.

Trivial
Observa+on
1

If
si
=
Σ
xi
is
a
the
i’th
local
means,
then
global

mean
=
Σ
si
/

Σ
ni.

•  If
local
means
for
each
par++on
are
passed

(without
corresponding
counts),
then
there
is

not
enough
informa+on
to
compute
global

means.

•  Same
tricks
works
for
variance,
but
need
to

pass
triples
(Σ
xi,

Σ
xi
2,

ni).

Trivial
Observa+on
2

•  To
reduce
data
passed
over
the
network,

combine
appropriate
sta+s+cs
as
early
as

possible.

•  Consider
average.

Recall
with
MapReduce
there

are
4
steps
(Map,
Shuﬄe,
Sort
and
Reduce)
and

Reduce
pulls
data
from
local
disk
that
performs

Map.

•  A
Combine
Step
in
MapReduce
combines
local

data
before
it
is
pulled
for
Reduce
Step.

•  There
are
built
in
combiners
for
counts,
means,

etc.

Sec+on
3.6

Hadoop
Streams

Processing
Big
Data
PaCern
4:

Streams
over
Distributed
File
Systems

Hadoop
Streams

•  In
addi+on
to
the
Java
API,
Hadoop
oﬀers

–  Streaming
interface
for
any
language
that
supports

reading
and
wri+ng
to
Standard
In
and
Out

–  Pipes
for
C++

•  Why
would
I
want
to
use
something
besides

Java?

Because
Hadoop
Streams
provide
direct

access
to

–  (Without
JNI/
NIO)
to
C++
libraries
like
Boost,
GNU

Scien+ﬁc
Library
(GSL)

–  R
modules

Pros
and
Cons

•  Java

+

Best
documented

+

Largest
community

–  More
LOC
per
MR
job

•  Python

+

Eﬃcient
memory
handling

+

Programmers
can
be
very
eﬃcient

–  Limited
logging
/
debugging

•  R

+

Vast
collec+on
of
sta+s+cal
algorithms

–  Poor
error
handling
and
memory
handling

–  Less
familiar
to
developers

Word
Count
Python
Mapper

def read_input(file):
for line in file:
yield line.split()
def main(separator='t'):
data = read_input(sys.stdin)
for words in data:
for word in words:
print '%s%s%d' % (word, separator, 1)

Word
Count
Python
Reducer

def read_mapper_output(file, separator='t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(sep='t'):
data = read_mapper_output(sys.stdin, sep=sepa)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word,
count in group)
print "%s%s%d" % (word, sep, total_count)

MalStone
Benchmark

MalStone
A
MalStone
B

Hadoop
MapReduce
455m
13s
840m
50s

Hadoop
Streams

(Python)

87m
29s
142m
32s

C++
implemented
UDFs
33m
40s
43m
44s

Sector/Sphere
1.20,
Hadoop
0.18.3
with
no
replica+on
on
Phase
1
of

Open
Cloud
Testbed
in
a
single
rack.

Data
consisted
of
20
nodes
with

500
million
100-‐byte
records
/
node.

Word
Count
R
Mapper

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)",
"", line)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn =
FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)

Word
Count
R
Reducer

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
while (length(line <- readLines(con, n = 1, warn = FALSE)) >
0) {
split <- splitLine(line)
word <- split$word
count <- split$count

Word
Count
R
Reducer
(cont’d)

if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "t", get(w, envir = env), "n", sep =
"”)

Word
Count
Java
Mapper

public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable>
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

Word
Count
Java
Reducer

public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}

Code
Comparison
–
Word
Count

Mapper

Python
def read_input(file):
for line in file:
yield line.split()
def main(separator='t'):
data = read_input(sys.stdin)
for words in data:
for word in words:
print '%s%s%d' % (word, separator, 1)
R
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)
Java
public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable>
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

Code
Comparison
–
Word
Count

Reducer

Python
def read_mapper_output(file, separator='t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(sep='t'):
data = read_mapper_output(sys.stdin, sep=sepa)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word, count in group)
print "%s%s%d" % (word, sep, total_count)
R
splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "t", get(w, envir = env), "n", sep = "”)
Java
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

Ques+ons?

For
the
most
current
version
of
these
notes,
see

rgrossman.com

Processing Big Data: An Introduction to Data Intensive Computing

More Related Content

What's hot

Similar to Processing Big Data: An Introduction to Data Intensive Computing

Recently uploaded

Processing Big Data: An Introduction to Data Intensive Computing