THE ABSTRACTION THAT POWERS THE BIG DATA 
RAÚL CASTRO FERNÁNDEZ 
COMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE
Data!ows: The Abstraction 
that Powers Big Data 
Raul 
Castro 
Fernandez 
Imperial 
College 
London 
rc3011@doc.ic.ac.uk 
@raulcfernandez
“Big 
Data 
needs 
Democra:za:on”
3 
Democratization of Data 
Developers 
and 
DBAs 
are 
no 
longer 
the 
only 
ones 
genera:ng, 
processing 
and 
analyzing 
data.
4 
Democratization of Data 
Developers 
and 
DBAs 
are 
no 
longer 
the 
only 
ones 
genera:ng, 
processing 
and 
analyzing 
data. 
Decision 
makers, 
domain 
scien:sts, 
applica:on 
users, 
journalists, 
crowd 
workers, 
and 
everyday 
consumers, 
sales, 
marke:ng…
5 
+ 
Everyone 
has 
data
6 
+ 
Everyone 
has 
data 
+ 
Many 
have 
interes:ng 
ques:ons
7 
+ 
Everyone 
has 
data 
+ 
Many 
have 
interes:ng 
ques:ons 
-­‐ 
Not 
everyone 
knows 
how 
to 
analyze 
it
8 
+ 
Everyone 
has 
data 
+ 
Many 
have 
interes:ng 
ques:ons 
-­‐ 
Not 
everyone 
knows 
how 
to 
analyze 
it
9 
Bob 
Local 
Expert
Bob 
10 
Local 
Expert
Bob 
11 
-­‐ 
Barrier 
of 
human 
communica:on 
-­‐ 
Barrier 
of 
professional 
rela:ons 
Local 
Expert
Bob 
12 
-­‐ 
Barrier 
of 
human 
communica:on 
-­‐ 
Barrier 
of 
professional 
rela:ons 
Local 
Expert 
The 
limits 
of 
my 
language 
mean 
the 
limits 
of 
my 
world. 
Ludwig 
WiWgenstein 
“Tractatus 
Logico-­‐Philosophicus 
1922”
13 
First 
step 
to 
democra:ze 
Big 
Data: 
to 
offer 
a 
familiar 
programming 
interface
• Mo>va>on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla:ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
14 
Outline 
? 
?
Mutable State in a Recommender System 
User-­‐Item 
matrix 
(UI) 
Co-­‐Occurrence 
matrix 
(CO) 
15 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
Item-­‐A 
Item-­‐B 
Item-­‐A 
1 
1 
Item-­‐B 
1 
2
Mutable State in a Recommender System 
User-­‐Item 
matrix 
(UI) 
Co-­‐Occurrence 
matrix 
(CO) 
16 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
void 
addRa>ng(int 
user, 
int 
item, 
int 
ra>ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(coOcc, 
userItem); 
} 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
Item-­‐A 
Item-­‐B 
Item-­‐A 
1 
1 
Item-­‐B 
1 
2 
Update 
with 
new 
ra:ngs
Mutable State in a Recommender System 
User-­‐Item 
matrix 
(UI) 
Co-­‐Occurrence 
matrix 
(CO) 
17 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
void 
addRa>ng(int 
user, 
int 
item, 
int 
ra>ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(coOcc, 
userItem); 
} 
Vector 
getRec(int 
user) 
{ 
Vector 
userRow 
= 
userItem.getRow(user); 
Vector 
userRec 
= 
coOcc.mul:ply(userRow); 
return 
userRec; 
} 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
Item-­‐A 
Item-­‐B 
Item-­‐A 
1 
1 
Item-­‐B 
1 
2 
Update 
with 
new 
ra:ngs 
User-­‐B 
Mul:ply 
for 
recommenda:on 
1 
2 
x
Challenges When Executing with Big Data 
18 
Big 
Data 
Problem: 
Matrices 
become 
large 
> 
Mutable 
state 
leads 
to 
concise 
algorithms 
but 
complicates 
parallelism 
and 
fault 
tolerance 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
> 
Cannot 
lose 
state 
aRer 
failure 
> 
Need 
to 
manage 
state 
to 
support 
data-­‐parallelism
19 
Using Current Distributed Data"ow 
Frameworks 
Input 
data 
Output 
data 
> 
No 
mutable 
state 
simplifies 
fault 
tolerance 
> 
MapReduce: 
Map 
and 
Reduce 
tasks 
> 
Storm: 
No 
support 
for 
state 
> 
Spark: 
Immutable 
RDDs
20 
Imperative Big Data Processing 
> 
Programming 
distributed 
dataflow 
graphs 
requires 
learning 
new 
programming 
models
21 
Imperative Big Data Processing 
> 
Programming 
distributed 
dataflow 
graphs 
requires 
learning 
new 
programming 
models 
Our 
Goal: 
Run 
Java 
programs 
with 
mutable 
state 
but 
with 
performance 
and 
fault 
tolerance 
of 
distributed 
dataflow 
systems
Stateful Data"ow Graphs: From Imperative 
22 
Programs to Distributed Data"ows 
Program.java 
SDGs: 
Stateful 
Dataflow 
Graphs 
> 
Mutable 
distributed 
state 
in 
dataflow 
graphs 
> 
@Annota>ons 
help 
with 
transla>on 
from 
Java 
to 
SDGs 
> 
Checkpoint-­‐based 
fault 
tolerance 
recovers 
mutable 
state 
aRer 
failure
• Mo:va:on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla:ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
23 
Outline 
Program.java
SDG: Data, State and Computation 
> 
SDGs 
separate 
data 
and 
state 
to 
allow 
data 
and 
pipeline 
parallelism 
24 
Task 
Elements 
(TEs) 
process 
data 
State 
Elements 
(SEs) 
represent 
state 
Dataflows 
represent 
data 
> 
Task 
Elements 
have 
local 
access 
to 
State 
Elements
State 
Elements 
support 
two 
abstrac:ons 
for 
distributed 
mutable 
state 
– Par>>oned 
SEs: 
task 
elements 
always 
access 
state 
by 
key 
– Par>al 
SEs: 
task 
elements 
can 
access 
complete 
state 
25 
Distributed Mutable State
Distributed Mutable State: Partitioned SEs 
Access 
by 
key 
State 
par::oned 
according 
26 
Dataflow 
routed 
according 
to 
hash 
func:on 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
to 
par>>oning 
key 
> 
Par>>oned 
SEs 
split 
into 
disjoint 
par::ons 
User-­‐Item 
matrix 
(UI) 
hash(msg.id) 
Key 
space: 
[0-­‐N] 
[0-­‐k] 
[(k+1)-­‐N]
Distributed Mutable State: Partial SEs 
27 
Local 
access: 
Data 
sent 
to 
one 
Global 
access: 
Data 
sent 
to 
all 
> 
Par>al 
SE 
gives 
nodes 
local 
state 
instances 
> 
Par>al 
SE 
access 
by 
TEs 
can 
be 
local 
or 
global
28 
Merging Distributed Mutable State 
> 
Reading 
all 
par:al 
SE 
instances 
results 
in 
Merge 
logic 
set 
of 
par>al 
values 
> 
Requires 
applica:on-­‐specific 
merge 
logic
29 
Merging Distributed Mutable State 
> 
Reading 
all 
par:al 
SE 
instances 
results 
in 
Mul:ple 
par:al 
values 
Merge 
logic 
set 
of 
par>al 
values 
> 
Requires 
applica:on-­‐specific 
merge 
logic
30 
Merging Distributed Mutable State 
> 
Reading 
all 
par:al 
SE 
instances 
results 
in 
Mul:ple 
par:al 
values 
Collect 
par:al 
values 
Merge 
logic 
set 
of 
par>al 
values 
> 
Requires 
applica:on-­‐specific 
merge 
logic
31 
Outline 
> 
@Annota>ons 
• Mo:va:on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla>ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
Program.java
32 
From Imperative Code to Execution 
SEEP 
Annotated 
program 
> 
SEEP: 
data-­‐parallel 
processing 
plaborm 
• Transla:on 
occurs 
in 
two 
stages: 
– Sta<c 
code 
analysis: 
From 
Java 
to 
SDG 
– Bytecode 
rewri<ng: 
From 
SDG 
to 
SEEP 
[SIGMOD’13] 
Program.java
Program.java 
33 
Translation Process 
Extract 
TEs, 
SEs 
and 
accesses 
Live 
variable 
analysis 
TE 
and 
SE 
access 
code 
assembly 
SEEP 
runnable 
SOOT 
Framework 
Javassist 
> 
Extract 
state 
and 
state 
access 
paderns 
through 
sta:c 
code 
analysis 
> 
Genera:on 
of 
runnable 
code 
using 
TE 
and 
SE 
connec:ons
Program.java 
34 
Translation Process 
Extract 
TEs, 
SEs 
and 
accesses 
Live 
variable 
analysis 
TE 
and 
SE 
access 
code 
assembly 
SEEP 
runnable 
SOOT 
Framework 
Javassist 
> 
Extract 
state 
and 
state 
access 
paderns 
through 
sta:c 
code 
analysis 
> 
Genera:on 
of 
runnable 
code 
using 
TE 
and 
SE 
connec:ons 
Annotated 
Program.java
35 
@Par>>oned 
Partitioned State Annotation 
Matrix 
userItem 
= 
new 
SeepMatrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
void 
addRa:ng(int 
user, 
int 
item, 
int 
ra:ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(coOcc, 
userItem); 
} 
Vector 
getRec(int 
user) 
{ 
Vector 
userRow 
= 
userItem.getRow(user); 
Vector 
userRec 
= 
coOcc.mul:ply(userRow); 
return 
userRec; 
} 
> 
@Par>>on 
field 
annota>on 
indicates 
par<<oned 
state 
hash(msg.id)
36 
Partial State and Global Annotations 
@Par::oned 
Matrix 
userItem 
= 
new 
SeepMatrix(); 
@Par>al 
Matrix 
coOcc 
= 
new 
SeepMatrix(); 
void 
addRa:ng(int 
user, 
int 
item, 
int 
ra:ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(@Global 
coOcc, 
userItem); 
} 
> 
@Par>al 
field 
annota>on 
indicates 
> 
@Global 
annotates 
variable 
par<al 
to 
indicate 
access 
to 
all 
par:al 
instances 
state
37 
Partial and Collection Annotations 
@Par::oned 
Matrix 
userItem 
= 
new 
SeepMatrix(); 
@Par>al 
Matrix 
coOcc 
= 
new 
SeepMatrix(); 
Vector 
getRec(int 
user) 
{ 
Vector 
userRow 
= 
userItem.getRow(user); 
@Par>al 
Vector 
puRec 
= 
@Global 
coOcc.mul:ply(userRow); 
Vector 
userRec 
= 
merge(puRec); 
return 
userRec; 
} 
Vector 
merge(@Collec>on 
Vector[] 
v){ 
/*…*/ 
} 
> 
@Collec>on 
annota:on 
indicates 
merge 
logic
38 
Outline 
> 
Failures 
• Mo:va:on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla:ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐Based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
Program.java
Challenges of Making SDGs Fault Tolerant 
access 
39 
Physical 
deployment 
of 
SDG 
> 
Task 
elements 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
local 
in-­‐memory 
state
Challenges of Making SDGs Fault Tolerant 
access 
40 
RAM 
RAM 
Physical 
deployment 
of 
SDG 
> 
Task 
elements 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
local 
in-­‐memory 
state 
Physical 
nodes
Challenges of Making SDGs Fault Tolerant 
41 
RAM 
RAM 
Physical 
deployment 
of 
SDG 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
Checkpoin>ng 
State 
• No 
updates 
allowed 
while 
state 
is 
being 
checkpointed 
• Checkpoin:ng 
state 
should 
not 
impact 
data 
processing 
path 
> 
Task 
elements 
access 
local 
in-­‐memory 
state 
Physical 
nodes
Challenges of Making SDGs Fault Tolerant 
42 
RAM 
RAM 
Physical 
deployment 
of 
SDG 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
State 
Backup 
• Backups 
large 
and 
cannot 
be 
stored 
in 
memory 
• Large 
writes 
to 
disk 
through 
network 
have 
high 
cost 
Checkpoin>ng 
State 
• No 
updates 
allowed 
while 
state 
is 
being 
checkpointed 
• Checkpoin:ng 
state 
should 
not 
impact 
data 
processing 
path 
> 
Task 
elements 
access 
local 
in-­‐memory 
state 
Physical 
nodes
Checkpoint Mechanism for Fault Tolerance 
1. Freeze 
mutable 
state 
for 
checkpoin:ng 
2. Dirty 
state 
supports 
updates 
concurrently 
3. Reconcile 
dirty 
state 
43 
Asynchronous, 
lock-­‐free 
checkpoin>ng 
Dirty 
state
Distributed M to N Checkpoint Backup 
44 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
45 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
46 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
47 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
48 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
49 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
50 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
51 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
52 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
How 
does 
mutable 
state 
impact 
performance? 
How 
efficient 
are 
translated 
SDGs? 
What 
is 
the 
throughput/latency 
trade-­‐off? 
Experimental 
set-­‐up: 
– Amazon 
EC2 
(c1 
and 
m1 
xlarge 
instances) 
– Private 
cluster 
(4-­‐core 
3.4 
GHz 
Intel 
Xeon 
servers 
with 
8 
GB 
RAM 
) 
– Sun 
Java 
7, 
Ubuntu 
12.04, 
Linux 
kernel 
3.10 
53 
Evaluation of SDG Performance
54 
Processing with Large Mutable State 
> 
addRa:ng 
and 
getRec 
func:ons 
from 
recommender 
20 
15 
10 
5 
0 
algorithm, 
while 
changing 
read/write 
ra:o 
Throughput 
Latency 
1:5 1:2 1:1 2:1 5:1 
1000 
100 
Throughput (1000 requests/s) 
Latency (ms) 
Workload (state read/write ratio) 
Combines 
batch 
and 
online 
processing 
to 
serve 
fresh 
results 
over 
large 
mutable 
state
55 
E#ciency of Translated SDG 
60 
50 
40 
30 
20 
10 
0 
> 
Batch-­‐oriented, 
itera:ve 
logis:c 
regression 
25 50 75 100 
Throughput (GB/s) 
Number of nodes 
SDG 
Spark 
Translated 
SDG 
achieves 
performance 
similar 
to 
non-­‐mutable 
dataflow
56 
Latency/Throughput Tradeo$ 
> 
Streaming 
word 
count 
query, 
repor:ng 
counts 
over 
windows 
250 
200 
150 
100 
50 
0 
SDG 
Naiad-LowLatency 
10 100 1000 10000 
Throughput (1000 requests/s) 
Window size (ms) 
SDGs 
achieve 
high 
throughput 
while 
main>ng 
low 
latency
57 
Latency/Throughput Tradeo$ 
> 
Streaming 
word 
count 
query, 
repor:ng 
counts 
over 
windows 
250 
250 
200 
150 
100 
50 
0 
Naiad-HighThroughput 
SDG 
Naiad-LowLatency 
Streaming Spark 
10 100 1000 10000 
Throughput (1000 requests/s) 
s) 
Window size (ms) 
SDGs 
achieve 
high 
throughput 
while 
main>ng 
low 
latency
58 
Latency/Throughput Tradeo$ 
> 
Streaming 
word 
count 
query, 
repor:ng 
counts 
over 
windows 
250 
250 
250 
200 
200 
150 
150 
100 
100 
50 
50 
0 
Naiad-HighThroughput 
SDG 
Naiad-HighThroughput 
SDG 
Streaming Spark 
Naiad-LowLatency 
Streaming Spark 
10 100 1000 10000 
Throughput (1000 requests/s) 
Throughput (1000 requests/s) 
s) 
Window size (ms) 
SDGs 
achieve 
high 
throughput 
while 
main>ng 
low 
latency
Running 
Java 
programs 
with 
the 
performance 
of 
current 
distributed 
dataflow 
frameworks 
SDG: 
Stateful 
Dataflow 
Graphs 
– Abstrac:ons 
for 
distributed 
mutable 
state 
– Annota>ons 
to 
disambiguate 
types 
of 
distributed 
state 
and 
state 
access 
– Checkpoint-­‐based 
fault 
tolerance 
mechanism 
59 
Summary
Running 
Java 
programs 
with 
the 
performance 
of 
current 
distributed 
dataflow 
frameworks 
SDG: 
Stateful 
Dataflow 
Graphs 
– Abstrac:ons 
for 
distributed 
mutable 
state 
– Annota>ons 
to 
disambiguate 
types 
of 
distributed 
state 
and 
state 
access 
– Checkpoint-­‐based 
fault 
tolerance 
mechanism 
60 
Summary 
hEps://github.com/lsds/Seep/ 
hEps://github.com/raulcf/SEEPng/ 
Thank 
you! 
Any 
Ques>ons? 
@raulcfernandez 
rc3011@doc.ic.ac.uk
BACKUP 
SLIDES 
61
62 
Scalability 
on 
State 
Size 
and 
Throughput 
> 
Increase 
state 
size 
in 
a 
mutated 
KV 
store 
2 
1.5 
1 
0.5 
0 
Throughput 
Latency 
50 100 150 200 
1000 
100 
10 
1 
Throughput (million requests/s) 
Latency (ms) 
Aggregated memory (GB) 
Support 
large 
state 
without 
compromising 
throughput 
or 
latency 
while 
staying 
fault 
tolerant
63 
Itera:on 
in 
SDG 
> 
Local 
itera>on 
supported 
by 
one 
node 
> 
Itera>on 
across 
TEs 
requires 
cycle 
in 
the 
dataflow
• Par::on 
• Par:al 
• Global 
• Par:al 
• Collec:on 
• Data 
annota:ons 
– Batch 
– Stream 
64 
Types 
of 
Annota:ons
Overhead 
of 
SDG 
Fault 
Tolerance 
Fault 
Tolerance 
mechanism 
impact 
on 
performance 
and 
65 
10000 
1000 
100 
10 
1 
No FT 1 2 3 4 5 
Latency (ms) 
State size (GB) 
1000 
100 
10 
1 
latency 
is 
small. 
2 4 6 8 10 No FT 
Latency (ms) 
Checkpoint frequency (s) 
State 
size 
and 
checkpoin>ng 
Frequency 
do 
not 
affect 
the 
performance
66 
10 
8 
6 
4 
2 
0 
Fault 
Tolerance 
Overhead 
SDG 
Naiad-NoDisk 
Naiad-Disk 
SDG (latency) 
Naiad-NoDisk (latency) 
10 100 1000 2000 
100 
80 
60 
40 
20 
0 
Throughput (10,000 requests/s) 
Latency (ms) 
Aggregated memory (MB)
40 
35 
30 
25 
20 
15 
10 
5 
0 
1-to-1 recovery 
2-to-1 recovery 
1-to-2 recovery 
2-to-2 recovery 
1 2 4 
Recovery time (s) 
State size (GB) 
67 
Recovery 
Times
68 
30 
25 
20 
15 
10 
5 
0 
Stragglers 
Throughput 
0 10 20 30 40 50 60 
5 
4 
3 
2 
1 
0 
Throughput (1000 request/s) 
Number of nodes 
Time (s) 
Nodes
69 
Fault 
Tolerance 
Sync. 
Vs. 
Async. 
250 
200 
150 
100 
50 
0 
T'put (Async) Latency (Sync) 
T'put (Sync) 
1 2 3 4 
10 
1 
0.1 
0.01 
0.001 
Throughput (1000 requests/s) 
Latency (s) 
State size (GB)
System 
Large 
State 
Mutable 
State 
Low 
Latency 
Itera>on 
MapReduce 
n/a 
n/a 
No 
No 
Spark 
n/a 
n/a 
No 
Yes 
Storm 
n/a 
n/a 
Yes 
No 
Naiad 
No 
Yes 
Yes 
Yes 
SDG 
Yes 
Yes 
Yes 
Yes 
70 
Comparison 
to 
State-­‐of-­‐the-­‐Art 
SDGs 
are 
first 
stateful 
fault 
tolerant 
model; 
enabling 
execu:on 
of 
impera:ve 
code 
with 
explicit 
state
71 
Characteris:cs 
of 
SDGs 
> 
Run>me 
Data 
Parallelism 
(elas>city) 
> 
Support 
for 
Cyclic 
Graphs 
> 
Low 
Latency 
Adapta:on 
to 
varying 
workloads 
and 
mechanism 
against 
stragglers 
Efficiently 
represent 
itera:ve 
algorithms 
Pipelining 
tasks 
decreases 
latency
72 
Local 
Expert 
Bob 
Hi, 
I 
have 
a 
query 
to 
run 
on 
“Big 
Data” 
Ok, 
cool, 
tell 
me 
about 
it 
I 
want 
to 
know 
sales 
per 
employee 
on 
Saturdays 
… 
well 
… 
ok, 
come 
in 
3 
days 
Well, 
this 
is 
actually 
preWy 
urgent… 
… 
2 
days, 
I’m 
preWy 
busy 
2 
Days 
Ayer 
Hi! 
You 
have 
the 
results? 
Yes, 
here 
you 
have 
your 
sales 
last 
Saturday 
My 
sales? 
I 
meant 
all 
employee 
sales, 
and 
not 
only 
last 
Saturday 
ups, 
sorry 
for 
that, 
give 
me 
2 
days…
17TH ~ 18th NOV 2014 
MADRID (SPAIN)

Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

  • 1.
    THE ABSTRACTION THATPOWERS THE BIG DATA RAÚL CASTRO FERNÁNDEZ COMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE
  • 2.
    Data!ows: The Abstraction that Powers Big Data Raul Castro Fernandez Imperial College London rc3011@doc.ic.ac.uk @raulcfernandez
  • 3.
    “Big Data needs Democra:za:on”
  • 4.
    3 Democratization ofData Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data.
  • 5.
    4 Democratization ofData Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data. Decision makers, domain scien:sts, applica:on users, journalists, crowd workers, and everyday consumers, sales, marke:ng…
  • 6.
    5 + Everyone has data
  • 7.
    6 + Everyone has data + Many have interes:ng ques:ons
  • 8.
    7 + Everyone has data + Many have interes:ng ques:ons -­‐ Not everyone knows how to analyze it
  • 9.
    8 + Everyone has data + Many have interes:ng ques:ons -­‐ Not everyone knows how to analyze it
  • 10.
    9 Bob Local Expert
  • 11.
    Bob 10 Local Expert
  • 12.
    Bob 11 -­‐ Barrier of human communica:on -­‐ Barrier of professional rela:ons Local Expert
  • 13.
    Bob 12 -­‐ Barrier of human communica:on -­‐ Barrier of professional rela:ons Local Expert The limits of my language mean the limits of my world. Ludwig WiWgenstein “Tractatus Logico-­‐Philosophicus 1922”
  • 14.
    13 First step to democra:ze Big Data: to offer a familiar programming interface
  • 15.
    • Mo>va>on •SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-­‐based fault tolerance for SDGs • Experimental evalua:on 14 Outline ? ?
  • 16.
    Mutable State ina Recommender System User-­‐Item matrix (UI) Co-­‐Occurrence matrix (CO) 15 Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 Item-­‐A Item-­‐B Item-­‐A 1 1 Item-­‐B 1 2
  • 17.
    Mutable State ina Recommender System User-­‐Item matrix (UI) Co-­‐Occurrence matrix (CO) 16 Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 Item-­‐A Item-­‐B Item-­‐A 1 1 Item-­‐B 1 2 Update with new ra:ngs
  • 18.
    Mutable State ina Recommender System User-­‐Item matrix (UI) Co-­‐Occurrence matrix (CO) 17 Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; } Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 Item-­‐A Item-­‐B Item-­‐A 1 1 Item-­‐B 1 2 Update with new ra:ngs User-­‐B Mul:ply for recommenda:on 1 2 x
  • 19.
    Challenges When Executingwith Big Data 18 Big Data Problem: Matrices become large > Mutable state leads to concise algorithms but complicates parallelism and fault tolerance Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); > Cannot lose state aRer failure > Need to manage state to support data-­‐parallelism
  • 20.
    19 Using CurrentDistributed Data"ow Frameworks Input data Output data > No mutable state simplifies fault tolerance > MapReduce: Map and Reduce tasks > Storm: No support for state > Spark: Immutable RDDs
  • 21.
    20 Imperative BigData Processing > Programming distributed dataflow graphs requires learning new programming models
  • 22.
    21 Imperative BigData Processing > Programming distributed dataflow graphs requires learning new programming models Our Goal: Run Java programs with mutable state but with performance and fault tolerance of distributed dataflow systems
  • 23.
    Stateful Data"ow Graphs:From Imperative 22 Programs to Distributed Data"ows Program.java SDGs: Stateful Dataflow Graphs > Mutable distributed state in dataflow graphs > @Annota>ons help with transla>on from Java to SDGs > Checkpoint-­‐based fault tolerance recovers mutable state aRer failure
  • 24.
    • Mo:va:on •SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-­‐based fault tolerance for SDGs • Experimental evalua:on 23 Outline Program.java
  • 25.
    SDG: Data, Stateand Computation > SDGs separate data and state to allow data and pipeline parallelism 24 Task Elements (TEs) process data State Elements (SEs) represent state Dataflows represent data > Task Elements have local access to State Elements
  • 26.
    State Elements support two abstrac:ons for distributed mutable state – Par>>oned SEs: task elements always access state by key – Par>al SEs: task elements can access complete state 25 Distributed Mutable State
  • 27.
    Distributed Mutable State:Partitioned SEs Access by key State par::oned according 26 Dataflow routed according to hash func:on Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 to par>>oning key > Par>>oned SEs split into disjoint par::ons User-­‐Item matrix (UI) hash(msg.id) Key space: [0-­‐N] [0-­‐k] [(k+1)-­‐N]
  • 28.
    Distributed Mutable State:Partial SEs 27 Local access: Data sent to one Global access: Data sent to all > Par>al SE gives nodes local state instances > Par>al SE access by TEs can be local or global
  • 29.
    28 Merging DistributedMutable State > Reading all par:al SE instances results in Merge logic set of par>al values > Requires applica:on-­‐specific merge logic
  • 30.
    29 Merging DistributedMutable State > Reading all par:al SE instances results in Mul:ple par:al values Merge logic set of par>al values > Requires applica:on-­‐specific merge logic
  • 31.
    30 Merging DistributedMutable State > Reading all par:al SE instances results in Mul:ple par:al values Collect par:al values Merge logic set of par>al values > Requires applica:on-­‐specific merge logic
  • 32.
    31 Outline > @Annota>ons • Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla>ng Java programs to SDGs • Checkpoint-­‐based fault tolerance for SDGs • Experimental evalua:on Program.java
  • 33.
    32 From ImperativeCode to Execution SEEP Annotated program > SEEP: data-­‐parallel processing plaborm • Transla:on occurs in two stages: – Sta<c code analysis: From Java to SDG – Bytecode rewri<ng: From SDG to SEEP [SIGMOD’13] Program.java
  • 34.
    Program.java 33 TranslationProcess Extract TEs, SEs and accesses Live variable analysis TE and SE access code assembly SEEP runnable SOOT Framework Javassist > Extract state and state access paderns through sta:c code analysis > Genera:on of runnable code using TE and SE connec:ons
  • 35.
    Program.java 34 TranslationProcess Extract TEs, SEs and accesses Live variable analysis TE and SE access code assembly SEEP runnable SOOT Framework Javassist > Extract state and state access paderns through sta:c code analysis > Genera:on of runnable code using TE and SE connec:ons Annotated Program.java
  • 36.
    35 @Par>>oned PartitionedState Annotation Matrix userItem = new SeepMatrix(); Matrix coOcc = new Matrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; } > @Par>>on field annota>on indicates par<<oned state hash(msg.id)
  • 37.
    36 Partial Stateand Global Annotations @Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(@Global coOcc, userItem); } > @Par>al field annota>on indicates > @Global annotates variable par<al to indicate access to all par:al instances state
  • 38.
    37 Partial andCollection Annotations @Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); Vector getRec(int user) { Vector userRow = userItem.getRow(user); @Par>al Vector puRec = @Global coOcc.mul:ply(userRow); Vector userRec = merge(puRec); return userRec; } Vector merge(@Collec>on Vector[] v){ /*…*/ } > @Collec>on annota:on indicates merge logic
  • 39.
    38 Outline > Failures • Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-­‐Based fault tolerance for SDGs • Experimental evalua:on Program.java
  • 40.
    Challenges of MakingSDGs Fault Tolerant access 39 Physical deployment of SDG > Task elements > Node failures may lead to state loss local in-­‐memory state
  • 41.
    Challenges of MakingSDGs Fault Tolerant access 40 RAM RAM Physical deployment of SDG > Task elements > Node failures may lead to state loss local in-­‐memory state Physical nodes
  • 42.
    Challenges of MakingSDGs Fault Tolerant 41 RAM RAM Physical deployment of SDG > Node failures may lead to state loss Checkpoin>ng State • No updates allowed while state is being checkpointed • Checkpoin:ng state should not impact data processing path > Task elements access local in-­‐memory state Physical nodes
  • 43.
    Challenges of MakingSDGs Fault Tolerant 42 RAM RAM Physical deployment of SDG > Node failures may lead to state loss State Backup • Backups large and cannot be stored in memory • Large writes to disk through network have high cost Checkpoin>ng State • No updates allowed while state is being checkpointed • Checkpoin:ng state should not impact data processing path > Task elements access local in-­‐memory state Physical nodes
  • 44.
    Checkpoint Mechanism forFault Tolerance 1. Freeze mutable state for checkpoin:ng 2. Dirty state supports updates concurrently 3. Reconcile dirty state 43 Asynchronous, lock-­‐free checkpoin>ng Dirty state
  • 45.
    Distributed M toN Checkpoint Backup 44 M to N distributed backup and parallel recovery
  • 46.
    Distributed M toN Checkpoint Backup 45 M to N distributed backup and parallel recovery
  • 47.
    Distributed M toN Checkpoint Backup 46 M to N distributed backup and parallel recovery
  • 48.
    Distributed M toN Checkpoint Backup 47 M to N distributed backup and parallel recovery
  • 49.
    Distributed M toN Checkpoint Backup 48 M to N distributed backup and parallel recovery
  • 50.
    Distributed M toN Checkpoint Backup 49 M to N distributed backup and parallel recovery
  • 51.
    Distributed M toN Checkpoint Backup 50 M to N distributed backup and parallel recovery
  • 52.
    Distributed M toN Checkpoint Backup 51 M to N distributed backup and parallel recovery
  • 53.
    Distributed M toN Checkpoint Backup 52 M to N distributed backup and parallel recovery
  • 54.
    How does mutable state impact performance? How efficient are translated SDGs? What is the throughput/latency trade-­‐off? Experimental set-­‐up: – Amazon EC2 (c1 and m1 xlarge instances) – Private cluster (4-­‐core 3.4 GHz Intel Xeon servers with 8 GB RAM ) – Sun Java 7, Ubuntu 12.04, Linux kernel 3.10 53 Evaluation of SDG Performance
  • 55.
    54 Processing withLarge Mutable State > addRa:ng and getRec func:ons from recommender 20 15 10 5 0 algorithm, while changing read/write ra:o Throughput Latency 1:5 1:2 1:1 2:1 5:1 1000 100 Throughput (1000 requests/s) Latency (ms) Workload (state read/write ratio) Combines batch and online processing to serve fresh results over large mutable state
  • 56.
    55 E#ciency ofTranslated SDG 60 50 40 30 20 10 0 > Batch-­‐oriented, itera:ve logis:c regression 25 50 75 100 Throughput (GB/s) Number of nodes SDG Spark Translated SDG achieves performance similar to non-­‐mutable dataflow
  • 57.
    56 Latency/Throughput Tradeo$ > Streaming word count query, repor:ng counts over windows 250 200 150 100 50 0 SDG Naiad-LowLatency 10 100 1000 10000 Throughput (1000 requests/s) Window size (ms) SDGs achieve high throughput while main>ng low latency
  • 58.
    57 Latency/Throughput Tradeo$ > Streaming word count query, repor:ng counts over windows 250 250 200 150 100 50 0 Naiad-HighThroughput SDG Naiad-LowLatency Streaming Spark 10 100 1000 10000 Throughput (1000 requests/s) s) Window size (ms) SDGs achieve high throughput while main>ng low latency
  • 59.
    58 Latency/Throughput Tradeo$ > Streaming word count query, repor:ng counts over windows 250 250 250 200 200 150 150 100 100 50 50 0 Naiad-HighThroughput SDG Naiad-HighThroughput SDG Streaming Spark Naiad-LowLatency Streaming Spark 10 100 1000 10000 Throughput (1000 requests/s) Throughput (1000 requests/s) s) Window size (ms) SDGs achieve high throughput while main>ng low latency
  • 60.
    Running Java programs with the performance of current distributed dataflow frameworks SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access – Checkpoint-­‐based fault tolerance mechanism 59 Summary
  • 61.
    Running Java programs with the performance of current distributed dataflow frameworks SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access – Checkpoint-­‐based fault tolerance mechanism 60 Summary hEps://github.com/lsds/Seep/ hEps://github.com/raulcf/SEEPng/ Thank you! Any Ques>ons? @raulcfernandez rc3011@doc.ic.ac.uk
  • 62.
  • 63.
    62 Scalability on State Size and Throughput > Increase state size in a mutated KV store 2 1.5 1 0.5 0 Throughput Latency 50 100 150 200 1000 100 10 1 Throughput (million requests/s) Latency (ms) Aggregated memory (GB) Support large state without compromising throughput or latency while staying fault tolerant
  • 64.
    63 Itera:on in SDG > Local itera>on supported by one node > Itera>on across TEs requires cycle in the dataflow
  • 65.
    • Par::on •Par:al • Global • Par:al • Collec:on • Data annota:ons – Batch – Stream 64 Types of Annota:ons
  • 66.
    Overhead of SDG Fault Tolerance Fault Tolerance mechanism impact on performance and 65 10000 1000 100 10 1 No FT 1 2 3 4 5 Latency (ms) State size (GB) 1000 100 10 1 latency is small. 2 4 6 8 10 No FT Latency (ms) Checkpoint frequency (s) State size and checkpoin>ng Frequency do not affect the performance
  • 67.
    66 10 8 6 4 2 0 Fault Tolerance Overhead SDG Naiad-NoDisk Naiad-Disk SDG (latency) Naiad-NoDisk (latency) 10 100 1000 2000 100 80 60 40 20 0 Throughput (10,000 requests/s) Latency (ms) Aggregated memory (MB)
  • 68.
    40 35 30 25 20 15 10 5 0 1-to-1 recovery 2-to-1 recovery 1-to-2 recovery 2-to-2 recovery 1 2 4 Recovery time (s) State size (GB) 67 Recovery Times
  • 69.
    68 30 25 20 15 10 5 0 Stragglers Throughput 0 10 20 30 40 50 60 5 4 3 2 1 0 Throughput (1000 request/s) Number of nodes Time (s) Nodes
  • 70.
    69 Fault Tolerance Sync. Vs. Async. 250 200 150 100 50 0 T'put (Async) Latency (Sync) T'put (Sync) 1 2 3 4 10 1 0.1 0.01 0.001 Throughput (1000 requests/s) Latency (s) State size (GB)
  • 71.
    System Large State Mutable State Low Latency Itera>on MapReduce n/a n/a No No Spark n/a n/a No Yes Storm n/a n/a Yes No Naiad No Yes Yes Yes SDG Yes Yes Yes Yes 70 Comparison to State-­‐of-­‐the-­‐Art SDGs are first stateful fault tolerant model; enabling execu:on of impera:ve code with explicit state
  • 72.
    71 Characteris:cs of SDGs > Run>me Data Parallelism (elas>city) > Support for Cyclic Graphs > Low Latency Adapta:on to varying workloads and mechanism against stragglers Efficiently represent itera:ve algorithms Pipelining tasks decreases latency
  • 73.
    72 Local Expert Bob Hi, I have a query to run on “Big Data” Ok, cool, tell me about it I want to know sales per employee on Saturdays … well … ok, come in 3 days Well, this is actually preWy urgent… … 2 days, I’m preWy busy 2 Days Ayer Hi! You have the results? Yes, here you have your sales last Saturday My sales? I meant all employee sales, and not only last Saturday ups, sorry for that, give me 2 days…
  • 74.
    17TH ~ 18thNOV 2014 MADRID (SPAIN)