SlideShare a Scribd company logo
THE ABSTRACTION THAT POWERS THE BIG DATA 
RAÚL CASTRO FERNÁNDEZ 
COMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE
Data!ows: The Abstraction 
that Powers Big Data 
Raul 
Castro 
Fernandez 
Imperial 
College 
London 
rc3011@doc.ic.ac.uk 
@raulcfernandez
“Big 
Data 
needs 
Democra:za:on”
3 
Democratization of Data 
Developers 
and 
DBAs 
are 
no 
longer 
the 
only 
ones 
genera:ng, 
processing 
and 
analyzing 
data.
4 
Democratization of Data 
Developers 
and 
DBAs 
are 
no 
longer 
the 
only 
ones 
genera:ng, 
processing 
and 
analyzing 
data. 
Decision 
makers, 
domain 
scien:sts, 
applica:on 
users, 
journalists, 
crowd 
workers, 
and 
everyday 
consumers, 
sales, 
marke:ng…
5 
+ 
Everyone 
has 
data
6 
+ 
Everyone 
has 
data 
+ 
Many 
have 
interes:ng 
ques:ons
7 
+ 
Everyone 
has 
data 
+ 
Many 
have 
interes:ng 
ques:ons 
-­‐ 
Not 
everyone 
knows 
how 
to 
analyze 
it
8 
+ 
Everyone 
has 
data 
+ 
Many 
have 
interes:ng 
ques:ons 
-­‐ 
Not 
everyone 
knows 
how 
to 
analyze 
it
9 
Bob 
Local 
Expert
Bob 
10 
Local 
Expert
Bob 
11 
-­‐ 
Barrier 
of 
human 
communica:on 
-­‐ 
Barrier 
of 
professional 
rela:ons 
Local 
Expert
Bob 
12 
-­‐ 
Barrier 
of 
human 
communica:on 
-­‐ 
Barrier 
of 
professional 
rela:ons 
Local 
Expert 
The 
limits 
of 
my 
language 
mean 
the 
limits 
of 
my 
world. 
Ludwig 
WiWgenstein 
“Tractatus 
Logico-­‐Philosophicus 
1922”
13 
First 
step 
to 
democra:ze 
Big 
Data: 
to 
offer 
a 
familiar 
programming 
interface
• Mo>va>on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla:ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
14 
Outline 
? 
?
Mutable State in a Recommender System 
User-­‐Item 
matrix 
(UI) 
Co-­‐Occurrence 
matrix 
(CO) 
15 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
Item-­‐A 
Item-­‐B 
Item-­‐A 
1 
1 
Item-­‐B 
1 
2
Mutable State in a Recommender System 
User-­‐Item 
matrix 
(UI) 
Co-­‐Occurrence 
matrix 
(CO) 
16 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
void 
addRa>ng(int 
user, 
int 
item, 
int 
ra>ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(coOcc, 
userItem); 
} 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
Item-­‐A 
Item-­‐B 
Item-­‐A 
1 
1 
Item-­‐B 
1 
2 
Update 
with 
new 
ra:ngs
Mutable State in a Recommender System 
User-­‐Item 
matrix 
(UI) 
Co-­‐Occurrence 
matrix 
(CO) 
17 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
void 
addRa>ng(int 
user, 
int 
item, 
int 
ra>ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(coOcc, 
userItem); 
} 
Vector 
getRec(int 
user) 
{ 
Vector 
userRow 
= 
userItem.getRow(user); 
Vector 
userRec 
= 
coOcc.mul:ply(userRow); 
return 
userRec; 
} 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
Item-­‐A 
Item-­‐B 
Item-­‐A 
1 
1 
Item-­‐B 
1 
2 
Update 
with 
new 
ra:ngs 
User-­‐B 
Mul:ply 
for 
recommenda:on 
1 
2 
x
Challenges When Executing with Big Data 
18 
Big 
Data 
Problem: 
Matrices 
become 
large 
> 
Mutable 
state 
leads 
to 
concise 
algorithms 
but 
complicates 
parallelism 
and 
fault 
tolerance 
Matrix 
userItem 
= 
new 
Matrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
> 
Cannot 
lose 
state 
aRer 
failure 
> 
Need 
to 
manage 
state 
to 
support 
data-­‐parallelism
19 
Using Current Distributed Data"ow 
Frameworks 
Input 
data 
Output 
data 
> 
No 
mutable 
state 
simplifies 
fault 
tolerance 
> 
MapReduce: 
Map 
and 
Reduce 
tasks 
> 
Storm: 
No 
support 
for 
state 
> 
Spark: 
Immutable 
RDDs
20 
Imperative Big Data Processing 
> 
Programming 
distributed 
dataflow 
graphs 
requires 
learning 
new 
programming 
models
21 
Imperative Big Data Processing 
> 
Programming 
distributed 
dataflow 
graphs 
requires 
learning 
new 
programming 
models 
Our 
Goal: 
Run 
Java 
programs 
with 
mutable 
state 
but 
with 
performance 
and 
fault 
tolerance 
of 
distributed 
dataflow 
systems
Stateful Data"ow Graphs: From Imperative 
22 
Programs to Distributed Data"ows 
Program.java 
SDGs: 
Stateful 
Dataflow 
Graphs 
> 
Mutable 
distributed 
state 
in 
dataflow 
graphs 
> 
@Annota>ons 
help 
with 
transla>on 
from 
Java 
to 
SDGs 
> 
Checkpoint-­‐based 
fault 
tolerance 
recovers 
mutable 
state 
aRer 
failure
• Mo:va:on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla:ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
23 
Outline 
Program.java
SDG: Data, State and Computation 
> 
SDGs 
separate 
data 
and 
state 
to 
allow 
data 
and 
pipeline 
parallelism 
24 
Task 
Elements 
(TEs) 
process 
data 
State 
Elements 
(SEs) 
represent 
state 
Dataflows 
represent 
data 
> 
Task 
Elements 
have 
local 
access 
to 
State 
Elements
State 
Elements 
support 
two 
abstrac:ons 
for 
distributed 
mutable 
state 
– Par>>oned 
SEs: 
task 
elements 
always 
access 
state 
by 
key 
– Par>al 
SEs: 
task 
elements 
can 
access 
complete 
state 
25 
Distributed Mutable State
Distributed Mutable State: Partitioned SEs 
Access 
by 
key 
State 
par::oned 
according 
26 
Dataflow 
routed 
according 
to 
hash 
func:on 
Item-­‐A 
Item-­‐B 
User-­‐A 
4 
5 
User-­‐B 
0 
5 
to 
par>>oning 
key 
> 
Par>>oned 
SEs 
split 
into 
disjoint 
par::ons 
User-­‐Item 
matrix 
(UI) 
hash(msg.id) 
Key 
space: 
[0-­‐N] 
[0-­‐k] 
[(k+1)-­‐N]
Distributed Mutable State: Partial SEs 
27 
Local 
access: 
Data 
sent 
to 
one 
Global 
access: 
Data 
sent 
to 
all 
> 
Par>al 
SE 
gives 
nodes 
local 
state 
instances 
> 
Par>al 
SE 
access 
by 
TEs 
can 
be 
local 
or 
global
28 
Merging Distributed Mutable State 
> 
Reading 
all 
par:al 
SE 
instances 
results 
in 
Merge 
logic 
set 
of 
par>al 
values 
> 
Requires 
applica:on-­‐specific 
merge 
logic
29 
Merging Distributed Mutable State 
> 
Reading 
all 
par:al 
SE 
instances 
results 
in 
Mul:ple 
par:al 
values 
Merge 
logic 
set 
of 
par>al 
values 
> 
Requires 
applica:on-­‐specific 
merge 
logic
30 
Merging Distributed Mutable State 
> 
Reading 
all 
par:al 
SE 
instances 
results 
in 
Mul:ple 
par:al 
values 
Collect 
par:al 
values 
Merge 
logic 
set 
of 
par>al 
values 
> 
Requires 
applica:on-­‐specific 
merge 
logic
31 
Outline 
> 
@Annota>ons 
• Mo:va:on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla>ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
Program.java
32 
From Imperative Code to Execution 
SEEP 
Annotated 
program 
> 
SEEP: 
data-­‐parallel 
processing 
plaborm 
• Transla:on 
occurs 
in 
two 
stages: 
– Sta<c 
code 
analysis: 
From 
Java 
to 
SDG 
– Bytecode 
rewri<ng: 
From 
SDG 
to 
SEEP 
[SIGMOD’13] 
Program.java
Program.java 
33 
Translation Process 
Extract 
TEs, 
SEs 
and 
accesses 
Live 
variable 
analysis 
TE 
and 
SE 
access 
code 
assembly 
SEEP 
runnable 
SOOT 
Framework 
Javassist 
> 
Extract 
state 
and 
state 
access 
paderns 
through 
sta:c 
code 
analysis 
> 
Genera:on 
of 
runnable 
code 
using 
TE 
and 
SE 
connec:ons
Program.java 
34 
Translation Process 
Extract 
TEs, 
SEs 
and 
accesses 
Live 
variable 
analysis 
TE 
and 
SE 
access 
code 
assembly 
SEEP 
runnable 
SOOT 
Framework 
Javassist 
> 
Extract 
state 
and 
state 
access 
paderns 
through 
sta:c 
code 
analysis 
> 
Genera:on 
of 
runnable 
code 
using 
TE 
and 
SE 
connec:ons 
Annotated 
Program.java
35 
@Par>>oned 
Partitioned State Annotation 
Matrix 
userItem 
= 
new 
SeepMatrix(); 
Matrix 
coOcc 
= 
new 
Matrix(); 
void 
addRa:ng(int 
user, 
int 
item, 
int 
ra:ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(coOcc, 
userItem); 
} 
Vector 
getRec(int 
user) 
{ 
Vector 
userRow 
= 
userItem.getRow(user); 
Vector 
userRec 
= 
coOcc.mul:ply(userRow); 
return 
userRec; 
} 
> 
@Par>>on 
field 
annota>on 
indicates 
par<<oned 
state 
hash(msg.id)
36 
Partial State and Global Annotations 
@Par::oned 
Matrix 
userItem 
= 
new 
SeepMatrix(); 
@Par>al 
Matrix 
coOcc 
= 
new 
SeepMatrix(); 
void 
addRa:ng(int 
user, 
int 
item, 
int 
ra:ng) 
{ 
userItem.setElement(user, 
item, 
ra:ng); 
updateCoOccurrence(@Global 
coOcc, 
userItem); 
} 
> 
@Par>al 
field 
annota>on 
indicates 
> 
@Global 
annotates 
variable 
par<al 
to 
indicate 
access 
to 
all 
par:al 
instances 
state
37 
Partial and Collection Annotations 
@Par::oned 
Matrix 
userItem 
= 
new 
SeepMatrix(); 
@Par>al 
Matrix 
coOcc 
= 
new 
SeepMatrix(); 
Vector 
getRec(int 
user) 
{ 
Vector 
userRow 
= 
userItem.getRow(user); 
@Par>al 
Vector 
puRec 
= 
@Global 
coOcc.mul:ply(userRow); 
Vector 
userRec 
= 
merge(puRec); 
return 
userRec; 
} 
Vector 
merge(@Collec>on 
Vector[] 
v){ 
/*…*/ 
} 
> 
@Collec>on 
annota:on 
indicates 
merge 
logic
38 
Outline 
> 
Failures 
• Mo:va:on 
• SDG: 
Stateful 
Dataflow 
Graphs 
• Handling 
distributed 
state 
in 
SDGs 
• Transla:ng 
Java 
programs 
to 
SDGs 
• Checkpoint-­‐Based 
fault 
tolerance 
for 
SDGs 
• Experimental 
evalua:on 
Program.java
Challenges of Making SDGs Fault Tolerant 
access 
39 
Physical 
deployment 
of 
SDG 
> 
Task 
elements 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
local 
in-­‐memory 
state
Challenges of Making SDGs Fault Tolerant 
access 
40 
RAM 
RAM 
Physical 
deployment 
of 
SDG 
> 
Task 
elements 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
local 
in-­‐memory 
state 
Physical 
nodes
Challenges of Making SDGs Fault Tolerant 
41 
RAM 
RAM 
Physical 
deployment 
of 
SDG 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
Checkpoin>ng 
State 
• No 
updates 
allowed 
while 
state 
is 
being 
checkpointed 
• Checkpoin:ng 
state 
should 
not 
impact 
data 
processing 
path 
> 
Task 
elements 
access 
local 
in-­‐memory 
state 
Physical 
nodes
Challenges of Making SDGs Fault Tolerant 
42 
RAM 
RAM 
Physical 
deployment 
of 
SDG 
> 
Node 
failures 
may 
lead 
to 
state 
loss 
State 
Backup 
• Backups 
large 
and 
cannot 
be 
stored 
in 
memory 
• Large 
writes 
to 
disk 
through 
network 
have 
high 
cost 
Checkpoin>ng 
State 
• No 
updates 
allowed 
while 
state 
is 
being 
checkpointed 
• Checkpoin:ng 
state 
should 
not 
impact 
data 
processing 
path 
> 
Task 
elements 
access 
local 
in-­‐memory 
state 
Physical 
nodes
Checkpoint Mechanism for Fault Tolerance 
1. Freeze 
mutable 
state 
for 
checkpoin:ng 
2. Dirty 
state 
supports 
updates 
concurrently 
3. Reconcile 
dirty 
state 
43 
Asynchronous, 
lock-­‐free 
checkpoin>ng 
Dirty 
state
Distributed M to N Checkpoint Backup 
44 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
45 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
46 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
47 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
48 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
49 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
50 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
51 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
Distributed M to N Checkpoint Backup 
52 
M 
to 
N 
distributed 
backup 
and 
parallel 
recovery
How 
does 
mutable 
state 
impact 
performance? 
How 
efficient 
are 
translated 
SDGs? 
What 
is 
the 
throughput/latency 
trade-­‐off? 
Experimental 
set-­‐up: 
– Amazon 
EC2 
(c1 
and 
m1 
xlarge 
instances) 
– Private 
cluster 
(4-­‐core 
3.4 
GHz 
Intel 
Xeon 
servers 
with 
8 
GB 
RAM 
) 
– Sun 
Java 
7, 
Ubuntu 
12.04, 
Linux 
kernel 
3.10 
53 
Evaluation of SDG Performance
54 
Processing with Large Mutable State 
> 
addRa:ng 
and 
getRec 
func:ons 
from 
recommender 
20 
15 
10 
5 
0 
algorithm, 
while 
changing 
read/write 
ra:o 
Throughput 
Latency 
1:5 1:2 1:1 2:1 5:1 
1000 
100 
Throughput (1000 requests/s) 
Latency (ms) 
Workload (state read/write ratio) 
Combines 
batch 
and 
online 
processing 
to 
serve 
fresh 
results 
over 
large 
mutable 
state
55 
E#ciency of Translated SDG 
60 
50 
40 
30 
20 
10 
0 
> 
Batch-­‐oriented, 
itera:ve 
logis:c 
regression 
25 50 75 100 
Throughput (GB/s) 
Number of nodes 
SDG 
Spark 
Translated 
SDG 
achieves 
performance 
similar 
to 
non-­‐mutable 
dataflow
56 
Latency/Throughput Tradeo$ 
> 
Streaming 
word 
count 
query, 
repor:ng 
counts 
over 
windows 
250 
200 
150 
100 
50 
0 
SDG 
Naiad-LowLatency 
10 100 1000 10000 
Throughput (1000 requests/s) 
Window size (ms) 
SDGs 
achieve 
high 
throughput 
while 
main>ng 
low 
latency
57 
Latency/Throughput Tradeo$ 
> 
Streaming 
word 
count 
query, 
repor:ng 
counts 
over 
windows 
250 
250 
200 
150 
100 
50 
0 
Naiad-HighThroughput 
SDG 
Naiad-LowLatency 
Streaming Spark 
10 100 1000 10000 
Throughput (1000 requests/s) 
s) 
Window size (ms) 
SDGs 
achieve 
high 
throughput 
while 
main>ng 
low 
latency
58 
Latency/Throughput Tradeo$ 
> 
Streaming 
word 
count 
query, 
repor:ng 
counts 
over 
windows 
250 
250 
250 
200 
200 
150 
150 
100 
100 
50 
50 
0 
Naiad-HighThroughput 
SDG 
Naiad-HighThroughput 
SDG 
Streaming Spark 
Naiad-LowLatency 
Streaming Spark 
10 100 1000 10000 
Throughput (1000 requests/s) 
Throughput (1000 requests/s) 
s) 
Window size (ms) 
SDGs 
achieve 
high 
throughput 
while 
main>ng 
low 
latency
Running 
Java 
programs 
with 
the 
performance 
of 
current 
distributed 
dataflow 
frameworks 
SDG: 
Stateful 
Dataflow 
Graphs 
– Abstrac:ons 
for 
distributed 
mutable 
state 
– Annota>ons 
to 
disambiguate 
types 
of 
distributed 
state 
and 
state 
access 
– Checkpoint-­‐based 
fault 
tolerance 
mechanism 
59 
Summary
Running 
Java 
programs 
with 
the 
performance 
of 
current 
distributed 
dataflow 
frameworks 
SDG: 
Stateful 
Dataflow 
Graphs 
– Abstrac:ons 
for 
distributed 
mutable 
state 
– Annota>ons 
to 
disambiguate 
types 
of 
distributed 
state 
and 
state 
access 
– Checkpoint-­‐based 
fault 
tolerance 
mechanism 
60 
Summary 
hEps://github.com/lsds/Seep/ 
hEps://github.com/raulcf/SEEPng/ 
Thank 
you! 
Any 
Ques>ons? 
@raulcfernandez 
rc3011@doc.ic.ac.uk
BACKUP 
SLIDES 
61
62 
Scalability 
on 
State 
Size 
and 
Throughput 
> 
Increase 
state 
size 
in 
a 
mutated 
KV 
store 
2 
1.5 
1 
0.5 
0 
Throughput 
Latency 
50 100 150 200 
1000 
100 
10 
1 
Throughput (million requests/s) 
Latency (ms) 
Aggregated memory (GB) 
Support 
large 
state 
without 
compromising 
throughput 
or 
latency 
while 
staying 
fault 
tolerant
63 
Itera:on 
in 
SDG 
> 
Local 
itera>on 
supported 
by 
one 
node 
> 
Itera>on 
across 
TEs 
requires 
cycle 
in 
the 
dataflow
• Par::on 
• Par:al 
• Global 
• Par:al 
• Collec:on 
• Data 
annota:ons 
– Batch 
– Stream 
64 
Types 
of 
Annota:ons
Overhead 
of 
SDG 
Fault 
Tolerance 
Fault 
Tolerance 
mechanism 
impact 
on 
performance 
and 
65 
10000 
1000 
100 
10 
1 
No FT 1 2 3 4 5 
Latency (ms) 
State size (GB) 
1000 
100 
10 
1 
latency 
is 
small. 
2 4 6 8 10 No FT 
Latency (ms) 
Checkpoint frequency (s) 
State 
size 
and 
checkpoin>ng 
Frequency 
do 
not 
affect 
the 
performance
66 
10 
8 
6 
4 
2 
0 
Fault 
Tolerance 
Overhead 
SDG 
Naiad-NoDisk 
Naiad-Disk 
SDG (latency) 
Naiad-NoDisk (latency) 
10 100 1000 2000 
100 
80 
60 
40 
20 
0 
Throughput (10,000 requests/s) 
Latency (ms) 
Aggregated memory (MB)
40 
35 
30 
25 
20 
15 
10 
5 
0 
1-to-1 recovery 
2-to-1 recovery 
1-to-2 recovery 
2-to-2 recovery 
1 2 4 
Recovery time (s) 
State size (GB) 
67 
Recovery 
Times
68 
30 
25 
20 
15 
10 
5 
0 
Stragglers 
Throughput 
0 10 20 30 40 50 60 
5 
4 
3 
2 
1 
0 
Throughput (1000 request/s) 
Number of nodes 
Time (s) 
Nodes
69 
Fault 
Tolerance 
Sync. 
Vs. 
Async. 
250 
200 
150 
100 
50 
0 
T'put (Async) Latency (Sync) 
T'put (Sync) 
1 2 3 4 
10 
1 
0.1 
0.01 
0.001 
Throughput (1000 requests/s) 
Latency (s) 
State size (GB)
System 
Large 
State 
Mutable 
State 
Low 
Latency 
Itera>on 
MapReduce 
n/a 
n/a 
No 
No 
Spark 
n/a 
n/a 
No 
Yes 
Storm 
n/a 
n/a 
Yes 
No 
Naiad 
No 
Yes 
Yes 
Yes 
SDG 
Yes 
Yes 
Yes 
Yes 
70 
Comparison 
to 
State-­‐of-­‐the-­‐Art 
SDGs 
are 
first 
stateful 
fault 
tolerant 
model; 
enabling 
execu:on 
of 
impera:ve 
code 
with 
explicit 
state
71 
Characteris:cs 
of 
SDGs 
> 
Run>me 
Data 
Parallelism 
(elas>city) 
> 
Support 
for 
Cyclic 
Graphs 
> 
Low 
Latency 
Adapta:on 
to 
varying 
workloads 
and 
mechanism 
against 
stragglers 
Efficiently 
represent 
itera:ve 
algorithms 
Pipelining 
tasks 
decreases 
latency
72 
Local 
Expert 
Bob 
Hi, 
I 
have 
a 
query 
to 
run 
on 
“Big 
Data” 
Ok, 
cool, 
tell 
me 
about 
it 
I 
want 
to 
know 
sales 
per 
employee 
on 
Saturdays 
… 
well 
… 
ok, 
come 
in 
3 
days 
Well, 
this 
is 
actually 
preWy 
urgent… 
… 
2 
days, 
I’m 
preWy 
busy 
2 
Days 
Ayer 
Hi! 
You 
have 
the 
results? 
Yes, 
here 
you 
have 
your 
sales 
last 
Saturday 
My 
sales? 
I 
meant 
all 
employee 
sales, 
and 
not 
only 
last 
Saturday 
ups, 
sorry 
for 
that, 
give 
me 
2 
days…
17TH ~ 18th NOV 2014 
MADRID (SPAIN)

More Related Content

Viewers also liked

Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Big Data Spain
 
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
Big Data Spain
 
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Big Data Spain
 
Convergent Replicated Data Types in Riak 2.0
Convergent Replicated Data Types in Riak 2.0Convergent Replicated Data Types in Riak 2.0
Convergent Replicated Data Types in Riak 2.0
Big Data Spain
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
Big Data Spain
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
Big Data Spain
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Big Data Spain
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data Spain
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Big Data Spain
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Big Data Spain
 
A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...
Big Data Spain
 
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
Big Data Spain
 
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
 Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at... Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Big Data Spain
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
Big Data Spain
 
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Big Data Spain
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Big Data Spain
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Big Data Spain
 

Viewers also liked (19)

Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
 
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
 
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 
Convergent Replicated Data Types in Riak 2.0
Convergent Replicated Data Types in Riak 2.0Convergent Replicated Data Types in Riak 2.0
Convergent Replicated Data Types in Riak 2.0
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
 
A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...
 
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
 
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
 Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at... Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
 
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 

Similar to Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Offline first: application data and synchronization
Offline first: application data and synchronizationOffline first: application data and synchronization
Offline first: application data and synchronization
EatDog
 
ICSE2014
ICSE2014ICSE2014
ICSE2014
swy351
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
WSO2
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration Hell
Databricks
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
DQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltdDQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltdDigendra Vir Singh (DV)
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
Neil Avery
 
Reverse Engineering automation
Reverse Engineering automationReverse Engineering automation
Reverse Engineering automationPositive Hack Days
 
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...
NLJUG
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Rete network slicing for Model Queries
Rete network slicing for Model QueriesRete network slicing for Model Queries
Rete network slicing for Model Queries
Zoltán Ujhelyi
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017
Jacob Maes
 
Network Monitoring System ppt.pdf
Network Monitoring System ppt.pdfNetwork Monitoring System ppt.pdf
Network Monitoring System ppt.pdf
kristinatemen
 
network monitoring system ppt
network monitoring system pptnetwork monitoring system ppt
network monitoring system pptashutosh rai
 

Similar to Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014 (20)

Offline first: application data and synchronization
Offline first: application data and synchronizationOffline first: application data and synchronization
Offline first: application data and synchronization
 
ICSE2014
ICSE2014ICSE2014
ICSE2014
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration Hell
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
DQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltdDQ Product Usage Methodology Highlights_v6_ltd
DQ Product Usage Methodology Highlights_v6_ltd
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
Reverse Engineering automation
Reverse Engineering automationReverse Engineering automation
Reverse Engineering automation
 
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Rete network slicing for Model Queries
Rete network slicing for Model QueriesRete network slicing for Model Queries
Rete network slicing for Model Queries
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017
 
Network Monitoring System ppt.pdf
Network Monitoring System ppt.pdfNetwork Monitoring System ppt.pdf
Network Monitoring System ppt.pdf
 
network monitoring system ppt
network monitoring system pptnetwork monitoring system ppt
network monitoring system ppt
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

  • 1. THE ABSTRACTION THAT POWERS THE BIG DATA RAÚL CASTRO FERNÁNDEZ COMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE
  • 2. Data!ows: The Abstraction that Powers Big Data Raul Castro Fernandez Imperial College London rc3011@doc.ic.ac.uk @raulcfernandez
  • 3. “Big Data needs Democra:za:on”
  • 4. 3 Democratization of Data Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data.
  • 5. 4 Democratization of Data Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data. Decision makers, domain scien:sts, applica:on users, journalists, crowd workers, and everyday consumers, sales, marke:ng…
  • 6. 5 + Everyone has data
  • 7. 6 + Everyone has data + Many have interes:ng ques:ons
  • 8. 7 + Everyone has data + Many have interes:ng ques:ons -­‐ Not everyone knows how to analyze it
  • 9. 8 + Everyone has data + Many have interes:ng ques:ons -­‐ Not everyone knows how to analyze it
  • 10. 9 Bob Local Expert
  • 11. Bob 10 Local Expert
  • 12. Bob 11 -­‐ Barrier of human communica:on -­‐ Barrier of professional rela:ons Local Expert
  • 13. Bob 12 -­‐ Barrier of human communica:on -­‐ Barrier of professional rela:ons Local Expert The limits of my language mean the limits of my world. Ludwig WiWgenstein “Tractatus Logico-­‐Philosophicus 1922”
  • 14. 13 First step to democra:ze Big Data: to offer a familiar programming interface
  • 15. • Mo>va>on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-­‐based fault tolerance for SDGs • Experimental evalua:on 14 Outline ? ?
  • 16. Mutable State in a Recommender System User-­‐Item matrix (UI) Co-­‐Occurrence matrix (CO) 15 Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 Item-­‐A Item-­‐B Item-­‐A 1 1 Item-­‐B 1 2
  • 17. Mutable State in a Recommender System User-­‐Item matrix (UI) Co-­‐Occurrence matrix (CO) 16 Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 Item-­‐A Item-­‐B Item-­‐A 1 1 Item-­‐B 1 2 Update with new ra:ngs
  • 18. Mutable State in a Recommender System User-­‐Item matrix (UI) Co-­‐Occurrence matrix (CO) 17 Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; } Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 Item-­‐A Item-­‐B Item-­‐A 1 1 Item-­‐B 1 2 Update with new ra:ngs User-­‐B Mul:ply for recommenda:on 1 2 x
  • 19. Challenges When Executing with Big Data 18 Big Data Problem: Matrices become large > Mutable state leads to concise algorithms but complicates parallelism and fault tolerance Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); > Cannot lose state aRer failure > Need to manage state to support data-­‐parallelism
  • 20. 19 Using Current Distributed Data"ow Frameworks Input data Output data > No mutable state simplifies fault tolerance > MapReduce: Map and Reduce tasks > Storm: No support for state > Spark: Immutable RDDs
  • 21. 20 Imperative Big Data Processing > Programming distributed dataflow graphs requires learning new programming models
  • 22. 21 Imperative Big Data Processing > Programming distributed dataflow graphs requires learning new programming models Our Goal: Run Java programs with mutable state but with performance and fault tolerance of distributed dataflow systems
  • 23. Stateful Data"ow Graphs: From Imperative 22 Programs to Distributed Data"ows Program.java SDGs: Stateful Dataflow Graphs > Mutable distributed state in dataflow graphs > @Annota>ons help with transla>on from Java to SDGs > Checkpoint-­‐based fault tolerance recovers mutable state aRer failure
  • 24. • Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-­‐based fault tolerance for SDGs • Experimental evalua:on 23 Outline Program.java
  • 25. SDG: Data, State and Computation > SDGs separate data and state to allow data and pipeline parallelism 24 Task Elements (TEs) process data State Elements (SEs) represent state Dataflows represent data > Task Elements have local access to State Elements
  • 26. State Elements support two abstrac:ons for distributed mutable state – Par>>oned SEs: task elements always access state by key – Par>al SEs: task elements can access complete state 25 Distributed Mutable State
  • 27. Distributed Mutable State: Partitioned SEs Access by key State par::oned according 26 Dataflow routed according to hash func:on Item-­‐A Item-­‐B User-­‐A 4 5 User-­‐B 0 5 to par>>oning key > Par>>oned SEs split into disjoint par::ons User-­‐Item matrix (UI) hash(msg.id) Key space: [0-­‐N] [0-­‐k] [(k+1)-­‐N]
  • 28. Distributed Mutable State: Partial SEs 27 Local access: Data sent to one Global access: Data sent to all > Par>al SE gives nodes local state instances > Par>al SE access by TEs can be local or global
  • 29. 28 Merging Distributed Mutable State > Reading all par:al SE instances results in Merge logic set of par>al values > Requires applica:on-­‐specific merge logic
  • 30. 29 Merging Distributed Mutable State > Reading all par:al SE instances results in Mul:ple par:al values Merge logic set of par>al values > Requires applica:on-­‐specific merge logic
  • 31. 30 Merging Distributed Mutable State > Reading all par:al SE instances results in Mul:ple par:al values Collect par:al values Merge logic set of par>al values > Requires applica:on-­‐specific merge logic
  • 32. 31 Outline > @Annota>ons • Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla>ng Java programs to SDGs • Checkpoint-­‐based fault tolerance for SDGs • Experimental evalua:on Program.java
  • 33. 32 From Imperative Code to Execution SEEP Annotated program > SEEP: data-­‐parallel processing plaborm • Transla:on occurs in two stages: – Sta<c code analysis: From Java to SDG – Bytecode rewri<ng: From SDG to SEEP [SIGMOD’13] Program.java
  • 34. Program.java 33 Translation Process Extract TEs, SEs and accesses Live variable analysis TE and SE access code assembly SEEP runnable SOOT Framework Javassist > Extract state and state access paderns through sta:c code analysis > Genera:on of runnable code using TE and SE connec:ons
  • 35. Program.java 34 Translation Process Extract TEs, SEs and accesses Live variable analysis TE and SE access code assembly SEEP runnable SOOT Framework Javassist > Extract state and state access paderns through sta:c code analysis > Genera:on of runnable code using TE and SE connec:ons Annotated Program.java
  • 36. 35 @Par>>oned Partitioned State Annotation Matrix userItem = new SeepMatrix(); Matrix coOcc = new Matrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; } > @Par>>on field annota>on indicates par<<oned state hash(msg.id)
  • 37. 36 Partial State and Global Annotations @Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(@Global coOcc, userItem); } > @Par>al field annota>on indicates > @Global annotates variable par<al to indicate access to all par:al instances state
  • 38. 37 Partial and Collection Annotations @Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); Vector getRec(int user) { Vector userRow = userItem.getRow(user); @Par>al Vector puRec = @Global coOcc.mul:ply(userRow); Vector userRec = merge(puRec); return userRec; } Vector merge(@Collec>on Vector[] v){ /*…*/ } > @Collec>on annota:on indicates merge logic
  • 39. 38 Outline > Failures • Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-­‐Based fault tolerance for SDGs • Experimental evalua:on Program.java
  • 40. Challenges of Making SDGs Fault Tolerant access 39 Physical deployment of SDG > Task elements > Node failures may lead to state loss local in-­‐memory state
  • 41. Challenges of Making SDGs Fault Tolerant access 40 RAM RAM Physical deployment of SDG > Task elements > Node failures may lead to state loss local in-­‐memory state Physical nodes
  • 42. Challenges of Making SDGs Fault Tolerant 41 RAM RAM Physical deployment of SDG > Node failures may lead to state loss Checkpoin>ng State • No updates allowed while state is being checkpointed • Checkpoin:ng state should not impact data processing path > Task elements access local in-­‐memory state Physical nodes
  • 43. Challenges of Making SDGs Fault Tolerant 42 RAM RAM Physical deployment of SDG > Node failures may lead to state loss State Backup • Backups large and cannot be stored in memory • Large writes to disk through network have high cost Checkpoin>ng State • No updates allowed while state is being checkpointed • Checkpoin:ng state should not impact data processing path > Task elements access local in-­‐memory state Physical nodes
  • 44. Checkpoint Mechanism for Fault Tolerance 1. Freeze mutable state for checkpoin:ng 2. Dirty state supports updates concurrently 3. Reconcile dirty state 43 Asynchronous, lock-­‐free checkpoin>ng Dirty state
  • 45. Distributed M to N Checkpoint Backup 44 M to N distributed backup and parallel recovery
  • 46. Distributed M to N Checkpoint Backup 45 M to N distributed backup and parallel recovery
  • 47. Distributed M to N Checkpoint Backup 46 M to N distributed backup and parallel recovery
  • 48. Distributed M to N Checkpoint Backup 47 M to N distributed backup and parallel recovery
  • 49. Distributed M to N Checkpoint Backup 48 M to N distributed backup and parallel recovery
  • 50. Distributed M to N Checkpoint Backup 49 M to N distributed backup and parallel recovery
  • 51. Distributed M to N Checkpoint Backup 50 M to N distributed backup and parallel recovery
  • 52. Distributed M to N Checkpoint Backup 51 M to N distributed backup and parallel recovery
  • 53. Distributed M to N Checkpoint Backup 52 M to N distributed backup and parallel recovery
  • 54. How does mutable state impact performance? How efficient are translated SDGs? What is the throughput/latency trade-­‐off? Experimental set-­‐up: – Amazon EC2 (c1 and m1 xlarge instances) – Private cluster (4-­‐core 3.4 GHz Intel Xeon servers with 8 GB RAM ) – Sun Java 7, Ubuntu 12.04, Linux kernel 3.10 53 Evaluation of SDG Performance
  • 55. 54 Processing with Large Mutable State > addRa:ng and getRec func:ons from recommender 20 15 10 5 0 algorithm, while changing read/write ra:o Throughput Latency 1:5 1:2 1:1 2:1 5:1 1000 100 Throughput (1000 requests/s) Latency (ms) Workload (state read/write ratio) Combines batch and online processing to serve fresh results over large mutable state
  • 56. 55 E#ciency of Translated SDG 60 50 40 30 20 10 0 > Batch-­‐oriented, itera:ve logis:c regression 25 50 75 100 Throughput (GB/s) Number of nodes SDG Spark Translated SDG achieves performance similar to non-­‐mutable dataflow
  • 57. 56 Latency/Throughput Tradeo$ > Streaming word count query, repor:ng counts over windows 250 200 150 100 50 0 SDG Naiad-LowLatency 10 100 1000 10000 Throughput (1000 requests/s) Window size (ms) SDGs achieve high throughput while main>ng low latency
  • 58. 57 Latency/Throughput Tradeo$ > Streaming word count query, repor:ng counts over windows 250 250 200 150 100 50 0 Naiad-HighThroughput SDG Naiad-LowLatency Streaming Spark 10 100 1000 10000 Throughput (1000 requests/s) s) Window size (ms) SDGs achieve high throughput while main>ng low latency
  • 59. 58 Latency/Throughput Tradeo$ > Streaming word count query, repor:ng counts over windows 250 250 250 200 200 150 150 100 100 50 50 0 Naiad-HighThroughput SDG Naiad-HighThroughput SDG Streaming Spark Naiad-LowLatency Streaming Spark 10 100 1000 10000 Throughput (1000 requests/s) Throughput (1000 requests/s) s) Window size (ms) SDGs achieve high throughput while main>ng low latency
  • 60. Running Java programs with the performance of current distributed dataflow frameworks SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access – Checkpoint-­‐based fault tolerance mechanism 59 Summary
  • 61. Running Java programs with the performance of current distributed dataflow frameworks SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access – Checkpoint-­‐based fault tolerance mechanism 60 Summary hEps://github.com/lsds/Seep/ hEps://github.com/raulcf/SEEPng/ Thank you! Any Ques>ons? @raulcfernandez rc3011@doc.ic.ac.uk
  • 63. 62 Scalability on State Size and Throughput > Increase state size in a mutated KV store 2 1.5 1 0.5 0 Throughput Latency 50 100 150 200 1000 100 10 1 Throughput (million requests/s) Latency (ms) Aggregated memory (GB) Support large state without compromising throughput or latency while staying fault tolerant
  • 64. 63 Itera:on in SDG > Local itera>on supported by one node > Itera>on across TEs requires cycle in the dataflow
  • 65. • Par::on • Par:al • Global • Par:al • Collec:on • Data annota:ons – Batch – Stream 64 Types of Annota:ons
  • 66. Overhead of SDG Fault Tolerance Fault Tolerance mechanism impact on performance and 65 10000 1000 100 10 1 No FT 1 2 3 4 5 Latency (ms) State size (GB) 1000 100 10 1 latency is small. 2 4 6 8 10 No FT Latency (ms) Checkpoint frequency (s) State size and checkpoin>ng Frequency do not affect the performance
  • 67. 66 10 8 6 4 2 0 Fault Tolerance Overhead SDG Naiad-NoDisk Naiad-Disk SDG (latency) Naiad-NoDisk (latency) 10 100 1000 2000 100 80 60 40 20 0 Throughput (10,000 requests/s) Latency (ms) Aggregated memory (MB)
  • 68. 40 35 30 25 20 15 10 5 0 1-to-1 recovery 2-to-1 recovery 1-to-2 recovery 2-to-2 recovery 1 2 4 Recovery time (s) State size (GB) 67 Recovery Times
  • 69. 68 30 25 20 15 10 5 0 Stragglers Throughput 0 10 20 30 40 50 60 5 4 3 2 1 0 Throughput (1000 request/s) Number of nodes Time (s) Nodes
  • 70. 69 Fault Tolerance Sync. Vs. Async. 250 200 150 100 50 0 T'put (Async) Latency (Sync) T'put (Sync) 1 2 3 4 10 1 0.1 0.01 0.001 Throughput (1000 requests/s) Latency (s) State size (GB)
  • 71. System Large State Mutable State Low Latency Itera>on MapReduce n/a n/a No No Spark n/a n/a No Yes Storm n/a n/a Yes No Naiad No Yes Yes Yes SDG Yes Yes Yes Yes 70 Comparison to State-­‐of-­‐the-­‐Art SDGs are first stateful fault tolerant model; enabling execu:on of impera:ve code with explicit state
  • 72. 71 Characteris:cs of SDGs > Run>me Data Parallelism (elas>city) > Support for Cyclic Graphs > Low Latency Adapta:on to varying workloads and mechanism against stragglers Efficiently represent itera:ve algorithms Pipelining tasks decreases latency
  • 73. 72 Local Expert Bob Hi, I have a query to run on “Big Data” Ok, cool, tell me about it I want to know sales per employee on Saturdays … well … ok, come in 3 days Well, this is actually preWy urgent… … 2 days, I’m preWy busy 2 Days Ayer Hi! You have the results? Yes, here you have your sales last Saturday My sales? I meant all employee sales, and not only last Saturday ups, sorry for that, give me 2 days…
  • 74. 17TH ~ 18th NOV 2014 MADRID (SPAIN)