Graph Stream Processing : spinning fast, large scale, complex analytics

Graph Stream Processing
spinning fast, large-scale, complex analytics
Paris Carbone
PhD Candidate @ KTH
Committer @ Apache Flink

We want to analyse….
datacomplex

datacomplexlarge-scale

data fastcomplexlarge-scale

But why do we need
large-scale, complex and fast data analysis?
>

But why do we need
large-scale, complex and fast data analysis?
to answer big complex questions faster>

>Hej Siri_

Get me the best route to work right now
>Hej Siri_

>Hej Siri_
…with the fewest human drivers

>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…

>Hej Siri_
did not eat yesterday… or the day before yesterday

>Hej Siri_
did not eat yesterday… or the day before yesterday
oh! And no kebab pizza!

>Hej Siri_
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday

no matter if they use Spark or Flink or just ipython
>Hej Siri_

best route to work right now
Lookup a pizza recipe all of my friends like but did not eat
yesterday…
re-unite all data scientists in the world?
3000 AD

Lookup a pizza recipe all of my friends like but did not eat
yesterday…
FIRST WORLD PROBLEM
3000 AD

use Spark or Flink or just ipython
30000 AD

FIRST EARTH WORLD PROBLEM
use Spark or Flink or just ipython
30000 AD

Still, fast analytics might save us some day…
• We can access patient movements and fb, twitter
and pretty much all social media interactions
• Can we stop a pandemic?
• Or can we predict fast where the virus can spread?

Now how do we analyse…
data fastcomplexlarge-scale ?

data
graphdistributed streaming

data
everything is a graph

data
everything is many everything is a graph

data
everything is many everything is a graph everything is a stream

it all started…
as a ﬁrst world problem question

but then things escalated quickly…
…and machinery got cheaper and we
suddenly realised that we have big data

Distributed Graph processing was born
Thus,

Thus,
Map Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shufﬂe it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
ﬁle system

Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shufﬂe it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
ﬁle system

• We want to compute the Connected Components
of a distributed graph.
• Basic computation element (map): vertex
• Updates : messages to other vertices
Distributed Graph processing example

• We want to compute the Connected Components
of a distributed graph.
• Basic computation element (map): vertex
• Updates : messages to other vertices
1 2
3

1
43
2
5
6
7
8
ROUND 0

1
43
2
5
ROUND 0
6
7
8
3
1
4
4
5
2
4
2
3
5
7
8
6
8
6
7

1
21
2
2
ROUND 1
6
6
6

1
2
2
2
2
1
2
6
6
6
6
1
21
2
2
ROUND 1
6
6
6
6
6

1
11
2
2
ROUND 2
6
6
6

1
11
2
2
ROUND 2
6
6
6
1
1
1

1
11
1
1
ROUND 3
6
6
6

1
11
1
1
ROUND 3
6
6
6
1
1
1
1

1
11
1
1
ROUND 4
6
6
6
No messages, DONE!

• Examples of Load-Compute-Store systems:
Pregel, Graphx (spark), Graphlab, PowerGraph
• Same execution strategy - Same problems
• It’s slow
• Too much re-computation ($€) for nothing.
• Real World Updates anyone?
Distributed Graph processing systems

…and streaming came
to mess everything
make
fast and simple

to mess everything
make
fast and simple
real
world

to mess everything
make
fast and simple
real
world event records
• local state stays here
• local computation too
The Dataﬂow™

Streaming is so advanced that…
• subsecond latency and high throughput
ﬁnally coexist
• it does fault tolerance without batch writes*
• late data** is handled gracefully
* https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076

Streaming is so advanced that…
…but what about complex problems?
• subsecond latency and high throughput
ﬁnally coexist
• it does fault tolerance without batch writes*
• late data** is handled gracefully
* https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076

can we make it happen?
• Problem: Can’t keep an inﬁnite graph in-
memory and do complex stuff

??
universe

??
universe
>it was never about the graph silly, it was about
answering complex questions, remember?

universe
;)
universe
summary
>it was never about the graph silly, it was about
answering complex questions, remember?
answers

Examples of Summaries
• Spanners : distance estimation
• Sparsiﬁers : cut estimation
• Sketches : homomorphic properties
graph summary
algorithm algorithm~R1 R2

Distributed Graph
streaming example
54
76
86
42
31
52Connected Components
on a stream of edges (additions)

31
Distributed Graph
streaming example
54
76
86
42
43
31
52
Connected Components
1

52
Distributed Graph
streaming example
54
76
86
42
43
87
52
31
1 2

52
4
Distributed Graph
streaming example
54
76
86
42
43
87
41
31
1 2

52
4
Distributed Graph
streaming example
76
86
42
43
87
41
31
1
76
2
6

52
4
8
Distributed Graph
streaming example
86
42
43
87
41
31
1
76
2
6

8
52
4
76
Distributed Graph
streaming example
42
43
87
41
31
1 2
6

8
52
4
76
Distributed Graph
streaming example
43
87
41Connected Components
31
1
6

But Is this Efﬁcient?
Sure, we can distribute the edges and summaries

But Is this Efﬁcient?
Sure, we can distribute the edges and summaries
any systems in mind?

Gelly Stream
Graph stream processing with Apache Flink

Gelly Stream Oveview
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly Gelly-
➤ Static Graphs
➤ Multi-Pass Algorithms
➤ Full Computations
➤ Dynamic Graphs
➤ Single-Pass Algorithms
➤ Approximate Computations
DataStream

Gelly Stream Status
➤ Properties and Metrics
➤ Transformations
➤ Aggregations
➤ Discretization
➤ Neighborhood
Aggregations
➤ Graph Streaming
Algorithms
➤ Connected
Components
➤ Bipartiteness Check
➤ Window Triangle Count
➤ Triangle Count
Estimation
➤ Continuous Degree
Aggregate

wait, so now we can detect
connected components right away?

wait, so now we can detect
connected components right away?
Solved! But how about our other issues now?

no matter if they use Spark or Flink or
just ipython
>Hej Siri_
>

Gelly-Stream to the rescue
graphStream.ﬁlterVertices(DataScientists())
.slice(Time.of(10, MINUTE), EdgeDirection.IN)
.applyOnNeighbors(FindPairs())
wendy checked_in glaze
steve checked_in glaze
tom checked_in joe’s_grill
sandra checked_in glaze
rafa checked_in joe’s_grill
wendy
steve
sandra
glaze
tom
rafa
joe’s
grill
{wendy, steve}
{steve, sandra}
{wendy, sandra}
{tom, rafa}

no matter if they use Spark or Flink or
just ipython
>Hej Siri_
> yes

The next step
• Iterative model* on streams for deeper analytics
• More Summaries
• Better Our-Of-Core State Integration
• AdHoc Graph Queries
Large-scale, Complex, Fast, Deep Analytics
* http://dl.acm.org/citation.cfm?id=2983551

Try out Gelly-Stream*
because all questions matter
@SenorCarbone
*https://github.com/vasia/gelly-streaming

Graph Stream Processing : spinning fast, large scale, complex analytics

More Related Content

What's hot

Viewers also liked

Similar to Graph Stream Processing : spinning fast, large scale, complex analytics

More from Paris Carbone

Recently uploaded

Graph Stream Processing : spinning fast, large scale, complex analytics