Graph Stream Processing
spinning fast, large-scale, complex analytics
Paris Carbone
PhD Candidate @ KTH
Committer @ Apache Flink
We want to analyse….
We want to analyse….
data
We want to analyse….
datacomplex
We want to analyse….
datacomplexlarge-scale
We want to analyse….
data fastcomplexlarge-scale
But why do we need
large-scale, complex and fast data analysis?
>
But why do we need
large-scale, complex and fast data analysis?
to answer big complex questions faster>
to	answer	big	complex	questions	faster>
>Hej Siri_
to	answer	big	complex	questions	faster>
Get me the best route to work right now
>Hej Siri_
to	answer	big	complex	questions	faster>
Get me the best route to work right now
>Hej Siri_
…with the fewest human drivers
to	answer	big	complex	questions	faster>
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
…with the fewest human drivers
to	answer	big	complex	questions	faster>
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday… or the day before yesterday
…with the fewest human drivers
to	answer	big	complex	questions	faster>
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday… or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to	answer	big	complex	questions	faster>
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to	answer	big	complex	questions	faster>
no matter if they use Spark or Flink or just ipython
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to	answer	big	complex	questions	faster>
no matter if they use Spark or Flink or just ipython
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to	answer	big	complex	questions	faster>
to	answer	big	complex	questions	faster>
no	matter	if	they	use	Spark	or	Flink	or	just	ipython
best	route	to	work	right	now
Lookup	a	pizza	recipe	all	of	my	friends	like	but	did	not	eat	
yesterday…
re-unite	all	data	scientists	in	the	world?
or	the	day	before	yesterday
oh!	And	no	kebab	pizza!
…with	the	fewest	human	drivers
3000 AD
to	answer	big	complex	questions	faster>
no	matter	if	they	use	Spark	or	Flink	or	just	ipython
best	route	to	work	right	now
Lookup	a	pizza	recipe	all	of	my	friends	like	but	did	not	eat	
yesterday…
re-unite	all	data	scientists	in	the	world?
or	the	day	before	yesterday
oh!	And	no	kebab	pizza!
…with	the	fewest	human	drivers
3000 AD
to	answer	big	complex	questions	faster>
no	matter	if	they	use	Spark	or	Flink	or	just	ipython
best	route	to	work	right	now
Lookup	a	pizza	recipe	all	of	my	friends	like	but	did	not	eat	
yesterday…
re-unite	all	data	scientists	in	the	world?
or	the	day	before	yesterday
oh!	And	no	kebab	pizza!
…with	the	fewest	human	drivers
FIRST WORLD PROBLEM
3000 AD
to	answer	big	complex	questions	faster>
use	Spark	or	Flink	or	just	ipython
best	route	to	work	right	now
re-unite	all	data	scientists	in	the	world?
oh!	And	no	kebab	pizza!
…with	the	fewest	human	drivers
30000 AD
to	answer	big	complex	questions	faster>
FIRST EARTH WORLD PROBLEM
use	Spark	or	Flink	or	just	ipython
best	route	to	work	right	now
re-unite	all	data	scientists	in	the	world?
oh!	And	no	kebab	pizza!
…with	the	fewest	human	drivers
30000 AD
Still,	fast	analytics	might	save	us	some	day…
• We can access patient movements and fb, twitter
and pretty much all social media interactions
• Can we stop a pandemic?
• Or can we predict fast where the virus can spread?
Now how do we analyse…
data fastcomplexlarge-scale ?
Now how do we analyse…
data
graphdistributed streaming
Now how do we analyse…
data
graphdistributed streaming
everything is a graph
Now how do we analyse…
data
graphdistributed streaming
everything is many everything is a graph
Now how do we analyse…
data
graphdistributed streaming
everything is many everything is a graph everything is a stream
it all started…
as a first world problem question
but then things escalated quickly…
…and machinery got cheaper and we
suddenly realised that we have big data
Distributed Graph processing was born
Thus,
Distributed Graph processing was born
Thus,
Map Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
• We want to compute the Connected Components
of a distributed graph.
• Basic computation element (map): vertex
• Updates : messages to other vertices
Distributed Graph processing example
• We want to compute the Connected Components
of a distributed graph.
• Basic computation element (map): vertex
• Updates : messages to other vertices
Distributed Graph processing example
1 2
3
Distributed Graph processing example
1
43
2
5
6
7
8
ROUND 0
Distributed Graph processing example
1
43
2
5
ROUND 0
6
7
8
3
1
4
4
5
2
4
2
3
5
7
8
6
8
6
7
Distributed Graph processing example
1
21
2
2
ROUND 1
6
6
6
Distributed Graph processing example
1
2
2
2
2
1
2
6
6
6
6
1
21
2
2
ROUND 1
6
6
6
6
6
Distributed Graph processing example
1
11
2
2
ROUND 2
6
6
6
Distributed Graph processing example
1
11
2
2
ROUND 2
6
6
6
1
1
1
Distributed Graph processing example
1
11
1
1
ROUND 3
6
6
6
Distributed Graph processing example
1
11
1
1
ROUND 3
6
6
6
1
1
1
1
Distributed Graph processing example
1
11
1
1
ROUND 4
6
6
6
No messages, DONE!
• Examples of Load-Compute-Store systems:
Pregel, Graphx (spark), Graphlab, PowerGraph
• Same execution strategy - Same problems
• It’s slow
• Too much re-computation ($€) for nothing.
• Real World Updates anyone?
Distributed Graph processing systems
…and streaming came
to mess everything
make
fast and simple
…and streaming came
to mess everything
make
fast and simple
real
world
…and streaming came
to mess everything
make
fast and simple
real
world event records
• local state stays here
• local computation too
The Dataflow™
Streaming is so advanced that…
• subsecond latency and high throughput
finally coexist
• it does fault tolerance without batch writes*
• late data** is handled gracefully
* https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
Streaming is so advanced that…
…but what about complex problems?
• subsecond latency and high throughput
finally coexist
• it does fault tolerance without batch writes*
• late data** is handled gracefully
* https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
can we make it happen?
can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
??
universe
can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
??
universe
>it was never about the graph silly, it was about
answering complex questions, remember?
can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
universe
;)
universe
summary
>it was never about the graph silly, it was about
answering complex questions, remember?
answers
Examples of Summaries
• Spanners : distance estimation
• Sparsifiers : cut estimation
• Sketches : homomorphic properties
graph summary
algorithm algorithm~R1 R2
Distributed Graph
streaming example
54
76
86
42
31
52Connected Components
on a stream of edges (additions)
31
Distributed Graph
streaming example
54
76
86
42
43
31
52
Connected Components
on a stream of edges (additions)
1
52
Distributed Graph
streaming example
54
76
86
42
43
87
52
Connected Components
on a stream of edges (additions)
31
1 2
52
4
Distributed Graph
streaming example
54
76
86
42
43
87
41
Connected Components
on a stream of edges (additions)
31
1 2
52
4
Distributed Graph
streaming example
76
86
42
43
87
41
Connected Components
on a stream of edges (additions)
31
1
76
2
6
52
4
8
Distributed Graph
streaming example
86
42
43
87
41
Connected Components
on a stream of edges (additions)
31
1
76
2
6
8
52
4
76
Distributed Graph
streaming example
42
43
87
41
Connected Components
on a stream of edges (additions)
31
1 2
6
8
52
4
76
Distributed Graph
streaming example
42
43
87
41
Connected Components
on a stream of edges (additions)
31
1 2
6
8
52
4
76
Distributed Graph
streaming example
43
87
41Connected Components
on a stream of edges (additions)
31
1
6
But Is this Efficient?
Sure, we can distribute the edges and summaries
But Is this Efficient?
Sure, we can distribute the edges and summaries
any systems in mind?
Gelly Stream
Graph stream processing with Apache Flink
Gelly Stream Oveview
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly Gelly-
➤ Static Graphs
➤ Multi-Pass Algorithms
➤ Full Computations
➤ Dynamic Graphs
➤ Single-Pass Algorithms
➤ Approximate Computations
DataStream
Gelly Stream Status
➤ Properties and Metrics
➤ Transformations
➤ Aggregations
➤ Discretization
➤ Neighborhood
Aggregations
➤ Graph Streaming
Algorithms
➤ Connected
Components
➤ Bipartiteness Check
➤ Window Triangle Count
➤ Triangle Count
Estimation
➤ Continuous Degree
Aggregate
wait, so now we can detect
connected components right	away?
wait, so now we can detect
connected components right	away?
wait, so now we can detect
connected components right	away?
Solved! But how about our other issues now?
no	matter	if	they	use	Spark	or	Flink	or		
just	ipython
>Hej	Siri_	
Siri,	is	it	possible	to	re-unite	all	data	
scientists	in	the	world?
>
no	matter	if	they	use	Spark	or	Flink	or		
just	ipython
>Hej	Siri_	
Siri,	is	it	possible	to	re-unite	all	data	
scientists	in	the	world?
>
Gelly-Stream to the rescue
graphStream.filterVertices(DataScientists())
.slice(Time.of(10, MINUTE), EdgeDirection.IN)
.applyOnNeighbors(FindPairs())
wendy checked_in glaze
steve checked_in glaze
tom checked_in joe’s_grill
sandra checked_in glaze
rafa checked_in joe’s_grill
wendy
steve
sandra
glaze
tom
rafa
joe’s
grill
{wendy, steve}
{steve, sandra}
{wendy, sandra}
{tom, rafa}
no	matter	if	they	use	Spark	or	Flink	or		
just	ipython
>Hej	Siri_	
Siri,	is	it	possible	to	re-unite	all	data	
scientists	in	the	world?
>
no	matter	if	they	use	Spark	or	Flink	or		
just	ipython
>Hej	Siri_	
Siri,	is	it	possible	to	re-unite	all	data	
scientists	in	the	world?
>	yes
The next step
• Iterative model* on streams for deeper analytics
• More Summaries
• Better Our-Of-Core State Integration
• AdHoc Graph Queries
Large-scale, Complex, Fast, Deep Analytics
* http://dl.acm.org/citation.cfm?id=2983551
Try out Gelly-Stream*
because all questions matter
@SenorCarbone
*https://github.com/vasia/gelly-streaming

Graph Stream Processing : spinning fast, large scale, complex analytics