GRAPHS AS STREAMS
RETHINKING GRAPH PROCESSING IN THE STREAMING ERA
Vasia Kalavri
vasia@apache.org
@vkalavri
2
MODERN STREAMING TECHNOLOGY
➤ sub-second latencies
➤ high throughput
➤ dynamic topologies
➤ powerful semantics
➤ ecosystem integration
3
MORE THAN COUNTING WORDS
Complex Event Processing Online Machine Learning Streaming SQL
4
WHAT ABOUT GRAPH PROCESSING?
5
HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk
and partition it in memory
6
HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk
and partition it in memory
2. Compute: read and mutate the
graph state
7
HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk
and partition it in memory
2. Compute: read and mutate the
graph state
3. Store: write the final graph
state back to disk
8
“If what you need
is to analyze a static graph
over and over again
then this model is great!
9
WHAT’S WRONG WITH THIS MODEL?
➤ It is slow
➤ wait until the computation is over before you see any result
➤ pre-processing and partitioning
➤ It is expensive
➤ lots of memory and CPU required in order to scale
➤ It requires re-computation for graph changes
➤ no efficient way to deal with updates
10
➤ Maintain the dynamic graph
structure
➤ Provide up-to-date results with
low latency
➤ Compute on fresh state only
11
GRAPH STREAMING CHALLENGES
12
ACADEMIA TO THE RESCUE
➤ Graph streaming in the 90s-00s
➤ input fits in secondary storage
➤ limited memory
➤ few passes over the input data
➤ compact graph representations and summaries
13
GRAPH SUMMARIES
➤ spanners
➤ connectivity, distance
➤ sparsifiers
➤ cut estimation
➤ neighborhood sketches
graph summary
~algorithm algorithmR1 R2
14
1
43
2
5
i=0
BATCH CONNECTED COMPONENTS
15
6
7
8
1
43
2
5
6
7
8
i=0
BATCH CONNECTED COMPONENTS
16
1
4
3
4
5
2
3
5
2
4
7
8
6
7
6
8
1
21
2
2
i=1
BATCH CONNECTED COMPONENTS
17
6
6
6
1
21
1
2
6
6
6
i=1
BATCH CONNECTED COMPONENTS
18
2
1
2
2
1
1
2
1
2
7
6
6
6
1
11
1
1
i=2
BATCH CONNECTED COMPONENTS
19
6
6
6
54
76
31
52
20
STREAM CONNECTED COMPONENTS
Graph Summary: Disjoint Set (Union-Find)
➤ Only store component IDs and vertex IDs
54
76
86
42
31
52
21
1
3
Cid = 1
54
76
86
42
43
31
52
22
1
3
Cid = 1
2
5
Cid = 2
54
76
86
42
43
87
31
52
23
1
3
Cid = 1
2
5
Cid = 2
4
54
76
86
42
43
87
41
31
52
24
1
3
Cid = 1
2
5
Cid = 2
4
6
7
Cid = 6
54
76
86
42
43
87
41
52
25
1
3
Cid = 1
2
5
Cid = 2
4
6
7
Cid = 6
8
54
76
86
42
43
87
41
26
1
3
Cid = 1
2
5
Cid = 2
4
6
7
Cid = 6
8
76
86
42
43
87
41
27
6
7
Cid = 6
8
1
3
Cid = 1
2
5
Cid = 2
4
76
86
42
43
87
41
28
1
3
Cid = 1
2
5
4
6
7
Cid = 6
8
DISTRIBUTED STREAM CONNECTED COMPONENTS
29
THE BAD NEWS
➤A slightly different motivation
➤ finite graph stored in disk vs. unbounded graph arriving in real-time
➤ some algorithms assume we know |V|, |E|
➤ most algorithms designed for single-node execution
30
THE GOOD NEWS
➤A quite different reality
➤ memory is getting bigger
➤ … and cheaper
➤ we know how to design distributed algorithms
31
GELLY-STREAM
SINGLE-PASS STREAM GRAPH PROCESSING WITH APACHE FLINK
32
GELLY ON STREAMS
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly Gelly-Stream
➤ Static Graphs
➤ Multi-Pass Algorithms
➤ Full Computations
➤ Dynamic Graphs
➤ Single-Pass Algorithms
➤ Approximate Computations
DataStream
33
DISTRIBUTED STREAM CONNECTED COMPONENTS
34
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream

.keyBy(0)

.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))

.fold(new DisjointSet(), new UpdateCC())

.flatMap(new Merger())

.setParallelism(1);
35
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream

.keyBy(0)

.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))

.fold(new DisjointSet(), new UpdateCC())

.flatMap(new Merger())

.setParallelism(1);
36
Partition the edge stream
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream

.keyBy(0)

.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))

.fold(new DisjointSet(), new UpdateCC())

.flatMap(new Merger())

.setParallelism(1);
37
Define the merging frequency
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream

.keyBy(0)

.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))

.fold(new DisjointSet(), new UpdateCC())

.flatMap(new Merger())

.setParallelism(1);
38
merge locally
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream

.keyBy(0)

.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))

.fold(new DisjointSet(), new UpdateCC())

.flatMap(new Merger())

.setParallelism(1);
39
merge globally
GELLY-STREAM STATUS
➤ Properties and Metrics
➤ Transformations
➤ Aggregations
➤ Discretization
➤ Neighborhood Aggregations
40
➤ Graph Streaming Algorithms
➤ Connected Components
➤ Bipartiteness Check
➤ Window Triangle Count
➤ Triangle Count Estimation
➤ Continuous Degree Aggregate
FEELING GELLY?
➤ Gelly-Stream Repository
github.com/vasia/gelly-streaming
➤ A list of graph streaming papers
citeulike.org/user/vasiakalavri/tag/graph-streaming
➤ A related talk at FOSDEM’16
slideshare.net/vkalavri/gellystream-singlepass-graph-streaming-analytics-with-apache-flink
41
GRAPHS AS STREAMS
RETHINKING GRAPH PROCESSING IN THE STREAMING ERA
Vasia Kalavri
vasia@apache.org
@vkalavri

Graphs as Streams: Rethinking Graph Processing in the Streaming Era