Streaming is the latest hot topic in the big data world. We want to process data immediately and continuously. Modern stream processors have matured significantly and offer exceptional features, including sub-second latencies, high throughput, fault-tolerance, and seamless integration with various data sources and sinks.
Many sources of streaming data consist of related or connected events: user interactions in a social network, web page clicks, movie ratings, product purchases. These connected events can be naturally represented as edges in an evolving graph.
In this talk I will explain how we can leverage a powerful stream processor, such as Apache Flink, and academic research of the past two decades, to build graph streaming applications. I will describe how we can model graphs as streams and how we can compute graph properties without storing and managing the graph state. I will introduce useful graph summary data structures and show how they allow us to build graph algorithms in the streaming model, such as connected components, bipartiteness detection, and distance estimation.
6. HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk
and partition it in memory
6
7. HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk
and partition it in memory
2. Compute: read and mutate the
graph state
7
8. HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk
and partition it in memory
2. Compute: read and mutate the
graph state
3. Store: write the final graph
state back to disk
8
9. “If what you need
is to analyze a static graph
over and over again
then this model is great!
9
10. WHAT’S WRONG WITH THIS MODEL?
➤ It is slow
➤ wait until the computation is over before you see any result
➤ pre-processing and partitioning
➤ It is expensive
➤ lots of memory and CPU required in order to scale
➤ It requires re-computation for graph changes
➤ no efficient way to deal with updates
10
11. ➤ Maintain the dynamic graph
structure
➤ Provide up-to-date results with
low latency
➤ Compute on fresh state only
11
GRAPH STREAMING CHALLENGES
13. ACADEMIA TO THE RESCUE
➤ Graph streaming in the 90s-00s
➤ input fits in secondary storage
➤ limited memory
➤ few passes over the input data
➤ compact graph representations and summaries
13
30. THE BAD NEWS
➤A slightly different motivation
➤ finite graph stored in disk vs. unbounded graph arriving in real-time
➤ some algorithms assume we know |V|, |E|
➤ most algorithms designed for single-node execution
30
31. THE GOOD NEWS
➤A quite different reality
➤ memory is getting bigger
➤ … and cheaper
➤ we know how to design distributed algorithms
31
35. STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
35
36. STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
36
Partition the edge stream
37. STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
37
Define the merging frequency
38. STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
38
merge locally
39. STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
39
merge globally
41. FEELING GELLY?
➤ Gelly-Stream Repository
github.com/vasia/gelly-streaming
➤ A list of graph streaming papers
citeulike.org/user/vasiakalavri/tag/graph-streaming
➤ A related talk at FOSDEM’16
slideshare.net/vkalavri/gellystream-singlepass-graph-streaming-analytics-with-apache-flink
41