7. General-purpose or specialized?
- fast application
development and
deployment
- easier maintenance
- non-intuitive APIs
- time-consuming
- use, configure and
integrate different
systems
- hard to maintain
- rich APIs and features
general-purpose specialized
what about performance?
7
8. Native Iterations
● the runtime is aware of the iterative execution
● no scheduling overhead between iterations
● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work
“out of the loop”
Maintain state as index
8
9. Flink Iteration Operators
Iterate IterateDelta
9
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State
12. Beyond Iterations
12
Performance & Scalability
● own memory management
● own serialization framework
● operate on binary data when possible
Automatic Optimizations
● choose best execution strategy
● cache invariant data
14. ● Java & Scala Graph APIs on top of Flink
● Ships already with Flink 0.9 (Beta)
● Can be seamlessly mixed with the DataSet
Flink API
→ easily implement applications that use
both record-based and graph-based analysis
Meet Gelly
14
15. Gelly in the Flink stack
15
Scala API
(batch and streaming)
Java API
(batch and streaming)
ML library Gelly
Runtime and Execution Environments
Data storage
Table API Python API
Transformations
and Utilities
Iterative Graph
Processing
Graph Library
16. Hello, Gelly!
// create a graph from a Dataset of Vertices and a DataSet of Edges
Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);
16
// create a graph from a Collection of Edges and initialize the Vertex values
Graph<String, Long, Double> graph = Graph.fromCollection(edges,
new MapFunction<String, Long>() {
public Long map(String vertexId) { return 1l; }
}, env);
vertex ID
type
vertex
value type
edge value
type
edges is a Java
collection
assigned vertex
value
18. Example: mapVertices
18
// increment each vertex value by one
Graph<Long, Long, Long> updatedGraph = graph.mapVertices(
new MapFunction<Vertex<Long, Long>, Long>() {
public Long map(Vertex<Long, Long> value) {
return value.getValue() + 1;
}
});
4
2
8
5
5
3
1
7
4
5
19. Example: subGraph
19
Graph<Long, Long, Long> subgraph = graph.subgraph(
new FilterFunction<Vertex<Long, Long>>() {
public boolean filter(Vertex<Long, Long> vertex) {
// keep only vertices with positive values
return (vertex.getValue() > 0);
}
},
new FilterFunction<Edge<Long, Long>>() {
public boolean filter(Edge<Long, Long> edge) {
// keep only edges with negative values
return (edge.getValue() < 0);
}
})
20. - Apply a reduce function to the 1st-hop
neighborhood of each vertex in parallel
Neighborhood Methods
3
4
7
4
4
graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);
3
9
7
4
5
20
21. Iterative Graph Processing
Gelly offers iterative graph processing
abstractions on top of Flink’s Delta iterations
● vertex-centric (similar to Pregel, Giraph)
● gather-sum-apply (similar to PowerGraph)
21
22. MessagingFunction: what messages
to send to other vertices
VertexUpdateFunction: how to
update a vertex value, based on the
received messages
The workset contains only active
vertices (received a msg in the
previous superstep)
Vertex-centric Iterations
22
24. Gather: how to transform each of
the edges and neighbors of each
vertex
Sum: how to aggregate the partial
values of Gather to a single value
Apply: how to update the vertex
value, based on the new aggregate
and the current value
The workset contains the active
vertices (user-defined activation)
Gather-Sum-Apply Iterations
24
26. Iteration Configuration
● Parallelism
● Aggregators
● Broadcast Variables
● Messaging Direction (IN, OUT, ALL)
● Access vertex degrees and number of
vertices
26
27. Vertex-Centric or GSA?
27
vertex-centric GSA
when message creation is independent &
expensive (parallelized in Gather)
when the graph is skewed (~ 1.5x faster)
if update is associative-commutative
when all messages are needed for
the update
when a vertex needs to send
messages to non-neighbors
28. ● PageRank*
● Single Source Shortest Paths*
● Label Propagation
● Weakly Connected Components*
● Community Detection
Upcoming: Triangle Count, HITS, Affinity Propagation,
Graph Summarization
graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor));
*: both vertex-centric and GSA implementations
Library of Algorithms
28
29. What’s next, Gelly?
● Integration with the Flink Streaming API
● Specialized Operators for Skewed Graphs
● Partitioning Methods
● More graph processing models: neighborhood-centric,
partition-centric
Check our Roadmap!
https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
29
30. Feeling Gelly? Try it!
● Grab the Flink master: https://github.com/apache/flink
● Read the Gelly programming guide: https://ci.apache.
org/projects/flink/flink-docs-master/libs/gelly_guide.html
● Follow the Gelly School tutorials (Beta): http://gellyschool.com
● Read our blog post: https://flink.apache.
org/news/2015/08/24/introducing-flink-gelly.html
● Come to Flink Forward (October 12-13, Berlin): http://flink-
forward.org/
30