Vasia Kalavri
Apache Flink PMC, PhD student @KTH
@vkalavri - vasia@apache.org
August 27, 2015
Gelly:
Large-Scale Graph Processing
with Apache Flink
Overview
- Why Graph Processing with Flink?
- Gelly: Flink’s Graph Processing API
- Iterative Graph Processing in Gelly
- What’s next, Gelly?
2
Why Graph Processing
with Flink?
4
A data analysis pipeline
load clean
create
graph
analyze
graph
clean transformload result
clean transformload
5
A more user-friendly pipeline
load
load result
load
6
General-purpose or specialized?
- fast application
development and
deployment
- easier maintenance
- non-intuitive APIs
- time-consuming
- use, configure and
integrate different
systems
- hard to maintain
- rich APIs and features
general-purpose specialized
what about performance?
7
Native Iterations
● the runtime is aware of the iterative execution
● no scheduling overhead between iterations
● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work
“out of the loop”
Maintain state as index
8
Flink Iteration Operators
Iterate IterateDelta
9
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State
Many iterative algorithms
exhibit sparse
computational
dependencies and
asymmetric convergence
Delta Iterations
10
Example: Connected Components
11
Only vertices that change
their value are in the
workset for the next
iteration.
Beyond Iterations
12
Performance & Scalability
● own memory management
● own serialization framework
● operate on binary data when possible
Automatic Optimizations
● choose best execution strategy
● cache invariant data
Gelly
the Flink Graph API
● Java & Scala Graph APIs on top of Flink
● Ships already with Flink 0.9 (Beta)
● Can be seamlessly mixed with the DataSet
Flink API
→ easily implement applications that use
both record-based and graph-based analysis
Meet Gelly
14
Gelly in the Flink stack
15
Scala API
(batch and streaming)
Java API
(batch and streaming)
ML library Gelly
Runtime and Execution Environments
Data storage
Table API Python API
Transformations
and Utilities
Iterative Graph
Processing
Graph Library
Hello, Gelly!
// create a graph from a Dataset of Vertices and a DataSet of Edges
Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env);
16
// create a graph from a Collection of Edges and initialize the Vertex values
Graph<String, Long, Double> graph = Graph.fromCollection(edges,
new MapFunction<String, Long>() {
public Long map(String vertexId) { return 1l; }
}, env);
vertex ID
type
vertex
value type
edge value
type
edges is a Java
collection
assigned vertex
value
● Graph Properties
○ getVertexIds
○ getEdgeIds
○ numberOfVertices
○ numberOfEdges
○ getDegrees
○ isWeaklyConnected
○ ...
Available Methods
● Transformations
○ map, filter, join
○ subgraph, union,
difference
○ reverse, undirected
○ getTriplets
● Mutations
○ add vertex/edge
○ remove vertex/edge
17
Example: mapVertices
18
// increment each vertex value by one
Graph<Long, Long, Long> updatedGraph = graph.mapVertices(
new MapFunction<Vertex<Long, Long>, Long>() {
public Long map(Vertex<Long, Long> value) {
return value.getValue() + 1;
}
});
4
2
8
5
5
3
1
7
4
5
Example: subGraph
19
Graph<Long, Long, Long> subgraph = graph.subgraph(
new FilterFunction<Vertex<Long, Long>>() {
public boolean filter(Vertex<Long, Long> vertex) {
// keep only vertices with positive values
return (vertex.getValue() > 0);
}
},
new FilterFunction<Edge<Long, Long>>() {
public boolean filter(Edge<Long, Long> edge) {
// keep only edges with negative values
return (edge.getValue() < 0);
}
})
- Apply a reduce function to the 1st-hop
neighborhood of each vertex in parallel
Neighborhood Methods
3
4
7
4
4
graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT);
3
9
7
4
5
20
Iterative Graph Processing
Gelly offers iterative graph processing
abstractions on top of Flink’s Delta iterations
● vertex-centric (similar to Pregel, Giraph)
● gather-sum-apply (similar to PowerGraph)
21
MessagingFunction: what messages
to send to other vertices
VertexUpdateFunction: how to
update a vertex value, based on the
received messages
The workset contains only active
vertices (received a msg in the
previous superstep)
Vertex-centric Iterations
22
Vertex-centric SSSP
shortestPaths = graph.runVertexCentricIteration(
new DistanceUpdater(), new DistanceMessenger()).getVertices();
DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction
23
sendMessages(K key, Double newDist) {
for (Edge edge : getOutgoingEdges()) {
sendMessageTo(edge.getTarget(),
newDist + edge.getValue());
}
updateVertex(Vertex<K, Double> vertex,
MessageIterator msgs) {
Double minDist = Double.MAX_VALUE;
for (double msg : msgs) {
if (msg < minDist)
minDist = msg;
}
if (vertex.getValue() > minDist)
setNewVertexValue(minDist);
}
Gather: how to transform each of
the edges and neighbors of each
vertex
Sum: how to aggregate the partial
values of Gather to a single value
Apply: how to update the vertex
value, based on the new aggregate
and the current value
The workset contains the active
vertices (user-defined activation)
Gather-Sum-Apply Iterations
24
Gather-Sum-Apply SSSP
shortestPaths = graph.runGatherSumApplyIteration(new CalculateDistances(),
new ChooseMinDistance(), new UpdateDistance()).getVertices();
CalculateDistances: Gather
25
gather(Neighbor<Double, Double> neighbor) {
return neighbor.getNeighborValue() + neighbor.getEdgeValue();
}
ChooseMinDistance: Sum
ChooseMinDistance: Apply
sum(Double newValue, Double currentValue) {
return Math.min(newValue, currentValue);
}
apply(Double newDistance, Double oldDistance) {
if (newDistance < oldDistance) { setResult(newDistance); }
}
Iteration Configuration
● Parallelism
● Aggregators
● Broadcast Variables
● Messaging Direction (IN, OUT, ALL)
● Access vertex degrees and number of
vertices
26
Vertex-Centric or GSA?
27
vertex-centric GSA
when message creation is independent &
expensive (parallelized in Gather)
when the graph is skewed (~ 1.5x faster)
if update is associative-commutative
when all messages are needed for
the update
when a vertex needs to send
messages to non-neighbors
● PageRank*
● Single Source Shortest Paths*
● Label Propagation
● Weakly Connected Components*
● Community Detection
Upcoming: Triangle Count, HITS, Affinity Propagation,
Graph Summarization
graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor));
*: both vertex-centric and GSA implementations
Library of Algorithms
28
What’s next, Gelly?
● Integration with the Flink Streaming API
● Specialized Operators for Skewed Graphs
● Partitioning Methods
● More graph processing models: neighborhood-centric,
partition-centric
Check our Roadmap!
https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
29
Feeling Gelly? Try it!
● Grab the Flink master: https://github.com/apache/flink
● Read the Gelly programming guide: https://ci.apache.
org/projects/flink/flink-docs-master/libs/gelly_guide.html
● Follow the Gelly School tutorials (Beta): http://gellyschool.com
● Read our blog post: https://flink.apache.
org/news/2015/08/24/introducing-flink-gelly.html
● Come to Flink Forward (October 12-13, Berlin): http://flink-
forward.org/
30

Gelly in Apache Flink Bay Area Meetup

  • 1.
    Vasia Kalavri Apache FlinkPMC, PhD student @KTH @vkalavri - vasia@apache.org August 27, 2015 Gelly: Large-Scale Graph Processing with Apache Flink
  • 2.
    Overview - Why GraphProcessing with Flink? - Gelly: Flink’s Graph Processing API - Iterative Graph Processing in Gelly - What’s next, Gelly? 2
  • 3.
  • 4.
  • 5.
    A data analysispipeline load clean create graph analyze graph clean transformload result clean transformload 5
  • 6.
    A more user-friendlypipeline load load result load 6
  • 7.
    General-purpose or specialized? -fast application development and deployment - easier maintenance - non-intuitive APIs - time-consuming - use, configure and integrate different systems - hard to maintain - rich APIs and features general-purpose specialized what about performance? 7
  • 8.
    Native Iterations ● theruntime is aware of the iterative execution ● no scheduling overhead between iterations ● caching and state maintenance are handled automatically Caching Loop-invariant DataPushing work “out of the loop” Maintain state as index 8
  • 9.
    Flink Iteration Operators IterateIterateDelta 9 Input Iterative Update Function Result Replace Workset Iterative Update Function Result Solution Set State
  • 10.
    Many iterative algorithms exhibitsparse computational dependencies and asymmetric convergence Delta Iterations 10
  • 11.
    Example: Connected Components 11 Onlyvertices that change their value are in the workset for the next iteration.
  • 12.
    Beyond Iterations 12 Performance &Scalability ● own memory management ● own serialization framework ● operate on binary data when possible Automatic Optimizations ● choose best execution strategy ● cache invariant data
  • 13.
  • 14.
    ● Java &Scala Graph APIs on top of Flink ● Ships already with Flink 0.9 (Beta) ● Can be seamlessly mixed with the DataSet Flink API → easily implement applications that use both record-based and graph-based analysis Meet Gelly 14
  • 15.
    Gelly in theFlink stack 15 Scala API (batch and streaming) Java API (batch and streaming) ML library Gelly Runtime and Execution Environments Data storage Table API Python API Transformations and Utilities Iterative Graph Processing Graph Library
  • 16.
    Hello, Gelly! // createa graph from a Dataset of Vertices and a DataSet of Edges Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env); 16 // create a graph from a Collection of Edges and initialize the Vertex values Graph<String, Long, Double> graph = Graph.fromCollection(edges, new MapFunction<String, Long>() { public Long map(String vertexId) { return 1l; } }, env); vertex ID type vertex value type edge value type edges is a Java collection assigned vertex value
  • 17.
    ● Graph Properties ○getVertexIds ○ getEdgeIds ○ numberOfVertices ○ numberOfEdges ○ getDegrees ○ isWeaklyConnected ○ ... Available Methods ● Transformations ○ map, filter, join ○ subgraph, union, difference ○ reverse, undirected ○ getTriplets ● Mutations ○ add vertex/edge ○ remove vertex/edge 17
  • 18.
    Example: mapVertices 18 // incrementeach vertex value by one Graph<Long, Long, Long> updatedGraph = graph.mapVertices( new MapFunction<Vertex<Long, Long>, Long>() { public Long map(Vertex<Long, Long> value) { return value.getValue() + 1; } }); 4 2 8 5 5 3 1 7 4 5
  • 19.
    Example: subGraph 19 Graph<Long, Long,Long> subgraph = graph.subgraph( new FilterFunction<Vertex<Long, Long>>() { public boolean filter(Vertex<Long, Long> vertex) { // keep only vertices with positive values return (vertex.getValue() > 0); } }, new FilterFunction<Edge<Long, Long>>() { public boolean filter(Edge<Long, Long> edge) { // keep only edges with negative values return (edge.getValue() < 0); } })
  • 20.
    - Apply areduce function to the 1st-hop neighborhood of each vertex in parallel Neighborhood Methods 3 4 7 4 4 graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT); 3 9 7 4 5 20
  • 21.
    Iterative Graph Processing Gellyoffers iterative graph processing abstractions on top of Flink’s Delta iterations ● vertex-centric (similar to Pregel, Giraph) ● gather-sum-apply (similar to PowerGraph) 21
  • 22.
    MessagingFunction: what messages tosend to other vertices VertexUpdateFunction: how to update a vertex value, based on the received messages The workset contains only active vertices (received a msg in the previous superstep) Vertex-centric Iterations 22
  • 23.
    Vertex-centric SSSP shortestPaths =graph.runVertexCentricIteration( new DistanceUpdater(), new DistanceMessenger()).getVertices(); DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction 23 sendMessages(K key, Double newDist) { for (Edge edge : getOutgoingEdges()) { sendMessageTo(edge.getTarget(), newDist + edge.getValue()); } updateVertex(Vertex<K, Double> vertex, MessageIterator msgs) { Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (vertex.getValue() > minDist) setNewVertexValue(minDist); }
  • 24.
    Gather: how totransform each of the edges and neighbors of each vertex Sum: how to aggregate the partial values of Gather to a single value Apply: how to update the vertex value, based on the new aggregate and the current value The workset contains the active vertices (user-defined activation) Gather-Sum-Apply Iterations 24
  • 25.
    Gather-Sum-Apply SSSP shortestPaths =graph.runGatherSumApplyIteration(new CalculateDistances(), new ChooseMinDistance(), new UpdateDistance()).getVertices(); CalculateDistances: Gather 25 gather(Neighbor<Double, Double> neighbor) { return neighbor.getNeighborValue() + neighbor.getEdgeValue(); } ChooseMinDistance: Sum ChooseMinDistance: Apply sum(Double newValue, Double currentValue) { return Math.min(newValue, currentValue); } apply(Double newDistance, Double oldDistance) { if (newDistance < oldDistance) { setResult(newDistance); } }
  • 26.
    Iteration Configuration ● Parallelism ●Aggregators ● Broadcast Variables ● Messaging Direction (IN, OUT, ALL) ● Access vertex degrees and number of vertices 26
  • 27.
    Vertex-Centric or GSA? 27 vertex-centricGSA when message creation is independent & expensive (parallelized in Gather) when the graph is skewed (~ 1.5x faster) if update is associative-commutative when all messages are needed for the update when a vertex needs to send messages to non-neighbors
  • 28.
    ● PageRank* ● SingleSource Shortest Paths* ● Label Propagation ● Weakly Connected Components* ● Community Detection Upcoming: Triangle Count, HITS, Affinity Propagation, Graph Summarization graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor)); *: both vertex-centric and GSA implementations Library of Algorithms 28
  • 29.
    What’s next, Gelly? ●Integration with the Flink Streaming API ● Specialized Operators for Skewed Graphs ● Partitioning Methods ● More graph processing models: neighborhood-centric, partition-centric Check our Roadmap! https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly 29
  • 30.
    Feeling Gelly? Tryit! ● Grab the Flink master: https://github.com/apache/flink ● Read the Gelly programming guide: https://ci.apache. org/projects/flink/flink-docs-master/libs/gelly_guide.html ● Follow the Gelly School tutorials (Beta): http://gellyschool.com ● Read our blog post: https://flink.apache. org/news/2015/08/24/introducing-flink-gelly.html ● Come to Flink Forward (October 12-13, Berlin): http://flink- forward.org/ 30