Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Baymeetup-FlinkResearch by Foo Sounds 5421 views
- Gelly-Stream: Single-Pass Graph Str... by Vasia Kalavri 2471 views
- Stateful Distributed Stream Processing by Gyula Fóra 8677 views
- Bay Area Apache Flink Meetup Commun... by Henry Saputra 6705 views
- Batch and Stream Graph Processing w... by Vasia Kalavri 7617 views
- Designing and Testing Accumulo Iter... by Josh Elser 5359 views

13,845 views

Published on

Gelly presentation at the Apache Flink Bay Area Meetup, August 27 2015, @MapR

Published in:
Data & Analytics

No Downloads

Total views

13,845

On SlideShare

0

From Embeds

0

Number of Embeds

362

Shares

0

Downloads

82

Comments

0

Likes

8

No embeds

No notes for slide

- 1. Vasia Kalavri Apache Flink PMC, PhD student @KTH @vkalavri - vasia@apache.org August 27, 2015 Gelly: Large-Scale Graph Processing with Apache Flink
- 2. Overview - Why Graph Processing with Flink? - Gelly: Flink’s Graph Processing API - Iterative Graph Processing in Gelly - What’s next, Gelly? 2
- 3. Why Graph Processing with Flink?
- 4. 4
- 5. A data analysis pipeline load clean create graph analyze graph clean transformload result clean transformload 5
- 6. A more user-friendly pipeline load load result load 6
- 7. General-purpose or specialized? - fast application development and deployment - easier maintenance - non-intuitive APIs - time-consuming - use, configure and integrate different systems - hard to maintain - rich APIs and features general-purpose specialized what about performance? 7
- 8. Native Iterations ● the runtime is aware of the iterative execution ● no scheduling overhead between iterations ● caching and state maintenance are handled automatically Caching Loop-invariant DataPushing work “out of the loop” Maintain state as index 8
- 9. Flink Iteration Operators Iterate IterateDelta 9 Input Iterative Update Function Result Replace Workset Iterative Update Function Result Solution Set State
- 10. Many iterative algorithms exhibit sparse computational dependencies and asymmetric convergence Delta Iterations 10
- 11. Example: Connected Components 11 Only vertices that change their value are in the workset for the next iteration.
- 12. Beyond Iterations 12 Performance & Scalability ● own memory management ● own serialization framework ● operate on binary data when possible Automatic Optimizations ● choose best execution strategy ● cache invariant data
- 13. Gelly the Flink Graph API
- 14. ● Java & Scala Graph APIs on top of Flink ● Ships already with Flink 0.9 (Beta) ● Can be seamlessly mixed with the DataSet Flink API → easily implement applications that use both record-based and graph-based analysis Meet Gelly 14
- 15. Gelly in the Flink stack 15 Scala API (batch and streaming) Java API (batch and streaming) ML library Gelly Runtime and Execution Environments Data storage Table API Python API Transformations and Utilities Iterative Graph Processing Graph Library
- 16. Hello, Gelly! // create a graph from a Dataset of Vertices and a DataSet of Edges Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env); 16 // create a graph from a Collection of Edges and initialize the Vertex values Graph<String, Long, Double> graph = Graph.fromCollection(edges, new MapFunction<String, Long>() { public Long map(String vertexId) { return 1l; } }, env); vertex ID type vertex value type edge value type edges is a Java collection assigned vertex value
- 17. ● Graph Properties ○ getVertexIds ○ getEdgeIds ○ numberOfVertices ○ numberOfEdges ○ getDegrees ○ isWeaklyConnected ○ ... Available Methods ● Transformations ○ map, filter, join ○ subgraph, union, difference ○ reverse, undirected ○ getTriplets ● Mutations ○ add vertex/edge ○ remove vertex/edge 17
- 18. Example: mapVertices 18 // increment each vertex value by one Graph<Long, Long, Long> updatedGraph = graph.mapVertices( new MapFunction<Vertex<Long, Long>, Long>() { public Long map(Vertex<Long, Long> value) { return value.getValue() + 1; } }); 4 2 8 5 5 3 1 7 4 5
- 19. Example: subGraph 19 Graph<Long, Long, Long> subgraph = graph.subgraph( new FilterFunction<Vertex<Long, Long>>() { public boolean filter(Vertex<Long, Long> vertex) { // keep only vertices with positive values return (vertex.getValue() > 0); } }, new FilterFunction<Edge<Long, Long>>() { public boolean filter(Edge<Long, Long> edge) { // keep only edges with negative values return (edge.getValue() < 0); } })
- 20. - Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel Neighborhood Methods 3 4 7 4 4 graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT); 3 9 7 4 5 20
- 21. Iterative Graph Processing Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations ● vertex-centric (similar to Pregel, Giraph) ● gather-sum-apply (similar to PowerGraph) 21
- 22. MessagingFunction: what messages to send to other vertices VertexUpdateFunction: how to update a vertex value, based on the received messages The workset contains only active vertices (received a msg in the previous superstep) Vertex-centric Iterations 22
- 23. Vertex-centric SSSP shortestPaths = graph.runVertexCentricIteration( new DistanceUpdater(), new DistanceMessenger()).getVertices(); DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction 23 sendMessages(K key, Double newDist) { for (Edge edge : getOutgoingEdges()) { sendMessageTo(edge.getTarget(), newDist + edge.getValue()); } updateVertex(Vertex<K, Double> vertex, MessageIterator msgs) { Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (vertex.getValue() > minDist) setNewVertexValue(minDist); }
- 24. Gather: how to transform each of the edges and neighbors of each vertex Sum: how to aggregate the partial values of Gather to a single value Apply: how to update the vertex value, based on the new aggregate and the current value The workset contains the active vertices (user-defined activation) Gather-Sum-Apply Iterations 24
- 25. Gather-Sum-Apply SSSP shortestPaths = graph.runGatherSumApplyIteration(new CalculateDistances(), new ChooseMinDistance(), new UpdateDistance()).getVertices(); CalculateDistances: Gather 25 gather(Neighbor<Double, Double> neighbor) { return neighbor.getNeighborValue() + neighbor.getEdgeValue(); } ChooseMinDistance: Sum ChooseMinDistance: Apply sum(Double newValue, Double currentValue) { return Math.min(newValue, currentValue); } apply(Double newDistance, Double oldDistance) { if (newDistance < oldDistance) { setResult(newDistance); } }
- 26. Iteration Configuration ● Parallelism ● Aggregators ● Broadcast Variables ● Messaging Direction (IN, OUT, ALL) ● Access vertex degrees and number of vertices 26
- 27. Vertex-Centric or GSA? 27 vertex-centric GSA when message creation is independent & expensive (parallelized in Gather) when the graph is skewed (~ 1.5x faster) if update is associative-commutative when all messages are needed for the update when a vertex needs to send messages to non-neighbors
- 28. ● PageRank* ● Single Source Shortest Paths* ● Label Propagation ● Weakly Connected Components* ● Community Detection Upcoming: Triangle Count, HITS, Affinity Propagation, Graph Summarization graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor)); *: both vertex-centric and GSA implementations Library of Algorithms 28
- 29. What’s next, Gelly? ● Integration with the Flink Streaming API ● Specialized Operators for Skewed Graphs ● Partitioning Methods ● More graph processing models: neighborhood-centric, partition-centric Check our Roadmap! https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly 29
- 30. Feeling Gelly? Try it! ● Grab the Flink master: https://github.com/apache/flink ● Read the Gelly programming guide: https://ci.apache. org/projects/flink/flink-docs-master/libs/gelly_guide.html ● Follow the Gelly School tutorials (Beta): http://gellyschool.com ● Read our blog post: https://flink.apache. org/news/2015/08/24/introducing-flink-gelly.html ● Come to Flink Forward (October 12-13, Berlin): http://flink- forward.org/ 30

No public clipboards found for this slide

Be the first to comment