Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gelly in Apache Flink Bay Area Meetup

13,845 views

Published on

Gelly presentation at the Apache Flink Bay Area Meetup, August 27 2015, @MapR

Published in: Data & Analytics

Gelly in Apache Flink Bay Area Meetup

  1. 1. Vasia Kalavri Apache Flink PMC, PhD student @KTH @vkalavri - vasia@apache.org August 27, 2015 Gelly: Large-Scale Graph Processing with Apache Flink
  2. 2. Overview - Why Graph Processing with Flink? - Gelly: Flink’s Graph Processing API - Iterative Graph Processing in Gelly - What’s next, Gelly? 2
  3. 3. Why Graph Processing with Flink?
  4. 4. 4
  5. 5. A data analysis pipeline load clean create graph analyze graph clean transformload result clean transformload 5
  6. 6. A more user-friendly pipeline load load result load 6
  7. 7. General-purpose or specialized? - fast application development and deployment - easier maintenance - non-intuitive APIs - time-consuming - use, configure and integrate different systems - hard to maintain - rich APIs and features general-purpose specialized what about performance? 7
  8. 8. Native Iterations ● the runtime is aware of the iterative execution ● no scheduling overhead between iterations ● caching and state maintenance are handled automatically Caching Loop-invariant DataPushing work “out of the loop” Maintain state as index 8
  9. 9. Flink Iteration Operators Iterate IterateDelta 9 Input Iterative Update Function Result Replace Workset Iterative Update Function Result Solution Set State
  10. 10. Many iterative algorithms exhibit sparse computational dependencies and asymmetric convergence Delta Iterations 10
  11. 11. Example: Connected Components 11 Only vertices that change their value are in the workset for the next iteration.
  12. 12. Beyond Iterations 12 Performance & Scalability ● own memory management ● own serialization framework ● operate on binary data when possible Automatic Optimizations ● choose best execution strategy ● cache invariant data
  13. 13. Gelly the Flink Graph API
  14. 14. ● Java & Scala Graph APIs on top of Flink ● Ships already with Flink 0.9 (Beta) ● Can be seamlessly mixed with the DataSet Flink API → easily implement applications that use both record-based and graph-based analysis Meet Gelly 14
  15. 15. Gelly in the Flink stack 15 Scala API (batch and streaming) Java API (batch and streaming) ML library Gelly Runtime and Execution Environments Data storage Table API Python API Transformations and Utilities Iterative Graph Processing Graph Library
  16. 16. Hello, Gelly! // create a graph from a Dataset of Vertices and a DataSet of Edges Graph<String, Long, Double> graph = Graph.fromDataSet(vertices, edges, env); 16 // create a graph from a Collection of Edges and initialize the Vertex values Graph<String, Long, Double> graph = Graph.fromCollection(edges, new MapFunction<String, Long>() { public Long map(String vertexId) { return 1l; } }, env); vertex ID type vertex value type edge value type edges is a Java collection assigned vertex value
  17. 17. ● Graph Properties ○ getVertexIds ○ getEdgeIds ○ numberOfVertices ○ numberOfEdges ○ getDegrees ○ isWeaklyConnected ○ ... Available Methods ● Transformations ○ map, filter, join ○ subgraph, union, difference ○ reverse, undirected ○ getTriplets ● Mutations ○ add vertex/edge ○ remove vertex/edge 17
  18. 18. Example: mapVertices 18 // increment each vertex value by one Graph<Long, Long, Long> updatedGraph = graph.mapVertices( new MapFunction<Vertex<Long, Long>, Long>() { public Long map(Vertex<Long, Long> value) { return value.getValue() + 1; } }); 4 2 8 5 5 3 1 7 4 5
  19. 19. Example: subGraph 19 Graph<Long, Long, Long> subgraph = graph.subgraph( new FilterFunction<Vertex<Long, Long>>() { public boolean filter(Vertex<Long, Long> vertex) { // keep only vertices with positive values return (vertex.getValue() > 0); } }, new FilterFunction<Edge<Long, Long>>() { public boolean filter(Edge<Long, Long> edge) { // keep only edges with negative values return (edge.getValue() < 0); } })
  20. 20. - Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel Neighborhood Methods 3 4 7 4 4 graph.reduceOnNeighbors(new MinValue(), EdgeDirection.OUT); 3 9 7 4 5 20
  21. 21. Iterative Graph Processing Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations ● vertex-centric (similar to Pregel, Giraph) ● gather-sum-apply (similar to PowerGraph) 21
  22. 22. MessagingFunction: what messages to send to other vertices VertexUpdateFunction: how to update a vertex value, based on the received messages The workset contains only active vertices (received a msg in the previous superstep) Vertex-centric Iterations 22
  23. 23. Vertex-centric SSSP shortestPaths = graph.runVertexCentricIteration( new DistanceUpdater(), new DistanceMessenger()).getVertices(); DistanceUpdater: VertexUpdateFunction DistanceMessenger: MessagingFunction 23 sendMessages(K key, Double newDist) { for (Edge edge : getOutgoingEdges()) { sendMessageTo(edge.getTarget(), newDist + edge.getValue()); } updateVertex(Vertex<K, Double> vertex, MessageIterator msgs) { Double minDist = Double.MAX_VALUE; for (double msg : msgs) { if (msg < minDist) minDist = msg; } if (vertex.getValue() > minDist) setNewVertexValue(minDist); }
  24. 24. Gather: how to transform each of the edges and neighbors of each vertex Sum: how to aggregate the partial values of Gather to a single value Apply: how to update the vertex value, based on the new aggregate and the current value The workset contains the active vertices (user-defined activation) Gather-Sum-Apply Iterations 24
  25. 25. Gather-Sum-Apply SSSP shortestPaths = graph.runGatherSumApplyIteration(new CalculateDistances(), new ChooseMinDistance(), new UpdateDistance()).getVertices(); CalculateDistances: Gather 25 gather(Neighbor<Double, Double> neighbor) { return neighbor.getNeighborValue() + neighbor.getEdgeValue(); } ChooseMinDistance: Sum ChooseMinDistance: Apply sum(Double newValue, Double currentValue) { return Math.min(newValue, currentValue); } apply(Double newDistance, Double oldDistance) { if (newDistance < oldDistance) { setResult(newDistance); } }
  26. 26. Iteration Configuration ● Parallelism ● Aggregators ● Broadcast Variables ● Messaging Direction (IN, OUT, ALL) ● Access vertex degrees and number of vertices 26
  27. 27. Vertex-Centric or GSA? 27 vertex-centric GSA when message creation is independent & expensive (parallelized in Gather) when the graph is skewed (~ 1.5x faster) if update is associative-commutative when all messages are needed for the update when a vertex needs to send messages to non-neighbors
  28. 28. ● PageRank* ● Single Source Shortest Paths* ● Label Propagation ● Weakly Connected Components* ● Community Detection Upcoming: Triangle Count, HITS, Affinity Propagation, Graph Summarization graphWithRanks = inputGraph.run(new PageRank(maxIterations, dampingFactor)); *: both vertex-centric and GSA implementations Library of Algorithms 28
  29. 29. What’s next, Gelly? ● Integration with the Flink Streaming API ● Specialized Operators for Skewed Graphs ● Partitioning Methods ● More graph processing models: neighborhood-centric, partition-centric Check our Roadmap! https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly 29
  30. 30. Feeling Gelly? Try it! ● Grab the Flink master: https://github.com/apache/flink ● Read the Gelly programming guide: https://ci.apache. org/projects/flink/flink-docs-master/libs/gelly_guide.html ● Follow the Gelly School tutorials (Beta): http://gellyschool.com ● Read our blog post: https://flink.apache. org/news/2015/08/24/introducing-flink-gelly.html ● Come to Flink Forward (October 12-13, Berlin): http://flink- forward.org/ 30

×