Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Batch and Stream Graph Processing with Apache Flink

7,038 views

Published on

Batch and Stream Graph Processing with Apache Flink, presented at the Apache Flink and Neo4j meetup, March 2016, Berlin

Published in: Data & Analytics

Batch and Stream Graph Processing with Apache Flink

  1. 1. Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri
  2. 2. Apache Flink • An open-source, distributed data analysis framework • True streaming at its core • Streaming & Batch API 2 Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... Event logs ETL, Graphs,
 Machine Learning
 Relational, … Low latency,
 windowing, aggregations, ...
  3. 3. Integration (picture not complete) POSIX Java/Scala
 Collections POSIX
  4. 4. Why Stream Processing? • Most problems have streaming nature • Stream processing gives lower latency • Data volumes more easily tamed 4 Event stream
  5. 5. Batch and Streaming Pipelined and
 blocking operators Streaming Dataflow Runtime Batch Parameters DataSet DataStream Relational
 Optimizer Window
 Optimization Pipelined and
 windowed operators Schedule lazily Schedule eagerly Recompute whole
 operators Periodic checkpoints Streaming data movement Stateful operations DAG recovery Fully buffered streams DAG resource management Streaming Parameters
  6. 6. Flink APIs 6 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.readFromKafka(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .keyBy("word”).timeWindow(Time.of(5,SECONDS)).sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”) .print() DataSet API (batch): DataStream API (streaming):
  7. 7. Working with Windows 7 Why windows? We are often interested in fresh data!15 38 65 88 110 120 #sec 40 80 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 1) Tumbling windows myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS)); #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 15 38 38 65 65 88 myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS)); 2) Sliding windows window buckets/panes
  8. 8. Working with Windows 7 Why windows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! 15 38 65 88 110 120 #sec 40 80 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 1) Tumbling windows myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS)); #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 15 38 38 65 65 88 myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS)); 2) Sliding windows window buckets/panes
  9. 9. Flink Stack Gelly Table ML SAMOA DataSet (Java/Scala) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Embedded Dataflow Dataflow(WiP) Table Cascading Streaming dataflow runtime CEP 8
  10. 10. Gelly the Flink Graph API
  11. 11. Meet Gelly • Java & Scala Graph APIs on top of Flink • graph transformations and utilities • iterative graph processing • library of graph algorithms • Can be seamlessly mixed with the DataSet Flink API to easily implement applications that use both record-based and graph-based analysis 10
  12. 12. Hello, Gelly! ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run( new ConnectedComponents(maxIterations)); val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph.fromDataSet(edges, env) val components = graph.run(new ConnectedComponents(maxIterations)) Java Scala 11
  13. 13. Graph Methods Graph Properties getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees ... 12 Transformations map, filter, join subgraph, union, difference reverse, undirected getTriplets Mutations add vertex/edge remove vertex/edge
  14. 14. Neighborhood Methods graph.reduceOnNeighbors(new MinValue, EdgeDirection.OUT) 13
  15. 15. Iterative Graph Processing • Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations • Based on the BSP, vertex-centric model • scatter-gather • gather-sum-apply • vertex-centric (pregel)* • partition-centric* 14
  16. 16. Scatter-Gather Iterations • MessagingFunction: generate message for other vertices • VertexUpdateFunction: update vertex value based on received messages 15 Scatter Gather
  17. 17. Gather-Sum-Apply Iterations • Gather: compute one value per edge • Sum: combine the partial values of Gather to a single value • Apply: update the vertex value, based on the Sum and the current value 16 Gather ApplySum
  18. 18. Library of Algorithms • PageRank* • Single Source Shortest Paths* • Label Propagation • Weakly Connected Components* • Community Detection • Triangle Count & Enumeration • Graph Summarization • val ranks = inputGraph.run(new PageRank(0.85, 20)) • *: both scatter-gather and GSA implementations 17
  19. 19. Gelly-Stream single-pass stream graph processing with Flink
  20. 20. Real Graphs are dynamic Graphs are created from events happening in real-time 19
  21. 21. 20
  22. 22. 20
  23. 23. 20
  24. 24. 20
  25. 25. 20
  26. 26. 20
  27. 27. 20
  28. 28. 20
  29. 29. 20
  30. 30. Gelly on Streams 21 DataStreamDataSet Distributed Dataflow Deployment DataStream
  31. 31. Gelly on Streams 21 DataStreamDataSet Distributed Dataflow Deployment Gelly • Static Graphs • Multi-Pass Algorithms • Full Computations DataStream
  32. 32. Gelly on Streams 21 DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly-Stream • Static Graphs • Multi-Pass Algorithms • Full Computations DataStream
  33. 33. Gelly on Streams 21 DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly-Stream • Static Graphs • Multi-Pass Algorithms • Full Computations • Dynamic Graphs • Single-Pass Algorithms • Approximate Computations DataStream
  34. 34. Batch vs. Stream Graph Processing 22 Batch Stream Input Graph static dynamic Analysis on a snapshot continuous Response after job completion immediately
  35. 35. Graph Streaming Challenges • Maintain the graph structure • How to apply state updates efficiently? • Result updates • Re-run the analysis for each event? • Design an incremental algorithm? • Run separate instances on multiple snapshots? • Computation on most recent events only 23
  36. 36. Single-Pass Graph Streaming • Each event is an edge addition • Maintain only a graph summary • Recent events are grouped in graph windows 24
  37. 37. Graph Summaries • spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties graph summary algorithm algorithm~R1 R2 25
  38. 38. Examples
  39. 39. Batch Connected Components • State: the graph and a component ID per vertex (initially equal to vertex ID) • Iterative Computation: For each vertex: • choose the min of neighbors’ component IDs and own component ID as new ID • if component ID changed since last iteration, notify neighbors 27
  40. 40. 1 43 2 5 6 7 8 i=0 Batch Connected Components 28
  41. 41. 1 11 2 2 6 6 6 i=1 Batch Connected Components 29
  42. 42. 1 11 1 5 6 6 6 i=2 Batch Connected Components 30 1
  43. 43. 1 11 1 1 6 6 6 i=3 Batch Connected Components 31
  44. 44. Stream Connected Components • State: a disjoint set data structure for the components • Computation: For each edge • if seen for the 1st time, create a component with ID the min of the vertex IDs • if in different components, merge them and update the component ID to the min of the component IDs • if only one of the endpoints belongs to a component, add the other one to the same component 32
  45. 45. 31 52 54 76 86 ComponentID Vertices 1 43 2 5 6 7 8 33
  46. 46. 31 52 54 76 86 42 ComponentID Vertices 1 1, 3 1 43 2 5 6 7 8 34
  47. 47. 31 52 54 76 86 42 ComponentID Vertices 43 2 2, 5 1 1, 3 1 43 2 5 6 7 8 35
  48. 48. 31 52 54 76 86 42 43 87 ComponentID Vertices 2 2, 4, 5 1 1, 3 1 43 2 5 6 7 8 36
  49. 49. 31 52 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7 1 43 2 5 6 7 8 37
  50. 50. 52 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 38
  51. 51. 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 39
  52. 52. 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 40
  53. 53. 76 86 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 41
  54. 54. 86 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 42
  55. 55. 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 43
  56. 56. Distributed Stream Connected Components 44
  57. 57. API Requirements • Continuous aggregations on edge streams • Global graph aggregations • Support for windowing 45
  58. 58. Introducing Gelly-Stream 46 Gelly-Stream enriches the DataStream API with two new additional ADTs: • GraphStream: • A representation of a data stream of edges. • Edges can have state (e.g. weights). • Supports property streams, transformations and aggregations. • GraphWindow: • A “time-slice” of a graph stream. • It enables neighborhood aggregations
  59. 59. GraphStream Operations 47 .getEdges() .getVertices() .numberOfVertices() .numberOfEdges() .getDegrees() .inDegrees() .outDegrees() GraphStream -> DataStream .mapEdges(); .distinct(); .filterVertices(); .filterEdges(); .reverse(); .undirected(); .union(); GraphStream -> GraphStream Property Streams Transformations
  60. 60. Graph Stream Aggregations 48 result aggregate property streamgraph stream (window) fold combine fold reduce local summaries global summary edges agg global aggregates can be persistent or transient graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
  61. 61. Graph Stream Aggregations 49 result aggregate property stream graph stream (window) fold combine transform fold reduce map local summaries global summary edges agg graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
  62. 62. Connected Components 50 graph stream #components
  63. 63. Connected Components 50 graph stream 1 43 2 5 6 7 8 #components
  64. 64. Connected Components 50 graph stream 31 52 1 43 2 5 6 7 8 #components
  65. 65. Connected Components 51 graph stream {1,3} {2,5} 1 43 2 5 6 7 8 #components
  66. 66. Connected Components 52 graph stream {1,3} {2,5} 54 1 43 2 5 6 7 8 #components
  67. 67. Connected Components 53 graph stream {1,3} {2,5} {4,5} 76 86 1 43 2 5 6 7 8 #components
  68. 68. Connected Components 54 graph stream {1,3} {2,5} {4,5} {6,7} {6,8} 1 43 2 5 6 7 8 #components
  69. 69. Connected Components 54 graph stream {1,3} {2,5} {4,5} {6,7} {6,8} 1 43 2 5 6 7 8 #components window triggers
  70. 70. Connected Components 55 graph stream {2,5} {6,8} {1,3} {4,5} {6,7} 1 43 2 5 6 7 8 #components
  71. 71. Connected Components 55 graph stream {2,5} {6,8} {1,3} {4,5} {6,7} 3 1 43 2 5 6 7 8 #components
  72. 72. Connected Components 56 graph stream {1,3} {2,4,5} {6,7,8} 1 43 2 5 6 7 8 #components
  73. 73. Connected Components 56 graph stream {1,3} {2,4,5} {6,7,8} 3 1 43 2 5 6 7 8 #components
  74. 74. Connected Components 57 graph stream {1,3} {2,4,5} {6,7,8} 42 43 1 43 2 5 6 7 8 #components
  75. 75. Connected Components 58 graph stream {1,3} {2,4,5} {6,7,8}{2,4} {3,4} 41 87 1 43 2 5 6 7 8 #components
  76. 76. Connected Components 59 graph stream {1,3} {2,4,5} {6,7,8}{1,2,4} {3,4} {7,8} 1 43 2 5 6 7 8 #components
  77. 77. Connected Components 59 graph stream {1,3} {2,4,5} {6,7,8}{1,2,4} {3,4} {7,8} 1 43 2 5 6 7 8 #components window triggers
  78. 78. Connected Components 60 graph stream {1,2,4,5} {6,7,8} {3,4} {7,8} 1 43 2 5 6 7 8 #components
  79. 79. Connected Components 60 graph stream {1,2,4,5} {6,7,8} 2 {3,4} {7,8} 1 43 2 5 6 7 8 #components
  80. 80. Connected Components 61 graph stream {1,2,3,4,5} {6,7,8} 1 43 2 5 6 7 8 #components
  81. 81. Connected Components 61 graph stream {1,2,3,4,5} {6,7,8} 2 1 43 2 5 6 7 8 #components
  82. 82. Slicing Graph Streams 62 graphStream.slice(Time.of(1, MINUTE)); 11:40 11:41 11:42 11:43
  83. 83. Aggregating Slices 63 graphStream.slice(Time.of(1, MINUTE), direction) • Slicing collocates edges by vertex information • Neighborhood aggregations on sliced graphs source target
  84. 84. Aggregating Slices 63 graphStream.slice(Time.of(1, MINUTE), direction) • Slicing collocates edges by vertex information • Neighborhood aggregations on sliced graphs source target
  85. 85. Aggregating Slices 63 graphStream.slice(Time.of(1, MINUTE), direction) .reduceOnEdges(); .foldNeighbors(); .applyOnNeighbors(); • Slicing collocates edges by vertex information • Neighborhood aggregations on sliced graphs source target Aggregations
  86. 86. Finding Matches Nearby 64
  87. 87. Finding Matches Nearby 64 graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) GraphStream :: graph geek check-ins wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe’s_grill sandra checked_in soap_bar rafa checked_in joe’s_grill
  88. 88. Finding Matches Nearby 64 graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) slice GraphStream :: graph geek check-ins wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe’s_grill sandra checked_in soap_bar rafa checked_in joe’s_grill wendy steve sandra soap bar tom rafa joe’s grill GraphWindow :: user-place
  89. 89. Finding Matches Nearby 64 graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) slice GraphStream :: graph geek check-ins wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe’s_grill sandra checked_in soap_bar rafa checked_in joe’s_grill wendy steve sandra soap bar tom rafa joe’s grill FindPairs {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa} GraphWindow :: user-place
  90. 90. What’s next? • Integration with Neo4j (Input / Output) • OpenCypher on Flink/Gelly • Pregel and Partition-Centric Iterations • Integration with Graphalytics
  91. 91. Feeling Gelly? • Gelly Guide https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html • Gelly-Stream Repository https://github.com/vasia/gelly-streaming • Gelly-Stream talk @FOSDEM16 https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/ • An interesting read http://people.cs.umass.edu/~mcgregor/papers/13-graphsurvey.pdf • A cool thesis http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-170425

×