Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Single-Pass Graph Stream Analytics with Apache Flink

203 views

Published on

A presentation motivating graph stream processing as a paradigm for large-scale complex analytics and gelly-streaming, our new framework based on Apache Flink.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Single-Pass Graph Stream Analytics with Apache Flink

  1. 1. @GraphDevroom Single-pass Graph Stream Analytics with Apache Flink Rethinking graph processing for dynamic data Vasiliki Kalavri <vasia@apache.org> Paris Carbone <senorcarbone@apache.org> 1
  2. 2. @GraphDevroom Real Graphs are dynamic Graphs created by events happening in real-time • liking a post • buying a book • listening to a song • rating a movie • packet switching in computer networks • bitcoin transactions Each event adds an edge to the graph 2
  3. 3. @GraphDevroom 3
  4. 4. @GraphDevroom In a batch world We create and analyze a snapshot of the real graph • all events / interactions / relationships that happened between t0 and tn • the Facebook social network on January 30 2016 • user web logs gathered between March 1st 12:00 and 16:00 • retweets and replies for 24h after the announcement of the death of David Bowie 4
  5. 5. @GraphDevroom Batch Graph Processing 5
  6. 6. @GraphDevroom In a streaming world • We receive and consume the events as they are happening, in real-time • We analyze the evolving graph and receive results continuously 6
  7. 7. @GraphDevroom 7 Streaming Graph Processing
  8. 8. @GraphDevroom 8 Streaming Graph Processing
  9. 9. @GraphDevroom 9 Streaming Graph Processing
  10. 10. @GraphDevroom 10 Streaming Graph Processing
  11. 11. @GraphDevroom 11 Streaming Graph Processing
  12. 12. @GraphDevroom 12 Streaming Graph Processing
  13. 13. @GraphDevroom 13 Streaming Graph Processing
  14. 14. @GraphDevroom 14 Streaming Graph Processing
  15. 15. @GraphDevroom 15 Streaming Graph Processing
  16. 16. @GraphDevroom 16 Streaming Graph Processing
  17. 17. @GraphDevroom 17 Streaming Graph Processing
  18. 18. @GraphDevroom Sounds expensive? Challenges • maintain the graph structure • how to apply state updates efficiently? • update the result • re-run the analysis for each event? • design an incremental algorithm? • run separate instances on multiple snapshots? • compute only on most recent events 18
  19. 19. @GraphDevroom 19 The Apache Flink Stack APIs Execution DataStreamDataSet Distributed Dataflow Deployment • Bounded Data Sources • Structured Iterations • Blocking Operations • Unbounded Data Sources • Asynchronous Iterations • Incremental Operations
  20. 20. @GraphDevroom Unifying Data Processing Job Manager • scheduling tasks • monitoring/recovery Client • task pipelining • blocking • execution plan building • optimisation 20 DataStreamDataSet Distributed Dataflow Deployment HDFS Kafka DataSet<String> text = env.readTextFile(“hdfs://…”); text.map(…).groupReduce(…)… DataStream<String> events = env.addSource(new KafkaConsumer(…)); events.map(…).filter(…).window(…).fold(…)…
  21. 21. @GraphDevroom Graph Processing on Apache Flink 21 DataStreamDataSet Distributed Dataflow Deployment Gelly • Static Graphs • Multi-Pass Algorithms • Full Computations DataStream
  22. 22. @GraphDevroom Data Streams as ADTs 22 • Direct access to the execution graph / topology • Suitable for engineers • Abstract Data Type Transformations hide operator details • Suitable data analysts and engineers similar to: PCollection, DStream DataStream
  23. 23. @GraphDevroom Nature of a DataStream Job 23 • Tasks are long running in a pipelined execution. • State is kept within tasks. • Transformations are applied per-record or per- window. Execution Graph unbounded data sinks unbounded data sources • operator parallelism • stream partitioning Execution Properties
  24. 24. @GraphDevroom Working with DataStreams 24 Creation Transformations DataStream<String> myStream = -for supported data sources: env.addSource(new FlinkKafkaConsumer<String>(…)); env.addSource(new RMQSource<String>(…)); env.addSource(new TwitterSource(propsFile)); env.socketTextStream(…); -for testing: env.fromCollection(…); env.fromElements(…); -for adding any custom source: env.addSource(MyCustomSource(…)); Properties myStream.setParallelism(3) myStream.broadcast(); .rebalance(); .forward(); .keyBy(key); partitioning partition stream and operator state by key myStream.map(…); myStream.flatMap(…); myStream.filter(…); myStream.union(myOtherStream); -for aggregations on partitioned-by-key streams: myKeyStream.reduce(…); myKeyStream.fold(…); myKeyStream.sum(…);
  25. 25. @GraphDevroom Example 25 env.setParallelism(2); //default parallelism DataStream<Tuple2<String, Integer>> counts = env .socketTextStream("localhost", 9999) .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning .sum(1) //rolling aggregation .setParallelism(4); counts.print(); “cool, gelly is cool” <“gelly", 1> <“is”, 1> <“cool”,1> <“cool”,1> <“is”, 1> <“gelly”, 1> <“cool”,2> <“cool”,1> print sum flatMap
  26. 26. @GraphDevroom Working with Windows 26 Why windows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! #sec 40 80 SUM #2 0 SUM #1 20 60 100 #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 15 38 38 65 65 88 15 38 65 88 110 120 myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS)); 1) Sliding windows 2) Tumbling windows myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS)); window buckets/panes
  27. 27. @GraphDevroom Example 27 env.setParallelism(2); //default parallelism DataStream<Tuple2<String, Integer>> counts = env .socketTextStream("localhost", 9999) .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning .window(Time.of(5, TimeUnit.MINUTES)) .sum(1) //rolling aggregation .setParallelism(4); counts.print(); 10:48 - “cool, gelly is cool” print window sum flatMap 11:01 - “dataflow is cool too” <“gelly”,1>… <“cool”,2> <“dataflow”,1>… <“cool”,1>
  28. 28. @GraphDevroom Single-Pass Graph Streaming with Windows • Each event represents an edge addition • Each edge is processed once and thrown away, i.e. the graph structure is not explicitly maintained • The state maintained corresponds to a graph summary, a continuously improving property, an aggregation • Recent events can be grouped in a graph window and processed independently 28
  29. 29. @GraphDevroom What’s the benefit? • Get results faster • No need to wait for the job to finish • Sometimes, early approximations are better than late exact answers • Get results continuously • Process unbounded number of events • Use less memory • single-pass algorithms don’t store the graph structure • run computations on a graph summary 29
  30. 30. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 30
  31. 31. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 31
  32. 32. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 32
  33. 33. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 33
  34. 34. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 34
  35. 35. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 35
  36. 36. @GraphDevroom What can you do in this model? • transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction • continuous aggregations, e.g. degree distribution 36
  37. 37. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 Streaming Degrees Distribution#vertices degree 37
  38. 38. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 38
  39. 39. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 39
  40. 40. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 40
  41. 41. @GraphDevroom 1 43 2 5 6 7 8 Streaming Degrees Distribution 0 2 4 6 1 2 3 4 #vertices degree 41
  42. 42. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 42
  43. 43. @GraphDevroom 1 43 2 5 6 7 8 Streaming Degrees Distribution 0 2 4 6 1 2 3 4 #vertices degree 43
  44. 44. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 44
  45. 45. @GraphDevroom 1 43 2 5 6 7 8 Streaming Degrees Distribution 0 2 4 6 1 2 3 4 #vertices degree 45
  46. 46. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 46
  47. 47. @GraphDevroom What can you do in this model? • spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties graph summary algorithm algorithm~R1 R2 47
  48. 48. @GraphDevroom What can you do in this model? • neighborhood aggregations on windows, e.g. triangle counting, clustering coefficient (no iterations… yet!) 48
  49. 49. @GraphDevroom Examples 49
  50. 50. @GraphDevroom Batch Connected Components • State: the graph and a component ID per vertex (initially equal to vertex ID) • Iterative Computation: For each vertex: • choose the min of neighbors’ component IDs and own component ID as new ID • if component ID changed since last iteration, notify neighbors 50
  51. 51. @GraphDevroom 1 43 2 5 6 7 8 i=0 Batch Connected Components 51
  52. 52. @GraphDevroom 1 43 2 5 6 7 8 i=1 3 4 1 4 4 5 2 4 1 2 4 5 7 8 6 8 6 7 1 1 2 6 6 Batch Connected Components 52
  53. 53. @GraphDevroom 1 11 2 2 6 6 6 i=2 1 1 1 2 1 2 6 6 6 1 1 Batch Connected Components 53
  54. 54. @GraphDevroom 1 11 1 1 6 6 6 i=3 Batch Connected Components 54
  55. 55. @GraphDevroom Streaming Connected Components • State: a disjoint set data structure for the components • Computation: For each edge • if seen for the 1st time, create a component with ID the min of the vertex IDs • if in different components, merge them and update the component ID to the min of the component IDs • if only one of the endpoints belongs to a component, add the other one to the same component 55
  56. 56. @GraphDevroom 31 52 54 76 86 ComponentID Vertices 1 43 2 5 6 7 8 56
  57. 57. @GraphDevroom 31 52 54 76 86 42 ComponentID Vertices 1 1, 3 1 43 2 5 6 7 8 57
  58. 58. @GraphDevroom 31 52 54 76 86 42 ComponentID Vertices 43 2 2, 5 1 1, 3 1 43 2 5 6 7 8 58
  59. 59. @GraphDevroom 31 52 54 76 86 42 43 87 ComponentID Vertices 2 2, 4, 5 1 1, 3 1 43 2 5 6 7 8 59
  60. 60. @GraphDevroom 31 52 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7 1 43 2 5 6 7 8 60
  61. 61. @GraphDevroom 52 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 61
  62. 62. @GraphDevroom 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 62
  63. 63. @GraphDevroom 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 63
  64. 64. @GraphDevroom 76 86 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 64
  65. 65. @GraphDevroom 86 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 65
  66. 66. @GraphDevroom 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 66
  67. 67. @GraphDevroom Distributed Streaming Connected Components 67
  68. 68. @GraphDevroom Streaming Bipartite Detection Similar to connected components, but • each vertex is also assigned a sign, (+) or (-) • edge endpoints must have different signs • when merging components, if flipping all signs doesn’t work => the graph is not bipartite 68
  69. 69. @GraphDevroom 1 43 2 5 6 7 (+) (-) (+) (-) (+) (-) (+) Cid=1 Cid=5 Streaming Bipartite Detection 69
  70. 70. @GraphDevroom 3 5 1 43 2 5 6 7 (+) (-) (+) (-) (+) (-) (+) Cid=1 Cid=5 Streaming Bipartite Detection 70
  71. 71. @GraphDevroom 3 5 1 43 2 5 6 7 (+) (-) (+) (-) (+) (-) (+) Cid=1 Cid=5 Streaming Bipartite Detection 71
  72. 72. @GraphDevroom Cid=1 1 43 2 5 6 7 (+) (-) (-)(+) (+) (-) (-) 3 5 Streaming Bipartite Detection 72
  73. 73. @GraphDevroom 3 7 Cid=1 1 43 2 5 6 7 (+) (-) (-)(+) (+) (-) (-) Can’t flip signs and stay consistent => not bipartite! Streaming Bipartite Detection 73
  74. 74. @GraphDevroom The GraphStream 74 DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly-Stream • Static Graphs • Multi-Pass Algorithms • Full Computations • Dynamic Graphs • Single-Pass Algorithms • Incremental Computations DataStream
  75. 75. @GraphDevroom Introducing Gelly-Stream 75 • Gelly-Stream enriches the DataStream API with two new additional ADTs: • GraphStream: • A representation of a data stream of edges. • Edges can have state (e.g. weights). • Supports property streams, transformations and aggregations. • GraphWindow: • A “time-slice” of a graph stream. • It enables neighborhood aggregations (and iterations in the future)
  76. 76. @GraphDevroom Graph Property Streams 76 A B C D A B C D A CGraph Stream: .getEdges() .getVertices() .numberOfVertices() .numberOfEdges() .getDegrees() .inDegrees() .outDegrees() GraphStream -> DataStream
  77. 77. @GraphDevroom .mapEdges(); .distinct(); .filterVertices(); .filterEdges(); .reverse(); .undirected(); .union(); Transform Graph Streams 77 A B C D A B C D A CGraph Stream: GraphStream -> GraphStream
  78. 78. @GraphDevroom Graph Stream Aggregations 78 result aggregate property streamgraph stream (window) fold combine fold reduce partitioned aggregates global aggregates edges agg global aggregates can be persistent or transient graphStream.aggregate(new MyGraphAggregation(window, update, fold, combine, merge))
  79. 79. @GraphDevroom Graph Stream Aggregations 79 result aggregate property stream graph stream (window) fold combine merge graphStream.aggregate(new MyGraphAggregation(window, fold, combine, merge)) fold reduce map partitioned aggregates global aggregates edges agg
  80. 80. @GraphDevroom Connected Components 80 graph stream combine merge graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) reduce map 31 52 1 43 2 5 6 7 8
  81. 81. @GraphDevroom Connected Components 81 graph stream combine merge reduce map {1,3} {2,5} graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  82. 82. @GraphDevroom Connected Components 82 graph stream combine merge reduce map {1,3} {2,5} 54 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  83. 83. @GraphDevroom Connected Components 83 graph stream combine merge reduce map {1,3} {2,5} {4,5} 76 86 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  84. 84. @GraphDevroom Connected Components 84 graph stream combine merge reduce map {1,3} {2,5} {4,5} {6,7} {6,8} graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  85. 85. @GraphDevroom Connected Components 85 graph stream combine merge reduce map TODO:: show blocking reduce instead? {2,5} {6,8} {1,3} {4,5} {6,7} 3 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  86. 86. @GraphDevroom Connected Components 86 graph stream combine merge reduce map {1,3} {2,4,5} {6,7,8} 3 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  87. 87. @GraphDevroom Connected Components 87 graph stream combine merge reduce map {1,3} {2,4,5} {6,7,8} 3 42 43 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  88. 88. @GraphDevroom Connected Components 88 graph stream combine merge reduce map {1,3} {2,4,5} {6,7,8} 3 {2,4} {3,4} 41 87 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  89. 89. @GraphDevroom Connected Components 89 graph stream combine merge reduce map {1,3} {2,4,5} {6,7,8} 3 {1,2,4} {3,4} {7,8} graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  90. 90. @GraphDevroom Connected Components 90 graph stream combine merge reduce map {1,2,4,5} {6,7,8} 2 {3,4} {7,8} graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  91. 91. @GraphDevroom Connected Components 91 graph stream combine merge reduce map {1,2,3,4,5} {6,7,8} 2 graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1 43 2 5 6 7 8
  92. 92. @GraphDevroom Slicing Graph Streams 92 graphStream.slice(Time.of(1, MINUTE)); 11:40 11:41 11:42 11:43
  93. 93. @GraphDevroom Aggregating Slices 93 graphStream.slice(Time.of(1, MINUTE), direction) .reduceOnEdges(); .foldNeighbors(); .applyOnNeighbors(); • Slicing collocates edges by vertex information • Neighbourhood aggregations are now enabled on sliced graphs source target Aggregations
  94. 94. @GraphDevroom Finding matches nearby 94 graphStream.slice(Time.of(1, MINUTE)).applyOnNeighbors(FindPairs()) slice applyOnNeighbors TODO: make it more interactive with transitions
  95. 95. @GraphDevroom Summary • Many graph analysis problems can be covered in single-pass • Processing dynamic graphs requires an incremental graph processing model • We introduce Gelly-Stream, a simple yet powerful library for graph streams

×