Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

2,232 views

Published on

Presentation at the Graph Devroom, FOSDEM'16.

Published in: Data & Analytics

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

  1. 1. @GraphDevroom Single-pass Graph Stream Analytics with Apache Flink Vasia Kalavri <vasia@apache.org> Paris Carbone <senorcarbone@apache.org> 1
  2. 2. @GraphDevroom Outline • Why Graph Streaming? • Single-Pass Algorithms Examples • Apache Flink Streaming API • The GellyStream API
  3. 3. @GraphDevroom Real Graphs are dynamic Graphs are created from events happening in real-time 3
  4. 4. @GraphDevroom 4
  5. 5. @GraphDevroom Batch Graph Processing 5 We create and analyze a snapshot of the real graph • the Facebook social network on January 30 2016 • user web logs gathered between March 1st 12:00 and 16:00 • retweets and replies for 24h after the announcement of the death of David Bowie
  6. 6. @GraphDevroom Streaming Graph Processing We consume events in real-time • Get results faster • No need to wait for the job to finish • Sometimes, early approximations are better than late exact answers • Get results continuously • Process unbounded number of events 6
  7. 7. @GraphDevroom Challenges • Maintain the graph structure • How to apply state updates efficiently? • Result updates • Re-run the analysis for each event? • Design an incremental algorithm? • Run separate instances on multiple snapshots? • Computation on most recent events only 7
  8. 8. @GraphDevroom Single-Pass Graph Streaming • Each event is an edge addition • Maintains only a graph summary • Recent events are grouped in graph windows 8
  9. 9. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 Streaming Degrees Distribution#vertices degree 9
  10. 10. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 10
  11. 11. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 11
  12. 12. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 12
  13. 13. @GraphDevroom 1 43 2 5 6 7 8 Streaming Degrees Distribution 0 2 4 6 1 2 3 4 #vertices degree 13
  14. 14. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 14
  15. 15. @GraphDevroom 1 43 2 5 6 7 8 Streaming Degrees Distribution 0 2 4 6 1 2 3 4 #vertices degree 15
  16. 16. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 16
  17. 17. @GraphDevroom 1 43 2 5 6 7 8 Streaming Degrees Distribution 0 2 4 6 1 2 3 4 #vertices degree 17
  18. 18. @GraphDevroom 1 43 2 5 6 7 8 0 2 4 6 1 2 3 4 #vertices degree Streaming Degrees Distribution 18
  19. 19. @GraphDevroom Graph Summaries • spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties graph summary algorithm algorithm~R1 R2 19
  20. 20. @GraphDevroom Window Aggregations Neighborhood aggregations on windows 20
  21. 21. @GraphDevroom Examples 21
  22. 22. @GraphDevroom Batch Connected Components • State: the graph and a component ID per vertex (initially equal to vertex ID) • Iterative Computation: For each vertex: • choose the min of neighbors’ component IDs and own component ID as new ID • if component ID changed since last iteration, notify neighbors 22
  23. 23. @GraphDevroom 1 43 2 5 6 7 8 i=0 Batch Connected Components 23
  24. 24. @GraphDevroom 1 11 2 2 6 6 6 i=1 Batch Connected Components 24
  25. 25. @GraphDevroom 1 11 1 5 6 6 6 i=2 Batch Connected Components 25 1
  26. 26. @GraphDevroom 1 11 1 1 6 6 6 i=3 Batch Connected Components 26
  27. 27. @GraphDevroom Stream Connected Components • State: a disjoint set data structure for the components • Computation: For each edge • if seen for the 1st time, create a component with ID the min of the vertex IDs • if in different components, merge them and update the component ID to the min of the component IDs • if only one of the endpoints belongs to a component, add the other one to the same component 27
  28. 28. @GraphDevroom 31 52 54 76 86 ComponentID Vertices 1 43 2 5 6 7 8 28
  29. 29. @GraphDevroom 31 52 54 76 86 42 ComponentID Vertices 1 1, 3 1 43 2 5 6 7 8 29
  30. 30. @GraphDevroom 31 52 54 76 86 42 ComponentID Vertices 43 2 2, 5 1 1, 3 1 43 2 5 6 7 8 30
  31. 31. @GraphDevroom 31 52 54 76 86 42 43 87 ComponentID Vertices 2 2, 4, 5 1 1, 3 1 43 2 5 6 7 8 31
  32. 32. @GraphDevroom 31 52 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7 1 43 2 5 6 7 8 32
  33. 33. @GraphDevroom 52 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 33
  34. 34. @GraphDevroom 54 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 34
  35. 35. @GraphDevroom 76 86 42 43 87 41 ComponentID Vertices 2 2, 4, 5 1 1, 3 6 6, 7, 8 1 43 2 5 6 7 8 35
  36. 36. @GraphDevroom 76 86 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 36
  37. 37. @GraphDevroom 86 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 37
  38. 38. @GraphDevroom 42 43 87 41 ComponentID Vertices 6 6, 7, 8 1 1, 2, 3, 4, 5 1 43 2 5 6 7 8 38
  39. 39. @GraphDevroom Distributed Stream Connected Components 39
  40. 40. @GraphDevroom Stream Bipartite Detection Similar to connected components, but • Each vertex is also assigned a sign, (+) or (-) • Edge endpoints must have different signs • When merging components, if flipping all signs doesn’t work => the graph is not bipartite 40
  41. 41. @GraphDevroom 1 43 2 5 6 7 (+) (-) (+) (-) (+) (-) (+) Cid=1 Cid=5 41 Stream Bipartite Detection
  42. 42. @GraphDevroom 3 5 1 43 2 5 6 7 (+) (-) (+) (-) (+) (-) (+) Cid=1 Cid=5 42 Stream Bipartite Detection
  43. 43. @GraphDevroom 3 5 1 43 2 5 6 7 (+) (-) (+) (-) (+) (-) (+) Cid=1 Cid=5 43 Stream Bipartite Detection
  44. 44. @GraphDevroom Cid=1 1 43 2 5 6 7 (+) (-) (-)(+) (+) (-) (-) 3 5 44 Stream Bipartite Detection
  45. 45. @GraphDevroom 3 7 Cid=1 1 43 2 5 6 7 (+) (-) (-)(+) (+) (-) (-) Can’t flip signs and stay consistent => not bipartite! 45 Stream Bipartite Detection
  46. 46. @GraphDevroom API Requirements • Continuous aggregations on edge streams • Global graph aggregations • Support for windowing
  47. 47. @GraphDevroom 47 The Apache Flink Stack APIs Execution DataStreamDataSet Distributed Dataflow Deployment • Bounded Data Sources • Blocking Operations • Structured Iterations • Unbounded Data Sources • Continuous Operations • Asynchronous Iterations
  48. 48. @GraphDevroom Unifying Data Processing Job Manager • scheduling tasks • monitoring/recovery Client • task pipelining • blocking • execution plan building • optimisation 48 DataStreamDataSet Distributed Dataflow Deployment HDFS Kafka DataSet<String> text = env.readTextFile(“hdfs://…”); text.map(…).groupReduce(…)… DataStream<String> events = env.addSource(new KafkaConsumer(…)); events.map(…).filter(…).window(…).fold(…)…
  49. 49. @GraphDevroom Data Streams as ADTs • Tasks are long running in a pipelined execution. • State is kept within tasks. • Transformations are applied per-record or window. • Transformations: map, flatmap, filter, union… • Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… 49 DataStream
  50. 50. @GraphDevroom Working with Windows 50 Why windows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! #sec 40 80 SUM #2 0 SUM #1 20 60 100 #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 15 38 38 65 65 88 15 38 65 88 110 120 myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS)); 1) Sliding windows 2) Tumbling windows myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS)); window buckets/panes
  51. 51. @GraphDevroom Example 51 myTextStream .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning .window(Time.of(5, TimeUnit.MINUTES)) .sum(1) //rolling aggregation .setParallelism(4); counts.print(); 10:48 - “cool, gelly is cool” print window sum flatMap 11:01 - “dataflow is cool too” <“gelly”,1>… <“cool”,2> <“dataflow”,1>… <“cool”,1>
  52. 52. @GraphDevroom Gelly on Streams 52 DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly-Stream • Static Graphs • Multi-Pass Algorithms • Full Computations • Dynamic Graphs • Single-Pass Algorithms • Approximate Computations DataStream
  53. 53. @GraphDevroom Introducing Gelly-Stream 53 Gelly-Stream enriches the DataStream API with two new additional ADTs: • GraphStream: • A representation of a data stream of edges. • Edges can have state (e.g. weights). • Supports property streams, transformations and aggregations. • GraphWindow: • A “time-slice” of a graph stream. • It enables neighbourhood aggregations
  54. 54. @GraphDevroom GraphStream Operations 54 .getEdges() .getVertices() .numberOfVertices() .numberOfEdges() .getDegrees() .inDegrees() .outDegrees() GraphStream -> DataStream .mapEdges(); .distinct(); .filterVertices(); .filterEdges(); .reverse(); .undirected(); .union(); GraphStream -> GraphStream Property Streams Transformations
  55. 55. @GraphDevroom Graph Stream Aggregations 55 result aggregate property streamgraph stream (window) fold combine fold reduce local summaries global summary edges agg global aggregates can be persistent or transient graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
  56. 56. @GraphDevroom Graph Stream Aggregations 56 result aggregate property stream graph stream (window) fold combine transform fold reduce map local summaries global summary edges agg graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
  57. 57. @GraphDevroom graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) Connected Components 57 graph stream 31 52 1 43 2 5 6 7 8 #components
  58. 58. @GraphDevroom Connected Components 58 graph stream {1,3} {2,5} 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  59. 59. @GraphDevroom Connected Components 59 graph stream {1,3} {2,5} 54 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  60. 60. @GraphDevroom Connected Components 60 graph stream {1,3} {2,5} {4,5} 76 86 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  61. 61. @GraphDevroom Connected Components 61 graph stream {1,3} {2,5} {4,5} {6,7} {6,8} 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components window triggers
  62. 62. @GraphDevroom Connected Components 62 graph stream {2,5} {6,8} {1,3} {4,5} {6,7} 3 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  63. 63. @GraphDevroom Connected Components 63 graph stream {1,3} {2,4,5} {6,7,8} 3 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  64. 64. @GraphDevroom Connected Components 64 graph stream {1,3} {2,4,5} {6,7,8} 42 43 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  65. 65. @GraphDevroom Connected Components 65 graph stream {1,3} {2,4,5} {6,7,8}{2,4} {3,4} 41 87 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  66. 66. @GraphDevroom Connected Components 66 graph stream {1,3} {2,4,5} {6,7,8}{1,2,4} {3,4} {7,8} 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components window triggers
  67. 67. @GraphDevroom Connected Components 67 graph stream {1,2,4,5} {6,7,8} 2 {3,4} {7,8} 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  68. 68. @GraphDevroom Connected Components 68 graph stream {1,2,3,4,5} {6,7,8} 2 1 43 2 5 6 7 8 graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform)) #components
  69. 69. @GraphDevroom Aggregating Slices 69 graphStream.slice(Time.of(1, MINUTE), direction) .reduceOnEdges(); .foldNeighbors(); .applyOnNeighbors(); • Slicing collocates edges by vertex information • Neighbourhood aggregations are now enabled on sliced graphs source target Aggregations
  70. 70. @GraphDevroom Finding Matches Nearby 70 graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) slice GraphStream :: graph geek check-ins wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe’s_grill sandra checked_in soap_bar rafa checked_in joe’s_grill wendy steve sandra soap bar tom rafa joe’s grill FindPairs {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa} GraphWindow :: user-place
  71. 71. @GraphDevroom Feeling Gelly? • Gelly-Stream: https://github.com/vasia/gelly-streaming • Apache Flink: https://github.com/apache/flink • An interesting read: http://users.dcc.uchile.cl/~pbarcelo/mcg.pdf • A cool thesis: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-170425 • Twitter: @vkalavri , @senorcarbone

×