Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Graph Stream Processing : spinning fast, large scale, complex analytics

810 views

Published on

Full Video: https://www.youtube.com/watch?v=cOShsisEsC0
An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.

Published in: Data & Analytics
  • Be the first to comment

Graph Stream Processing : spinning fast, large scale, complex analytics

  1. 1. Graph Stream Processing spinning fast, large-scale, complex analytics Paris Carbone PhD Candidate @ KTH Committer @ Apache Flink
  2. 2. We want to analyse….
  3. 3. We want to analyse…. data
  4. 4. We want to analyse…. datacomplex
  5. 5. We want to analyse…. datacomplexlarge-scale
  6. 6. We want to analyse…. data fastcomplexlarge-scale
  7. 7. But why do we need large-scale, complex and fast data analysis? >
  8. 8. But why do we need large-scale, complex and fast data analysis? to answer big complex questions faster>
  9. 9. to answer big complex questions faster>
  10. 10. >Hej Siri_ to answer big complex questions faster>
  11. 11. Get me the best route to work right now >Hej Siri_ to answer big complex questions faster>
  12. 12. Get me the best route to work right now >Hej Siri_ …with the fewest human drivers to answer big complex questions faster>
  13. 13. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… …with the fewest human drivers to answer big complex questions faster>
  14. 14. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday …with the fewest human drivers to answer big complex questions faster>
  15. 15. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
  16. 16. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
  17. 17. no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
  18. 18. no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
  19. 19. to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers 3000 AD
  20. 20. to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers 3000 AD
  21. 21. to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers FIRST WORLD PROBLEM 3000 AD
  22. 22. to answer big complex questions faster> use Spark or Flink or just ipython best route to work right now re-unite all data scientists in the world? oh! And no kebab pizza! …with the fewest human drivers 30000 AD
  23. 23. to answer big complex questions faster> FIRST EARTH WORLD PROBLEM use Spark or Flink or just ipython best route to work right now re-unite all data scientists in the world? oh! And no kebab pizza! …with the fewest human drivers 30000 AD
  24. 24. Still, fast analytics might save us some day… • We can access patient movements and fb, twitter and pretty much all social media interactions • Can we stop a pandemic? • Or can we predict fast where the virus can spread?
  25. 25. Now how do we analyse… data fastcomplexlarge-scale ?
  26. 26. Now how do we analyse… data graphdistributed streaming
  27. 27. Now how do we analyse… data graphdistributed streaming everything is a graph
  28. 28. Now how do we analyse… data graphdistributed streaming everything is many everything is a graph
  29. 29. Now how do we analyse… data graphdistributed streaming everything is many everything is a graph everything is a stream
  30. 30. it all started… as a first world problem question
  31. 31. but then things escalated quickly… …and machinery got cheaper and we suddenly realised that we have big data
  32. 32. Distributed Graph processing was born Thus,
  33. 33. Distributed Graph processing was born Thus, Map Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  34. 34. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  35. 35. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  36. 36. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  37. 37. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  38. 38. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  39. 39. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
  40. 40. • We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example
  41. 41. • We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example 1 2 3
  42. 42. Distributed Graph processing example 1 43 2 5 6 7 8 ROUND 0
  43. 43. Distributed Graph processing example 1 43 2 5 ROUND 0 6 7 8 3 1 4 4 5 2 4 2 3 5 7 8 6 8 6 7
  44. 44. Distributed Graph processing example 1 21 2 2 ROUND 1 6 6 6
  45. 45. Distributed Graph processing example 1 2 2 2 2 1 2 6 6 6 6 1 21 2 2 ROUND 1 6 6 6 6 6
  46. 46. Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6
  47. 47. Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6 1 1 1
  48. 48. Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6
  49. 49. Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6 1 1 1 1
  50. 50. Distributed Graph processing example 1 11 1 1 ROUND 4 6 6 6 No messages, DONE!
  51. 51. • Examples of Load-Compute-Store systems: Pregel, Graphx (spark), Graphlab, PowerGraph • Same execution strategy - Same problems • It’s slow • Too much re-computation ($€) for nothing. • Real World Updates anyone? Distributed Graph processing systems
  52. 52. …and streaming came to mess everything make fast and simple
  53. 53. …and streaming came to mess everything make fast and simple real world
  54. 54. …and streaming came to mess everything make fast and simple real world event records • local state stays here • local computation too The Dataflow™
  55. 55. Streaming is so advanced that… • subsecond latency and high throughput finally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
  56. 56. Streaming is so advanced that… …but what about complex problems? • subsecond latency and high throughput finally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
  57. 57. can we make it happen?
  58. 58. can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff
  59. 59. can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff ?? universe
  60. 60. can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff ?? universe >it was never about the graph silly, it was about answering complex questions, remember?
  61. 61. can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff universe ;) universe summary >it was never about the graph silly, it was about answering complex questions, remember? answers
  62. 62. Examples of Summaries • Spanners : distance estimation • Sparsifiers : cut estimation • Sketches : homomorphic properties graph summary algorithm algorithm~R1 R2
  63. 63. Distributed Graph streaming example 54 76 86 42 31 52Connected Components on a stream of edges (additions)
  64. 64. 31 Distributed Graph streaming example 54 76 86 42 43 31 52 Connected Components on a stream of edges (additions) 1
  65. 65. 52 Distributed Graph streaming example 54 76 86 42 43 87 52 Connected Components on a stream of edges (additions) 31 1 2
  66. 66. 52 4 Distributed Graph streaming example 54 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2
  67. 67. 52 4 Distributed Graph streaming example 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6
  68. 68. 52 4 8 Distributed Graph streaming example 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6
  69. 69. 8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6
  70. 70. 8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6
  71. 71. 8 52 4 76 Distributed Graph streaming example 43 87 41Connected Components on a stream of edges (additions) 31 1 6
  72. 72. But Is this Efficient? Sure, we can distribute the edges and summaries
  73. 73. But Is this Efficient? Sure, we can distribute the edges and summaries any systems in mind?
  74. 74. Gelly Stream Graph stream processing with Apache Flink
  75. 75. Gelly Stream Oveview DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly- ➤ Static Graphs ➤ Multi-Pass Algorithms ➤ Full Computations ➤ Dynamic Graphs ➤ Single-Pass Algorithms ➤ Approximate Computations DataStream
  76. 76. Gelly Stream Status ➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations ➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate
  77. 77. wait, so now we can detect connected components right away?
  78. 78. wait, so now we can detect connected components right away?
  79. 79. wait, so now we can detect connected components right away? Solved! But how about our other issues now?
  80. 80. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >
  81. 81. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >
  82. 82. Gelly-Stream to the rescue graphStream.filterVertices(DataScientists()) .slice(Time.of(10, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) wendy checked_in glaze steve checked_in glaze tom checked_in joe’s_grill sandra checked_in glaze rafa checked_in joe’s_grill wendy steve sandra glaze tom rafa joe’s grill {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}
  83. 83. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >
  84. 84. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? > yes
  85. 85. The next step • Iterative model* on streams for deeper analytics • More Summaries • Better Our-Of-Core State Integration • AdHoc Graph Queries Large-scale, Complex, Fast, Deep Analytics * http://dl.acm.org/citation.cfm?id=2983551
  86. 86. Try out Gelly-Stream* because all questions matter @SenorCarbone *https://github.com/vasia/gelly-streaming

×