Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Aggregate Sharing for User-Define D... by Paris Carbone 658 views
- Spark meetup london share and anal... by Andy Petrella 1830 views
- Cloud PARTE: Elastic Complex Event ... by Stefan Marr 1887 views
- Flink. Pure Streaming by Indizen Technologies 103 views
- JUDCon India 2012 Drools Fusion by Mark Proctor 2754 views
- Graphs as Streams: Rethinking Graph... by Vasia Kalavri 886 views

810 views

Published on

An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.

Published in:
Data & Analytics

License: CC Attribution License

No Downloads

Total views

810

On SlideShare

0

From Embeds

0

Number of Embeds

55

Shares

0

Downloads

27

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Graph Stream Processing spinning fast, large-scale, complex analytics Paris Carbone PhD Candidate @ KTH Committer @ Apache Flink
- 2. We want to analyse….
- 3. We want to analyse…. data
- 4. We want to analyse…. datacomplex
- 5. We want to analyse…. datacomplexlarge-scale
- 6. We want to analyse…. data fastcomplexlarge-scale
- 7. But why do we need large-scale, complex and fast data analysis? >
- 8. But why do we need large-scale, complex and fast data analysis? to answer big complex questions faster>
- 9. to answer big complex questions faster>
- 10. >Hej Siri_ to answer big complex questions faster>
- 11. Get me the best route to work right now >Hej Siri_ to answer big complex questions faster>
- 12. Get me the best route to work right now >Hej Siri_ …with the fewest human drivers to answer big complex questions faster>
- 13. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… …with the fewest human drivers to answer big complex questions faster>
- 14. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday …with the fewest human drivers to answer big complex questions faster>
- 15. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
- 16. Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
- 17. no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
- 18. no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>
- 19. to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers 3000 AD
- 20. to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers 3000 AD
- 21. to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers FIRST WORLD PROBLEM 3000 AD
- 22. to answer big complex questions faster> use Spark or Flink or just ipython best route to work right now re-unite all data scientists in the world? oh! And no kebab pizza! …with the fewest human drivers 30000 AD
- 23. to answer big complex questions faster> FIRST EARTH WORLD PROBLEM use Spark or Flink or just ipython best route to work right now re-unite all data scientists in the world? oh! And no kebab pizza! …with the fewest human drivers 30000 AD
- 24. Still, fast analytics might save us some day… • We can access patient movements and fb, twitter and pretty much all social media interactions • Can we stop a pandemic? • Or can we predict fast where the virus can spread?
- 25. Now how do we analyse… data fastcomplexlarge-scale ?
- 26. Now how do we analyse… data graphdistributed streaming
- 27. Now how do we analyse… data graphdistributed streaming everything is a graph
- 28. Now how do we analyse… data graphdistributed streaming everything is many everything is a graph
- 29. Now how do we analyse… data graphdistributed streaming everything is many everything is a graph everything is a stream
- 30. it all started… as a ﬁrst world problem question
- 31. but then things escalated quickly… …and machinery got cheaper and we suddenly realised that we have big data
- 32. Distributed Graph processing was born Thus,
- 33. Distributed Graph processing was born Thus, Map Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 34. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 35. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 36. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 37. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 38. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 39. Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system
- 40. • We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example
- 41. • We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example 1 2 3
- 42. Distributed Graph processing example 1 43 2 5 6 7 8 ROUND 0
- 43. Distributed Graph processing example 1 43 2 5 ROUND 0 6 7 8 3 1 4 4 5 2 4 2 3 5 7 8 6 8 6 7
- 44. Distributed Graph processing example 1 21 2 2 ROUND 1 6 6 6
- 45. Distributed Graph processing example 1 2 2 2 2 1 2 6 6 6 6 1 21 2 2 ROUND 1 6 6 6 6 6
- 46. Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6
- 47. Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6 1 1 1
- 48. Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6
- 49. Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6 1 1 1 1
- 50. Distributed Graph processing example 1 11 1 1 ROUND 4 6 6 6 No messages, DONE!
- 51. • Examples of Load-Compute-Store systems: Pregel, Graphx (spark), Graphlab, PowerGraph • Same execution strategy - Same problems • It’s slow • Too much re-computation ($€) for nothing. • Real World Updates anyone? Distributed Graph processing systems
- 52. …and streaming came to mess everything make fast and simple
- 53. …and streaming came to mess everything make fast and simple real world
- 54. …and streaming came to mess everything make fast and simple real world event records • local state stays here • local computation too The Dataﬂow™
- 55. Streaming is so advanced that… • subsecond latency and high throughput ﬁnally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
- 56. Streaming is so advanced that… …but what about complex problems? • subsecond latency and high throughput ﬁnally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
- 57. can we make it happen?
- 58. can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff
- 59. can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff ?? universe
- 60. can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff ?? universe >it was never about the graph silly, it was about answering complex questions, remember?
- 61. can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff universe ;) universe summary >it was never about the graph silly, it was about answering complex questions, remember? answers
- 62. Examples of Summaries • Spanners : distance estimation • Sparsiﬁers : cut estimation • Sketches : homomorphic properties graph summary algorithm algorithm~R1 R2
- 63. Distributed Graph streaming example 54 76 86 42 31 52Connected Components on a stream of edges (additions)
- 64. 31 Distributed Graph streaming example 54 76 86 42 43 31 52 Connected Components on a stream of edges (additions) 1
- 65. 52 Distributed Graph streaming example 54 76 86 42 43 87 52 Connected Components on a stream of edges (additions) 31 1 2
- 66. 52 4 Distributed Graph streaming example 54 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2
- 67. 52 4 Distributed Graph streaming example 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6
- 68. 52 4 8 Distributed Graph streaming example 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6
- 69. 8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6
- 70. 8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6
- 71. 8 52 4 76 Distributed Graph streaming example 43 87 41Connected Components on a stream of edges (additions) 31 1 6
- 72. But Is this Efﬁcient? Sure, we can distribute the edges and summaries
- 73. But Is this Efﬁcient? Sure, we can distribute the edges and summaries any systems in mind?
- 74. Gelly Stream Graph stream processing with Apache Flink
- 75. Gelly Stream Oveview DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly- ➤ Static Graphs ➤ Multi-Pass Algorithms ➤ Full Computations ➤ Dynamic Graphs ➤ Single-Pass Algorithms ➤ Approximate Computations DataStream
- 76. Gelly Stream Status ➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations ➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate
- 77. wait, so now we can detect connected components right away?
- 78. wait, so now we can detect connected components right away?
- 79. wait, so now we can detect connected components right away? Solved! But how about our other issues now?
- 80. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >
- 81. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >
- 82. Gelly-Stream to the rescue graphStream.ﬁlterVertices(DataScientists()) .slice(Time.of(10, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) wendy checked_in glaze steve checked_in glaze tom checked_in joe’s_grill sandra checked_in glaze rafa checked_in joe’s_grill wendy steve sandra glaze tom rafa joe’s grill {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}
- 83. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >
- 84. no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? > yes
- 85. The next step • Iterative model* on streams for deeper analytics • More Summaries • Better Our-Of-Core State Integration • AdHoc Graph Queries Large-scale, Complex, Fast, Deep Analytics * http://dl.acm.org/citation.cfm?id=2983551
- 86. Try out Gelly-Stream* because all questions matter @SenorCarbone *https://github.com/vasia/gelly-streaming

No public clipboards found for this slide

Be the first to comment