Full Video: https://www.youtube.com/watch?v=cOShsisEsC0
An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.
11. Get me the best route to work right now
>Hej Siri_
to answer big complex questions faster>
12. Get me the best route to work right now
>Hej Siri_
…with the fewest human drivers
to answer big complex questions faster>
13. Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
…with the fewest human drivers
to answer big complex questions faster>
14. Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday… or the day before yesterday
…with the fewest human drivers
to answer big complex questions faster>
15. Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday… or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to answer big complex questions faster>
16. Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to answer big complex questions faster>
17. no matter if they use Spark or Flink or just ipython
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to answer big complex questions faster>
18. no matter if they use Spark or Flink or just ipython
Get me the best route to work right now
>Hej Siri_
Lookup a pizza recipe all of my friends like but
did not eat yesterday…
Siri, is it possible to re-unite all data
scientists in the world?
or the day before yesterday
oh! And no kebab pizza!
…with the fewest human drivers
to answer big complex questions faster>
23. to answer big complex questions faster>
FIRST EARTH WORLD PROBLEM
use Spark or Flink or just ipython
best route to work right now
re-unite all data scientists in the world?
oh! And no kebab pizza!
…with the fewest human drivers
30000 AD
24. Still, fast analytics might save us some day…
• We can access patient movements and fb, twitter
and pretty much all social media interactions
• Can we stop a pandemic?
• Or can we predict fast where the virus can spread?
25. Now how do we analyse…
data fastcomplexlarge-scale ?
26. Now how do we analyse…
data
graphdistributed streaming
27. Now how do we analyse…
data
graphdistributed streaming
everything is a graph
28. Now how do we analyse…
data
graphdistributed streaming
everything is many everything is a graph
29. Now how do we analyse…
data
graphdistributed streaming
everything is many everything is a graph everything is a stream
33. Distributed Graph processing was born
Thus,
Map Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
34. Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
35. Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
36. Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
37. Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
38. Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
39. Distributed Graph processing was born
Thus,
1. Store Updates to DFS
2. Load graph snapshot (mem)
3. Compute round~superstep
4. Store updates
5. …repeat
Distributed Graph
ProcessingMap Reduce
1. Store Partitioned Data
2. Sent Local computation (map)
3. now shuffle it on disks
4. merge the results (reduce)
5. Store the result back
DFS :
distributed
file system
40. • We want to compute the Connected Components
of a distributed graph.
• Basic computation element (map): vertex
• Updates : messages to other vertices
Distributed Graph processing example
41. • We want to compute the Connected Components
of a distributed graph.
• Basic computation element (map): vertex
• Updates : messages to other vertices
Distributed Graph processing example
1 2
3
51. • Examples of Load-Compute-Store systems:
Pregel, Graphx (spark), Graphlab, PowerGraph
• Same execution strategy - Same problems
• It’s slow
• Too much re-computation ($€) for nothing.
• Real World Updates anyone?
Distributed Graph processing systems
55. …and streaming came
to mess everything
make
fast and simple
real
world event records
• local state stays here
• local computation too
The Dataflow™
56.
57. Streaming is so advanced that…
• subsecond latency and high throughput
finally coexist
• it does fault tolerance without batch writes*
• late data** is handled gracefully
* https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
58. Streaming is so advanced that…
…but what about complex problems?
• subsecond latency and high throughput
finally coexist
• it does fault tolerance without batch writes*
• late data** is handled gracefully
* https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
61. can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
62. can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
??
universe
63. can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
??
universe
>it was never about the graph silly, it was about
answering complex questions, remember?
64. can we make it happen?
• Problem: Can’t keep an infinite graph in-
memory and do complex stuff
universe
;)
universe
summary
>it was never about the graph silly, it was about
answering complex questions, remember?
answers
89. The next step
• Iterative model* on streams for deeper analytics
• More Summaries
• Better Our-Of-Core State Integration
• AdHoc Graph Queries
Large-scale, Complex, Fast, Deep Analytics
* http://dl.acm.org/citation.cfm?id=2983551