Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DEMYSTIFYING
DISTRIBUTED
GRAPH PROCESSING
Vasia Kalavri
vasia@apache.org
@vkalavri
WHY DISTRIBUTED
GRAPH PROCESSING?
MY GRAPH IS SO BIG, IT
DOESN’T FIT IN A SINGLE
MACHINE
Big Data Ninja
MISCONCEPTION #1
A SOCIAL NETWORK
YOUR INPUT DATASET SIZE
IS _OFTEN_ IRRELEVANT
INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends list
per user
▸ exc...
DISTRIBUTED PROCESSING IS
ALWAYS FASTER THAN
SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2
GRAPHS DON’T APPEAR OUT OF THIN AIR
Expectation…
GRAPHS DON’T APPEAR OUT OF THIN AIR
Reality!
HOW DO WE EXPRESS A
DISTRIBUTED GRAPH
ANALYSIS TASK?
GRAPH APPLICATIONS ARE DIVERSE
▸ Iterative value propagation
▸ PageRank, Connected Components, Label Propagation
▸ Travers...
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterati...
PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
...
PREGEL: SUPERSTEPS
(Vi+1, outbox) <— compute(Vi, inbox)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i Superstep ...
PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numV...
SIGNAL-COLLECT
outbox <— signal(Vi)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i
Vi+1 <— collect(inbox)
1 3, 4
...
SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/nu...
GATHER-SUM-APPLY (POWERGRAPH)
1
...
...
Gather Sum
1
2
5
...
Apply
3
1 5
5 3
1
...
Gather
3
1 5
5 3
Superstep i Superstep ...
GSA EXAMPLE: PAGERANK
double gather(source, edge, target):
return target.value() / target.numEdges()
double sum(rank1, ran...
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterati...
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterati...
THINK LIKE A (SUB)GRAPH
1
5
4
3
2
1
5
4
3
2
- compute() on the entire partition
- Information flows freely inside each part...
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterati...
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterati...
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterati...
CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efficient
distributed processing engine
▸ Graph ETL: hi...
HELLO, GELLY! THE APACHE FLINK GRAPH API
▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API
▸ Transforma...
POSIX Java/Scala

Collections
POSIX
‣efficient streaming runtime
‣native iteration operators
‣well-integrated
WHY FLINK?
FEELING GELLY?
▸ Paper References
http://www.citeulike.org/user/vasiakalavri/tag/dotscale
▸ Apache Flink:
http://flink.apac...
Demystifying Distributed Graph Processing
Demystifying Distributed Graph Processing
Demystifying Distributed Graph Processing
Upcoming SlideShare
Loading in …5
×

Demystifying Distributed Graph Processing

2,010 views

Published on

dotScale 2016 presentation

Writing distributed graph applications is inherently hard. In this talk, Vasia gives an overview of high-level programming models and platforms for distributed graph processing. She exposes and discusses common misconceptions, shares lessons learnt, and suggests best practices.

Published in: Technology
  • Be the first to comment

Demystifying Distributed Graph Processing

  1. 1. DEMYSTIFYING DISTRIBUTED GRAPH PROCESSING Vasia Kalavri vasia@apache.org @vkalavri
  2. 2. WHY DISTRIBUTED GRAPH PROCESSING?
  3. 3. MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE Big Data Ninja MISCONCEPTION #1
  4. 4. A SOCIAL NETWORK
  5. 5. YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT
  6. 6. INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL ▸ Naive Who(m) to Follow: ▸ compute a friends-of-friends list per user ▸ exclude existing friends ▸ rank by common connections
  7. 7. DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar MISCONCEPTION #2
  8. 8. GRAPHS DON’T APPEAR OUT OF THIN AIR Expectation…
  9. 9. GRAPHS DON’T APPEAR OUT OF THIN AIR Reality!
  10. 10. HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
  11. 11. GRAPH APPLICATIONS ARE DIVERSE ▸ Iterative value propagation ▸ PageRank, Connected Components, Label Propagation ▸ Traversals and path exploration ▸ Shortest paths, centrality measures ▸ Ego-network analysis ▸ Personalized recommendations ▸ Pattern mining ▸ Finding frequent subgraphs
  12. 12. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012
  13. 13. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation
  14. 14. PREGEL: THINK LIKE A VERTEX 1 5 4 3 2 1 3, 4 2 1, 4 5 3 ...
  15. 15. PREGEL: SUPERSTEPS (Vi+1, outbox) <— compute(Vi, inbox) 1 3, 4 2 1, 4 5 3 ... 1 3, 4 2 1, 4 5 3 ... Superstep i Superstep i+1
  16. 16. PREGEL EXAMPLE: PAGERANK void compute(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for sum up received messages update vertex rank distribute rank to neighbors
  17. 17. SIGNAL-COLLECT outbox <— signal(Vi) 1 3, 4 2 1, 4 5 3 ... 1 3, 4 2 1, 4 5 3 ... Superstep i Vi+1 <— collect(inbox) 1 3, 4 2 1, 4 5 3 ... Signal Collect Superstep i+1
  18. 18. SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) distribute rank to neighbors sum up received messages update vertex rank
  19. 19. GATHER-SUM-APPLY (POWERGRAPH) 1 ... ... Gather Sum 1 2 5 ... Apply 3 1 5 5 3 1 ... Gather 3 1 5 5 3 Superstep i Superstep i+1
  20. 20. GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numEdges() double sum(rank1, rank2): return rank1 + rank2 double apply(sum, currentRank): return 0.15 + 0.85*sum compute partial rank combine partial ranks update rank
  21. 21. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013
  22. 22. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals
  23. 23. THINK LIKE A (SUB)GRAPH 1 5 4 3 2 1 5 4 3 2 - compute() on the entire partition - Information flows freely inside each partition - Network communication between partitions, not vertices
  24. 24. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014
  25. 25. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014 Ego-network analysis Arabesque 2015 Tinkerpop
  26. 26. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014 Ego-network analysis Arabesque 2015 Pattern Matching Tinkerpop
  27. 27. CAN WE HAVE IT ALL? ▸ Data pipeline integration: built on top of an efficient distributed processing engine ▸ Graph ETL: high-level API with abstractions and methods to transform graphs ▸ Familiar programming model: support popular programming abstractions
  28. 28. HELLO, GELLY! THE APACHE FLINK GRAPH API ▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API ▸ Transformations, library of common algorithms val graph = Graph.fromDataSet(edges, env) val ranks = graph.run(new PageRank(0.85, 20)) ▸ Iteration abstractions Pregel Signal-Collect Gather-Sum-Apply Partition-Centric*
  29. 29. POSIX Java/Scala
 Collections POSIX ‣efficient streaming runtime ‣native iteration operators ‣well-integrated WHY FLINK?
  30. 30. FEELING GELLY? ▸ Paper References http://www.citeulike.org/user/vasiakalavri/tag/dotscale ▸ Apache Flink: http://flink.apache.org/ ▸ Gelly documentation: http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html ▸ Gelly-Stream: https://github.com/vasia/gelly-streaming

×