Distributed K-
Betweenness
Complex Network Analysis
Daniel Marcous and Yotam Sandbank
dmarcous@gmail.com
yotamsandbank@gmail.com
Centrality
❖ Core concept in complex network analysis
❖ Different measures:
❖ Closeness
❖ Degree
❖ Betweenness
Betweenness
●
Betweenness computation
●
Betweenness computation
Betweenness computation
Betweenness computation
Distributed Betweenness
❖ Independent computation for each node
❖ Why not run on different machines?
❖ Betweenness computation not implemented in GraphX
Distributed Betweenness
❖ Algorithm:
❖ Divide nodes between machines
❖ For each machine, compute the Betweenness contribution of each node to every
other node in the graph
❖ Aggregate results from all machines
❖ Problems:
❖ Can’t get information about a specific node in GraphX
❖ Need to copy graph to every machine (goes bad with big graphs)
Distributed Betweenness
❖ Solutions:
❖ Can’t get information about a specific node in GraphX
❖ GraphX Pregel API
❖ Run 1 iteration, with every node passing its identity to all its neighbors
❖ Need to copy graph to every machine (goes wrong with big graphs)
❖ We didn’t find a good solution for this problem
❖ How can we avoid copying the whole graph to every machine?
Distributed K-Betweenness
●
Quotes by Adriana Iamnitchi , University of South Florida
Distributed K-Betweenness
●
Distributed K-Betweenness
●
Implementation
❖ Technology :
❖ Spark 1.5.2
❖ Scala 2.10
❖ GraphX 1.5.2 (+ Pregel API)
❖ Steps :
❖ Create K-graphlets
❖ Pregel
❖ Parallel BC calculation - contribution of vertex X to other vertices BC
❖ Local for each vertex’ graphlet
❖ Brandes
❖ Also parallelized for each vertex in k-graphlet
❖ BC Aggregation - final kBC score for each vertex
❖ Reduce
Code
Code
Usage
Tuning
Do it yourself
❖ The project can be found in github:
❖ https://github.com/dmarcous/spark-betweenness
❖ Accessible as a Spark Package !
❖ http://spark-packages.org/package/dmarcous/spark-betweenness
❖ spark code (scala / java), spark-shell, spark-submit, pySpark APIs
Experiment design
❖ Amazon EMR cluster
❖ 1 master
❖ 4 worker nodes
❖ r3.2xlarge
❖ 8 vcpu
❖ 61 GB RAM
❖ 160 GB SSD
❖ 6 Datasets
❖ Different sizes (|E| / |V|)
❖ Different diameters
❖ Implementations
❖ spark-betweenness
❖ networkX
Results
Spark Single Description Type Name
240 31 3 9 5156 3015 Small random generated Random HW2
601 210 3 8 88234 4039 Social circles Social Facebook
-1 349 4
2160 -1 3 16 428156 58228 Friendship network Social Birghtkite
489 -1 3 44 925872 334863 Customer co-purchases Social Amazon
5707 -1 4
-1 -1 5
139 -1 3 849 2766607 1965206 Road net of California Infrastructure roadNet-Ca
356 -1 4
638 -1 5
85 -1 3 1054 3843324 1379917 Road net of Texas Infrastructure roadNet-TX
305 -1 4
600 -1 5
-1 means it either crashed or didn’t finish in a long time (over an hour)
Results
Results
❖ Performs great on graphs with large diameter
❖ Large K-graphlets are “impossible” to store in memory and send between machines
❖ Not good for graphs with small diameter (very slow, sometimes crashes)
❖ Very hard to tune (how many cores, memory for each process, and so on..)
Conclusions
❖ Distributed Betweenness – good idea in theory, hard to implement
❖ Multi-threaded on a single strong machine might do the job
❖ Our implementation – great for large diameter graphs (road networks,
power grids, and more)

Distributed K-Betweenness (Spark)

  • 1.
    Distributed K- Betweenness Complex NetworkAnalysis Daniel Marcous and Yotam Sandbank dmarcous@gmail.com yotamsandbank@gmail.com
  • 2.
    Centrality ❖ Core conceptin complex network analysis ❖ Different measures: ❖ Closeness ❖ Degree ❖ Betweenness
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Distributed Betweenness ❖ Independentcomputation for each node ❖ Why not run on different machines? ❖ Betweenness computation not implemented in GraphX
  • 9.
    Distributed Betweenness ❖ Algorithm: ❖Divide nodes between machines ❖ For each machine, compute the Betweenness contribution of each node to every other node in the graph ❖ Aggregate results from all machines ❖ Problems: ❖ Can’t get information about a specific node in GraphX ❖ Need to copy graph to every machine (goes bad with big graphs)
  • 10.
    Distributed Betweenness ❖ Solutions: ❖Can’t get information about a specific node in GraphX ❖ GraphX Pregel API ❖ Run 1 iteration, with every node passing its identity to all its neighbors ❖ Need to copy graph to every machine (goes wrong with big graphs) ❖ We didn’t find a good solution for this problem ❖ How can we avoid copying the whole graph to every machine?
  • 11.
    Distributed K-Betweenness ● Quotes byAdriana Iamnitchi , University of South Florida
  • 12.
  • 13.
  • 14.
    Implementation ❖ Technology : ❖Spark 1.5.2 ❖ Scala 2.10 ❖ GraphX 1.5.2 (+ Pregel API) ❖ Steps : ❖ Create K-graphlets ❖ Pregel ❖ Parallel BC calculation - contribution of vertex X to other vertices BC ❖ Local for each vertex’ graphlet ❖ Brandes ❖ Also parallelized for each vertex in k-graphlet ❖ BC Aggregation - final kBC score for each vertex ❖ Reduce
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Do it yourself ❖The project can be found in github: ❖ https://github.com/dmarcous/spark-betweenness ❖ Accessible as a Spark Package ! ❖ http://spark-packages.org/package/dmarcous/spark-betweenness ❖ spark code (scala / java), spark-shell, spark-submit, pySpark APIs
  • 20.
    Experiment design ❖ AmazonEMR cluster ❖ 1 master ❖ 4 worker nodes ❖ r3.2xlarge ❖ 8 vcpu ❖ 61 GB RAM ❖ 160 GB SSD ❖ 6 Datasets ❖ Different sizes (|E| / |V|) ❖ Different diameters ❖ Implementations ❖ spark-betweenness ❖ networkX
  • 21.
    Results Spark Single DescriptionType Name 240 31 3 9 5156 3015 Small random generated Random HW2 601 210 3 8 88234 4039 Social circles Social Facebook -1 349 4 2160 -1 3 16 428156 58228 Friendship network Social Birghtkite 489 -1 3 44 925872 334863 Customer co-purchases Social Amazon 5707 -1 4 -1 -1 5 139 -1 3 849 2766607 1965206 Road net of California Infrastructure roadNet-Ca 356 -1 4 638 -1 5 85 -1 3 1054 3843324 1379917 Road net of Texas Infrastructure roadNet-TX 305 -1 4 600 -1 5 -1 means it either crashed or didn’t finish in a long time (over an hour)
  • 22.
  • 23.
    Results ❖ Performs greaton graphs with large diameter ❖ Large K-graphlets are “impossible” to store in memory and send between machines ❖ Not good for graphs with small diameter (very slow, sometimes crashes) ❖ Very hard to tune (how many cores, memory for each process, and so on..)
  • 24.
    Conclusions ❖ Distributed Betweenness– good idea in theory, hard to implement ❖ Multi-threaded on a single strong machine might do the job ❖ Our implementation – great for large diameter graphs (road networks, power grids, and more)