Distributed K-Betweenness (Spark)

Distributed K-
Betweenness
Complex Network Analysis
Daniel Marcous and Yotam Sandbank
dmarcous@gmail.com
yotamsandbank@gmail.com

Centrality
❖ Core concept in complex network analysis
❖ Different measures:
❖ Closeness
❖ Degree
❖ Betweenness

Distributed Betweenness
❖ Independent computation for each node
❖ Why not run on different machines?
❖ Betweenness computation not implemented in GraphX

❖ Algorithm:
❖ Divide nodes between machines
❖ For each machine, compute the Betweenness contribution of each node to every
other node in the graph
❖ Aggregate results from all machines
❖ Problems:
❖ Can’t get information about a specific node in GraphX
❖ Need to copy graph to every machine (goes bad with big graphs)

❖ Solutions:
❖ Can’t get information about a specific node in GraphX
❖ GraphX Pregel API
❖ Run 1 iteration, with every node passing its identity to all its neighbors
❖ Need to copy graph to every machine (goes wrong with big graphs)
❖ We didn’t find a good solution for this problem
❖ How can we avoid copying the whole graph to every machine?

Distributed K-Betweenness
●
Quotes by Adriana Iamnitchi , University of South Florida

Implementation
❖ Technology :
❖ Spark 1.5.2
❖ Scala 2.10
❖ GraphX 1.5.2 (+ Pregel API)
❖ Steps :
❖ Create K-graphlets
❖ Pregel
❖ Parallel BC calculation - contribution of vertex X to other vertices BC
❖ Local for each vertex’ graphlet
❖ Brandes
❖ Also parallelized for each vertex in k-graphlet
❖ BC Aggregation - final kBC score for each vertex
❖ Reduce

Do it yourself
❖ The project can be found in github:
❖ https://github.com/dmarcous/spark-betweenness
❖ Accessible as a Spark Package !
❖ http://spark-packages.org/package/dmarcous/spark-betweenness
❖ spark code (scala / java), spark-shell, spark-submit, pySpark APIs

Experiment design
❖ Amazon EMR cluster
❖ 1 master
❖ 4 worker nodes
❖ r3.2xlarge
❖ 8 vcpu
❖ 61 GB RAM
❖ 160 GB SSD
❖ 6 Datasets
❖ Different sizes (|E| / |V|)
❖ Different diameters
❖ Implementations
❖ spark-betweenness
❖ networkX

Results
Spark Single Description Type Name
240 31 3 9 5156 3015 Small random generated Random HW2
601 210 3 8 88234 4039 Social circles Social Facebook
-1 349 4
2160 -1 3 16 428156 58228 Friendship network Social Birghtkite
489 -1 3 44 925872 334863 Customer co-purchases Social Amazon
5707 -1 4
-1 -1 5
139 -1 3 849 2766607 1965206 Road net of California Infrastructure roadNet-Ca
356 -1 4
638 -1 5
85 -1 3 1054 3843324 1379917 Road net of Texas Infrastructure roadNet-TX
305 -1 4
600 -1 5
-1 means it either crashed or didn’t finish in a long time (over an hour)

Results
❖ Performs great on graphs with large diameter
❖ Large K-graphlets are “impossible” to store in memory and send between machines
❖ Not good for graphs with small diameter (very slow, sometimes crashes)
❖ Very hard to tune (how many cores, memory for each process, and so on..)

Conclusions
❖ Distributed Betweenness – good idea in theory, hard to implement
❖ Multi-threaded on a single strong machine might do the job
❖ Our implementation – great for large diameter graphs (road networks,
power grids, and more)

Distributed K-Betweenness (Spark)

More Related Content

Similar to Distributed K-Betweenness (Spark)

More from Daniel Marcous

Recently uploaded

Distributed K-Betweenness (Spark)