Scalable Graph Convolutional Network Based Link Prediction on a Distributed Graph Database Server

Scalable Graph
Convolutional Network
based Link Prediction on
a Distributed Graph
Database Server
Anuradha Karunarathna, Dinika Senarath, Shalika
Madhushanki, Chinthaka Weerakkody, Miyuru
Dayarathna, Sanath Jayasena, and Toyotaro
Suzumura
21/10/2020
IEEE International Conference on Cloud Computing 2020
University of Moratuwa, Sri Lanka
WSO2, Inc. USA
IBM T.J. Watson Research Center, USA
MIT-IBM Watson AI Lab, USA
Barcelona Supercomputing Center, Spain

Introduction
▰ Graphs are rich data structures
▰ Graph data enables a wide variety of applications
▰ Link prediction in graph databases has become a
prominent research area
2
Online social networks
Protein-interaction networks
Computational biology
Cyber security
Transportation systems
Human Protein Interaction Network (P.M.
Kim et al, 2007).

Graph Convolutional Neural Networks
●
Until very recently little attention has been made to the generalization
of neural network models to graph structured data [1]
●
Graph Convolutional Network (GCN) is an improvement made over
Convolutional Neural Networks with the aim of encoding graphs
3
[1] https://tkipf.github.io/graph-convolutional-networks/

Why Graph Link Prediction?
Link prediction predicts whether there will be links between two
nodes based on the attribute information and the observed
existing link Information.
▰ Recommendation Systems
▰ Interaction discovery - (bioinformatics)
▰ Route planning - (aircraft route planning)
▰ Help to find hidden terrorist criminal gangs
4

Presentation Outline
▰ Introduction
▰ Research Problem (Link Prediction Performance)
▰ Proposed Solution (Scheduling algorithm)
▰ Related Work
▰ Methodology
▰ Evaluation
▰ Conclusion
5

Research Problem
Conduct efficient
scheduling of link
prediction tasks on
large attributed
graphs?
6
Graphs are used in many
applications
Graph datasets have
become too large
Expensive in terms of
storage and computational
Time for link prediction
★ Distribute graphs
★ Perform link
prediction on
distributed graphs
but

Proposed Solution and Contributions
▰ Distribute graphs across multi-machine clusters and
conduct deep learning and link prediction on distributed
graph partitions
▰ Develop a scheduling algorithm to conduct GCN
training process of the graph partitions in the worker
nodes
7

Objectives
▰ Develop a Link prediction application on top of a distributed
graph database server - JasmineGraph [1]
▰ Our approach has
▻ High accuracy by considering graph structure + node
features
▻ Computational efficiency
▻ Effective Communication Management
8
[1] M. Dayarathna (2018), miyurud/jasminegraph, GitHub. [Online]. Available:
https://github.com/miyurud/jasminegraph .

Related Work
9
No Related Work Relatedness Limitation
1. Link prediction using
heuristics [16]
Eg: Common Neighbour,
Jaccard coefficient, Katz
index
Link prediction mechanism on
graphs
● Finding one heuristic which can be applied for any generic
graph
● Ignoring explicit features of the graph
● Only consider the graph structure
● Capture a small set of structure patterns
2. SEAL [26] Link prediction based on local
subgraphs using a graph neural
network
● Use matrix factorization for node embeddings (train and
optimize embedding vector of each node)
● Huge number of parameters because number of node
parameters are linear with graph size
[16] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and
technology, 58(7):1019–1031, 2007.
[26] M. Zhang and Y. Chen. Link prediction based on graph neural networks. In Proceedings of the 32Nd International Conference on Neural Information
Processing Systems, NIPS’18, pages 5171–5181,USA, 2018. Curran Associates Inc

Related Work (Contd.)
10
3. GraphSAGE [10]
(Graph SAmple and
aggreGatE)
Inductive node embedding
generation based on GCN
Training on local sub-graphs is possible, but requires entire graph to
be loaded to the memory
4. Pytorch BigGraph [14] Distributed graph training
mechanism
● High number of buckets (If nodes are partitioned to p
partitions, there are p^2 buckets)
● Random node partitioning
● Shared file system
5. Euler [1] Distributed graph training
mechanism
● Distributed graph training mechanism
[1] Alibaba. Euler. URL: https://github.com/alibaba/euler , 2019.
[10] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information
Processing Systems, 2017, pp. 1024–1034.
[14] A. Ler er, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich. Pytorch-biggraph: A large-scale graph
embedding system. CoRR, abs/1903.12287, 2019.

Methodology - JasmineGraph
▰ JasmineGraph Distributed
Database Server [1]
▻ Partitions and stores graph
data using one of the
Metis/hash/etc. partitioning
approaches
11
[1] https://github.com/miyurud/jasminegraph

Methodology (Contd.)
▰ Node Embedding Generation
12
▻ Training process happens
in database server
partitions
▻ Using GraphSAGE
▻ Implemented in Tensorflow
▻ Embeddings are written to
the model store

▰ We use localized graph
convolution modules to
train each graph
partition
▰ The training happens in
an unsupervised
manner
13

Methodology (Embedding Generation for
a Graph Partition)
▰ The training process is initiated with the
concatenation of local store and its corresponding
central store partition
▰ The concatenated graph and the graph structure
is input to the neural network
▰ The hidden layers of the GNN are structured in a
way to transform the node features and aggregate
the features over the graph to generate node
embeddings.
14

▰ Link Prediction using generated node embeddings
▻ Inter worker communication to collect node
embeddings
▻ No linear comparison with all other nodes (Time
Complexity – O(N))
▻ Apply Locality Sensitive Hashing to rank predictions
15

Link Prediction Algorithm
▰ Accepts a starting node
denoted as query node (q)
and it points out a list of
predicted nodes as the output
▰ Random projection method of
LSH
16

Scheduling Algorithm
▰ Decides which partitions can be trained parallely with the
available memory
▰ Two main objectives,
▻ Utilize the available memory optimally
▻ Finish training all partitions in minimum number of iterations
▰ Bin Packing Problem - given n different items with weights, and
bins with capacity c, assign each item to a bin in a manner that
number of total bins is minimized.
▻ Bins - Training Iterations
▻ Capacity - Available Memory
▻ Items - Graph Partitions
▻ Weights - Memory requirement of each partition 17

Data Sets
18
Dataset Number of
vertices
Number of
edges
Number of
features
Edgelist File
Size (MB)
Feature File
Size (MB)
Twitter 81,306 1,768,149 1007 16 157
Amazon Small 548,551 1,244,636 250 19.4 266
Reddit 232,965 11,606,919 602 145 270
DBLP-V11 4,107,340 36,624,464 948 508 9523

Experiments and the environments
20
Experiments
Vertical Scalability Horizontal Scalability
Node
Embedding
Accuracy
Graph
Training
Time
Experiment
s
Node
Embedding
Accuracy
Graph
Training
Time
Experiments
Server Specification
CPU 80
RAM 64GB
OS Ubuntu
16.04.6
LTS
Disk 1.8TB
Master Specification
CPU 4
RAM 16GB
OS Ubuntu
16.04.6
LTS
Disk 100GB
CPU 8
RAM 30GB
OS Ubuntu
16.04.6
LTS
Disk 10GB

Graph Training Experiments - Vertical
Scalability
21

Graph Training Experiments - Vertical
Scalability
22

Graph Training Experiments - Horizontal
Scalability
23

Training Times for Different Graph
Partitions
24
Our approach could run
Training process on the
Partitioned DBLP-V11

Accuracy Comparison Experiments
▰ Mean Reciprocal Rank (MRR value)
value
▰ Hit @ 1 score
▰ Hit @ 10 score
25

Training Accuracy of Different Graph
Partitions
29

Conclusion
▰ Current graph link prediction approaches cannot scale
well when the datasets are large
▰ A solution is to perform link prediction distributedly
▰ Densely connected components play critical role in
determining the performance of the overall training
process
30

Conclusion (Contd.)
▰ JasmineGraph was able to train a GCN from the largest dataset
DBLP-V11 (> 9.3GB) in 11 hours and 40 minutes time using 16
workers on a single server.
▰ Reddit was processed by the original GraphSAGE implementation in
238 minutes while JasmineGraph took only 100 minutes on the same
hardware with 16 workers leading to 2.4 times improved performance
▰ Future work - graph stream processing, privacy preserving machine
learning
31

Scalable Graph Convolutional Network Based Link Prediction on a Distributed Graph Database Server

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Scalable Graph Convolutional Network Based Link Prediction on a Distributed Graph Database Server

Similar to Scalable Graph Convolutional Network Based Link Prediction on a Distributed Graph Database Server (20)

Recently uploaded

Recently uploaded (20)

Scalable Graph Convolutional Network Based Link Prediction on a Distributed Graph Database Server