Scalable Graph
Convolutional Network
based Link Prediction on
a Distributed Graph
Database Server
Anuradha Karunarathna, Dinika Senarath, Shalika
Madhushanki, Chinthaka Weerakkody, Miyuru
Dayarathna, Sanath Jayasena, and Toyotaro
Suzumura
21/10/2020
IEEE International Conference on Cloud Computing 2020
University of Moratuwa, Sri Lanka
WSO2, Inc. USA
IBM T.J. Watson Research Center, USA
MIT-IBM Watson AI Lab, USA
Barcelona Supercomputing Center, Spain
Introduction
▰ Graphs are rich data structures
▰ Graph data enables a wide variety of applications
▰ Link prediction in graph databases has become a
prominent research area
2
Online social networks
Protein-interaction networks
Computational biology
Cyber security
Transportation systems
Human Protein Interaction Network (P.M.
Kim et al, 2007).
Graph Convolutional Neural Networks
●
Until very recently little attention has been made to the generalization
of neural network models to graph structured data [1]
●
Graph Convolutional Network (GCN) is an improvement made over
Convolutional Neural Networks with the aim of encoding graphs
3
[1] https://tkipf.github.io/graph-convolutional-networks/
Why Graph Link Prediction?
Link prediction predicts whether there will be links between two
nodes based on the attribute information and the observed
existing link Information.
▰ Recommendation Systems
▰ Interaction discovery - (bioinformatics)
▰ Route planning - (aircraft route planning)
▰ Help to find hidden terrorist criminal gangs
4
Presentation Outline
▰ Introduction
▰ Research Problem (Link Prediction Performance)
▰ Proposed Solution (Scheduling algorithm)
▰ Related Work
▰ Methodology
▰ Evaluation
▰ Conclusion
5
Research Problem
Conduct efficient
scheduling of link
prediction tasks on
large attributed
graphs?
6
Graphs are used in many
applications
Graph datasets have
become too large
Expensive in terms of
storage and computational
Time for link prediction
★ Distribute graphs
★ Perform link
prediction on
distributed graphs
but
Proposed Solution and Contributions
▰ Distribute graphs across multi-machine clusters and
conduct deep learning and link prediction on distributed
graph partitions
▰ Develop a scheduling algorithm to conduct GCN
training process of the graph partitions in the worker
nodes
7
Objectives
▰ Develop a Link prediction application on top of a distributed
graph database server - JasmineGraph [1]
▰ Our approach has
▻ High accuracy by considering graph structure + node
features
▻ Computational efficiency
▻ Effective Communication Management
8
[1] M. Dayarathna (2018), miyurud/jasminegraph, GitHub. [Online]. Available:
https://github.com/miyurud/jasminegraph .
Related Work
9
No Related Work Relatedness Limitation
1. Link prediction using
heuristics [16]
Eg: Common Neighbour,
Jaccard coefficient, Katz
index
Link prediction mechanism on
graphs
● Finding one heuristic which can be applied for any generic
graph
● Ignoring explicit features of the graph
● Only consider the graph structure
● Capture a small set of structure patterns
2. SEAL [26] Link prediction based on local
subgraphs using a graph neural
network
● Use matrix factorization for node embeddings (train and
optimize embedding vector of each node)
● Huge number of parameters because number of node
parameters are linear with graph size
[16] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and
technology, 58(7):1019–1031, 2007.
[26] M. Zhang and Y. Chen. Link prediction based on graph neural networks. In Proceedings of the 32Nd International Conference on Neural Information
Processing Systems, NIPS’18, pages 5171–5181,USA, 2018. Curran Associates Inc
Related Work (Contd.)
10
3. GraphSAGE [10]
(Graph SAmple and
aggreGatE)
Inductive node embedding
generation based on GCN
Training on local sub-graphs is possible, but requires entire graph to
be loaded to the memory
4. Pytorch BigGraph [14] Distributed graph training
mechanism
● High number of buckets (If nodes are partitioned to p
partitions, there are p^2 buckets)
● Random node partitioning
● Shared file system
5. Euler [1] Distributed graph training
mechanism
● Distributed graph training mechanism
[1] Alibaba. Euler. URL: https://github.com/alibaba/euler , 2019.
[10] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information
Processing Systems, 2017, pp. 1024–1034.
[14] A. Ler er, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich. Pytorch-biggraph: A large-scale graph
embedding system. CoRR, abs/1903.12287, 2019.
Methodology - JasmineGraph
▰ JasmineGraph Distributed
Database Server [1]
▻ Partitions and stores graph
data using one of the
Metis/hash/etc. partitioning
approaches
11
[1] https://github.com/miyurud/jasminegraph
Methodology (Contd.)
▰ Node Embedding Generation
12
▻ Training process happens
in database server
partitions
▻ Using GraphSAGE
▻ Implemented in Tensorflow
▻ Embeddings are written to
the model store
Methodology (Contd.)
▰ We use localized graph
convolution modules to
train each graph
partition
▰ The training happens in
an unsupervised
manner
13
Methodology (Embedding Generation for
a Graph Partition)
▰ The training process is initiated with the
concatenation of local store and its corresponding
central store partition
▰ The concatenated graph and the graph structure
is input to the neural network
▰ The hidden layers of the GNN are structured in a
way to transform the node features and aggregate
the features over the graph to generate node
embeddings.
14
Methodology (Contd.)
▰ Link Prediction using generated node embeddings
▻ Inter worker communication to collect node
embeddings
▻ No linear comparison with all other nodes (Time
Complexity – O(N))
▻ Apply Locality Sensitive Hashing to rank predictions
15
Link Prediction Algorithm
▰ Accepts a starting node
denoted as query node (q)
and it points out a list of
predicted nodes as the output
▰ Random projection method of
LSH
16
Scheduling Algorithm
▰ Decides which partitions can be trained parallely with the
available memory
▰ Two main objectives,
▻ Utilize the available memory optimally
▻ Finish training all partitions in minimum number of iterations
▰ Bin Packing Problem - given n different items with weights, and
bins with capacity c, assign each item to a bin in a manner that
number of total bins is minimized.
▻ Bins - Training Iterations
▻ Capacity - Available Memory
▻ Items - Graph Partitions
▻ Weights - Memory requirement of each partition 17
Data Sets
18
Dataset Number of
vertices
Number of
edges
Number of
features
Edgelist File
Size (MB)
Feature File
Size (MB)
Twitter 81,306 1,768,149 1007 16 157
Amazon Small 548,551 1,244,636 250 19.4 266
Reddit 232,965 11,606,919 602 145 270
DBLP-V11 4,107,340 36,624,464 948 508 9523
Data Sets (Contd.)
19
Experiments and the environments
20
Experiments
Vertical Scalability Horizontal Scalability
Node
Embedding
Accuracy
Graph
Training
Time
Experiment
s
Node
Embedding
Accuracy
Graph
Training
Time
Experiments
Server Specification
CPU 80
RAM 64GB
OS Ubuntu
16.04.6
LTS
Disk 1.8TB
Master Specification
CPU 4
RAM 16GB
OS Ubuntu
16.04.6
LTS
Disk 100GB
CPU 8
RAM 30GB
OS Ubuntu
16.04.6
LTS
Disk 10GB
Graph Training Experiments - Vertical
Scalability
21
Graph Training Experiments - Vertical
Scalability
22
Graph Training Experiments - Horizontal
Scalability
23
Training Times for Different Graph
Partitions
24
Our approach could run
Training process on the
Partitioned DBLP-V11
Accuracy Comparison Experiments
▰ Mean Reciprocal Rank (MRR value)
value
▰ Hit @ 1 score
▰ Hit @ 10 score
25
MRR Value
26
Hit@1 Score Comparison
27
Hit@10 Score Comparison
28
Training Accuracy of Different Graph
Partitions
29
Conclusion
▰ Current graph link prediction approaches cannot scale
well when the datasets are large
▰ A solution is to perform link prediction distributedly
▰ Densely connected components play critical role in
determining the performance of the overall training
process
30
Conclusion (Contd.)
▰ JasmineGraph was able to train a GCN from the largest dataset
DBLP-V11 (> 9.3GB) in 11 hours and 40 minutes time using 16
workers on a single server.
▰ Reddit was processed by the original GraphSAGE implementation in
238 minutes while JasmineGraph took only 100 minutes on the same
hardware with 16 workers leading to 2.4 times improved performance
▰ Future work - graph stream processing, privacy preserving machine
learning
31
Thank you!
32

Scalable Graph Convolutional Network Based Link Prediction on a Distributed Graph Database Server

  • 1.
    Scalable Graph Convolutional Network basedLink Prediction on a Distributed Graph Database Server Anuradha Karunarathna, Dinika Senarath, Shalika Madhushanki, Chinthaka Weerakkody, Miyuru Dayarathna, Sanath Jayasena, and Toyotaro Suzumura 21/10/2020 IEEE International Conference on Cloud Computing 2020 University of Moratuwa, Sri Lanka WSO2, Inc. USA IBM T.J. Watson Research Center, USA MIT-IBM Watson AI Lab, USA Barcelona Supercomputing Center, Spain
  • 2.
    Introduction ▰ Graphs arerich data structures ▰ Graph data enables a wide variety of applications ▰ Link prediction in graph databases has become a prominent research area 2 Online social networks Protein-interaction networks Computational biology Cyber security Transportation systems Human Protein Interaction Network (P.M. Kim et al, 2007).
  • 3.
    Graph Convolutional NeuralNetworks ● Until very recently little attention has been made to the generalization of neural network models to graph structured data [1] ● Graph Convolutional Network (GCN) is an improvement made over Convolutional Neural Networks with the aim of encoding graphs 3 [1] https://tkipf.github.io/graph-convolutional-networks/
  • 4.
    Why Graph LinkPrediction? Link prediction predicts whether there will be links between two nodes based on the attribute information and the observed existing link Information. ▰ Recommendation Systems ▰ Interaction discovery - (bioinformatics) ▰ Route planning - (aircraft route planning) ▰ Help to find hidden terrorist criminal gangs 4
  • 5.
    Presentation Outline ▰ Introduction ▰Research Problem (Link Prediction Performance) ▰ Proposed Solution (Scheduling algorithm) ▰ Related Work ▰ Methodology ▰ Evaluation ▰ Conclusion 5
  • 6.
    Research Problem Conduct efficient schedulingof link prediction tasks on large attributed graphs? 6 Graphs are used in many applications Graph datasets have become too large Expensive in terms of storage and computational Time for link prediction ★ Distribute graphs ★ Perform link prediction on distributed graphs but
  • 7.
    Proposed Solution andContributions ▰ Distribute graphs across multi-machine clusters and conduct deep learning and link prediction on distributed graph partitions ▰ Develop a scheduling algorithm to conduct GCN training process of the graph partitions in the worker nodes 7
  • 8.
    Objectives ▰ Develop aLink prediction application on top of a distributed graph database server - JasmineGraph [1] ▰ Our approach has ▻ High accuracy by considering graph structure + node features ▻ Computational efficiency ▻ Effective Communication Management 8 [1] M. Dayarathna (2018), miyurud/jasminegraph, GitHub. [Online]. Available: https://github.com/miyurud/jasminegraph .
  • 9.
    Related Work 9 No RelatedWork Relatedness Limitation 1. Link prediction using heuristics [16] Eg: Common Neighbour, Jaccard coefficient, Katz index Link prediction mechanism on graphs ● Finding one heuristic which can be applied for any generic graph ● Ignoring explicit features of the graph ● Only consider the graph structure ● Capture a small set of structure patterns 2. SEAL [26] Link prediction based on local subgraphs using a graph neural network ● Use matrix factorization for node embeddings (train and optimize embedding vector of each node) ● Huge number of parameters because number of node parameters are linear with graph size [16] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007. [26] M. Zhang and Y. Chen. Link prediction based on graph neural networks. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 5171–5181,USA, 2018. Curran Associates Inc
  • 10.
    Related Work (Contd.) 10 3.GraphSAGE [10] (Graph SAmple and aggreGatE) Inductive node embedding generation based on GCN Training on local sub-graphs is possible, but requires entire graph to be loaded to the memory 4. Pytorch BigGraph [14] Distributed graph training mechanism ● High number of buckets (If nodes are partitioned to p partitions, there are p^2 buckets) ● Random node partitioning ● Shared file system 5. Euler [1] Distributed graph training mechanism ● Distributed graph training mechanism [1] Alibaba. Euler. URL: https://github.com/alibaba/euler , 2019. [10] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034. [14] A. Ler er, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich. Pytorch-biggraph: A large-scale graph embedding system. CoRR, abs/1903.12287, 2019.
  • 11.
    Methodology - JasmineGraph ▰JasmineGraph Distributed Database Server [1] ▻ Partitions and stores graph data using one of the Metis/hash/etc. partitioning approaches 11 [1] https://github.com/miyurud/jasminegraph
  • 12.
    Methodology (Contd.) ▰ NodeEmbedding Generation 12 ▻ Training process happens in database server partitions ▻ Using GraphSAGE ▻ Implemented in Tensorflow ▻ Embeddings are written to the model store
  • 13.
    Methodology (Contd.) ▰ Weuse localized graph convolution modules to train each graph partition ▰ The training happens in an unsupervised manner 13
  • 14.
    Methodology (Embedding Generationfor a Graph Partition) ▰ The training process is initiated with the concatenation of local store and its corresponding central store partition ▰ The concatenated graph and the graph structure is input to the neural network ▰ The hidden layers of the GNN are structured in a way to transform the node features and aggregate the features over the graph to generate node embeddings. 14
  • 15.
    Methodology (Contd.) ▰ LinkPrediction using generated node embeddings ▻ Inter worker communication to collect node embeddings ▻ No linear comparison with all other nodes (Time Complexity – O(N)) ▻ Apply Locality Sensitive Hashing to rank predictions 15
  • 16.
    Link Prediction Algorithm ▰Accepts a starting node denoted as query node (q) and it points out a list of predicted nodes as the output ▰ Random projection method of LSH 16
  • 17.
    Scheduling Algorithm ▰ Decideswhich partitions can be trained parallely with the available memory ▰ Two main objectives, ▻ Utilize the available memory optimally ▻ Finish training all partitions in minimum number of iterations ▰ Bin Packing Problem - given n different items with weights, and bins with capacity c, assign each item to a bin in a manner that number of total bins is minimized. ▻ Bins - Training Iterations ▻ Capacity - Available Memory ▻ Items - Graph Partitions ▻ Weights - Memory requirement of each partition 17
  • 18.
    Data Sets 18 Dataset Numberof vertices Number of edges Number of features Edgelist File Size (MB) Feature File Size (MB) Twitter 81,306 1,768,149 1007 16 157 Amazon Small 548,551 1,244,636 250 19.4 266 Reddit 232,965 11,606,919 602 145 270 DBLP-V11 4,107,340 36,624,464 948 508 9523
  • 19.
  • 20.
    Experiments and theenvironments 20 Experiments Vertical Scalability Horizontal Scalability Node Embedding Accuracy Graph Training Time Experiment s Node Embedding Accuracy Graph Training Time Experiments Server Specification CPU 80 RAM 64GB OS Ubuntu 16.04.6 LTS Disk 1.8TB Master Specification CPU 4 RAM 16GB OS Ubuntu 16.04.6 LTS Disk 100GB CPU 8 RAM 30GB OS Ubuntu 16.04.6 LTS Disk 10GB
  • 21.
    Graph Training Experiments- Vertical Scalability 21
  • 22.
    Graph Training Experiments- Vertical Scalability 22
  • 23.
    Graph Training Experiments- Horizontal Scalability 23
  • 24.
    Training Times forDifferent Graph Partitions 24 Our approach could run Training process on the Partitioned DBLP-V11
  • 25.
    Accuracy Comparison Experiments ▰Mean Reciprocal Rank (MRR value) value ▰ Hit @ 1 score ▰ Hit @ 10 score 25
  • 26.
  • 27.
  • 28.
  • 29.
    Training Accuracy ofDifferent Graph Partitions 29
  • 30.
    Conclusion ▰ Current graphlink prediction approaches cannot scale well when the datasets are large ▰ A solution is to perform link prediction distributedly ▰ Densely connected components play critical role in determining the performance of the overall training process 30
  • 31.
    Conclusion (Contd.) ▰ JasmineGraphwas able to train a GCN from the largest dataset DBLP-V11 (> 9.3GB) in 11 hours and 40 minutes time using 16 workers on a single server. ▰ Reddit was processed by the original GraphSAGE implementation in 238 minutes while JasmineGraph took only 100 minutes on the same hardware with 16 workers leading to 2.4 times improved performance ▰ Future work - graph stream processing, privacy preserving machine learning 31
  • 32.