DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
2023.03.21
DDGK: Learning Graph Representations for Deep
Divergence Graph Kernels
Rami Al-Rfou, Dustin Zelle, and Bryan Perozzi
WWW ‘19
Nguyen Minh Duc
Contents
• Introduction
• Related Works
• Model Description
• DDGK Algorithm
• Experimental Results
• Extensions and Future Works
• Conclusion
3
Introduction
- Graph representation learning usually relies on
- Supervised learning
- Feature engineering
- Generic representations of graphs
- Algorithmic approach
- Graph similarity measure is hard due to
- NP-hard
- Graph isomorphism
- DDGK learns without supervision and domain knowledge
5
Related Works
Traditional Graph Kernels:
- Graph Edit Distance (Gao, et al., 2010) and Maximum Common Subgraph (Bunke, et al., 2002)
- Weisfeiler-Lehman Graph Kernels (Kriege, et al., 2016)
Node Embedding Methods:
- DeepWalk (Perozzi, et al., 2014)
- Graph Attention (Abu-El-Haija, et al., 2018)
Graph Statistics (Feature engineering):
- NetSmilie (Berlingerio, et al., 2012)
- DeltaCon (Koutra, et al., 2013)
Supervised Graph Similarity
- CNN for graphs (Niepert, et al., 2016)
- Graph Convolutional Networks (T. Kipf and M. Welling, 2016)
7
Model Description
Isomorphism Attention
Given two graphs 𝑆 (Source graph) and 𝑇 (Target graph)
Provides a bidirectional mapping across the pair’s nodes
Input: A one-hot encoded vertex from 𝑇
Output: The vertex’s neighbor
Cross-Graph
Attention
2
8
Model Description
Cross-Graph
Attention
2
The first attention network (𝑀𝑇→𝑆 )
Place photo here
Assigns every node in 𝑇 with a probability
distribution over the nodes of 𝑆
Consists of one Linear layer
Modeled as a multiclass classifier
𝑃𝑟 𝑣𝑗 𝑢𝑖 =
𝑒𝑀𝑇→𝑆(𝑣𝑗,𝑢𝑖)
𝑣𝑘∈𝑉𝑆
𝑒𝑀𝑇→𝑆(𝑣𝑘,𝑢𝑖)
9
Model Description
Cross-Graph
Attention
2
The reverse attention network (𝑀𝑆→𝑇 )
Place photo here
Maps the neighborhood in 𝑆 to the neighborhood in 𝑇
Consists of one Linear layer
Modeled as a multilabel classifier
𝑃𝑟 𝑢𝑗 𝑁(𝑣𝑖) =
1
1 + 𝑒−𝑀𝑆→𝑇(𝑢𝑗,𝑁 𝑣𝑖 )
11
Model Description
Node attribute regularizer
Attributes
Consistency
3
Attribute distribution over nodes
Vertices and edges could have their own
attributes
Cross-Graph attention could provide several
equally good mapping
Solution: adding regularizing losses to
preserve nodes and edges attributes
Replace 𝑄𝑛 with 𝑄𝑒, we obtain Edge Attribute
Regularizer
15
DDGK Algorithm
Save the similarity score in the matrix 𝚿
for every pair of source and target graph
The Algorithm
1
Could be used as a representation vector
16
DDGK Algorithm
- Since Ψ is not a perfect function, 𝐷(𝑆| 𝑆 ≠ 0 could
happen.
- Setting
𝐷(𝑆| 𝑇 ≔ 𝐷(𝑆| 𝑇 − 𝐷(𝑆||𝑆)
ensures 𝐷(𝑆| 𝑆 = 0
- If symmetry is required, we can define
𝐷(𝑆| 𝑇 ≔ 𝐷(𝑆| 𝑇 + 𝐷(𝑇||𝑆)
Graph
Divergence
2
17
DDGK Algorithm
DDGK requires 𝑂(𝑇𝑁2
𝑉) computations, where
𝑇 = max(𝜌, 𝜏)
𝑁 = The number of graphs
𝑉 = The average number of nodes
Linear layers in Cross-Graph Attention could be replaced
by a DNN with fixed size hidden layers to reduce the
network size from 𝑂( 𝑉𝑆 × 𝑉𝑇 ) to 𝑂( 𝑉𝑆 + 𝑉𝑇 )
Scalability
3
For large number of source graphs, we could sample 20%
of them and DDGK could still achieve high accuracy
24
Extensions & Future Works
Graph Encoders
- Edge-to-Nodes Encoder.
- Neighborhood Encoder.
Attention Mechanism
- Subgraph alignment.
Regularization
- Better regularization to avoid overfitting.
Feature Engineering
- Combination of the two could be useful for graph classification.
Scalability
- Perozzi’s newer work: “Just SLaQ When You Approximate: Accurate Spectral Distances for Web-Scale
Graphs, WWW ’20” could handle graphs with billions of nodes within an hour.
25
Conclusion
- Neural Networks can learn powerful representations of graphs without feature engineering.
- Proposed DDGK:
- Graph Encoder
- Isomorphism preserving attention
- Provide interpretability into the alignment of pairs of graph
- Divergence score to measure (dis)similarity between source and target graphs
- Representations produced by DDGK are competitive with challenging baselines.
Generic representations of graphs -> Generic node alignment -> Extract useful information
Algorithmic approach from theoretical computer science
NP-hard natural of the classical measurement such as Graph Edit Distance, and Maximum Common Subgraph
Graph isomorphism is a hard problem (no polynomial algorithm)
DeepWalk learns embeddings of a graph's vertices, by modeling a stream of short random walks
Overfit the model on the source graph to accurately obtain the graph’s structure
Similar idea with Target Graph
The idea is given a vertex in the target graph, find the most similar vertex from the source graph
Activation layer is Softmax
The source graph encoder outputs the neighbors of the chosen vertex
From that, the reverse attention predict its corresponding position in the target graph
Activation layer: Sigmoid
Overall structure of the model
There could be a lot of node mappings from the target to the source graphs.
But not all of them preserve the attributes on the graph’s nodes and edges.
Solution?
This is to demonstrate the power of attribute regularization.
They are two identical graph, and the attention map should produce an Identity matrix
This is one application of DDGK, Hierarchical Clustering.
30 different graphs
Graphs are sampled from different data sets such as neural network structure, social network, network of common nouns and adjectives in a novel, and chemistry-related graph.
Dimension sampling.
Experiment with different amount of sampling in the source graph set.
You can notice that the accuracy converges quickly from just 20% of the original size.
I also did my own experiment on this method
I implemented this model on Google Colab and measure the time taken to process graphs of different sizes.
SLaQ uses spectral analysis on graph, which relies on some linear algebra properties of graph. I have looked at this paper but it’s quite hard to understand.