Graph Representation Learning:
Theories and Applications
Louis Wang
Apr 22nd 2020
1
Agenda
- Graph Representation Learning
- Shallow Methods
- Deepwalk[Bryan, 2014 KDD]
- Node2Vec[A Grover, 2016 KDD]
- Deep Methods: GNN
- Graph Convolutional Networks [Kipf., ICLR 2017](GCN)
- GraphSAGE [WL, 2017 NIPS]
- Graph Attention Networks[Veličković, ICLR 2018] (GAT)
- Applications
- Pins recommendation with PinSage by Pinterest
- Dish Recommendation on Uber Eats with GCN
2
Graph Data are everywhere
3
Graph Representation Learning
Node Embedding Graph Embedding
Representation learning is learning representations of input data typically by transforming it or
extracting features from it(by some means), that makes it easier to perform a task like classification
or prediction. [Yoshua Bengio 2014]
Embedding is ALL you need:
word2vec, doc2vec, node2vec, item2vec, struc2vec…
4
Tasks on Graph
Node Classification
- Predict a type of given node.
Edge Classification/Link Prediction
- Predict whether two nodes are linked
or the type of the link.
Graph Classification
- Identify densely linked clusters of nodes
Network Similarity
- How similar are two (sub)network
5
Goal: Encode (f) nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in
the original network
1. Define an Encoder
2. Define a similarity function
3. Optimization
Node Embedding
6
Shallow Encoding —— an embedding-lookup table
ENC(v)=Zv
Methods: DeepWalk[Perozzi et al. 2014 KDD], Node2vec[Grover et al. 2016 KDD], etc.
7
Shallow Methods Framework
generate ‘sentences’
unbiased walk:
DeepWalk
biased walk:
Node2vec
(different walk strategies)
Idea: Optimize the node embeddings so that nodes have similar embeddings if they tend to co-occur on
short random walks over the graph.
8
DeepWalk
1. Run Short fixed-length random walks starting from each node on the graph using some strategy R
2. For each node u collection N(u), the multiset of nodes visited on random walks starting from u
3. Optimize embeddings according to : Given nodes u, predict its neighbors N(u).
9
DeepWalk Optimization
The loss function is kind of slow because…
1. nested sum give o|V^2| complexity
2. normalization term from the softmax function
Solution: Negative sampling
• Use k negative nodes proportional to
degree instead of all nodes..!
• k should be a balance between
predictive accuracy and computational
efficiency.
10
Node2Vec —— Let’s generate biased walks
Idea: Flexible notion of network neighborhood of node leads to rich node embeddings.
11
Two parameters:
• return parameter p:
• return back to the previous node
• ‘walk away’ parameter q:
• moving outwards(DFS) vs. inwards (BFS)
• intuitively, q is the ratio of BFS vs DFS
Node2Vec —— Explore neighborhoods in a BFS as well as DFS fashion.
The walker just traversed edge (s1,w) and is now at w.
Neighbors of w can only be:
- s2: same distance to s1.
- s1: back to s1
- s3/s4: farther from s1
12
Limitations of Shallow Encoders
• o(|V|) parameters are needed:
• Each node has a unique embedding.
• No sharing of parameters between nodes.
• Inherently “transductive”:
• Either not possible or very time consuming to generate
embeddings for nodes not seen during training.
• Does not incorporate node features
• many graphs have features that we can and should leverage
13
Graph Convolutional Networks
Idea:
Node’s neighborhood defines a computation graph.
To obtain node representations, use a NN to aggregate information from neighbors recursively by limited BFS.
14
Graph Convolutional Networks
• Each layer is one level of depth in the BFS
• Nodes have embeddings at each layer.
• Layer-0 embedding of node u is its input
feature.
• Layer-K embedding gets information from
NN - final embeddings.
So we need…
1. AGG: Aggregator for collecting information
from node’s neighborhood.
2. NNs: Neural network for neighborhood
representation(eg. NN W1) and node’s self
embedding (eg. NN B1)
3. Loss Function for optimization
15
Mathematically…
16
Supervised Training vs Unsupervised Training
For the shallow methods, we train the models in an unsupervised manner:
• use only the graph structure
• similar nodes have similar embeddings
• feed the ‘sentences’ into skipgram model.
For GCN, we directly train the model for a supervised task, like node classification.
We can feed the embeddings into any loss function and run stochastic gradient descent to train the parameters.
17
Inductive capability
1. In many real applications new nodes are often added to the graph.
Needed to generate embeddings for new nodes without retraining.
Hard to do with shallow methods.
2. The same aggregation parameters are shared for all nodes. The number of model parameters is sublinear in |V|
and generalize to unseen nodes
18
GraphSAGE —— Graph SAmple and aggreGatE
GCN just aggregated the neighbor messages by taking the weighted average. How to do better?
Idea: Generalize the aggregation methods to its neighbors and concatenate the features with itself.
19
Neighborhood Aggregator
Mean: Take a weighted average of its neighbors
Pooling: element-wise mean or max pooling.
LSTM: Apply LSTM to reshuffled neighbors
20
Recap for GCN, GraphSAGE
Key Idea: Generate node embeddings based on local neighborhoods using neural networks
Graph Convolutional Network:
• Average neighborhood information and stack neural network
GraphSAGE:
• Generalized neighborhood aggregation (AVG, POOLING, LSTM, etc.)
21
Graph Attention Network —— Learnable Aggregator for GCN
Idea: Borrow the idea of attention mechanisms and learn to assign different weights to different
neighbors in the aggregation process.
Attention Is All You Need [A Vaswani, 2017 NIPS]
22
Graph Attention Network —— Learnable Aggregator for GCN
a is the attention mechanism function
euv indicates the importance of node u’s message to node v
!uv is the normalized coefficients using softmax function
Compute embedding of each node in the graph following an attention strategy.
• Nodes attend over their neighborhoods’ messages
• Implicitly specifgying different weights to different nodes in a neighborhood
23
Attention Mechanism
Attention mechanism a:
The approach is agnostic to the choice of a
• The original paper use a simple single-layer neural network
• Multi-head attention can stabilize the learning process of attention mechanism
• a can have parameters, which needs to be estimated
Parameters of a are trained jointly:
• learn the parameters together with weight matrices in an end-to-end fashion
Benefits:
• Computationally efficient:
computation of attentional coefficients can be parallelized across all edges of the graph
aggregation may be parallelized across all nodes
• Storage efficient:
sparse matrix operations do not require more than O(V+E) entries to be stored
Fixed number of parameters, irrespective of graph size
• Trivially localized:
only attends over local network neighborhoods (masked model).
• Inductive capability:
it is a shared edge-wise mechanism
it does not depend on the global graph structure.
24
Applications ——Pinsage
Challenge for Pinterest:
Scaling up GCN-based node embedding in training and inference is difficult:
300M+ users, 4+B pins and 2+B boards.
Innovations:
• Importance-based neighborhoods sampling strategy by simulating random walks and selecting neighbors
with highest visit counts. (importance pooling)
• selecting a fixed number of nodes to aggregate from allows to control the memory footprint of the
algorithm during training.
25
26
Applications ——Uber Eats
Applications ——Uber Eats
Innovations:
max-margin loss: customized loss function when training GraphSAGE. Good for weighted graph.
27
28
Applications ——Uber Eats

Gnn overview

  • 1.
    Graph Representation Learning: Theoriesand Applications Louis Wang Apr 22nd 2020 1
  • 2.
    Agenda - Graph RepresentationLearning - Shallow Methods - Deepwalk[Bryan, 2014 KDD] - Node2Vec[A Grover, 2016 KDD] - Deep Methods: GNN - Graph Convolutional Networks [Kipf., ICLR 2017](GCN) - GraphSAGE [WL, 2017 NIPS] - Graph Attention Networks[Veličković, ICLR 2018] (GAT) - Applications - Pins recommendation with PinSage by Pinterest - Dish Recommendation on Uber Eats with GCN 2
  • 3.
    Graph Data areeverywhere 3
  • 4.
    Graph Representation Learning NodeEmbedding Graph Embedding Representation learning is learning representations of input data typically by transforming it or extracting features from it(by some means), that makes it easier to perform a task like classification or prediction. [Yoshua Bengio 2014] Embedding is ALL you need: word2vec, doc2vec, node2vec, item2vec, struc2vec… 4
  • 5.
    Tasks on Graph NodeClassification - Predict a type of given node. Edge Classification/Link Prediction - Predict whether two nodes are linked or the type of the link. Graph Classification - Identify densely linked clusters of nodes Network Similarity - How similar are two (sub)network 5
  • 6.
    Goal: Encode (f)nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network 1. Define an Encoder 2. Define a similarity function 3. Optimization Node Embedding 6
  • 7.
    Shallow Encoding ——an embedding-lookup table ENC(v)=Zv Methods: DeepWalk[Perozzi et al. 2014 KDD], Node2vec[Grover et al. 2016 KDD], etc. 7
  • 8.
    Shallow Methods Framework generate‘sentences’ unbiased walk: DeepWalk biased walk: Node2vec (different walk strategies) Idea: Optimize the node embeddings so that nodes have similar embeddings if they tend to co-occur on short random walks over the graph. 8
  • 9.
    DeepWalk 1. Run Shortfixed-length random walks starting from each node on the graph using some strategy R 2. For each node u collection N(u), the multiset of nodes visited on random walks starting from u 3. Optimize embeddings according to : Given nodes u, predict its neighbors N(u). 9
  • 10.
    DeepWalk Optimization The lossfunction is kind of slow because… 1. nested sum give o|V^2| complexity 2. normalization term from the softmax function Solution: Negative sampling • Use k negative nodes proportional to degree instead of all nodes..! • k should be a balance between predictive accuracy and computational efficiency. 10
  • 11.
    Node2Vec —— Let’sgenerate biased walks Idea: Flexible notion of network neighborhood of node leads to rich node embeddings. 11
  • 12.
    Two parameters: • returnparameter p: • return back to the previous node • ‘walk away’ parameter q: • moving outwards(DFS) vs. inwards (BFS) • intuitively, q is the ratio of BFS vs DFS Node2Vec —— Explore neighborhoods in a BFS as well as DFS fashion. The walker just traversed edge (s1,w) and is now at w. Neighbors of w can only be: - s2: same distance to s1. - s1: back to s1 - s3/s4: farther from s1 12
  • 13.
    Limitations of ShallowEncoders • o(|V|) parameters are needed: • Each node has a unique embedding. • No sharing of parameters between nodes. • Inherently “transductive”: • Either not possible or very time consuming to generate embeddings for nodes not seen during training. • Does not incorporate node features • many graphs have features that we can and should leverage 13
  • 14.
    Graph Convolutional Networks Idea: Node’sneighborhood defines a computation graph. To obtain node representations, use a NN to aggregate information from neighbors recursively by limited BFS. 14
  • 15.
    Graph Convolutional Networks •Each layer is one level of depth in the BFS • Nodes have embeddings at each layer. • Layer-0 embedding of node u is its input feature. • Layer-K embedding gets information from NN - final embeddings. So we need… 1. AGG: Aggregator for collecting information from node’s neighborhood. 2. NNs: Neural network for neighborhood representation(eg. NN W1) and node’s self embedding (eg. NN B1) 3. Loss Function for optimization 15
  • 16.
  • 17.
    Supervised Training vsUnsupervised Training For the shallow methods, we train the models in an unsupervised manner: • use only the graph structure • similar nodes have similar embeddings • feed the ‘sentences’ into skipgram model. For GCN, we directly train the model for a supervised task, like node classification. We can feed the embeddings into any loss function and run stochastic gradient descent to train the parameters. 17
  • 18.
    Inductive capability 1. Inmany real applications new nodes are often added to the graph. Needed to generate embeddings for new nodes without retraining. Hard to do with shallow methods. 2. The same aggregation parameters are shared for all nodes. The number of model parameters is sublinear in |V| and generalize to unseen nodes 18
  • 19.
    GraphSAGE —— GraphSAmple and aggreGatE GCN just aggregated the neighbor messages by taking the weighted average. How to do better? Idea: Generalize the aggregation methods to its neighbors and concatenate the features with itself. 19
  • 20.
    Neighborhood Aggregator Mean: Takea weighted average of its neighbors Pooling: element-wise mean or max pooling. LSTM: Apply LSTM to reshuffled neighbors 20
  • 21.
    Recap for GCN,GraphSAGE Key Idea: Generate node embeddings based on local neighborhoods using neural networks Graph Convolutional Network: • Average neighborhood information and stack neural network GraphSAGE: • Generalized neighborhood aggregation (AVG, POOLING, LSTM, etc.) 21
  • 22.
    Graph Attention Network—— Learnable Aggregator for GCN Idea: Borrow the idea of attention mechanisms and learn to assign different weights to different neighbors in the aggregation process. Attention Is All You Need [A Vaswani, 2017 NIPS] 22
  • 23.
    Graph Attention Network—— Learnable Aggregator for GCN a is the attention mechanism function euv indicates the importance of node u’s message to node v !uv is the normalized coefficients using softmax function Compute embedding of each node in the graph following an attention strategy. • Nodes attend over their neighborhoods’ messages • Implicitly specifgying different weights to different nodes in a neighborhood 23
  • 24.
    Attention Mechanism Attention mechanisma: The approach is agnostic to the choice of a • The original paper use a simple single-layer neural network • Multi-head attention can stabilize the learning process of attention mechanism • a can have parameters, which needs to be estimated Parameters of a are trained jointly: • learn the parameters together with weight matrices in an end-to-end fashion Benefits: • Computationally efficient: computation of attentional coefficients can be parallelized across all edges of the graph aggregation may be parallelized across all nodes • Storage efficient: sparse matrix operations do not require more than O(V+E) entries to be stored Fixed number of parameters, irrespective of graph size • Trivially localized: only attends over local network neighborhoods (masked model). • Inductive capability: it is a shared edge-wise mechanism it does not depend on the global graph structure. 24
  • 25.
    Applications ——Pinsage Challenge forPinterest: Scaling up GCN-based node embedding in training and inference is difficult: 300M+ users, 4+B pins and 2+B boards. Innovations: • Importance-based neighborhoods sampling strategy by simulating random walks and selecting neighbors with highest visit counts. (importance pooling) • selecting a fixed number of nodes to aggregate from allows to control the memory footprint of the algorithm during training. 25
  • 26.
  • 27.
    Applications ——Uber Eats Innovations: max-marginloss: customized loss function when training GraphSAGE. Good for weighted graph. 27
  • 28.