Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Paper study: Attention, learn to solve routing problems!
1. Attention, Learn to Solve
Routing Problems!
ICLR 2019
University of Amsterdam
Wouter Kool, Herke van Hoof and Max Welling
2. Abstract
• Learn heuristics for combinatorial optimization problems can save
costly development.
• Propose a model based on attention layers and train this model using
REINFORCE with a baseline based on deterministic greedy rollout.
• Outperform recent learned heuristics for TSP.
3. Introduction
• Approaches to solve combinatorial optimization problem can be
divided into
• Exact methods: guarantee finding optimal solutions
• Heuristics: trade off optimality for computational cost, usually expressed in
the form of rules (like the policy to make decisions)
• Train a model to parameterize policies to obtain new and stronger
algorithm for routing problem.
4. Introduction (cont’d)
• Propose a model based on attention and train it using REINFORCE
with greedy rollout baseline.
• Show the flexibility of proposed approach on multiple routing
problems.
6. Attention mechanism
• For encoder-decoder model, use attention to obtain new context vector.
• ℎ𝑗 denotes encoder hidden state, 𝑠𝑖 denotes decoder hidden state.
• Alignment model, compatibility: relationship between current decoding
state and every encoding state.
• 𝑒𝑖𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
• Attention weight
• 𝛼𝑖𝑗 =
exp(𝑒 𝑖𝑗)
σ 𝑘=1
𝑇
exp 𝑒 𝑖𝑘
• Context vector
• 𝑐𝑖 = σ 𝑗=1
𝑇
𝛼𝑖𝑗ℎ𝑗
7. Transformer
• Multi-head attention: project the input encoding to different number
of spaces
• Self-attention: no additional decoding state, just encoding states
themselves
• Each head has its own attention mechanism
9. Problem definition
• Define a problem instance 𝑠 as a graph with 𝑛 nodes, where node 𝑖 ∈
{1, … , 𝑛} is represented by features 𝑥𝑖.
• For TSP, 𝑥𝑖 is the coordinate of node 𝑖 (in 2d space).
• Define a solution 𝜋 = (𝜋1, … , 𝜋 𝑛) as a permutation of the nodes.
• Given a problem 𝑠, model output a policy 𝑝(𝜋|𝑠) for selecting a
solution 𝜋
10. Encoder-decoder model
• Encoder-decoder model defines stochastic policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋
given a problem instance 𝑠.
𝑝 𝜃 𝜋 𝑠 = ෑ
𝑡=1
𝑛
𝑝 𝜃(𝜋 𝑡|𝑠, 𝜋1:𝑡−1)
• The encoder produces embeddings of all input nodes.
• The decoder produces the sequence 𝜋, one node at a time, based on embedding
nodes and mask and context.
• For TSP,
• embedding nodes: from encoder
• mask: remaining nodes during decoding
• context: First and last node embedding in tour during decoding
13. Multi-head attention
• 𝑀𝐻𝐴𝑖
𝑙
ℎ1
𝑙−1
, … , ℎ 𝑛
𝑙−1
• Let number of heads 𝑀 = 8, embedding dimension 𝑑ℎ = 128.
• Each head has its own attention mechanism.
14. Result vector of each head
• Each node has its own query 𝑞𝑖, key 𝑘𝑖 and value 𝑣𝑖.
• 𝑞𝑖 = 𝑊 𝑄ℎ𝑖, 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖
• 𝑊 𝑄 and 𝑊 𝐾 are (𝑑 𝑘 × 𝑑ℎ) matrices, 𝑊 𝑉 is (𝑑 𝑣 × 𝑑ℎ) matrix.
• Given node 𝑖 and another node 𝑗:
• 𝑞𝑖 and 𝑘𝑗 determine the importance of 𝑣𝑗
• Compatibility 𝑢𝑖𝑗 =
𝑞𝑖
𝑇
𝑘 𝑗
√𝑑 𝑘
if node 𝑖 adjacent to node j else −∞ .
• Attention weight 𝑎𝑖𝑗 =
𝑒
𝑢 𝑖𝑗
σ
𝑗′ 𝑒
𝑢
𝑖𝑗′ ∈ [0,1]
• Result vector ℎ𝑖
′
= σ 𝑗 𝑎𝑖𝑗 𝑣𝑗 (size is 𝑑 𝑣)
15. 1. Compute the compatibility
2. Compute the attention weight
3. Linear combination of 𝑎𝑖𝑗 and 𝑣𝑗
16. Final result vector
• Let ℎ𝑖𝑚
′
denote the result vector of node 𝑖 in head 𝑚 (size is 𝑑 𝑣)
• In Transformer, concatenate the result vectors first and transform it.
• 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = 𝑊 𝑂 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑖1′, … ℎ𝑖𝑚′)
• In proposed method, transform each result vectors and sum up them.
• 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = σ 𝑚=1
𝑀
𝑊𝑚
𝑂
ℎ𝑖𝑚′
• Both method output 𝑑ℎ-dimensional vector for each node.
𝑚 ⋅ 𝑑 𝑣𝑑ℎ × (𝑚 ⋅ 𝑑 𝑣)
𝑑ℎ × 𝑑 𝑣
17. Decoder
• At decoding time, the decode context consisted of embedding of the
graph, the last node and first node
• ℎ 𝑐
𝑁
= ቐ
തℎ 𝑁 , ℎ 𝜋 𝑡−1
𝑁
, ℎ 𝜋1
𝑁
if 𝑡 > 1
തℎ 𝑁 , 𝑣 𝑙, 𝑣 𝑓 else.
• (3 ⋅ 𝑑ℎ)-dimensional result vector ℎ 𝑐
𝑁
: embedding of the special
context node (𝑐)
[⋅,⋅,⋅] horizontal concatenation operator
𝑣 𝑙 and 𝑣 𝑓 are learnable 𝑑ℎ-dimensional parameters
18. Update context node embedding
• Obtain new context node embedding ℎ 𝑐
𝑁+1
using 𝑀-head attention.
• The keys and values come from node embedding ℎ𝑖
𝑁
, query comes
from context node.
• 𝑞 𝑐 = 𝑊 𝑄ℎ 𝑐 , 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖
• Compatibility 𝑢(𝑐)𝑗 =
𝑞(𝑐)
𝑇
𝑘 𝑗
√𝑑 𝑘
𝑑 𝑘 =
𝑑ℎ
𝑀
if node 𝑗 haven’t been visited
else −∞.
• Apply the similar 𝑀𝐻𝐴 to get ℎ 𝑐
𝑁+1
(size is 𝑑ℎ).
19. Final output probability
• Compute 𝑝 𝜃 𝜋 𝑡 𝑠, 𝜋1:𝑡−1 using single attention head (𝑀 = 1, 𝑑 𝑘 =
𝑑ℎ) but only compute compatibility (no need 𝑣𝑖)
• 𝑢(𝑐)𝑗 = 𝐶 ⋅ tanh
𝑞 𝑐
𝑇
𝑘 𝑗
𝑑 𝑘
∈ [−𝐶, 𝐶] if node 𝑗 haven’t been visited else
− ∞(𝐶 = 10).
• Compute the final output probability vector 𝑝 using softmax
𝑝𝑖 = 𝑝 𝜃 𝜋 𝑡 = 𝑖 𝑠, 𝜋1:𝑡−1 =
𝑒 𝑢(𝑐)𝑖
σ 𝑗 𝑒 𝑢(𝑐)𝑗
21. REINFORCE with baseline
• Define the loss ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [𝐿(𝜋)]
• Optimize ℒ by gradient descent using REINFORCE
• By introduce the baseline reduces gradient variance and then speed up
learning.
𝛻ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [ 𝐿 𝜋 − 𝑏 𝑠 𝛻 log 𝑝 𝜃(𝜋|𝑠)]
• Common baseline
• Exponential moving average 𝑏 𝑠 = 𝑀 with decay 𝛽.
• 𝑀0 = 𝐿 𝜋 , 𝑀𝑡+1 = 𝛽𝑀𝑡 + 1 − 𝛽 𝐿(𝜋)
• Learned value function (critic) ො𝑣(𝑠, 𝜔)
• 𝜔 are learned from (𝑠, 𝐿(𝜋))
22. Proposed baseline
Replace baseline parameter if improvement is significant
Sample solution 𝜋𝑖 based on 𝑝 𝜽
Greedily pick baseline solution 𝜋𝑖
𝐵𝐿
based on 𝑝 𝜽𝐵𝐿
Calculate the gradient of loss with REINFORCE
with baseline as length of 𝜋𝑖
𝐵𝐿
.
Two model, one for training another for baseline
Copy the training parameter to baseline
27. Discussion
• Introduce a model and training method which both contribute to
significantly improved results on learned heuristics for TSP.
• Using attention instead of recurrence introduces invariance to the
input order of the nodes, increasing learning efficiency.
• The multi-head attention mechanism allows nodes to communicate
relevant information over different channels.