CVRP solver
with
Multi-Head Attention
Rintaro Sato
Kyushu University, Japan
0
8
42
3
0
5
7
6
1
8
42
3
0
5
7
6
1
Input:
・each customer location, demand
・vehicle capacity
・depot location
Output:
・tour with minimum cost (=distance)
What do we want?
1
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
2
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
3
8
42
3
0
5
7
6
1
ℝ3
vector 𝑥𝑖 = (x, y, demand)𝑖
ℝ 𝑑ℎ
𝑥1
𝑥7
𝑊∗𝑥𝑖 + 𝑏 = ℎ𝑖
ℎ1
𝑒. 𝑔. 𝑑ℎ = 128
ℎ7
𝑚𝑎𝑝𝑝𝑖𝑛𝑔 𝑖𝑛 𝑑ℎ 𝑑𝑖𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑎𝑙 𝑣𝑒𝑐𝑡𝑜𝑟 𝑠𝑝𝑎𝑐𝑒
embedding vector ℎ𝑖
Input Feature
ℝ 𝑑ℎ
Initial Embedding in Encoder
4
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
5
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
6
Multi-Head Attention layer (= MHA layer) [2]
7
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
8
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
9
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
10
Encoder Decoder
ℎ 𝑁
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
How Decoder works
Context vector ℎ 𝐶
𝑁
: contains state information;
ℎ 𝐶
𝑁
= [ℎ 𝑁
, ℎ 𝜋 𝑡−1
𝑁
, 𝐷𝑡]
ℎ 𝑁
: graph embedding( = output of Encoder)
ℎ 𝜋 𝑡−1
𝑁 : last visited node embedding
𝐷𝑡 : remaining vehicle capacity
11
Encoder Decoder
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
How Decoder works
12
Encoder Decoder
Select
Next node
MHA
layer
How Decoder works
Probability 𝑝 𝜃
13
Update
𝐷𝑡 , ℎ 𝜋 𝑡−1
𝑁
Decoder
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
14
Encoder
While loop
until all routes are completed
Embedding
Encoder
MHA
layer
Decoder
ℎ𝑖 ℎ𝑖
(𝑁)
Select
Next node
Generate
Context Vector ℎ 𝐶
𝑁
MHA
layer
Training Overview
Evaluate Cost & Update Encoder, Decoder
15
𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 ∇ 𝜃 𝐽 𝜃 s ≅ E[ 𝐿 𝜋 s − b s ∙ ∇ 𝜃 𝑙𝑜𝑔𝑝 𝜃 𝜋 s ]
𝑙𝑜𝑔𝑝 = L𝑜𝑔𝑆𝑜𝑓𝑡𝑚𝑎𝑥 = log
exp 𝑥𝑖
exp 𝑥𝑖
= 𝑖𝑛 𝑟𝑎𝑛𝑔𝑒 −𝐼𝑛𝑓, 0
REINFORCE(Williams, 1992), policy gradient
𝐿 𝜋 s : length of path
𝜋: path (index permutation)
𝑠: graph
b s : baseline
∇ 𝜃: gradient by 𝜃
𝑝 𝜃: probability
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 𝜃 s = 𝐸[𝐿 𝜋 s ]
𝑙𝑜𝑔𝑝 𝜃 𝜋 s = log(
𝑖=1
𝑛
𝑝 𝜋 𝑖 𝜋 < 𝑖 , 𝑠)) =
𝑖=1
𝑛
log(𝑝 𝜋 𝑖 𝜋 < 𝑖 , 𝑠)), 𝑛𝑜𝑑𝑒𝑠 𝑖 ∈ 0, 1, … , 𝑛
Approximation of gradient using REINFORCE
∇ 𝜃 𝐽 𝜃 s → 0
16
𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 ∇ 𝜃 𝐽 𝜃 s ≅ E[ 𝐿 𝜋 s − b s ∙ ∇ 𝜃 𝑙𝑜𝑔𝑝 𝜃 𝜋 s ]
𝑙𝑜𝑔𝑝 = L𝑜𝑔𝑆𝑜𝑓𝑡𝑚𝑎𝑥 = log
exp 𝑥𝑖
exp 𝑥𝑖
= 𝑖𝑛 𝑟𝑎𝑛𝑔𝑒 −𝐼𝑛𝑓, 0
REINFORCE(Williams, 1992), policy gradient
𝐿 𝜋 s : length of path
𝜋: path (index permutation)
𝑠: graph
b s : baseline
∇ 𝜃: gradient by 𝜃
𝑝 𝜃: probability
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 𝜃 s = 𝐸[𝐿 𝜋 s ]
𝑙𝑜𝑔𝑝 𝜃 𝜋 s = log(
𝑖=1
𝑛
𝑝 𝜋 𝑖 𝜋 < 𝑖 , 𝑠)) =
𝑖=1
𝑛
log(𝑝 𝜋 𝑖 𝜋 < 𝑖 , 𝑠)), 𝑛𝑜𝑑𝑒𝑠 𝑖 ∈ 0, 1, … , 𝑛
Approximation of gradient using REINFORCE
In order to reduce variance
Update every epoch if the current model perform well
17
∇ 𝜃 𝐽 𝜃 s → 0
18
Results
Reference
Paper
・[1] Attention, Learn to Solve Routing Problems (Wouter Kool et al. 2019)
・[2] Attention is all you need (Vaswani et al. 2017)
Article
・https://qiita.com/ohtaman/items/0c383da89516d03c3ac0 (深層学習で数理最適化問題を解く [前編])
Implementation
・https://github.com/Rintarooo/VRP_MHA (my own TensorFlow 2 Implementation)
・ https://github.com/wouterkool/attention-learn-to-route (Official PyTorch implementation)
・ https://github.com/alexeypustynnikov/AM-VRP
・ https://github.com/d-eremeev/ADM-VRP
8
42
9
3
5
7
6
1
19

CVRP solver with Multi-Head Attention