SlideShare a Scribd company logo
1 of 27
Download to read offline
Attention, Learn to Solve
Routing Problems!
ICLR 2019
University of Amsterdam
Wouter Kool, Herke van Hoof and Max Welling
Abstract
• Learn heuristics for combinatorial optimization problems can save
costly development.
• Propose a model based on attention layers and train this model using
REINFORCE with a baseline based on deterministic greedy rollout.
• Outperform recent learned heuristics for TSP.
Introduction
• Approaches to solve combinatorial optimization problem can be
divided into
• Exact methods: guarantee finding optimal solutions
• Heuristics: trade off optimality for computational cost, usually expressed in
the form of rules (like the policy to make decisions)
• Train a model to parameterize policies to obtain new and stronger
algorithm for routing problem.
Introduction (cont’d)
• Propose a model based on attention and train it using REINFORCE
with greedy rollout baseline.
• Show the flexibility of proposed approach on multiple routing
problems.
Background
Attention mechanism
• For encoder-decoder model, use attention to obtain new context vector.
• ℎ𝑗 denotes encoder hidden state, 𝑠𝑖 denotes decoder hidden state.
• Alignment model, compatibility: relationship between current decoding
state and every encoding state.
• 𝑒𝑖𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
• Attention weight
• 𝛼𝑖𝑗 =
exp(𝑒 𝑖𝑗)
σ 𝑘=1
𝑇
exp 𝑒 𝑖𝑘
• Context vector
• 𝑐𝑖 = σ 𝑗=1
𝑇
𝛼𝑖𝑗ℎ𝑗
Transformer
• Multi-head attention: project the input encoding to different number
of spaces
• Self-attention: no additional decoding state, just encoding states
themselves
• Each head has its own attention mechanism
Attention model
Problem definition
• Define a problem instance 𝑠 as a graph with 𝑛 nodes, where node 𝑖 ∈
{1, … , 𝑛} is represented by features 𝑥𝑖.
• For TSP, 𝑥𝑖 is the coordinate of node 𝑖 (in 2d space).
• Define a solution 𝜋 = (𝜋1, … , 𝜋 𝑛) as a permutation of the nodes.
• Given a problem 𝑠, model output a policy 𝑝(𝜋|𝑠) for selecting a
solution 𝜋
Encoder-decoder model
• Encoder-decoder model defines stochastic policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋
given a problem instance 𝑠.
𝑝 𝜃 𝜋 𝑠 = ෑ
𝑡=1
𝑛
𝑝 𝜃(𝜋 𝑡|𝑠, 𝜋1:𝑡−1)
• The encoder produces embeddings of all input nodes.
• The decoder produces the sequence 𝜋, one node at a time, based on embedding
nodes and mask and context.
• For TSP,
• embedding nodes: from encoder
• mask: remaining nodes during decoding
• context: First and last node embedding in tour during decoding
Encoder
• 𝑑 𝑥-dimensional input feature 𝑥𝑖. For TSP, 𝑑 𝑥 = 2.
• 𝑑ℎ-dimensional node embedding. Let 𝑑ℎ = 128.
• Initial embedding: ℎ𝑖
0
= 𝑊 𝑥 𝑥𝑖 + 𝑏 𝑥
• The embedding ℎ𝑖
𝑙
are updated using 𝑁 attention layers.
෠ℎ𝑖 = 𝐵𝑁 𝑙 ℎ𝑖
𝑙−1
+ 𝑀𝐻𝐴𝑖
𝑙
ℎ1
𝑙−1
, … , ℎ 𝑛
𝑙−1
ℎ𝑖
𝑙
= 𝐵𝑁 𝑙(෠ℎ𝑖 + 𝐹𝐹 𝑙(෠ℎ𝑖))
• Graph embedding: തℎ 𝑁 =
1
𝑛
σ𝑖=1
𝑛
ℎ𝑖
𝑁
𝑖 denotes the node index
𝑙 denotes the output of 𝑙’th attention layer
FF: node-wise feed forward
MHA: multi-head attention
BN: batch normalization
Multi-head attention
• 𝑀𝐻𝐴𝑖
𝑙
ℎ1
𝑙−1
, … , ℎ 𝑛
𝑙−1
• Let number of heads 𝑀 = 8, embedding dimension 𝑑ℎ = 128.
• Each head has its own attention mechanism.
Result vector of each head
• Each node has its own query 𝑞𝑖, key 𝑘𝑖 and value 𝑣𝑖.
• 𝑞𝑖 = 𝑊 𝑄ℎ𝑖, 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖
• 𝑊 𝑄 and 𝑊 𝐾 are (𝑑 𝑘 × 𝑑ℎ) matrices, 𝑊 𝑉 is (𝑑 𝑣 × 𝑑ℎ) matrix.
• Given node 𝑖 and another node 𝑗:
• 𝑞𝑖 and 𝑘𝑗 determine the importance of 𝑣𝑗
• Compatibility 𝑢𝑖𝑗 =
𝑞𝑖
𝑇
𝑘 𝑗
√𝑑 𝑘
if node 𝑖 adjacent to node j else −∞ .
• Attention weight 𝑎𝑖𝑗 =
𝑒
𝑢 𝑖𝑗
σ
𝑗′ 𝑒
𝑢
𝑖𝑗′ ∈ [0,1]
• Result vector ℎ𝑖
′
= σ 𝑗 𝑎𝑖𝑗 𝑣𝑗 (size is 𝑑 𝑣)
1. Compute the compatibility
2. Compute the attention weight
3. Linear combination of 𝑎𝑖𝑗 and 𝑣𝑗
Final result vector
• Let ℎ𝑖𝑚
′
denote the result vector of node 𝑖 in head 𝑚 (size is 𝑑 𝑣)
• In Transformer, concatenate the result vectors first and transform it.
• 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = 𝑊 𝑂 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑖1′, … ℎ𝑖𝑚′)
• In proposed method, transform each result vectors and sum up them.
• 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = σ 𝑚=1
𝑀
𝑊𝑚
𝑂
ℎ𝑖𝑚′
• Both method output 𝑑ℎ-dimensional vector for each node.
𝑚 ⋅ 𝑑 𝑣𝑑ℎ × (𝑚 ⋅ 𝑑 𝑣)
𝑑ℎ × 𝑑 𝑣
Decoder
• At decoding time, the decode context consisted of embedding of the
graph, the last node and first node
• ℎ 𝑐
𝑁
= ቐ
തℎ 𝑁 , ℎ 𝜋 𝑡−1
𝑁
, ℎ 𝜋1
𝑁
if 𝑡 > 1
തℎ 𝑁 , 𝑣 𝑙, 𝑣 𝑓 else.
• (3 ⋅ 𝑑ℎ)-dimensional result vector ℎ 𝑐
𝑁
: embedding of the special
context node (𝑐)
[⋅,⋅,⋅] horizontal concatenation operator
𝑣 𝑙 and 𝑣 𝑓 are learnable 𝑑ℎ-dimensional parameters
Update context node embedding
• Obtain new context node embedding ℎ 𝑐
𝑁+1
using 𝑀-head attention.
• The keys and values come from node embedding ℎ𝑖
𝑁
, query comes
from context node.
• 𝑞 𝑐 = 𝑊 𝑄ℎ 𝑐 , 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖
• Compatibility 𝑢(𝑐)𝑗 =
𝑞(𝑐)
𝑇
𝑘 𝑗
√𝑑 𝑘
𝑑 𝑘 =
𝑑ℎ
𝑀
if node 𝑗 haven’t been visited
else −∞.
• Apply the similar 𝑀𝐻𝐴 to get ℎ 𝑐
𝑁+1
(size is 𝑑ℎ).
Final output probability
• Compute 𝑝 𝜃 𝜋 𝑡 𝑠, 𝜋1:𝑡−1 using single attention head (𝑀 = 1, 𝑑 𝑘 =
𝑑ℎ) but only compute compatibility (no need 𝑣𝑖)
• 𝑢(𝑐)𝑗 = 𝐶 ⋅ tanh
𝑞 𝑐
𝑇
𝑘 𝑗
𝑑 𝑘
∈ [−𝐶, 𝐶] if node 𝑗 haven’t been visited else
− ∞(𝐶 = 10).
• Compute the final output probability vector 𝑝 using softmax
𝑝𝑖 = 𝑝 𝜃 𝜋 𝑡 = 𝑖 𝑠, 𝜋1:𝑡−1 =
𝑒 𝑢(𝑐)𝑖
σ 𝑗 𝑒 𝑢(𝑐)𝑗
REINFORCE with greedy rollout
baseline
REINFORCE with baseline
• Define the loss ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [𝐿(𝜋)]
• Optimize ℒ by gradient descent using REINFORCE
• By introduce the baseline reduces gradient variance and then speed up
learning.
𝛻ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [ 𝐿 𝜋 − 𝑏 𝑠 𝛻 log 𝑝 𝜃(𝜋|𝑠)]
• Common baseline
• Exponential moving average 𝑏 𝑠 = 𝑀 with decay 𝛽.
• 𝑀0 = 𝐿 𝜋 , 𝑀𝑡+1 = 𝛽𝑀𝑡 + 1 − 𝛽 𝐿(𝜋)
• Learned value function (critic) ො𝑣(𝑠, 𝜔)
• 𝜔 are learned from (𝑠, 𝐿(𝜋))
Proposed baseline
Replace baseline parameter if improvement is significant
Sample solution 𝜋𝑖 based on 𝑝 𝜽
Greedily pick baseline solution 𝜋𝑖
𝐵𝐿
based on 𝑝 𝜽𝐵𝐿
Calculate the gradient of loss with REINFORCE
with baseline as length of 𝜋𝑖
𝐵𝐿
.
Two model, one for training another for baseline
Copy the training parameter to baseline
Experiments
Learned heuristic
Non-learned baseline
Heuristic solver
structure2vec
Pointer network (PN)
PN+ RL
Compare to heuristic solver, non-learned baseline and learned heuristic
PN: pointer network
AM: attention model (proposed method)
TSP20 result compare to pointer network (10000 instances)
Generalization ability
Discussion
• Introduce a model and training method which both contribute to
significantly improved results on learned heuristics for TSP.
• Using attention instead of recurrence introduces invariance to the
input order of the nodes, increasing learning efficiency.
• The multi-head attention mechanism allows nodes to communicate
relevant information over different channels.

More Related Content

What's hot

レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明MITSUNARI Shigeo
 
Scala 初心者が米田の補題を Scala で考えてみた
Scala 初心者が米田の補題を Scala で考えてみたScala 初心者が米田の補題を Scala で考えてみた
Scala 初心者が米田の補題を Scala で考えてみたKazuyuki TAKASE
 
Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説
Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説
Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説AtCoder Inc.
 
Bellman ford algorithm
Bellman ford algorithmBellman ford algorithm
Bellman ford algorithmA. S. M. Shafi
 
Analytical tools for textile plant layout.pptx
Analytical tools for textile plant layout.pptxAnalytical tools for textile plant layout.pptx
Analytical tools for textile plant layout.pptxBewuket Teshome
 
Foss4g(戸田) 20171015(コアデイ)
Foss4g(戸田) 20171015(コアデイ)Foss4g(戸田) 20171015(コアデイ)
Foss4g(戸田) 20171015(コアデイ)OSgeo Japan
 
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能MITSUNARI Shigeo
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithmsRajendran
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門Yoichi Iwata
 
Algorithms Lecture 2: Analysis of Algorithms I
Algorithms Lecture 2: Analysis of Algorithms IAlgorithms Lecture 2: Analysis of Algorithms I
Algorithms Lecture 2: Analysis of Algorithms IMohamed Loey
 
TOC 8 | Derivation, Parse Tree & Ambiguity Check
TOC 8 | Derivation, Parse Tree & Ambiguity CheckTOC 8 | Derivation, Parse Tree & Ambiguity Check
TOC 8 | Derivation, Parse Tree & Ambiguity CheckMohammad Imam Hossain
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathLê Hòa
 
Evolutionary computing - soft computing
Evolutionary computing - soft computingEvolutionary computing - soft computing
Evolutionary computing - soft computingSakshiMahto1
 
Multiple optimization and Non-dominated sorting with rPref package in R
Multiple optimization and Non-dominated sorting with rPref package in RMultiple optimization and Non-dominated sorting with rPref package in R
Multiple optimization and Non-dominated sorting with rPref package in RSatoshi Kato
 
Divisor
DivisorDivisor
Divisoroupc
 
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love BucharestStefan Adam
 

What's hot (20)

レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
 
Scala 初心者が米田の補題を Scala で考えてみた
Scala 初心者が米田の補題を Scala で考えてみたScala 初心者が米田の補題を Scala で考えてみた
Scala 初心者が米田の補題を Scala で考えてみた
 
Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説
Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説
Disco Presents ディスカバリーチャンネルプログラミングコンテスト2016 本選 解説
 
Bellman ford algorithm
Bellman ford algorithmBellman ford algorithm
Bellman ford algorithm
 
Analytical tools for textile plant layout.pptx
Analytical tools for textile plant layout.pptxAnalytical tools for textile plant layout.pptx
Analytical tools for textile plant layout.pptx
 
Lecture28 tsp
Lecture28 tspLecture28 tsp
Lecture28 tsp
 
Foss4g(戸田) 20171015(コアデイ)
Foss4g(戸田) 20171015(コアデイ)Foss4g(戸田) 20171015(コアデイ)
Foss4g(戸田) 20171015(コアデイ)
 
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
準同型暗号の実装とMontgomery, Karatsuba, FFT の性能
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門
 
Algorithms Lecture 2: Analysis of Algorithms I
Algorithms Lecture 2: Analysis of Algorithms IAlgorithms Lecture 2: Analysis of Algorithms I
Algorithms Lecture 2: Analysis of Algorithms I
 
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 
TOC 8 | Derivation, Parse Tree & Ambiguity Check
TOC 8 | Derivation, Parse Tree & Ambiguity CheckTOC 8 | Derivation, Parse Tree & Ambiguity Check
TOC 8 | Derivation, Parse Tree & Ambiguity Check
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable Path
 
Alpha beta pruning
Alpha beta pruningAlpha beta pruning
Alpha beta pruning
 
Evolutionary computing - soft computing
Evolutionary computing - soft computingEvolutionary computing - soft computing
Evolutionary computing - soft computing
 
Multiple optimization and Non-dominated sorting with rPref package in R
Multiple optimization and Non-dominated sorting with rPref package in RMultiple optimization and Non-dominated sorting with rPref package in R
Multiple optimization and Non-dominated sorting with rPref package in R
 
Divisor
DivisorDivisor
Divisor
 
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
 
Build your career in physical ASIC design
Build your career in physical ASIC designBuild your career in physical ASIC design
Build your career in physical ASIC design
 

Similar to Paper study: Attention, learn to solve routing problems!

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...ssuser4b1f48
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15Hao Zhuang
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...MITSUNARI Shigeo
 
Applied Algorithms and Structures week999
Applied Algorithms and Structures week999Applied Algorithms and Structures week999
Applied Algorithms and Structures week999fashiontrendzz20
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 

Similar to Paper study: Attention, learn to solve routing problems! (20)

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Tokyo conference
Tokyo conferenceTokyo conference
Tokyo conference
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...
 
Applied Algorithms and Structures week999
Applied Algorithms and Structures week999Applied Algorithms and Structures week999
Applied Algorithms and Structures week999
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
lecture_20.pptx
lecture_20.pptxlecture_20.pptx
lecture_20.pptx
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 

Paper study: Attention, learn to solve routing problems!

  • 1. Attention, Learn to Solve Routing Problems! ICLR 2019 University of Amsterdam Wouter Kool, Herke van Hoof and Max Welling
  • 2. Abstract • Learn heuristics for combinatorial optimization problems can save costly development. • Propose a model based on attention layers and train this model using REINFORCE with a baseline based on deterministic greedy rollout. • Outperform recent learned heuristics for TSP.
  • 3. Introduction • Approaches to solve combinatorial optimization problem can be divided into • Exact methods: guarantee finding optimal solutions • Heuristics: trade off optimality for computational cost, usually expressed in the form of rules (like the policy to make decisions) • Train a model to parameterize policies to obtain new and stronger algorithm for routing problem.
  • 4. Introduction (cont’d) • Propose a model based on attention and train it using REINFORCE with greedy rollout baseline. • Show the flexibility of proposed approach on multiple routing problems.
  • 6. Attention mechanism • For encoder-decoder model, use attention to obtain new context vector. • ℎ𝑗 denotes encoder hidden state, 𝑠𝑖 denotes decoder hidden state. • Alignment model, compatibility: relationship between current decoding state and every encoding state. • 𝑒𝑖𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗) • Attention weight • 𝛼𝑖𝑗 = exp(𝑒 𝑖𝑗) σ 𝑘=1 𝑇 exp 𝑒 𝑖𝑘 • Context vector • 𝑐𝑖 = σ 𝑗=1 𝑇 𝛼𝑖𝑗ℎ𝑗
  • 7. Transformer • Multi-head attention: project the input encoding to different number of spaces • Self-attention: no additional decoding state, just encoding states themselves • Each head has its own attention mechanism
  • 9. Problem definition • Define a problem instance 𝑠 as a graph with 𝑛 nodes, where node 𝑖 ∈ {1, … , 𝑛} is represented by features 𝑥𝑖. • For TSP, 𝑥𝑖 is the coordinate of node 𝑖 (in 2d space). • Define a solution 𝜋 = (𝜋1, … , 𝜋 𝑛) as a permutation of the nodes. • Given a problem 𝑠, model output a policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋
  • 10. Encoder-decoder model • Encoder-decoder model defines stochastic policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋 given a problem instance 𝑠. 𝑝 𝜃 𝜋 𝑠 = ෑ 𝑡=1 𝑛 𝑝 𝜃(𝜋 𝑡|𝑠, 𝜋1:𝑡−1) • The encoder produces embeddings of all input nodes. • The decoder produces the sequence 𝜋, one node at a time, based on embedding nodes and mask and context. • For TSP, • embedding nodes: from encoder • mask: remaining nodes during decoding • context: First and last node embedding in tour during decoding
  • 11. Encoder • 𝑑 𝑥-dimensional input feature 𝑥𝑖. For TSP, 𝑑 𝑥 = 2. • 𝑑ℎ-dimensional node embedding. Let 𝑑ℎ = 128. • Initial embedding: ℎ𝑖 0 = 𝑊 𝑥 𝑥𝑖 + 𝑏 𝑥 • The embedding ℎ𝑖 𝑙 are updated using 𝑁 attention layers. ෠ℎ𝑖 = 𝐵𝑁 𝑙 ℎ𝑖 𝑙−1 + 𝑀𝐻𝐴𝑖 𝑙 ℎ1 𝑙−1 , … , ℎ 𝑛 𝑙−1 ℎ𝑖 𝑙 = 𝐵𝑁 𝑙(෠ℎ𝑖 + 𝐹𝐹 𝑙(෠ℎ𝑖)) • Graph embedding: തℎ 𝑁 = 1 𝑛 σ𝑖=1 𝑛 ℎ𝑖 𝑁 𝑖 denotes the node index 𝑙 denotes the output of 𝑙’th attention layer FF: node-wise feed forward MHA: multi-head attention BN: batch normalization
  • 12.
  • 13. Multi-head attention • 𝑀𝐻𝐴𝑖 𝑙 ℎ1 𝑙−1 , … , ℎ 𝑛 𝑙−1 • Let number of heads 𝑀 = 8, embedding dimension 𝑑ℎ = 128. • Each head has its own attention mechanism.
  • 14. Result vector of each head • Each node has its own query 𝑞𝑖, key 𝑘𝑖 and value 𝑣𝑖. • 𝑞𝑖 = 𝑊 𝑄ℎ𝑖, 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖 • 𝑊 𝑄 and 𝑊 𝐾 are (𝑑 𝑘 × 𝑑ℎ) matrices, 𝑊 𝑉 is (𝑑 𝑣 × 𝑑ℎ) matrix. • Given node 𝑖 and another node 𝑗: • 𝑞𝑖 and 𝑘𝑗 determine the importance of 𝑣𝑗 • Compatibility 𝑢𝑖𝑗 = 𝑞𝑖 𝑇 𝑘 𝑗 √𝑑 𝑘 if node 𝑖 adjacent to node j else −∞ . • Attention weight 𝑎𝑖𝑗 = 𝑒 𝑢 𝑖𝑗 σ 𝑗′ 𝑒 𝑢 𝑖𝑗′ ∈ [0,1] • Result vector ℎ𝑖 ′ = σ 𝑗 𝑎𝑖𝑗 𝑣𝑗 (size is 𝑑 𝑣)
  • 15. 1. Compute the compatibility 2. Compute the attention weight 3. Linear combination of 𝑎𝑖𝑗 and 𝑣𝑗
  • 16. Final result vector • Let ℎ𝑖𝑚 ′ denote the result vector of node 𝑖 in head 𝑚 (size is 𝑑 𝑣) • In Transformer, concatenate the result vectors first and transform it. • 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = 𝑊 𝑂 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑖1′, … ℎ𝑖𝑚′) • In proposed method, transform each result vectors and sum up them. • 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = σ 𝑚=1 𝑀 𝑊𝑚 𝑂 ℎ𝑖𝑚′ • Both method output 𝑑ℎ-dimensional vector for each node. 𝑚 ⋅ 𝑑 𝑣𝑑ℎ × (𝑚 ⋅ 𝑑 𝑣) 𝑑ℎ × 𝑑 𝑣
  • 17. Decoder • At decoding time, the decode context consisted of embedding of the graph, the last node and first node • ℎ 𝑐 𝑁 = ቐ തℎ 𝑁 , ℎ 𝜋 𝑡−1 𝑁 , ℎ 𝜋1 𝑁 if 𝑡 > 1 തℎ 𝑁 , 𝑣 𝑙, 𝑣 𝑓 else. • (3 ⋅ 𝑑ℎ)-dimensional result vector ℎ 𝑐 𝑁 : embedding of the special context node (𝑐) [⋅,⋅,⋅] horizontal concatenation operator 𝑣 𝑙 and 𝑣 𝑓 are learnable 𝑑ℎ-dimensional parameters
  • 18. Update context node embedding • Obtain new context node embedding ℎ 𝑐 𝑁+1 using 𝑀-head attention. • The keys and values come from node embedding ℎ𝑖 𝑁 , query comes from context node. • 𝑞 𝑐 = 𝑊 𝑄ℎ 𝑐 , 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖 • Compatibility 𝑢(𝑐)𝑗 = 𝑞(𝑐) 𝑇 𝑘 𝑗 √𝑑 𝑘 𝑑 𝑘 = 𝑑ℎ 𝑀 if node 𝑗 haven’t been visited else −∞. • Apply the similar 𝑀𝐻𝐴 to get ℎ 𝑐 𝑁+1 (size is 𝑑ℎ).
  • 19. Final output probability • Compute 𝑝 𝜃 𝜋 𝑡 𝑠, 𝜋1:𝑡−1 using single attention head (𝑀 = 1, 𝑑 𝑘 = 𝑑ℎ) but only compute compatibility (no need 𝑣𝑖) • 𝑢(𝑐)𝑗 = 𝐶 ⋅ tanh 𝑞 𝑐 𝑇 𝑘 𝑗 𝑑 𝑘 ∈ [−𝐶, 𝐶] if node 𝑗 haven’t been visited else − ∞(𝐶 = 10). • Compute the final output probability vector 𝑝 using softmax 𝑝𝑖 = 𝑝 𝜃 𝜋 𝑡 = 𝑖 𝑠, 𝜋1:𝑡−1 = 𝑒 𝑢(𝑐)𝑖 σ 𝑗 𝑒 𝑢(𝑐)𝑗
  • 20. REINFORCE with greedy rollout baseline
  • 21. REINFORCE with baseline • Define the loss ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [𝐿(𝜋)] • Optimize ℒ by gradient descent using REINFORCE • By introduce the baseline reduces gradient variance and then speed up learning. 𝛻ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [ 𝐿 𝜋 − 𝑏 𝑠 𝛻 log 𝑝 𝜃(𝜋|𝑠)] • Common baseline • Exponential moving average 𝑏 𝑠 = 𝑀 with decay 𝛽. • 𝑀0 = 𝐿 𝜋 , 𝑀𝑡+1 = 𝛽𝑀𝑡 + 1 − 𝛽 𝐿(𝜋) • Learned value function (critic) ො𝑣(𝑠, 𝜔) • 𝜔 are learned from (𝑠, 𝐿(𝜋))
  • 22. Proposed baseline Replace baseline parameter if improvement is significant Sample solution 𝜋𝑖 based on 𝑝 𝜽 Greedily pick baseline solution 𝜋𝑖 𝐵𝐿 based on 𝑝 𝜽𝐵𝐿 Calculate the gradient of loss with REINFORCE with baseline as length of 𝜋𝑖 𝐵𝐿 . Two model, one for training another for baseline Copy the training parameter to baseline
  • 24. Learned heuristic Non-learned baseline Heuristic solver structure2vec Pointer network (PN) PN+ RL Compare to heuristic solver, non-learned baseline and learned heuristic
  • 25. PN: pointer network AM: attention model (proposed method) TSP20 result compare to pointer network (10000 instances)
  • 27. Discussion • Introduce a model and training method which both contribute to significantly improved results on learned heuristics for TSP. • Using attention instead of recurrence introduces invariance to the input order of the nodes, increasing learning efficiency. • The multi-head attention mechanism allows nodes to communicate relevant information over different channels.