Paper study: Attention, learn to solve routing problems!
ChenYiHuang5
381 views
27 slides
Feb 10, 2020
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
This is the paper from ICLR 2019
The authors aim to use the attention mechanism (especially Transfomer) to solve routing problem including TSP.
Size: 1.46 MB
Language: en
Added: Feb 10, 2020
Slides: 27 pages
Slide Content
Attention, Learn to Solve
Routing Problems!
ICLR 2019
University of Amsterdam
WouterKool, Herkevan Hoof and Max Welling
Abstract
•Learn heuristics for combinatorial optimization problems can save
costly development.
•Propose a model based on attention layers and train this model using
REINFORCE with a baseline based on deterministic greedy rollout.
•Outperform recent learned heuristics for TSP.
Introduction
•Approaches to solve combinatorial optimization problem can be
divided into
•Exact methods: guarantee finding optimal solutions
•Heuristics: trade off optimality for computational cost, usually expressed in
the form of rules (like the policy to make decisions)
•Train a model to parameterize policies to obtain new and stronger
algorithm for routing problem.
Introduction (cont’d)
•Propose a model based on attention and train it using REINFORCE
with greedy rollout baseline.
•Show the flexibility of proposed approach on multiple routing
problems.
Background
Attention mechanism
•For encoder-decoder model, use attention to obtain new context vector.
•ℎ
�denotes encoder hidden state, �
�denotes decoder hidden state.
•Alignment model, compatibility: relationship between current decoding
state and every encoding state.
•�
��=�(�
�−1,ℎ
�)
•Attention weight
•�
��=
exp(�
��)
σ
�=1
??????
exp�
��
•Context vector
•�
�=σ
�=1
??????
�
��ℎ
�
Transformer
•Multi-head attention: project the input encoding to different number
of spaces
•Self-attention: no additional decoding state, just encoding states
themselves
•Each head has its own attention mechanism
Attention model
Problem definition
•Define a problem instance �as a graph with �nodes, where node �∈
{1,…,�}is represented by features ??????
�.
•For TSP, ??????
�is the coordinate of node �(in 2d space).
•Define a solution ??????=(??????
1,…,??????
�)as a permutation of the nodes.
•Given a problem �, model output a policy �(??????|�)for selecting a
solution ??????
Encoder-decoder model
•Encoder-decoder model defines stochastic policy �(??????|�)for selecting a solution ??????
given a problem instance �.
�
????????????�=ෑ
�=1
�
�
??????(??????
�|�,??????
1:�−1)
•The encoder produces embeddingsof all input nodes.
•The decoder produces the sequence ??????, one node at a time,based on embedding
nodes and mask and context.
•For TSP,
•embedding nodes: from encoder
•mask: remaining nodes during decoding
•context: First and last node embedding in tour during decoding
Multi-head attention
•�??????�
�
�
ℎ
1
�−1
,…,ℎ
�
�−1
•Let number of heads �=8, embedding dimension �
ℎ=128.
•Each head has its own attention mechanism.
Result vector of each head
•Each node has its own query �
�, key �
�and value �
�.
•�
�=??????
??????
ℎ
�,�
�=??????
�
ℎ
�,�
�=??????
??????
ℎ
�
•??????
??????
and ??????
�
are (�
��
ℎ)matrices, ??????
??????
is (�
��
ℎ)matrix.
•Given node �and another node �:
•�
�and �
�determine the importance of �
�
•Compatibility �
��=
�
�
??????
�
�
√�
�
if node �adjacent to node j else −∞.
•Attention weight �
��=
�
�
��
σ
�
′�
�
��
′
∈[0,1]
•Result vector ℎ
�
′
=σ
��
���
�(size is �
�)
1. Compute the compatibility
2. Compute the attention weight
3. Linear combination of �
��and �
�
Final result vector
•Let ℎ
��
′
denote the result vector of node �in head �(size is �
�)
•In Transformer, concatenate the result vectors first and transform it.
•�??????�
�ℎ
1,…,ℎ
�=??????
�
������(ℎ
�1′,…ℎ
��′)
•In proposed method, transform each result vectors and sum up them.
•�??????�
�ℎ
1,…,ℎ
�=σ
�=1
�
??????
�
�
ℎ
��′
•Both method output �
ℎ-dimensional vector for each node.
�⋅�
��
ℎ×(�⋅�
�)
�
ℎ×�
�
Decoder
•At decoding time, the decode context consisted of embedding of the
graph, the last node and first node
•ℎ
�
�
=ቐ
തℎ
�
,ℎ
??????�−1
�
,ℎ
??????1
�
if�>1
തℎ
�
,�
�
,�
�
else.
•(3⋅�
ℎ)-dimensional result vector ℎ
�
�
: embedding of the special
context node (�)
[⋅,⋅,⋅]horizontal concatenation operator
�
�
and �
�
are learnable �
ℎ-dimensional parameters
Update context node embedding
•Obtain new context node embedding ℎ
�
�+1
using �-head attention.
•The keys and values come from node embedding ℎ
�
�
, query comes
from context node.
•�
�=??????
??????
ℎ
�,�
�=??????
�
ℎ
�,�
�=??????
??????
ℎ
�
•Compatibility �
(�)�=
�
(??????)
??????
�
�
√�
�
�
�=
�
ℎ
�
if node �haven’t been visited
else −∞.
•Apply the similar �??????�to get ℎ
�
�+1
(size is �
ℎ).
Final output probability
•Compute �
????????????
��,??????
1:�−1using single attention head (�=1, �
�=
�
ℎ) but onlycompute compatibility (no need �
�)
•�
(�)�=�⋅tanh
�
??????
??????
�
�
�
�
∈[−�,�]if node �haven’t been visited else
−∞(�=10).
•Compute the final output probability vector �using softmax
�
�=�
????????????
�=��,??????
1:�−1=
�
�
(??????)�
σ
�
�
�
(??????)�
REINFORCE with greedy rollout
baseline
REINFORCE with baseline
•Define the loss ℒ??????�=??????
�
????????????�[�(??????)]
•Optimizeℒby gradient descent usingREINFORCE
•By introduce the baseline reduces gradient variance and then speed up
learning.
�ℒ??????�=??????
�
????????????�[�??????−���log�
??????(??????|�)]
•Common baseline
•Exponential moving average ��=�with decay �.
•�
0=�??????,�
�+1=��
�+1−��(??????)
•Learned value function (critic) ො�(�,??????)
•??????are learned from (�,�(??????))
Proposed baseline
Replace baseline parameter if improvement is significant
Sample solution ??????
�based on �
??????
Greedily pick baseline solution ??????
�
??????�
based on �
????????????�
Calculate the gradient of loss with REINFORCE
with baseline as length of ??????
�
??????�
.
Two model, one for training another for baseline
Copy the training parameter to baseline
PN: pointer network
AM: attention model (proposed method)
TSP20 result compare to pointer network (10000 instances)
Generalization ability
Discussion
•Introduce a model and training method which both contribute to
significantly improved results on learned heuristics for TSP.
•Using attention instead of recurrence introduces invariance to the
input order of the nodes, increasing learning efficiency.
•The multi-head attention mechanism allows nodes to communicate
relevant information over different channels.