Paper study: Attention, learn to solve routing problems!

ChenYiHuang5 381 views 27 slides Feb 10, 2020
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

This is the paper from ICLR 2019
The authors aim to use the attention mechanism (especially Transfomer) to solve routing problem including TSP.


Slide Content

Attention, Learn to Solve
Routing Problems!
ICLR 2019
University of Amsterdam
WouterKool, Herkevan Hoof and Max Welling

Abstract
•Learn heuristics for combinatorial optimization problems can save
costly development.
•Propose a model based on attention layers and train this model using
REINFORCE with a baseline based on deterministic greedy rollout.
•Outperform recent learned heuristics for TSP.

Introduction
•Approaches to solve combinatorial optimization problem can be
divided into
•Exact methods: guarantee finding optimal solutions
•Heuristics: trade off optimality for computational cost, usually expressed in
the form of rules (like the policy to make decisions)
•Train a model to parameterize policies to obtain new and stronger
algorithm for routing problem.

Introduction (cont’d)
•Propose a model based on attention and train it using REINFORCE
with greedy rollout baseline.
•Show the flexibility of proposed approach on multiple routing
problems.

Background

Attention mechanism
•For encoder-decoder model, use attention to obtain new context vector.
•ℎ
�denotes encoder hidden state, �
�denotes decoder hidden state.
•Alignment model, compatibility: relationship between current decoding
state and every encoding state.
•�
��=�(�
�−1,ℎ
�)
•Attention weight
•�
��=
exp(�
��)
σ
�=1
??????
exp�
��
•Context vector
•�
�=σ
�=1
??????
�
��ℎ
�

Transformer
•Multi-head attention: project the input encoding to different number
of spaces
•Self-attention: no additional decoding state, just encoding states
themselves
•Each head has its own attention mechanism

Attention model

Problem definition
•Define a problem instance �as a graph with �nodes, where node �∈
{1,…,�}is represented by features ??????
�.
•For TSP, ??????
�is the coordinate of node �(in 2d space).
•Define a solution ??????=(??????
1,…,??????
�)as a permutation of the nodes.
•Given a problem �, model output a policy �(??????|�)for selecting a
solution ??????

Encoder-decoder model
•Encoder-decoder model defines stochastic policy �(??????|�)for selecting a solution ??????
given a problem instance �.
�
????????????�=ෑ
�=1
�
�
??????(??????
�|�,??????
1:�−1)
•The encoder produces embeddingsof all input nodes.
•The decoder produces the sequence ??????, one node at a time,based on embedding
nodes and mask and context.
•For TSP,
•embedding nodes: from encoder
•mask: remaining nodes during decoding
•context: First and last node embedding in tour during decoding

Encoder
•�
??????-dimensional input feature ??????
�. For TSP, �
??????=2.
•�
ℎ-dimensional node embedding. Let �
ℎ=128.
•Initial embedding: ℎ
�
0
=??????
??????
??????
�+�
??????
•The embedding ℎ
�
�
are updated using �attention layers.
෠ℎ
�=��
�

�
�−1
+�??????�
�
�

1
�−1
,…,ℎ
�
�−1

�
�
=��
�
(෠ℎ
�+????????????
�
(෠ℎ
�))
•Graph embedding: തℎ
�
=
1
�
σ
�=1
�

�
�
�denotes the node index
�denotes the output of �’thattention layer
FF: node-wise feed forward
MHA: multi-head attention
BN: batch normalization

Multi-head attention
•�??????�
�
�

1
�−1
,…,ℎ
�
�−1
•Let number of heads �=8, embedding dimension �
ℎ=128.
•Each head has its own attention mechanism.

Result vector of each head
•Each node has its own query �
�, key �
�and value �
�.
•�
�=??????
??????

�,�
�=??????
�

�,�
�=??????
??????

�
•??????
??????
and ??????
�
are (�
��
ℎ)matrices, ??????
??????
is (�
��
ℎ)matrix.
•Given node �and another node �:
•�
�and �
�determine the importance of �
�
•Compatibility �
��=
�
�
??????
�
�
√�
�
if node �adjacent to node j else −∞.
•Attention weight �
��=
�
�
��
σ
�
′�
�
��

∈[0,1]
•Result vector ℎ
�


��
���
�(size is �
�)

1. Compute the compatibility
2. Compute the attention weight
3. Linear combination of �
��and �
�

Final result vector
•Let ℎ
��

denote the result vector of node �in head �(size is �
�)
•In Transformer, concatenate the result vectors first and transform it.
•�??????�
�ℎ
1,…,ℎ
�=??????
�
������(ℎ
�1′,…ℎ
��′)
•In proposed method, transform each result vectors and sum up them.
•�??????�
�ℎ
1,…,ℎ
�=σ
�=1
�
??????
�
�

��′
•Both method output �
ℎ-dimensional vector for each node.
�⋅�
��
ℎ×(�⋅�
�)
�
ℎ×�
�

Decoder
•At decoding time, the decode context consisted of embedding of the
graph, the last node and first node
•ℎ
�
�
=ቐ
തℎ
�
,ℎ
??????�−1
�
,ℎ
??????1
�
if�>1
തℎ
�
,�
�
,�
�
else.
•(3⋅�
ℎ)-dimensional result vector ℎ
�
�
: embedding of the special
context node (�)
[⋅,⋅,⋅]horizontal concatenation operator
�
�
and �
�
are learnable �
ℎ-dimensional parameters

Update context node embedding
•Obtain new context node embedding ℎ
�
�+1
using �-head attention.
•The keys and values come from node embedding ℎ
�
�
, query comes
from context node.
•�
�=??????
??????

�,�
�=??????
�

�,�
�=??????
??????

�
•Compatibility �
(�)�=
�
(??????)
??????
�
�
√�
�
�
�=
�

�
if node �haven’t been visited
else −∞.
•Apply the similar �??????�to get ℎ
�
�+1
(size is �
ℎ).

Final output probability
•Compute �
????????????
��,??????
1:�−1using single attention head (�=1, �
�=
�
ℎ) but onlycompute compatibility (no need �
�)
•�
(�)�=�⋅tanh
�
??????
??????
�
�
�
�
∈[−�,�]if node �haven’t been visited else
−∞(�=10).
•Compute the final output probability vector �using softmax
�
�=�
????????????
�=��,??????
1:�−1=
�
�
(??????)�
σ
�
�
�
(??????)�

REINFORCE with greedy rollout
baseline

REINFORCE with baseline
•Define the loss ℒ??????�=??????
�
????????????�[�(??????)]
•Optimizeℒby gradient descent usingREINFORCE
•By introduce the baseline reduces gradient variance and then speed up
learning.
�ℒ??????�=??????
�
????????????�[�??????−���log�
??????(??????|�)]
•Common baseline
•Exponential moving average ��=�with decay �.
•�
0=�??????,�
�+1=��
�+1−��(??????)
•Learned value function (critic) ො�(�,??????)
•??????are learned from (�,�(??????))

Proposed baseline
Replace baseline parameter if improvement is significant
Sample solution ??????
�based on �
??????
Greedily pick baseline solution ??????
�
??????�
based on �
????????????�
Calculate the gradient of loss with REINFORCE
with baseline as length of ??????
�
??????�
.
Two model, one for training another for baseline
Copy the training parameter to baseline

Experiments

Learned heuristic
Non-learned baseline
Heuristic solver
structure2vec
Pointer network (PN)
PN+ RL
Compare to heuristic solver, non-learned baseline and learned heuristic

PN: pointer network
AM: attention model (proposed method)
TSP20 result compare to pointer network (10000 instances)

Generalization ability

Discussion
•Introduce a model and training method which both contribute to
significantly improved results on learned heuristics for TSP.
•Using attention instead of recurrence introduces invariance to the
input order of the nodes, increasing learning efficiency.
•The multi-head attention mechanism allows nodes to communicate
relevant information over different channels.
Tags