Paper study: Attention, learn to solve routing problems!

Attention, Learn to Solve
Routing Problems!
ICLR 2019
University of Amsterdam
WouterKool, Herkevan Hoof and Max Welling

Abstract
•Learn heuristics for combinatorial optimization problems can save
costly development.
•Propose a model based on attention layers and train this model using
REINFORCE with a baseline based on deterministic greedy rollout.
•Outperform recent learned heuristics for TSP.

Introduction
•Approaches to solve combinatorial optimization problem can be
divided into
•Exact methods: guarantee finding optimal solutions
•Heuristics: trade off optimality for computational cost, usually expressed in
the form of rules (like the policy to make decisions)
•Train a model to parameterize policies to obtain new and stronger
algorithm for routing problem.

Introduction (cont’d)
•Propose a model based on attention and train it using REINFORCE
with greedy rollout baseline.
•Show the flexibility of proposed approach on multiple routing
problems.

Background

Attention mechanism
•For encoder-decoder model, use attention to obtain new context vector.
•ℎ
&#3627408471;denotes encoder hidden state, &#3627408480;
&#3627408470;denotes decoder hidden state.
•Alignment model, compatibility: relationship between current decoding
state and every encoding state.
•&#3627408466;
&#3627408470;&#3627408471;=&#3627408462;(&#3627408480;
&#3627408470;−1,ℎ
&#3627408471;)
•Attention weight
•&#3627409148;
&#3627408470;&#3627408471;=
exp(&#3627408466;
&#3627408470;&#3627408471;)
σ
&#3627408472;=1
??????
exp&#3627408466;
&#3627408470;&#3627408472;
•Context vector
•&#3627408464;
&#3627408470;=σ
&#3627408471;=1
??????
&#3627409148;
&#3627408470;&#3627408471;ℎ
&#3627408471;

Transformer
•Multi-head attention: project the input encoding to different number
of spaces
•Self-attention: no additional decoding state, just encoding states
themselves
•Each head has its own attention mechanism

Attention model

Problem definition
•Define a problem instance &#3627408480;as a graph with &#3627408475;nodes, where node &#3627408470;∈
{1,…,&#3627408475;}is represented by features ??????
&#3627408470;.
•For TSP, ??????
&#3627408470;is the coordinate of node &#3627408470;(in 2d space).
•Define a solution ??????=(??????
1,…,??????
&#3627408475;)as a permutation of the nodes.
•Given a problem &#3627408480;, model output a policy &#3627408477;(??????|&#3627408480;)for selecting a
solution ??????

Encoder-decoder model
•Encoder-decoder model defines stochastic policy &#3627408477;(??????|&#3627408480;)for selecting a solution ??????
given a problem instance &#3627408480;.
&#3627408477;
????????????&#3627408480;=ෑ
&#3627408481;=1
&#3627408475;
&#3627408477;
??????(??????
&#3627408481;|&#3627408480;,??????
1:&#3627408481;−1)
•The encoder produces embeddingsof all input nodes.
•The decoder produces the sequence ??????, one node at a time,based on embedding
nodes and mask and context.
•For TSP,
•embedding nodes: from encoder
•mask: remaining nodes during decoding
•context: First and last node embedding in tour during decoding

Encoder
•&#3627408465;
??????-dimensional input feature ??????
&#3627408470;. For TSP, &#3627408465;
??????=2.
•&#3627408465;
ℎ-dimensional node embedding. Let &#3627408465;
ℎ=128.
•Initial embedding: ℎ
&#3627408470;
0
=??????
??????
??????
&#3627408470;+&#3627408463;
??????
•The embedding ℎ
&#3627408470;
&#3627408473;
are updated using &#3627408449;attention layers.
෠ℎ
&#3627408470;=&#3627408437;&#3627408449;
&#3627408473;
ℎ
&#3627408470;
&#3627408473;−1
+&#3627408448;??????&#3627408436;
&#3627408470;
&#3627408473;
ℎ
1
&#3627408473;−1
,…,ℎ
&#3627408475;
&#3627408473;−1
ℎ
&#3627408470;
&#3627408473;
=&#3627408437;&#3627408449;
&#3627408473;
(෠ℎ
&#3627408470;+????????????
&#3627408473;
(෠ℎ
&#3627408470;))
•Graph embedding: തℎ
&#3627408449;
=
1
&#3627408475;
σ
&#3627408470;=1
&#3627408475;
ℎ
&#3627408470;
&#3627408449;
&#3627408470;denotes the node index
&#3627408473;denotes the output of &#3627408473;’thattention layer
FF: node-wise feed forward
MHA: multi-head attention
BN: batch normalization

Multi-head attention
•&#3627408448;??????&#3627408436;
&#3627408470;
&#3627408473;
ℎ
1
&#3627408473;−1
,…,ℎ
&#3627408475;
&#3627408473;−1
•Let number of heads &#3627408448;=8, embedding dimension &#3627408465;
ℎ=128.
•Each head has its own attention mechanism.

Result vector of each head
•Each node has its own query &#3627408478;
&#3627408470;, key &#3627408472;
&#3627408470;and value &#3627408483;
&#3627408470;.
•&#3627408478;
&#3627408470;=??????
??????
ℎ
&#3627408470;,&#3627408472;
&#3627408470;=??????
&#3627408446;
ℎ
&#3627408470;,&#3627408483;
&#3627408470;=??????
??????
ℎ
&#3627408470;
•??????
??????
and ??????
&#3627408446;
are (&#3627408465;
&#3627408472;×&#3627408465;
ℎ)matrices, ??????
??????
is (&#3627408465;
&#3627408483;×&#3627408465;
ℎ)matrix.
•Given node &#3627408470;and another node &#3627408471;:
•&#3627408478;
&#3627408470;and &#3627408472;
&#3627408471;determine the importance of &#3627408483;
&#3627408471;
•Compatibility &#3627408482;
&#3627408470;&#3627408471;=
&#3627408478;
&#3627408470;
??????
&#3627408472;
&#3627408471;
√&#3627408465;
&#3627408472;
if node &#3627408470;adjacent to node j else −∞.
•Attention weight &#3627408462;
&#3627408470;&#3627408471;=
&#3627408466;
&#3627408482;
&#3627408470;&#3627408471;
σ
&#3627408471;
′&#3627408466;
&#3627408482;
&#3627408470;&#3627408471;
′
∈[0,1]
•Result vector ℎ
&#3627408470;
′
=σ
&#3627408471;&#3627408462;
&#3627408470;&#3627408471;&#3627408483;
&#3627408471;(size is &#3627408465;
&#3627408483;)

1. Compute the compatibility
2. Compute the attention weight
3. Linear combination of &#3627408462;
&#3627408470;&#3627408471;and &#3627408483;
&#3627408471;

Final result vector
•Let ℎ
&#3627408470;&#3627408474;
′
denote the result vector of node &#3627408470;in head &#3627408474;(size is &#3627408465;
&#3627408483;)
•In Transformer, concatenate the result vectors first and transform it.
•&#3627408448;??????&#3627408436;
&#3627408470;ℎ
1,…,ℎ
&#3627408475;=??????
&#3627408450;
&#3627408464;&#3627408476;&#3627408475;&#3627408464;&#3627408462;&#3627408481;(ℎ
&#3627408470;1′,…ℎ
&#3627408470;&#3627408474;′)
•In proposed method, transform each result vectors and sum up them.
•&#3627408448;??????&#3627408436;
&#3627408470;ℎ
1,…,ℎ
&#3627408475;=σ
&#3627408474;=1
&#3627408448;
??????
&#3627408474;
&#3627408450;
ℎ
&#3627408470;&#3627408474;′
•Both method output &#3627408465;
ℎ-dimensional vector for each node.
&#3627408474;⋅&#3627408465;
&#3627408483;&#3627408465;
ℎ×(&#3627408474;⋅&#3627408465;
&#3627408483;)
&#3627408465;
ℎ×&#3627408465;
&#3627408483;

Decoder
•At decoding time, the decode context consisted of embedding of the
graph, the last node and first node
•ℎ
&#3627408464;
&#3627408449;
=ቐ
തℎ
&#3627408449;
,ℎ
??????&#3627408481;−1
&#3627408449;
,ℎ
??????1
&#3627408449;
if&#3627408481;>1
തℎ
&#3627408449;
,&#3627408483;
&#3627408473;
,&#3627408483;
&#3627408467;
else.
•(3⋅&#3627408465;
ℎ)-dimensional result vector ℎ
&#3627408464;
&#3627408449;
: embedding of the special
context node (&#3627408464;)
[⋅,⋅,⋅]horizontal concatenation operator
&#3627408483;
&#3627408473;
and &#3627408483;
&#3627408467;
are learnable &#3627408465;
ℎ-dimensional parameters

Update context node embedding
•Obtain new context node embedding ℎ
&#3627408464;
&#3627408449;+1
using &#3627408448;-head attention.
•The keys and values come from node embedding ℎ
&#3627408470;
&#3627408449;
, query comes
from context node.
•&#3627408478;
&#3627408464;=??????
??????
ℎ
&#3627408464;,&#3627408472;
&#3627408470;=??????
&#3627408446;
ℎ
&#3627408470;,&#3627408483;
&#3627408470;=??????
??????
ℎ
&#3627408470;
•Compatibility &#3627408482;
(&#3627408464;)&#3627408471;=
&#3627408478;
(??????)
??????
&#3627408472;
&#3627408471;
√&#3627408465;
&#3627408472;
&#3627408465;
&#3627408472;=
&#3627408465;
ℎ
&#3627408448;
if node &#3627408471;haven’t been visited
else −∞.
•Apply the similar &#3627408448;??????&#3627408436;to get ℎ
&#3627408464;
&#3627408449;+1
(size is &#3627408465;
ℎ).

Final output probability
•Compute &#3627408477;
????????????
&#3627408481;&#3627408480;,??????
1:&#3627408481;−1using single attention head (&#3627408448;=1, &#3627408465;
&#3627408472;=
&#3627408465;
ℎ) but onlycompute compatibility (no need &#3627408483;
&#3627408470;)
•&#3627408482;
(&#3627408464;)&#3627408471;=&#3627408438;⋅tanh
&#3627408478;
??????
??????
&#3627408472;
&#3627408471;
&#3627408465;
&#3627408472;
∈[−&#3627408438;,&#3627408438;]if node &#3627408471;haven’t been visited else
−∞(&#3627408438;=10).
•Compute the final output probability vector &#3627408477;using softmax
&#3627408477;
&#3627408470;=&#3627408477;
????????????
&#3627408481;=&#3627408470;&#3627408480;,??????
1:&#3627408481;−1=
&#3627408466;
&#3627408482;
(??????)&#3627408470;
σ
&#3627408471;
&#3627408466;
&#3627408482;
(??????)&#3627408471;

REINFORCE with greedy rollout
baseline

REINFORCE with baseline
•Define the loss ℒ??????&#3627408480;=??????
&#3627408477;
????????????&#3627408480;[&#3627408447;(??????)]
•Optimizeℒby gradient descent usingREINFORCE
•By introduce the baseline reduces gradient variance and then speed up
learning.
&#3627409147;ℒ??????&#3627408480;=??????
&#3627408477;
????????????&#3627408480;[&#3627408447;??????−&#3627408463;&#3627408480;&#3627409147;log&#3627408477;
??????(??????|&#3627408480;)]
•Common baseline
•Exponential moving average &#3627408463;&#3627408480;=&#3627408448;with decay &#3627409149;.
•&#3627408448;
0=&#3627408447;??????,&#3627408448;
&#3627408481;+1=&#3627409149;&#3627408448;
&#3627408481;+1−&#3627409149;&#3627408447;(??????)
•Learned value function (critic) ො&#3627408483;(&#3627408480;,??????)
•??????are learned from (&#3627408480;,&#3627408447;(??????))

Proposed baseline
Replace baseline parameter if improvement is significant
Sample solution ??????
&#3627408470;based on &#3627408477;
??????
Greedily pick baseline solution ??????
&#3627408470;
??????&#3627408447;
based on &#3627408477;
????????????&#3627408447;
Calculate the gradient of loss with REINFORCE
with baseline as length of ??????
&#3627408470;
??????&#3627408447;
.
Two model, one for training another for baseline
Copy the training parameter to baseline

Experiments

Learned heuristic
Non-learned baseline
Heuristic solver
structure2vec
Pointer network (PN)
PN+ RL
Compare to heuristic solver, non-learned baseline and learned heuristic

PN: pointer network
AM: attention model (proposed method)
TSP20 result compare to pointer network (10000 instances)

Generalization ability

Discussion
•Introduce a model and training method which both contribute to
significantly improved results on learned heuristics for TSP.
•Using attention instead of recurrence introduces invariance to the
input order of the nodes, increasing learning efficiency.
•The multi-head attention mechanism allows nodes to communicate
relevant information over different channels.

Paper study: Attention, learn to solve routing problems!

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Paper study: Attention, learn to solve routing problems!

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......