Neural Processes Family

KotaMatsui 2,773 views 61 slides Aug 20, 2019
Slide 1
Slide 1 of 61
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61

About This Presentation

conditional neural processes, neural processes, attentive neural processes, neural processesによるベイズ最適化, neural processesとガウス過程の関係, のまとめ


Slide Content

Neural Processes Family
Kota Matsui
RIKEN AIP Data Driven Biomedical Science Team
August 20, 2019

Table of contents
1.Conditional Neural Processes [Garnelo+ (ICML2018)]
2.Neural Processes [Garnelo+ (ICML2018WS)]
3.Attentive Neural Processes [Kim+ (ICLR2019)]
4.Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5.On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6.Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family 1 / 60

Table of Contents
1.Conditional Neural Processes [Garnelo+ (ICML2018)]
2.Neural Processes [Garnelo+ (ICML2018WS)]
3.Attentive Neural Processes [Kim+ (ICLR2019)]
4.Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5.On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6.Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes 2 / 60

Motivation
Neural Net vs Gaussian Processes
Neural Net (NN)
•Function approximation ability
•New functions are learned from scratch each time
•Uncertainty of functions can not be considered
Gaussian Processes (GP)
•Can use prior knowledge to quickly estimate the shape of
new function
•Can model uncertainty of functions
•Computationally expensive
•Hard to design prior distribution
Aim
Combine the benefits of NN and GP
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Introduction 3 / 60

Conditional Neural Processes (CNPs)

•A conditional distribution over functionstrained to model
the empirical conditional distributions of functions
•permutation invariantin training/test data
•scalable: running time complexity ofO(n+m)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Introduction 4 / 60

Stochastic Processes i
•observations :O=f(xi; yi)g
n1
i=0
XY
•targets :T=fxig
n+m1
i=n
•generative model (stochastic processes) :
•yi=f(xi),f:X!Y(noiseless case)
•fP(prior process)
•P;P(f(T)jO; T)(predictive distribution)
Task
Predict the output valuesf(x)for8x2TgivenO
Example 1 (Gaussian Processes)
P=GP((x); k(x;x

))
;predictive distribution :f(x) N(n(x);
2
n(x))
n(x) =(x) +k(x)

(K+
2
I)
1
(ym)

2
n(x) =k(x;x)k(x)

(K+
2
I)
1
k(x)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 5 / 60

Stochastic Processes ii
1D Gaussian process regression
Difficulties of ordinary SP approaches
1.It isdifficult to design appropriate priors
2.GPs (typical ex) do not scale w.r.t. the number of data
! O((n+m)
3
)computational costs are required
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 6 / 60

Conditional Neural Processes i
conditional stochastic processQ((f(A)jO; T)
Predictive Abilityof NNs +Uncertainty Modelingof SPs
Assumption 1
1.(permutation invariant)
Q((f(T)jO; T) =Q(
(
f
(
T

)
jO; T

)
=Q(
(
f(T)jO

; T
)
•O

; T

: permutations ofO; Tresp.
2.(factorizability)
Q((f(T)jO; T) =

x2T
Q((f(x)jO; x)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 7 / 60

Conditional Neural Processes ii ArchitecturePredictObserve Aggregate
r
3
r
2
r
1
…x
3
x
2
x
1
x
5
x
6
x
4
ahhh y
5
y
6
y
4
ry
3
y
2
y
1 g gg

ri=hθ(xi,yi)!(xi,yi)∈O
r=r1⊕r2⊕...rn−1⊕rn
φi=gθ(xi,r)!(xi)∈T
hθ,gθ

r1⊕r2⊕...rn−1⊕rn=
1
n
n
!
i=1
ri
Qθ(f(xi)|O,xi)=Q(f(xi)|φi)
φi=(µi,σ
2
i)N(µi,σ
2
i)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 8 / 60

Conditional Neural Processes ii Architecture
•構造としては VAEに非常に近い
!hとaはVAEのencoderに対応し ,入力データから潜在
表現rを獲得
•VAEとの違いその 1:入力xに加えて出力 yも与えて潜在
表現を学習
•VAEとの違いその 2:潜在表現 rは確率変数ではなく ,デー
タ毎の表現 r1; :::; rnの和で決まる
•違いその 2でデータ毎に独立に計算した潜在表現を使って
いることが後で説明する “画像全体で一貫した completion
にならない ”原因になっている
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 9 / 60

Conditional Neural Processes iii Training
Optimization Problem
minimization of the negative conditional log probability
(

= arg min
(
L(()
L(() =(EfP
[
EN
[
logQ(
(
fyig
n(1
i=0
jON;fxig
n(1
i=0
)]]
•fP: prior process
•NUnif(0; n(1)
•ON=f(xi; yi)g
N
i=0
O
practical implementation : gradient descent
1.samplingfandN
2.MC estimates of the gradient ofL(()
3.gradient descent by estimated gradient
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 10 / 60

Function Regression i Setting
Dataset
1.random sample from GP w/ fixed kernel&params
2.random sample from GP w/ switching two kernels
network architectures
h(: 3-layer MLP with 128-dim outputri,i= 1; :::;128
r=
1
128

128
i=1
ri: aggregation
g(: 5-layer MLP,g((xi; r) =i;
2
i
(mean&var of Gaussian)
Adam (optimizer)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results11 / 60

Function Regression ii Results











K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results12 / 60

Image Completion i Setting
Dataset
1.MNIST (f: [0;1]
2
![0;1])
Complete the entire image from a small number of
observation
2.CelebA (f: [0;1]
2
![0;1]
3
)
Complete the entire image from a small number of
observation
network architectures
•the same model architecture as for 1D function regression
except for
•input layer : 2D pixel coordinates normalized to[0;1]
2
•output layer : color intensity of the corresponding pixel
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results13 / 60

Image Completion ii Results




•1 (non-informative) observation point
!prediction corresponds to the average over all digits
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results14 / 60

Image Completion ii Results  
Random Context Ordered Context
# 10 100 1000 10 100 1000
kNN0.215 0.052 0.0070.370 0.273 0.007
GP0.247 0.1370.0010.257 0.2200.002
CNP0.039 0.0160.0090.057 0.0470.021
   





• 
•   
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results15 / 60

Image Completion iii Latent Variable Model
Original CNPs
•The model returnsfactored outputs(sample-wise
independent modeling)
!best prediction with limited data points is to average
over all possible predictions
•It can not sample different coherent images of all the
possible digitsconditioned on the observations
GPs can do this due to a kernel function
•Adding latent variables, CNPs can maintain this property
CNPsのlatent variable modelは後述する Neural Processesと
同じもの
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results16 / 60

Image Completion iv Latent Variable Model  

z∼N(µ,σ
2
)
r=(µ,σ
2
)=hθ(X,Y)
φi=(µi,σ
2
i)=gθ(xi,z)

• 



K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results17 / 60

Classification i Settings
Dataset
•Omniglot
•1,623 classes of characters from 50 different alphabets
•suitable for few-shot learning
•N-way classification task
•Nclasses are randomly chosen at each training step
network architectures
•encoderh: include convolution layers
•aggregationr: class-wise aggregation&concatenate
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results18 / 60

Classification ii ResultsPredictObserve Aggregate
r
5
r
4
r
3
ahhh
r
Class
E
Class
D
Class
C
g gg
r
2
r
1
hh
Class
B
Class
A
A
B
C
E
D
0101 01
5-wayAcc 20-wayAcc Runtime
1-shot 5-shot 1-shot 5-shot
MANN 82.8% 94.9% - - O(nm)
MN 98.1% 98.9% 93.8% 98.5% O(nm)
CNP 95.3% 98.5% 89.9% 96.8% O(n+m)

"

(&$%'!

(&$%'!
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural ProcessesExperimental Results19 / 60

Table of Contents
1.Conditional Neural Processes [Garnelo+ (ICML2018)]
2.Neural Processes [Garnelo+ (ICML2018WS)]
3.Attentive Neural Processes [Kim+ (ICLR2019)]
4.Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5.On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6.Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes 20 / 60

Generative Model i
Assumption 2
1.UExchangeabilityV
入力xや出力yの順番を入れ替えても分布は変わらない
x1:n(y1:n) =(x1:n)((y1:n))
2.UConsistencyV
ある列Dm=f(xi; yi)g
m
i=1に対する分布とそれを含む列で Dm以外を
周辺化して得られる分布は同じ
x1:m(y1:m) =

x1:n(y1:n)dym+1:n
3.UDecomposabilityV
観測モデルに独立分解仮定
p(y1:njf;x1:n) =
n

i=1
N(yijf(xi);
2
)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 21 / 60

Generative Model ii
fをある確率過程からのサンプルとしたときの観測値の事
後分布
x1:n
(y1:n) =

p(y1:njf;x1:n)p(f)df
=
∫n

i=1
N(yijf(xi);
2
)p(f)df
fを隠れ変数付き NNg(x;z)でモデル化するとき ,生成モ
デルは
p(z; y1:njx1:n) =
n

i=1
N(yijg(xi;z);
2
)p(z)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 22 / 60

Evidence Lower-Bound (ELBO)
隠れ変数 zの変分事後分布を q(zjx1:n; y1:n)とおくと ELBOは
logp(y1:njx1:n)
Eq(zjx1:n;y1:n)
[
n

i=1
logp(yijz;xi) + log
p(z)
q(zjx1:n; y1:n)
]
特に予測時には観測データとテストデータを分割して
logp(ym+1:njx1:m;xm+1:n; y1:m)
E
q(zjx
m+1:n;y
m+1:n)
[
n

i=m+1
logp(yijz;xi) + log
p(zjx1:m; y1:m)
q(zjxm+1:n; ym+1:n)
]
E
q(zjx
m+1:n;y
m+1:n)
[
n

i=m+1
logp(yijz;xi) + log
q(zjx1:m; y1:m)
q(zjxm+1:n; ym+1:n)
]
pをqで近似している理由は ,観測データによる条件付き分布としての計算
にO(m
3
)のコストがかかるのを回避するため
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 23 / 60

Architecturesx1x2x3
y1y2y3
hθhθhθ
r1r2r3
a
r
x4x5x6
gθgθgθ
z
ˆy4ˆy5ˆy6


z
z!N(µ(r),!
2
(r)I)
gθ(xi)=P(y|z,xi):
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 24 / 60

Comparing Architectures : VAE, CNPs&NPs
X



qφ(z|X)
pθ(X|z)
ˆ
X
z∼N(0,I)
XY


r=
!
n
i=1
ri

ˆ
Y
gθ(Y|
ˆ
X,r)
hθ(xi,yi)
XY


r=
!
n
i=1
ri

ˆ
Y
hθ(xi,yi)
z∼N(µ(r),σ
2
(r)I)
gθ(Y|
ˆ
X,z)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 25 / 60

Black-Box Optimization with Thompson Sampling

NeuralprocessGaussianprocessRandomSearch
0.26 0.14 1.00


K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Experiments 26 / 60

Table of Contents
1.Conditional Neural Processes [Garnelo+ (ICML2018)]
2.Neural Processes [Garnelo+ (ICML2018WS)]
3.Attentive Neural Processes [Kim+ (ICLR2019)]
4.Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5.On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6.Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes 27 / 60

Recall : Neural Processesx
1
y
1
x
2
y
2
x
3
y
3
MLP
θ
MLP
θ
MLP
θ
MLP
Ψ
MLP
Ψ
MLP
Ψ
r
1
r
2
r
3
s
1
s
2
s
3
r
Cm
m s
C
x
r
C
~
MLP y
ENCODER DECODER
Deterministic
Path
Latent
Path
NEURAL PROCESS
m Mean
z
z
*
*
•潜在表現 rと潜在変数 zを両方モデルに組み込む ver.
•ELBOを目的関数として学習
logp(yTjxT;xC;yC)
Eq(zjs
T)[logp(yTjxT;rC;z)]DKL(q(zjsT)∥q(zjsC))
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 28 / 60

Motivation i
オリジナルの NPはcontext setに対して underfitしやすい
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 29 / 60

Motivation ii
underfitの原因に関する仮説
入力の潜在表現を平均してしまう操作がボトルネックx1x2x3
y1y2y3
hθhθhθ
r1r2r3
a
r





,*-+)(#"
.
""'&⇒
!$'%&⇒
""!
'
&
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 30 / 60

Contribution
key observation : GP回帰がunderfitしないのはなぜ?
GP回帰では ,カーネル関数が 2点の類似度を測る
!どの観測点 (xi; yi)がxの予測に重要かを示す
•xiがxに近ければ対応する予測値 yもyiに近いことが
期待される
Contribution:Attentive neural processes (ANPs)
•(微分可能な ) attentionによって上記の性質 をNPに実装
•一方で,観測点に対する permutation invarianceは担保
•1次元の回帰と 2次元の画像補完の問題で性能評価
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 31 / 60

Attention
Notation
•key-value pair(xi,ri)
•keyxi:入力ベクトル
•valueri:観測点(xi; yi)の潜在表現 (encoderの出力)
•queryx
attention mechanism
1.xiのxに対する重み iを計算
2.riの重み付き和 r=

n
i=1
iriをxのvalueとする
rは(xi;ri)の順序に依らない (permutation invariance)
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Attention 32 / 60

Attention : Examples
Laplace
Wi= softmax(f(kQiA(XjA∥1g
n
j=1)2R
n
Laplace(Q;X;R) :=W R2R
mdv
DotProduct
DotProduct(Q;X;R) := softmax
(
1
p
dk
QX

R
)
2R
mdv
MultiHead
MultiHead(Q;X;R) := concat(head1; :::;headH)W2R
mdv
headh= DotProduct(linear(Q);linear(X);linear(R))
•デザイン行列 X= (x1; :::;xn)

2R
nd
k
•対応する潜在表現行列 R= (r1; :::;rn)

R
ndv
•query行列Q= (x1; :::;xm)

R
md
k
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Attention 33 / 60

Attentive Neural Processes : architectures!
"
#
"
!
$
#
$
!
%
#
%
MLP
MLP
MLP
MLP
MLP
MLP
&
"
&
$
&
%
'
"
'
$
'
%
m '
(
!
~
MLP #
ENCODER DECODER
)*+*&,-.-'+-/
01+2
31+*.+
01+2
Self-
attn
ϕ
Self-
attn
ω
Cross-
attention
!
"
!
$
!
%!
&
&
ATTENTIVE NEURAL PROCESS
m 4*1.
Keys Query
Values
5
5
6
6
6
6
6
ω

ω


ω
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Attention 34 / 60

Attentive Neural Processes : interpretation
self-attentionによる潜在表現の計算
•観測点間の interactionのモデル化 (GP回帰におけるカー
ネルによる観測点間の類似度計算に対応 )
•もし多くの観測点が overlapしている場合 , 1つまたは少数
の観測点に大きな重みを乗せるようにできる
cross-attentionによるquery-specificな潜在表現の計算
•各query点がその予測に重要と考えられる観測点により密
接に対応付けられるようにするパート
•global latent (そこから誘導される確率過程の大域的構造 )
を担保するために , latent pathにはattentionを入れない
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes ANPs 35 / 60

Attentive Neural Processes : Remarks
•self-attention, cross-attentionを導入しても ,観測点に対す
るpermutation invariantな性質は保たれる
•uniform attention (全ての観測点の同じ重みを割り振る )を
採用するとオリジナルの NPに帰着
•オリジナルの NPと同様の ELBO最大化で学習
logp(yTjxT;xC;yC)
E
q(zjsT)[logp(yTjxT;r;z)]DKL(q(zjsT)∥q(zjsC))
•r=r(xC;yC;xT): cross-attentionの出力(潜在表現 )
•attention計算(各観測点に対する重み計算 )が増えたため ,
予測時の計算複雑さは O(n+m)からO(n(n+m))に増加
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes ANPs 36 / 60

Experiment 1 : 1D Function regression on synthetic GP data :?context pointr60P'
:?target r negative log likelihood
&?training iteration 4
?wall clock time
Published as a conference paper at ICLR 2019
Figure 3: Qualitative and quantitative results of different attention mechanisms for 1D GP function
regression with random kernel hyperparameters.Left: moving average of context reconstruction
error (top) and target negative log likelihood (NLL) given contexts (bottom) plotted against training
iterations (left) and wall clock time (right).ddenotes the bottleneck size i.e. hidden layer size of all
MLPs and the dimensionality ofrandz.Right: predictive mean and variance of different attention
mechanisms given the same context. Best viewed in colour.
1D Function regression on synthetic GP dataWe first explore the (A)NPs trained on data that is
generated from a Gaussian Process with a squared-exponential kernel and small likelihood noise
1
.
We emphasise that (A)NPs need not be trained on GP data or data generated from a known stochastic
process, and this is just an illustrative example. We explore two settings: one where the hyperpa-
rameters of the kernel are fixed throughout training, and another where they vary randomly at each
training iteration. The number of contexts (n) and number of targets (m) are chosen randomly at
each iteration (n⇠U[3,100],m⇠n+U[0,100"n]). Eachx-value is drawn uniformly at ran-
dom in["2,2]. For this simple 1D data, we do not use self-attention and just explore the use of
cross-attention in the deterministic path (c.f. Figure 2). Thus we use the same encoder/decoder ar-
chitecture for NP and ANP, except for the cross-attention. See Appendix B for experimental details.
Figure 3 (left) shows context reconstruction error
1
|C|
P
i2C
E
q(z|sC)[logp(yi|xi,r

(xC,yC,xi),z)]
and NLL of targets given contexts
1
|T|
P
i2T
E
q(z|sC)[logp(yi|xi,r

(xC,yC,xi),z)]for the dif-
ferent attention mechanisms, trained on a GP with random kernel hyperparameters. ANP shows
a much more rapid decrease in reconstruction error and lower values at convergence compared to
the NP, especially for dot product and multihead attention. This holds not only against training
iteration but also against wall clock time, so learning is fast despite the added computational cost
of attention. The right column plots show that the computation times of Laplace and dot-product
ANP are similar to the NP for the same value ofd, and multihead ANP takes around twice the time.
We also show how the size of the bottleneck (d) in the deterministic and latent paths of the NP
affects the underfitting behaviour of NPs. The figure shows that raisingddoes help achieve better
reconstructions, but there appears to be a limit in how much reconstructions can improve. Beyond a
certain value ofd, the learning for the NP becomes too slow, and the value of reconstruction error at
convergence is still higher than that achieved by multihead ANP with 10% of the wall-clock time.
Hence using ANPs has significant benefits over simply raising the bottleneck size in NPs.
In Figure 3 (right) we visualise the learned conditional distribution for a qualitative comparison of
the attention mechanisms. The context is drawn from the GP with the hyperparameter values that
give the most fluctuation. Note that the predictive mean of the NP underfits the context, and tries to
explain the data by learning a large likelihood noise. Laplace shows similar behaviour, whereas dot-
product attention gives predictive means that accurately predict almost all context points. Note that
Laplace attention is parameter-free (keys and queries are the x-coordinates) whereas for dot-product
attention we have set the keys and queries to be parameterised representations of the x-values (output
of learned MLP that takes x-coordinates as inputs). So the dot-product similarities are computed in
a learned representation space, whereas for Laplace attention the similarities are computed based on
L1 distance in the x-coordinate domain, hence it is expected that dot-product attention outperforms
Laplace attention. However dot-product attention displays non-smooth predictions, shown more
1
Code is available athttps://github.com/deepmind/neural-processes/blob/master/
attentive_neural_process.ipynb
5
d=
!
"
#
"
$
hidden layer size of MLPs
dimensionality ofr
dimensionality ofz
"XN!
•!214}???rGP_v%???{A0
•!context point4 n n target point4 m s iteration mhyiy??????????
•!s7(_vr??????????mA0
•!ANPs msYself-attention szg, cross-attention rtB
O18
1
|C|
!
i!C
E
q(z|sC)[logp(yi|xi,r
"
(xC,yC,xi),z)]
1
|T|
!
i!T
E
q(z|sC)[logp(yi|xi,r
"
(xC,yC,xi),z)]
L<?r60P'
<?rQr#4$,
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 37 / 60

Experiment 1 : 1D Function regression on synthetic GP dataPublished as a conference paper at ICLR 2019NP Attentive NP
Figure 1: Comparison of predictions given by a fully trained NP and Attentive NP (ANP) in 1D func-
tion regression (left) / 2D image regression (right). The contexts (crosses/top half pixels) are used
to predict the target outputs (y-values of allx2["2,2]/all pixels in image). The ANP predictions
are noticeably more accurate than for NP at the context points.
provide relevant information for a given target prediction. In theory, increasing the dimensionality
of the representation could address this issue, but we show in Section 4 that in practice, this is not
sufficient.
To address this issue, we draw inspiration from GPs, which also define a family of conditional
distributions for regression. In GPs, the kernel can be interpreted as a measure of similarity among
two points in the input domain, and shows which context points(xi,yi)are relevant for a given
queryx⇤. Hence whenx⇤is close to somexi, itsy-value predictiony⇤is necessarily close to
yi(assuming small likelihood noise), and there is no risk of underfitting. We implement a similar
mechanism in NPs using differentiable attention that learns to attend to the contexts relevant to the
given target, while preserving the permutation invariance in the contexts. We evaluate the resulting
Attentive Neural Processes(ANPs) on 1D function regression and on 2D image regression. Our
results show that ANPs greatly improve upon NPs in terms of reconstruction of contexts as well as
speed of training, both against iterations and wall clock time. We also demonstrate that ANPs show
enhanced expressiveness relative to the NP and is able to model a wider range of functions.
2BACKGROUND
2.1 NEURALPROCESSES
The NP is a model for regression functions that map an inputxi2R
dx
to an outputyi2R
dy
. In
particular, the NP defines a (infinite) family of conditional distributions, where one may condition
on an arbitrary number of observedcontexts(xC,yC):=(xi,yi)i2Cto model an arbitrary number
oftargets(xT,yT):=(xi,yi)i2Tin a way that is invariant to ordering of the contexts and ordering
of the targets. The model is defined for arbitraryCandTbut in practice we useC⇢T. The
deterministic NP models these conditional distributions as:
p(yT|xT,xC,yC):=p(yT|xT,rC) (1)
withrC:=r(xC,yC)2R
d
whereris a deterministic function that aggregates(xC,yC)into a
finite dimensional representation with permutation invariance inC. In practice, each context(x,y)
pair is passed through an MLP to form a representation of each pair, and these are aggregated by
taking the mean to formrC. The likelihoodp(yT|xT,rC)is modelled by a Gaussian factorised
across the targets(xi,yi)i2Twith mean and variance given by passingxiandrCthrough an MLP.
The unconditional distributionp(yT|xT)(whenC=?) is defined by lettingr?be a fixed vector.
The latent variable version of the NP model includes a global latentzto account for uncertainty in
the predictions ofyTfor a given observed(xC,yC). It is incorporated into the model via alatent
paththat complements thedeterministic pathdescribed above. Herezis modelled by a factorised
Gaussian parametrised bysC:=s(xC,yC), withsbeing a function of the same properties asr
p(yT|xT,xC,yC):=
Z
p(yT|xT,rC,z)q(z|sC)dz (2)
withq(z|s?):=p(z), the prior onz. The likelihood is referred to as thedecoder, andq, r, sform
theencoder. See Figure 2 for diagrams of these models.
The motivation for having a global latent is to model different realisations of the data generating
stochastic process — each sample ofzwould correspond to one realisation of the stochastic process.
One can define the model using either just the deterministic path, just the latent path, or both. In this
2
• .+`L<;tr?{Ukl]p]
?inaccurate predictive means?
• L<;tr?r.3`Vq2!cyl]x
?overestimated variances at the input locations?
NP
ANP
• .+`L<;tr?{Uxu^q2!cyl]x
• L<;tr?r.3s>ylbyl]x
→Z~|?VE)quwT]2!G5`/vyl]x
Multihead Attention {B
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 38 / 60

Experiment 1 : 1D Function regression on synthetic GP dataC<A=?B
• 14.368 1&)7:.&9 

C@;>!22 
• 0-725$ ) 
%%215& 
• "/$)685*&23 
%% 1'&.(7#:,2 
%%32+:68 *:.&9)
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 39 / 60

Experiment 2 : 2D Function regression on image data/)
• "!$"!$ GKE 
•  # !##?2A$ DL$ @IJFH>.M 
• ?"9?!D - 
• 
• 
• 
• "@0000000000000000000000000000000?; p(yT|xT,rC,z)
O("#?* 
O?+? 
&
• @$ ;,68&3AC:1B3N$ D%>;47 
• @$ <'5?92=1$ 3;4:1B
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 40 / 60

NPs as Meta-Learning Model
Use neural processes as a modelMbecause
1.statistical efficiency
Accurate predictions of function values based on small
numbers of evaluations
2.calibrated uncertainties
balance exploration and exploitation
3.O(n+m)computational complexity
4.non-parametric modeling
!Not necessary to set hyper parameters such as learning
rate and update frequency in MAML
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 45 / 60

Experiments : Bayesian Optimization via NPs
Adversarial task search for RL agents[Ruderman+ (2018)]
•Search Problem of adversarially designed 3D Maze
•trivially solvable by human players
•But RL agents will catastrophically fail
•Notation
•fA: given agent mapping from task params to its
performancer
•parameters of the task
•M: maze layout
•ps; pg: start and goal positions
Problem setup
1.Position search(p

s; p

g) = arg min
ps;pg
fA(M; ps; pg)
2.Full maze search(M

; p

s; p

g) = arg min
M;ps;pg
fA(M; ps; pg)
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 46 / 60

Experiments : Bayesian Optimization via NPs(a) Position search results (b) Full maze search results
Figure 2: Bayesian Optimisation results.Left:Position searchRight:Full maze search. We report
the minimum up to iterationt(scaled in [0,1]) as a function of the number of iterations. Bold
lines show the mean performance over 4 unseen agents on a set of held-out mazes. We also show
20% of the standard deviation.Baselines:GP: Gaussian Process (with a linear and Matern 3/2
product kernel [Bonilla et al., 2008]), BBB: Bayes by Backprop [Blundell et al., 2015], AlphaDiv:
AlphaDivergence [Hernández-Lobato et al., 2016], DKL: Deep Kernel Learning [Wilson et al., 2016].
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 47 / 60

Table of Contents
1.Conditional Neural Processes [Garnelo+ (ICML2018)]
2.Neural Processes [Garnelo+ (ICML2018WS)]
3.Attentive Neural Processes [Kim+ (ICLR2019)]
4.Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5.On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6.Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP 48 / 60

Gaussian Processes with Deep Kernels i
Notation
•x1:n,y1:n:観測点
•f:R
p
!R:真関数
•GP model : p(fjx1:n) =N(m;K)
p(y1:njf) =N(f;
1
I)
ここで,f= (f(x1); :::; f(xn)),m= (m(x1); :::; m(xn)),
Kij=k(xi;xj)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GPGP with Deep Kernels50 / 60

Gaussian Processes with Deep Kernels ii
Definition 1 (deep kernel [Tsuda+ (2002)])
k(xi;xj) :=
1
d
d

j;j

=1

(
w

jxi+bj
)
jj

(
w

j
′xj+bj
)

(
w

j
xi+bj
)
は1層のNN,w; bはモデルパラメータで
(A)は活性化関数
•= (jj
′)
d
j;j

=1
は半正定値行列
行列表記
ϕi:=ϕ(xi;W;b) =

1
d
(W

xi+b)2R
d
;
= [ϕ1; :::;ϕn]とおくと,k(X;X) =

.
以下, GPの平均関数は次のような形で書かれるとする
m(X) =;2R
d
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GPGP with Deep Kernels51 / 60

Gaussian Processes with Deep Kernels iii
Latent functionを積分消去して得られるエビデンス (周辺尤度)
p(yjX) =

p(y;fjX) df=

p(yjf)p(fjX) df
=N
(
;

+
(1
In
)
NPsの生成モデルと関連づけるために隠れ変数を導入
z N(;)
このとき ,上記のエビデンスは zの周辺化からも導出される
p(yjX) =

p(yjX;z)p(z) dz=

N
(
z;
(1
In
)
N(;) dz
=N
(
;

+
(1
In
)
特に,z N(0;Id)のときは p(yjX) =N(0;

+
(1
In)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GPGP with Deep Kernels52 / 60

Gaussian Processes with Deep Kernels iv
エビデンス p(yjX)を計算するには共分散行列

+
1
In
の逆行列計算が必要 (O(n
3
)の計算コスト )
!エビデンス下界 (ELBO)の計算で置き換えてコスト削減
logp(YjX)E
q(zjX)[logp(Yjz;X)]KL(q(zjX)∥p(z))
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GPGP with Deep Kernels53 / 60

予測時の ELBOの一致i
Deep kernel GPsのエビデンス下界を ,観測データとテストデー
タを明示的に分離して書く (C= 1 :m; T=m+ 1 :nはそれぞ
れ観測データ ,テストデータを表す )
logp(YTjXT;XC;YC)
E
q(zjXT;YT)[logp(YTjz;XT)]KL (q(zjXT;YT)∥p(zjXC;YC))
ここで,p(zjXC;YC)は観測データ XC;YCに基づいて設定さ
れる“data-driven”なprior
p(zjXC;YC) =N((XC;YC);(XC;YC))
NPsでやったのと同様にこれを変分事後分布で近似する
p(zjXC;YC)q(zjXC;YC)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 54 / 60

予測時の ELBOの一致ii
以上の下で , Deep kernel GPのELBOは
E
q(zjXT;YT)[logp(YTjz;XT)]KL (q(zjXT;YT)∥q(zjXC;YC))
一方, NPsのELBOは
logp(ym+1:njx1:m;xm+1:n; y1:m)
E
q(zjx
m+1:n;y
m+1:n)
[
n

i=m+1
logp(yijz;xi) + log
p(zjx1:m; y1:m)
q(zjxm+1:n; ym+1:n)
]
E
q(zjx
m+1:n;y
m+1:n)
[
n

i=m+1
logp(yijz;xi) + log
q(zjx1:m; y1:m)
q(zjxm+1:n; ym+1:n)
]
となり,両者の生成モデルが同じならば ELBOも一致すること
がわかる
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 55 / 60

生成モデル
NPsの生成モデル :
p(Yjz;X)p(z) =N
(
Y;g((z;X);
(1
I
)
N(z;;)
Deep kernel GPs with latent variableの生成モデル :
p(Yjz;X)p(z) =N
(
Y;z;
(1
I
)
N(z;;)
上記を比較すると ,g((z;X) =zととれば両者が一致するこ
とがわかる .より一般には ,パラメータ =fW

;b

g
L
ℓ=1
を持つ
L層のDeep NN(A)によって
g((z;X) =(X)z
なる形の affine-decoderを用いることで両者は一致する
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 56 / 60

Table of Contents
1.Conditional Neural Processes [Garnelo+ (ICML2018)]
2.Neural Processes [Garnelo+ (ICML2018WS)]
3.Attentive Neural Processes [Kim+ (ICLR2019)]
4.Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5.On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6.Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 57 / 60

Summary
•NPs familyは出力yを予測するための条件付き分布を直接
モデリングする方法
•GPs回帰では予測時に O((m+n)
3
)かかっていた計算コス
トがO(m+n)で済む
•BOへの応用も既に考えられている (問題によっては
GP-basedのBOよりも高性能 )
•潜在表現 A変数の導出に attentionを用いた ANPsはより
GPに近い回帰の結果を返す
•NPsはGPs回帰において deep kernelを用いるのと等価な
操作とみなせる
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 58 / 60

Further Neural Processes
•Functional neural processes [Louizos+ (arXiv2019)]
•Recurrent neural processes [Willi+ (arXiv2019)]
•Sequential neural processes [Singh+ (arXiv2019)]
•Conditional neural additive processes [Requeima+
(arXiv2019)]
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 59 / 60

References
[1]Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages
1126–1135. JMLR. org, 2017.
[2]Alexandre Galashov, Jonathan Schwarz, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, SM Eslami,
and Yee Whye Teh. Meta-learning surrogate models for sequential decision making.arXiv preprint
arXiv:1903.11907, 2019.
[3]Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,
Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. InInternational Conference on
Machine Learning, pages 1690–1699, 2018.
[4]Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye
Teh. Neural processes.arXiv preprint arXiv:1807.01622, 2018.
[5]Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and
Yee Whye Teh. Attentive neural processes.arXiv preprint arXiv:1901.05761, 2019.
[6]Tim GJ Rudner, Vincent Fortuin, Yee Whye Teh, and Yarin Gal. On the connection between neural processes and
gaussian processes with deep kernels. InWorkshop on Bayesian Deep Learning, NeurIPS, 2018.
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 60 / 60