Unbiased Hamiltonian Monte Carlo

JeremyHeng10 127 views 37 slides Oct 06, 2019
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

Talk at Bayes4Health Workshop in Lancaster, September 2019.


Slide Content

Unbiased Hamiltonian Monte Carlo
Jeremy Heng
Information Systems, Decision Sciences and Statistics (IDS)
Department, ESSEC
Joint work with Pierre Jacob
Department of Statistics, Harvard University
Bayes4Health & CoSInES Workshop
17 September 2019
Jeremy Heng Unbiased HMC 1/ 37

Unbiased estimation
Targetdistribution
(dx) =(x)dx;x2R
d
Objective: compute expectation
E[h(X)] =
Z
R
d
h(x)(x)dx
for some test functionh:R
d
!R
Unbiased estimatorsof expectation (from Pierre's talk):
Hk(X;Y) =h(Xk) +
1
X
t=k+1
fh(Xt)h(Yt1)g
Hk:m(X;Y) =
1
mk+ 1
m
X
t=k
Ht(X;Y)
for any 0km
Jeremy Heng Unbiased HMC 2/ 37

Simulating coupled chains
These unbiased estimators requirecoupled chainsX= (Xt)
andY= (Yt) satisfying
(i)same marginal law:
X00;XtP(Xt1;) fort1
Y00;YtP(Yt1;) fort1
(ii)faithful meeting:Xt=Yt1fort
We will considercouplingsfor Markov kernelsPdened by
(i) random walk Metropolis{Hastings (RWMH)
(ii) Gibbs samplers
(iii) Metropolis-adjusted Langevin algorithm (MALA)
(iv) Hamiltonian Monte Carlo (HMC)
Jeremy Heng Unbiased HMC 3/ 37

Maximal coupling
A key tool to simulate chains that meet is amaximal
couplingbetween two distributionsp(x) andq(y) onR
d
A maximal couplingc(x;y) is a joint distribution onR
d
R
d
such that:
(i) (X;Y)cimpliesXpandYq
(ii)P(X=Y) ismaximized
There is an algorithm to sample from a maximal coupling if
(i)samplingfrompandqis possible
(ii)evaluatingthe densities ofpandqis tractable
Thorisson.Coupling, stationarity, and regeneration(2000)
Jeremy Heng Unbiased HMC 4/ 37

Independent coupling of Gamma and Normal
Jeremy Heng Unbiased HMC 5/ 37

Maximal coupling of Gamma and Normal
Jeremy Heng Unbiased HMC 6/ 37

Maximal coupling: algorithm
Sampling (X;Y) from maximal coupling ofpandq
1
SampleXpandU U([0;1])
IfUq(X)=p(X), output (X;X)
2
Otherwise, sampleY
?
qand
U
?
U([0;1])
untilU
?
>p(Y
?
)=q(Y
?
), and
output (X;Y
?
)Normal(0,1)
Gamma(2,2)
0.0
0.2
0.4
0.6
0.8
−5.0 −2.5 0.0 2.5 5.0
space
density
Jeremy Heng Unbiased HMC 7/ 37

Maximal coupling: algorithm
Step 1 samples fromoverlap
minfp(x);q(x)g
Maximality follows fromcoupling
inequality
P(X=Y) =
Z
R
d
minfp(x);q(x)gdx
= 1TV(p;q)
Step 2 samplesY
?
independently
ofX(can sometimes do better!)
Expected cost does not depend
onpandq(but variance does!)Normal(0,1)
Gamma(2,2)
0.0
0.2
0.4
0.6
0.8
−5.0 −2.5 0.0 2.5 5.0
space
density
Jeremy Heng Unbiased HMC 8/ 37

Metropolis{Hastings (kernelP)
At iterationt, Markov chain at stateXt
1
ProposeX
?
q(Xt;), e.g.
for RWMHX
?
N(Xt;
2
Id),
for MALAX
?
N(Xt+

2
2
rlog(Xt);
2
Id)
2
SampleU U([0;1])
3
If
Umin

1;
(X
?
)q(X
?
;Xt)
(Xt)q(Xt;X
?
)

;
setXt+1=X
?
, otherwise setXt+1=Xt
Jeremy Heng Unbiased HMC 9/ 37

Coupled Metropolis{Hastings (kernel

P)
At iterationt, two Markov chains at statesXtandYt1
1
Propose (X
?
;Y
?
) from maximal coupling ofq(Xt;) and
q(Yt1;)
2
SampleU U([0;1])
3
If
Umin

1;
(X
?
)q(X
?
;Xt)
(Xt)q(Xt;X
?
)

;
setXt+1=X
?
, otherwise setXt+1=Xt
If
Umin

1;
(Y
?
)q(Y
?
;Yt1)
(Yt1)q(Yt1;Y
?
)

;
setYt=Y
?
, otherwise setYt=Yt1
Jeremy Heng Unbiased HMC 10/ 37

Coupled RWMH on Normal target: trajectories
=N(0;1),0=N(10;3
2
),P= RWMH with proposal std 0:5
Jeremy Heng Unbiased HMC 11/ 37

Coupled RWMH on Normal target: meetings
=N(0;1),0=N(10;3
2
),P= RWMH with proposal std 0:5
Jeremy Heng Unbiased HMC 12/ 37

Coupled RWMH on Normal target: meeting times
=N(0;1),0=N(10;3
2
),P= RWMH with proposal std 0:50.000
0.005
0.010
0.015
0 50 100 150 200
meeting time
density
Jeremy Heng Unbiased HMC 13/ 37

Coupled RWMH on Normal target: scaling with dimension
=N(0;) with W
1
(Id;d)
0=(target) or0=N(1;Id) (oset)
P= RWMH with proposal covariance d
1
Roberts, Gelman and Gilks.Weak convergence and optimal
scaling of random walk Metropolis algorithms(1997)
P= Coupled RWMH using maximal coupling of Normal
proposals
P= Coupled RWMH using maximal coupling of Normal
proposals withreection coupling of residuals
Bou-Rabee, Eberle and Zimmer.Coupling and convergence
for Hamiltonian Monte Carlo(2018)
Jeremy Heng Unbiased HMC 14/ 37

Coupled RWMH on Normal target: maximal coupling10
100
1000
10000
1 2 3 4 5
dimension
average meeting time
initialization:targetoffset
Jeremy Heng Unbiased HMC 15/ 37

Coupled RWMH on Normal target: maximal + reection0
500
1000
1500
2000
1 13 25 37 50
dimension
average meeting time
initialization:targetoffset
Jeremy Heng Unbiased HMC 16/ 37

Coupled RWMH on Normal target: scaling with dimension
Reection coupling can be eective when chains are far apart
(understanding this scaling is an open question)
Lindvall.On coupling of Brownian motions(1982)
Lindvall and Rogers.Coupling of multidimensional diusions
by reection(1986)
Larger proposal covariance leads to smaller meeting times but
leads to poorer mixing of marginal RWMH chain
This tradeo should be studied by optimizing theasymptotic
ineciencyofHk:m(X;Y) ascompute budget! 1
E[2(1) + max(1;m+ 1)]
| {z }
expected cost
V[Hk:m(X;Y)]
| {z }
variance
Glynn and Whitt.The asymptotic eciency of simulation
estimators(1992)
Jeremy Heng Unbiased HMC 17/ 37

Coupled HMC on Normal target: scaling with dimension0
20
40
60
10 50 100 200 300
dimension
average meeting time
initialization:targetoffset
Jeremy Heng Unbiased HMC 18/ 37

Coupled Gibbs sampler
Gibbs samplers update each component leaving(dxijxi)
invariant
If full conditional distributions belong tostandard familyof
distributions, we can apply maximal coupling directly
Otherwise, we can coupleMH-within-Gibbsupdates as
before
The two chains meet when all components have met
Jeremy Heng Unbiased HMC 19/ 37

Coupled Gibbs on Normal target: strong correlations0
200
400
600
1 3 5 7 9
dimension
median meeting time
initialization:targetoffset
Jeremy Heng Unbiased HMC 20/ 37

Coupled Gibbs on Normal target: weak correlations20
40
60
1 50 100 200 300
dimension
average meeting time
initialization:targetoffset
Jeremy Heng Unbiased HMC 21/ 37

Hamiltonian Monte Carlo (HMC)
Denepotential energyU(q) =log(q) and
HamiltonianE(q;p) =U(q) +
1
2
jpj
2
Hamiltonian dynamics (q(t);p(t))2R
d
R
d
, fort0
d
dt
q(t) =rpE(q(t);p(t)) =p(t)
d
dt
p(t) =rqE(q(t);p(t)) =rU(q(t))
Ideal HMC algorithm:
At iterationt, Markov chain at stateXt
1
Setq(0) =Xtand samplep(0) N(0;Id)
2
Solve dynamics over time lengthTto get (q(T);p(T))
3
SetXt+1=q(T).
Jeremy Heng Unbiased HMC 22/ 37

Hamiltonian Monte Carlo (HMC)
Solving Hamiltonian dynamics exactly is typically intractable
HMC algorithm:
1
Setq0=Xtand samplep0 N(0;Id)
2
Leap-frog integrator: for`= 0; : : : ;L1, compute
p
`+1=2=p`
"
2
rU(q`)
q`+1=q`+"p
`+1=2
p`+1=p
`+1=2
"
2
rU(q`+1)
3
SampleU
?
U([0;1])
4
If
U
?
minf1;exp [E(q0;p0)E(qL;pL)]g
setXt+1=qL, otherwise setXt+1=Xt
Jeremy Heng Unbiased HMC 23/ 37

Coupled Hamiltonian dynamics
Consider coupling two particles (q
i
(t);p
i
(t));i= 1;2
following Hamiltonian dynamics
Dierence (t) =q
1
(t)q
2
(t) satises
1
2
d
dt
j(t)j
2
= (t)
T
fp
1
(t)p
2
(t)g
therefore ifp
1
(0) =p
2
(0) thent7! j(t)j
2
has astationary
pointatt= 0
To characterize stationary point
1
2
d
2
dt
2
j(0)j
2
=(0)
T
frU(q
1
(0)) rU(q
2
(0))g
j(0)j
2
ifq
1
(0);q
2
(0)2SwhereUis-strongly convex
Jeremy Heng Unbiased HMC 24/ 37

Coupled Hamiltonian dynamics
Sincet= 0 is a strict local maximum point, there exists
T>0 such that for anyt2(0;T]
jq
1
(t)q
2
(t)j tjq
1
(0)q
2
(0)j; t2[0;1)
AssumingrUis-Lipschitz, we established contraction
using Taylor expansion aroundt= 0 (Lemma 1)
More quantitative results by Mangoubi and Smith (2017,
Theorem 6) Bou-Rabee et al. (2018, Theorem 2.1) give
T=
p


andt=
1
2
t
2
Coupling can be eective in high dimensions if problem is
well-conditioned
Jeremy Heng Unbiased HMC 25/ 37

Logistic regression: distance against integration time
Realizations of three coupled Hamiltonian trajectories15
20
25
30
35
0.00 0.25 0.50 0.75 1.00
Integration time
Distance
Jeremy Heng Unbiased HMC 26/ 37

Coupled HMC kernel (

K";L)
At iterationt, two Markov chains at statesXtandYt1
1
Setq
1
0
=Xt;q
2
0
=Yt1and samplep0 N(0;Id)
2
Perform leap-frog integration to obtain (q
i
L
;p
i
L
);i= 1;2
3
SampleU
?
U([0;1])
4
If
U
?
min

1;exp

E(q
1
0;p0)E(q
1
L
;p
1
L
)

setXt+1=q
1
L
, otherwise setXt+1=Xt
5
If
U
?
min

1;exp

E(q
2
0;p0)E(q
2
L
;p
2
L
)

setYt=q
2
L
, otherwise setYt=Yt1
Jeremy Heng Unbiased HMC 27/ 37

Coupled HMC chains−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5
x
1
x
2
Jeremy Heng Unbiased HMC 28/ 37

Logistic regression: distance after 1000 iterations1e−12
1e−08
1e−04
1e+00
0.25 0.50 0.75 1.00 1.25
Integration time
Distance after 1000 iterations
L10 20 30
Jeremy Heng Unbiased HMC 29/ 37

Mixture of coupled kernels (kernel

K)
To enable exact meetings, we consider for2(0;1)
P= (1)P";L
|{z}
coupled HMC
+ P
|{z}
coupled RWMH
Choice of RWMH proposal std:
distance between chains< <spread of
Advocate small RWMH probabilityto minimize ineciency
Jeremy Heng Unbiased HMC 30/ 37

Sensitivity of RWMH proposal std
Logistic regression (left), Cox process (right)200
400
600
800
1e−061e−051e−040.001 0.01
σ
Meeting time 50
100
150
1e−061e−051e−040.001 0.01
σ
Meeting time
Jeremy Heng Unbiased HMC 31/ 37

Sensitivity of RWMH probability
Logistic regression (left), Cox process (right)200
400
600
800
0.01 0.03 0.05 0.07 0.09 0.11
γ
Meeting time 0
200
400
600
0.01 0.16 0.31 0.46 0.61 0.76
γ
Meeting time
Jeremy Heng Unbiased HMC 32/ 37

Conditions
To ensure validity of unbiased estimators:
1
Convergence of marginal chain (inherited from HMC)
2
Meeting time has geometric tails (Theorem 2 if";L; ; are
small enough)
3
Faithfulness (by construction)
Main assumptions of Theorem 2:
1
rUis globally Lipschitz
2
Uis strongly convex onSR
d 3
Geometric drift condition on HMC kernel
Assumptions can be veried for Gaussian targets, Bayesian
logistic regression relying on Theorem 9 of
Durmus, Mouline, Saksman.On the convergence of
Hamiltonian Monte Carlo(2017)
Jeremy Heng Unbiased HMC 33/ 37

Cox process: eect of dimension and algorithmRHMC
HMC
256 1024 4096
0
200
400
600
0
200
400
600
Dimension
Meeting time
Jeremy Heng Unbiased HMC 34/ 37

Eciency
Asymptotic ineciencyofHk:m(X;Y)
E[2(1) + max(1;m+ 1)]
| {z }
expected cost
V[Hk:m(X;Y)]
| {z }
variance
to be compared withasymptotic varianceof optimal HMC
Bias removal leads tovariance ination
Variance ination can bemitigatedby increasingkandm
Ask! 1,Hk:m(X;Y) is standard MCMC average with
increasing probability, so its variance should be similar
Ifkm, asymptotic ineciency is approximately
m

2
(h)
mk+ 1

2
(h)
theasymptotic varianceof marginal HMC
Jeremy Heng Unbiased HMC 35/ 37

Logistic regression: impact ofkandm
k m Cost Relative ineciency
1 k436 1989.07
1 5 k436 1671.93
1 10 k436 1403.28
90%quantile()k553 38.11
90%quantile() 5k1868 1.23
90%quantile() 10k3518 1.05
Relative ineciency =
Asymptotic ineciency
Asymptotic variance of optimal HMC
k;m! 1
!
Asymptotic variance of marginal HMC
Asymptotic variance of optimal HMC
Jeremy Heng Unbiased HMC 36/ 37

Concluding remarks
In non-convex cases, for another coupling of HMC see
Bou-Rabee, Eberle and Zimmer.Coupling and convergence
for Hamiltonian Monte Carlo(2018)
Could combine synchronous coupling (L= 1) and maximal
coupling for MALA
Extension to other variants of HMC:
1
No-U-Turn Sampler (Homan and Gelman, 2014)
2
Partial momentum refreshment (Horowitz, 1991)
3
Dierent choices of kinetic energy (Livingstone et al., 2017)
4
Hamiltonian bouncy particle sampler (Vanetti et al., 2017)
Heng and Jacob.Unbiased Hamiltonian Monte Carlo with
couplings. Biometrika, 2019
R package:
https://github.com/pierrejacob/debiasedhmc
Jeremy Heng Unbiased HMC 37/ 37