Markov Chain Monte Carlo Methods

FrancescoCasalegno1 4,230 views 32 slides Sep 27, 2018
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Why should you care about Markov Chain Monte Carlo methods?
→ They are in the list of "Top 10 Algorithms of 20th Century"
→ They allow you to make inference with Bayesian Networks
→ They are used everywhere in Machine Learning and Statistics

Markov Chain Monte Carlo methods are a cl...


Slide Content

Markov Chain Monte Carlo
→ Gibbs Sampling → Metropolis–Hastings
→ Hamiltonian Monte Carlo → Reversible-Jump MCMC
Francesco Casalegno

Francesco Casalegno – Markov Chain Monte Carlo
1.Motivation

2.Basic Principles of MCMC

3.Gibbs Sampling

4.Metropolis–Hastings

5.Hamiltonian Monte Carlo

6.Reversible-Jump Markov Chain Monte Carlo

7.Conclusion
Outline
2

Motivation

3

Francesco Casalegno – Markov Chain Monte Carlo
●“Computing in Science and Engineering” put MCMC among the Top 10 Algorithms of
the 20
th
Century, (together with Fast Fourier Transform, Quicksort, Simplex Method, …)

●MCMC methods are used to draw samples from complex distributions: X
1
… X
M
~ p(x)
○Why complex distributions? → If not, use Inverse Transform Sampling or Rejection Sampling
→ p(x) highly dimensional, multi-modal, known up to const ...
○Why Markov Chain? → Samples drawn in a sequence!
○Why Monte Carlo? → Samples used to approximate pdf or compute mean/variance/...

●OK, but when do we find these complex distributions?
○Typical scenario: sample from posterior = use observations (y
1
… y
N
) = Y to make inference on θ
○So we want to sample θ ~ p(θ|Y) = p(Y|θ) p(θ) / p(Y) ∝ p(Y|θ) p(θ)
■θ is often highly dimensional
■The normalization constant, i.e. the evidence p(Y) = p(Y|θ) p( θ) , is computationally intractable
○MCMC are an essential tool to make inference with Bayesian Networks

●In the next slides we see Bayesian Networks where MCMC can be applied with success!
1. Motivation
4

Francesco Casalegno – Markov Chain Monte Carlo
●Bayesian Networks (aka Belief Network) are powerful models representing variables and
their conditional dependencies in a DAG (directed acyclic graph).

●Notation



●Inference on unobserved variables is done by computing posterior distributions


●Posterior distributions are computed using the following tools

○Law of Total Probability

○Chain Rule (aka Product Rule)

○MCMC methods
1. Motivation
5
observed quantity unobserved variables fixed parameters
Bayes’ Theorem

Francesco Casalegno – Markov Chain Monte Carlo
1. Example: Hierarchical Regression
6
●Observations
○Different countries c = 1 … 4, different number of samples N
c
○Each sample is x
i
= mother longevity, y
i
= child longevity
●Model
○Linear regression, but samples are given per country...
■No Pool: treat each country independently, fit 4 independent θ
c

■Complete Pool: forget about the country, fit one θ on all together
■Hierarchical Regression: there are 4 different θ
c
, but related!
→ Best approach, in particular for countries with few samples (c=3)!
●How can we make inference on θ
1
… θ
C
, μ
θ
, and σ
θ
?
○Note: in Bayesian Nets, all parameters of interest have priors!

Francesco Casalegno – Markov Chain Monte Carlo
●Observations
○Number of coal mines fatalities y
t
during year t = 1900 ... 1960
●Model
○Number of fatalities follows a Poisson law
○Fatality rate changed (e.g. new OSH law) at some point
●How can we compute posteriors for ν, λ
1
, λ
2
?
○Year ν when the rate changed
○Fatality rate λ
1
before changepoint
○Fatality rate λ
2
after changepoint
1. Example: Mine Fatality (Change-Point Model)
7

Francesco Casalegno – Markov Chain Monte Carlo
1. Example: Latent Dirichlet Allocation
●Observations: words from D documents
●Model
○Assume there are T topics in total
○Assume Bag-Of-Words (only word counts matter, not order)
○φ
t
distribution of words in topic t ∊ {1 … T}
○θ
d
distribution of topics in document d∊ {1 … D}
○z
d,n
topic of word n ∊ {1 … N
d
} within document d∊ {1 … D}
○w
d,n
word appearing at position n ∊ {1 … N
d
} of doc d∊ {1 … D}
●How to automatically discover (infer posterior distribution)
○topics content in terms of words associated with them?
○document content in terms of topic distribution?
8

Basic Principles of MCMC

9

Francesco Casalegno – Markov Chain Monte Carlo
2. Basic Principles of MCMC


●As we have seen, to use Bayesian Networks we need to sample from θ ~ p(θ|Y).
But MCMC methods are generic: in the following we just talk about sampling from p.

●MCMC methods build a Markov Chain of samples X
1
, X
2
, … converging to p. We need:
1.An initial sample x
0

2.A simple way to draw a new X
n+1
given X
n
= x
n
(i.e. the Markov process)
3.A mathematical proof that, for n large enough, the process generates samples X
n
~ p
We will see how different MCMC methods differ in the way they draw X
n+1
given X
n
= x
n

●In this way, we draw X
1
, X
2
, … ~ p and we then compute Monte Carlo approximations like


And the error in I
M
? X
1
, X
2
, … are correlated, so it is worse than standard Monte Carlo!


M* is known as effective sample size, and ρ
k
is the k-lag autocorrelation of X
1
, X
2
, …

10

Francesco Casalegno – Markov Chain Monte Carlo
X
1
, X
2
, … generated by MCMC

●wait for converge to p: discard burn-in!
●pdf approximation is worse (effective M*!)
●strong autocorrelation of samples

2. Basic Principles of MCMC


11
X
1
, X
2
, … drawn i.i.d. from p

●looks like noise
●pdf is approximation is quite good
●no autocorrelation

Francesco Casalegno – Markov Chain Monte Carlo
2. Basic Principles of MCMC

●But how do draw a new X
n+1
given X
n
= x
n
?
○There is no single solution, depending on the situation we can use a different MCMC method!
○Each MCMC method has its own way of drawing a new X
n+1
given X
n
= x
n


●In these slides we will present all most important MCMC methods

○Gibbs Sampling

○Metropolis–Hastings

○Hamiltonian Monte Carlo

○Reversible-Jump MCMC
12

Gibbs Sampling

13

Francesco Casalegno – Markov Chain Monte Carlo
3. Gibbs Sampling
●Gibbs sampling is a MCMC method used to draw from a multivariate distribution p when
1.sampling from the joint distribution p(x) = p(x
1
… x
D
) is difficult → so we need MCMC!
2.sampling from the (univariate) conditionals p(x
1
|x
2
… x
D
), p(x
2
|x
1
x
3
… x
D
), …, p(x
D
|x
1
… x
D-1
) is easy

●Algorithm
→ Choose an initial point x
0
and find a way to draw from conditionals (e.g. by inverse sampling)
→ For n=0,...
→ Draw x
1
n+1
~ p(x
1
n+1
|x
2
n
… x
D
n
)
→ Draw x
2
n+1
~ p(x
2
n+1
|x
1
n+1
x
3
n
… x
D
n
)

→ Draw x
D
n+1
~ p(x
D
n+1
|x
1
n+1
… x
D-1
n+1
)
→ Set x
n+1
= (x
1
n+1
… x
D
n+1
)

14

Francesco Casalegno – Markov Chain Monte Carlo
●Let us sample from a bivariate normal distribution.

3. Gibbs Sampling: Example


15

Francesco Casalegno – Markov Chain Monte Carlo
●The joint posterior probability is


●Direct sampling from these multivariate, mixed
(discrete-continuous) distribution would be too hard!
→ Use Gibbs sampling, draw from conditional posterior!
○The posterior for λ
1
is
○The posterior for λ
2
is
○The posterior for ν is a with

from which we can draw using Inverse Transform Sampling.



3. Gibbs Sampling: Mine Fatality

16

Francesco Casalegno – Markov Chain Monte Carlo

1.Unlike other MCMC methods, x
n+1
is always
accepted as next step (no rejection)

2.Useful to treat very highly dimensional
problems

3.Useful if we have both continuous and
discrete components, to work with fully
discrete/continuous separate conditionals


1.All the conditionals must be known
→ often known only up to normalizing const!

2.Must know how to sample from conditionals
→ if it is hard, sample from conditionals with
another MCMC method such as
“Metropolis-within-Gibbs”


3.If components are strongly correlated, the
Markov chain converges slowly and has
highly auto-correlated samples

17
3. Gibbs Sampling: Pros and Cons

Metropolis–Hastings

18

Francesco Casalegno – Markov Chain Monte Carlo
4. Metropolis–Hastings
●Metropolis–Hastings generates a chain of samples from p by using the following ideas.
○Draw a new candidate x*
n+1
for X
n+1
given X
n
= x
n
using some proposal distribution Q(x*
n+1
|x
n
)
○Accept the candidate (x
n+1
= x*
n+1
) with some acceptance prob. A(x*
n+1
,x
n
), otherwise reject.

●Algorithm
→ Choose an initial point x
0
and a proposal distribution Q(x*
n+1
|x
n
)
→ For n=0,...
→ Draw new candidate x*
n+1
~ Q(x*
n+1
|x
n
)
→ Compute acceptance probability
→ Accept candidate with probability A(x*
n+1
,x
n
).
If candidate is rejected, go back to draw new candidate.

Note: to compute the acceptance prob. we only need to know p up to a multiplicative const → typical Bayesian posterior!

●How do we choose the proposal distribution Q(x*
n+1
|x
n
) ?
○A common choice x*
n+1
~ Normal(x
n
, σ
2
)
○This is called Random Walk Metropolis, as we can also write x*
n+1
= x
n
+ ε with ε~Normal(0, σ
2
)
○Large σ
2
→ low auto-correlation between samples (“big jumps”), but high rejection rate
○Small σ
2
→ high auto-correlation between samples (“small jumps”), but low rejection rate
○If we use a symmetric proposal distribution (e.g. Q = Normal), we have
○This is called Metropolis method, historically invented before Metropolis–Hastings
○But sometimes we need an asymmetric proposal, e.g. for 1-tailed target distributions (e.g. p = Gamma(α, β))
19

Francesco Casalegno – Markov Chain Monte Carlo
●Let us sample from a bivariate normal distribution using a Normal proposal distribution.

4. Metropolis–Hastings: Example
20

Francesco Casalegno – Markov Chain Monte Carlo
4. Metropolis–Hastings: Covariation Model

21
●The posterior distribution is


which does not look like anything familiar.

●Using Inverse Transform or Rejection Sampling would
be difficult in this case. So we use Metropolis-Hastings.
○p(ρ|y
1:N
) is known up to a constant, and that is OK
○Q(ρ*
n+1

n
) cannot be a Normal as our domain is bounded
→ Cannot use Random Walk Metropolis!
○Use Q(ρ*
n+1

n
) = TruncatedNormal
-1
+1
: asymmetric proposal
→ Pdf of TruncatedNormal
-1
+1
will appear in acceptance prob!

Francesco Casalegno – Markov Chain Monte Carlo

1.Works also if we know p only up to a
multiplicative constant
○Can sample from Bayesian posterior
w/o calculating

2.Can be used within Gibbs sampling:
○Gibbs splits the joint into conditionals
○Sample x
1
n+1
~ p(x
1
n+1
|x
2
n
), x
2
n+1
~ p(x
2
n+1
|x
1
n+1
)
using Metropolis-Hastings

3.Can be used when it is not practical to
derive all conditional posteriors


1.Choice of the best proposal distribution?

2.Choice of variance of proposal distribution?
○too small → high autocorrelation
○too large → high rejection rate

22
4. Metropolis–Hastings: Pros and Cons

Hamiltonian Monte Carlo

23

Francesco Casalegno – Markov Chain Monte Carlo
5. Hamiltonian Monte Carlo
24
●Hamiltonian Monte Carlo has two advantages with respect to other MCMC methods
○Little or no autocorrelation of samples
○Fast mix-in, i.e. the chain immediately converges to distribution p

●Hamiltonian Monte Carlo is based on the Hamiltonian (total energy) H(x, v) = U(x) + K(v)
○Imagine a ball in a space with potential energy U(x) = - log p(x) and put the ball in initial position x
n

○Give the ball an initial random velocity v ~ q and define its kinetic energy K(v) = - log q(v)
○Compute the trajectory for a time T, then take the final position: x(T) = x
n+1


●Algorithm
→ Choose an initial point x
0
and a velocity distribution q(v)
→ For n=0,...
→ Set the initial position to x(t=0) = x
n

→ Draw a new random initial velocity v(t=0) ~ q(v)
→ Numerical integrate the trajectory with total energy H(x, v) = -log p(x) - log q(v) for a time T
→ Set x
n+1
= x(t=T)

●How do we choose the distribution for the velocity q(v) ?
○A common choice is v~Normal(0, Σ) so that the kinetic energy reads K(v) = ½ v
T
Σ
-1
v
○If we have an understanding of p(x) we can choose Σ in a smart way, otherwise just set Σ = σ
2
I

Francesco Casalegno – Markov Chain Monte Carlo
●In a system with energy H(x, v) = U(x) + K(v), position x and velocity v evolve according to



●In most cases these equations cannot be solved exactly, so we use a numerical scheme
○ Choose a discrete time step τ
○ Compute numerical solution using Leapfrog Method (or another symplectic method)



○Energy H(x, v) = U(x) + K(v) should be preserved over time, but we use a numerical discretization…
○Symplectic methods are good because they preserve H(x, v) up to O(τ
s
) , with s=2 for Leapfrog Method
○When using numerical methods to compute trajectories, accept x
n+1
= x(t=T) with acceptance probability


Notice that H(x
n+1
, v
n+1
) = H(x
n
, v
n
) + O(τ
s
) so the acceptance probability is ≈ 1 for τ small enough.
5. Hamiltonian Monte Carlo: Trajectories
25

Francesco Casalegno – Markov Chain Monte Carlo
●Let us sample from a bivariate normal distribution.

5. Hamiltonian Monte Carlo: Example
26

Francesco Casalegno – Markov Chain Monte Carlo

1.Best method for continuous distributions

2.Samples have almost 0 autocorrelation

3.Only requires to know only p up to a const.

4.Can be extended to have velocity
depending on the location, q = q(v|x), but
than K = K(x, v)

1.Choice of symplectic integrator and τ?
○τ too small → slow integration
○τ too large → higher rejection rate
→ adaptive methods automatically choose τ

2.Choice of q(v) ?
If q(v) = Normal(0, Σ), choice of Σ?

3.Choice of integration time T?
○T too small → may have correlation
○T too large → Hamiltonian trajectories
are closed, so time waste
→ NUTS method automatically chooses T!

4.Must evaluate derivatives p’(x) and q’(v)

5.Works only for continuous distributions

27
5. Hamiltonian Monte Carlo: Pros and Cons

Reversible-Jump MCMC

28

Francesco Casalegno – Markov Chain Monte Carlo
●Reversible-Jump MCMC extends MCMC methods to the case where the variables space
has unknown/variable number of dimensions .
○Hierarchical Regression. In the example we fit lines, i.e. we used θ ∊ ℝ
2
. We could also decide to
use polynomials of another degree k, so that θ ∊ ℝ
k+1
→ how do we choose k?
○Change-Point Model. In the example we assumed that the
mine fatality rate was changing at some point.
We could also assume that the rate changed k times, so we
need inference on the change-points ν
1
… ν
k
as well as on the
rates λ
1
… λ
k+1
→ how do we choose k?


●Reversible-Jump MCMC is a powerful
method for model selection!
○Also works for multiple
hyper-parameters k
1
… k
m
6. Reversible-Jump MCMC
29

Francesco Casalegno – Markov Chain Monte Carlo
●Consider the meta-space where k is the
model index, and d
k
is the dimension of that space
○k is treated as just another variable in the meta-space
○For our change-point model, k = n. of change points, d
k
= 2k+1

●How do we jump from dimension d
k
to d
k’
?
○Sample an extra random variable u ~ Q(u)
○If d
k
< d
k’
it is called “birth” — If d
k
> d
k’
it is called “death”

●Algorithm
a.draw jump u ~ Q(u)
b.compute proposal x
n+1
*
= g(x
n
, u)
c.compute reverse jump u
*
s.t. x
n
= g(x
n+1
*
, u
*
)
d.accept proposal with acceptance probability

6. Reversible-Jump MCMC
30

Conclusions

31

Francesco Casalegno – Markov Chain Monte Carlo
Conclusions
1.Bayesian Networks are a powerful tool of Machine Learning and Statistical Modelling.

2.Bayesian Networks use MCMC to sample from computationally intractable posteriors.

3.Gibbs Sampling reduces reduces drawing from hard joint posterior into easy conditionals

4.Metropolis-Hastings is useful when posterior has no closed form/is known up to const.

5.Hamiltonian Monte Carlo is best choice for continuous case: low correlation, low rejection

6.Reversible-Jump MCMC is an extension used when n. of parameters is unknown/variable
32