Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
xianblog
155 views
51 slides
Aug 27, 2024
Slide 1 of 51
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
About This Presentation
Slides of a presentation at the COMPSTAT 2024 conference in Gießen, Hesse, Germany, on 28 Aug 2024
Size: 1.41 MB
Language: en
Added: Aug 27, 2024
Slides: 51 pages
Slide Content
Insufficient Gibbs sampling
Christian P. Robert
U. Paris Dauphine PSL & Warwick U.
Joint work with Antoine Luciano and Robin Ryder
ERCall for oceaneers
Ongoing 2023-2030 ERC funding for PhD and postdoctoral
collaborations with
▶Michael Jordan (Berkeley)
▶Eric Moulines (Paris)
▶Gareth O Roberts (Warwick)
▶myself (Paris)
on OCEAN (On IntelligenCE And Networks) project
Outline
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion
Motivations
▶Framework:Only robust statisticsT(X)of dataX∈R
N
are available, attached with a parametric modelFθonX.
▶Goal:Sample from the posterior distribution of the
parametersθgivenT(X),π(θ|T(X))
▶Issue:Posterior intractable for most robust statistics.
▶State-of-the-art:Approximate Bayesian Computation
(ABC) methods withT(X)as a summary statistics only
deliver a sample from an approximation ofπ(θ|X)(that
may be close toπ(θ|T(X))).
[Marin & al., 2011]
▶Connections with
Motivations
▶Framework:Only robust statisticsT(X)of dataX∈R
N
are available, attached with a parametric modelFθonX.
▶Goal:Sample from the posterior distribution of the
parametersθgivenT(X),π(θ|T(X))
▶Issue:Posterior intractable for most robust statistics.
▶State-of-the-art:Approximate Bayesian Computation
(ABC) methods withT(X)as a summary statistics only
deliver a sample from an approximation ofπ(θ|X)(that
may be close toπ(θ|T(X))).
[Marin & al., 2011]
▶Connections with
Contribution
TurningXinto a latent variable:
π(θ|T(x))∝
Z
R
N
π(θ, x|T(x))dx∝
Z
R
N
π(θ|x)π(x|T(x))dx
Two step simulation scheme:
1. Xfromπ(X|T(X), θ)with the
assumptionX∼Fθ
2. θfrom the posterior given the
augmented dataπ(θ|X)
Gibbs Sampler
Resolution by data completion since full posterior can directly
be simulated (e.g., conjuguate prior) or using a MH step
[Tanner & Wong, 1987]
X
0
←Xinit(N, T(X),F);
fort = 1,. . . ,Tdo
X
t
|T(X), θ
t−1
∼π(·|X
t−1
, T(X), F
θ
t−1);
θ
t
|X
t
∼π(·|X
t
);
end
Gibbs Sampler / Data Augmentation
Methodology for different statistics
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion
Median,MAD
Here we haveT(X) = (median(X),MAD(X))where:
(a) the median is the 50% quantile of the empirical distribution,
i.e., for a sampleX= (X1, . . . , XN)
median(X) =
ffi
X
(k) ifN=2k+1
X
(k)+X
(k+1)
2
ifN=2k
whereX
(i)denotes theith order statistic of the sample.
(b) the median absolute seviation (MAD) is a measure of
statistical dispersion that is commonly used as a robust
alternative to the standard deviation, defined as:
MAD(X)=median(|X−median(X)|)
Median,MAD
Basic Idea :
50−k%k%50−k%k%med−MADmedmed+MAD50%
withk∈[0, 50].
The distribution ofkis unknown, thus a direct sampling of the
datavector is impossible.
Gibbs alternative: randomly select (or not) two coordinates
(i, j)∈{1, . . . , N}
2
for resampling
X
t+1
i
, X
t+1
j
|T(X
t
), X
t
−(i,j)
, θ
t
Median,MAD
Basic Idea :
50−k%k%50−k%k%med−MADmedmed+MAD50%
withk∈[0, 50].
The distribution ofkis unknown, thus a direct sampling of the
datavector is impossible.
Gibbs alternative: randomly select (or not) two coordinates
(i, j)∈{1, . . . , N}
2
for resampling
X
t+1
i
, X
t+1
j
|T(X
t
), X
t
−(i,j)
, θ
t
Median, MAD : Odd Case
Median and MAD being based on order statistics, two different
settings for odd and evenN.
Quantiles
Now assume one observes only sample quantiles,(q1, . . . , qK)
and(p1, . . . , pK)withpk∈[0, 1]andqk∈Rfork∈{1, . . . , K}
such that
Q(pk) =
c
Pθ(Xi≤pk) =qk
So
T(X) = ((q1, . . . , qK),(p1, . . . , pK))
Quantiles
General case :
p1%p2−p1%. . .pK−pK−1%100−pK%q1q2. . .qK−1qK
Example (K=3) :If(pk)
k∈{1,...,K}= (.25, .5, .75)and
(qk)
k∈{1,...,K}= (q1, med, q3)then we have :
25%25%25%25%q1medq3
Quantile specificity
Definition
Q(X, p) = (1−g)X
(i)+gX
(i+1)
whereh= (N−1)p+1,i
integer part ofhi.ei=⌊h⌋andgits fractional part i.eg=h−i.
[Hyndman, 1996]
▶used by default in R, Julia, NumPy, and Excel
▶Ifgk=0, we haveQ(X, pk) =X
(ik)“deterministic” andik
denotes its index.
▶Ifgk> 0we haveQ(X, pk) = (1−gk)X
(ik)+gkX
(ik+1)
Q(X, pk)a linear combination of the order statistics of
indexesikandik+1. One of the two need be “simulated”
(e.g., the smallest one) and the other is then deterministic.
Quantiles
Recall the joint density of arbitrary numberKorder statistics
f
(i1,...,iK)(x1, . . . , xK) =N!
K
Y
k=1
f(xk)
K
Y
k=0
(F(xk+1) −F(xk))
ik+1−ik−1
(ik+1−ik−1)!
wherefandF, pdf and pmf ofFθ, resp.
Quantiles
Denote
KD={k∈{1, . . . , K} |gk=0}
KS={k∈{1, . . . , K} |gk> 0}
We have{1, . . . , K}=KD∪KS, with
ffi
X
(ik)=qk ∀k∈KD
X
(ik+1)=
qk−X
(i
k
)(1−gk)
gk
∀k∈KS
We need sample|KS|order statistics to set theK+|KS|order
statistics ofXthat define the values of(Q(X, pk))
k∈{1,...,K}.
Quantiles
f
(X
(i
k
))k∈K
S
|(Q(X,p1),...,Q(X,pK))=(q1,...,qK)(x1, . . . , x
|KS|)
∝f
(X
(i
k
))k∈K
S
,(Q(X,p1),...,Q(X,pK))=(q1,...,qK)(x1, . . . , x
|KS|, q1, . . . , qK)
∝f
(i)i∈I
(ϕ
−1
(x1, . . . , x
|KS|, q1, . . . , qK))
(1)
Median, IQR
The median is the 50% quantile of a distribution. For a sample
X= (X1, . . . , XN)of random variables, the median is defined as:
median(X)=
ffi
X
(k), ifN=2k+1
X
(k)+X
(k+1)
2
, ifN=2k
whereX
(i)denotes theith order statistic ofX.
The Interquartile Range is the difference between the 75% and
25% quantiles of a distribution. For a sampleX= (X1, . . . , XN)
of i.i.d. random variables, the IQR is defined as:
IQR=Q(X, 0.75)−Q(X, 0.25)
whereQis the quantile function ofFθ.
Median, IQR
Basic Idea :
25%25%25%25%Q1medQ3Q3−Q1=IQR
Median, IQR
IfN=4k+1(simplest case) we have :
Q1=X
(k+1)1
m=X
(2k+1)
Q3=X
(3k+1)
i=Q3−Q1
⇐⇒
X
(k+1)=Q1
X
(2k+1)=m
X
(3k+1)=Q1+i
Impact on Bayesian Model Choice
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion
Bayesian Model Choice
Several modelsM1, M2, . . .are considered simultaneously for a
datasetyand the model indexMis part of the inference.
Use of a prior distribution,π(M=m), plus a prior distribution
on the parameter conditional on the valuemof the model
index,πm(θm)
Goal is to derive the posterior distribution ofM, challenging
computational target when models are complex.
Bayesian model choice
Comparison of modelsMiby a Bayesian approach
Probabili-ses (!) the entire model/parameter space
▶allocate probabilitiespito all modelsMi
▶define priorsπi(θi)for each parameter spaceΘi
▶compute
π(Mi|x) =
pi
Z
Θi
fi(x|θi)πi(θi)dθi
X
j
pj
Z
Θj
fj(x|θj)πj(θj)dθj
[Wrinch & Jeffreys, 1919]
Bayesian model choice
Comparison of modelsMiby a Bayesian approach
Relies on a central notion: theevidence
evk=
Z
Θk
πk(θk)Lk(θk)dθk,
aka the marginal likelihood.
[Wrinch & Jeffreys, 1919]
Generic ABC for model choice
fort=1toTdo
Repeat
Generatemfrom the priorπ(M=m)
Generateθmfrom the priorπm(θm)
Generatezfrom the modelfm(z|θm)
Untilρ{η(z), η(y)}< ϵ
Setm
(t)
=mandθ
(t)
=θm
end
ABC-MC
[Toni, Welch, Strelkowa, Ipsen & Stumpf, 2009]
Bridge Sampling
Approximation of Bayes factors (and other ratios of integrals)
c1
c2
=B12=
Z
˜π2(θ|x)α(θ)π1(θ|x)dθ
Z
˜π1(θ|x)α(θ)π2(θ|x)dθ
for anyα(·)
≈
1
n1
n1X
i=1
˜π2(θ1i|x)α(θ1i)
1
n2
n2X
i=1
˜π1(θ2i|x)α(θ2i)
θji∼πj(θ|x)
[Bennett, 1976; Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
Optimal Bridge Sampling
Optimal choice
α
⋆
(θ) =
n1+n2
n1π1(θ|x) +n2π2(θ|x)
leading to
b
B12≈
1
n1
n1X
i=1
˜π2(θ1i|x)
n1π1(θ1i|x) +n2π2(θ1i|x)
1
n2
n2X
i=1
˜π1(θ2i|x)
n1π1(θ2i|x) +n2π2(θ2i|x)
(to be iterated).
Insufficient Bridge Sampling
Inclusion of missing data into bridge estimate
b
B12≈
1
n1
n1X
i=1
˜π2(θ1i|x)
n1π1(θ1i,y1i) +n2π2(θ1i,y1i)
1
n2
n2X
i=1
˜π1(θ2i,y2i)
n1π1(θ2i,y2i) +n2π2(θ2i,y2i)
Results
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion
Gaussian Case
Conjuguate Prior Inverse Gamma:
▶Prior :µ, σ
2
∼Normal-InverseGamma(µ0, τ, α, β)
▶Posterior :µ, σ
2
|X∼Normal-InverseGamma(M, C, A, B)
where
M=
νµ0+N
¯
X
ν+N
andC=ν+N
A=α+
N
2
andB=β+
1
2
(NS
2
+
Nν
ν+N
(
¯
X−µ0)
2
)
Robust statistics:
▶µ≈medandσ≈MAD/Φ
−1
(3/4) =IQR/2Φ
−1
(3/4)
whenNlarge enough
▶efficiency :effmed=2/π≈0.64and
effMAD=effIQR≈0.36
Gaussian Example
For comparison purposes, consider approximation of the
simulated posterior
˜
pi(µ, σ
2
|med, MAD)by
Normal-InverseGamma(
˜
M,
˜
C,
˜
A,
˜
B)
where
˜
M=
νµ0+N·effmed·med
ν+N·effmed
,
˜
C=ν+N·effmed,
˜
A=α+
N·effMAD
2
,
˜
B=β+
1
2
`
N·effMAD·(C·MAD)
2
+
N·effMADν
ν+N·effMAD
(
¯
X−µ0)
2
´
.
Gaussian Example
Figure: µandσfor a sample of size
N=1000, med= −5, MAD=3givenT(X) = (med, MAD)with the
Robust Gibbs in blue and the appproximation in red.
Cauchy Example
▶Density :
fθ(x) =
1
πγ
1
1+ (x−x0)
2
soθ= (x0, γ)wherex0∈R(location parameter) andγ > 0
(scale parameter)
▶Heavy tailed distribution :Eθ(X)not defined and
Varθ(X) =∞
▶Convergent estimators :x0≈medandγ≈MADwhen
N→ ∞
Cauchy Example
Figure: x0andγfor a sample of size
N=1000, med=0, MAD=1givenT(X) = (med, MAD)in blue and
givenXin orange.
3-Parameters Weibull Example
Used to modelize income data.
Density :
f(x) =
α
β
(
x−x0
β
)
α−1
e
−(
x−x
0
β
)
α
whereα > 0(shape parameter),β > 0(scale parameter) and
x0∈R(location parameter)
Quantiles method for different values of K.
Weibull Example
Figure:
for a sample of sizeN=1000given
T(X) = ((qk)k=1,...,K,(pk)k=1,...,K,)with the Robust Gibbs forK=3
(in blue),4(in orange) and9(in green).
Normal-Laplace Comparison
Comparison of model
M1:y∼N(θ1, 1)against modelM2:y∼L(θ2, 1/
√
2)
where Laplace distribution has meanθ2and scale parameter
1/sqrt2, resulting in a variance of 1.
Since median, first and second-order empirical moments
asymptotically match for these models, summary statistics are
useless for model selection.
Model selection is possible when using the MAD as summary
statistic
[Marin & al., 2013]
Normal-Laplace Comparison
Computation of Bayes factor when
T(y) = (med(y), MAD(y))
by bridge sampling (and not ABC approximation)Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
p(1|T0) Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
p(1|T0) Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
p(1|T0)
Figure:
modelM1as number of observationsNincreases, for Gaussian and
Laplace simulated samples, resp. Boxplots based on 100 Monte Carlo
runs with104bridge sampling simulations.
Conclusion
▶Among the three robust statistics addressed in this paper,
two exhibited similarities, yielding comparable results to
existing methods.
▶Case of median absolute deviation (MAD) novel challenge,
for which our partial data augmentation ensures ergodicity.
Conclusion
Exact method for Bayesian model choice by extending bridge
sampling estimators, while relying solely on observed statistics,
if limited by both the standard conditions of bridge sampling
and the size of the simulated data.
While focussed on continuous univariate distributions,
extension to various observed statistics, discrete distributions
and multivariate data formaly identical
The elephant in the ocean
Method falling victim to curse of dimensionality, when
dimension or size of data too large.
The elephant in the ocean
Method falling victim to curse of dimensionality, when
dimension or size of data too large.
▶A novel intermediary step similar to principal components
towards identifying most discriminating subspace would be
necessary for extension.
▶Since the best summary statistic for ABC model choice is
the pair of evidencesevk(k=1, 2), simulating directly this
pair sounds like our Holy Grail...
[Prangle & al., 2013]