Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)

xianblog 155 views 51 slides Aug 27, 2024
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

Slides of a presentation at the COMPSTAT 2024 conference in Gießen, Hesse, Germany, on 28 Aug 2024


Slide Content

Insufficient Gibbs sampling
Christian P. Robert
U. Paris Dauphine PSL & Warwick U.
Joint work with Antoine Luciano and Robin Ryder

ERCall for oceaneers
Ongoing 2023-2030 ERC funding for PhD and postdoctoral
collaborations with
▶Michael Jordan (Berkeley)
▶Eric Moulines (Paris)
▶Gareth O Roberts (Warwick)
▶myself (Paris)
on OCEAN (On IntelligenCE And Networks) project

Outline
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion

Motivations
▶Framework:Only robust statisticsT(X)of dataX∈R
N
are available, attached with a parametric modelFθonX.
▶Goal:Sample from the posterior distribution of the
parametersθgivenT(X),π(θ|T(X))
▶Issue:Posterior intractable for most robust statistics.
▶State-of-the-art:Approximate Bayesian Computation
(ABC) methods withT(X)as a summary statistics only
deliver a sample from an approximation ofπ(θ|X)(that
may be close toπ(θ|T(X))).
[Marin & al., 2011]
▶Connections with

Motivations
▶Framework:Only robust statisticsT(X)of dataX∈R
N
are available, attached with a parametric modelFθonX.
▶Goal:Sample from the posterior distribution of the
parametersθgivenT(X),π(θ|T(X))
▶Issue:Posterior intractable for most robust statistics.
▶State-of-the-art:Approximate Bayesian Computation
(ABC) methods withT(X)as a summary statistics only
deliver a sample from an approximation ofπ(θ|X)(that
may be close toπ(θ|T(X))).
[Marin & al., 2011]
▶Connections with

Contribution
TurningXinto a latent variable:
π(θ|T(x))∝
Z
R
N
π(θ, x|T(x))dx∝
Z
R
N
π(θ|x)π(x|T(x))dx
Two step simulation scheme:
1. Xfromπ(X|T(X), θ)with the
assumptionX∼Fθ
2. θfrom the posterior given the
augmented dataπ(θ|X)

Gibbs Sampler
Resolution by data completion since full posterior can directly
be simulated (e.g., conjuguate prior) or using a MH step
[Tanner & Wong, 1987]
X
0
←Xinit(N, T(X),F);
fort = 1,. . . ,Tdo
X
t
|T(X), θ
t−1
∼π(·|X
t−1
, T(X), F
θ
t−1);
θ
t
|X
t
∼π(·|X
t
);
end
Gibbs Sampler / Data Augmentation

Methodology for different statistics
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion

Median,MAD
Here we haveT(X) = (median(X),MAD(X))where:
(a) the median is the 50% quantile of the empirical distribution,
i.e., for a sampleX= (X1, . . . , XN)
median(X) =
ffi
X
(k) ifN=2k+1
X
(k)+X
(k+1)
2
ifN=2k
whereX
(i)denotes theith order statistic of the sample.
(b) the median absolute seviation (MAD) is a measure of
statistical dispersion that is commonly used as a robust
alternative to the standard deviation, defined as:
MAD(X)=median(|X−median(X)|)

Median,MAD
Basic Idea :
50−k%k%50−k%k%med−MADmedmed+MAD50%
withk∈[0, 50].
The distribution ofkis unknown, thus a direct sampling of the
datavector is impossible.
Gibbs alternative: randomly select (or not) two coordinates
(i, j)∈{1, . . . , N}
2
for resampling
X
t+1
i
, X
t+1
j
|T(X
t
), X
t
−(i,j)
, θ
t

Median,MAD
Basic Idea :
50−k%k%50−k%k%med−MADmedmed+MAD50%
withk∈[0, 50].
The distribution ofkis unknown, thus a direct sampling of the
datavector is impossible.
Gibbs alternative: randomly select (or not) two coordinates
(i, j)∈{1, . . . , N}
2
for resampling
X
t+1
i
, X
t+1
j
|T(X
t
), X
t
−(i,j)
, θ
t

Median, MAD : Odd Case
Median and MAD being based on order statistics, two different
settings for odd and evenN.

Median, MAD : Odd Case
Randomly pick indicesiandj
1. (Xi, Xj) = (m, XMAD):(
˜
Xi,
˜
Xj) = (Xi, Xj)
2. Xj=m:
˜
Xi∼Fθ1
Zone(Xi)and
˜
Xj=Xj
3. Xj=XMAD:
3.1 (Xi−m)(Xj−m)> 0:
˜
Xi∼Fθ1
Zone(Xi)and
˜
Xj=XMAD
3.2
▶˜
Xi∼Fθ1Zone(X
i)∪Zone
S(X
i)
▶˜
Xj=

m−sif
˜
Xi> m
m+selse
4.
4.1 Xi) , Zone(Xj)) = (Z1, Z3)ou(Z2, Z4):
▶˜
Xi∼Fθ
▶˜
Xj∼Fθ1
Zone
C(
˜
X
i)
4.2
˜
Xi∼Fθ1
Zone(Xi)and
˜
Xj∼Fθ1
Zone(Xj)
(Xi, Xj) = (
˜
Xi,
˜
Xj)

Median, MAD : Even Case

Quantiles
Now assume one observes only sample quantiles,(q1, . . . , qK)
and(p1, . . . , pK)withpk∈[0, 1]andqk∈Rfork∈{1, . . . , K}
such that
Q(pk) =
c
Pθ(Xi≤pk) =qk
So
T(X) = ((q1, . . . , qK),(p1, . . . , pK))

Quantiles
General case :
p1%p2−p1%. . .pK−pK−1%100−pK%q1q2. . .qK−1qK
Example (K=3) :If(pk)
k∈{1,...,K}= (.25, .5, .75)and
(qk)
k∈{1,...,K}= (q1, med, q3)then we have :
25%25%25%25%q1medq3

Quantile specificity
Definition
Q(X, p) = (1−g)X
(i)+gX
(i+1)
whereh= (N−1)p+1,i
integer part ofhi.ei=⌊h⌋andgits fractional part i.eg=h−i.
[Hyndman, 1996]
▶used by default in R, Julia, NumPy, and Excel
▶Ifgk=0, we haveQ(X, pk) =X
(ik)“deterministic” andik
denotes its index.
▶Ifgk> 0we haveQ(X, pk) = (1−gk)X
(ik)+gkX
(ik+1)
Q(X, pk)a linear combination of the order statistics of
indexesikandik+1. One of the two need be “simulated”
(e.g., the smallest one) and the other is then deterministic.

Quantiles
Recall the joint density of arbitrary numberKorder statistics
f
(i1,...,iK)(x1, . . . , xK) =N!
K
Y
k=1
f(xk)
K
Y
k=0
(F(xk+1) −F(xk))
ik+1−ik−1
(ik+1−ik−1)!
wherefandF, pdf and pmf ofFθ, resp.

Quantiles
Denote
KD={k∈{1, . . . , K} |gk=0}
KS={k∈{1, . . . , K} |gk> 0}
We have{1, . . . , K}=KD∪KS, with
ffi
qk=X
(ik) ∀k∈KD
qk= (1−gk)X
(ik)+gkX
(ik+1)∀k∈KS

Quantiles
Denote
KD={k∈{1, . . . , K} |gk=0}
KS={k∈{1, . . . , K} |gk> 0}
We have{1, . . . , K}=KD∪KS, with
ffi
X
(ik)=qk ∀k∈KD
X
(ik+1)=
qk−X
(i
k
)(1−gk)
gk
∀k∈KS
We need sample|KS|order statistics to set theK+|KS|order
statistics ofXthat define the values of(Q(X, pk))
k∈{1,...,K}.

Quantiles
f
(X
(i
k
))k∈K
S
|(Q(X,p1),...,Q(X,pK))=(q1,...,qK)(x1, . . . , x
|KS|)
∝f
(X
(i
k
))k∈K
S
,(Q(X,p1),...,Q(X,pK))=(q1,...,qK)(x1, . . . , x
|KS|, q1, . . . , qK)
∝f
(i)i∈I

−1
(x1, . . . , x
|KS|, q1, . . . , qK))
(1)

Median, IQR
The median is the 50% quantile of a distribution. For a sample
X= (X1, . . . , XN)of random variables, the median is defined as:
median(X)=
ffi
X
(k), ifN=2k+1
X
(k)+X
(k+1)
2
, ifN=2k
whereX
(i)denotes theith order statistic ofX.
The Interquartile Range is the difference between the 75% and
25% quantiles of a distribution. For a sampleX= (X1, . . . , XN)
of i.i.d. random variables, the IQR is defined as:
IQR=Q(X, 0.75)−Q(X, 0.25)
whereQis the quantile function ofFθ.

Median, IQR
Basic Idea :
25%25%25%25%Q1medQ3Q3−Q1=IQR

Median, IQR
IfN=4k+1(simplest case) we have :











Q1=X
(k+1)1
m=X
(2k+1)
Q3=X
(3k+1)
i=Q3−Q1
⇐⇒





X
(k+1)=Q1
X
(2k+1)=m
X
(3k+1)=Q1+i

Impact on Bayesian Model Choice
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion

Bayesian Model Choice
Several modelsM1, M2, . . .are considered simultaneously for a
datasetyand the model indexMis part of the inference.
Use of a prior distribution,π(M=m), plus a prior distribution
on the parameter conditional on the valuemof the model
index,πm(θm)
Goal is to derive the posterior distribution ofM, challenging
computational target when models are complex.

Bayesian model choice
Comparison of modelsMiby a Bayesian approach
Probabili-ses (!) the entire model/parameter space
▶allocate probabilitiespito all modelsMi
▶define priorsπi(θi)for each parameter spaceΘi
▶compute
π(Mi|x) =
pi
Z
Θi
fi(x|θi)πi(θi)dθi
X
j
pj
Z
Θj
fj(x|θj)πj(θj)dθj
[Wrinch & Jeffreys, 1919]

Bayesian model choice
Comparison of modelsMiby a Bayesian approach
Relies on a central notion: theevidence
evk=
Z
Θk
πk(θk)Lk(θk)dθk,
aka the marginal likelihood.
[Wrinch & Jeffreys, 1919]

Generic ABC for model choice
fort=1toTdo
Repeat
Generatemfrom the priorπ(M=m)
Generateθmfrom the priorπm(θm)
Generatezfrom the modelfm(z|θm)
Untilρ{η(z), η(y)}< ϵ
Setm
(t)
=mandθ
(t)
=θm
end
ABC-MC
[Toni, Welch, Strelkowa, Ipsen & Stumpf, 2009]

Back to sufficiency
‘Sufficient statistics for individual models are unlikely
to be very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]
Ifη1(x)sufficient statistic for modelm=1and parameterθ1
andη2(x)sufficient statistic for modelm=2and parameterθ2,
(η1(x), η2(x))is not always sufficient for(m, θm)
©Potential loss of information at the testing level
[Marin & al., 2013]

Back to sufficiency
‘Sufficient statistics for individual models are unlikely
to be very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]Ifη1(x)sufficient statistic for modelm=1and parameterθ1
andη2(x)sufficient statistic for modelm=2and parameterθ2,
(η1(x), η2(x))is not always sufficient for(m, θm)
©Potential loss of information at the testing level
[Marin & al., 2013]

Back to sufficiency
‘Sufficient statistics for individual models are unlikely
to be very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]Ifη1(x)sufficient statistic for modelm=1and parameterθ1
andη2(x)sufficient statistic for modelm=2and parameterθ2,
(η1(x), η2(x))is not always sufficient for(m, θm)
©Potential loss of information at the testing level
[Marin & al., 2013]

Limiting behaviour ofB12(under sufficiency)
Ifη(y)sufficient statistic for both models,
fi(y|θi) =gi(y)f
η
i
(η(y)|θi)
Thus
B12(y) =
R
Θ1
π(θ1)g1(y)f
η
1
(η(y)|θ1)dθ1
R
Θ2
π(θ2)g2(y)f
η
2
(η(y)|θ2)dθ2
=
g1(y)
R
π1(θ1)f
η
1
(η(y)|θ1)dθ1
g2(y)
R
π2(θ2)f
η
2
(η(y)|θ2)dθ2
=
g1(y)
g2(y)
B
η
12
(y).
[Didelot, Everitt, Johansen & Lawson, 2011]
©No discrepancy only when cross-model sufficiency

Limiting behaviour ofB12(under sufficiency)
Ifη(y)sufficient statistic for both models,
fi(y|θi) =gi(y)f
η
i
(η(y)|θi)
Thus
B12(y) =
R
Θ1
π(θ1)g1(y)f
η
1
(η(y)|θ1)dθ1
R
Θ2
π(θ2)g2(y)f
η
2
(η(y)|θ2)dθ2
=
g1(y)
R
π1(θ1)f
η
1
(η(y)|θ1)dθ1
g2(y)
R
π2(θ2)f
η
2
(η(y)|θ2)dθ2
=
g1(y)
g2(y)
B
η
12
(y).
[Didelot, Everitt, Johansen & Lawson, 2011]
©No discrepancy only when cross-model sufficiency

Bridge Sampling
Approximation of Bayes factors (and other ratios of integrals)
c1
c2
=B12=
Z
˜π2(θ|x)α(θ)π1(θ|x)dθ
Z
˜π1(θ|x)α(θ)π2(θ|x)dθ
for anyα(·)

1
n1
n1X
i=1
˜π2(θ1i|x)α(θ1i)
1
n2
n2X
i=1
˜π1(θ2i|x)α(θ2i)
θji∼πj(θ|x)
[Bennett, 1976; Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Optimal Bridge Sampling
Optimal choice
α

(θ) =
n1+n2
n1π1(θ|x) +n2π2(θ|x)
leading to
b
B12≈
1
n1
n1X
i=1
˜π2(θ1i|x)
n1π1(θ1i|x) +n2π2(θ1i|x)
1
n2
n2X
i=1
˜π1(θ2i|x)
n1π1(θ2i|x) +n2π2(θ2i|x)
(to be iterated).

Insufficient Bridge Sampling
Inclusion of missing data into bridge estimate
b
B12≈
1
n1
n1X
i=1
˜π2(θ1i|x)
n1π1(θ1i,y1i) +n2π2(θ1i,y1i)
1
n2
n2X
i=1
˜π1(θ2i,y2i)
n1π1(θ2i,y2i) +n2π2(θ2i,y2i)

Results
1Introduction
2Methodology for different statistics
Median, MAD
Quantiles
Median, IQR
3Impact on Bayesian Model Choice
4Experiments
Gaussian Case
Cauchy Example
3-Parameters Weibull Example
Normal-Laplace Comparison
5Conclusion

Gaussian Case
Conjuguate Prior Inverse Gamma:
▶Prior :µ, σ
2
∼Normal-InverseGamma(µ0, τ, α, β)
▶Posterior :µ, σ
2
|X∼Normal-InverseGamma(M, C, A, B)
where
M=
νµ0+N
¯
X
ν+N
andC=ν+N
A=α+
N
2
andB=β+
1
2
(NS
2
+

ν+N
(
¯
X−µ0)
2
)
Robust statistics:
▶µ≈medandσ≈MAD/Φ
−1
(3/4) =IQR/2Φ
−1
(3/4)
whenNlarge enough
▶efficiency :effmed=2/π≈0.64and
effMAD=effIQR≈0.36

Gaussian Example
For comparison purposes, consider approximation of the
simulated posterior
˜
pi(µ, σ
2
|med, MAD)by
Normal-InverseGamma(
˜
M,
˜
C,
˜
A,
˜
B)
where
˜
M=
νµ0+N·effmed·med
ν+N·effmed
,
˜
C=ν+N·effmed,
˜
A=α+
N·effMAD
2
,
˜
B=β+
1
2
`
N·effMAD·(C·MAD)
2
+
N·effMADν
ν+N·effMAD
(
¯
X−µ0)
2
´
.

Gaussian Example
Figure: µandσfor a sample of size
N=1000, med= −5, MAD=3givenT(X) = (med, MAD)with the
Robust Gibbs in blue and the appproximation in red.

Cauchy Example
▶Density :
fθ(x) =
1
πγ
1
1+ (x−x0)
2
soθ= (x0, γ)wherex0∈R(location parameter) andγ > 0
(scale parameter)
▶Heavy tailed distribution :Eθ(X)not defined and
Varθ(X) =∞
▶Convergent estimators :x0≈medandγ≈MADwhen
N→ ∞

Cauchy Example
Figure: x0andγfor a sample of size
N=1000, med=0, MAD=1givenT(X) = (med, MAD)in blue and
givenXin orange.

3-Parameters Weibull Example
Used to modelize income data.
Density :
f(x) =
α
β
(
x−x0
β
)
α−1
e
−(
x−x
0
β
)
α
whereα > 0(shape parameter),β > 0(scale parameter) and
x0∈R(location parameter)
Quantiles method for different values of K.

Weibull Example
Figure:
for a sample of sizeN=1000given
T(X) = ((qk)k=1,...,K,(pk)k=1,...,K,)with the Robust Gibbs forK=3
(in blue),4(in orange) and9(in green).

Normal-Laplace Comparison
Comparison of model
M1:y∼N(θ1, 1)against modelM2:y∼L(θ2, 1/

2)
where Laplace distribution has meanθ2and scale parameter
1/sqrt2, resulting in a variance of 1.
Since median, first and second-order empirical moments
asymptotically match for these models, summary statistics are
useless for model selection.
Model selection is possible when using the MAD as summary
statistic
[Marin & al., 2013]

Normal-Laplace Comparison
Computation of Bayes factor when
T(y) = (med(y), MAD(y))
by bridge sampling (and not ABC approximation)Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
p(1|T0) Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
p(1|T0) Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
p(1|T0)
Figure:
modelM1as number of observationsNincreases, for Gaussian and
Laplace simulated samples, resp. Boxplots based on 100 Monte Carlo
runs with104bridge sampling simulations.

Conclusion
▶Among the three robust statistics addressed in this paper,
two exhibited similarities, yielding comparable results to
existing methods.
▶Case of median absolute deviation (MAD) novel challenge,
for which our partial data augmentation ensures ergodicity.

Conclusion
Exact method for Bayesian model choice by extending bridge
sampling estimators, while relying solely on observed statistics,
if limited by both the standard conditions of bridge sampling
and the size of the simulated data.
While focussed on continuous univariate distributions,
extension to various observed statistics, discrete distributions
and multivariate data formaly identical

The elephant in the ocean
Method falling victim to curse of dimensionality, when
dimension or size of data too large.

The elephant in the ocean
Method falling victim to curse of dimensionality, when
dimension or size of data too large.
▶A novel intermediary step similar to principal components
towards identifying most discriminating subspace would be
necessary for extension.
▶Since the best summary statistic for ABC model choice is
the pair of evidencesevk(k=1, 2), simulating directly this
pair sounds like our Holy Grail...
[Prangle & al., 2013]