bayes_machine_learning_book for data scientist

addamare554 9 views 38 slides Oct 10, 2024
Slide 1
Slide 1 of 38
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38

About This Presentation

bayes theory.
which is very important in machine learning.
we can use this for ml


Slide Content

BML lecture #1: Bayesics
http://github.com/rbardenet/bml-course
R´emi Bardenet
[email protected]
CNRS & CRIStAL, Univ. Lille, France
1 / 38

What comes toyourmind when you hear ”Bayesian ML”?
2 / 38

Course outline
3 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
4 / 38

Quotes from Gelman et al., 2013 on Bayesian methods
▶[...] practical methods for making inferences from data, using
probability models for quantities we observe
which we wish to learn.
▶The essential characteristic of Bayesian methods is their
of probability for quantifying uncertainty
statistical data analysis.
▶Three steps:
1Setting up a full probability model,
2Conditioning on observed data, calculating and interpreting the
appropriate “posterior distribution”,
3Evaluating the fit of the model and the implications of the resulting
posterior distribution. In response, one can alter or expand the
model and repeat the three steps.
5 / 38

Notation that I will try to stick to
▶y1:n= (y1, . . . ,yn)∈ Y
n
denote observable data/labels.
▶x1:n∈ X
n
denote covariates/features/hidden states.
▶z1:n∈ Z
n
denote hidden variables.
▶θ∈Θ denote parameters.
▶Xdenotes anX-valued random variable. Lowercasexdenotes either
a point inXor anX-valued random variable.
6 / 38

More notation
▶Whenever it can easily be made formal, we write densities for our
random variables and let the context indicate what is meant. So if
X∼ N(0, σ
2
), we write
Eh(X) =
Z
h(x)
e
−x
2
/2σ
2
σ


dx=
Z
h(x)p(x)dx.
Similarly, forX∼ P(λ), we write
Eh(X) =

X
k=0
h(k)e
−λ
λ
k
k!
=
Z
h(x)p(x)dx
▶All pdfs are denoted byp, so that, e. g.
Eh(Y, θ) =
Z
h(y, θ)p(y, θ)dydθ
=
Z
h(y, θ)p(y,x, θ)dxdydθ
=
Z
h(y, θ)p(y, θ|x)p(x)dxdydθ
7 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
8 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
9 / 38

Inference in regression models
10 / 38

Inference in regression models
11 / 38

Inference in regression models
12 / 38

Inference in regression models
13 / 38

Inference in regression models
14 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
15 / 38

Describing a decision problem under uncertainty
▶A state spaceS,
Every quantity you need to consider to make your decision.
▶ActionsA ⊂ F(S,Z),
Making a decision means picking one of the available actions.
▶A reward spaceZ,
Encodes how you feel about having picked a particular action.
▶A loss functionL:A × S →R+.
How much you would suffer from picking actionain states.
16 / 38

Classification as a decision problem
▶S=X
n
× Y
n
× X × Y, i.e.s= (x1:n,y1:n,x,y).
▶Z={0,1}.
▶A={ag:s7→1
y̸=g(x;x1:n,y1:n),g∈ G}.
▶L(ag,s) = 1
y̸=g(x;x1:n,y1:n).
PAC bounds; see e.g. (Shalev-Shwartz and Ben-David, 2014)
Let (x1:n,y1:n)∼P
⊗n
, and independently (x,y)∼P, we want an
algorithmg(·;x1:n,y1:n)∈ Gsuch that ifn⩾n(δ, ε),
P
⊗n
Θ
E
(x,y)∼PL(ag,s)⩽ε
Λ
⩾1−δ.
17 / 38

Regression as a decision problem
▶S=
▶Z=
▶A=

18 / 38

Estimation as a decision problem
▶S=
▶Z=
▶A=

19 / 38

Clustering as a decision problem
▶S=
▶Z=
▶A=

20 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
21 / 38

SEU is what defines the Bayesian approach
The subjective expected utility principle
1ChooseS,Z,Aand a loss functionL(a,s),
2Choose poverS,
3Take the the corresponding
a

∈arg min
a∈A
Es∼pL(a,s). (1)
Corollary: minimize the posterior expected loss
Now partitions= (sobs,su), then
a

∈arg min
a∈A
Esobs
E
su|sobs
L(a,s).
In ML,A={ag}, withg=g(sobs), so that (1) is equivalent to
a

=ag
⋆, with
g

(sobs)≜arg min
g
E
su|sobs
L(a,s).
22 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
23 / 38

A recap on probabilistic graphical models 1/2
▶PGMs (aka “Bayesian” networks) represent the dependencies in a
joint distributionp(s) by G= (E,V).
▶Two important properties:
p(s) =
Y
v∈V
p(sv|s
pa(v)) and yv⊥y
nd(v)|y
pa(v).
24 / 38

A recap on probabilistic graphical models 2/2
Also good to know how to determine whetherA⊥B|C; see (Murphy,
2012, Section 10.5).
d-blocking
An undirected pathPinGisd-blocked byE⊂Vif at least one of the
following conditions hold.
▶Pcontains a “chain”a→b→candb∈E.
▶Pcontains a Ψent”a←b→candb∈E.
▶Pcontains a “v-structure”a→b←cand neitherbnor any of its
descendants are inE.
Theorem
25 / 38

Exercise
▶Doesx2⊥x6|x5,x1?
▶Doesx2⊥x6|x1?
▶Write the joint distribution as factorized over the graph.
26 / 38

Estimation as a decision problem: point estimates
27 / 38

Estimation as a decision problem: credible intervals
28 / 38

Choosing priors (see Exercises)
29 / 38

Classification as a decision problem
30 / 38

Regression as a decision problem 1/2
31 / 38

Regression as a decision problem 2/2
32 / 38

Dimensionality reduction as a decision problem
33 / 38

Clustering as a decision problem
34 / 38

Topic modelling as a decision problem
35 / 38

Outline
1A warmup: Estimation in regression models
2ML as data-driven decision-making
3Subjective expected utility
4Specifying joint models
550 shades of Bayes
36 / 38

50 shades of Bayes
An issue (or is it?)
Depending on how they interpret and how they implement SEU, you will
meet many types of Bayesians (46656, according to Good).
A few divisive questions
▶Using data or the likelihood to choose your prior; see Lecture #5.
▶Using MAP estimators for their computational tractability, like in
inverse problems
ˆxλ∈arg min∥y−Ax∥+λΩ(x).
▶When and how should you revise your model (likelihood or prior)?
▶MCMC vs variational Bayes (more in Lectures #2 and #3)
37 / 38

References
[1] Bayesian data analysis.
[2] Machine learning: a probabilistic perspective.
Press, 2012.
[3] Understanding machine
learning: From theory to algorithms.
2014.
38 / 38