Probabilistic Machine Learning - Continuous Variables

ssuserf52f1a 32 views 56 slides Jun 18, 2024
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

Probabilistic Machine Learning - Continuous Variables


Slide Content

Probabilistic Machine Learning
Lecture 03
Continuous Variables
Philipp Hennig
24 April 2023
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning

– Warning –
Theory incoming
We need to talk about
▶Derived quantities: Functions of elementary events
▶Continuous quantities: Measures on the continuum
▶Functions of continuous quantities
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 1

Recap from Lecture 1
Plausibility as a Measure [A.N. Kolmogorov.Grundbegriffe der Wahrscheinlichkeitsrechnung, 1933]
Definition (σ-algebra, measurable sets & spaces)
LetEbe a space ofelementary events. Consider the power set 2
E
, and letF62
E
be a set of subsets of
E. Elements ofFare calledrandom events. IfFsatisfies the following properties, it is called a
σ-algebra.
1.E2F II.
2.(A2F))(:A:=EΓA2F) I.
3.(A1,A2, 2F))
S
N
i=1
Ai2F I.
(this implies, by de Morgan’s Laws,
T
N
i=1
Ai2F, and thus also∅2F. IfEis countable, then 2
E
is a
σ-algebra). IfFis aσ-algebra, its elements are calledmeasurable sets, and(E,F)is called a
measurable space(orBorel space).
IfΩis countable, then 2

is aσ-algebra, and everything is easy.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 2

Recap from Lecture 1
Plausibility as a Measure [A.N. Kolmogorov.Grundbegriffe der Wahrscheinlichkeitsrechnung, 1933]
Definition (Measure & Probability Measure)
Let(Ω,F)be ameasurable space(aka. Borel space). A nonnegative real functionP:F_R0,+(III.) is
called ameasureif it satisfies the following properties:
1.P(∅) =0
2.For any countable sequencefAi2Fgi=1,...,of pairwise disjoint sets (Ai\Aj=∅ifi6=j),P
satisfiescountable additivity(aka.σ-additivity):
P


[
i=1
Ai
!
=

X
i=1
P(Ai). (V.)
The measurePis called aprobability measureifP(Ω) =1. IV.
(For probability measures,1.is unnecessary). Then,(Ω,F,P)is called aprobability space.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 2

What about derived quantities?
Random variables
A bent coin has probabilityfof coming up heads. The coin is tossedNtimes. What is the probability
distribution of the number of headsr?
R
X1 X2
. . . XN−1 XN
R:=
P
N
i=1
Xi
Xi:=
(
1 ifi-th toss is heads
0 else
▶ForX= [X1, . . . ,XN], we haveΩ =f0,1g
N
.
▶But what aboutR2[0, . . . ,N]6N? It’s not part ofΩ.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 3

Building new Probability Distributions from old ones
Random Variables
Definition (Measurable Functions, Random Variables)
Let(Ω,F)and(Γ,G)be two measurable spaces (i.e. spaces withσ-algebras). A functionX: Ω_Γis
calledmeasurableifX
−1
(G)2Ffor allG2G. If there is, additionally, a probability measurePon
(Ω,F), thenXis called arandom variable.
Definition (Distribution Measure)
LetX: Ω_Γbe a random variable. Then thedistribution measure(orlaw)PXofXis defined for any
G6Gas
PX(G) =P(X
−1
(G)) =P(fωjX(ω)2Gg).
Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note:Note: We will call essentially every variableXover which we construct probability measures random
variables, even if the measure is directly constructed onX, because every variable can be thought of as
the output of a function (if necessary, the identity).
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 4

Example: the Binomial Distribution
statistics of accumulated Bernoulli experiments
A bent coin has probabilityfof coming up heads. The coin is tossedNtimes. What is the probability
distribution of the number of headsr?
R
X1 X2
. . . XN−1 XN
R:=
P
N
i=1
Xi
Xi:=
(
1 ifi-th toss is heads
0 else
P(R=r) =
X
ω∈{X|R=r}
N
Y
i=1
P(Xi) =
X
ω∈{X|R=r}
f
r
(1−f)
N−r
:=P(rjf,N)
▶original space:Ω =f0;1g
N
(countably finite)
▶σ-algebra: 2

(the power set)
▶random variableR=
P
N
i=1
Xi2[0, . . . ,N] =: Γ6N.
▶distribution (measure) / law ofR: …
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 5

Example: the Binomial Distribution
statistics of accumulated Bernoulli experiments
Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:Definition:LetR: Ω_Γbe a random variable. Then thedistribution measure
(orlaw)PRofRis defined for anyG⊂Γas
PR(G) =P(R
−1
(G)) =P({ω|R(ω)∈G}).
Thedistribution measureofR:=
P
N
i=1
Xiis
P(rjf,N) = (# ways to chooserfromN)f
r
(1Γf)
N−r
=
N!
(NΓr)!r!
f
r
(1Γf)
N−r
=
.
N
r
/
f
r
(1Γf)
N−r
Note:In the remainder of the course, will oftenabuse
notationby writingP(r)instead ofP(R=r)(recall again
thatP(X)6=P(Y)!)0.0 2.5 5.0 7.5 10.0
r
0.00
0.05
0.10
0.15
0.20
0.25
P(rjf=1/3,N=10)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 6

Now for the Real case …
some complications for continuous spaces
▶in a countable spaceΩ, even 2

is aσ-algebra.
▶But in continuous spaces, such asΩ =R
d
, not all sets are measurable.
Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Consider the setS:fe
ixπ
:x2[0,1)6Rgof all points on the unit circle, and the groupG
(also a set) of rotations ofs2Sbyrationalnumbersse
iqπ
:q2Q
▶We noteSis uncountable, butGis countable. Hence,Sbreaks up intouncountablymany orbits
fse
iqπ
:q2QgunderG.
▶If we have theaxiom of choice, we can pick one point from each orbit, forming anuncountable
subsetZ6Swith the property that, forall setsof the formfe
i
qπz:z2Zgforq2Qare
pairwise disjoint from each other.
▶Thus,S=ffe
iqπ
z:z2Zg:q2Qg, hence we have partitioned the circle into acountable
collection of disjoint sets, which are alsopairwise disjoint.
Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Zis not measurable.
Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:IfZhas zero measure, then, by sigma-additivity,Shas zero measure. IfZhas non-zero measure,
then, by sigma-additivity,Shas infinite measure.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 7

Now for the Real case …
some complications for continuous spaces
▶in a countable spaceΩ, even 2

is aσ-algebra.
▶But in continuous spaces, such asΩ =R
d
, not all sets are measurable.
Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Example:Consider the setS:fe
ixπ
:x2[0,1)6Rgof all points on the unit circle, and the groupG
(also a set) of rotations ofs2Sbyrationalnumbersse
iqπ
:q2Q
▶We noteSis uncountable, butGis countable. Hence,Sbreaks up intouncountablymany orbits
fse
iqπ
:q2QgunderG.
▶If we have theaxiom of choice, we can pick one point from each orbit, forming anuncountable
subsetZ6Swith the property that, forall setsof the formfe
i
qπz:z2Zgforq2Qare
pairwise disjoint from each other.
▶Thus,S=ffe
iqπ
z:z2Zg:q2Qg, hence we have partitioned the circle into acountable
collection of disjoint sets, which are alsopairwise disjoint.
Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Proposition:Zis not measurable.
Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:Proof:IfZhas zero measure, then, by sigma-additivity,Shas zero measure. IfZhas non-zero measure,
then, by sigma-additivity,Shas infinite measure.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 7

The Banach-Tarski Paradox
You thought you would get away without ever thinking about countability again after your BSc, didn’t you?
Stefan Banach, 1892, Kraków–1945, LvivAlfred Tarski, 1901, Warsaw–1983, Berkeley
There is no way to define volume in three dimensions unless one of the following
five concessions is made
1.The volume of a set can change under rotation
2.The volume of a union of two disjoint sets can be different from the sum of
their volumes
3.Some sets have to be callednon-measurable, and we will have to check
whether a set is measurable before taking its volume
4.The ZFC (Zermelo-Fraenkel, with axiom of choice) axioms of set theory have
to be changed
5.The volume of[0,1]
3
is either 0 or1.
Probably best to take option 3…
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 8

Taking the easy way out
Accepting that not everything is measurable
▶in a countable spaceΩ, even 2

is aσ-algebra.
▶But in continuous spaces, such asΩ =R
d
, not all sets are measurable.
▶However,R
d
is atopological space
Definition (Topology)
LetΩbe a space andτbe a collection of sets. We sayτis atopologyonΩif
▶Ω2τ, and∅2τ
▶anyunion of elements ofτis inτ
▶any intersection offinitely manyelements ofτis inτ.
The elements of the topologyτare calledopen sets. In the Euclidean vector spaceR
d
, the canonical
topology is that of all setsUthat satisfyx2U:) 9ε >0: ((kyΓxk< ε))(y2U)).
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 9

From topologies toσ-algebras
for topological spaces, it’s easy to defineσ-algebras
Note that a topology isalmostaσ-algebra:
Definition (σ-algebra, measurable sets & spaces)
LetEbe a space, andF62
E
a set of subsets. We
sayFis aσ-algebraif
1.E2F II.
2.(A2F))(:A:=EΓA2F) I.
3.(A1,A2, 2F))
S
N
i=1
Ai2F I.
(this implies, by de Morgan’s Laws,
T
N
i=1
Ai2F,
and thus also∅2F. IfEis countable, then 2
E
is a
σ-algebra). IfFis aσ-algebra, its elements are
calledmeasurable sets, and(E,F)is called a
measurable space(orBorel space).
Definition (Topology)
LetΩbe a space andτbe a collection of sets. We
sayτis atopologyonΩif
▶Ω2τ, and∅2τ
▶anyunion of elements ofτis inτ
▶any intersection offinitely manyelements of
τis inτ.
(this impliesA2τ)) The elements of the
topologyτare calledopen sets. In the Euclidean
vector spaceR
d
, the canonical topology is that of
all setsUthat satisfy
x2U:) 9ε >0: ((kyΓxk< ε))(y2U)).
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 10

Almost done!
Standard settings
Definition (Borel algebra)
Let(Ω, τ)be a topological space. TheBorelσ-algebrais theσ-algebrageneratedbyτ. That is by
takingτand completing it to include infinite intersections of elements fromτand all complements inΩ
to elements ofτ.
▶In this lecture, we will almost exclusively consider (random) variables defined on discrete or
Euclidean spaces. In the latter case, theσ-algebra will not be mentioned but assumed to be the
Borelσ-algebra.
▶Consider(Ω,F)and(Γ,G). If bothFandGare Borelσ-algebras, then any continuous functionX
is measurable (and can thus be used to define a random variable). This is because, for continuous
functions, pre-images of open sets are open sets.
Now that we can define (Borel)σ-algebras on continous spaces, we can define probability distribution
measures. They might just be a bit unwieldy.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 11

▶Random Variablesallow us to define derived quantities from atomic events
▶Borelσ-algebrascan be defined on all topological spaces, allowing us to define probabilities if the
elementary space is continuous.
Note the connection to computability theory:measurablefunctionsandcomputablefunctions. “Not all
sets aremeasurable”, and “not all languages arecomputable”.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 12

Cumulative Distributions
Connecting probabilities to integration
Definition (Cumulative Distribution Function (CDF))
LetBbe the Borelσ-algebra inR
d
. For probability measuresPon(R
d
,B), thecumulative distribution
functionis the function
F(x) =P

d
Y
i=1
(Xi<xi)
!
.
In particular for the univariate cased=1, we haveF(x) =P((M,x]).
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 13

Probability Densities
a convenient way to write things down
Definition (Probability Density Functions (pdf’s))
A probability measurePon(R
d
,B)has adensitypifpis a non-negative (Borel) measurable function
onR
d
satisfying, for allB2B
P(B) =
Z
B
p(x)dx=:
Z
B
p(x1, . . . ,x
d)dx1. . .dx
d
In particular, if the CDFFofPis sufficiently differentiable, thenPhas a density, given by
p(x) =

d
F
∂x1 ∂x
d




x
.
and, ford=1,
P(a0X<b) =F(b)ΓF(a) =
Z
b
a
f(x)dx.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 14

Densities Satisfy the Laws of Probability Theory
because integrals are linear operators proof in Math for ML lecture
▶For probability densitiespon(R
d
,B)we have
P(E)
(IV)
=1=
Z
R
d
p(x)dx.
▶LetX= (X1,X2)2R
2
be a random variable with densitypXonR
2
. Then themarginal densitiesof
X1andX2are given by thesum rule
pX1
(x1) =
Z
R
pX(x1,x2)dx2,pX2
(x2) =
Z
R
pX(x1,x2)dx1
▶Theconditional densityp(x1jx2)(forp(x2)>0) is given by theproduct rule
p(x1jx2) =
p(x1,x2)
p(x2)
▶Bayes’ Theoremholds:
p(x1jx2) =
p(x1)p(x2jx1)
R
p(x1)p(x2jx1)dx1
.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 15

A Graphical View
sketch−2
0
2
−2
−1
0
1
2
0
0.2
0.4
p(x|y) p(x,y)
x
y
p
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 16

Change of Measure
The transformation law
Theorem (Change of Variable for Probability Density Functions)
Let X be a continuous random variable with PDF pX(x)over c1<x<c2. And, let Y=u(X)be a
monotonic differentiable function with inverse X=v(Y). Then the PDF of Y is
pY(y) =pX(v(y))




dv(y)
dy




=pX(v(y))




du(x)
dx




−1
.
Proof:foru

(X)>0:8d1=u(c1)<y<u(c2) =d2
FY(y) =P(Y0y) =P(u(X)0y) =P(X0v(y)) =
Z
v(y)
c1
pX(x)dx
pY(y) =
dFY(y)
dy
=pX(v(y))
dv(y)
dy
=pX(v(y))




dv(y)
dy




Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 17

Change of Measure
The transformation law
Theorem (Change of Variable for Probability Density Functions)
Let X be a continuous random variable with PDF pX(x)over c1<x<c2. And, let Y=u(X)be a
monotonic differentiable function with inverse X=v(Y). Then the PDF of Y is
pY(y) =pX(v(y))




dv(y)
dy




=pX(v(y))




du(x)
dx




−1
.
Proof:foru

(X)<0:8d2=u(c2)<y<u(c1) =d1
FY(y) =P(Y0y) =P(u(X)0y) =P(X1v(y)) =1ΓP(X0v(y)) =1Γ
Z
v(y)
c1
pX(x)dx
pY(y) =
dFY(y)
dy
=ΓpX(v(y))
dv(y)
dy
=pX(v(y))




dv(y)
dy




Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 17

Change of Measure
The transformation law
Theorem (Transformation Law, general)
Let X= (X1, . . . ,X
d)have a joint density pX. Let g:R
d
_R
d
be continously differentiable and injective,
with non-vanishing Jacobian Jg. Then Y=g(X)has density
pY(y) =
(
pX(g
−1
(y)) jJ
g
−1(y)jif y is in the range of g,
0 otherwise.
The JacobianJgis theddmatrix with
[Jg(x)]ij=
∂gi(x)
∂xj
.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 18

DEMO
live:streamlit cloud
local:▶git clone https://github.com/philipphennig/ProbML_Apps.git
▶cd ProbML_Apps/03
▶pip install -r requirements.txt
▶streamlit run Lecture_03.py
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 19

▶Probability Density Functions (pdf’s)distribute probability across continuous domains.
▶they satisfy “the rules of probability”:
Z
R
d
p(x)dx=1
pX1
(x1) =
Z
R
pX(x1,x2)dx2 sum rule
p(x1jx2) =
p(x1,x2)
p(x2)
product rule
p(x1jx2) =
p(x1)p(x2jx1)
R
p(x1)p(x2jx1)dx1
Bayes’ Theorem.
▶Not every measure has a density, but all pdfs define measures
▶Densities transform under continuously differentiable, injective functionsg:x7!ywith
non-vanishing Jacobian as
pY(y) =
(
pX(g
−1
(y)) jJ
g
−1(y)jifyis in the range ofg,
0 otherwise.
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 20

An example
Pierre Simon Laplace, 1814
What is the probabilityπfor a person to be wearing glasses?
▶model probability as random variableπranging in[0,1]
▶X=person is wearing glasses. How do we do inference?
p(πjX) =
p(Xjπ)p(π)
p(X)
=
p(Xjπ)p(π)
R
p(Xjπ)p(π)dπ
What is a good prior?uniform forπ2[0,1], i.e.p(π) =1, zero elsewhere.
(If we sample independently)What is the likelihood for a positive or a negative
observation?p(X=1jπ) =π; p(X=0jπ) =1Γπ.
What is the posterior afternpositive,mnegative observations?
p(πjn,m) =
π
n
(1Γπ)
m
1
R
π
n
(1Γπ)
m
1dπ
=
π
n
(1Γπ)
m
B(n+1,m+1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 21

An example
Pierre Simon Laplace, 1814
What is the probabilityπfor a person to be wearing glasses?
▶model probability as random variableπranging in[0,1]
▶X=person is wearing glasses. How do we do inference?
p(πjX) =
p(Xjπ)p(π)
p(X)
=
p(Xjπ)p(π)
R
p(Xjπ)p(π)dπ
What is a good prior?uniform forπ2[0,1], i.e.p(π) =1, zero elsewhere.
(If we sample independently)What is the likelihood for a positive or a negative
observation?p(X=1jπ) =π; p(X=0jπ) =1Γπ.
What is the posterior afternpositive,mnegative observations?
p(πjn,m) =
π
n
(1Γπ)
m
1
R
π
n
(1Γπ)
m
1dπ
=
π
n
(1Γπ)
m
B(n+1,m+1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 21

An example
Pierre Simon Laplace, 1814
What is the probabilityπfor a person to be wearing glasses?
▶model probability as random variableπranging in[0,1]
▶X=person is wearing glasses. How do we do inference?
p(πjX) =
p(Xjπ)p(π)
p(X)
=
p(Xjπ)p(π)
R
p(Xjπ)p(π)dπ
What is a good prior?
uniform forπ2[0,1], i.e.p(π) =1, zero elsewhere.
(If we sample independently)What is the likelihood for a positive or a negative
observation?p(X=1jπ) =π; p(X=0jπ) =1Γπ.
What is the posterior afternpositive,mnegative observations?
p(πjn,m) =
π
n
(1Γπ)
m
1
R
π
n
(1Γπ)
m
1dπ
=
π
n
(1Γπ)
m
B(n+1,m+1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 21

An example
Pierre Simon Laplace, 1814
What is the probabilityπfor a person to be wearing glasses?
▶model probability as random variableπranging in[0,1]
▶X=person is wearing glasses. How do we do inference?
p(πjX) =
p(Xjπ)p(π)
p(X)
=
p(Xjπ)p(π)
R
p(Xjπ)p(π)dπ
What is a good prior?uniform forπ2[0,1], i.e.p(π) =1, zero elsewhere.
(If we sample independently)What is the likelihood for a positive or a negative
observation?
p(X=1jπ) =π; p(X=0jπ) =1Γπ.
What is the posterior afternpositive,mnegative observations?
p(πjn,m) =
π
n
(1Γπ)
m
1
R
π
n
(1Γπ)
m
1dπ
=
π
n
(1Γπ)
m
B(n+1,m+1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 21

An example
Pierre Simon Laplace, 1814
What is the probabilityπfor a person to be wearing glasses?
▶model probability as random variableπranging in[0,1]
▶X=person is wearing glasses. How do we do inference?
p(πjX) =
p(Xjπ)p(π)
p(X)
=
p(Xjπ)p(π)
R
p(Xjπ)p(π)dπ
What is a good prior?uniform forπ2[0,1], i.e.p(π) =1, zero elsewhere.
(If we sample independently)What is the likelihood for a positive or a negative
observation?p(X=1jπ) =π; p(X=0jπ) =1Γπ.
What is the posterior afternpositive,mnegative observations?
p(πjn,m) =
π
n
(1Γπ)
m
1
R
π
n
(1Γπ)
m
1dπ
=
π
n
(1Γπ)
m
B(n+1,m+1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 21

An example
Pierre Simon Laplace, 1814
What is the probabilityπfor a person to be wearing glasses?
▶model probability as random variableπranging in[0,1]
▶X=person is wearing glasses. How do we do inference?
p(πjX) =
p(Xjπ)p(π)
p(X)
=
p(Xjπ)p(π)
R
p(Xjπ)p(π)dπ
What is a good prior?uniform forπ2[0,1], i.e.p(π) =1, zero elsewhere.
(If we sample independently)What is the likelihood for a positive or a negative
observation?p(X=1jπ) =π; p(X=0jπ) =1Γπ.
What is the posterior afternpositive,mnegative observations?
p(πjn,m) =
π
n
(1Γπ)
m
1
R
π
n
(1Γπ)
m
1dπ
=
π
n
(1Γπ)
m
B(n+1,m+1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 21

DEMO
live:streamlit cloud
local:▶git clone https://github.com/philipphennig/ProbML_Apps.git
▶cd ProbML_Apps/03
▶pip install -r requirements.txt
▶streamlit run Lecture_03.py
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 22

The Invention of Inference
It’s actually “Laplace’s Theorem”
Pierre-Simon, marquis de Laplace (1749–1827)
La probabilité de la plupart des événemens simples, est inconnue; en la
considérant à priori, elle nous paraît susceptible de toutes les valeurs
comprises entre zéro et l’unité; mais sie l’on a observé un résultat
composé de plusieurs de ces événemens, la manière dont ils y entrent,
rend quelques-unes de ces valeurs plus probables que les autres. Ainsi à
mesure que les résultat observé se compose par le développement des
événemens simples, leur vraie possibilité se fait de plus en plus connaître,
et il devient de plus en plus probable qu’elle tombe dans des limites qui se
reserrant sans cesse, finiraient par coïncider, si le nombre des événemens
simples devenait infini. Theorie Analytique des Probabilités, 1814, p. 363
Translated by a Deep Network, assisted by a human
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 23

The Invention of Inference
It’s actually “Laplace’s Theorem”
Pierre-Simon, marquis de Laplace (1749–1827)
The probability of most simple events is unknown. Considering it a priori,
it seems susceptible to all values between zero and unity. But if one has
observed a result composed of several of these events, the way they
enter them makes some of these values more probable than the others.
Thus, as the observed results are composed by the development of
simple events, their real possibility becomes more and more known, and
it becomes more and more probable that it falls within limits that
constantly tighten, would end up coinciding if the number of simple
events became infinite. Theorie Analytique des Probabilités, 1814, p. 363
Translated by a Deep Network, assisted by a human
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 23

Let’s be more careful with notation!
(but only once more, then we’ll be sloppy)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 24

Example — inferring probability of wearing glasses (2)
Step 1: Constructσ-algebra
Represent all unknowns as random variables (RVs)
▶probability to wear glasses is represented by RVY
▶five observations are represented by RVsX1,X2,X3,X4,X5
▶we abbreviateY=πasπ,Xi=xiasxi
▶p(π)is the prior ofY, written fullyp(Y=π)
▶p(xijπ)is the likelihood of observationxi
▶note that the likelihood is a function ofπ
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 25

Example — inferring probability of wearing glasses (2)
Step 1: Constructσ-algebra
Represent all unknowns as random variables (RVs)
▶probability to wear glasses is represented by RVY
▶five observations are represented by RVsX1,X2,X3,X4,X5
Possible values of the RVs
▶Ytakes valuesπ2[0,1]
▶X1,X2,X3,X4,X5are binary, i.e. values 0 and 1
▶we abbreviateY=πasπ,Xi=xiasxi
▶p(π)is the prior ofY, written fullyp(Y=π)
▶p(xijπ)is the likelihood of observationxi
▶note that the likelihood is a function ofπ
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 25

Example — inferring probability of wearing glasses (2)
Step 1: Constructσ-algebra
Represent all unknowns as random variables (RVs)
▶probability to wear glasses is represented by RVY
▶five observations are represented by RVsX1,X2,X3,X4,X5
Possible values of the RVs
▶Ytakes valuesπ2[0,1]
▶X1,X2,X3,X4,X5are binary, i.e. values 0 and 1
Graphical representation
Y
X1X2X3X4X5
▶we abbreviateY=πasπ,Xi=xiasxi
▶p(π)is the prior ofY, written fullyp(Y=π)
▶p(xijπ)is the likelihood of observationxi
▶note that the likelihood is a function ofπ
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 25

Example — inferring probability of wearing glasses (2)
Step 1: Constructσ-algebra
Represent all unknowns as random variables (RVs)
▶probability to wear glasses is represented by RVY
▶five observations are represented by RVsX1,X2,X3,X4,X5
Possible values of the RVs
▶Ytakes valuesπ2[0,1]
▶X1,X2,X3,X4,X5are binary, i.e. values 0 and 1
Graphical representation
Y
X1X2X3X4X5
Generative model and joint probability
▶we abbreviateY=πasπ,Xi=xiasxi
▶p(π)is the prior ofY, written fullyp(Y=π)
▶p(xijπ)is the likelihood of observationxi
▶note that the likelihood is a function ofπ
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 25

Example — inferring probability of wearing glasses (3)
Step 2: Define probability space, taking care of conditional independence
Probability of wearing glasses without observations
p(πj“nothing”) =p(π)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 26

Example — inferring probability of wearing glasses (3)
Step 2: Define probability space, taking care of conditional independence
Probability of wearing glasses without observations
p(πj“nothing”) =p(π)
Probability of wearing glasses after one observation
p(πjx1) =
p(x1jπ)p(π)
R
p(x1jπ)p(π)dπ
=Z
−1
1
p(x1jπ)p(π)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 26

Example — inferring probability of wearing glasses (3)
Step 2: Define probability space, taking care of conditional independence
Probability of wearing glasses without observations
p(πj“nothing”) =p(π)
Probability of wearing glasses after one observation
p(πjx1) =
p(x1jπ)p(π)
R
p(x1jπ)p(π)dπ
=Z
−1
1
p(x1jπ)p(π)
Probability of wearing glasses after two observations
p(πjx1,x2) =Z
−1
2
p(x2jx1, π)p(x1jπ)p(π) =Z
−1
2
p(x2jπ)p(x1jπ)p(π)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 26

Example — inferring probability of wearing glasses (3)
Step 2: Define probability space, taking care of conditional independence
Probability of wearing glasses without observations
p(πj“nothing”) =p(π)
Probability of wearing glasses after one observation
p(πjx1) =
p(x1jπ)p(π)
R
p(x1jπ)p(π)dπ
=Z
−1
1
p(x1jπ)p(π)
Probability of wearing glasses after two observations
p(πjx1,x2) =Z
−1
2
p(x2jx1, π)p(x1jπ)p(π) =Z
−1
2
p(x2jπ)p(x1jπ)p(π)

Probability of wearing glasses after five observations
p(πjx1,x2,x3,x4,x5) =Z
−1
5

5
Y
i=1
p(xijπ)
!
p(π)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 26

Example — inferring probability of wearing glasses (4)
Step 3: Define analytic forms of generative model
What is the likelihood?
p(x1jπ) =
6
πforx1=1
1Γπforx1=0
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 27

Example — inferring probability of wearing glasses (4)
Step 3: Define analytic forms of generative model
What is the likelihood?
p(x1jπ) =
6
πforx1=1
1Γπforx1=0
More helpful RVs:
▶RVNfor the number of observations being 1 (with valuesn)
▶RVMfor the number of observations being 0 (with valuesm)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 27

Example — inferring probability of wearing glasses (4)
Step 3: Define analytic forms of generative model
What is the likelihood?
p(x1jπ) =
6
πforx1=1
1Γπforx1=0
More helpful RVs:
▶RVNfor the number of observations being 1 (with valuesn)
▶RVMfor the number of observations being 0 (with valuesm)
Probability of wearing glasses after five observations
p(πjx1,x2,x3,x4,x5) =Z
−1
5

5
Y
i=1
p(xijπ)
!
p(π)
=Z
−1
5
π
n
(1Γπ)
m
p(π)
=p(πjn,m)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 27

Example — inferring probability of wearing glasses (5)
Step 4: make computationally convenient choices. Here: aconjugateprior
Posterior after seeing five observations:
p(πjn,m) =Z
−1
5
π
n
(1Γπ)
m
p(π)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 28

Example — inferring probability of wearing glasses (5)
Step 4: make computationally convenient choices. Here: aconjugateprior
Posterior after seeing five observations:
p(πjn,m) =Z
−1
5
π
n
(1Γπ)
m
p(π)
What priorp(π)would make the calculations easy?
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 28

Example — inferring probability of wearing glasses (5)
Step 4: make computationally convenient choices. Here: aconjugateprior
Posterior after seeing five observations:
p(πjn,m) =Z
−1
5
π
n
(1Γπ)
m
p(π)
What priorp(π)would make the calculations easy?
p(π) =Z
−1
π
a−1
(1Γπ)
b−1
with parametersa>0,b>0
the Betadistributionwith parameter a and b
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 28

Example — inferring probability of wearing glasses (5)
Step 4: make computationally convenient choices. Here: aconjugateprior
Posterior after seeing five observations:
p(πjn,m) =Z
−1
5
π
n
(1Γπ)
m
p(π)
What priorp(π)would make the calculations easy?
p(π) =Z
−1
π
a−1
(1Γπ)
b−1
with parametersa>0,b>0
the Betadistributionwith parameter a and b
Let’s give the normalization factorZof the beta distribution a name!
B(a,b) =
Z
1
0
π
a−1
(1Γπ)
b−1

the Betafunctionwith parameters a and b
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 28

Aside: The Eulerian Integrals
For an evening at the fireplace: [Philip J. Davis.Leonhard Euler’s Integral: A Historical Profile of the Gamma Function, 1959]
Leonhard Euler (1707–1783)0 2 4
x
10
0
10
1
10
2 Euler ()
x!
Hadamard
Form,n2Nandx,y,z2C:
B(x,y) =
Z
1
0
t
x−1
(1Γt)
y−1
dt
=
Γ(x)Γ(y)
Γ(x+y)
ifx+¯x,y+¯y>0
B(m,n) =
(mΓ1)! (nΓ1)!
(m+nΓ1)!
Γ(z) =
Z

0
e
−t
t
z−1
dt=
Z
1
0
(Γlogx)
z+1
dx
Γ(n) = (nΓ1)!
Hadamard:
H(x) =
1
Γ(1Γx)
d
dx
log
0
Γ
.
1Γx
2
/

Γ
,

x
2
-
1
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 29

The Gamma / Beta Functions
intractable integrals, for Laplace[graph: E. Jahnke & F. Emde,Tafeln höherer Funktionen, 4.ed., Leipzig 1948; image: public domain]
Jacques Salomon Hadamard (1865–1963) 1959] LEONHARD EULER'S INTEGRAL 859
functional equation (19) then, induces this behavior over and over again at
each of the negative integers. The (real) gamma function is comprised of an
infinite number of disconnected portions opening up and down alternately. The
portions corresponding to negative values are each squeezed into an infinite
strip one unit in width, but the major portion which corresponds to positive x
and which contains the factorials is of infinite width (see Fig. 2). Thus, there are
excluded points for the gamma function at which it exhibits from the ordinary
(real variable) point of view a somewhat unpleasant and capricious behavior.
THE ABSOLUTE VALUE OF THE COMPLEX GAMMA FUNCTION, EXHIBITING THE POLES AT
THE NEGATIVE INTEGERS
r -7 -3 -2 Z O X, 4
FIG. 3*
But from the complex point of view, these points of singular behavior (singular
in the sense of Sherlock Holmes) merit special study and become an important
part of the story. In pictures of the complex gamma function they show up as an
infinite row of "stalagmites," each of infinite height (the ones in the figure are
truncated out of necessity) which become more and more needlelike as they go
out to infinity (see Fig. 3). They are known as poles. Poles are points where the
function has an infinite behavior of especially simple type, a behavior which is
akin to that of such simple functions as the hyperbola y = 1/x at x = 0 or of
y = tan x at x = r/2. The theory of analytic functions is especially interested
* From: E. Jahnke and F. Emde, Tafein hoherer Funktionen, 4th ed., Leipzig, 1948.
This content downloaded from 192.124.26.251 on Thu, 19 Jul 2018 11:01:28 UTC
All use subject to http://about.jstor.org/terms
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 30

Laplace’s Approximation
What you do when it’s 1814 and you don’t have a computer P.-S. Laplace, 1814
Pierre-Simon, marquis de Laplace (1749–1827)
p(xja,b) =
x
a−1
(1Γx)
b−1
B(a,b)
logp(xja,b) = (aΓ1)logx+ (bΓ1)log(1Γx)Γconst.
∂logp(xja,b)
∂x
=
aΓ1
x
Γ
bΓ1
1Γx
)ˆx:=
aΓ1
a+bΓ2

2
logp(xja,b)
∂x
2




x=ˆx

aΓ1
ˆx
2
Γ
bΓ1
(1Γˆx)
2
=Γ(a+bΓ2)
2
.
1
aΓ1
+
1
bΓ1
/
=:Ψ
logp(x)5logp(ˆx) +0+
1
2
(xΓˆx)
2
Ψ
Z
p(x)dx5p(ˆx)
Z
exp
.
Γ
(xΓˆx)
2
2(ΓΨ
−1
)
/
dx=1
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 31

A little help from a friend
what a nice chap, I wonder whether we’ll ever encounter him again …
Carl Friedrich Gauss (1777–1855)
Z

−∞
exp
.

1
2
(x−m)
2
v
/
dx=
p
2πv
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 32

Laplace’s Approximation
What you do when it’s 1814 and you don’t have a computer P.-S. Laplace, 1814
Pierre-Simon, marquis de Laplace (1749–1827)
p(xja,b) =
x
a−1
(1Γx)
b−1
B(a,b)
ˆx:=
aΓ1
a+bΓ2
Ψ:=Γ(a+bΓ2)
2
.
1
aΓ1
+
1
bΓ1
/
Z
p(x)dx5p(ˆx)
Z
exp
.
Γ
(xΓˆx)
2
2(ΓΨ
−1
)
/
dx=1
B(a,b)5ˆx
a−1
ˆx
b−1

q
2π(ΓΨ
−1
)
Note forN=a+b_1we haveΨ
(a+b)
3
ab

N
2
N−a

N
2
N−b
, so
p
ΓΨ
−1
_
1
N
p
NΓb=
1
N
p
NΓb5
p
f
p
N
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 33

Choose your prior, if you’re willing to pay for it
inference is never “intractable” if you have enough compute0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
p
p()
p(n=5,m=2j)
p(jn=5,m=2)
1N = 120
2xp = jnp.linspace(0, 1, N)
3posterior = prior(xp) * likelihood(xp)
4posterior = posterior / posterior. sum() * (N - 1)
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 34

▶Random Variablesallow us to define derived quantities from
atomic events
▶Borelσ-algebrascan be defined on all topological spaces, allowing
us to define probabilities if the elementary space is continuous.
▶Probability Density Functions (pdf’s)distribute probability across
continuous domains.
▶they satisfy “the rules of probability” (integrate to one, sum rule,
product rule, hence Bayes’ Theorem)
▶Not every measure has a density, but all pdfs define measures
▶Densities transform under continuously transformations
▶Probabilistic inference can even be used to infer probabilities!
▶With the right prior, the posterior might be possible to compute
simply by addingfloats, if you know how to compute one special
integral. If you don’t, just use aLaplace approximation
▶If you don’t like that prior,wecan use another one (Laplace
couldn’t!), becausewe’ve got computers!
Please cite this course, as
@ t e c h r e p o r t{ T u e b i n g e n _ P r o b M L 2 3 ,
t i t l e=
{ P r o b a b i l i s t i c M a c h i n e L e a r n i n g } ,
a u t h o r= { H e n n i g , P h i l i p p } ,
s e r i e s= { L e c t u r e N o t e s
i n M a c h i n e L e a r n i n g } ,
y e a r= { 2 0 2 3 } ,
i n s t i t u t i o n= { T ü b i n g e n A I C e n t e r } }
Probabilistic ML — P. Hennig, SS 2023 —© Philipp Hennig, 2023 CC BY-NC-SA 4.0 35
Tags