Priorfortheweights
Let'sconsiderapriorfortheweightsoftheform
p(w)=
exp(−αEw)
Zw(α)
whereαisahyperparameter(aparameterofaprior
distributionoveranotherparameter,fornowwewillassume
αisknown)andnormalizerZw(α)=
R
exp(−αEw)dw.
Whenweconsideredweightdecaywearguedthatsmaller
weightsgeneralizebetter,soweshouldsetEwto
Ew=
1
2
||w||
2
=
1
2
W
X
i=1
w
2
i.
WiththisEw,thepriorbecomesaGaussian.
Bayesian Methods for Neural Networks p.9/29
Exampleprior
Apriorovertwoweights.
Bayesian Methods for Neural Networks p.10/29
Likelihoodofthedata
Justaswedidfortheprior,let'sconsideralikelihood
functionoftheform
p(D|w)=
exp(−βED)
ZD(β)
whereβisanotherhyperparameterandthenormalization
factorZD(β)=
R
exp(−βED)dD(where
R
dD=
R
dt
1
...dt
N
)
Ifweassumethataftertrainingthetargetdatat∈Dobeys
aGaussiandistributionwithmeany(x;w),thenthe
likelihoodfunctionisgivenby
p(D|w)=
N
Y
n=1
p(t
n
|x
n
,w)=
1
ZD(β)
exp(−
β
2
N
X
n=1
{y(x;w)−t
n
}
2
)
Bayesian Methods for Neural Networks p.11/29
Posteriorovertheweights
Withp(w)andp(D|w)dened,wecannowcombinethem
accordingtoBayesruletogettheposteriordistribution,
p(w|D)=
p(D|w)p(w)
P(D)
=
1
ZS
exp(−βED)exp(−αEw)
=
1
ZS
exp(−S(w))
where
S(w)=βED+αEw
and
ZS(α,β)=
Z
exp(−βED−αEw)dw
Bayesian Methods for Neural Networks p.12/29
Posteriorovertheweights(cont.)
Ifweimaginewewanttondthemaximumaposteriori
weights,wMP(themaximumoftheposteriordistribution),
wecouldminimizethenegativelogarithmofp(w|D),which
isequivalenttominimizing
S(w)=
β
2
N
X
n=1
{y(x;w)−t
n
}
2
+
α
2
W
X
i=1
w
2
i.
We'veseenthisbefore,it'stheerrorfunctionminimizedwith
weightdecay! Theratioα/βdeterminestheamountwe
penalizelargeweights.
Bayesian Methods for Neural Networks p.13/29
ExampleofBayesianLearning
Aclassicationproblemwithtwoinputsandonelogistic
output.
Bayesian Methods for Neural Networks p.14/29
2.Findingadistributionoveroutputs
Oncewehavetheposterioroftheweights,wecanconsider
theoutputofthewholedistributionofweightvaluesto
produceadistributionoverthenetworkoutputs.
p(y|x,D)=
Z
p(y|x,w)p(w|D)dw
wherewearemarginalizingovertheweights. Ingeneral,we
requireanapproximationtoevaluatethisintegral.
Bayesian Methods for Neural Networks p.15/29
Distributionoveroutputs(cont.) Ifweapproximatep(w|D)asasufcientlynarrowGaussian,
wearriveatagaussiandistributionovertheoutputsofthe
network,
p(y|x,D)≈
1
2πσ
1/2
y
exp(−
(y−yMP)
2
2σ
2
y
),
ThemeanyMPisthemaximumaposteriorinetworkoutput
andthevarianceσ
2
y=β
−1
+g
T
A
−1
g,whereAisthe
HessianofS(w)andg≡rwy|wMP.
Bayesian Methods for Neural Networks p.16/29
3.BayesianClassicationwithANNs
Wecanapplythesametechniquestoclassication
problemswhere,forthetwoclasses,thelikelihoodfunction
isgivenby,
p(D|w)=
Y
n
y(x
n
)
t
n
(1−y(x
n
))
1−t
n
=exp(−G(D|w))
whereG(D|w)isthecross-entropyerrorfunction
G(D|w)=−
X
n
{t
n
lny(x
n
)+(1−t
n
)ln(1−y(x
n
))}
Bayesian Methods for Neural Networks p.18/29
Classication(cont.)
Ifweusealogisticsigmoidy(x;w)astheoutputactivation
functionandinterpretthatasP(C1|x,w)),thentheoutput
distributionisgivenby
P(C1|x,D)=
Z
y(x;w)p(w|D)dw
Onceagainwehavemarginalizedouttheweights.
Aswedidinthecaseofregression,wecouldnowapply
approximationstoevaluatethisintegral(detailsinthe
reading).
Bayesian Methods for Neural Networks p.19/29
ExampleofBayesianClassication
Figure 1 Figure 2
The three lines in Figure 2 correspond to network outputs of 0 .1, 0.5, and 0.9. (a) shows the
predictions made bywMP. (b) and (c) show the predictions made by the weights w
(1)
and
w
(2)
. (d) showsP(C1|x,D), the prediction after marginalizing over the distribution of
weights; for point C, far from the training data, the output i s close to 0.5.
Bayesian Methods for Neural Networks p.20/29
Whataboutαandβ?
Untilnow,wehaveassumedthatthehyperparametersare
knownapriori,butinpracticewewillalmostneverknowthe
correctformoftheprior. Thereexisttwopossible
alternativesolutionstothisproblem:
1. Wecouldndtheirmaximumaposteriorivaluesinan
iterativeoptimizationprocedurewherewealternate
betweenoptimizingwMPandthehyperparameters
αMPandβMP
2. WecouldbeproperBayesiansandmarginalize(or
integrate)overthehyperparameters. Forexample
p(w|D)=
1
p(D)
Z Z
p(D|w,β)p(w|α)p(α)p(β)dαdβ.
Bayesian Methods for Neural Networks p.21/29
IllustrationoftheOccamfactor
Bayesian Methods for Neural Networks p.25/29
5.Committeeofmodels
WecangoevenfurtherwithBayesianmethods. Rather
thanpickingasinglemodelwecanmarginalizeovera
numberofdifferentmodels.
p(y|x,D)=
X
i
p(y|x,Hi)P(Hi|D)
Theresultisaweightedaverageoftheprobability
distributionsovertheoutputsofthemodelsinthe
committee.
Bayesian Methods for Neural Networks p.26/29
MonteCarloSamplingMethods
Wewishtoevaluateintegralsoftheform:
I=
Z
F(w)p(w|D)dw
Theideaistoapproximatetheintegralwithanitesum,
I≈
1
L
L
X
i=L
F(wi)
wherewiisasampleoftheweightsgeneratedfromthe
distributionp(w|D). ThechallengeintheMonteCarlo
methodisthatitisoftendifculttosamplefromp(w|D)
directly.
Bayesian Methods for Neural Networks p.28/29
ImportanceSampling
Ifsamplingfromthedistributionp(w|D)isimpractical,we
couldsamplefromasimplerdistributionq(w),fromwhichit
iseasytosample. Thenwecanwrite
I=
Z
F(w)
p(w|D)
q(w)
q(w)dw≈
1
L
L
X
i=1
F(wi)
p(wi|D)
q(wi)
Ingeneralwecannotnormalizep(w|D)soweusea
modiedformoftheapproximationwithanunnormalized
˜p(wi|D),
I≈
P
L
i=1
F(wi)˜p(wi|D)/q(wi)
P
L
i=1
˜p(wi|D)/q(wi)
Bayesian Methods for Neural Networks p.29/29