We can apply this process to neural networks and come up with the probability distribution over the network weights, w, given the training data

BayesianMethodsforNeural
Networks
Readings:Bishop,NeuralNetworksforPattern
Recognition.Chapter10.
Aaron Courville
Bayesian Methods for Neural Networks p.1/29

BayesianInference
We'veseenBayesianinferencebefore,remember
∙p(θ)isthepriorprobabilityofaparameterθbefore
havingseenthedata.
∙p(D|θ)iscalledthelikelihood. Itistheprobabilityofthe
dataDgivenθ
WecanuseBayes'ruletodeterminetheposterior
probabilityofθgiventhedata,D,
p(θ|D)=
p(D|θ)p(θ)
p(D)
Ingeneralthiswillprovideanentiredistributionover
possiblevaluesofθratherthatthesinglemostlikelyvalue
ofθ.
Bayesian Methods for Neural Networks p.2/29

BayesianANNs?
Wecanapplythisprocesstoneuralnetworksandcomeup
withtheprobabilitydistributionoverthenetworkweights,w,
giventhetrainingdata,p(w|D).
Aswewillsee,wecanalsocomeupwithaposterior
distributionover:
∙thenetworkoutput
∙asetofdifferentsizednetworks
∙theoutputsofasetofdifferentsizednetworks
Bayesian Methods for Neural Networks p.3/29

Whyshouldwebother?
Insteadofconsideringasingleanswertoaquestion,
Bayesianmethodsallowustoconsideranentire
distributionofanswers. Withthisapproachwecannaturally
addressissueslike:
∙regularization(overttingornot),
∙modelselection/comparison,
withouttheneedforaseparatecross-validationdataset.
Withthesetechniqueswecanalsoputerrorbarsonthe
outputofthenetwork,byconsideringtheshapeofthe
outputdistributionp(y|D).
Bayesian Methods for Neural Networks p.4/29

Overview
Wewillbelookingathow,usingBayesianmethods,wecan
explorethefollowquestions:
1.p(w|D,H)? Whatisthedistributionoverweightsw
giventhedataandaxedmodel,H?
2.p(y|D,H)? Whatisthedistributionovernetworkoutputs
ygiventhedataandamodel(forregressionproblems)?
3.p(C|D,H)? Whatisthedistributionoverpredictedclass
labelsCgiventhedataandmodel(forclassication
problems)?
4.p(H|D)? Whatisthedistributionovermodelsgiventhe
data?
5.p(y|D)? Whatisthedistributionovernetworkoutputs
giventhedata(notconditionedonaparticularmodel!)?
Bayesian Methods for Neural Networks p.5/29

Overview(cont.)
WewillalsolookbrieyatMonteCarlosamplingmethods
todealwithusingBayesianmethodsintherealworld.
Agooddealofcurrentresearchisgoingintoapplyingsuch
methodstodealwithBayesianinferenceindifcult
problems.
Bayesian Methods for Neural Networks p.6/29

MaximumLikelihoodLearning
Optimizationmethodsfocusonndingasingleweight
assignmentthatminimizessomeerrorfunction(typicallya
leastsquared-errorfunction).
Thisisequivalenttondingamaximumofthelikelihood
function,i.e. ndingaw
∗
thatmaximizestheprobabilityof
thedatagiventhoseweights,p(D|w
∗
).
Bayesian Methods for Neural Networks p.7/29

1.Bayesianlearningoftheweights
Hereweconsiderndingaposteriordistributionover
weights,
p(w|D)=
p(D|w)p(w)
p(D)
=
p(D|w)p(w)
R
p(D|w)p(w)dw
.
IntheBayesianformalism,learningtheweightsmeans
changingourbeliefabouttheweightsfromtheprior,p(w),
totheposterior,p(w|D)asaconsequenceofseeingthe
data.
Bayesian Methods for Neural Networks p.8/29

Priorfortheweights
Let'sconsiderapriorfortheweightsoftheform
p(w)=
exp(−αEw)
Zw(α)
whereαisahyperparameter(aparameterofaprior
distributionoveranotherparameter,fornowwewillassume
αisknown)andnormalizerZw(α)=
R
exp(−αEw)dw.
Whenweconsideredweightdecaywearguedthatsmaller
weightsgeneralizebetter,soweshouldsetEwto
Ew=
1
2
||w||
2
=
1
2
W
X
i=1
w
2
i.
WiththisEw,thepriorbecomesaGaussian.
Bayesian Methods for Neural Networks p.9/29

Exampleprior
Apriorovertwoweights.
Bayesian Methods for Neural Networks p.10/29

Likelihoodofthedata
Justaswedidfortheprior,let'sconsideralikelihood
functionoftheform
p(D|w)=
exp(−βED)
ZD(β)
whereβisanotherhyperparameterandthenormalization
factorZD(β)=
R
exp(−βED)dD(where
R
dD=
R
dt
1
...dt
N
)
Ifweassumethataftertrainingthetargetdatat∈Dobeys
aGaussiandistributionwithmeany(x;w),thenthe
likelihoodfunctionisgivenby
p(D|w)=
N
Y
n=1
p(t
n
|x
n
,w)=
1
ZD(β)
exp(−
β
2
N
X
n=1
{y(x;w)−t
n
}
2
)
Bayesian Methods for Neural Networks p.11/29

Posteriorovertheweights
Withp(w)andp(D|w)dened,wecannowcombinethem
accordingtoBayesruletogettheposteriordistribution,
p(w|D)=
p(D|w)p(w)
P(D)
=
1
ZS
exp(−βED)exp(−αEw)
=
1
ZS
exp(−S(w))
where
S(w)=βED+αEw
and
ZS(α,β)=
Z
exp(−βED−αEw)dw
Bayesian Methods for Neural Networks p.12/29

Posteriorovertheweights(cont.)
Ifweimaginewewanttondthemaximumaposteriori
weights,wMP(themaximumoftheposteriordistribution),
wecouldminimizethenegativelogarithmofp(w|D),which
isequivalenttominimizing
S(w)=
β
2
N
X
n=1
{y(x;w)−t
n
}
2
+
α
2
W
X
i=1
w
2
i.
We'veseenthisbefore,it'stheerrorfunctionminimizedwith
weightdecay! Theratioα/βdeterminestheamountwe
penalizelargeweights.
Bayesian Methods for Neural Networks p.13/29

ExampleofBayesianLearning
Aclassicationproblemwithtwoinputsandonelogistic
output.
Bayesian Methods for Neural Networks p.14/29

2.Findingadistributionoveroutputs
Oncewehavetheposterioroftheweights,wecanconsider
theoutputofthewholedistributionofweightvaluesto
produceadistributionoverthenetworkoutputs.
p(y|x,D)=
Z
p(y|x,w)p(w|D)dw
wherewearemarginalizingovertheweights. Ingeneral,we
requireanapproximationtoevaluatethisintegral.
Bayesian Methods for Neural Networks p.15/29

Distributionoveroutputs(cont.) Ifweapproximatep(w|D)asasufcientlynarrowGaussian,
wearriveatagaussiandistributionovertheoutputsofthe
network,
p(y|x,D)≈
1
2πσ
1/2
y
exp(−
(y−yMP)
2
2σ
2
y
),
ThemeanyMPisthemaximumaposteriorinetworkoutput
andthevarianceσ
2
y=β
−1
+g
T
A
−1
g,whereAisthe
HessianofS(w)andg≡rwy|wMP.
Bayesian Methods for Neural Networks p.16/29

ExampleofBayesianRegression ThegureisanexampleoftheapplicationofBayesian
methodstoaregressionproblem. Thedata(circles)was
generatedfromthefunction,h(x)=0.5+0.4sin(2πx).
Bayesian Methods for Neural Networks p.17/29

3.BayesianClassicationwithANNs
Wecanapplythesametechniquestoclassication
problemswhere,forthetwoclasses,thelikelihoodfunction
isgivenby,
p(D|w)=
Y
n
y(x
n
)
t
n
(1−y(x
n
))
1−t
n
=exp(−G(D|w))
whereG(D|w)isthecross-entropyerrorfunction
G(D|w)=−
X
n
{t
n
lny(x
n
)+(1−t
n
)ln(1−y(x
n
))}
Bayesian Methods for Neural Networks p.18/29

Classication(cont.)
Ifweusealogisticsigmoidy(x;w)astheoutputactivation
functionandinterpretthatasP(C1|x,w)),thentheoutput
distributionisgivenby
P(C1|x,D)=
Z
y(x;w)p(w|D)dw
Onceagainwehavemarginalizedouttheweights.
Aswedidinthecaseofregression,wecouldnowapply
approximationstoevaluatethisintegral(detailsinthe
reading).
Bayesian Methods for Neural Networks p.19/29

ExampleofBayesianClassication
Figure 1 Figure 2
The three lines in Figure 2 correspond to network outputs of 0 .1, 0.5, and 0.9. (a) shows the
predictions made bywMP. (b) and (c) show the predictions made by the weights w
(1)
and
w
(2)
. (d) showsP(C1|x,D), the prediction after marginalizing over the distribution of
weights; for point C, far from the training data, the output i s close to 0.5.
Bayesian Methods for Neural Networks p.20/29

Whataboutαandβ?
Untilnow,wehaveassumedthatthehyperparametersare
knownapriori,butinpracticewewillalmostneverknowthe
correctformoftheprior. Thereexisttwopossible
alternativesolutionstothisproblem:
1. Wecouldndtheirmaximumaposteriorivaluesinan
iterativeoptimizationprocedurewherewealternate
betweenoptimizingwMPandthehyperparameters
αMPandβMP
2. WecouldbeproperBayesiansandmarginalize(or
integrate)overthehyperparameters. Forexample
p(w|D)=
1
p(D)
Z Z
p(D|w,β)p(w|α)p(α)p(β)dαdβ.
Bayesian Methods for Neural Networks p.21/29

4.BayesianModelComparison
Untilnow,wehavebeendealingwiththeapplicationof
Bayesianmethodstoaneuralnetworkwithaxednumber
ofunitsandaxedarchitecture.
WithBayesianmethods,wecangeneralizelearningto
includelearningtheappropriatemodelsizeandevenmodel
type.
ConsiderasetofcandidatemodelsHithatcouldinclude
neuralnetworkswithdifferentnumbersofhiddenunits,RBF
networksandothermodels.
Bayesian Methods for Neural Networks p.22/29

ModelComparison(cont.)
WecanapplyBayes'theoremtocomputetheposterior
distributionovermodels,thenpickthemodelwiththe
largestposterior.
P(Hi|D)=
p(D|Hi)P(Hi)
p(D)
Thetermp(D|Hi)iscalledtheevidenceforHiandisgiven
by
p(D|Hi)=
Z
p(D|w,Hi)p(w|Hi)dw.
Theevidencetermbalancesbetweenttingthedatawell
andavoidingoverlycomplexmodels.
Bayesian Methods for Neural Networks p.23/29

Modelevidencep(D|Hi)
Considerasingleweight,w. Ifweassumethattheposterior
issharplypeakedaroundthemostprobablevalue,wMP,
withwidth∆wposteriorwecanapproximatetheintegralwith
theexpression
p(D|Hi)≈p(D|wMP,Hi)p(wMP|Hi)∆wposterior.
Ifwealsotaketheprioroverthetheweightstobeuniform
overalargeinterval∆wpriorthentheapproximationtothe
evidencebecomes
p(D|Hi)≈p(D|wMP,Hi)(
∆wposterior
∆wprior
).
Theratio∆wposterior/∆wprioriscalledtheOccamfactorand
penalizescomplexmodels.
Bayesian Methods for Neural Networks p.24/29

IllustrationoftheOccamfactor
Bayesian Methods for Neural Networks p.25/29

5.Committeeofmodels
WecangoevenfurtherwithBayesianmethods. Rather
thanpickingasinglemodelwecanmarginalizeovera
numberofdifferentmodels.
p(y|x,D)=
X
i
p(y|x,Hi)P(Hi|D)
Theresultisaweightedaverageoftheprobability
distributionsovertheoutputsofthemodelsinthe
committee.
Bayesian Methods for Neural Networks p.26/29

BayesianMethodsinPractice
Bayesianmethodsarealmostalwaysdifculttoapply
directly. Theyinvolveintegralsthatareintractableexceptin
themosttrivialcases.
Untilnow,wehavemadeassumptionsabouttheshapeof
thedistributionsintheintegrations(Gaussians). Forawide
arrayofproblemstheseassumptiondonotholdandmay
leadtoverypoorperformance.
Typicalnumericalintegrationtechniquesareunsuitablefor
theintegrationsinvolvedinapplyingBayesianmethods,
wheretheintegralsareoveralargenumberofdimensions.
MonteCarlotechniquesofferawayaroundthisproblem.
Bayesian Methods for Neural Networks p.27/29

MonteCarloSamplingMethods
Wewishtoevaluateintegralsoftheform:
I=
Z
F(w)p(w|D)dw
Theideaistoapproximatetheintegralwithanitesum,
I≈
1
L
L
X
i=L
F(wi)
wherewiisasampleoftheweightsgeneratedfromthe
distributionp(w|D). ThechallengeintheMonteCarlo
methodisthatitisoftendifculttosamplefromp(w|D)
directly.
Bayesian Methods for Neural Networks p.28/29

ImportanceSampling
Ifsamplingfromthedistributionp(w|D)isimpractical,we
couldsamplefromasimplerdistributionq(w),fromwhichit
iseasytosample. Thenwecanwrite
I=
Z
F(w)
p(w|D)
q(w)
q(w)dw≈
1
L
L
X
i=1
F(wi)
p(wi|D)
q(wi)
Ingeneralwecannotnormalizep(w|D)soweusea
modiedformoftheapproximationwithanunnormalized
˜p(wi|D),
I≈
P
L
i=1
F(wi)˜p(wi|D)/q(wi)
P
L
i=1
˜p(wi|D)/q(wi)
Bayesian Methods for Neural Networks p.29/29

We can apply this process to neural networks and come up with the probability distribution over the network weights, w, given the training data

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

We can apply this process to neural networks and come up with the probability distribution over the network weights, w, given the training data

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx