We can apply this process to neural networks and come up with the probability distribution over the network weights, w, given the training data

TaoYin5 5 views 29 slides Jun 22, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

We can apply this process to neural networks and come up with the probability distribution over the network weights, w, given the training data


Slide Content

BayesianMethodsforNeural
Networks
Readings:Bishop,NeuralNetworksforPattern
Recognition.Chapter10.
Aaron Courville
Bayesian Methods for Neural Networks – p.1/29

BayesianInference
We'veseenBayesianinferencebefore,remember
∙p(θ)isthepriorprobabilityofaparameterθbefore
havingseenthedata.
∙p(D|θ)iscalledthelikelihood. Itistheprobabilityofthe
dataDgivenθ
WecanuseBayes'ruletodeterminetheposterior
probabilityofθgiventhedata,D,
p(θ|D)=
p(D|θ)p(θ)
p(D)
Ingeneralthiswillprovideanentiredistributionover
possiblevaluesofθratherthatthesinglemostlikelyvalue
ofθ.
Bayesian Methods for Neural Networks – p.2/29

BayesianANNs?
Wecanapplythisprocesstoneuralnetworksandcomeup
withtheprobabilitydistributionoverthenetworkweights,w,
giventhetrainingdata,p(w|D).
Aswewillsee,wecanalsocomeupwithaposterior
distributionover:
∙thenetworkoutput
∙asetofdifferentsizednetworks
∙theoutputsofasetofdifferentsizednetworks
Bayesian Methods for Neural Networks – p.3/29

Whyshouldwebother?
Insteadofconsideringasingleanswertoaquestion,
Bayesianmethodsallowustoconsideranentire
distributionofanswers. Withthisapproachwecannaturally
addressissueslike:
∙regularization(overttingornot),
∙modelselection/comparison,
withouttheneedforaseparatecross-validationdataset.
Withthesetechniqueswecanalsoputerrorbarsonthe
outputofthenetwork,byconsideringtheshapeofthe
outputdistributionp(y|D).
Bayesian Methods for Neural Networks – p.4/29

Overview
Wewillbelookingathow,usingBayesianmethods,wecan
explorethefollowquestions:
1.p(w|D,H)? Whatisthedistributionoverweightsw
giventhedataandaxedmodel,H?
2.p(y|D,H)? Whatisthedistributionovernetworkoutputs
ygiventhedataandamodel(forregressionproblems)?
3.p(C|D,H)? Whatisthedistributionoverpredictedclass
labelsCgiventhedataandmodel(forclassication
problems)?
4.p(H|D)? Whatisthedistributionovermodelsgiventhe
data?
5.p(y|D)? Whatisthedistributionovernetworkoutputs
giventhedata(notconditionedonaparticularmodel!)?
Bayesian Methods for Neural Networks – p.5/29

Overview(cont.)
WewillalsolookbrieyatMonteCarlosamplingmethods
todealwithusingBayesianmethodsinthe“realworld”.
Agooddealofcurrentresearchisgoingintoapplyingsuch
methodstodealwithBayesianinferenceindifcult
problems.
Bayesian Methods for Neural Networks – p.6/29

MaximumLikelihoodLearning
Optimizationmethodsfocusonndingasingleweight
assignmentthatminimizessomeerrorfunction(typicallya
leastsquared-errorfunction).
Thisisequivalenttondingamaximumofthelikelihood
function,i.e. ndingaw

thatmaximizestheprobabilityof
thedatagiventhoseweights,p(D|w

).
Bayesian Methods for Neural Networks – p.7/29

1.Bayesianlearningoftheweights
Hereweconsiderndingaposteriordistributionover
weights,
p(w|D)=
p(D|w)p(w)
p(D)
=
p(D|w)p(w)
R
p(D|w)p(w)dw
.
IntheBayesianformalism,learningtheweightsmeans
changingourbeliefabouttheweightsfromtheprior,p(w),
totheposterior,p(w|D)asaconsequenceofseeingthe
data.
Bayesian Methods for Neural Networks – p.8/29

Priorfortheweights
Let'sconsiderapriorfortheweightsoftheform
p(w)=
exp(−αEw)
Zw(α)
whereαisahyperparameter(aparameterofaprior
distributionoveranotherparameter,fornowwewillassume
αisknown)andnormalizerZw(α)=
R
exp(−αEw)dw.
Whenweconsideredweightdecaywearguedthatsmaller
weightsgeneralizebetter,soweshouldsetEwto
Ew=
1
2
||w||
2
=
1
2
W
X
i=1
w
2
i.
WiththisEw,thepriorbecomesaGaussian.
Bayesian Methods for Neural Networks – p.9/29

Exampleprior
Apriorovertwoweights.
Bayesian Methods for Neural Networks – p.10/29

Likelihoodofthedata
Justaswedidfortheprior,let'sconsideralikelihood
functionoftheform
p(D|w)=
exp(−βED)
ZD(β)
whereβisanotherhyperparameterandthenormalization
factorZD(β)=
R
exp(−βED)dD(where
R
dD=
R
dt
1
...dt
N
)
Ifweassumethataftertrainingthetargetdatat∈Dobeys
aGaussiandistributionwithmeany(x;w),thenthe
likelihoodfunctionisgivenby
p(D|w)=
N
Y
n=1
p(t
n
|x
n
,w)=
1
ZD(β)
exp(−
β
2
N
X
n=1
{y(x;w)−t
n
}
2
)
Bayesian Methods for Neural Networks – p.11/29

Posteriorovertheweights
Withp(w)andp(D|w)dened,wecannowcombinethem
accordingtoBayesruletogettheposteriordistribution,
p(w|D)=
p(D|w)p(w)
P(D)
=
1
ZS
exp(−βED)exp(−αEw)
=
1
ZS
exp(−S(w))
where
S(w)=βED+αEw
and
ZS(α,β)=
Z
exp(−βED−αEw)dw
Bayesian Methods for Neural Networks – p.12/29

Posteriorovertheweights(cont.)
Ifweimaginewewanttondthemaximumaposteriori
weights,wMP(themaximumoftheposteriordistribution),
wecouldminimizethenegativelogarithmofp(w|D),which
isequivalenttominimizing
S(w)=
β
2
N
X
n=1
{y(x;w)−t
n
}
2
+
α
2
W
X
i=1
w
2
i.
We'veseenthisbefore,it'stheerrorfunctionminimizedwith
weightdecay! Theratioα/βdeterminestheamountwe
penalizelargeweights.
Bayesian Methods for Neural Networks – p.13/29

ExampleofBayesianLearning
Aclassicationproblemwithtwoinputsandonelogistic
output.
Bayesian Methods for Neural Networks – p.14/29

2.Findingadistributionoveroutputs
Oncewehavetheposterioroftheweights,wecanconsider
theoutputofthewholedistributionofweightvaluesto
produceadistributionoverthenetworkoutputs.
p(y|x,D)=
Z
p(y|x,w)p(w|D)dw
wherewearemarginalizingovertheweights. Ingeneral,we
requireanapproximationtoevaluatethisintegral.
Bayesian Methods for Neural Networks – p.15/29

Distributionoveroutputs(cont.) Ifweapproximatep(w|D)asasufcientlynarrowGaussian,
wearriveatagaussiandistributionovertheoutputsofthe
network,
p(y|x,D)≈
1
2πσ
1/2
y
exp(−
(y−yMP)
2

2
y
),
ThemeanyMPisthemaximumaposteriorinetworkoutput
andthevarianceσ
2
y=β
−1
+g
T
A
−1
g,whereAisthe
HessianofS(w)andg≡rwy|wMP.
Bayesian Methods for Neural Networks – p.16/29

ExampleofBayesianRegression ThegureisanexampleoftheapplicationofBayesian
methodstoaregressionproblem. Thedata(circles)was
generatedfromthefunction,h(x)=0.5+0.4sin(2πx).
Bayesian Methods for Neural Networks – p.17/29

3.BayesianClassicationwithANNs
Wecanapplythesametechniquestoclassication
problemswhere,forthetwoclasses,thelikelihoodfunction
isgivenby,
p(D|w)=
Y
n
y(x
n
)
t
n
(1−y(x
n
))
1−t
n
=exp(−G(D|w))
whereG(D|w)isthecross-entropyerrorfunction
G(D|w)=−
X
n
{t
n
lny(x
n
)+(1−t
n
)ln(1−y(x
n
))}
Bayesian Methods for Neural Networks – p.18/29

Classication(cont.)
Ifweusealogisticsigmoidy(x;w)astheoutputactivation
functionandinterpretthatasP(C1|x,w)),thentheoutput
distributionisgivenby
P(C1|x,D)=
Z
y(x;w)p(w|D)dw
Onceagainwehavemarginalizedouttheweights.
Aswedidinthecaseofregression,wecouldnowapply
approximationstoevaluatethisintegral(detailsinthe
reading).
Bayesian Methods for Neural Networks – p.19/29

ExampleofBayesianClassication
Figure 1 Figure 2
The three lines in Figure 2 correspond to network outputs of 0 .1, 0.5, and 0.9. (a) shows the
predictions made bywMP. (b) and (c) show the predictions made by the weights w
(1)
and
w
(2)
. (d) showsP(C1|x,D), the prediction after marginalizing over the distribution of
weights; for point C, far from the training data, the output i s close to 0.5.
Bayesian Methods for Neural Networks – p.20/29

Whataboutαandβ?
Untilnow,wehaveassumedthatthehyperparametersare
knownapriori,butinpracticewewillalmostneverknowthe
correctformoftheprior. Thereexisttwopossible
alternativesolutionstothisproblem:
1. Wecouldndtheirmaximumaposteriorivaluesinan
iterativeoptimizationprocedurewherewealternate
betweenoptimizingwMPandthehyperparameters
αMPandβMP
2. WecouldbeproperBayesiansandmarginalize(or
integrate)overthehyperparameters. Forexample
p(w|D)=
1
p(D)
Z Z
p(D|w,β)p(w|α)p(α)p(β)dαdβ.
Bayesian Methods for Neural Networks – p.21/29

4.BayesianModelComparison
Untilnow,wehavebeendealingwiththeapplicationof
Bayesianmethodstoaneuralnetworkwithaxednumber
ofunitsandaxedarchitecture.
WithBayesianmethods,wecangeneralizelearningto
includelearningtheappropriatemodelsizeandevenmodel
type.
ConsiderasetofcandidatemodelsHithatcouldinclude
neuralnetworkswithdifferentnumbersofhiddenunits,RBF
networksandothermodels.
Bayesian Methods for Neural Networks – p.22/29

ModelComparison(cont.)
WecanapplyBayes'theoremtocomputetheposterior
distributionovermodels,thenpickthemodelwiththe
largestposterior.
P(Hi|D)=
p(D|Hi)P(Hi)
p(D)
Thetermp(D|Hi)iscalledtheevidenceforHiandisgiven
by
p(D|Hi)=
Z
p(D|w,Hi)p(w|Hi)dw.
Theevidencetermbalancesbetweenttingthedatawell
andavoidingoverlycomplexmodels.
Bayesian Methods for Neural Networks – p.23/29

Modelevidencep(D|Hi)
Considerasingleweight,w. Ifweassumethattheposterior
issharplypeakedaroundthemostprobablevalue,wMP,
withwidth∆wposteriorwecanapproximatetheintegralwith
theexpression
p(D|Hi)≈p(D|wMP,Hi)p(wMP|Hi)∆wposterior.
Ifwealsotaketheprioroverthetheweightstobeuniform
overalargeinterval∆wpriorthentheapproximationtothe
evidencebecomes
p(D|Hi)≈p(D|wMP,Hi)(
∆wposterior
∆wprior
).
Theratio∆wposterior/∆wprioriscalledtheOccamfactorand
penalizescomplexmodels.
Bayesian Methods for Neural Networks – p.24/29

IllustrationoftheOccamfactor
Bayesian Methods for Neural Networks – p.25/29

5.Committeeofmodels
WecangoevenfurtherwithBayesianmethods. Rather
thanpickingasinglemodelwecanmarginalizeovera
numberofdifferentmodels.
p(y|x,D)=
X
i
p(y|x,Hi)P(Hi|D)
Theresultisaweightedaverageoftheprobability
distributionsovertheoutputsofthemodelsinthe
committee.
Bayesian Methods for Neural Networks – p.26/29

BayesianMethodsinPractice
Bayesianmethodsarealmostalwaysdifculttoapply
directly. Theyinvolveintegralsthatareintractableexceptin
themosttrivialcases.
Untilnow,wehavemadeassumptionsabouttheshapeof
thedistributionsintheintegrations(Gaussians). Forawide
arrayofproblemstheseassumptiondonotholdandmay
leadtoverypoorperformance.
Typicalnumericalintegrationtechniquesareunsuitablefor
theintegrationsinvolvedinapplyingBayesianmethods,
wheretheintegralsareoveralargenumberofdimensions.
MonteCarlotechniquesofferawayaroundthisproblem.
Bayesian Methods for Neural Networks – p.27/29

MonteCarloSamplingMethods
Wewishtoevaluateintegralsoftheform:
I=
Z
F(w)p(w|D)dw
Theideaistoapproximatetheintegralwithanitesum,
I≈
1
L
L
X
i=L
F(wi)
wherewiisasampleoftheweightsgeneratedfromthe
distributionp(w|D). ThechallengeintheMonteCarlo
methodisthatitisoftendifculttosamplefromp(w|D)
directly.
Bayesian Methods for Neural Networks – p.28/29

ImportanceSampling
Ifsamplingfromthedistributionp(w|D)isimpractical,we
couldsamplefromasimplerdistributionq(w),fromwhichit
iseasytosample. Thenwecanwrite
I=
Z
F(w)
p(w|D)
q(w)
q(w)dw≈
1
L
L
X
i=1
F(wi)
p(wi|D)
q(wi)
Ingeneralwecannotnormalizep(w|D)soweusea
modiedformoftheapproximationwithanunnormalized
˜p(wi|D),
I≈
P
L
i=1
F(wi)˜p(wi|D)/q(wi)
P
L
i=1
˜p(wi|D)/q(wi)
Bayesian Methods for Neural Networks – p.29/29
Tags