Classification and regression power point

samantaray3001 18 views 68 slides Jul 22, 2024
Slide 1
Slide 1 of 68
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68

About This Presentation

Classification and regeression


Slide Content

Sep 25th, 2001Copyright © 2001, 2003, Andrew W. Moore
Regression and
Classification with
Neural Networks
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
[email protected]
412-268-7599
Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials.
Comments and corrections gratefully
received.

Neural Networks: Slide 2Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wxfor some unknown w.
Given the data, we can estimate w.
inputs outputs
x
1 = 1 y
1 = 1
x
2 = 3 y
2 = 2.2
x
3 = 2 y
3 = 2
x
4 = 1.5 y
4 = 1.9
x
5 = 4 y
5 = 3.1
DATASET
1 

w

Neural Networks: Slide 3Copyright © 2001, 2003, Andrew W. Moore
1-parameter linear regression
Assume that the data is formed by
y
i= wx
i + noise
i
where…
•the noise signals are independent
•the noise has a normal distribution with mean 0
and unknown variance σ
2
P(y|w,x) has a normal distribution with
•mean wx
•variance σ
2

Neural Networks: Slide 4Copyright © 2001, 2003, Andrew W. Moore
Bayesian Linear Regression
P(y|w,x) = Normal (mean wx, var σ
2
)
We have a set of datapoints (x
1,y
1) (x
2,y
2) … (x
n,y
n)
which are EVIDENCEabout w.
We want to infer wfrom the data.
P(w|x
1, x
2, x
3,…x
n, y
1, y
2…y
n)
•You can use BAYES rule to work out a posterior
distribution for wgiven the data.
•Or you could do Maximum Likelihood Estimation

Neural Networks: Slide 5Copyright © 2001, 2003, Andrew W. Moore
Maximum likelihood estimation of w
Asks the question:
“For which value of wis this data most likely to have
happened?”
<=>
For what wis
P(y
1, y
2…y
n|x
1, x
2, x
3,…x
n, w) maximized?
<=>
For what w ismaximized? ),(
1
i
n
i
i
xwyP

Neural Networks: Slide 6Copyright © 2001, 2003, Andrew W. Moore
For what w is
For what w is
For what w is
For what w ismaximized? ),(
1
i
n
i
i
xwyP
 maximized? ))(
2
1
exp(
2
1 
iiwxy
n
i


  maximized?
2
12
1







 
ii
n
i
wxy  minimized?
2
1



n
i
ii
wxy

Neural Networks: Slide 7Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
The maximum
likelihood wis
the one that
minimizes sum-
of-squares of
residuals
We want to minimize a quadratic function of w. 
  
222
2
2 wxwyxy
wxy
i
i
iii
i
ii
 



E(w)
w

Neural Networks: Slide 8Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
Easy to show the sum of
squares is minimized
when2



i
ii
x
yx
w
The maximum likelihood
model is
We can use it for
predictionwxxOut

Neural Networks: Slide 9Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
Easy to show the sum of
squares is minimized
when2



i
ii
x
yx
w
The maximum likelihood
model is
We can use it for
prediction
Note: In Bayesian stats you’d have
ended up with a prob dist of w
And predictions would have given a prob
dist of expected output
Often useful to know your confidence.
Max likelihood can give some kinds of
confidence too.
p(w)
wwxxOut

Neural Networks: Slide 10Copyright © 2001, 2003, Andrew W. Moore
Multivariate Regression
What if the inputs are vectors?
Dataset has form
x
1 y
1
x
2 y
2
x
3 y
3
.: :
.
x
R y
R
3
.
. 4
6 .
.
5
. 8
. 10
2-d input
example
x
1
x
2

Neural Networks: Slide 11Copyright © 2001, 2003, Andrew W. Moore
Multivariate Regression
Write matrix X and Y thus:






































RRmRR
m
m
R y
y
y
xxx
xxx
xxx

2
1
21
22221
11211
2

...
...
...
..........
..........
..........
y
x
x
x
x
1
(thereare R datapoints. Each input has mcomponents)
The linear regression model assumes a vector w such that
Out(x) = w
T
x= w
1x[1] + w
2x[2] + ….w
mx[D]
The max. likelihood wis w = (X
T
X)
-1
(X
T
Y)

Neural Networks: Slide 12Copyright © 2001, 2003, Andrew W. Moore
Multivariate Regression
Write matrix X and Y thus:






































RRmRR
m
m
R y
y
y
xxx
xxx
xxx

2
1
21
22221
11211
2

...
...
...
..........
..........
..........
y
x
x
x
x
1
(thereare R datapoints. Each input has mcomponents)
The linear regression model assumes a vector w such that
Out(x) = w
T
x= w
1x[1] + w
2x[2] + ….w
mx[D]
The max. likelihood wis w = (X
T
X)
-1
(X
T
Y)
IMPORTANT EXERCISE:
PROVE IT !!!!!

Neural Networks: Slide 13Copyright © 2001, 2003, Andrew W. Moore
Multivariate Regression (con’t)
The max. likelihood wis w = (X
T
X)
-1
(X
T
Y)
X
T
X is an mx mmatrix: i,j’th elt is
X
T
Y is an m-element vector: i
’th
elt

R
k
kjkixx
1 

R
k
kkiyx
1

Neural Networks: Slide 14Copyright © 2001, 2003, Andrew W. Moore
What about a constant term?
We may expect
linear data that does
not go through the
origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
Can you guess??

Neural Networks: Slide 15Copyright © 2001, 2003, Andrew W. Moore
The constant term
•The trick is to create a fake input “X
0” that
always takes the value 1
X
1X
2Y
2416
3417
5520
X
0X
1X
2Y
12416
13417
15520
Before:
Y=w
1X
1+ w
2X
2
…has to be a poor
model
After:
Y= w
0X
0+w
1X
1+ w
2X
2
= w
0+w
1X
1+ w
2X
2
…has a fine constant
term
In this example,
You should be able
to see the MLE w
0
, w
1andw
2 by
inspection

Neural Networks: Slide 16Copyright © 2001, 2003, Andrew W. Moore
Regression with varying noise
•Suppose you know the variance of the noise that
was added to each datapoint.
x=0 x=3x=2x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2
x
iy
i

i
2
½ ½ 4
1 1 1
2 1 1/4
2 3 4
3 2 1/4),(~
2
iii wxNy 
Assume

Neural Networks: Slide 17Copyright © 2001, 2003, Andrew W. Moore
MLE estimation with varying noise),,...,,,,...,,|,...,,(log
22
2
2
12121argmax wxxxyyyp
w
RRR
 



R
i i
iiwxy
w
1
2
2
)(
argmin
 












0
)(
such that
1
2
R
i i
iii wxyx
w
 



















R
i i
i
R
i i
ii
x
yx
1
2
2
1
2


Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero
Trivial algebra

Neural Networks: Slide 18Copyright © 2001, 2003, Andrew W. Moore
This is Weighted Regression
•We are asking to minimize the weighted sum of
squares
x=0 x=3x=2x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2


R
i i
iiwxy
w
1
2
2
)(
argmin
 2
1
i
where weight for i’th datapoint is

Neural Networks: Slide 19Copyright © 2001, 2003, Andrew W. Moore
Weighted Multivariate Regression
The max. likelihood wis w = (WX
T
WX)
-1
(WX
T
WY)
(WX
T
WX) is an mx mmatrix: i,j’th elt is
(WX
T
WY)is an m-element vector: i
’th
elt

R
k i
kjkixx
1
2
 

R
k i
kkiyx
1
2

Neural Networks: Slide 20Copyright © 2001, 2003, Andrew W. Moore
Non-linear Regression
•Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
x=0 x=3x=2x=1
y=0
y=3
y=2
y=1
x
iy
i
½ ½
1 2.5
2 3
3 2
3 3),(~
2

ii
xwNy 
Assume

Neural Networks: Slide 21Copyright © 2001, 2003, Andrew W. Moore
Non-linear MLE estimation),,,...,,|,...,,(log
2121argmax wxxxyyyp
w
RR
  

R
i
ii
xwy
w
1
2
argmin 













0such that
1
R
i i
ii
xw
xwy
w
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero

Neural Networks: Slide 22Copyright © 2001, 2003, Andrew W. Moore
Non-linear MLE estimation),,,...,,|,...,,(log
2121argmax wxxxyyyp
w
RR
  

R
i
ii
xwy
w
1
2
argmin 













0such that
1
R
i i
ii
xw
xwy
w
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero
We’re down the
algebraic toilet

Neural Networks: Slide 23Copyright © 2001, 2003, Andrew W. Moore
Non-linear MLE estimation),,,...,,|,...,,(log
2121argmax wxxxyyyp
w
RR
  

R
i
ii
xwy
w
1
2
argmin 













0such that
1
R
i i
ii
xw
xwy
w
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero
We’re down the
algebraic toilet
Common (but not only) approach:
Numerical Solutions:
•Line Search
•Simulated Annealing
•Gradient Descent
•Conjugate Gradient
•Levenberg Marquart
•Newton’s Method
Also, special purpose statistical-
optimization-specific tricks such as
E.M. (See Gaussian Mixtures lecture
for introduction)

Neural Networks: Slide 24Copyright © 2001, 2003, Andrew W. Moore
GRADIENT DESCENT
Suppose we have a scalar function
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
ηis called the LEARNING RATE. A small positive
number, e.g. η= 0.05:f(w) w
w
ww f




Neural Networks: Slide 25Copyright © 2001, 2003, Andrew W. Moore
GRADIENT DESCENT
Suppose we have a scalar function
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
ηis called the LEARNING RATE. A small positive
number, e.g. η= 0.05:f(w) w
w
ww f



QUESTION: Justify the Gradient Descent Rule
Recall Andrew’s favorite
default value for anything

Neural Networks: Slide 26Copyright © 2001, 2003, Andrew W. Moore
Gradient Descent in “m” Dimensions 
m
:)f(w wf-ww 
Given
points in direction of steepest ascent.
GRADIENT DESCENT RULE:
Equivalentlywf-
j
jj
w
ηww



….where w
jis the jth weight
“just like a linear feedback system”





















wf
wf
wf
1
mw
w
 wf
is the gradient in that direction

Neural Networks: Slide 27Copyright © 2001, 2003, Andrew W. Moore
What’s all this got to do with Neural
Nets, then, eh??
For supervised learning, neural nets are also models with
vectors ofwparameters in them. They are now called
weights.
As before, we want to compute the weights to minimize sum-
of-squared residuals.
Which turns out, under “Gaussian i.i.d noise”
assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights, we
use GRADIENT DESCENT toSEARCHfor them.

Neural Networks: Slide 28Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptrons
They are multivariate linear models:
Out(x) = w
T
x
And “training” consists of minimizing sum-of-squared residuals
by gradient descent.
QUESTION: Derive the perceptron training rule. 
 
2
2





k
k
y
y
kk
kk
x
xOut
w

Neural Networks: Slide 29Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptron Training Rule


R
k
k
T
k
yE
1
2
)( xw
Gradient descent tells us
we should update w
thusly if we wish to
minimize E:j
jj
w
E
ηww


-
So what’s?
jw
E

Neural Networks: Slide 30Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptron Training Rule


R
k
k
T
k
yE
1
2
)( xw
Gradient descent tells us
we should update w
thusly if we wish to
minimize E:j
jj
w
E
ηww


-
So what’s?
jw
E

 







R
k
k
T
k
jj
y
ww
E
1
2
)( xw 





R
k
k
T
k
j
k
T
k
y
w
y
1
)()(2 xwxw 



R
k
k
T
j
k
w
δ
1
2 xw k
T
kkyδ xw
…where… 
 


R
k
m
i
kii
j
k
xw
w
δ
1 1
2 


R
k
kjkxδ
1
2

Neural Networks: Slide 31Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptron Training Rule


R
k
k
T
k
yE
1
2
)( xw
Gradient descent tells us
we should update w
thusly if we wish to
minimize E:j
jj
w
E
ηww


-
…where…




R
k
kjk
j

w
E
1
2 


R
k
kjkjj
xδηww
1
2
We frequently neglect the 2 (meaning
we halve the learning rate)

Neural Networks: Slide 32Copyright © 2001, 2003, Andrew W. Moore
The “Batch” perceptron algorithm
1)Randomly initialize weights w
1w
2… w
m
2)Get your dataset (append 1’s to the inputs if
you don’t want to go through the origin).
3)for i= 1 to R
4)for j= 1 to m
5)if stops improving then stop. Else loop
back to 3.iiiy xw

: 


R
i
ijijj
xww
1
 
2
i

Neural Networks: Slide 33Copyright © 2001, 2003, Andrew W. Mooreijijj
iii
xww
y





xw
A RULE KNOWN BY
MANY NAMES

Neural Networks: Slide 34Copyright © 2001, 2003, Andrew W. Moore
If data is voluminous and arrives fast
Input-output pairs (x,y) come streaming in very
quickly. THEN
Don’t bother remembering old ones.
Just keep using new ones.
observe (x,y)jjj xδηwwj
y

xw



Neural Networks: Slide 35Copyright © 2001, 2003, Andrew W. Moore
GD Advantages (MI disadvantages):
•Biologically plausible
•With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
•More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
•It’s moronic
•It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
•If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
•Hard to choose a good learning rate
•Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix Inversion
for Linear Perceptrons

Neural Networks: Slide 36Copyright © 2001, 2003, Andrew W. Moore
GD Advantages (MI disadvantages):
•Biologically plausible
•With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
•More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
•It’s moronic
•It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
•If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
•Hard to choose a good learning rate
•Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix Inversion
for Linear Perceptrons

Neural Networks: Slide 37Copyright © 2001, 2003, Andrew W. Moore
GD Advantages (MI disadvantages):
•Biologically plausible
•With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
•More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
•It’s moronic
•It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
•If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
•Hard to choose a good learning rate
•Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix Inversion
for Linear Perceptrons
But we’ll
soon see that
GD
has an important extra
trick up its sleeve

Neural Networks: Slide 38Copyright © 2001, 2003, Andrew W. Moore
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤1/2
1 if out(x)>1/2
WHAT’S THE BIG PROBLEM WITH THIS???

Neural Networks: Slide 39Copyright © 2001, 2003, Andrew W. Moore
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤½
1 if out(x)>½
WHAT’S THE BIG PROBLEM WITH THIS???
Blue = Out(x)

Neural Networks: Slide 40Copyright © 2001, 2003, Andrew W. Moore
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤½
1 if out(x)>½
Blue = Out(x)
Green = Classification

Neural Networks: Slide 41Copyright © 2001, 2003, Andrew W. Moore
Classification with Perceptrons I .xw
2



ii
y
Don’t minimize
Minimize number of misclassifications instead. [Assume outputs are
+1 & -1, not +1 & 0]
where Round(x) = -1 if x<0
1 if x≥0
The gradient descent rule can be changed to:
if (x
i,y
i) correctly classed, don’t change
if wrongly predicted as 1w w -x
i
if wrongly predicted as -1 w w +x
i 


ii
y xw Round
NOTE: CUTE &
NON OBVIOUS WHY
THIS WORKS!!

Neural Networks: Slide 42Copyright © 2001, 2003, Andrew W. Moore
Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.

Neural Networks: Slide 43Copyright © 2001, 2003, Andrew W. Moore
Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.
SOLUTION:
Instead of Out(x) = w
T
x
We’ll use Out(x) = g(w
T
x)
where is a
squashing function 1,0:xg

Neural Networks: Slide 44Copyright © 2001, 2003, Andrew W. Moore
The Sigmoid)exp(1
1
)(
h
hg


Note that if you rotate this
curve through 180
o
centered on (0,1/2)you get
the same curve.
i.e. g(h)=1-g(-h)
Can you prove this?

Neural Networks: Slide 45Copyright © 2001, 2003, Andrew W. Moore
The Sigmoid
Now we choose wto minimize   




R
i
ii
R
i
ii
gyy
1
2
1
2
)xw()x(Out )exp(1
1
)(
h
hg


Neural Networks: Slide 46Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptron Classification
Regions
0 0
0
1
1
1
X
2
X
1
We’ll use the model Out(x) = g(w
T
(x,1))
= g(w
1x
1+w
2x
2+ w
0)
Which region of above diagram classified with +1, and
which with 0 ??

Neural Networks: Slide 47Copyright © 2001, 2003, Andrew W. Moore
Gradient descent with sigmoid on a perceptron 
 
 
 


 
 
 





































































































































k
kkiiii
iji
i
ii
k
ikk
jk
ikk
i k
ikki
k
ikk
ji k
ikki
j
i k
ikki
k
kk
xwy
xgg
xw
w
xwgxwgy
xwg
w
xwgy
w
xwgy
xwg
xgxg
x
e
x
e
x
ex
e
x
e
x
e
x
e
x
e
xg
x
e
xg
xgxgxg
net )Out(x where
net1net2
'2
2
Out(x)
1
1
1
1
1
1
1
1
2
1
1
2
1
11

2
1
' so
1
1
:Because
1' notice First,
2

 


R
i
ijiiijj
xggww
1
1 









m
j
ijji xwgg
1 iii gy
The sigmoid perceptron
update rule:
where

Neural Networks: Slide 48Copyright © 2001, 2003, Andrew W. Moore
Other Things about Perceptrons
•Invented and popularized by Rosenblatt (1962)
•Even with sigmoid nonlinearity, correct
convergence is guaranteed
•Stable behavior for overconstrained and
underconstrained problems

Neural Networks: Slide 49Copyright © 2001, 2003, Andrew W. Moore
Perceptrons and Boolean Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…
•Can learn the function x
1x
2
•Can learn the function x
1x
2 .
•Can learn anyconjunction of literals, e.g.
x
1~x
2~x
3x
4x
5
QUESTION: WHY?
X
1
X
2
X
1
X
2

Neural Networks: Slide 50Copyright © 2001, 2003, Andrew W. Moore
Perceptrons and Boolean Functions
•Can learn any disjunction of literals
e.g. x
1~x
2~x
3x
4x
5
•Can learn majority function
f(x
1,x
2… x
n) = 1 if n/2 x
i’s or more are = 1
0 if less than n/2 x
i’s are = 1
•What about the exclusive or function?
f(x
1,x
2) = x
1 x
2=
(x
1~x
2) (~ x
1x
2)

Neural Networks: Slide 51Copyright © 2001, 2003, Andrew W. Moore
Multilayer Networks
The class of functions representable by perceptrons
is limited








 

j
jjxwgg Out(x) xw
Use a wider
representation !













 
k
jkjk
j
j xwgWg Out(x)
This is a nonlinear function
Of a linear combination
Of non linear functions
Of linear combinations of inputs

Neural Networks: Slide 52Copyright © 2001, 2003, Andrew W. Moore
A 1-HIDDEN LAYER NET
N
INPUTS= 2 N
HIDDEN= 3









HID
N
k
kkvWg
1
Out 
































IN S
IN S
IN S
N
k
kk
N
k
kk
N
k
kk
xwgv
xwgv
xwgv
1
33
1
22
1
11
x
1
x
2
w
11
w
21
w
31
w
1
w
2
w
3
w
32
w
22
w
12

Neural Networks: Slide 53Copyright © 2001, 2003, Andrew W. Moore
OTHER NEURAL NETS
2-Hidden layers + Constant Term
1
x
1
x
2
x
3
x
2
x
1
“JUMP” CONNECTIONS







 

HIDINS N
k
kk
N
k
kk vWxwg
11
0 Out

Neural Networks: Slide 54Copyright © 2001, 2003, Andrew W. Moore
Backpropagation 
descent.gradient by
xOut
minimize to
}{,}{ weightsofset a Find
Out(x)
2


















i
ii
jkj
j k
kjkj
y
wW
xwgWg
That’s it!
That’s the backpropagation
algorithm.

Neural Networks: Slide 55Copyright © 2001, 2003, Andrew W. Moore
Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
•In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.
IMPLEMENTING BACKPROP: Differentiate Monster sum-square residual 
Write down the Gradient Descent Rule It turns out to be easier &
computationally efficient to use lots of local variables with names like h
j o
kv
jnet
i
etc…

Neural Networks: Slide 56Copyright © 2001, 2003, Andrew W. Moore
Choosing the learning rate
•This is a subtle art.
•Too small: can take days instead of minutes
to converge
•Too large: diverges (MSE gets larger and
larger while the weights increase and
usually oscillate)
•Sometimes the “just right” value is hard to
find.

Neural Networks: Slide 57Copyright © 2001, 2003, Andrew W. Moore
Learning-rate problems
From J. Hertz, A. Krogh, and R.
G. Palmer. Introduction to the
Theory of Neural Computation.
Addison-Wesley, 1994.

Neural Networks: Slide 58Copyright © 2001, 2003, Andrew W. Moore
Improving Simple Gradient Descent
Momentum
Don’t just change weights according to the current datapoint.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let be the change we would make with
regular gradient descent.
Instead we use
Momentum damps oscillations.
A hack? Well, maybe.w

  tt Δw
w
Δw 


1
momentum parameterttt Δwww 1

Neural Networks: Slide 59Copyright © 2001, 2003, Andrew W. Moore
Momentum illustration

Neural Networks: Slide 60Copyright © 2001, 2003, Andrew W. Moore
Improving Simple Gradient Descent
Newton’s method)|(|
2
1
)()(
3
2
2
hh
w
h
w
hwhw O
EE
EE
TT







If we neglect the O(h
3
)terms, this is a quadratic form
Quadratic form fun facts:
If y = c + b
T
x-1/2 x
T
A x
And if Ais SPD
Then
x
opt
= A
-1
bis the value of xthat maximizes y

Neural Networks: Slide 61Copyright © 2001, 2003, Andrew W. Moore
Improving Simple Gradient Descent
Newton’s method)|(|
2
1
)()(
3
2
2
hh
w
h
w
hwhw O
EE
EE
TT







If we neglect the O(h
3
)terms, this is a quadratic formww
ww












EE
1
2
2
This should send us directly to the global minimum if the
function is truly quadratic.
And it might get us close if it’s locally quadraticish

Neural Networks: Slide 62Copyright © 2001, 2003, Andrew W. Moore
Improving Simple Gradient Descent
Newton’s method)|(|
2
1
)()(
3
2
2
hh
w
h
w
hwhw O
EE
EE
TT







If we neglect the O(h
3
)terms, this is a quadratic formww
ww












EE
1
2
2
This should send us directly to the global minimum if the
function is truly quadratic.
And it might get us close if it’s locally quadraticish

Neural Networks: Slide 63Copyright © 2001, 2003, Andrew W. Moore
Improving Simple Gradient Descent
Conjugate Gradient
Another method which attempts to exploit the “local
quadratic bowl” assumption
But does so while only needing to use
and not2
2
w
E
It is also more stable than Newton’s method if the local
quadratic bowl assumption is violated.
It’s complicated, outside our scope, but it often works well.
More details in Numerical Recipes in C.w
E

Neural Networks: Slide 64Copyright © 2001, 2003, Andrew W. Moore
BEST GENERALIZATION
Intuitively, you want to use the smallest,
simplest net that seems to fit the data.
HOW TO FORMALIZE THIS INTUITION?
1.Don’t. Just use intuition
2.Bayesian Methods Get it Right
3.Statistical Analysis explains what’s going on
4.Cross-validation
Discussed in the next
lecture

Neural Networks: Slide 65Copyright © 2001, 2003, Andrew W. Moore
What You Should Know
•How to implement multivariate Least-
squares linear regression.
•Derivation of least squares as max.
likelihood estimator of linear coefficients
•The general gradient descent rule

Neural Networks: Slide 66Copyright © 2001, 2003, Andrew W. Moore
What You Should Know
•Perceptrons
Linear output, least squares
Sigmoid output, least squares
•Multilayer nets
The idea behind back prop
Awareness of better minimization methods
•Generalization. What it means.

Neural Networks: Slide 67Copyright © 2001, 2003, Andrew W. Moore
APPLICATIONS
To Discuss:
•What can non-linear regression be useful for?
•What can neural nets (used as non-linear
regressors) be useful for?
•What are the advantages of N. Nets for
nonlinear regression?
•What are the disadvantages?

Neural Networks: Slide 68Copyright © 2001, 2003, Andrew W. Moore
Other Uses of Neural Nets…
•Time series with recurrent nets
•Unsupervised learning (clustering principal
components and non-linear versions
thereof)
•Combinatorial optimization with Hopfield
nets, Boltzmann Machines
•Evaluation function learning (in
reinforcement learning)
Tags