Historical Background
1943 McCulloch and Pitts proposed the first
computational models of neuron.
1949 Hebb proposed the first learning rule.
1958 Rosenblatt’s work in perceptrons.
1969 Minsky and Papert’s exposed limitation of the
theory.
1970s Decade of dormancy for neural networks.
1980-90s Neural network return (self-
organization, back-propagation algorithms, etc)
Nervous Systems
Human brain contains ~ 10
11
neurons.
Each neuron is connected ~ 10
4
others.
Some scientists compared the brain with a
“complex, nonlinear, parallel computer”.
The largest modern neural networks
achieve the complexity comparable to a
nervous system of a fly.
Neurons
The main purpose of neurons is to receive, analyze and transmit
further the information in a form of signals (electric pulses).
When a neuron sends the information we say that a neuron “fires”.
Neurons
This animation demonstrates the firing of a
synapse between the pre-synaptic terminal of
one neuron to the soma (cell body) of another
neuron.
Acting through specialized projections known as
dendrites and axons, neurons carry information
throughout the neural network.
bias
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
A Model of Artificial Neuron
y
i
f (.)a (.)
bias
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
A Model of Artificial Neuron
y
i
f (.)a (.)
1
( )
m
i ij j
j
f wx
)()1( faty
i
otherwise
f
fa
0
01
)(
Feed-Forward Neural Networks
Graph representation:
–nodes: neurons
–arrows: signal flow directions
A neural network that does not
contain cycles (feedback loops) is
called a feed–forward network (or
perceptron).
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Hidden Layer(s)
Input Layer
Output Layer
Layered Structure
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Knowledge and Memory
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
The output behavior of a network is
determined by the weights.
Weights the memory of an NN.
Knowledge distributed across the
network.
Large number of nodes
–increases the storage “capacity”;
–ensures that the knowledge is
robust;
–fault tolerance.
Store new information by changing
weights.
Pattern Classification
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Function: x y
The NN’s output is used to
distinguish between and recognize
different input patterns.
Different output patterns
correspond to particular classes of
input patterns.
Networks with hidden layers can be
used for solving more complex
problems then just a linear pattern
classification.
input pattern x
output pattern y
Training
. . .
. . .
. . .
. . .
(1) (2)(1) (2 )) ( ) (
( , ),( , ), ,( , ),
kk
d d dx xT x
( )
1 2( , , , )
i
i i imx x xx
( )
1 2( , , , )
i
i i ind d d d
x
i1
x
i2
x
im
y
i1
y
i2
y
in
Training Set
d
i1
d
i2
d
in
Goal:
( ( ))
(M )in
i
i
i
E error dy
( )
2
( )i
i
i
dy
Generalization
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
By properly training a neural
network may produce reasonable
answers for input patterns not seen
during training (generalization).
Generalization is particularly useful
for the analysis of a “noisy” data
(e.g. time–series).
Generalization
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
By properly training a neural
network may produce reasonable
answers for input patterns not seen
during training (generalization).
Generalization is particularly useful
for the analysis of a “noisy” data
(e.g. time–series).
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5
-1
-0.5
0
0.5
1
1.5
without noise with noise
Applications
Pattern classification
Object recognition
Function approximation
Data compression
Time series analysis and forecast
. . .
The Single-Layered Perceptron
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
Training a Single-Layered Perceptron
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
d
1
d
2
d
n
(1) ((1) (2) ))2) ( (
( , ),( , ), ,( , )
p p
x xd d d xT Training Set
Goal:
( )k
i
y
( )
1
m
k
l
l
il
xwa
1,2, ,
1,2, ,
i n
k p
( )k
i
d
( )
( )
T
i
k
axw
Learning Rules
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
d
1
d
2
d
n
(1) ((1) (2) ))2) ( (
( , ),( , ), ,( , )
p p
x xd d d xT Training Set
Goal:
( )k
i
y
( )
1
m
k
l
l
il
xwa
1,2, ,
1,2, ,
i n
k p
( )k
i
d
( )
( )
T
i
k
axw
Linear Threshold Units (LTUs) : Perceptron Learning Rule
Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule
Perceptron
Linear Threshold Unit
( )T k
i
w x
sgn
( ) ( )
sgn( )
k T k
i iy w x
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
+1
1
Perceptron
Linear Threshold Unit
( )T k
i
w x
sgn
( ) ( )
sgn( )
k T k
i iy w x
) (( ) ( )
sgn( ) {, }
1,2, ,
1,2, ,
1 1
k
i
k T k
i iy
i n
k p
d
w x
Goal:
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
+1
1
Example
x
1
x
2
x
3
= 1
212
y
TTT
]2,1[,]1,5.1[,]0,1[ Class 1 (+1)
TTT
]2,1[,]1,5.2[,]0,2[ Class 2 (1)
Class 1
Class 2
x
1
x
2
g
(x
) =
2
x
1
+
x
2
+
2
=
0
) (( ) ( )
sgn( ) {, }
1,2, ,
1,2, ,
1 1
k
i
k T k
i iy
i n
k p
d
w x
Goal:
1
x x x
d d d
(1) (2) (3)
(1) (2) (3)
1 1.5 1
0 , 1 , 2
1, 1,
1 1 1
1
x x x
d d d
Class 1 (+1)
Class 2 (1)
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
w x
w
Goal:
Augmented input vector
x
1
x
2
x
3
= 1
212
y
Class 1
(1, 2, 1)
(1.5, 1, 1)
(1,0, 1)
Class 2
(1, 2, 1)
(2.5, 1, 1)
(2,0, 1)
x
1
x
2
x
3
(0,0,0)
1 2 3
( ) 2 2 0g x x x x
0xw
T
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
w x
w
Goal:
Augmented input vector
x
1
x
2
x
3
= 1
212
y
Class 1
(1, 2, 1)
(1.5, 1, 1)
(1,0, 1)
Class 2
(1, 2, 1)
(2.5, 1, 1)
(2,0, 1)
x
1
x
2
x
3
(0,0,0)
1 2 3
( ) 2 2 0g x x x x
0xw
T
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
w x
w
Goal:
A plane passes through the origin
in the augmented input space.
Linearly Separable vs.
Linearly Non-Separable
0
1
1
0
1
1
0
1
1
AND OR XOR
Linearly Separable Linearly Separable Linearly Non-Separable
Goal
Given training sets T
1
C
1
and T
2
C
2
with
elements in form of x=(x
1, x
2 , ...
, x
m-1 , x
m)
T
,
where x
1
, x
2
, ...
, x
m-1
R and x
m
= 1.
Assume T
1
and T
2
are linearly separable.
Find w=(w
1
, w
2
, ...
, w
m
)
T
such that
2
1
1
1
)sgn(
T
T
T
x
x
xw
Goal
Given training sets T
1
C
1
and T
2
C
2
with
elements in form of x=(x
1, x
2 , ...
, x
m-1 , x
m)
T
,
where x
1
, x
2
, ...
, x
m-1
R and x
m
= 1.
Assume T
1
and T
2
are linearly separable.
Find w=(w
1
, w
2
, ...
, w
m
)
T
such that
2
1
1
1
)sgn(
T
T
T
x
x
xw
w
T
x = 0 is a hyperplain passes through the
origin of augmented input space.
Observation
x
1
x
2
+
d = +1
d = 1
+
w
1
w
2
w
3
w
4
w
5
w
6
x
Which w’s correctly
classify x?
What trick can be
used?
Observation
x
1
x
2
+
d = +1
d = 1
+
w1
x1 +w2
x2
= 0
w
x
Is this w ok?
0
T
wx
Observation
x
1
x
2
+
d = +1
d = 1
+
w
1
x
1
+
w
2
x
2
=
0
w
x
Is this w ok?
0
T
wx
Observation
x
1
x
2
+
d = +1
d = 1
+
w
1
x
1
+
w
2
x
2
=
0
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = ?
?
?
Observation
x
1
x
2
+
d = +1
d = 1
+
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = x
( )
T TT
wx xw xw x
reasonable?
<0 >0
Observation
x
1
x
2
+
d = +1
d = 1
+
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = x
( )
T TT
wx xw xw x
reasonable?
<0 >0
Observation
x
1
x
2
+
d = +1
d = 1
w
x
Is this w ok?
0
T
wx
w = ?
+xorx
Perceptron Learning Rule
+d = +1
d = 1
Upon misclassification on
wx
w x
Define error
r d y
2
2
0
+
+
No error
0
Perceptron Learning Rule
rw x
Define error
r d y
2
2
0
+
+
No error
Summary
Perceptron Learning Rule
Based on the general weight learning rule.
()()
iii xtwt r
ii i
rd y
( ()( ))
i i ii
wt ytd x
0
2 1, 1
2 1, 1
i i
i i
i i
d y
d y
d y
incorrect
correct
Summary
Perceptron Learning Rule
x y
( ) d yw x
.
.
.
.
.
.
Converge?
d
+
x y
( ) d yw x
.
.
.
.
.
.
d
+
Perceptron Convergence Theorem
Exercise: Reference some papers
or textbooks to prove the theorem.
If the given training set is linearly separable,
the learning process will converge in a finite
number of steps.
The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)
x
(3)
x
(4)
Linearly Separable.
The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)
x
(3)
x
(4)
The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)
x
(3)
x
(4)
w
1
w
0
The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)
x
(3)
x
(4)
w
1
w
0
w
2
w
1
The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)
x
(3)
x
(4)
w
1
w
0
w
1
w
2
w
2
w
3
The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)
x
(3)
x
(4)
w
1
w
0
w
1
w
2
w
2
w
3
w
4
= w
3
The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)
x
(3)
x
(4)
w
The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)
x
(3)
x
(4)
w
The demonstration is in
augmented space.
Conceptually, in augmented space, we
adjust the weight vector to fit the data.
Weight Space
w
1
w
2
+x
A weight in the shaded area will give correct
classification for the positive example.
w
Weight Space
w
1
w
2
+x
A weight in the shaded area will give correct
classification for the positive example.
w
w = x
Weight Space
w
1
w
2
x
A weight not in the shaded area will give correct
classification for the negative example.
w
Weight Space
w
1
w
2
x
A weight not in the shaded area will give correct
classification for the negative example.
w
w = x
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
10
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
10 w
11
The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)
x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
11
Conceptually, in weight space, we move
the weight into the feasible region.
Adaline (Adaptive Linear Element)
Widrow [1962]
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.
( ) ( )k T k
i iyw x
Adaline (Adaptive Linear Element)
Widrow [1962]
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.
( ) ( )k T k
i iyw x
( )( ) ( )
1,2, ,
1,2, ,
k T k k
i i iy
i
d
n
k p
w x
Goal:
In what condition, the goal is reachable?
LMS (Least Mean Square)
Minimize the cost function (error function):
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
( ( )) 2
1
1
( )
2
Tk
p
k
k
d
xw
( )
1 1
(
2
)1
2
p m
ll
kk
k l
xwd
Gradient Decent Algorithm
E(w)
w
1
w
2
Our goal is to go downhill.
Contour Map
(w
1
, w
2
)
w
1
w
2w
Gradient Decent Algorithm
E(w)
w
1
w
2
Our goal is to go downhill.
Contour Map
(w
1
, w
2
)
w
1
w
2w
How to find the steepest decent direction?
Gradient Operator
Let f(w) = f (w
1, w
2,…, w
m) be a function over R
m
.
1 2
1 2 m
m
f f f
w w
dw dwdf w
w
d
Define
1 2
, , ,
T
m
dw dw dww
,df f f w w
1 2
, , ,
T
m
f f f
f
w w w
Gradient Operator
fw fw f
w
df : positivedf : zero df : negative
Go uphill Plain Go downhill
,df f f w w
fw fw f
w
The Steepest Decent Direction
df : positivedf : zero df : negative
Go uphill Plain Go downhill
,df f f w w
To minimize f , we choose
w = f
LMS (Least Mean Square)
Minimize the cost function (error function):
( )( )
2
1 1
1
( )
2
p m
k l
l
k
l
k
d xE w
w
( )
1
( ) (
1
)( )
p m
k l
k kk
l j
j
l
E
w
wd x x
w
( ( )(
1
) )kk
p
T k
j
k
xd
xw
( )
1
( )) (
p
k
k k
j
k
yd x
( )
1
()( )
k
p
k
k
j
j
E
w
x
w
(k)
( ) ( ) ( )k k k
dy
Adaline Learning Rule
Minimize the cost function (error function):
( )( )
2
1 1
1
( )
2
p m
k l
l
k
l
k
d xE w
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
w w w
w
( )E
w
w w Weight Modification Rule
( )
1
()( )
k
p
k
k
j
j
E
w
x
w
( ) ( ) ( )k k k
dy
Learning Modes
Batch Learning Mode:
Incremental Learning Mode:
(( ))
1
p
k
kk
jj xw
(( ))kk
jj xw
( )
1
()( )
k
p
k
k
j
j
E
w
x
w
( ) ( ) ( )k k k
dy
Summary
Adaline Learning Rule
x y
wδx
.
.
.
.
.
.
d
+
-Learning Rule
LMS Algorithm
Widrow-Hoff Learning Rule
Converge?
LMS Convergence
Based on the independence theory (Widrow, 1976).
1.The successive input vectors are statistically independent.
2.At time t, the input vector x(t) is statistically independent of
all previous samples of the desired response, namely d(1),
d(2), …, d(t1).
3.At time t, the desired response d(t) is dependent on x(t), but
statistically independent of all previous values of the
desired response.
4.The input vector x(t) and desired response d(t) are drawn
from Gaussian distributed populations.
LMS Convergence
It can be shown that LMS is convergent if
max
2
0
where
max
is the largest eigenvalue of the correlation
matrix R
x
for the inputs.
1
1
lim
T
i i
n
in
x
R xx
LMS Convergence
It can be shown that LMS is convergent if
max
2
0
where
max
is the largest eigenvalue of the correlation
matrix R
x
for the inputs.
1
1
lim
T
i i
n
in
x
R xx
Since
max is hardly available, we commonly use
2
0
( )tr
x
R
Adaline
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.
( ) ( )k T k
i iyw x
Unipolar Sigmoid
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.
i
neti
e
neta
1
1
)(
( ) ( )k T k
i iy aw x
Bipolar Sigmoid
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.
( ) ( )k T k
i iy aw x
1
1
2
)(
i
neti
e
neta
Goal
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
(
2
)
1
( )1
( )
2
kT
p
k
k
da
wx
Gradient Decent Algorithm
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
( )E
ww w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
w w w
w
(
2
)
1
( )1
( )
2
kT
p
k
k
da
wx
The Gradient
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
)( ()k T k
aywx
( )
(
1
)( )( )
( )
k
k
j
p
k j
k
d
y
y
w
E
w
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
w w w
w
( )
( )
( )
(
( )
1
)
( )
k
k
k
p
k j
k k
netnet
net
y
w
a
d
( ) ( ) kTk
netxw
(
1
)k
m
i
ii
xw
? ?
( )
( )
j
k
k
j
w
net
x
Depends on the
activation function
used.
Weight Modification Rule
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
1
( )
( )
( )
(
)
)
(( )
( )
p
k
kk
k
j
j
k
k
net
x
aE
y
w ne
d
t
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
w w w
w
(( ))k k
netya
(
1
)
(
( )
( ))
k
kk
p
kj j
k
ne
w
at
x
net
( ) ( ) ( )k k k
dy
Learning
Rule
Batch
)
( )
(
( )
( )
k
k
kj j
k
net
net
a
xw
Incremental
The Learning Efficacy
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
w w w
w
Adaline
Sigmoid
Unipolar Bipolar
( )neta net
1
( )
1
net
a
e
net
2
( ) 1
1
net
an
e
et
( )
1
anet
net
( ) ( )
(1 )
( )
k knet
net
y y
a
Exercise
(( ))k k
netya
1
( )
( )
( )
(
)
)
(( )
( )
p
k
kk
k
j
j
k
k
net
x
aE
y
w ne
d
t
w
Learning Rule Unipolar Sigmoid
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d
w
( ) ( ) ( )) ( )
1
(
(( )
)
1
(
)
k k
j
k k
p
kj
k
y
E
xd y y
w
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
w w w
w
( ) ))( (( )
1
(1 )
k k k
k
k
j
p
y yx
( ) ( ) ( )k k k
dy
( () ) ( )(
1
)
(1 )
k
j
k
p
k
k k
j
xw y y
Weight Modification Rule
()
()(1
)
k
k
y
y
Comparisons
( () ) ( )(
1
)
(1 )
k
j
k
p
k
k k
j xw y y
Adaline
Sigmoid
Batch
Incremental
(( ))
1
p
k
k
j j
k
w x
(( ))kk
jj
xw
Batch
Incremental
( )) ( ) )( (
(1 )
j
kkk k
j
w y yx
The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net
)
)
(
(1
net
ne
y
a
t
y
constant depends on output
The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net
)
)
(
(1
net
ne
y
a
t
y
constant depends on output
The learning efficacy of
Adaline is constant meaning
that the Adline will never get
saturated.
y=net
a’(net)
1
The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net
)
)
(
(1
net
ne
y
a
t
y
constant depends on output
The sigmoid will get saturated
if its output value nears the two
extremes.
y
a’(net)
10
(1 )y y
Initialization for Sigmoid Neurons
i
neti
e
neta
1
1
)(
w
i1
w
i2
w
im
x
1
x
2
x
m
.
.
.
( )T k
i
w x
( ) ( )k T k
i iy aw x
Before training, it weight
must be sufficiently small. Why?
Multilayer Perceptron
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Hidden Layer
Input Layer
Output Layer
Multilayer Perceptron
Input
Analysis
Classification
Output
Learning
Where the
knowledge from?
How an MLP Works?
XOR
0
1
1
x
1
x
2
Example:
Not linearly separable.
Is a single layer
perceptron workable?
How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
L
2L
1
x
1 x
2
x
3= 1
y
1
y
2
How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
0
1
1
y
1
y
2
L
3
How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
0
1
1
y
1
y
2
L
3
How an MLP Works?
0
1
1
y
1
y
2
L
3
L
2
L
1
x
1 x
2
x
3
= 1
L
3
y
1 y
2
y
3= 1
z
Example:
Parity Problem
x
1
x
2
x
3
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
Is the problem linearly separable?
Parity Problem
x
1
x
2
x
3
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
P
1
P
2
P
3
Parity Problem
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
Parity Problem
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
P
1
x
1 x
2 x
3
y
1
P
2
y
2
P
3
y
3
Parity Problem
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
y
1
y
2
y
3
P
4
Parity Problem
y
1
y
2
y
3
P
4
P
1
x
1 x
2 x
3
P
2
P
3
y
1
z
y
3
P
4
y
2
General Problem
General Problem
Hyperspace Partition
L
1
L
2
L
3
Region Encoding
101
L
1
L
2
L
3
001000
010
110
111
100
Hyperspace Partition &
Region Encoding Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
101
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
001
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
000
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
110
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
010
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
100
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
111
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
Classification
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
001101101001000110010100111
0
1
1 1
0
Feed-Forward Neural Networks
Back Propagation
Learning algorithm
Activation Function — Sigmoid
net
e
netay
1
1
)(
net
net
e
e
neta
)(
1
1
)(
2
y
y
e
net
1
)1()( yyneta
net
1
0.5
0
Remember this
Supervised Learning
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
Hidden Layer
Input Layer
Output Layer
),(,),,(),,(
)()()2()2()1()1( pp
dxdxdxT
Training Set
d
1
d
2
d
n
Supervised Learning
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
d
1
d
2
d
n
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Sum of Squared Errors
Goal:
Minimize
),(,),,(),,(
)()()2()2()1()1( pp
dxdxdxT
Training Set
Back Propagation Learning Algorithm
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
d
1
d
2
d
n
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Learning on Output Neurons
Learning on Hidden Neurons
Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji
p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E
)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
??
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E
p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E
)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(
)()( l
j
l
jod
depends on the
activation function
Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E
p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E
)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
( ) ( )
( )
l l
j jd o
( ) ( )
(1 )
l l
j jo o
Using sigmoid,
Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E
p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E
)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
( ) ( )
( )
l l
j jd o
( ) ( )
(1 )
l l
j jo o
Using sigmoid,
( )l
j
(
( )
( )( ) ( (
)
) ) )
(
( ) (1 )
l
l
l
j
l
j j
l l l
j jj
E
net
o od o
Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E
)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(l
io
( )
( )( )
l
l
i
ji
l
j
E
o
w
( ( )) (( )( ) )
(1( ))
l l
j
l l
j
l
j ijd o o oo
Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E
)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(l
io
( )
( )( )
l
l
i
ji
l
j
E
o
w
( ( )) (( )( ) )
(1( ))
l l
j
l l
j
l
j ijd o o oo
( )( )
1
p
l
i
l
l
j
ji
E
o
w
)(
1
)(l
i
p
l
l
jji ow
How to train the weights
connecting to output neurons?
Learning on Hidden Neurons
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji
p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E
)(
)(
)()(
??
Learning on Hidden Neurons
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji
p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E
)(
)(
)()(
)(l
i
)(l
ko
Learning on Hidden Neurons
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji
p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E
)(
)(
)()(
)(l
i
)(l
ko
)(
)(
)(
)(
)(
)(
l
i
l
i
l
i
l
l
i
l
net
o
o
E
net
E
( ) ( )
(1 )
l l
i io o
?
Learning on Hidden Neurons
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji
p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E
)(
)(
)()(
)(l
i
)(l
ko
)(
)(
)(
)(
)(
)(
l
i
l
i
l
i
l
l
i
l
net
o
o
E
net
E
j
l
i
l
j
l
j
l
l
i
l
o
net
net
E
o
E
)(
)(
)(
)(
)(
)(
)(l
j ji
w
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
( ) ( )
(1 )
l l
i io o
Learning on Hidden Neurons
p
l
l
EE
1
)(
n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji
p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E
)(
)(
)()(
)(l
ko
)(
1
)(l
k
p
l
l
i
ik
o
w
E
)(
1
)(l
k
p
l
l
iik ow
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
( )
( ) ( ) ( ) ( ) ( )
( )
( ) (1 )
l
l l l l l
j j j j j l
j
E
d o o o
net
)(
1
)(l
i
p
l
l
jji
ow
Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
( )
( ) ( ) ( ) ( ) ( )
( )
( ) (1 )
l
l l l l l
j j j j j l
j
E
d o o o
net
)(
1
)(l
i
p
l
l
jji
ow
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
)(
1
)(l
k
p
l
l
iik
ow
Learning Factors
•Initial Weights
•Learning Constant ()
•Cost Functions
•Momentum
•Update Rules
•Training Data and Generalization
•Number of Layers
•Number of Hidden Nodes
Reading Assignments
Shi Zhong and Vladimir Cherkassky, “Factors Controlling Generalization Ability of
MLP Networks.” In Proc. IEEE Int. Joint Conf. on Neural Networks, vol. 1, pp. 625-
630, Washington DC. July 1999. (http://www.cse.fau.edu/~zhong/pubs.htm)
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). "Learning Internal
Representations by Error Propagation," in Parallel Distributed Processing: Explorations
in the Microstructure of Cognition, vol. I, D. E. Rumelhart, J. L. McClelland, and the
PDP Research Group. MIT Press, Cambridge (1986).
(http://www.cnbc.cmu.edu/~plaut/85-419/papers/RumelhartETAL86.backprop.pdf).