Lecture2---Feed-Forward Neural Networks.ppt

Feed-Forward Neural Networks

Content
Introduction
Single-Layer Perceptron Networks
Learning Rules for Single-Layer Perceptron Networks
–Perceptron Learning Rule
–Adaline Leaning Rule
-Leaning Rule
Multilayer Perceptron
Back Propagation Learning algorithm

Feed-Forward Neural Networks
Introduction

Historical Background
1943 McCulloch and Pitts proposed the first
computational models of neuron.
1949 Hebb proposed the first learning rule.
1958 Rosenblatt’s work in perceptrons.
1969 Minsky and Papert’s exposed limitation of the
theory.
1970s Decade of dormancy for neural networks.
1980-90s Neural network return (self-
organization, back-propagation algorithms, etc)

Nervous Systems
Human brain contains ~ 10
11
neurons.
Each neuron is connected ~ 10
4
others.
Some scientists compared the brain with a
“complex, nonlinear, parallel computer”.
The largest modern neural networks
achieve the complexity comparable to a
nervous system of a fly.

Neurons
The main purpose of neurons is to receive, analyze and transmit
further the information in a form of signals (electric pulses).
When a neuron sends the information we say that a neuron “fires”.

Neurons
This animation demonstrates the firing of a
synapse between the pre-synaptic terminal of
one neuron to the soma (cell body) of another
neuron.
Acting through specialized projections known as
dendrites and axons, neurons carry information
throughout the neural network.

bias
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
A Model of Artificial Neuron
y
i
f (.)a (.)


bias
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
A Model of Artificial Neuron
y
i
f (.)a (.)

1
( )
m
i ij j
j
f wx


)()1( faty
i 


 

otherwise
f
fa
0
01
)(

Feed-Forward Neural Networks
Graph representation:
–nodes: neurons
–arrows: signal flow directions
A neural network that does not
contain cycles (feedback loops) is
called a feed–forward network (or
perceptron).
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n

Hidden Layer(s)
Input Layer
Output Layer
Layered Structure
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n

Knowledge and Memory
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
The output behavior of a network is
determined by the weights.
Weights  the memory of an NN.
Knowledge  distributed across the
network.
Large number of nodes
–increases the storage “capacity”;
–ensures that the knowledge is
robust;
–fault tolerance.
Store new information by changing
weights.

Pattern Classification
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Function: x  y
The NN’s output is used to
distinguish between and recognize
different input patterns.
Different output patterns
correspond to particular classes of
input patterns.
Networks with hidden layers can be
used for solving more complex
problems then just a linear pattern
classification.
input pattern x
output pattern y

Training
. . .
. . .
. . .
. . .
 
(1) (2)(1) (2 )) ( ) (
( , ),( , ), ,( , ),
kk
 d d dx xT x  
( )
1 2( , , , )
i
i i imx x xx 
( )
1 2( , , , )
i
i i ind d d d 
x
i1
x
i2
x
im
y
i1
y
i2
y
in
Training Set
d
i1
d
i2
d
in
Goal:
( ( ))
(M )in
i
i
i
E error  dy
( )
2
( )i
i
i
  dy

Generalization
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
By properly training a neural
network may produce reasonable
answers for input patterns not seen
during training (generalization).
Generalization is particularly useful
for the analysis of a “noisy” data
(e.g. time–series).

Generalization
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
By properly training a neural
network may produce reasonable
answers for input patterns not seen
during training (generalization).
Generalization is particularly useful
for the analysis of a “noisy” data
(e.g. time–series).
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5
-1
-0.5
0
0.5
1
1.5
without noise with noise

Applications
Pattern classification
Object recognition
Function approximation
Data compression
Time series analysis and forecast
. . .

Feed-Forward Neural Networks
Single-Layer
Perceptron Networks

The Single-Layered Perceptron
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2

Training a Single-Layered Perceptron
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
d
1
d
2
d
n
 
(1) ((1) (2) ))2) ( (
( , ),( , ), ,( , )
p p
x xd d d xT Training Set
Goal:
( )k
i
y
( )
1
m
k
l
l
il
xwa

 
 
 
 

1,2, ,
1,2, ,
i n
k p





( )k
i
d
( )
( )
T
i
k
axw

Learning Rules
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
d
1
d
2
d
n
 
(1) ((1) (2) ))2) ( (
( , ),( , ), ,( , )
p p
x xd d d xT Training Set
Goal:
( )k
i
y
( )
1
m
k
l
l
il
xwa

 
 
 
 

1,2, ,
1,2, ,
i n
k p





( )k
i
d
( )
( )
T
i
k
axw
 Linear Threshold Units (LTUs) : Perceptron Learning Rule
 Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule

Feed-Forward Neural Networks
Learning Rules for
Single-Layered Perceptron
Networks
 Perceptron Learning Rule
 Adline Leaning Rule
 -Learning Rule

Perceptron
Linear Threshold Unit
( )T k
i
w x
sgn
( ) ( )
sgn( )
k T k
i iy w x
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.

+1
1

Perceptron
Linear Threshold Unit
( )T k
i
w x
sgn
( ) ( )
sgn( )
k T k
i iy w x
) (( ) ( )
sgn( ) {, }
1,2, ,
1,2, ,
1 1
k
i
k T k
i iy
i n
k p
d  



w x


Goal:
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.

+1
1

Example
x
1
x
2
x
3
= 1
212
y
 
TTT
]2,1[,]1,5.1[,]0,1[ Class 1 (+1)
 
TTT
]2,1[,]1,5.2[,]0,2[ Class 2 (1)
Class 1
Class 2
x
1
x
2
g
(x
) =
2
x
1
+
x
2
+
2
=
0
) (( ) ( )
sgn( ) {, }
1,2, ,
1,2, ,
1 1
k
i
k T k
i iy
i n
k p
d  



w x


Goal:

Augmented input vector
x
1
x
2
x
3
= 1
212
y
 
TTT
]2,1[,]1,5.1[,]0,1[ Class 1 (+1)
 
TTT
]2,1[,]1,5.2[,]0,2[ Class 2 (1)
(4) (5) (6)
(4) (5) (6)
2 2.5 1
0 , 1 , 2
1, 1, 1
1 1

1
x x x
d d d
  
     
     
    
     
     
     
  
(1) (2) (3)
(1) (2) (3)
1 1.5 1
0 , 1 , 2
1, 1,
1 1 1
1
x x x
d d d
       
     
    
     
     
    





 
Class 1 (+1)
Class 2 (1)
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
 

w x
w
Goal:

Augmented input vector
x
1
x
2
x
3
= 1
212
y
Class 1
(1, 2, 1)
(1.5, 1, 1)
(1,0, 1)
Class 2
(1, 2, 1)
(2.5, 1, 1)
(2,0, 1)
x
1
x
2
x
3
(0,0,0)
1 2 3
( ) 2 2 0g x x x x   
0xw
T
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
 

w x
w
Goal:

Augmented input vector
x
1
x
2
x
3
= 1
212
y
Class 1
(1, 2, 1)
(1.5, 1, 1)
(1,0, 1)
Class 2
(1, 2, 1)
(2.5, 1, 1)
(2,0, 1)
x
1
x
2
x
3
(0,0,0)
1 2 3
( ) 2 2 0g x x x x   
0xw
T
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
 

w x
w
Goal:
A plane passes through the origin
in the augmented input space.

Linearly Separable vs.
Linearly Non-Separable
0
1
1
0
1
1
0
1
1
AND OR XOR
Linearly Separable Linearly Separable Linearly Non-Separable

Goal
Given training sets T
1
C
1
and T
2
 C
2
with
elements in form of x=(x
1, x
2 , ...
, x
m-1 , x
m)
T
,
where x
1
, x
2
, ...

, x
m-1
R and x
m
= 1.
Assume T
1
and T
2
are linearly separable.
Find w=(w
1
, w
2
, ...

, w
m
)
T
such that






2
1
1
1
)sgn(
T
T
T
x
x
xw

Goal
Given training sets T
1
C
1
and T
2
 C
2
with
elements in form of x=(x
1, x
2 , ...
, x
m-1 , x
m)
T
,
where x
1
, x
2
, ...

, x
m-1
R and x
m
= 1.
Assume T
1
and T
2
are linearly separable.
Find w=(w
1
, w
2
, ...

, w
m
)
T
such that






2
1
1
1
)sgn(
T
T
T
x
x
xw
w
T
x = 0 is a hyperplain passes through the
origin of augmented input space.

Observation
x
1
x
2
+

d = +1
d = 1
+
w
1
w
2
w
3
w
4
w
5
w
6
x
Which w’s correctly
classify x?
What trick can be
used?

Observation
x
1
x
2
+

d = +1
d = 1
+
w1
x1 +w2
x2
= 0
w
x
Is this w ok?
0
T
wx

Observation
x
1
x
2
+

d = +1
d = 1
+
w
1
x
1
+
w
2
x
2
=
0
w
x
Is this w ok?
0
T
wx

Observation
x
1
x
2
+

d = +1
d = 1
+
w
1
x
1
+
w
2
x
2
=
0
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = ?
?
?

Observation
x
1
x
2
+

d = +1
d = 1
+
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = x
( )
T TT
  wx xw xw x
reasonable?
<0 >0

Observation
x
1
x
2
+

d = +1
d = 1
+
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = x
( )
T TT
  wx xw xw x
reasonable?
<0 >0

Observation
x
1
x
2
+

d = +1
d = 1

w
x
Is this w ok?
0
T
wx
w = ?
+xorx

Perceptron Learning Rule
+d = +1
d = 1
Upon misclassification on
wx
w x
Define error
r d y 
2
2
0


 


+
+
No error
0

Perceptron Learning Rule
rw x
Define error
r d y 
2
2
0


 


+
+
No error

Perceptron Learning Rule
rw x
Learning Rate
Error (d  y)
Input

Summary 
Perceptron Learning Rule
Based on the general weight learning rule.
()()
iii xtwt r 
ii i
rd y 
( ()( ))
i i ii
wt ytd x 
0
2 1, 1
2 1, 1
i i
i i
i i
d y
d y
d y


   

  

incorrect
correct

Summary 
Perceptron Learning Rule
x y

( )  d yw x

.
.
.
.
.
.
Converge?
d
+

x y

( )  d yw x

.
.
.
.
.
.
d
+
Perceptron Convergence Theorem
Exercise: Reference some papers
or textbooks to prove the theorem.
If the given training set is linearly separable,
the learning process will converge in a finite
number of steps.

The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)

x
(3)
x
(4)
Linearly Separable.

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0
w
2
w
1

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0
w
1
w
2
w
2
w
3

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0
w
1
w
2
w
2
w
3
w
4
= w
3

The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)

x
(3)
x
(4)
w

The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)

x
(3)
x
(4)
w
The demonstration is in
augmented space.
Conceptually, in augmented space, we
adjust the weight vector to fit the data.

Weight Space
w
1
w
2
+x
A weight in the shaded area will give correct
classification for the positive example.
w

Weight Space
w
1
w
2
+x
A weight in the shaded area will give correct
classification for the positive example.
w
w = x

Weight Space
w
1
w
2
x
A weight not in the shaded area will give correct
classification for the negative example.
w

Weight Space
w
1
w
2
x
A weight not in the shaded area will give correct
classification for the negative example.
w
w = x

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
10

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
10 w
11

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
11
Conceptually, in weight space, we move
the weight into the feasible region.

Feed-Forward Neural Networks
Learning Rules for
Single-Layered Perceptron
Networks
 Perceptron Learning Rule
 Adaline Leaning Rule
 -Learning Rule

Adaline (Adaptive Linear Element)
Widrow [1962]
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

( ) ( )k T k
i iyw x

Adaline (Adaptive Linear Element)
Widrow [1962]
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

( ) ( )k T k
i iyw x
( )( ) ( )
1,2, ,
1,2, ,
k T k k
i i iy
i
d
n
k p
 


w x


Goal:
In what condition, the goal is reachable?

LMS (Least Mean Square)
Minimize the cost function (error function):
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
( ( )) 2
1
1
( )
2
Tk
p
k
k
d

  xw
( )
1 1
(
2
)1
2
p m
ll
kk
k l
xwd
 
 
 
 
 
 

Gradient Decent Algorithm
E(w)
w
1
w
2
Our goal is to go downhill.
Contour Map
(w
1
, w
2
)
w
1
w
2w

Gradient Decent Algorithm
E(w)
w
1
w
2
Our goal is to go downhill.
Contour Map
(w
1
, w
2
)
w
1
w
2w
How to find the steepest decent direction?

Gradient Operator
Let f(w) = f (w
1, w
2,…, w
m) be a function over R
m
.
1 2
1 2 m
m
f f f
w w
dw dwdf w
w
d
  
   
  

Define
 
1 2
, , ,
T
m
dw dw dww 
,df f f   w w
1 2
, , ,
T
m
f f f
f
w w w
   
  
  
 


Gradient Operator
fw fw f
w
df : positivedf : zero df : negative
Go uphill Plain Go downhill
,df f f   w w

fw fw f
w
The Steepest Decent Direction
df : positivedf : zero df : negative
Go uphill Plain Go downhill
,df f f   w w
To minimize f , we choose
w =   f

LMS (Least Mean Square)
Minimize the cost function (error function):
( )( )
2
1 1
1
( )
2
p m
k l
l
k
l
k
d xE w
 
 
 
 
 
 w
( )
1
( ) (
1
)( )
p m
k l
k kk
l j
j
l
E
w
wd x x
 
  
 
 
  
 
w
 
( ( )(
1
) )kk
p
T k
j
k
xd

  xw  
( )
1
( )) (
p
k
k k
j
k
yd x

 
( )
1
()( )
k
p
k
k
j
j
E
w
x





w
 (k)
( ) ( ) ( )k k k
dy 

Adaline Learning Rule
Minimize the cost function (error function):
( )( )
2
1 1
1
( )
2
p m
k l
l
k
l
k
d xE w
 
 
 
 
 
 w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
( )E 
w
w w Weight Modification Rule
( )
1
()( )
k
p
k
k
j
j
E
w
x





w
( ) ( ) ( )k k k
dy 

Learning Modes
Batch Learning Mode:
Incremental Learning Mode:
(( ))
1
p
k
kk
jj xw


(( ))kk
jj xw
( )
1
()( )
k
p
k
k
j
j
E
w
x





w
( ) ( ) ( )k k k
dy 

Summary 
Adaline Learning Rule
x y
wδx

.
.
.
.
.
.
d
+

-Learning Rule
LMS Algorithm
Widrow-Hoff Learning Rule
Converge?

LMS Convergence
Based on the independence theory (Widrow, 1976).
1.The successive input vectors are statistically independent.
2.At time t, the input vector x(t) is statistically independent of
all previous samples of the desired response, namely d(1),
d(2), …, d(t1).
3.At time t, the desired response d(t) is dependent on x(t), but
statistically independent of all previous values of the
desired response.
4.The input vector x(t) and desired response d(t) are drawn
from Gaussian distributed populations.

LMS Convergence
It can be shown that LMS is convergent if
max
2
0

 
where 
max
is the largest eigenvalue of the correlation
matrix R
x
for the inputs.
1
1
lim
T
i i
n
in



x
R xx

LMS Convergence
It can be shown that LMS is convergent if
max
2
0

 
where 
max
is the largest eigenvalue of the correlation
matrix R
x
for the inputs.
1
1
lim
T
i i
n
in



x
R xx
Since 
max is hardly available, we commonly use
2
0
( )tr
 
x
R

Comparisons
Fundamental
Hebbian
Assumption
Gradient
Decent
ConvergenceIn finite steps
Converge
Asymptotically
Constraint
Linearly
Separable
Linear
Independence
Perceptron
Learning Rule
Adaline
Learning Rule
(Widrow-Hoff)

Feed-Forward Neural Networks
Learning Rules for
Single-Layered Perceptron
Networks
 Perceptron Learning Rule
 Adaline Leaning Rule
 -Learning Rule

Adaline
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

( ) ( )k T k
i iyw x

Unipolar Sigmoid
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

i
neti
e
neta



1
1
)(
 
( ) ( )k T k
i iy aw x

Bipolar Sigmoid
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

 
( ) ( )k T k
i iy aw x
1
1
2
)( 



i
neti
e
neta


Goal
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
(
2
)
1
( )1
( )
2
kT
p
k
k
da

  
  wx

Gradient Decent Algorithm
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
( )E
ww w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
(
2
)
1
( )1
( )
2
kT
p
k
k
da

  
  wx

The Gradient
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
 
)( ()k T k
aywx
( )
(
1
)( )( )
( )
k
k
j
p
k j
k
d
y
y
w
E
w

 
 
 

w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
 
( )
( )
( )
(
( )
1
)
( )
k
k
k
p
k j
k k
netnet
net
y
w
a
d

 
 
 

( ) ( ) kTk
netxw
(
1
)k
m
i
ii
xw


? ?
( )
( )
j
k
k
j
w
net
x

 

Depends on the
activation function
used.

Weight Modification Rule
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
 
1
( )
( )
( )
(
)
)
(( )
( )
p
k
kk
k
j
j
k
k
net
x
aE
y
w ne
d
t


 
 

w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
 
(( ))k k
netya
 
(
1
)
(
( )
( ))
k
kk
p
kj j
k
ne
w
at
x
net



 


( ) ( ) ( )k k k
dy 
Learning
Rule
Batch
 
)
( )
(
( )
( )
k
k
kj j
k
net
net
a
xw

 

Incremental

The Learning Efficacy
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
Adaline
Sigmoid
Unipolar Bipolar
( )neta net
1
( )
1
net
a
e
net



2
( ) 1
1
net
an
e
et

 

( )
1
anet
net



( ) ( )
(1 )
( )
k knet
net
y y
a
 



Exercise
 
(( ))k k
netya
 
1
( )
( )
( )
(
)
)
(( )
( )
p
k
kk
k
j
j
k
k
net
x
aE
y
w ne
d
t


 
 

w

Learning Rule  Unipolar Sigmoid
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
( ) ( ) ( )) ( )
1
(
(( )
)
1
(
)
k k
j
k k
p
kj
k
y
E
xd y y
w




 
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
( ) ))( (( )
1
(1 )
k k k
k
k
j
p
y yx


( ) ( ) ( )k k k
dy 
( () ) ( )(
1
)
(1 )
k
j
k
p
k
k k
j
xw y y 

  Weight Modification Rule

()
()(1
)
k
k
y
y


Comparisons
( () ) ( )(
1
)
(1 )
k
j
k
p
k
k k
j xw y y 

 
Adaline
Sigmoid
Batch
Incremental
(( ))
1
p
k
k
j j
k
w x

 
(( ))kk
jj
xw 
Batch
Incremental
( )) ( ) )( (
(1 )
j
kkk k
j
w y yx  

The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net



)
)
(
(1
net
ne
y
a
t
y



constant depends on output

The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net



)
)
(
(1
net
ne
y
a
t
y



constant depends on output
The learning efficacy of
Adaline is constant meaning
that the Adline will never get
saturated.
y=net
a’(net)
1

The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net



)
)
(
(1
net
ne
y
a
t
y



constant depends on output
The sigmoid will get saturated
if its output value nears the two
extremes.
y
a’(net)
10
(1 )y y

Initialization for Sigmoid Neurons
i
neti
e
neta



1
1
)(
w
i1
w
i2
w
im
x
1
x
2
x
m
.
.
.
( )T k
i
w x

 
( ) ( )k T k
i iy aw x
Before training, it weight
must be sufficiently small. Why?

Feed-Forward Neural Networks
Multilayer
Perceptron

Multilayer Perceptron
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Hidden Layer
Input Layer
Output Layer

Multilayer Perceptron
Input
Analysis
Classification
Output
Learning
Where the
knowledge from?

How an MLP Works?
XOR
0
1
1
x
1
x
2
Example:
Not linearly separable.
Is a single layer
perceptron workable?

How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
L
2L
1
x
1 x
2
x
3= 1
y
1
y
2

How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
0
1
1
y
1
y
2
L
3

How an MLP Works?
0
1
1
y
1
y
2
L
3
L
2
L
1
x
1 x
2
x
3
= 1
L
3
y
1 y
2
y
3= 1
z
Example:

Parity Problem
x
1
x
2
x
3
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
Is the problem linearly separable?

Parity Problem
x
1
x
2
x
3
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
P
1
P
2
P
3

Parity Problem
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000

Parity Problem
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
P
1
x
1 x
2 x
3
y
1
P
2
y
2
P
3
y
3

Parity Problem
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
y
1
y
2
y
3
P
4

Parity Problem
y
1
y
2
y
3
P
4
P
1
x
1 x
2 x
3
P
2
P
3
y
1
z
y
3
P
4
y
2

General Problem

Hyperspace Partition
L
1
L
2
L
3

Region Encoding
101
L
1
L
2
L
3
001000
010
110
111
100

Hyperspace Partition &
Region Encoding Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

101
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

001
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

000
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

110
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

010
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

100
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

111
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

Classification
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
001101101001000110010100111
0
1
1 1
0

Feed-Forward Neural Networks
Back Propagation
Learning algorithm

Activation Function — Sigmoid
net
e
netay



1
1
)(
net
net
e
e
neta












 )(
1
1
)(
2
y
y
e
net

 1

)1()( yyneta  
net
1
0.5
0
Remember this

Supervised Learning
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
Hidden Layer
Input Layer
Output Layer
 ),(,),,(),,(
)()()2()2()1()1( pp
dxdxdxT 
Training Set
d
1
d
2
d
n

Supervised Learning
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
d
1
d
2
d
n



p
l
l
EE
1
)(
 


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Sum of Squared Errors
Goal:
Minimize
 ),(,),,(),,(
)()()2()2()1()1( pp
dxdxdxT 
Training Set

Back Propagation Learning Algorithm
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
d
1
d
2
d
n



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Learning on Output Neurons
Learning on Hidden Neurons

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji

 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
??



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E








 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(
)()( l
j
l
jod
depends on the
activation function

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E








 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
( ) ( )
( )
l l
j jd o 
( ) ( )
(1 )
l l
j jo o 
Using sigmoid,

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E








 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
( ) ( )
( )
l l
j jd o 
( ) ( )
(1 )
l l
j jo o 
Using sigmoid,
( )l
j

(
( )
( )( ) ( (
)
) ) )
(
( ) (1 )
l
l
l
j
l
j j
l l l
j jj
E
net
o od o

 

 

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1

 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(l
io
( )
( )( )
l
l
i
ji
l
j
E
o
w




( ( )) (( )( ) )
(1( ))
l l
j
l l
j
l
j ijd o o oo 

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1

 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(l
io
( )
( )( )
l
l
i
ji
l
j
E
o
w




( ( )) (( )( ) )
(1( ))
l l
j
l l
j
l
j ijd o o oo 
( )( )
1
p
l
i
l
l
j
ji
E
o
w






)(
1
)(l
i
p
l
l
jji ow

 
How to train the weights
connecting to output neurons?

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
??

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
i
)(l
ko

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
i
)(l
ko
)(
)(
)(
)(
)(
)(
l
i
l
i
l
i
l
l
i
l
net
o
o
E
net
E







( ) ( )
(1 )
l l
i io o 
?

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
i
)(l
ko
)(
)(
)(
)(
)(
)(
l
i
l
i
l
i
l
l
i
l
net
o
o
E
net
E















j
l
i
l
j
l
j
l
l
i
l
o
net
net
E
o
E
)(
)(
)(
)(
)(
)(
)(l
j ji
w
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
  

  


( ) ( )
(1 )
l l
i io o 

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
ko
)(
1
)(l
k
p
l
l
i
ik
o
w
E






)(
1
)(l
k
p
l
l
iik ow

 
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
  

  



Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .

Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
( )
( ) ( ) ( ) ( ) ( )
( )
( ) (1 )
l
l l l l l
j j j j j l
j
E
d o o o
net
 

   

)(
1
)(l
i
p
l
l
jji
ow

 

Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
( )
( ) ( ) ( ) ( ) ( )
( )
( ) (1 )
l
l l l l l
j j j j j l
j
E
d o o o
net
 

   

)(
1
)(l
i
p
l
l
jji
ow

 
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
  

  


)(
1
)(l
k
p
l
l
iik
ow

 

Learning Factors
•Initial Weights
•Learning Constant ()
•Cost Functions
•Momentum
•Update Rules
•Training Data and Generalization
•Number of Layers
•Number of Hidden Nodes

Reading Assignments
Shi Zhong and Vladimir Cherkassky, “Factors Controlling Generalization Ability of
MLP Networks.” In Proc. IEEE Int. Joint Conf. on Neural Networks, vol. 1, pp. 625-
630, Washington DC. July 1999. (http://www.cse.fau.edu/~zhong/pubs.htm)
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). "Learning Internal
Representations by Error Propagation," in Parallel Distributed Processing: Explorations
in the Microstructure of Cognition, vol. I, D. E. Rumelhart, J. L. McClelland, and the
PDP Research Group. MIT Press, Cambridge (1986).
(http://www.cnbc.cmu.edu/~plaut/85-419/papers/RumelhartETAL86.backprop.pdf).

Lecture2---Feed-Forward Neural Networks.ppt

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Lecture2---Feed-Forward Neural Networks.ppt

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77