Lecture2---Feed-Forward Neural Networks.ppt

KassahunAwoke 24 views 150 slides Oct 18, 2024
Slide 1
Slide 1 of 150
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150

About This Presentation

Feed forward


Slide Content

Feed-Forward Neural Networks

Content
Introduction
Single-Layer Perceptron Networks
Learning Rules for Single-Layer Perceptron Networks
–Perceptron Learning Rule
–Adaline Leaning Rule
-Leaning Rule
Multilayer Perceptron
Back Propagation Learning algorithm

Feed-Forward Neural Networks
Introduction

Historical Background
1943 McCulloch and Pitts proposed the first
computational models of neuron.
1949 Hebb proposed the first learning rule.
1958 Rosenblatt’s work in perceptrons.
1969 Minsky and Papert’s exposed limitation of the
theory.
1970s Decade of dormancy for neural networks.
1980-90s Neural network return (self-
organization, back-propagation algorithms, etc)

Nervous Systems
Human brain contains ~ 10
11
neurons.
Each neuron is connected ~ 10
4
others.
Some scientists compared the brain with a
“complex, nonlinear, parallel computer”.
The largest modern neural networks
achieve the complexity comparable to a
nervous system of a fly.

Neurons
The main purpose of neurons is to receive, analyze and transmit
further the information in a form of signals (electric pulses).
When a neuron sends the information we say that a neuron “fires”.

Neurons
This animation demonstrates the firing of a
synapse between the pre-synaptic terminal of
one neuron to the soma (cell body) of another
neuron.
Acting through specialized projections known as
dendrites and axons, neurons carry information
throughout the neural network.

bias
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
A Model of Artificial Neuron
y
i
f (.)a (.)

bias
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.
A Model of Artificial Neuron
y
i
f (.)a (.)

1
( )
m
i ij j
j
f wx


)()1( faty
i 


 

otherwise
f
fa
0
01
)(

Feed-Forward Neural Networks
Graph representation:
–nodes: neurons
–arrows: signal flow directions
A neural network that does not
contain cycles (feedback loops) is
called a feed–forward network (or
perceptron).
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n

Hidden Layer(s)
Input Layer
Output Layer
Layered Structure
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n

Knowledge and Memory
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
The output behavior of a network is
determined by the weights.
Weights  the memory of an NN.
Knowledge  distributed across the
network.
Large number of nodes
–increases the storage “capacity”;
–ensures that the knowledge is
robust;
–fault tolerance.
Store new information by changing
weights.

Pattern Classification
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Function: x  y
The NN’s output is used to
distinguish between and recognize
different input patterns.
Different output patterns
correspond to particular classes of
input patterns.
Networks with hidden layers can be
used for solving more complex
problems then just a linear pattern
classification.
input pattern x
output pattern y

Training
. . .
. . .
. . .
. . .
 
(1) (2)(1) (2 )) ( ) (
( , ),( , ), ,( , ),
kk
 d d dx xT x  
( )
1 2( , , , )
i
i i imx x xx 
( )
1 2( , , , )
i
i i ind d d d 
x
i1
x
i2
x
im
y
i1
y
i2
y
in
Training Set
d
i1
d
i2
d
in
Goal:
( ( ))
(M )in
i
i
i
E error  dy
( )
2
( )i
i
i
  dy

Generalization
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
By properly training a neural
network may produce reasonable
answers for input patterns not seen
during training (generalization).
Generalization is particularly useful
for the analysis of a “noisy” data
(e.g. time–series).

Generalization
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
By properly training a neural
network may produce reasonable
answers for input patterns not seen
during training (generalization).
Generalization is particularly useful
for the analysis of a “noisy” data
(e.g. time–series).
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5
-1
-0.5
0
0.5
1
1.5
without noise with noise

Applications
Pattern classification
Object recognition
Function approximation
Data compression
Time series analysis and forecast
. . .

Feed-Forward Neural Networks
Single-Layer
Perceptron Networks

The Single-Layered Perceptron
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2

Training a Single-Layered Perceptron
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
d
1
d
2
d
n
 
(1) ((1) (2) ))2) ( (
( , ),( , ), ,( , )
p p
x xd d d xT Training Set
Goal:
( )k
i
y
( )
1
m
k
l
l
il
xwa

 
 
 
 

1,2, ,
1,2, ,
i n
k p





( )k
i
d
( )
( )
T
i
k
axw

Learning Rules
. . .
x
1 x
2
x
m
= 1
y
1
y
2
y
n
x
m-1
. . .
. . .
w
11
w
12
w
1m
w
21
w
22
w
2mw
n1
w
nm
w
n2
d
1
d
2
d
n
 
(1) ((1) (2) ))2) ( (
( , ),( , ), ,( , )
p p
x xd d d xT Training Set
Goal:
( )k
i
y
( )
1
m
k
l
l
il
xwa

 
 
 
 

1,2, ,
1,2, ,
i n
k p





( )k
i
d
( )
( )
T
i
k
axw
 Linear Threshold Units (LTUs) : Perceptron Learning Rule
 Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule

Feed-Forward Neural Networks
Learning Rules for
Single-Layered Perceptron
Networks
 Perceptron Learning Rule
 Adline Leaning Rule
 -Learning Rule

Perceptron
Linear Threshold Unit
( )T k
i
w x
sgn
( ) ( )
sgn( )
k T k
i iy w x
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.

+1
1

Perceptron
Linear Threshold Unit
( )T k
i
w x
sgn
( ) ( )
sgn( )
k T k
i iy w x
) (( ) ( )
sgn( ) {, }
1,2, ,
1,2, ,
1 1
k
i
k T k
i iy
i n
k p
d  



w x


Goal:
x
1
x
2
x
m= 1
w
i1
w
i2
w
im =
i
.
.
.

+1
1

Example
x
1
x
2
x
3
= 1
212
y
 
TTT
]2,1[,]1,5.1[,]0,1[ Class 1 (+1)
 
TTT
]2,1[,]1,5.2[,]0,2[ Class 2 (1)
Class 1
Class 2
x
1
x
2
g
(x
) =
2
x
1
+
x
2
+
2
=
0
) (( ) ( )
sgn( ) {, }
1,2, ,
1,2, ,
1 1
k
i
k T k
i iy
i n
k p
d  



w x


Goal:

Augmented input vector
x
1
x
2
x
3
= 1
212
y
 
TTT
]2,1[,]1,5.1[,]0,1[ Class 1 (+1)
 
TTT
]2,1[,]1,5.2[,]0,2[ Class 2 (1)
(4) (5) (6)
(4) (5) (6)
2 2.5 1
0 , 1 , 2
1, 1, 1
1 1

1
x x x
d d d
  
     
     
    
     
     
     
  
(1) (2) (3)
(1) (2) (3)
1 1.5 1
0 , 1 , 2
1, 1,
1 1 1
1
x x x
d d d
       
     
    
     
     
    





 
Class 1 (+1)
Class 2 (1)
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
 

w x
w
Goal:

Augmented input vector
x
1
x
2
x
3
= 1
212
y
Class 1
(1, 2, 1)
(1.5, 1, 1)
(1,0, 1)
Class 2
(1, 2, 1)
(2.5, 1, 1)
(2,0, 1)
x
1
x
2
x
3
(0,0,0)
1 2 3
( ) 2 2 0g x x x x   
0xw
T
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
 

w x
w
Goal:

Augmented input vector
x
1
x
2
x
3
= 1
212
y
Class 1
(1, 2, 1)
(1.5, 1, 1)
(1,0, 1)
Class 2
(1, 2, 1)
(2.5, 1, 1)
(2,0, 1)
x
1
x
2
x
3
(0,0,0)
1 2 3
( ) 2 2 0g x x x x   
0xw
T
( ) ( ) ( )
1 2 3
sgn( )
( , , )
k T k k
T
y d
w w w
 

w x
w
Goal:
A plane passes through the origin
in the augmented input space.

Linearly Separable vs.
Linearly Non-Separable
0
1
1
0
1
1
0
1
1
AND OR XOR
Linearly Separable Linearly Separable Linearly Non-Separable

Goal
Given training sets T
1
C
1
and T
2
 C
2
with
elements in form of x=(x
1, x
2 , ...
, x
m-1 , x
m)
T
,
where x
1
, x
2
, ...

, x
m-1
R and x
m
= 1.
Assume T
1
and T
2
are linearly separable.
Find w=(w
1
, w
2
, ...

, w
m
)
T
such that






2
1
1
1
)sgn(
T
T
T
x
x
xw

Goal
Given training sets T
1
C
1
and T
2
 C
2
with
elements in form of x=(x
1, x
2 , ...
, x
m-1 , x
m)
T
,
where x
1
, x
2
, ...

, x
m-1
R and x
m
= 1.
Assume T
1
and T
2
are linearly separable.
Find w=(w
1
, w
2
, ...

, w
m
)
T
such that






2
1
1
1
)sgn(
T
T
T
x
x
xw
w
T
x = 0 is a hyperplain passes through the
origin of augmented input space.

Observation
x
1
x
2
+

d = +1
d = 1
+
w
1
w
2
w
3
w
4
w
5
w
6
x
Which w’s correctly
classify x?
What trick can be
used?

Observation
x
1
x
2
+

d = +1
d = 1
+
w1
x1 +w2
x2
= 0
w
x
Is this w ok?
0
T
wx

Observation
x
1
x
2
+

d = +1
d = 1
+
w
1
x
1
+
w
2
x
2
=
0
w
x
Is this w ok?
0
T
wx

Observation
x
1
x
2
+

d = +1
d = 1
+
w
1
x
1
+
w
2
x
2
=
0
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = ?
?
?

Observation
x
1
x
2
+

d = +1
d = 1
+
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = x
( )
T TT
  wx xw xw x
reasonable?
<0 >0

Observation
x
1
x
2
+

d = +1
d = 1
+
w
x
Is this w ok?
How to adjust w?
0
T
wx
w = x
( )
T TT
  wx xw xw x
reasonable?
<0 >0

Observation
x
1
x
2
+

d = +1
d = 1

w
x
Is this w ok?
0
T
wx
w = ?
+xorx

Perceptron Learning Rule
+d = +1
d = 1
Upon misclassification on
wx
w x
Define error
r d y 
2
2
0


 


+
+
No error
0

Perceptron Learning Rule
rw x
Define error
r d y 
2
2
0


 


+
+
No error

Perceptron Learning Rule
rw x
Learning Rate
Error (d  y)
Input

Summary 
Perceptron Learning Rule
Based on the general weight learning rule.
()()
iii xtwt r 
ii i
rd y 
( ()( ))
i i ii
wt ytd x 
0
2 1, 1
2 1, 1
i i
i i
i i
d y
d y
d y


   

  

incorrect
correct

Summary 
Perceptron Learning Rule
x y

( )  d yw x

.
.
.
.
.
.
Converge?
d
+

x y

( )  d yw x

.
.
.
.
.
.
d
+
Perceptron Convergence Theorem
Exercise: Reference some papers
or textbooks to prove the theorem.
If the given training set is linearly separable,
the learning process will converge in a finite
number of steps.

The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)

x
(3)
x
(4)
Linearly Separable.

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0
w
2
w
1

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0
w
1
w
2
w
2
w
3

The Learning Scenario
x
1
x
2
w
0
+x
(1)
+x
(2)

x
(3)
x
(4)
w
1
w
0
w
1
w
2
w
2
w
3
w
4
= w
3

The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)

x
(3)
x
(4)
w

The Learning Scenario
x
1
x
2
+x
(1)
+x
(2)

x
(3)
x
(4)
w
The demonstration is in
augmented space.
Conceptually, in augmented space, we
adjust the weight vector to fit the data.

Weight Space
w
1
w
2
+x
A weight in the shaded area will give correct
classification for the positive example.
w

Weight Space
w
1
w
2
+x
A weight in the shaded area will give correct
classification for the positive example.
w
w = x

Weight Space
w
1
w
2
x
A weight not in the shaded area will give correct
classification for the negative example.
w

Weight Space
w
1
w
2
x
A weight not in the shaded area will give correct
classification for the negative example.
w
w = x

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
10

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
1
w
2
=w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
10 w
11

The Learning Scenario in Weight Space
w
1
w
2
+x
(1)+x
(2)

x
(3)
x
(4)
To correctly classify the training set, the
weight must move into the shaded area.
w
0
w
11
Conceptually, in weight space, we move
the weight into the feasible region.

Feed-Forward Neural Networks
Learning Rules for
Single-Layered Perceptron
Networks
 Perceptron Learning Rule
 Adaline Leaning Rule
 -Learning Rule

Adaline (Adaptive Linear Element)
Widrow [1962]
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

( ) ( )k T k
i iyw x

Adaline (Adaptive Linear Element)
Widrow [1962]
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

( ) ( )k T k
i iyw x
( )( ) ( )
1,2, ,
1,2, ,
k T k k
i i iy
i
d
n
k p
 


w x


Goal:
In what condition, the goal is reachable?

LMS (Least Mean Square)
Minimize the cost function (error function):
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
( ( )) 2
1
1
( )
2
Tk
p
k
k
d

  xw
( )
1 1
(
2
)1
2
p m
ll
kk
k l
xwd
 
 
 
 
 
 

Gradient Decent Algorithm
E(w)
w
1
w
2
Our goal is to go downhill.
Contour Map
(w
1
, w
2
)
w
1
w
2w

Gradient Decent Algorithm
E(w)
w
1
w
2
Our goal is to go downhill.
Contour Map
(w
1
, w
2
)
w
1
w
2w
How to find the steepest decent direction?

Gradient Operator
Let f(w) = f (w
1, w
2,…, w
m) be a function over R
m
.
1 2
1 2 m
m
f f f
w w
dw dwdf w
w
d
  
   
  

Define
 
1 2
, , ,
T
m
dw dw dww 
,df f f   w w
1 2
, , ,
T
m
f f f
f
w w w
   
  
  
 

Gradient Operator
fw fw f
w
df : positivedf : zero df : negative
Go uphill Plain Go downhill
,df f f   w w

fw fw f
w
The Steepest Decent Direction
df : positivedf : zero df : negative
Go uphill Plain Go downhill
,df f f   w w
To minimize f , we choose
w =   f

LMS (Least Mean Square)
Minimize the cost function (error function):
( )( )
2
1 1
1
( )
2
p m
k l
l
k
l
k
d xE w
 
 
 
 
 
 w
( )
1
( ) (
1
)( )
p m
k l
k kk
l j
j
l
E
w
wd x x
 
  
 
 
  
 
w
 
( ( )(
1
) )kk
p
T k
j
k
xd

  xw  
( )
1
( )) (
p
k
k k
j
k
yd x

 
( )
1
()( )
k
p
k
k
j
j
E
w
x





w
 (k)
( ) ( ) ( )k k k
dy 

Adaline Learning Rule
Minimize the cost function (error function):
( )( )
2
1 1
1
( )
2
p m
k l
l
k
l
k
d xE w
 
 
 
 
 
 w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
( )E 
w
w w Weight Modification Rule
( )
1
()( )
k
p
k
k
j
j
E
w
x





w
( ) ( ) ( )k k k
dy 

Learning Modes
Batch Learning Mode:
Incremental Learning Mode:
(( ))
1
p
k
kk
jj xw


(( ))kk
jj xw
( )
1
()( )
k
p
k
k
j
j
E
w
x





w
( ) ( ) ( )k k k
dy 

Summary 
Adaline Learning Rule
x y
wδx

.
.
.
.
.
.
d
+

-Learning Rule
LMS Algorithm
Widrow-Hoff Learning Rule
Converge?

LMS Convergence
Based on the independence theory (Widrow, 1976).
1.The successive input vectors are statistically independent.
2.At time t, the input vector x(t) is statistically independent of
all previous samples of the desired response, namely d(1),
d(2), …, d(t1).
3.At time t, the desired response d(t) is dependent on x(t), but
statistically independent of all previous values of the
desired response.
4.The input vector x(t) and desired response d(t) are drawn
from Gaussian distributed populations.

LMS Convergence
It can be shown that LMS is convergent if
max
2
0

 
where 
max
is the largest eigenvalue of the correlation
matrix R
x
for the inputs.
1
1
lim
T
i i
n
in



x
R xx

LMS Convergence
It can be shown that LMS is convergent if
max
2
0

 
where 
max
is the largest eigenvalue of the correlation
matrix R
x
for the inputs.
1
1
lim
T
i i
n
in



x
R xx
Since 
max is hardly available, we commonly use
2
0
( )tr
 
x
R

Comparisons
Fundamental
Hebbian
Assumption
Gradient
Decent
ConvergenceIn finite steps
Converge
Asymptotically
Constraint
Linearly
Separable
Linear
Independence
Perceptron
Learning Rule
Adaline
Learning Rule
(Widrow-Hoff)

Feed-Forward Neural Networks
Learning Rules for
Single-Layered Perceptron
Networks
 Perceptron Learning Rule
 Adaline Leaning Rule
 -Learning Rule

Adaline
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

( ) ( )k T k
i iyw x

Unipolar Sigmoid
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

i
neti
e
neta



1
1
)(
 
( ) ( )k T k
i iy aw x

Bipolar Sigmoid
( )T k
i
w x
x
1
x
2
x
m
w
i1
w
i2
w
im
.
.
.

 
( ) ( )k T k
i iy aw x
1
1
2
)( 



i
neti
e
neta

Goal
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
(
2
)
1
( )1
( )
2
kT
p
k
k
da

  
  wx

Gradient Decent Algorithm
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
( )E
ww w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
(
2
)
1
( )1
( )
2
kT
p
k
k
da

  
  wx

The Gradient
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
 
)( ()k T k
aywx
( )
(
1
)( )( )
( )
k
k
j
p
k j
k
d
y
y
w
E
w

 
 
 

w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
 
( )
( )
( )
(
( )
1
)
( )
k
k
k
p
k j
k k
netnet
net
y
w
a
d

 
 
 

( ) ( ) kTk
netxw
(
1
)k
m
i
ii
xw


? ?
( )
( )
j
k
k
j
w
net
x

 

Depends on the
activation function
used.

Weight Modification Rule
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
 
1
( )
( )
( )
(
)
)
(( )
( )
p
k
kk
k
j
j
k
k
net
x
aE
y
w ne
d
t


 
 

w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
 
(( ))k k
netya
 
(
1
)
(
( )
( ))
k
kk
p
kj j
k
ne
w
at
x
net



 


( ) ( ) ( )k k k
dy 
Learning
Rule
Batch
 
)
( )
(
( )
( )
k
k
kj j
k
net
net
a
xw

 

Incremental

The Learning Efficacy
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
Adaline
Sigmoid
Unipolar Bipolar
( )neta net
1
( )
1
net
a
e
net



2
( ) 1
1
net
an
e
et

 

( )
1
anet
net



( ) ( )
(1 )
( )
k knet
net
y y
a
 



Exercise
 
(( ))k k
netya
 
1
( )
( )
( )
(
)
)
(( )
( )
p
k
kk
k
j
j
k
k
net
x
aE
y
w ne
d
t


 
 

w

Learning Rule  Unipolar Sigmoid
Minimize
(2
1
( ))1
( ) ( )
2
k
p
k
k
yE d

 w
( ) ( ) ( )) ( )
1
(
(( )
)
1
(
)
k k
j
k k
p
kj
k
y
E
xd y y
w




 
w
1 2
( ) ( ) ( )
( ) , , ,
T
w
m
E E E
E
w w w
   
   
  
 
w w w
w 
( ) ))( (( )
1
(1 )
k k k
k
k
j
p
y yx


( ) ( ) ( )k k k
dy 
( () ) ( )(
1
)
(1 )
k
j
k
p
k
k k
j
xw y y 

  Weight Modification Rule

()
()(1
)
k
k
y
y


Comparisons
( () ) ( )(
1
)
(1 )
k
j
k
p
k
k k
j xw y y 

 
Adaline
Sigmoid
Batch
Incremental
(( ))
1
p
k
k
j j
k
w x

 
(( ))kk
jj
xw 
Batch
Incremental
( )) ( ) )( (
(1 )
j
kkk k
j
w y yx  

The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net



)
)
(
(1
net
ne
y
a
t
y



constant depends on output

The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net



)
)
(
(1
net
ne
y
a
t
y



constant depends on output
The learning efficacy of
Adaline is constant meaning
that the Adline will never get
saturated.
y=net
a’(net)
1

The Learning Efficacy
net
y=a(net) = net
net
y=a(net)
Adaline Sigmoid
( )
1
anet
net



)
)
(
(1
net
ne
y
a
t
y



constant depends on output
The sigmoid will get saturated
if its output value nears the two
extremes.
y
a’(net)
10
(1 )y y

Initialization for Sigmoid Neurons
i
neti
e
neta



1
1
)(
w
i1
w
i2
w
im
x
1
x
2
x
m
.
.
.
( )T k
i
w x

 
( ) ( )k T k
i iy aw x
Before training, it weight
must be sufficiently small. Why?

Feed-Forward Neural Networks
Multilayer
Perceptron

Multilayer Perceptron
. . .
. . .
. . .
. . .
x
1
x
2
x
m
y
1y
2 y
n
Hidden Layer
Input Layer
Output Layer

Multilayer Perceptron
Input
Analysis
Classification
Output
Learning
Where the
knowledge from?

How an MLP Works?
XOR
0
1
1
x
1
x
2
Example:
Not linearly separable.
Is a single layer
perceptron workable?

How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
L
2L
1
x
1 x
2
x
3= 1
y
1
y
2

How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
0
1
1
y
1
y
2
L
3

How an MLP Works?
0
1
1
XOR
x
1
x
2
Example:
L
1
L
200
01
11
0
1
1
y
1
y
2
L
3

How an MLP Works?
0
1
1
y
1
y
2
L
3
L
2
L
1
x
1 x
2
x
3
= 1
L
3
y
1 y
2
y
3= 1
z
Example:

Parity Problem
x
1
x
2
x
3
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
Is the problem linearly separable?

Parity Problem
x
1
x
2
x
3
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
P
1
P
2
P
3

Parity Problem
0
1
1
0
1
0
0
000
001
010
011
100
101
110
1111
x
1 x
2 x
3
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000

Parity Problem
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
P
1
x
1 x
2 x
3
y
1
P
2
y
2
P
3
y
3

Parity Problem
x
1
x
2
x
3
P
1
P
2
P
3
111
011
001000
y
1
y
2
y
3
P
4

Parity Problem
y
1
y
2
y
3
P
4
P
1
x
1 x
2 x
3
P
2
P
3
y
1
z
y
3
P
4
y
2

General Problem

General Problem

Hyperspace Partition
L
1
L
2
L
3

Region Encoding
101
L
1
L
2
L
3
001000
010
110
111
100

Hyperspace Partition &
Region Encoding Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

101
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

001
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

000
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

110
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

010
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

100
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

111
Region Identification Layer
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3

Classification
101
L
1
L
2
L
3001000
010
110
111
100
L
1
x
1 x
2 x
3
L
2L
3
001101101001000110010100111
0
1
1 1
0

Feed-Forward Neural Networks
Back Propagation
Learning algorithm

Activation Function — Sigmoid
net
e
netay



1
1
)(
net
net
e
e
neta












 )(
1
1
)(
2
y
y
e
net

 1

)1()( yyneta  
net
1
0.5
0
Remember this

Supervised Learning
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
Hidden Layer
Input Layer
Output Layer
 ),(,),,(),,(
)()()2()2()1()1( pp
dxdxdxT 
Training Set
d
1
d
2
d
n

Supervised Learning
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
d
1
d
2
d
n



p
l
l
EE
1
)(
 


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Sum of Squared Errors
Goal:
Minimize
 ),(,),,(),,(
)()()2()2()1()1( pp
dxdxdxT 
Training Set

Back Propagation Learning Algorithm
. . .
. . .
. . .
. . .
x
1
x
2
x
m
o
1
o
2
o
n
d
1
d
2
d
n



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
Learning on Output Neurons
Learning on Hidden Neurons

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji

 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
??



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E








 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(
)()( l
j
l
jod
depends on the
activation function

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E








 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
( ) ( )
( )
l l
j jd o 
( ) ( )
(1 )
l l
j jo o 
Using sigmoid,

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
)(
)(
)(
)(
)(
)(
l
j
l
j
l
j
l
l
j
l
net
o
o
E
net
E








 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
( ) ( )
( )
l l
j jd o 
( ) ( )
(1 )
l l
j jo o 
Using sigmoid,
( )l
j

(
( )
( )( ) ( (
)
) ) )
(
( ) (1 )
l
l
l
j
l
j j
l l l
j jj
E
net
o od o

 

 

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1

 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(l
io
( )
( )( )
l
l
i
ji
l
j
E
o
w




( ( )) (( )( ) )
(1( ))
l l
j
l l
j
l
j ijd o o oo 

Learning on Output Neurons
. . .
j. . .
. . .
i
. . .
o
1 o
j o
n
d
1
d
jd
n
. . .
. . .
. . .
. . .
w
ji



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1

 







p
l ji
lp
l
l
jiji w
E
E
ww
E
1
)(
1
)(
ji
l
j
l
j
l
ji
l
w
net
net
E
w
E







)(
)(
)()(
)(
)()( l
j
l
j netao
)()( l
iji
l
j ownet
)(l
io
( )
( )( )
l
l
i
ji
l
j
E
o
w




( ( )) (( )( ) )
(1( ))
l l
j
l l
j
l
j ijd o o oo 
( )( )
1
p
l
i
l
l
j
ji
E
o
w






)(
1
)(l
i
p
l
l
jji ow

 
How to train the weights
connecting to output neurons?

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
??

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
i
)(l
ko

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
i
)(l
ko
)(
)(
)(
)(
)(
)(
l
i
l
i
l
i
l
l
i
l
net
o
o
E
net
E







( ) ( )
(1 )
l l
i io o 
?

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
i
)(l
ko
)(
)(
)(
)(
)(
)(
l
i
l
i
l
i
l
l
i
l
net
o
o
E
net
E















j
l
i
l
j
l
j
l
l
i
l
o
net
net
E
o
E
)(
)(
)(
)(
)(
)(
)(l
j ji
w
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
  

  


( ) ( )
(1 )
l l
i io o 

Learning on Hidden Neurons



p
l
l
EE
1
)(  


n
j
l
j
l
j
l
odE
1
2
)()()(
2
1
. . .
j. . .
k
. . .
i
. . .
. . .
. . .
. . .
. . .
w
ik
w
ji

 







p
l ik
lp
l
l
ikik
w
E
E
ww
E
1
)(
1
)(
ik
l
i
l
i
l
ik
l
w
net
net
E
w
E







)(
)(
)()(
)(l
ko
)(
1
)(l
k
p
l
l
i
ik
o
w
E






)(
1
)(l
k
p
l
l
iik ow

 
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
  

  

Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .

Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
( )
( ) ( ) ( ) ( ) ( )
( )
( ) (1 )
l
l l l l l
j j j j j l
j
E
d o o o
net
 

   

)(
1
)(l
i
p
l
l
jji
ow

 

Back Propagation
o
1
o
j
o
n
. . .
j. . .
k
. . .
i
. . .
d
1
d
jd
n
. . .
. . .
. . .
. . .
x
1 x
m
. . .
( )
( ) ( ) ( ) ( ) ( )
( )
( ) (1 )
l
l l l l l
j j j j j l
j
E
d o o o
net
 

   

)(
1
)(l
i
p
l
l
jji
ow

 
( )
( ) ( ) ( ) ( )
( )
(1 )
l
l l l l
i i i ji jl
ji
E
o o w
net
  

  


)(
1
)(l
k
p
l
l
iik
ow

 

Learning Factors
•Initial Weights
•Learning Constant ()
•Cost Functions
•Momentum
•Update Rules
•Training Data and Generalization
•Number of Layers
•Number of Hidden Nodes

Reading Assignments
Shi Zhong and Vladimir Cherkassky, “Factors Controlling Generalization Ability of
MLP Networks.” In Proc. IEEE Int. Joint Conf. on Neural Networks, vol. 1, pp. 625-
630, Washington DC. July 1999. (http://www.cse.fau.edu/~zhong/pubs.htm)
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). "Learning Internal
Representations by Error Propagation," in Parallel Distributed Processing: Explorations
in the Microstructure of Cognition, vol. I, D. E. Rumelhart, J. L. McClelland, and the
PDP Research Group. MIT Press, Cambridge (1986).
(http://www.cnbc.cmu.edu/~plaut/85-419/papers/RumelhartETAL86.backprop.pdf).
Tags