Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

MohammedBennamoun 17,959 views 113 slides May 15, 2016
Slide 1
Slide 1 of 113
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113

About This Presentation

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers


Slide Content

CS407 Neural Computation
Lecture 4:
Single Layer Perceptron (SLP)
Classifiers
Lecturer: A/Prof. M. Bennamoun

Outline
ƒ
What’s
a SLP and what’s class
ification?
ƒ
Limitation of
a single perceptron.
ƒ
Foundations of clas
sification and Bayes
D
ecis
ion making theory
ƒ
Discriminant functions, linear
machine and minimum distance
classification
ƒ
Training and classification using the Dis
c
rete perceptron
ƒ
Single-Layer Continuous perce
ptron Networks for linearly
separable classifications
ƒ
Appendix A
: Unconstrained optimization techniques
ƒ
Appendix B
: Perceptron Conv
ergenc
e proof
ƒ
Suggested reading and references

What is a perceptron and what is a Single Layer Perceptron (SLP)?

Perceptron „
The simplest form of a neural network

consists of a single neuron with adjustable synaptic weights and bias

performs pattern classification with only two classes

perceptron convergence theorem :

P
atterns (vectors) are drawn from two
linearly separable classes

D
uring training, the perceptron algorithm
converges and positions the decision surface in the form of hyperplane between two classes by adjusting synaptic weights

What is a perceptron?
w
k1
x
1
w
k2
x
2
w
km
x
m
...
...
Σ
Bias
b
k
ϕ
(.)
v
k
Input signal
Synaptic weights
Summing
junction
Activation
function
b
x
w
v
k
j
m j
kj
k
+
=

=
1
)
(v
y
k
k
ϕ
=
)
(
)
(

=

sign
ϕ
Discrete Perceptron:
Output
y
k
shape
S

=
⋅)
(
ϕ
Continous
Perceptron:

Activation Function of a perceptron
v
i
+1
-1
v
i
+1
Signum
Function
(sign)
shape
s
v

=
)
(
ϕ
Continous
Perceptron:
)
(
)
(

=

sign
ϕ
Discrete Perceptron:

SLP Architecture
Single layer perceptron
Input layer
Output layer

Where are we heading? Different Non-Linearly Separable Problems http://www.zsolutions.com/light.ht
m
Structure
Types of
Decision Regions
Exclusive-OR
Problem
Classes with
Meshed regions
Most General
Region Shape
s
Single-Layer
Two-Layer Three-Layer
Half Plane
Bounded By Hyperplane
Convex Open
Or
Closed Regions
Arbitrary
(Complexity
Limited by No.
of Nodes)
A
A
B
B
A
A
B
B
A
A
B
B
B
A
B
A
B
A

Review from last lectures:

Implementing Logic Gates with Perceptrons
h
t
t
p
://www.cs.b
h
a
m
.
a
c
.u
k
/
~
j
x
b
/
NN/l
3
.p
d
f
ƒ
We can use the perceptron to implem
ent the basi
c logic gates (AND, OR
and NOT).
ƒ
All we need to do is fi
nd the appr
opriate connection weights and
n
euron
thresholds to produce the right
outputs for each set of inputs.
ƒ
We saw how we can construct simple
networks that perform NOT, AND,
and OR.
ƒ
It is then a
well known
result from logic that
we can construct
any
logical
function from these three operations
.
ƒ
The resulting networks, however, will
usually have a much more complex
architecture than a simple Perceptron.
ƒ
We generally want to avoid decom
posing complex problems into simple
logic gates, by finding the weights and
thresholds that work directly in a
Perceptron architecture.

Implementation of Logical NOT, AND, and OR ƒ
In each case we have inputs
in
i
and outputs
out
, and need to determine
the weights and thresholds. It is ea
sy to find solutions by inspection:

The Need to Find Weights Analytically ƒ
Constructing simple networks by ha
nd is one thing. But
what about
harder problems? For example, what about:
ƒ
How long do we keep looki
ng for a so
lution? We need to be able to
calculate appropriate para
meters rather than look
ing for solutions by trial
and error.
ƒ
Each training pattern produces a lin
ear inequality for the output in terms
of the inputs and the network parame
ters. These can be used to compute
the weights and thresholds.

Finding Weights Analytically for the AND Network ƒ
We have two weight
s
w
1
and
w
2
and the threshold
θ,
and for each
training pattern we n
eed to satisfy
ƒ
So the training data l
ead to four inequalities:
ƒ
It is easy to see that t
h
ere are an
infinite number of solutions. Similarly,
there are an infinite number of soluti
ons for the NOT and OR networks.

Limitations of Simple
Perceptrons
ƒ
We can follow the same procedure for the XOR network:
ƒ
Clearly the second and third inequa
lities are incompatible
with the
fourth, so there is in fact no solu
tion. We need more complex networks,
e.g. that combi
n
e together many si
mple networks, or use different
activation/thresholding/transfer functions.
ƒ
It then becomes much
more difficult to determine all the weights
a
nd
thresholds by hand.
ƒ
These weight
s instead are adap
ted using learning rules.
Hence, need
to
consider learning rules (see previous lecture), and more complex architectures.

E.g. Decision Surface of a Perceptron
+
+
-
-
x
1
x
2
Non-Linearly separable

Perceptron is able to represent some useful functions

B
ut functions that are not linearly separable (e.g. XOR)
are not representable
+
+
+
+
-
-
-
-
x
2
Linearly separable
x
1

What is classification?

Classification ?
http://140.
122
.1
85.1
2
0
ƒ
Pattern classification/recognition
-
A
ssign the input data (a physi
c
a
l object, event, or phenomenon)
to one of the pre-specified classes (categories)
ƒ
The block diagram of the recogn
ition and classification system

Classification: an example •
Automate the process of sorting incoming fish on a conveyor belt according to species (Salmon or Sea bass).
¾
Set up a camera
¾
Take some sample images
¾
Note the physical differences between the two types of fish
Length
Lightness Width No. & shape of fins ( “sanfirim”) Position of the mouth
ht
tp:
/
/
w
eb
cou
r
se.te
chn
ion.a
c
.
il/
236607/
Win
t
e
r
2002-
2003
/en/ho
.ht
m
D
uda
&
Ha
rt
, C
hapte
r

1

Classification an example…

Classification: an example…

Cost of misclassification: depends on application
Is it better to misclassify salmon as bass or vice versa?

Put salmon in a can of bass
loose profit

Put bass in a can of salmon
loose customer

There is a cost associated with our decision.

Make a decision to minimize a given cost.

Feature Extraction:

Problem & Domain dependent

Requires knowledge of the domain

A good feature extractor would make the job of the classifier trivial.
⇒⇒

Bayesian decision theory

Bayesian Decision Theory
ht
tp:
/
/
w
eb
cou
r
se.te
chn
ion.a
c
.
il/
236607/
Win
t
e
r
2002-
2003
/en/ho
.ht
m
l
D
uda
&
Ha
rt
, C
hapte
r

2

Bayes
ian decis
ion theory is a fundamental statistical approach
to the problem of pattern classification.
¾
Decision making when all the probabilistic information is known.
¾
For given probabilities the decision is optimal.
¾
When new information is added, it is assimilated in optimal fashion for improvement of decisions.

Bayesian Decision Theory … „
Fish Example:

Each fish is in one of 2 states: sea bass or salmon

Let
ω
denote the
state of nature

ω
=
ω
1
for sea bass

ω
=
ω
2
for salmon

Bayesian Decision Theory …

The State of nature is unpredictable
ω
is a
variable that must be described probabilistically.

If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon.

Define

P(
ω
1
) :
a prior
i
probability that the next fish is sea bass

P(
ω
2
):
a prior
i
probability that the next fish is salmon.

Bayesian Decision Theory …

If other types of fish are irrelevant:
P
(
ω
1
) + P(
ω
2
) = 1.

Prior probabilities reflect our prior
knowledge (e.g. time of year, fishing area, …)

Simple decision Rule:

Make a decision without seeing the fish.

Decide w
1
if P(
ω
1
) > P(
ω
2
);
ω
2
otherwise.

OK if deciding for one fish

If several fish, all assigned to same class.

Bayesian Decision Theory ...

In general, we will have some features and more information.

Feature: lightness measurement =
x
¾
Different fish yield different lightness readings (
x is a random variable)

Bayesian Decision Theory ….

Define
p(x|
ω
1
) =
Class Conditional Probability Density
Probability density function for x given that the state of nature is
ω
1

The difference between p(x|
ω
1
) and p(x|
ω
2
) describes the
difference in lightness between sea bass and salmon.

Class conditioned probability density: p(x|
ω
)
Hypothetical class-conditional probability
Density functions are normalized (area under each curve is 1.0)


Suppose tha
t
we know
The prior probabilities P(
ω
1
) and P(
ω
2
),
The conditional densities
and
Measure lightness of a fish = x.

What is the category of the fish
?
1
(|
)
px
ω
2
(|
)
px
ω
(|
)
j
px
ω
Bayesian Decision Theory ...

Bayes
F
ormula

Given

P
rior probabilities P(
ω
j)

C
onditional probabilities p(x|
ω
j)

Measurement of particular item

F
eature value x

Bayes formula:
(from
)

where so
)
(
)
(
)
|
(
)
(
x
p
P
x
p
x
P
j
j
j
ω
ω
ω
=

=
i
i
i
P
x
p
x
p
)
(
)
|
(
)
(
ω
ω

=
i
i
x
P
1
)
|

)
(
)
|
(
)
(
)
|
(
)
,
(
x
p
x
P
P
x
p
x
p
j
j
j
j
ω
ω
ω
ω
=
=
L
i
ke
l
i
hood
P
r
ior
Po
ste
r
io
r
Ev
i
d
ence

=

Bayes'
formula ...

p(x|
ω
j
)
is called the
likelihood
of
ω
j
with
respect to
x.
(the
ω
j
category for which
p(x|
ω
j
)
is large is more
"likely" to be the true category)

p(x) is the
evidence
how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1.

Posterior Probability
Posterior probabilities for the particular priors P(
ω
1
)=2/3 and P(
ω
2
)=1/3.
At every x the posteri
ors sum to 1.

Error
21 12
I
f
we d
eci
d
e

(
|
)
(|
)
I
f
we d
eci
d
e

(
|
)
P
x
P
e
rror
x
P
x
ωω ωω


=



For a given
x
, we can minimize the
probability of error by deciding
ω
1
if P(
ω
1
|x)
> P(
ω
2
|x)
and
ω
2
otherwise.

Bayes'
Decision Rule
(Minimizes the probability of error)
ω
1
:
if P(
ω
1
|x) > P(
ω
2
|x)
i.e.
ω
2
:
otherwise
or ω
1
:
if P ( x |
ω
1
) P(
ω
1
) > P(x|
ω
2
) P(
ω
2
)
ω
2
:
otherwise
and
P(Error|x)
= min [
P(
ω
1
|x) , P(
ω
2
|x)
]
)
(
)
(
2
1
21
x
P
x
P
ω
ω
ωω <>
Likelihood ratio
)
(
)
(
)
|
(
)
|
(
)
(
)
|
(
)
(
)
|
(
21
21
2
2
1
1
21
21
ωω
ωω
ω
ω
ω
ω
ωω
ωω
PP
x
p
x
p
P
x
p
P
x
p
<>

<>
Threshold

Decision Boundaries

Classification as division of feature space into non-overlapping regions

Boundaries between these regions are known as
decision surfaces
or
decision
boundaries
k
k
R
to
assigned
x
X
x
that
such
X
X
ω


,
,
1
K

Optimum decision boundaries

Criterion:

m
inimize miss-classification

M
aximize correct-classification
)
(
)
(
y
probabilit
posterior
maximum
.
.
)
(
)
(
)
(
)
(
x
P
x
P
k
j
e
i
P
x
p
P
x
p
k
j
if
X
x
Classify
j
k
j
j
k
k
k
ω
ω
ω
ω
ω
ω
>


>



2
)
(
)
(
)
,
(
)
(
1
1
=

=

=


=
=
R
Here
P
X
x
P
X
x
P
correct
P
R
k
k
k
k
R
k
k
k
ω
ω
ω

Discriminant
functions

Discriminant
functions determine
classification by comparison of their values:

Optimum classification: based on posterior probability

Any monotone function g may be applied without changing the decision boundaries
)
(
)
(
x
g
x
g
k
j
if
X
x
Classify
j
k
k
>



))
(
ln(
)
(
.
.
))
(
(
)
(
x
P
x
g
g
e
x
P
g
x
g
k
k
k
k
ω
ω
=
=
)
(
x
P
k
ω

The Two-Category Case

Us
e 2 discriminant
functions
g
1
and
g
2
, and
assigning
x
to
ω
1
if
g
1
>g
2
.

Alternative: define a single discriminant
f
unction
g(x) = g
1
(x) -
g
2
(x),
dec
ide
ω
1
if
g(x
)
>0
, otherwise decide
ω
2
.

Two category case
12
11 22
()
(
|
)
(
|
)
(|
)
(
)
()
l
n
l
n
(|
)
(
)
gP
P
pP
g
pPω
ω
ω
ω
ω
ω
=

=+
xx
x
x
x
x

Summary

Bayes
a
pproach:

E
stimate class-conditioned probability density

C
ombine with prior class probability

D
etermine posterior class probability

D
erive decision boundaries

Alternate approach implemented by NN

E
stimate posterior probability directly

i.e. determine decision boundaries directly

DISCRIMINANT FUNCTIONS

Discriminant
Functions
ht
tp
://1
40.
1
2
2.1
8
5.1
2
0
ƒ
Determine the membership in a category by the classifier based on the comparison of
R
discriminant
functions
g
1
(
x
),
g
2
(
x
),…,
g
R
(
x
)

W
hen
x
is within the region
X
k
if
g
k
(
x
) has the largest
value
Do not mix between n = dim of each I
/
P v
ect
or (di
m
of feature space); P= # of I/P
vectors; and R= # of classes.

Discriminant
Functions…

Discriminant
Functions…

Discriminant
Functions…

Discriminant
Functions…

Discriminant
Functions…

Linear Machine and Minimum Distance Classification •
F
ind the linear-form
discriminant
f
unction for two class
classification when the class prototypes are known •
Example 3.1
: Select the decision
hyperplane
that
contains the midpoint of the line segment connecting center
point of two classes

Linear Machine and Minimum Distance Classification…
(dichotomizer)
•The
dichotomizer’s discriminant
f
unction
g
(
x
):
t

Linear Machine and Minimum Distance Classification…
(multiclass
c
lassification)
•The linear-form
discriminant
f
unctions for multiclass
classification

T
here are up to R(R-1)/2 decision hyperplanes
for
R
pairwise
separable classes
(i.e. next to or
touching another)

Linear Machine and Minimum Distance Classification…
(multiclass
c
lassification)
•Linear machine or minimum-distance classifier

A
ssume the class prototypes are known for all
classes • Euclidean distance between input pattern
x
and the
center
of class
i,
X
i:
t

Linear Machine and Minimum Distance Classification…
(multiclass
c
lassification)

Linear Machine and Minimum Distance Classification…
Note:
to find S
12
we need to compute (g
1
-g
2
)
P
1
, P
2
, P
3
are the centre
s of gravi
t
y
of the prot
ot
ype points, we
need to design a minimum
distance classifier. U
s
ing
the formulas from the previous s
l
ide, we get
w
i

Linear Machine and Minimum Distance Classification… •If R linear
d
iscriminant
f
unctions exist for a set of
patterns such that
()
(
)
j
i

,..., R
,
j
,..., R,
,
i
i,
g
g
j
i

=
=

>
,
2
1
2
1


Class
for

x
x
x
•The classes are linearly separable.

Linear Machine and Minimum Distance Classification… Example:

Linear Machine and Minimum Distance Classification… Example…

Linear Machine and Minimum Distance Classification…
•Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear
discriminant
f
unctions can be
determined if the a priori information about the sets of patterns and their class membership is known •In the next section (Discrete perceptron) we will examine neural networks that derive their weights during the learning cycle.

Linear Machine and Minimum Distance Classification… •The example of linearly non-separable patterns

Linear Machine and Minimum Distance Classification…
Input space (x)
Image space (o)
)
1
sgn(
2
1
1
+
+
=
x
x
o

Linear Machine and Minimum Distance Classification…
)
1
sgn(
2
1
1
+
+
=
x
x
o
)
1
sgn(
2
1
2
+


=
x
x
o
-1
1
1
1
1
1
-1
1
1
1
1
-1
1
-1
-1
-1
o
2
o
1
x
2
x
1
These 2 inputs map
to the same point (1,1) in the image
space

The Discrete Perceptron

Discrete Perceptron Training Algorithm

S
o far, we have shown that
coefficients
of linear
discriminant functions called
weights
can be
determined based on
a priori
information about sets of
patterns and their class membership. •In what follows, we will begin to examine neural network classifiers that derive their
weights
dur
i
n
g
the
learning cycle
.
•The sample pattern vectors x
1
, x
2
, …, x
p
, called the
training sequence
, are presented to the machine along
with the correct response.

Discrete Perceptron Training Algorithm -
G
eometrical Representations
ht
tp
://1
40.
1
2
2.1
8
5.1
2
0
Zurada
, Chapter
3
(
In
tersects th
e origi
n

point
w
=
0
)
5
prototype p
a
tterns i
n
thi
s
ca
se:
y
1
, y
2
, …
y
5
If dim o
f
augmented pattern vecto
r
is > 3, o
u
r power o
f
v
i
sualization are no longer o
f
assistance. In t
h
is case,
the only re
course is to us
e the analytical approach.

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•Devise an analytic approach based on the geometrical representations –
E
.g. the decision surface for the training pattern
y
1
If
y
1
in
Class 1:
y
1
in
Cla
s
s
2
(
)
1
1
y
y
w
w
=

t
1
1
y
w
w
c
+
=

y
1
in
Cla
s
s
1
If
y
1
in
Class 2:
1
1
y
w
w
c

=

c
(
>0) is the correction
increment (
is two times the
learning constant
ρ
introduced before
)
Weight Space Weight Space
c
controls the
si
ze of adjustment
Gradient (the direction of
steepest increase)
(see previ
o
us slide)
(correcti
o
n in n
e
g
a
tive
gradi
ent direction)

Discrete Perceptron Training Algorithm -
G
eometrical Representations…

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
Note 2: c is not constant and de
pends on the c
u
rrent training pa
ttern as expressed by eq. A
b
ove.
Note 1: p=distance so >0
p
c
t
t
=
=
y
y
y
y
w
y
1

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•For
fixed correction rule:
c
=constant, the correction of
weights is always the same fixed portion of the current training vector

T
he weight can be initialised at any value
•For
dynamic correction rule:
c
depends on the distance
from the weight (i.e. the weight vector) to the decision surface in the weight space. Hence

T
he initial weight should be different from 0.
(if
w
1
=
0
, then c
y
=
0
and
w
’=
w
1
+c
y
=
0
,
t
h
ere
f
ore n
o
p
o
ssi
bl
e a
d
ju
st
me
nt
s).
Current input pattern
Current weight

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•Dy
n
amic correction rule: Us
ing the va
lue of c from previous slide as
a reference, we devise an adj
ustment technique whic
h depends on
the length
w
2
-
w
1
Ν
ote:
λ
is
the ratio of
the distance
between the old weight vector
w
1
and the new
w
2
,
to
the distan
ce
from
w
1
to the pattern
hyperplane
λ
=
2
: S
y
mm
etrical reflectio
n w.r.t decisi
on
plane
λ
=
0
: No weight adjustment

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•Example:
2
class :

1
,
2
,
5.
0
1
class :

1
,
3
,
1
4
2
4
2
3
1
3
1

=
=

=

=
=
=
=
=
d
d
x
x
d
d
x
x
•The augmented input vectors are:



=


=



=


=
1
2
13
,
1
5.
0
,
11
4
3
2
1
y
y
y
y
•The decision lines
w
ty
i=0, for
i=1, 2, 3, 4 are sketched
on the
augmented weight space
as follows:

Discrete Perceptron Training Algorithm -
G
eometrical Representations…

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
[
]
t
75.
1
5.
2

and
1
c
For
1

=
=
w
•Using
the weight training with each step can
be summarized as follows:
y
w
w
c
±
=
'
k
k
kt
k
k
d
c
y
y
w
w
)]
sgn(
[
2

=
∆ •We obtain the following outputs and weight updates: •
Step 1
: Pattern
y
1
is input



=
+
=
=


=





=
75.
2
5.
1
2
1
11
]
75.
1
5.
2
[
sgn
1
1
2
1
11
y
w
w
o
do

Discrete Perceptron Training Algorithm -
G
eometrical Representations…

Step 2
: Pattern
y
2
is input



=

=

=

=






=
75.
1
1
2
1
1
5.
0
]
75.
2
5.
1
[
sgn
2
2
3
2
22
y
w
w
o
do

Step 3
: Pattern
y
3
is input


=
+
=
=


=





=
75.
2
2
2
1
13
]
75.
1
1
[
sgn
3
3
4
3
33
y
w
w
o
do

Discrete Perceptron Training Algorithm -
G
eometrical Representations…

S
ince we have no evidence of correct classification of
weight
w
4
the training set consisting of an ordered
sequence of patterns
y
1
,
y
2
and
y
3
needs to be recycled.
We thus have
y
4
=
y
1
,
y
5
=
y
2
, etc (the superscript is used
to denote the following training step number).

Step 4, 5
:
w
6
=
w
5
=
w
4
(no misclassification, thus no
weight adjustments). •You can check that the adjustment following in steps 6 through 10 are as follows:
[] []
t
t
75.
0
3
75.
1
5.
2
11
7
8
9
107
=
=
=
==
w
w
w
w
ww
w
11
is in solution area.

The Continuous Perceptron

Continuous Perceptron Training Algorithm
ht
t
p
:
/
/
1
40.122.185.12
0
Zu
ra
d
a
,
C
h
ap
t
e
r
3
•Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons:

G
ain finer control over the training procedure

F
acilitate the differential characteristics to enable
computation of the error gradient
(o
f current
erro
r funct
i
on)
The
f
a
ctor
½
do
es not
a
f
f
e
ct the locatio
n
of
t
h
e
error mi
ni
mum

Continuous Perceptron Training Algorithm… •The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface
By definit
i
on of the
st
eepest descent concept,
each elementary move sho
u
ld be
perpendicu
l
ar to the current error contour.

Continuous Perceptron Training Algorithm… •Define the error as the squared difference between the desired output and the actual output
1
,...,
2
,
1
)
(

have
we ,

Since
+
=
=


=
n
i
y
wnet
net
i
i
t
y
w
T
r
a
i
ni
ng
r
u
l
e
o
f

contin
ou
s perceptron
(equivalent to delta t
r
a
i
ni
ng
r
u
le
)

Continuous Perceptron Training Algorithm…

Continuous Perceptron Training Algorithm… Same as previous example
(of discrete perceptron) but with a
continuous activation function and using the delta rule.
Same training pattern set as discrete perceptron example

Continuous Perceptron Training Algorithm…
2
1
)
exp(
1
2
21






+

=
k
k
k
net
d
E
λ
[]
2
2
1
1
1
)
(
exp
1
2
1
21
)
(





+

+

=
w
w
E
λ
w
form
following
the
to
expression
this
simplifies
terms
the
reducing
and
1
=
λ
[]
2
2
1
1
)
exp(
1
2
)
(
w
w
E
+
+
=
w
similarly
[]
2
2
1
2
)
5.
0
exp(
1
2
)
(
w
w
E

+
=
w
[]
2
2
1
3
)
3
exp(
1
2
)
(
w
w
E
+
+
=
w
[]
2
2
1
4
)
2
exp(
1
2
)
(
w
w
E

+
=
w
These error surfaces are as shown on the previous slide.

Continuous Perceptron Training Algorithm…
mini
mum

Mutlicategory SLP

Multi-category Single layer Perceptron nets
•Treat the last fixed component of input pattern vector as the neuron activation threshold…. T=w
n+1
y
n+1
= -1 (irrelevant wheter it
is equal to +1 or –1)

Multi-category Single layer Perceptron nets…

R
-category linear classifier using
R
discrete
bipolar
perceptrons

G
oal: The
i
-th
T
LU response of
+1
is indicative of
class
i
and all other TLU respond with -1

Multi-category Single layer Perceptron nets…
•Example 3.5
t
1)
1,
-
(-1,

be
should
Indecision regions
= regions
where no class membership of an input pattern can be uniquely determined based on the response of the classifier (patterns in shaded areas are not assigned any reasonable classification. E.g. point Q for which o=[1 1
–1]
t
=> indecisive
response). However no patterns such as Q have been used for training in the example.

Multi-category Single layer Perceptron nets…
[
]
[
]
t
t
1
3
1

and
2
1
0
13
12

=

=
w
w
[
]
t
0
2
1

and
1
c
For
11

=
=
w

Step 1
: Pattern
y
1
is input
[] [] []
*
1
1
2
10
1
3
1
sgn
1
1
2
10
2
1
0
sgn
1
1
2
10
0
2
1
sgn
=
 
 
 
 



=
 
 
 
 


=
 
 
 
 


Since the only incorrect respo
n
se is
provided by TLU3, we have
 
 

=
 
 


 
 

===
01
9
1
2
10
1
31
23
12
22
11
21
w
w
w
w
w

Multi-category Single layer Perceptron nets…

Step 2
: Pattern
y
2
is input
[] [] []
1
15
2
0
1
9
sgn
1
15
2
2
1
0
sgn
*
1
15
2
0
2
1
sgn

=
 
 
 
 
−−

=
 
 
 
 
−−

=
 
 
 
 
−−

23
33
22
3231
13
1
15
2
021
w
w
w
ww
==
 
 

=
 
 
−−

 
 
=

Multi-category Single layer Perceptron nets…

Step 3
: Pattern
y
3
is input
33
43
32
4241
2
2
4
w
w
w
ww
==
 
 

=
One can verify that the only

adjusted weights

from now on are t
h
ose
of TLU1
(
)
() ()
1
sgn
1
sgn
*
1
sgn
3
33
3
32
3
31
=

==
y
w
y
w
y
w
ttt

During the second cycle:
 
 

=
 
 
==
4
2
7
332
7161
41
51
ww
w
w
 
 
==
535
91
71
81
w
w
w

Multi-category Single layer Perceptron nets…

R
-category linear classifier using
R
continuous bipolar
perceptrons

Comparison between Perceptron and Bayes’
Classifier

Perceptron operates on the prom
ise that the patterns to be
classified are
linear separable
(o
therwise the
training algorithm will
oscillate)
,
while
Bay
e
s classifier can work on nonseparable
patterns

Bayes
c
lassifier minimizes the
pr
obability o
f
m
isclassification
which is independent of the underlying distribution

Bayes
c
lassifier is a linear classifier on the assumption of
Gaussianity

The perceptron is non-parametric, while
Bayes
c
lassifier is
parametric
(its derivation is contingent on the assumption of the
underlying distributions)

The perceptron is adaptive and simple to implement

the
B
ayes’ classifier could be made adaptive but at the expense of
increased storage and more complex computations

APPENDIX A
Unconstrained Optimization
Techniques

Unconstrained Optimization Techniques
http://ai.
k
a
ist.a
c.
k
r
/~j
k
im/cs679/
Haykin
, Ch
apt
e
r
3

Cost function
E
(
ww
)

c
ontinuously differentiable

a
measure of how to choose
ww
of an adaptive
filtering algorithm so that it behaves in an optimum manner

we want to find an optimal solution
ww
* that minimize
E
(
ww
)

local iterative descent : starting with an initial guess denoted by
ww
(0),
generate a sequence of weight vectors
ww
(1),
ww
(2),
…, such that the cost function
E
(
ww
) is reduced at
each iteration of the algorithm, as shown by
E
(
ww
(n+1)) <
E
(
ww
(n))

S
teepest Descent, Newton’s, Gauss-Newton’s
methods
0
*)
(
=

w
E
r

Method of Steepest Descent „
Here the successive adjustments applied to
w w
are in the direction of steepest descent, that is, in a direction opposite to the grad(
E
(
ww
))
ww
(n+1) =
ww
(n) -
a
gg
(n)
a
: small positive constant called
step size
or
learning-rate
parameter.
gg
(n) : grad(
E
(
ww
))

The method of steepest descent converges to the optimal solution
w* w*
slowly

The learning rate parameter
a
has a profound
influence on its convergence behavior

o
verdamped, underdamped, or even
unstable(diverges)

Newton’s Method „
Using a second-order Taylor series expansion of the cost function around the point
ww
(n)

E
(
ww
(n)) =
E
(
ww
(n+1)) -
E
(
ww
(n))
~
gg
T
(n)

ww
(n) + 1/2

ww
T
(n)
HH
(n)

ww
(n)
where

ww
(n) =
ww
(n+1) -
ww
(n) ,
HH
(n) : Hessian matrix of
E
(n)
We want

ww
**
(n) that minimize

E
(
ww
(n)) so
differentiate

E
(
ww
(n)) with respect to

ww
(n) :
gg
(n) +
HH
(n)

ww
**
(n) = 0
so,

ww
**
(n) = -
HH
--
11
(n)
gg
(n)

Newton’s Method… Finally,
ww
(n+1) =
ww
(n)
+

ww
(n)
=
ww
(n) -
HH
--
11
(n)
gg
(n)

Newton’s method converges quickly asymptotically and does not exhibit the zigzagging behavior

the Hessian
HH
(
n
) has to be a positive definite
matrix for all
n

Gauss-Newton Method

The Gauss-Newton method is applicable to a cost function

Because the error signal e(i) is a function of
ww
,
we linearize the dependence of e(i) on
ww
by
writing

Equivalently, by using matrix notation we may write

=
=
n
i
i
e
w
E
1
2
)
(
21
)
(
r
))
(
(
)
(
)
(
)
,
(
'
)
(
n
w
w
w
i
e
i
e
w
i
e
T
n
w
w
r
r
r
r
r
r





+
=
=
))
(
)(
(
)
(
)
,
(
'
n
w
w
n
J
n
e
w
n
e
r
r
r
r
r

+
=

Gauss-Newton Method

where
JJ
(n) is the n-by-m Jacobian matrix of
ee
(n) (see bottom of this slide)

We want updated weight vector
ww
(n+1)
defined by

simple algebraic calculation tells… Now differentiate this expression with respect to
ww
and set the result to 0, we obtain


=
+
2
)
,
(
'
21
min
arg
)
1
(
w
n
e
n
w
w
r
r
r
r
))
(
)(
(
)
(
))
(
(
21
))
(
)(
(
)
(
)
(
21
)
,
('
21
2
2
n
w
w
n
J
n
J
n
w
w
n
w
w
n
J
n
e
n
e
w
n
e
T
T
T
r
r
r
r
r
r
r
r
r
r


+

+
=
 
 
∂∂




∂∂




∂∂
∂∂


=
MMM
w
n
e
w
n
e
w
n
e
w
k
e
w
k
e
w
k
e
we
we
w
e
J
)
(
)
(
)
(
)
(
)
(
)
(
)
1(
)
1(
)
1
(
111
L
L
M
L
M
L
M
L
L
M
L
M
L
M
L
L
ααα

Gauss-Newton Method…
0
))
(
)(
(
)
(
)
(
)
(
=

+
n
w
w
n
J
n
J
n
e
n
J
T
T
r
r
r
Thus we get „
To guard against the possibility that the matrix product
JJ
T
(n)
JJ
(n) is singular, the customary practice
is
where is a small positive constant. This modification effect is progressively reduced as
the number of iterations, n, is increased.
δ
)
(
)
(
))
(
)
(
(
)
(
)
1
(
1
n
e
n
J
n
J
n
J
n
w
n
w
T
T
r
r
r


=
+
)
(
)
(
)
)
(
)
(
(
)
(
)
1
(
1
n
e
n
J
I
n
J
n
J
n
w
n
w
T
T
r
r
r

+

=
+
δ

Linear Least-Squares Filter

The single neuron around which it is built is linear

The cost function consists of the sum of error squares

Usi
n
g
and
the error
vector is

Differentiating with respect to correspondingly,

From Gauss-Newton method, (eq. 3.22)
)
(
)
(
)
(
i
i
i
y
T
w
x
=
)
(
)
(
)
(
i
y
i
d
i
e

=
)
(
)
(
)
(
)
(
n
n
n
n
w
X
d
e

=
)
(
n
w
)
(
)
(
n
n
T
X
e

=

)
(
)
(
n
n
X
J

=
)
(
)
(
)
(
)
(
))
(
)
(
(
)
1
(
1
n
n
n
n
n
n
n
T
T
d
X
d
X
X
X
w
+

=
=
+

LMS Algorithm „
Based on the use of instantaneous values for cost function :

Differentiating with respect to ,

The error signal in LMS algorithm : hence,
so,
)
(
21
)
(
2
n
e
E
=
w
w
w
w
w


=


)
(
)
(
)
(
n
e
n
e
E
)
(
)
(
)
(
)
(
n
n
n
d
n
e
T
w
x

=
)
(
)
(
)
(
n
nn
e
x
w

=
∂∂
)
(
)
(
)
(
)
(
n
e
n
n
E
x
w
w

=
∂∂

LMS Algorithm … „
Using as an
estimate
for the gradient
vector,

Using this for the gradient vector of steepest descent method, LMS algorithm as follows :

:
learning-rate parameter

The inverse of is a
measure
of the
memory of the LMS algorithm

W
hen is small, the adaptive process
progress slowly, more of the past data are remembered and a more accurate filtering action’
)
(
)
(
n
Ew
w
∂∂
)
(
)
(
)
(
ˆ
n
e
n
n
x
g

=
)
(
)
(
)
(
ˆ
)
1
(
ˆ
n
e
n
n
n
x
w
w
η
+
=
+
η
η
η

LMS Characteristics

LMS algorithm produces an
estimate
of the
weight vector

S
acrifice a distinctive feature •
S
teepest descent algorithm :
foll
ow
s a
well-defined trajectory

L
MS algorithm : follows a random
trajectory

N
umber of iterations goes infinity,
performs a random walk

But importantly, LMS algorithm does
not
require knowledge of the statistics of the environment
)
(
n
w
)
(
ˆ
n
w
)
(
ˆ
n
w

Convergence Consideration

Two distinct quantities, and determine the convergence

t
he user supplies , and the selection of
is important for the LMS algorithm to converge

Convergence of the mean
as

T
his is not a practical value

Convergence in the mean square

Convergence condition for LMS algorithm in the mean square
η
)
(
n
x
)
(
n
x
η
[
]
0
)
(
ˆ
w
w

n
E


n
[
]



n
n
e
E

as

constant
)
(
2
inputs
sensor
the
of
values
square
-
mean
of
sum
2
0
<
<
η

APPENDIX B
Perceptron Convergence Proof

Perceptron Convergence Proof
Haykin
, Ch
apt
e
r
3
Consider the following perceptron:
)
(
)
(


)
(
)
(
)
(
0
n
n
n
x
n
w
n
v
T
m
i
i
i
x
w
==

=
1
C
class
to
belonging
or
input vect
every
for

0
x
x
w
>
T
2
C
class
to
belonging
or
input vect
every
for

0
x
x
w

T

Perceptron Convergence Proof… „
The algorithm for the weight adjustment for the perceptron

if x(n) is correctly classified no adjustments to
w

o
therwise

learning rate parameter
controls adjustment
applied to weight vector
1
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
1
(
n
n
n
n
T
x
x
w
w
w
>
=
+
2
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
1
(
n
n
n
n
T
x
x
w
w
w

=
+
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
(
)
(
)
1
(
n
n
n
n
n
n
T
x
x
w
x
w
w
>

=
+
η
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
(
)
(
)
1
(
n
n
n
n
n
n
T
x
x
w
x
w
w

+
=
+
η
)
(
n
η
2 1

Perceptron Convergence Proof
For
0
w
=
=
)
0
(
and
1
)
(
n
η
Suppose the perceptron
incorrectly
classifies the vectors
such that
),...
2
(
),
1
(
x
x
1
C
to
belonging
(n)
for
)
(
)
(
)
1
(

1

since
But
)
(
)
(
)
(
)
1
(
:
that
so

0
)
(
x
x
w
w
x
w
w
x
w
n
n
n
n
n
n
n
n
T
+
=
+

=
+
=
+

η
η
)
1
B
(
)
(
...
)
2
(
)
1(
)
1
(
)
1
(

find

y we
iterativel ,
(0)

Since
n
n
n
x
x
x
w
w
0
w
+
+
+
=
+
+
=
Since the classes C1 and C2 are assumed to be linearly separable, there exists a solution
w
0
for which
w
T
x
(
n
)>0 for
the vectors
x
(1), …
x
(
n
) belonging to the subset
H
1
(subset of
training vectors that belong to class C1).

Perceptron Convergence Proof
)
2
(
)
(
min
0
)
(
1
B
n
T
H
n
x
w
x

=
α
For a fixed solution
w
0
, we may then define a positive number
α
as
)
(
...
)
2
(
)
1(
)
1
(

implies
above
(B1)
equation

Hence
T0
T0
T0
T0
n
n
x
w
x
w
x
w
w
w
+
+
+
=
+
Using equation B2 above, (since
each term is greater or equal
than
α),
w
e
have
α
n
n

+
)
1
(
T0
w
w
Now we use the
Cauchy-Schwartz
inequality:
0
for

)
.
(
or

)
.
(
2
2
2
2
2
2
2



b
b
b
a
a
b
a
b
a

Perceptron Convergence Proof This implies that:
)
3
(
)
1
(
2
0
2
2
2
B
n
n
w
w
α

+
Now let’s follow another development route (notice index k)
1
H

(k)

and

1
for
)
(
)
(
)
1
(

=
+
=
+
x
x
w
w
, ..., n
k
k
k
k
By taking the squared Euclidean norm of both sides, we get:
)
(
)
(
2
)
(
)
(
)
1
(
2
2
2
k
k
k
k
k
T
x
w
x
w
w
+
+
=
+
But under the assumption the the perceptron
incorrectly
classifies an input vector
x
(
k
) belonging to the subset H
1
, we
have
:
hence
and

0
)
(
)
(
<
k
k
T
x
w
2
2
2
)
(
)
(
)
1
(
k
k
k
x
w
w
+

+

Perceptron Convergence Proof Or equivalently,
n
k
k
k
k
,...
1
;
)
(
)
(
)
1
(
2
2
2
=


+
x
w
w
Adding these inequalities for
k=1,…n
, and invoking the initial
condition
w
(0)=
0
, we get the following inequality:
)
4
(
)
(
)
1
(
1
2
2
B
n
k
n
n
k
β


+

=
x
w
Where
β
is a positive number defined by;

=

=
n
k
H
k
k
1
2
)
(
)
(
max
1
x
x
β
Eq. B4 states that the squared Euclidean norm of
w
(
n+1
)
grows at most linearly with the number of iterations
n
.

Perceptron Convergence Proof The second result of B4 is clearly in conflict with Eq. B3.
•Indeed, we can state that
n
cannot be larger than some
value
n
max
for which Eq. B3 and B4 are both satisfied with
the equality sign. That is
n
max
is the solution of the eq.
•Solving for
n
max
given a solution
w
0
, we find that
We have thus proved that for
η
(
n
)=1 for all
n
, and for
w
(
0
)=
0
,
given that a sol’
vector
w
0
exists, the rule for adapting the
synaptic weights of the perceptron must terminate after at most n
max
iterations
.
β
α
max
2
0
2
2max
n
n
=
w
2
2
0
max
α
β
w
=
n

MORE READING

Suggested Reading.
ƒ
S. Haykin
, “Neural Networks”, Prentice-Hall, 1999,
chapter 3. ƒ
L. Fausett
, “Fundamentals of Neural Networks”,
Prentice-Hall, 1994, Chapter 2. ƒ
R. O. Duda, P.E. Hart, and D.G. Stork,
“Pattern
Classification”, 2
nd
edition, Wiley 2001. Appendix A4,
chapter 2, and chapter 5. ƒ
J.M. Zurada
, “Introduction to Artificial Neural Systems”,
West Publishing Company, 1992, chapter 3.

References:
These lecture notes were based on the references of the
previous slide, and the following references
1.
B
erlin Chen Lecture notes:
Normal University, Taipei,
Taiwan, ROC.
h
t
tp://140.122.185.120
2.
Ehud Rivlin, IIT: http://webc
o
urse.tec
hnion.ac.il/236607/Winte
r
2002-
2003/en/ho.html
3.
Jin Hyung
Kim, KAIST Computer Science Dept., CS679
Neural Network lecture notes http://ai.kaist.ac.kr/~jkim/cs679/detail.htm
4.
Dr John A. Bullinaria, Course Material, Introduction to Neural Networks, http://www.cs.bham.ac.uk/~jxb/inn.html