Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

CS407 Neural Computation
Lecture 4:
Single Layer Perceptron (SLP)
Classifiers
Lecturer: A/Prof. M. Bennamoun

Outline
ƒ
What’s
a SLP and what’s class
ification?
ƒ
Limitation of
a single perceptron.
ƒ
Foundations of clas
sification and Bayes
D
ecis
ion making theory
ƒ
Discriminant functions, linear
machine and minimum distance
classification
ƒ
Training and classification using the Dis
c
rete perceptron
ƒ
Single-Layer Continuous perce
ptron Networks for linearly
separable classifications
ƒ
Appendix A
: Unconstrained optimization techniques
ƒ
Appendix B
: Perceptron Conv
ergenc
e proof
ƒ
Suggested reading and references

What is a perceptron and what is a Single Layer Perceptron (SLP)?

Perceptron „
The simplest form of a neural network
„
consists of a single neuron with adjustable synaptic weights and bias
„
performs pattern classification with only two classes
„
perceptron convergence theorem :
–
P
atterns (vectors) are drawn from two
linearly separable classes
–
D
uring training, the perceptron algorithm
converges and positions the decision surface in the form of hyperplane between two classes by adjusting synaptic weights

What is a perceptron?
w
k1
x
1
w
k2
x
2
w
km
x
m
...
...
Σ
Bias
b
k
ϕ
(.)
v
k
Input signal
Synaptic weights
Summing
junction
Activation
function
b
x
w
v
k
j
m j
kj
k
+
=
∑
=
1
)
(v
y
k
k
ϕ
=
)
(
)
(
⋅
=
⋅
sign
ϕ
Discrete Perceptron:
Output
y
k
shape
S
−
=
⋅)
(
ϕ
Continous
Perceptron:

Activation Function of a perceptron
v
i
+1
-1
v
i
+1
Signum
Function
(sign)
shape
s
v
−
=
)
(
ϕ
Continous
Perceptron:
)
(
)
(
⋅
=
⋅
sign
ϕ
Discrete Perceptron:

SLP Architecture
Single layer perceptron
Input layer
Output layer

Where are we heading? Different Non-Linearly Separable Problems http://www.zsolutions.com/light.ht
m
Structure
Types of
Decision Regions
Exclusive-OR
Problem
Classes with
Meshed regions
Most General
Region Shape
s
Single-Layer
Two-Layer Three-Layer
Half Plane
Bounded By Hyperplane
Convex Open
Or
Closed Regions
Arbitrary
(Complexity
Limited by No.
of Nodes)
A
A
B
B
A
A
B
B
A
A
B
B
B
A
B
A
B
A

Review from last lectures:

Implementing Logic Gates with Perceptrons
h
t
t
p
://www.cs.b
h
a
m
.
a
c
.u
k
/
~
j
x
b
/
NN/l
3
.p
d
f
ƒ
We can use the perceptron to implem
ent the basi
c logic gates (AND, OR
and NOT).
ƒ
All we need to do is fi
nd the appr
opriate connection weights and
n
euron
thresholds to produce the right
outputs for each set of inputs.
ƒ
We saw how we can construct simple
networks that perform NOT, AND,
and OR.
ƒ
It is then a
well known
result from logic that
we can construct
any
logical
function from these three operations
.
ƒ
The resulting networks, however, will
usually have a much more complex
architecture than a simple Perceptron.
ƒ
We generally want to avoid decom
posing complex problems into simple
logic gates, by finding the weights and
thresholds that work directly in a
Perceptron architecture.

Implementation of Logical NOT, AND, and OR ƒ
In each case we have inputs
in
i
and outputs
out
, and need to determine
the weights and thresholds. It is ea
sy to find solutions by inspection:

The Need to Find Weights Analytically ƒ
Constructing simple networks by ha
nd is one thing. But
what about
harder problems? For example, what about:
ƒ
How long do we keep looki
ng for a so
lution? We need to be able to
calculate appropriate para
meters rather than look
ing for solutions by trial
and error.
ƒ
Each training pattern produces a lin
ear inequality for the output in terms
of the inputs and the network parame
ters. These can be used to compute
the weights and thresholds.

Finding Weights Analytically for the AND Network ƒ
We have two weight
s
w
1
and
w
2
and the threshold
θ,
and for each
training pattern we n
eed to satisfy
ƒ
So the training data l
ead to four inequalities:
ƒ
It is easy to see that t
h
ere are an
infinite number of solutions. Similarly,
there are an infinite number of soluti
ons for the NOT and OR networks.

Limitations of Simple
Perceptrons
ƒ
We can follow the same procedure for the XOR network:
ƒ
Clearly the second and third inequa
lities are incompatible
with the
fourth, so there is in fact no solu
tion. We need more complex networks,
e.g. that combi
n
e together many si
mple networks, or use different
activation/thresholding/transfer functions.
ƒ
It then becomes much
more difficult to determine all the weights
a
nd
thresholds by hand.
ƒ
These weight
s instead are adap
ted using learning rules.
Hence, need
to
consider learning rules (see previous lecture), and more complex architectures.

E.g. Decision Surface of a Perceptron
+
+
-
-
x
1
x
2
Non-Linearly separable
•
Perceptron is able to represent some useful functions
•
B
ut functions that are not linearly separable (e.g. XOR)
are not representable
+
+
+
+
-
-
-
-
x
2
Linearly separable
x
1

What is classification?

Classification ?
http://140.
122
.1
85.1
2
0
ƒ
Pattern classification/recognition
-
A
ssign the input data (a physi
c
a
l object, event, or phenomenon)
to one of the pre-specified classes (categories)
ƒ
The block diagram of the recogn
ition and classification system

Classification: an example •
Automate the process of sorting incoming fish on a conveyor belt according to species (Salmon or Sea bass).
¾
Set up a camera
¾
Take some sample images
¾
Note the physical differences between the two types of fish
Length
Lightness Width No. & shape of fins ( “sanfirim”) Position of the mouth
ht
tp:
/
/
w
eb
cou
r
se.te
chn
ion.a
c
.
il/
236607/
Win
t
e
r
2002-
2003
/en/ho
.ht
m
D
uda
&
Ha
rt
, C
hapte
r

1

Classification an example…

Classification: an example…
•
Cost of misclassification: depends on application
Is it better to misclassify salmon as bass or vice versa?

Put salmon in a can of bass
loose profit

Put bass in a can of salmon
loose customer

There is a cost associated with our decision.

Make a decision to minimize a given cost.
•
Feature Extraction:

Problem & Domain dependent

Requires knowledge of the domain

A good feature extractor would make the job of the classifier trivial.
⇒⇒

Bayesian decision theory

Bayesian Decision Theory
ht
tp:
/
/
w
eb
cou
r
se.te
chn
ion.a
c
.
il/
236607/
Win
t
e
r
2002-
2003
/en/ho
.ht
m
l
D
uda
&
Ha
rt
, C
hapte
r

2
„
Bayes
ian decis
ion theory is a fundamental statistical approach
to the problem of pattern classification.
¾
Decision making when all the probabilistic information is known.
¾
For given probabilities the decision is optimal.
¾
When new information is added, it is assimilated in optimal fashion for improvement of decisions.

Bayesian Decision Theory … „
Fish Example:
„
Each fish is in one of 2 states: sea bass or salmon
„
Let
ω
denote the
state of nature

ω
=
ω
1
for sea bass

ω
=
ω
2
for salmon

Bayesian Decision Theory …
„
The State of nature is unpredictable
ω
is a
variable that must be described probabilistically.
„
If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon.
„
Define

P(
ω
1
) :
a prior
i
probability that the next fish is sea bass

P(
ω
2
):
a prior
i
probability that the next fish is salmon.
⇒

Bayesian Decision Theory …
„
If other types of fish are irrelevant:
P
(
ω
1
) + P(
ω
2
) = 1.
„
Prior probabilities reflect our prior
knowledge (e.g. time of year, fishing area, …)
„
Simple decision Rule:

Make a decision without seeing the fish.

Decide w
1
if P(
ω
1
) > P(
ω
2
);
ω
2
otherwise.

OK if deciding for one fish

If several fish, all assigned to same class.

Bayesian Decision Theory ...
„
In general, we will have some features and more information.
„
Feature: lightness measurement =
x
¾
Different fish yield different lightness readings (
x is a random variable)

Bayesian Decision Theory ….
„
Define
p(x|
ω
1
) =
Class Conditional Probability Density
Probability density function for x given that the state of nature is
ω
1
„
The difference between p(x|
ω
1
) and p(x|
ω
2
) describes the
difference in lightness between sea bass and salmon.

Class conditioned probability density: p(x|
ω
)
Hypothetical class-conditional probability
Density functions are normalized (area under each curve is 1.0)

„
Suppose tha
t
we know
The prior probabilities P(
ω
1
) and P(
ω
2
),
The conditional densities
and
Measure lightness of a fish = x.
„
What is the category of the fish
?
1
(|
)
px
ω
2
(|
)
px
ω
(|
)
j
px
ω
Bayesian Decision Theory ...

Bayes
F
ormula
„
Given
–
P
rior probabilities P(
ω
j)
–
C
onditional probabilities p(x|
ω
j)
„
Measurement of particular item
–
F
eature value x
„
Bayes formula:
(from
)

where so
)
(
)
(
)
|
(
)
(
x
p
P
x
p
x
P
j
j
j
ω
ω
ω
=
∑
=
i
i
i
P
x
p
x
p
)
(
)
|
(
)
(
ω
ω
∑
=
i
i
x
P
1
)
|
(ω
)
(
)
|
(
)
(
)
|
(
)
,
(
x
p
x
P
P
x
p
x
p
j
j
j
j
ω
ω
ω
ω
=
=
L
i
ke
l
i
hood
P
r
ior
Po
ste
r
io
r
Ev
i
d
ence
∗
=

Bayes'
formula ...
•
p(x|
ω
j
)
is called the
likelihood
of
ω
j
with
respect to
x.
(the
ω
j
category for which
p(x|
ω
j
)
is large is more
"likely" to be the true category)
•
p(x) is the
evidence
how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1.

Posterior Probability
Posterior probabilities for the particular priors P(
ω
1
)=2/3 and P(
ω
2
)=1/3.
At every x the posteri
ors sum to 1.

Bayes'
Decision Rule
(Minimizes the probability of error)
ω
1
:
if P(
ω
1
|x) > P(
ω
2
|x)
i.e.
ω
2
:
otherwise
or ω
1
:
if P ( x |
ω
1
) P(
ω
1
) > P(x|
ω
2
) P(
ω
2
)
ω
2
:
otherwise
and
P(Error|x)
= min [
P(
ω
1
|x) , P(
ω
2
|x)
]
)
(
)
(
2
1
21
x
P
x
P
ω
ω
ωω <>
Likelihood ratio
)
(
)
(
)
|
(
)
|
(
)
(
)
|
(
)
(
)
|
(
21
21
2
2
1
1
21
21
ωω
ωω
ω
ω
ω
ω
ωω
ωω
PP
x
p
x
p
P
x
p
P
x
p
<>
⇔
<>
Threshold

Decision Boundaries
„
Classification as division of feature space into non-overlapping regions
„
Boundaries between these regions are known as
decision surfaces
or
decision
boundaries
k
k
R
to
assigned
x
X
x
that
such
X
X
ω
↔
∈
,
,
1
K

Optimum decision boundaries
„
Criterion:
–
m
inimize miss-classification
–
M
aximize correct-classification
)
(
)
(
y
probabilit
posterior
maximum
.
.
)
(
)
(
)
(
)
(
x
P
x
P
k
j
e
i
P
x
p
P
x
p
k
j
if
X
x
Classify
j
k
j
j
k
k
k
ω
ω
ω
ω
ω
ω
>
≠
∀
>
≠
∀
∈
2
)
(
)
(
)
,
(
)
(
1
1
=
∈
=
∈
=
∑
∑
=
=
R
Here
P
X
x
P
X
x
P
correct
P
R
k
k
k
k
R
k
k
k
ω
ω
ω

Discriminant
functions
„
Discriminant
functions determine
classification by comparison of their values:
„
Optimum classification: based on posterior probability
„
Any monotone function g may be applied without changing the decision boundaries
)
(
)
(
x
g
x
g
k
j
if
X
x
Classify
j
k
k
>
≠
∀
∈
))
(
ln(
)
(
.
.
))
(
(
)
(
x
P
x
g
g
e
x
P
g
x
g
k
k
k
k
ω
ω
=
=
)
(
x
P
k
ω

The Two-Category Case
„
Us
e 2 discriminant
functions
g
1
and
g
2
, and
assigning
x
to
ω
1
if
g
1
>g
2
.
„
Alternative: define a single discriminant
f
unction
g(x) = g
1
(x) -
g
2
(x),
dec
ide
ω
1
if
g(x
)
>0
, otherwise decide
ω
2
.
„
Two category case
12
11 22
()
(
|
)
(
|
)
(|
)
(
)
()
l
n
l
n
(|
)
(
)
gP
P
pP
g
pPω
ω
ω
ω
ω
ω
=
−
=+
xx
x
x
x
x

Summary
„
Bayes
a
pproach:
–
E
stimate class-conditioned probability density
–
C
ombine with prior class probability
–
D
etermine posterior class probability
–
D
erive decision boundaries
„
Alternate approach implemented by NN
–
E
stimate posterior probability directly
–
i.e. determine decision boundaries directly

DISCRIMINANT FUNCTIONS

Discriminant
Functions
ht
tp
://1
40.
1
2
2.1
8
5.1
2
0
ƒ
Determine the membership in a category by the classifier based on the comparison of
R
discriminant
functions
g
1
(
x
),
g
2
(
x
),…,
g
R
(
x
)
–
W
hen
x
is within the region
X
k
if
g
k
(
x
) has the largest
value
Do not mix between n = dim of each I
/
P v
ect
or (di
m
of feature space); P= # of I/P
vectors; and R= # of classes.

Discriminant
Functions…

Linear Machine and Minimum Distance Classification •
F
ind the linear-form
discriminant
f
unction for two class
classification when the class prototypes are known •
Example 3.1
: Select the decision
hyperplane
that
contains the midpoint of the line segment connecting center
point of two classes

Linear Machine and Minimum Distance Classification…
(dichotomizer)
•The
dichotomizer’s discriminant
f
unction
g
(
x
):
t

Linear Machine and Minimum Distance Classification…
(multiclass
c
lassification)
•The linear-form
discriminant
f
unctions for multiclass
classification
–
T
here are up to R(R-1)/2 decision hyperplanes
for
R
pairwise
separable classes
(i.e. next to or
touching another)

Linear Machine and Minimum Distance Classification…
(multiclass
c
lassification)
•Linear machine or minimum-distance classifier
–
A
ssume the class prototypes are known for all
classes • Euclidean distance between input pattern
x
and the
center
of class
i,
X
i:
t

Linear Machine and Minimum Distance Classification…
(multiclass
c
lassification)

Linear Machine and Minimum Distance Classification…
Note:
to find S
12
we need to compute (g
1
-g
2
)
P
1
, P
2
, P
3
are the centre
s of gravi
t
y
of the prot
ot
ype points, we
need to design a minimum
distance classifier. U
s
ing
the formulas from the previous s
l
ide, we get
w
i

Linear Machine and Minimum Distance Classification… •If R linear
d
iscriminant
f
unctions exist for a set of
patterns such that
()
(
)
j
i

,..., R
,
j
,..., R,
,
i
i,
g
g
j
i
≠
=
=
∈
>
,
2
1
2
1

Class
for

x
x
x
•The classes are linearly separable.

Linear Machine and Minimum Distance Classification… Example:

Linear Machine and Minimum Distance Classification… Example…

Linear Machine and Minimum Distance Classification…
•Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear
discriminant
f
unctions can be
determined if the a priori information about the sets of patterns and their class membership is known •In the next section (Discrete perceptron) we will examine neural networks that derive their weights during the learning cycle.

Linear Machine and Minimum Distance Classification… •The example of linearly non-separable patterns

Linear Machine and Minimum Distance Classification…
Input space (x)
Image space (o)
)
1
sgn(
2
1
1
+
+
=
x
x
o

Linear Machine and Minimum Distance Classification…
)
1
sgn(
2
1
1
+
+
=
x
x
o
)
1
sgn(
2
1
2
+
−
−
=
x
x
o
-1
1
1
1
1
1
-1
1
1
1
1
-1
1
-1
-1
-1
o
2
o
1
x
2
x
1
These 2 inputs map
to the same point (1,1) in the image
space

The Discrete Perceptron

Discrete Perceptron Training Algorithm
•
S
o far, we have shown that
coefficients
of linear
discriminant functions called
weights
can be
determined based on
a priori
information about sets of
patterns and their class membership. •In what follows, we will begin to examine neural network classifiers that derive their
weights
dur
i
n
g
the
learning cycle
.
•The sample pattern vectors x
1
, x
2
, …, x
p
, called the
training sequence
, are presented to the machine along
with the correct response.

Discrete Perceptron Training Algorithm -
G
eometrical Representations
ht
tp
://1
40.
1
2
2.1
8
5.1
2
0
Zurada
, Chapter
3
(
In
tersects th
e origi
n

point
w
=
0
)
5
prototype p
a
tterns i
n
thi
s
ca
se:
y
1
, y
2
, …
y
5
If dim o
f
augmented pattern vecto
r
is > 3, o
u
r power o
f
v
i
sualization are no longer o
f
assistance. In t
h
is case,
the only re
course is to us
e the analytical approach.

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•Devise an analytic approach based on the geometrical representations –
E
.g. the decision surface for the training pattern
y
1
If
y
1
in
Class 1:
y
1
in
Cla
s
s
2
(
)
1
1
y
y
w
w
=
∇
t
1
1
y
w
w
c
+
=
′
y
1
in
Cla
s
s
1
If
y
1
in
Class 2:
1
1
y
w
w
c
−
=
′
c
(
>0) is the correction
increment (
is two times the
learning constant
ρ
introduced before
)
Weight Space Weight Space
c
controls the
si
ze of adjustment
Gradient (the direction of
steepest increase)
(see previ
o
us slide)
(correcti
o
n in n
e
g
a
tive
gradi
ent direction)

Discrete Perceptron Training Algorithm -
G
eometrical Representations…

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
Note 2: c is not constant and de
pends on the c
u
rrent training pa
ttern as expressed by eq. A
b
ove.
Note 1: p=distance so >0
p
c
t
t
=
=
y
y
y
y
w
y
1

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•For
fixed correction rule:
c
=constant, the correction of
weights is always the same fixed portion of the current training vector
–
T
he weight can be initialised at any value
•For
dynamic correction rule:
c
depends on the distance
from the weight (i.e. the weight vector) to the decision surface in the weight space. Hence
–
T
he initial weight should be different from 0.
(if
w
1
=
0
, then c
y
=
0
and
w
’=
w
1
+c
y
=
0
,
t
h
ere
f
ore n
o
p
o
ssi
bl
e a
d
ju
st
me
nt
s).
Current input pattern
Current weight

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•Dy
n
amic correction rule: Us
ing the va
lue of c from previous slide as
a reference, we devise an adj
ustment technique whic
h depends on
the length
w
2
-
w
1
Ν
ote:
λ
is
the ratio of
the distance
between the old weight vector
w
1
and the new
w
2
,
to
the distan
ce
from
w
1
to the pattern
hyperplane
λ
=
2
: S
y
mm
etrical reflectio
n w.r.t decisi
on
plane
λ
=
0
: No weight adjustment

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•Example:
2
class :

1
,
2
,
5.
0
1
class :

1
,
3
,
1
4
2
4
2
3
1
3
1
−
=
=
−
=
−
=
=
=
=
=
d
d
x
x
d
d
x
x
•The augmented input vectors are:


−
=


=


−
=


=
1
2
13
,
1
5.
0
,
11
4
3
2
1
y
y
y
y
•The decision lines
w
ty
i=0, for
i=1, 2, 3, 4 are sketched
on the
augmented weight space
as follows:

Discrete Perceptron Training Algorithm -
G
eometrical Representations…

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
[
]
t
75.
1
5.
2

and
1
c
For
1
−
=
=
w
•Using
the weight training with each step can
be summarized as follows:
y
w
w
c
±
=
'
k
k
kt
k
k
d
c
y
y
w
w
)]
sgn(
[
2
−
=
∆ •We obtain the following outputs and weight updates: •
Step 1
: Pattern
y
1
is input


−
=
+
=
=
−
−
=




−
=
75.
2
5.
1
2
1
11
]
75.
1
5.
2
[
sgn
1
1
2
1
11
y
w
w
o
do

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•
Step 2
: Pattern
y
2
is input


−
=
−
=
−
=
−
=




−
−
=
75.
1
1
2
1
1
5.
0
]
75.
2
5.
1
[
sgn
2
2
3
2
22
y
w
w
o
do
•
Step 3
: Pattern
y
3
is input


=
+
=
=
−
−
=




−
=
75.
2
2
2
1
13
]
75.
1
1
[
sgn
3
3
4
3
33
y
w
w
o
do

Discrete Perceptron Training Algorithm -
G
eometrical Representations…
•
S
ince we have no evidence of correct classification of
weight
w
4
the training set consisting of an ordered
sequence of patterns
y
1
,
y
2
and
y
3
needs to be recycled.
We thus have
y
4
=
y
1
,
y
5
=
y
2
, etc (the superscript is used
to denote the following training step number).
•
Step 4, 5
:
w
6
=
w
5
=
w
4
(no misclassification, thus no
weight adjustments). •You can check that the adjustment following in steps 6 through 10 are as follows:
[] []
t
t
75.
0
3
75.
1
5.
2
11
7
8
9
107
=
=
=
==
w
w
w
w
ww
w
11
is in solution area.

The Continuous Perceptron

Continuous Perceptron Training Algorithm
ht
t
p
:
/
/
1
40.122.185.12
0
Zu
ra
d
a
,
C
h
ap
t
e
r
3
•Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons:
–
G
ain finer control over the training procedure
–
F
acilitate the differential characteristics to enable
computation of the error gradient
(o
f current
erro
r funct
i
on)
The
f
a
ctor
½
do
es not
a
f
f
e
ct the locatio
n
of
t
h
e
error mi
ni
mum

Continuous Perceptron Training Algorithm… •The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface
By definit
i
on of the
st
eepest descent concept,
each elementary move sho
u
ld be
perpendicu
l
ar to the current error contour.

Continuous Perceptron Training Algorithm… •Define the error as the squared difference between the desired output and the actual output
1
,...,
2
,
1
)
(

have
we ,

Since
+
=
=
∂
∂
=
n
i
y
wnet
net
i
i
t
y
w
T
r
a
i
ni
ng
r
u
l
e
o
f

contin
ou
s perceptron
(equivalent to delta t
r
a
i
ni
ng
r
u
le
)

Continuous Perceptron Training Algorithm…

Continuous Perceptron Training Algorithm… Same as previous example
(of discrete perceptron) but with a
continuous activation function and using the delta rule.
Same training pattern set as discrete perceptron example

Continuous Perceptron Training Algorithm…
2
1
)
exp(
1
2
21




−
−
+
−
=
k
k
k
net
d
E
λ
[]
2
2
1
1
1
)
(
exp
1
2
1
21
)
(




−
+
−
+
−
=
w
w
E
λ
w
form
following
the
to
expression
this
simplifies
terms
the
reducing
and
1
=
λ
[]
2
2
1
1
)
exp(
1
2
)
(
w
w
E
+
+
=
w
similarly
[]
2
2
1
2
)
5.
0
exp(
1
2
)
(
w
w
E
−
+
=
w
[]
2
2
1
3
)
3
exp(
1
2
)
(
w
w
E
+
+
=
w
[]
2
2
1
4
)
2
exp(
1
2
)
(
w
w
E
−
+
=
w
These error surfaces are as shown on the previous slide.

Continuous Perceptron Training Algorithm…
mini
mum

Mutlicategory SLP

Multi-category Single layer Perceptron nets
•Treat the last fixed component of input pattern vector as the neuron activation threshold…. T=w
n+1
y
n+1
= -1 (irrelevant wheter it
is equal to +1 or –1)

Multi-category Single layer Perceptron nets…
•
R
-category linear classifier using
R
discrete
bipolar
perceptrons
–
G
oal: The
i
-th
T
LU response of
+1
is indicative of
class
i
and all other TLU respond with -1

Multi-category Single layer Perceptron nets…
•Example 3.5
t
1)
1,
-
(-1,

be
should
Indecision regions
= regions
where no class membership of an input pattern can be uniquely determined based on the response of the classifier (patterns in shaded areas are not assigned any reasonable classification. E.g. point Q for which o=[1 1
–1]
t
=> indecisive
response). However no patterns such as Q have been used for training in the example.

Multi-category Single layer Perceptron nets…
[
]
[
]
t
t
1
3
1

and
2
1
0
13
12
−
=
−
=
w
w
[
]
t
0
2
1

and
1
c
For
11
−
=
=
w
•
Step 1
: Pattern
y
1
is input
[] [] []
*
1
1
2
10
1
3
1
sgn
1
1
2
10
2
1
0
sgn
1
1
2
10
0
2
1
sgn
=
 
 
 
 
−
−
−
=
 
 
 
 
−
−
=
 
 
 
 
−
−
Since the only incorrect respo
n
se is
provided by TLU3, we have
 
 
−
=
 
 
−
−
 
 
−
===
01
9
1
2
10
1
31
23
12
22
11
21
w
w
w
w
w

Multi-category Single layer Perceptron nets…
•
Step 2
: Pattern
y
2
is input
[] [] []
1
15
2
0
1
9
sgn
1
15
2
2
1
0
sgn
*
1
15
2
0
2
1
sgn
−
=
 
 
 
 
−−
−
=
 
 
 
 
−−
−
=
 
 
 
 
−−
−
23
33
22
3231
13
1
15
2
021
w
w
w
ww
==
 
 
−
=
 
 
−−
−
 
 
=

Multi-category Single layer Perceptron nets…
•
Step 3
: Pattern
y
3
is input
33
43
32
4241
2
2
4
w
w
w
ww
==
 
 
−
=
One can verify that the only

adjusted weights

from now on are t
h
ose
of TLU1
(
)
() ()
1
sgn
1
sgn
*
1
sgn
3
33
3
32
3
31
=
−
==
y
w
y
w
y
w
ttt
•
During the second cycle:
 
 
−
=
 
 
==
4
2
7
332
7161
41
51
ww
w
w
 
 
==
535
91
71
81
w
w
w

Multi-category Single layer Perceptron nets…
•
R
-category linear classifier using
R
continuous bipolar
perceptrons

Comparison between Perceptron and Bayes’
Classifier
„
Perceptron operates on the prom
ise that the patterns to be
classified are
linear separable
(o
therwise the
training algorithm will
oscillate)
,
while
Bay
e
s classifier can work on nonseparable
patterns
„
Bayes
c
lassifier minimizes the
pr
obability o
f
m
isclassification
which is independent of the underlying distribution
„
Bayes
c
lassifier is a linear classifier on the assumption of
Gaussianity
„
The perceptron is non-parametric, while
Bayes
c
lassifier is
parametric
(its derivation is contingent on the assumption of the
underlying distributions)
„
The perceptron is adaptive and simple to implement
„
the
B
ayes’ classifier could be made adaptive but at the expense of
increased storage and more complex computations

APPENDIX A
Unconstrained Optimization
Techniques

Unconstrained Optimization Techniques
http://ai.
k
a
ist.a
c.
k
r
/~j
k
im/cs679/
Haykin
, Ch
apt
e
r
3
„
Cost function
E
(
ww
)
–
c
ontinuously differentiable
–
a
measure of how to choose
ww
of an adaptive
filtering algorithm so that it behaves in an optimum manner
„
we want to find an optimal solution
ww
* that minimize
E
(
ww
)
–
local iterative descent : starting with an initial guess denoted by
ww
(0),
generate a sequence of weight vectors
ww
(1),
ww
(2),
…, such that the cost function
E
(
ww
) is reduced at
each iteration of the algorithm, as shown by
E
(
ww
(n+1)) <
E
(
ww
(n))
–
S
teepest Descent, Newton’s, Gauss-Newton’s
methods
0
*)
(
=
∇
w
E
r

Method of Steepest Descent „
Here the successive adjustments applied to
w w
are in the direction of steepest descent, that is, in a direction opposite to the grad(
E
(
ww
))
ww
(n+1) =
ww
(n) -
a
gg
(n)
a
: small positive constant called
step size
or
learning-rate
parameter.
gg
(n) : grad(
E
(
ww
))
„
The method of steepest descent converges to the optimal solution
w* w*
slowly
„
The learning rate parameter
a
has a profound
influence on its convergence behavior
–
o
verdamped, underdamped, or even
unstable(diverges)

Newton’s Method „
Using a second-order Taylor series expansion of the cost function around the point
ww
(n)
∆
E
(
ww
(n)) =
E
(
ww
(n+1)) -
E
(
ww
(n))
~
gg
T
(n)
∆
ww
(n) + 1/2
∆
ww
T
(n)
HH
(n)
∆
ww
(n)
where
∆
ww
(n) =
ww
(n+1) -
ww
(n) ,
HH
(n) : Hessian matrix of
E
(n)
We want
∆
ww
**
(n) that minimize
∆
E
(
ww
(n)) so
differentiate
∆
E
(
ww
(n)) with respect to
∆
ww
(n) :
gg
(n) +
HH
(n)
∆
ww
**
(n) = 0
so,
∆
ww
**
(n) = -
HH
--
11
(n)
gg
(n)

Newton’s Method… Finally,
ww
(n+1) =
ww
(n)
+
∆
ww
(n)
=
ww
(n) -
HH
--
11
(n)
gg
(n)
„
Newton’s method converges quickly asymptotically and does not exhibit the zigzagging behavior
„
the Hessian
HH
(
n
) has to be a positive definite
matrix for all
n

Gauss-Newton Method
„
The Gauss-Newton method is applicable to a cost function
„
Because the error signal e(i) is a function of
ww
,
we linearize the dependence of e(i) on
ww
by
writing
„
Equivalently, by using matrix notation we may write
∑
=
=
n
i
i
e
w
E
1
2
)
(
21
)
(
r
))
(
(
)
(
)
(
)
,
(
'
)
(
n
w
w
w
i
e
i
e
w
i
e
T
n
w
w
r
r
r
r
r
r
−


∂
∂
+
=
=
))
(
)(
(
)
(
)
,
(
'
n
w
w
n
J
n
e
w
n
e
r
r
r
r
r
−
+
=

Gauss-Newton Method
…
where
JJ
(n) is the n-by-m Jacobian matrix of
ee
(n) (see bottom of this slide)
„
We want updated weight vector
ww
(n+1)
defined by
„
simple algebraic calculation tells… Now differentiate this expression with respect to
ww
and set the result to 0, we obtain


=
+
2
)
,
(
'
21
min
arg
)
1
(
w
n
e
n
w
w
r
r
r
r
))
(
)(
(
)
(
))
(
(
21
))
(
)(
(
)
(
)
(
21
)
,
('
21
2
2
n
w
w
n
J
n
J
n
w
w
n
w
w
n
J
n
e
n
e
w
n
e
T
T
T
r
r
r
r
r
r
r
r
r
r
−
−
+
−
+
=
 
 
∂∂
∂
∂
∂
∂
∂∂
∂
∂
∂
∂
∂∂
∂∂
∂
∂
=
MMM
w
n
e
w
n
e
w
n
e
w
k
e
w
k
e
w
k
e
we
we
w
e
J
)
(
)
(
)
(
)
(
)
(
)
(
)
1(
)
1(
)
1
(
111
L
L
M
L
M
L
M
L
L
M
L
M
L
M
L
L
ααα

Gauss-Newton Method…
0
))
(
)(
(
)
(
)
(
)
(
=
−
+
n
w
w
n
J
n
J
n
e
n
J
T
T
r
r
r
Thus we get „
To guard against the possibility that the matrix product
JJ
T
(n)
JJ
(n) is singular, the customary practice
is
where is a small positive constant. This modification effect is progressively reduced as
the number of iterations, n, is increased.
δ
)
(
)
(
))
(
)
(
(
)
(
)
1
(
1
n
e
n
J
n
J
n
J
n
w
n
w
T
T
r
r
r
−
−
=
+
)
(
)
(
)
)
(
)
(
(
)
(
)
1
(
1
n
e
n
J
I
n
J
n
J
n
w
n
w
T
T
r
r
r
−
+
−
=
+
δ

Linear Least-Squares Filter
„
The single neuron around which it is built is linear
„
The cost function consists of the sum of error squares
„
Usi
n
g
and
the error
vector is
„
Differentiating with respect to correspondingly,
„
From Gauss-Newton method, (eq. 3.22)
)
(
)
(
)
(
i
i
i
y
T
w
x
=
)
(
)
(
)
(
i
y
i
d
i
e
−
=
)
(
)
(
)
(
)
(
n
n
n
n
w
X
d
e
−
=
)
(
n
w
)
(
)
(
n
n
T
X
e
−
=
∇
)
(
)
(
n
n
X
J
−
=
)
(
)
(
)
(
)
(
))
(
)
(
(
)
1
(
1
n
n
n
n
n
n
n
T
T
d
X
d
X
X
X
w
+
−
=
=
+

LMS Algorithm „
Based on the use of instantaneous values for cost function :
„
Differentiating with respect to ,
„
The error signal in LMS algorithm : hence,
so,
)
(
21
)
(
2
n
e
E
=
w
w
w
w
w
∂
∂
=
∂
∂
)
(
)
(
)
(
n
e
n
e
E
)
(
)
(
)
(
)
(
n
n
n
d
n
e
T
w
x
−
=
)
(
)
(
)
(
n
nn
e
x
w
−
=
∂∂
)
(
)
(
)
(
)
(
n
e
n
n
E
x
w
w
−
=
∂∂

LMS Algorithm … „
Using as an
estimate
for the gradient
vector,
„
Using this for the gradient vector of steepest descent method, LMS algorithm as follows :
–
:
learning-rate parameter
„
The inverse of is a
measure
of the
memory of the LMS algorithm
–
W
hen is small, the adaptive process
progress slowly, more of the past data are remembered and a more accurate filtering action’
)
(
)
(
n
Ew
w
∂∂
)
(
)
(
)
(
ˆ
n
e
n
n
x
g
−
=
)
(
)
(
)
(
ˆ
)
1
(
ˆ
n
e
n
n
n
x
w
w
η
+
=
+
η
η
η

LMS Characteristics
„
LMS algorithm produces an
estimate
of the
weight vector
–
S
acrifice a distinctive feature •
S
teepest descent algorithm :
foll
ow
s a
well-defined trajectory
•
L
MS algorithm : follows a random
trajectory
–
N
umber of iterations goes infinity,
performs a random walk
„
But importantly, LMS algorithm does
not
require knowledge of the statistics of the environment
)
(
n
w
)
(
ˆ
n
w
)
(
ˆ
n
w

Convergence Consideration
„
Two distinct quantities, and determine the convergence
–
t
he user supplies , and the selection of
is important for the LMS algorithm to converge
„
Convergence of the mean
as
–
T
his is not a practical value
„
Convergence in the mean square
„
Convergence condition for LMS algorithm in the mean square
η
)
(
n
x
)
(
n
x
η
[
]
0
)
(
ˆ
w
w
→
n
E
∞
→
n
[
]
∞
→
→
n
n
e
E

as

constant
)
(
2
inputs
sensor
the
of
values
square
-
mean
of
sum
2
0
<
<
η

APPENDIX B
Perceptron Convergence Proof

Perceptron Convergence Proof
Haykin
, Ch
apt
e
r
3
Consider the following perceptron:
)
(
)
(

)
(
)
(
)
(
0
n
n
n
x
n
w
n
v
T
m
i
i
i
x
w
==
∑
=
1
C
class
to
belonging
or
input vect
every
for

0
x
x
w
>
T
2
C
class
to
belonging
or
input vect
every
for

0
x
x
w
≤
T

Perceptron Convergence Proof… „
The algorithm for the weight adjustment for the perceptron
–
if x(n) is correctly classified no adjustments to
w
–
o
therwise
–
learning rate parameter
controls adjustment
applied to weight vector
1
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
1
(
n
n
n
n
T
x
x
w
w
w
>
=
+
2
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
1
(
n
n
n
n
T
x
x
w
w
w
≤
=
+
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
(
)
(
)
1
(
n
n
n
n
n
n
T
x
x
w
x
w
w
>
−
=
+
η
C
class
to
belongs
)
(

and
0
)
(

if

)
(
)
(
)
(
)
1
(
n
n
n
n
n
n
T
x
x
w
x
w
w
≤
+
=
+
η
)
(
n
η
2 1

Perceptron Convergence Proof
For
0
w
=
=
)
0
(
and
1
)
(
n
η
Suppose the perceptron
incorrectly
classifies the vectors
such that
),...
2
(
),
1
(
x
x
1
C
to
belonging
(n)
for
)
(
)
(
)
1
(

1

since
But
)
(
)
(
)
(
)
1
(
:
that
so

0
)
(
x
x
w
w
x
w
w
x
w
n
n
n
n
n
n
n
n
T
+
=
+
⇒
=
+
=
+
≤
η
η
)
1
B
(
)
(
...
)
2
(
)
1(
)
1
(
)
1
(

find

y we
iterativel ,
(0)

Since
n
n
n
x
x
x
w
w
0
w
+
+
+
=
+
+
=
Since the classes C1 and C2 are assumed to be linearly separable, there exists a solution
w
0
for which
w
T
x
(
n
)>0 for
the vectors
x
(1), …
x
(
n
) belonging to the subset
H
1
(subset of
training vectors that belong to class C1).

Perceptron Convergence Proof
)
2
(
)
(
min
0
)
(
1
B
n
T
H
n
x
w
x
∈
=
α
For a fixed solution
w
0
, we may then define a positive number
α
as
)
(
...
)
2
(
)
1(
)
1
(

implies
above
(B1)
equation

Hence
T0
T0
T0
T0
n
n
x
w
x
w
x
w
w
w
+
+
+
=
+
Using equation B2 above, (since
each term is greater or equal
than
α),
w
e
have
α
n
n
≥
+
)
1
(
T0
w
w
Now we use the
Cauchy-Schwartz
inequality:
0
for

)
.
(
or

)
.
(
2
2
2
2
2
2
2
≠
≥
≤
b
b
b
a
a
b
a
b
a

Perceptron Convergence Proof This implies that:
)
3
(
)
1
(
2
0
2
2
2
B
n
n
w
w
α
≥
+
Now let’s follow another development route (notice index k)
1
H

(k)

and

1
for
)
(
)
(
)
1
(
∈
=
+
=
+
x
x
w
w
, ..., n
k
k
k
k
By taking the squared Euclidean norm of both sides, we get:
)
(
)
(
2
)
(
)
(
)
1
(
2
2
2
k
k
k
k
k
T
x
w
x
w
w
+
+
=
+
But under the assumption the the perceptron
incorrectly
classifies an input vector
x
(
k
) belonging to the subset H
1
, we
have
:
hence
and

0
)
(
)
(
<
k
k
T
x
w
2
2
2
)
(
)
(
)
1
(
k
k
k
x
w
w
+
≤
+

Perceptron Convergence Proof Or equivalently,
n
k
k
k
k
,...
1
;
)
(
)
(
)
1
(
2
2
2
=
≤
−
+
x
w
w
Adding these inequalities for
k=1,…n
, and invoking the initial
condition
w
(0)=
0
, we get the following inequality:
)
4
(
)
(
)
1
(
1
2
2
B
n
k
n
n
k
β
≤
≤
+
∑
=
x
w
Where
β
is a positive number defined by;
∑
=
∈
=
n
k
H
k
k
1
2
)
(
)
(
max
1
x
x
β
Eq. B4 states that the squared Euclidean norm of
w
(
n+1
)
grows at most linearly with the number of iterations
n
.

Perceptron Convergence Proof The second result of B4 is clearly in conflict with Eq. B3.
•Indeed, we can state that
n
cannot be larger than some
value
n
max
for which Eq. B3 and B4 are both satisfied with
the equality sign. That is
n
max
is the solution of the eq.
•Solving for
n
max
given a solution
w
0
, we find that
We have thus proved that for
η
(
n
)=1 for all
n
, and for
w
(
0
)=
0
,
given that a sol’
vector
w
0
exists, the rule for adapting the
synaptic weights of the perceptron must terminate after at most n
max
iterations
.
β
α
max
2
0
2
2max
n
n
=
w
2
2
0
max
α
β
w
=
n

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77