Artificial Neural Networks Lect3: Neural Network Learning rules

MohammedBennamoun 19,642 views 73 slides May 15, 2016
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

Artificial Neural Networks Lect3: Neural Network Learning rules


Slide Content

CS407 Neural Computation
Lecture 3: Neural Network
Learning Rules
Lecturer: A/Prof. M. Bennamoun

Learning--Definition „
Learning is a process by which free parameters of NN are adapted thru stimulation from environment

Sequence of Events

s
timulated by an environment

undergoes changes in its free parameters

r
esponds in a new way to the environment

Learning Algorithm

p
rescribed steps of process to make a system
learn

w
ays to adjust synaptic weight of a neuron

N
o unique learning algorithms -
k
it of tools

The Lecture covers

f
ive learning rules, learning paradigms

p
robabilistic and statistical aspect of learning

Review:
Gradients and Derivatives Gradient Descent Minimization

Gradients and Derivatives. Differential Calculus
is the branch of mathematics concerned
with computing gradients. Consider a function
y = f(x)
:
The gradient, or rate of change, of
f(x)
at a particular value of
x,
as we change
x
can be approximated by

y/

x.
Or we can write
it exactly as which is known as the
partial derivative
of
f(x)
with respect to
x
.

Examples of Computing Derivatives
Some simple examples should make this clearer:
Other derivatives can be computed in the same way. Some useful ones are:

Gradient Descent Minimisation
Suppose we have a function
f(x)
and we want to change the
value of
x
to minimise
f(x)
. What we need to do depends on the
derivative of
f(x).
There are three cases to consider:
then
f(x)
increases as
x
increases
so we should decrease
x
then
f(x)
decreases as
x
increases
so we should increase
x
then
f(x)
is at a maximum or minimum
so we should not change
x
In summary, we can decrease
f(x)
by changing
x
by the amount:
where
η
is a small positive constant specifying how much we
change
x
by, and the derivative

f/

x tells us which direction to
go in. If we repeatedly use this equation,
f(x)
will (assuming
η
is sufficiently small) keep descending towards its minimum, and hence this procedure is known as
gradient descent
minimisation
.

Types of Learning

Learning with Teacher

Supervised learning

Teacher has knowledge of environment to learn

input and desired output pairs are given as a training set

Parameters are adjusted based on
error signal
step-by-step

T
he desired response of the system is provided
by a teacher, e.g., the distance
ρ
[
d,o
] as an
error measure

Learning with Teacher

Learning with Teacher

E
stimate the negative error gr
adient direction and reduce the
error accordingly

M
odify the synaptic weight
s to reduce the stochastic
minimization of error in mu
ltidimensional weight space

Move toward a minimum point of error surface

m
ay not be a global minimum

u
se gradient of error surf
ace -
d
irection of steepest descent

Good for pattern recogniti
on and function approximation

Unsupervised Learning „
Self-organized learning

T
he desired response is unknown, no explicit error
information can be used to improve network behavior

E
.g. finding the cluster boundaries of input
patterns

S
uitable weight self-adaptation mechanisms have
to be embedded in the trained network

N
o external teacher or critics

T
ask-independent measure of quality is required
to learn

N
etwork parameters are optimized with respect to
a measure

c
ompetitive learning rule is a case of unsupervised
learning

Learning with Teacher

Learning without Teacher „
Reinforcement learning

N
o teacher to provide direct (desired) response at
each step

e
xample : good/bad, win/loose
Environment
Critics
Learning Systems
Primary reinforcement
Heuristic reinforcement

Terminology:

Training set:
The ensemble of “inputs” used to train
the system. For a supervised network. It is the ensemble of “input-desired” response pairs used to train the system.

Validation set
: The ensemble of samples that will be
used to
validate the parameters
used in the training
(not to be confused with the test set which assesses the performance of the classifier).

Test set
: The ensemble of “input-desired” response
data used to verify the performance of a trained system. This data is
not
used for training.

Training epoch
: one cycle through the set of training
patterns.

Generalization
: The ability of a NN to produce
reasonable responses to input patterns that are similar, but not identical, to training patterns.

Terminology:

Asynchronous
: process in which weights or
activations are updated one at a time, rather than all being updated simultaneously.

Synchronous updates
: All weights are adjusted at the
same time.

Inhibitory connection
: connection link between two
neurons such that a signal sent over this link will reduce the activation of the neuron that receives the signal . This may result from the connection having a negative weight, or from the signal received being used to reduce the activation of a neuron by scaling the net input the neuron receives from other neurons.

Activation
: a node’s level of activity; the result of
applying the activation function to the net input to the node. Typically this is also the value the node transmits.

Review:
Vectors-
Overview

Vectors-
A Brief review
2-D vector
Vector w.r.t cartesian axes
22
21
v
v
v
+
=
r

Inner product-
A Brief review…
)
cos(
..
2
1
2
2
1
1
φ
w
v
w
v
w
v
w
v
w
v
w
v
i
i
i
v
r
r
r
r
r
=
=
+
=

=
The projection of
v
is given by:
ww
v
v
v
v
ww
rr
rr
==
)
cos(
φ

Inner product-
A Brief review…

Learning Rules (LR)

The General Learning Rule „
The weight adjustment is proportional to the product of input
x
and the
learning signal
r

c
is a positive learning constant.
)
(
.
)]
(
),
(
),
(
[
)
(
t
x
t
d
t
x
t
w
r
c
t
w
i
i
r
r
r
r
=

)
(
.
)]
(
),
(
),
(
[
)
(
)
(
)
(
)
1
(
t
x
t
d
t
x
t
w
r
c
t
w
t
w
t
w
t
w
i
i
i
i
i
r
r
r
r
r
r
r
+
=

+
=
+

Learning Rule 1
Error Correction Learning
Rule

LR1:Error Correction Learning

LR1:Error Correction Learning… „
Error signal,
e
k
(n)
e
k
(n) = d
k
(n) -
y
k
(n)
where
n
denotes time step

Error signal activates a control mechanism for corrective adjustment of synaptic weights

Mininizing
a cost function,
E(n)
, or index of
performance

Also called instantaneous value of error energy

step-by-step adjustment until

system reaches steady state; synaptic weights are stabilized

Also called deltra rule, Widrow-Hoff rule
)
(
21
)
(
2
n
n
E
e
k
=

Error Correction Learning…

w
kj
(n)
=
η
e
k
(n)x
j(n)

η
: rate of learning; learning-rate parameter
w
kj
(n+1)
=
w
kj
(n)
+

w
kj
(n)
w
kj
(n)
= Z
-1
[
w
kj
(n+1)
]

Z
-1
is unit-delay operator

adjustment is proportioned to the product of error signal and input signal

error-correction learning is local

The learning rate
η
determines the stability or
convergence

E.g 1: Perceptron Learning Rule ƒ
Supervised learning, only applicable for binary neuron
response (e.g. [-1,1]) ƒ
The learning signal is equal to: E.g., in classification task, the weight is adapted only when classification error occurred ƒ
The weight initialisation is random

E.g1:Perceptron Learning Rule…

E.g1:Perceptron Learning Rule…

E.g2:Delta Learning Rule ƒ
Supervised learning, only applicable for continuous
activation function ƒ
The learning signal
r
is called
delta
and defined as:
-
D
erived by calculating the gradient vector with
respect to
w
i
of the squared error.

E.g2: Delta Learning Rule… ƒ
The weight
initialization
is random
ƒ
Also called continuous perceptron training rule

E.g2: Delta Learning Rule…

E.g3: Widrow-Hoff LR
Widrow 1962
ƒ
Supervised learning, independent of the activation
function of the neuron ƒ
Minimize the squared error between the desired output
value and the neuron active value

S
ometimes called LMS (Least Mean Square)
learning rule
ƒ
The learning signal
r
is:
Considered a special case of the delta learning rule when

Learning Rule 2
Memory-based Learning Rule

LR2: Memory-based Learning

In memory-based learning, all (or most) of the past experiences are explicitly stored in a large memory of correctly classified input- output examples
–W
h
e
r
e

x
i
denotes an input vector and
d
i
denotes the corresponding desired response.

When classification of a test vector x
test
(not
seen before) is required, the algorithm responds by
retrieving
and
analyzing
the
traing data in a “local neighborhood”
o
f x
test
{
}
Ni
i
i
d
x
1
)
,
(
=

LR2: Memory-based Learning

All memory-based learning algorithm involve 2 essential Ingredient (which make them different from each others)

C
riterion used for defining local neighbor of
x
test

Learning rule applied to the training examples in local neighborhood of
x
test

Nearest Neighbor Rule (NNR)

t
he vector
X

N

{
X
1
,
X
2
, …,
X
N
}
is the
nearest neighbor of
X
test
if

X

n
is the class of
X
test
)
,
(
)
,
(
min
'
test
N
test
i
i
X
X
d
X
X
d
r
r
r
r
=

LR2: Nearest Neighbor Rule (NNR) „
Cover and Hart (1967)

E
xamples (x
i,d
i) are independent and
identically distributed (iid), according to the joint pdf of the example (
x
,d)

T
he sample size N is infinitely large

w
orks well if no feature or class noise

a
s number of training cases grows
large, the error rate of 1-NN is at most 2 times the Bayes
o
ptimal rate

Î
Half of the “
classification
information”
in a training set of infinite size is contained in the
Nearest Neighbor
!!

LR2: k-Nearest Neighbor Rule „
K-nearest Neighbor rule (variant of the NNR)

I
dentify the k classified patterns that lie
nearest to X
test
for some integer k,

A
ssign X
te
st
to the class that is most frequently
represented in the k nearest neighbors to X
te
st

KNN: find the
k
nearest neighbors of an
object.

Radial-basis function network
is a memory-based
classifier
q

K nearest neighbors
Data are represented as high-d
imensional vector
s
KNN requires
:
•Distance metri
c
•Choice of K •Potentially a choice of element weighting in the vectors Given a new
example
ƒ
Compute distances to
each known example ƒ
Choose class of
most
popular

K nearest neighbors
New item

K nearest neighbors
New item •Compute distances

K nearest neighbors
New item •Compute distances •Pick K best distances

K nearest neighbors
New item •Compute distances •Pick K best distances •Assign class to new example

Example: image search
Query image

Images represented as features (color histogram, texture moments, etc.)

Similarity search using these features


Find 10 most similar images for the query image

Other Applications

Web-page search
–“
Find 100 most similar pages for a given
page


Page represented as word-frequency vector

Similarity: vector distance

GIS:

find 5 closest cities of Brisbane
”…

Learning Rule 3
Hebbian
Learning Rule
D. He
bb

LR3: Hebbian
Learning
“When an axon of cell A is near enough to excite a cell
B and
repeatedly or persistently takes
place in firin
g
it, some growth
process or metaboli
c change takes place in one or both cells such
that A’s efficiency, as one of the
cells firing B, is increased”
(Hebb,
1949) In other words: 1. If two neurons on either side of a synapse (connection) are activated simultaneously (i.e. synchronously), then
the strength of that synapse is
selectively increased. This rule is often supplemented by: 2. If two neurons on either side of a synapse are activated asynchronously, then that synaps
e is selectively weakened or
eliminated. so that chance coinci
dences do not bui
l
d up connecti
on
strengths.

LR3: Hebbian
Learning

A purely feed forward, unsupervised learning

The learning signal is equal to the neuron’s output
ƒ
The weight initialisation at small random values around
w
i
=0 prior to learning
ƒ
If the cross product of output and input (or correlation) is
positive, it results in an increase of the weight, otherwise the weight decreases ƒ
It can be seen that the
output
is strengthened in turn for
each input presented.

LR3: Hebbian
Learning…
Therefore,
frequent
input patterns will have most influence
at the neuron’s weight vector and will eventually produce the largest output.

LR3: Hebbian
Learning…
ƒ
In some cases, the Hebbian rule needs to be modified to
counteract unconstrained growth of weight values, which takes place when excitations and responses consistently agree in sign. ƒ
This corresponds to the Hebbian learning rule with
saturation
of the weights at a certain, preset level.

Single Layer Network with Hebb
Rule Learning of a set
of input-output training vectors is called a
HEBB NET

LR3: Hebbian
Learning

If two neurons of a connection are activated

s
imultaneously (synchronously), then its strength is
increased

a
synchronously, then the strength is weakened or
eliminated

Hebbian
synapse

t
ime dependent

depend on exact time of occurrence of two signals
–l
o
c
a
l

locally available information is used

interactive mechanism

learning is done by two signal interaction

c
onjunctional or correlational mechanism •
c
ooccurrence
of two signals

Hebbian
learning is found in Hippocampus
presynaptic
&
postsynaptic signals

Special case:
Correlation LR

Supervised learning, applic
able for recording data
in memory networks with binary response neurons

The learning signal
r
is simply equal to the
desired output
di
A special case of the
H
ebbian
learning rule with a binary
activation function and for
o
i
=
di
The weight
initialization
at small random values around
w
i
=0 prior to learning (just like Hebbian rule)

Special case:
Correlation LR

Learning Rule 4
Competitive Learning Rule =
Winner-Take-All LR

LR4: Competitive Learning
ƒ
Unsupervised network training, and applicable for an ensemble of neurons (e.g. a layer of
p
neurons), not
for a single neuron.
ƒ
Output neurons of NN compete to become active
ƒ
Adapt the neuron
m
which has the maximum
response due to input
x
ƒ
Only single neuron is active at any one time

s
alient feature for pattern classification

N
eurons learn to specialize on ensembles of
similar patterns; Therefore,

T
hey become feature detectors

LR4: Competitive Learning… ƒ
Basic Elements

A
set of neurons that are all same except
synaptic weight distribution

r
espond differently to a given set of input
pattern

A
mechanism to compete to respond to
a given input

T
he winner that wins the competition is
called
“winner-takes-all”

LR4: Competitive NN…
Feedforward connection is excitatory
Lateral connection ( ) is inhibitory -
lateral inhibition
layer of
source Input
Single layer of output neurons

LR4: Competitive Learning…

Competitive Learning Rule: Adapt the neuron
m
which has the maximum response due to input
x

Weights are typically initialised at random values and their strengths are
normalized
during learning.

If neuron does not respond to a particular input, no learning takes place
m

all
for

1
=

j
mj
w

LR4: Competitive Learning…

x
has some constant Euclidean length and

perform
clus
tering
thru competitive learning
m

all
for

1
2
=

j
mj
w

LR4: Competitive Learning… ƒ
What is required for the net to encode the training set is
that the weight vectors become aligned with any clusters present in this set and that each cluster is represented by at least one node. Then, when a vector is presented to the net there will be a node, or group of nodes, which respond maximally to the input and which respond in this way only when this vector is shown at the input ƒ
If the net can learn a weight ve
ctor configuration like this,
without being told explicitly of the existence of clusters at the input, then it is said to undergo a process of
sel
f
-
organised
or
unsupervised
learning. This is to be contrasted
with nets which were trained with the delta rule for e.g. where a target vector or output had to be supplied.

LR4: Competitive Learning… ƒ
In order to achieve this goal, the weight vectors must be
rotated around the sphere so that they line up with the training set. ƒ
The first thing to notice is that this may be achieved in a
gradual and efficient way by moving the weight vector which is closest (in an angular sense) to the current input vector towards that vector slightly. ƒ
The node
k
with the closest vector is that which gives the
greatest input excitation
v
=
w
.
x
since this is just the dot
product of the weight and input vectors. As shown below, the weight vector of node
k
may be aligned more closely
with the input if a change is made according to
)
(x
j
mj
mj
w
w

=

α

LR4: Winner-Take-All learning..
The winner neighbourhood is sometimes extended to beyond the single neuron winner to include the neighbouring neurons

Learning Rule 5
Boltzman
L
earning Rule

LR5: Boltzman
L
earning

Rooted from statistical mechanics

Boltzman
M
achine
: NN on the basis of Boltzman
learning

The neurons constitute a recurrent structure
(see
next slide)

T
hey are stochastic neurons

o
perate in binary manner: “on”: +1 and “off”: -1

V
isible neurons and hidden neurons

e
nergy function of the machine (x
j
= state of
neuron j):

m
eans no self feedback
j
k
jk
kj
x
x
w
E
∑∑

=
21
j

k
j

k

Boltzman
M
achine
Fig: Architecture of Boltzmann machine. K is the number of visible neurons and L is the number of hidden neurons

Boltzman
M
achine Operation

choosing a neuron at random,
k
, then flip the state of the
neuron from state
x
k
to state
-x
k
(random perturbation)
with probability
where is energy change of the machine resulting from such a flip (flip from state
x
k
to state
–x
k
)

If this rule is applied repeatedly, the machine reaches thermal equilibrium
(note that T is a pseudo-temperature).

Two modes of operation
–Clamped condition : visible neurons are clamped onto specific states determined by
environment (i.e. under the
influence of training set). –Free-running condition: all neurons (visible and hidden) are allowed to operate freely (
i.e. with no envir. input
)
)
exp(
1
1
)
(
T
E
x
x
P
k
k
k


+
=


k
E

Boltzman
M
achine operation…

Such a network can be used for
pattern completion
.

Goal of Boltzman
Learning is to maximize likelihood
function (using gradient descent)

denotes the set of training examples drawn from a pdf of
interest.

represents the state of the visible neurons

represents the state of the hidden neurons

set of synaptic weights is called a model of the environment if it leads the same probability distribution of the states of visible unitsℑ
)
(
log
)
(
log
)
(
α
α
α
α
α
α
x
X
P
x
X
P
w
L
x
x
=
=
=
=






α
x
β
x

LR5: Boltzman
L
earning Rule…

Let denote the
correlation
between the states of
neurons
j
and
k
with network in a
clamped
condition

Let denote the
correlation
between the states of
neurons
j
and
k
with network in
free-running
condition

Boltzman
Learning Rule
(Hinton and Sejnowski 86)
where
η
is a learning-rate

and range in value from –1 to +1.
k

j

),
ρ
ρ
(
η


=


+
kj
kj
kj
w
+kj
ρ
−kj
ρ
j
k
kj
x
x
p
)
|
(

ρ
α
α
β
β
x
X
x
X
xx
=
=
=




+
αβ
j
k
kj
x
x
p
)
(

ρ
x
X
xx
=
=





α
+kj
ρ
−kj
ρ
Note: DON’T PANIC.
Boltzmann
m
a
c
hine will be presente
d in details in future lectures.

End of Learning Rules (LR)

Network complexity „
No formal methods exist for determining network architecture. For e.g. the number of layers in a feed forward network, the number of nodes in each layer…

The next lectures will focus on specific networks.

Suggested Reading.
ƒ
S. Haykin
, “Neural Networks”, Prentice-Hall, 1999,
chapter 2, and section 11.7, chapter 11 (for Boltzmann learning). ƒ
L. Fausett
, “Fundamentals of Neural Networks”,
Prentice-Hall, 1994, Chapter 2, and Section 7.2.2. of chapter 7 (for Boltzmann
machine).
ƒ
R.P. Lippmann
, “An Introduction to Computing with
Neural Nets”, IEEE Magazi
ne on Acoustics, Signal and
Speech Processing, April 1987: 4-22. ƒ
B. Widrow
, “Generalization and Information Storage in
Networks of Adaline “neurons”, Self-Organizing Systems, 1962, ed. MC. Jovitz, G.T. Jacobi, G. Goldstein, Spartan Books, 435-461

References:
In addition to the references of the previous slide, the following references were also used to prepare these lecture notes.
1.
B
erlin Chen Lecture notes: Norm
al University, Taipei, Taiwan,
ROC.
http://140.122.185.120
2.
Jin Hyung
Kim, KAIST Computer Science Dept., CS679
Neural Network lecture notes http://ai.kaist.ac.kr/~jkim/cs679/detail.htm
3.
Kevin Gurney lecture notes, “Neu
ral Nets”, Univ. of Sheffield,
UK. http://www.shef.ac.uk/psychology/gurney/notes/contents.ht ml
4.
Dr John A. Bullinaria, Course Material, Introduction to Neural Networks,
http://www.cs.bham.ac.uk/~jxb/inn.html
5.
Richard Caruana, lecture notes, Cornell Univ. http://courses.cs.cornell.edu/cs578/2002fa/
6.
http://www.free-graphics.com/main.html

References…
7.
Rothrock-Ling, Wright State Univ. lecture notes: www.i
e
.psu.edu/Rothrock/
hfe890Spr01/ANN_p
a
rt1.ppt
8.
L. Jin, N. Koudas, C. Li, “NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms”: ww
w.ics.uci.edu/~chenli/pub/NN
H.ppt
9.
Ajay Jain, UCSF: http://ww
w.cgl.ucsf.edu/Outreac
h/bmi203/lecture_notes02/lectur
e7.pdf