1
Adaline and Madaline
Adaline : Adaptive Linear neuron
Madaline: Multiple Adaline
2.1 Adaline(Bernard Widrow, Stanford Univ.)
bias term
(feedback,
error, gain,
adjust term)0
wx
n
T
jj
j
y w x
0
:x
Linear combination
d = f(y)
2
2.1.1 Least Mean Square (LMS) Learning
◎Input vectors :
Ideal outputs :
Actual outputs :12
{ , , , }
L
x x x 12
{ , , , }
L
d d d 12
{ , , , }
L
y y y
2
222
1
2
1
2 -- (2.4)
L
T
k k k k k k
k
T T T
k k k k k
ε ε d y d
L
dd
wx
w x x w x w
Mean square error:
Let 2
,,k k kξε d p x :xx
T
kkR correlation
matrix
Assume the output function: f(y) = y = d
3
Let ()
2 2 0.
d
R
d
w
wp
w *1
.R
wp Obtain
Practical difficulties of analytical formula :
1. Large dimensions -difficult to calculate
2. < > expected value -Knowledge of probabilities1
R
Idea:*
argmin ( )
w
ww 2
(2.4) 2
TT
kdR w w p w
4
The graph of is a
paraboloid. 2
( ) 2
TT
kdR w w w p w
2.1.2 Steepest Descent
5
Steps: 1. Initialize weight values
2. Determine the steepest descent direction
Let
3. Modify weight values
4. Repeat 2~3.( 1) ( ) ( ), : step sizet t t w w w ( ( ))
( ( )) 2( ( ))
()
dt
t R t
dt
w
w
w p w
w 0
()tw
No calculation of ( ) ( ( ))tt
w
ww
Drawbacks: i) To know R and p is equivalent to
knowing the error surface in advance. ii) Steepest
descent training is a batch training method. 1
R
6
2.1.3 StochasticGradient Descent
Approximate by randomly
selecting one training example at a time
1. Apply an input vector
2.
3.
4.
5. Repeat 1~4 with the next input vector( ( )) 2( ( ))t R t
w
w p w k
x 2 2 2
( ) ( ) ( ( ) )
T
k k k k kεt d y d t wx 22
( ) ( ) ( )
2( ( ) ) 2 ( )
kk
t
k k k k k
t εt εt
dt εt
w w w
w x x x ( 1) ( ) 2 ( )kktt μεt w w x and Rp
No calculation of
7
○Practical Considerations:
(a) No. of training vectors, (b) Stopping criteria
(c) Initial weights, (d) Step size
Drawback: time consuming.
Improvement: mini-batch training method.
2.1.4 Conjugate Gradient Descent
--Drawback: can only minimize quadratic functions,
e.g.,1
()
2
TT
f A c w w w b w
Advantage: guarantees to find the optimum solution in
at most niterations, where nis the size of matrix A.
A-Conjugate Vectors:
Let square, symmetric, positive-definite matrix.
Vectors are A-conjugate
if { (0), (1), , ( 1)}Sn s s s ( ) ( ) 0,ss
T
i A j i j :
nn
A
*If A= I(identity matrix), conjugacy = orthogonality.
SetSforms a basisfor space .
n
R
The solution in can be written as 1
0
()
n
i
i
ai
ws
w n
R
•The conjugate-direction method for minimizing f(w) is
defined by( 1) ( ) ( ) ( ), 0,1, , 1i i i i i n w w s
wherew(0) is an arbitrary starting vector.
is determined by()i min ( ( ) ( ))f i i
ws
Let( ) ( ) ( ) ( 1), 1,2, , 1 (A)i i i i i n s r s
Define , which is in the steepest
descent direction of( ) ( )i A ir b w ( ( ) 2( )).fA
w
w b w
How to determine ( )?is ()fw
10
Multiply by s(i-1)A,( 1) ( ) ( 1) ( ( ) ( ) ( 1)).i A i i A i i i
TT
s s s r s ( ) ( ) 0,
T
i A j i j ss
In order to be A-conjugate:0 ( 1) ( ) ( ) ( 1) ( 1).i A i i i A i
TT
s r s s ( 1) ( )
( ) (B)
( 1) ( 1)
i A i
i
i A i
T
T
sr
ss (1), (2), , ( 1)n s s s
generated by Eqs. (A) and (B)
are A-conjugate.
•Desire that evaluating does not need to know A.
Polak-Ribiere formula:( )( ( ) ( 1))
()
( 1) ( 1)
T
T
r r r
rr
i i i
i
ii
()i
Fletcher-Reeves formula:( ) ( )
()
( 1) ( 1)
T
T
rr
rr
ii
i
ii
* The conjugate-direction method for minimizingLet ( 1) ( ) ( ) ( ), 0,1, , 1i i i i i n w w s 2
( ) 2
TT
kdR w w w p w
w(0) is an arbitrary starting vector()i min ( ( ) ( ))ii
ws
is determined by( ) ( ) ( ) ( 1),s r si i i i ( ) ( )i R ir p w ( 1) ( )
()
( 1) ( 1)
i R i
i
i R i
T
T
sr
ss
Nonlinear Conjugate Gradient Algorithm
Initialize w(0) by an appropriate process
Conjugate gradient
converges in at most
nsteps where nis the
size of the matrix of
the system (here n=2).
Example: A comparison of the convergences of
gradient descent(green) and conjugate gradient
(red) for minimizing a quadratic function.
14
2.3. Applications
2.3.1. Echo Cancellation in Telephone Circuitsn n
n: incoming voice, s: outgoing voice
: noise (leakage of the incoming voice)
y: the output of the filter mimics
15
Hybrid circuit: deals with the leakage issue, which
attempts to isolate incoming from outgoing signals
Adaptive filter: deals with the choppy issue, which
mimics the leakage of the incoming voice for
suppressing the choppy speech from the outgoing
signals2 2 2 2
22
( ) ( ) 2 ( )
()
s s n y s n y s n y
s n y
( ) 0s n y
(snot correlated with y, ) n 2 2 2 2
()ε s s n y 22
min min ( )ε ny
16
2.3.2 Predict Signal
An adaptive filter is trained to predict signal.
The signal used to train the filter is a delayed
actual signal.
The expected output is the current signal.
17
2.3.3 Reproduce Signal
18
2.3.4. Adaptive beam –forming antenna
arrays
Antenna : spatial array of sensors which are
directional in their reception
characteristics.
Adaptive filter learns to steer antennae in order
that they can respond to incoming signals no
matter what their directions are, which reduce
responses to unwanted noise signals coming in
from other directions
19
2.4 Madaline : Many adaline
○XOR function
?
20
21
2.4.2. Madaline Rule II (MRII)
○Training algorithm –A trial–and–errorprocedure
with a minimum disturbance principle(those
nodes that can affect the output error while
incurring the least change in their weights
should have precedence in the learning
process)
○Procedure –
1. Input a training pattern
2. Count #incorrect values in the output layer
22
3.1. Select the first previously unselected error
node whose analog output is closest to zero
( this node can reverse its bipolar output
with the least change in its weights)
3.2. Change the weights on the selected unit s.t.
the bipolar output of the unit changes
3.3. Input the same training pattern
3.4. If reduce #errors, accept the weight change,
otherwise restore the original weightsQ
3. For all units on the output layer
4. Repeat Step 3 for all layers except the input layer
23
6. Repeat step 5 for all layers except the input layer.
5. For all units on the output layer
5.1. Select the previously unselected pair of units
whose output are closest to zero
5.2. Apply a weight correction to both units, in
order to change their bipolar outputs
5.3. Input the same training pattern
5.4. If reduce # errors, accept the correction;
otherwise, restore the original weights.
24
※ Steps 5 and 6 can be repeated with triplets,
quadruplets or longer combinations of units
until satisfactory results are obtained
The MRII learningrule considers the network with
only one hidden layer. For networks with more hidden
layers, the backpropagation learning strategy to be
discussed later can be employed.
25
2.4.3. A Madaline for Translation–Invariant
Pattern Recognition
26
。Relationships among the weight matrices of Adalines
27
○Extension --Mutiple slabs with different key weight
matrices for discriminating more then two classes of
patterns