Perceptron 2ggrfrfrfrfrfrfrfrfrfrfrfrfrfws015.ppt

ahmedfahmi28 11 views 63 slides Feb 26, 2025
Slide 1
Slide 1 of 63
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63

About This Presentation

computercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputercomputer


Slide Content

Carla P. Gomes
CS4700
Basic Concepts
Neural Network
Input 0 Input 1 Input n...
Output 0 Output 1 Output m...
A Neural Network maps a set of inputs to a
set of outputs
Number of inputs/outputs is variable
The Network itself is composed of an
arbitrary number of nodes or units, connected
by links, with an arbitrary topology.
A link from unit i to unit j serves to propagate
the activation a
j to j, and it has a weight
W
ij
.
What can a neural networks do?
Compute a known function / Approximate an unknown function
Pattern Recognition / Signal Processing
Learn to do any of the above

Carla P. Gomes
CS4700
Different
types of nodes

Carla P. Gomes
CS4700
Output edges,
each with weights
(positive, negative, and
change over time,
learning)
Input edges,
each with weights
(positive, negative, and
change over time,
learning)
An Artificial Neuron
Node or Unit:
A Mathematical Abstraction
Artificial Neuron,
Node or unit ,
Processing Unit i
j
n
j
iji
aWin


0
,
Input
function(in
i
):
weighted sum
of its inputs,
including
fixed input a
0.

 a processing element producing an output based on a function of its inputs
Activation
function (g)
applied to
input function
(typically
non-linear).
Output
)(
0
,j
n
j
iji
aWga


Note: the fixed input and bias weight are conventional; some authors instead, e.g., or a
0
=1 and -W
0i

Carla P. Gomes
CS4700
(a)Threshold activation function  a step function or threshold function
(outputs 1 when the input is positive; 0 otherwise).
(b) Sigmoid (or logistics function) activation function (key advantage: differentiable)
(c) Sign function, +1 if input is positive, otherwise -1.
Activation Functions
 Changing the bias weight W
0,i moves the threshold location.
These functions have a threshold (either hard or soft) at zero.

Carla P. Gomes
CS4700
Threshold Activation Function
Input edges,
each with weights
(positive, negative, and
change over time,
learning)
;0
0
,

j
n
j
iji aWin

i
=0 
i
=t
iiij
n
j
ij
wwaWgetweadefining
,0,0
1
,0
,1  


;0
0,0
1
, 

awawin
ij
n
j
iji

i
threshold value
associated with
unit i
iiij
n
j
ij wwaWgetweadefining
,0,0
1
,0 ,1  

Carla P. Gomes
CS4700
i
n
j
jij
WaW
,0
1
,


Implementing Boolean Functions
Units with a threshold activation function
can act as logic gates; we can use these units
to compute Boolean function of its inputs.
Activation of
threshold units when:

Carla P. Gomes
CS4700
Boolean AND
input x1input x2ouput
0 0 0
0 1 0
1 0 0
1 1 1
x
2
x
1
w
2=1w
1=1
W
0
= 1.5
-1
i
n
j
jij
WaW
,0
1
,


Activation of
threshold units when:

Carla P. Gomes
CS4700
Boolean OR
input x1input x2ouput
0 0 0
0 1 1
1 0 1
1 1 1
x
2
x
1
w
2=1w
1=1
w
0
= 0.5
-1
i
n
j
jij
WaW
,0
1
,


Activation of
threshold units when:

Carla P. Gomes
CS4700
Inverter
input x1 output
0 1
1 0
x
1
w
1
= 1
-1
w
0= -
i
n
j
jij
WaW
,0
1
,


Activation of
threshold units when:
So, units with a threshold activation function
can act as logic gates given the appropriate input and
bias weights.

Carla P. Gomes
CS4700
Network Structures
Acyclic or Feed-forward networks
Activation flows from input layer to
output layer
–single-layer perceptrons
–multi-layer perceptrons
Recurrent networks
–Feed the outputs back into own inputs
Network is a dynamical system
(stable state, oscillations, chaotic behavior)
Response of the network depends on initial state
–Can support short-term memory
–More difficult to understand
Feed-forward networks implement functions,
have no internal state (only weights).
Our focus

Carla P. Gomes
CS4700
Feed-forward Network:
Represents a function of Its Input
Two hidden unitsTwo input units One Output
By adjusting the weights we get different functions:
that is how learning is done in neural networks!
Each unit receives input only
from units in the immediately
preceding layer.
Given an input vector x = (x
1,x
2), the activations of the input units are set to values of the
input vector, i.e., (a
1
,a
2
)=(x
1
,x
2
), and the network computes:
Feed-forward network computes a parameterized family of functions h
W(x)
(Bias unit omitted
for simplicity)
Note: the input layer in general does not include computing units.
Weights are the parameters of the function

Carla P. Gomes
CS4700
Perceptron
Perceptron
–Invented by Frank Rosenblatt in 1957 in an
attempt to understand human memory, learning,
and cognitive processes.
–The first neural network model by computation,
with a remarkable learning algorithm:
•If function can be represented by perceptron, the
learning algorithm is guaranteed to quickly
converge to the hidden function!
–Became the foundation of pattern recognition
research
Rosenblatt &
Mark I Perceptron:
the first machine that could
"learn" to recognize and
identify optical patterns.
Cornell Aeronautical Laboratory
One of the earliest and most influential neural networks:
An important milestone in AI.

Carla P. Gomes
CS4700
Perceptron
ROSENBLATT, Frank.
(Cornell Aeronautical Laboratory at Cornell
University )
The Perceptron: A Probabilistic Model for
Information Storage and Organization in the Brain.
In, Psychological Review, Vol. 65, No. 6, pp. 386-
408, November, 1958.

Carla P. Gomes
CS4700
Single Layer Feed-forward Neural Networks
Perceptrons
Single-layer neural network (perceptron network)
A network with all the inputs connected directly to the outputs
Since each output unit is
independent of the others,
we can limit our study
to single output perceptrons.
–Output units all operate separately: no shared weights

Carla P. Gomes
CS4700
Perceptron to Learn to Identify Digits
(From Pat. Winston, MIT)
Seven line segments
are enough to produce
all 10 digits
5
31
46
0
2
Digitx
0
x
1
x
2
x
3
x
4
x
5
x
6
0 0 1 1 1 1 1 1
9 1 1 1 1 1 1 0
8 1 1 1 1 1 1 1
7 0 0 1 1 1 0 0
6 1 1 1 0 1 1 1
5 1 1 1 0 1 1 0
4 1 1 0 1 1 0 0
3 1 0 1 1 1 1 0
2 1 0 1 1 0 1 1
1 0 0 0 1 1 0 0

Carla P. Gomes
CS4700
Perceptron to Learn to Identify Digits
(From Pat. Winston, MIT)
Seven line segments
are enough to produce
all 10 digits
A vision system reports which of the seven segments
in the display are on, therefore producing the inputs
for the perceptron.5
31
46
0
2

Carla P. Gomes
CS4700
Perceptron to Learn to Identify Digit 0
Seven line segments
are enough to produce
all 10 digits
A vision system reports which of the seven segments
in the display are on, therefore producing the inputs for the perceptron.
5
31
46
0
2
0
0
0
0
0
-1
1
0
Digitx
0 x
1x
2 x
3 x
4 x
5 x
6X
7
(fixed input)
0 0 1 1 1 1 1 11
Sum>0  output=1
Else output=0
When the input digit is 0,
what’s the value of
sum?

Carla P. Gomes
CS4700
Single Layer Feed-forward Neural Networks
Perceptrons
Two input perceptron unit with a sigmoid (logistics) activation function:
adjusting weights moves the location, orientation, and steepness of cliff

Carla P. Gomes
CS4700
Perceptron Learning:
Intuition
Weight Update
 Input I
j (j=1,2,…,n)
 Single output O: target output, T.
Consider some initial weights
Define example error: Err = T – O
Now just move weights in right direction!
If the error is positive, then we need to increase O.
Err >0  need to increase O;
Err <0  need to decrease O;
Each input unit j, contributes W
j
I
j
to total input:
if I
j
is positive, increasing W
j
tends to increase O;
if I
j
is negative, decreasing W
j
tends to increase O;
So, use:
W
j  W
j +   I
j  Err
Perceptron Learning Rule (Rosenblatt 1960)
 is the learning rate.

Carla P. Gomes
CS4700
Let’s consider an example (adapted from Patrick Wintson book, MIT)
Framework and notation:
0/1 signals
Input vector:
Weight vector:
x
0 = 1 and 
0=-w
0, simulate the threshold.
O is output (0 or 1) (single output).
Threshold function:
Perceptron Learning:
Simple Example


n
xxxxX ,,,
210



nwwwwW ,,,
210 
010
0



OelseOthenSxwS
nk
k
kk
Learning rate = 1.

Carla P. Gomes
CS4700
Set of examples, each example is a pair
i.e., an input vector and a label y (0 or 1).
Learning procedure, called the “error correcting method”
•Start with all zero weight vector.
•Cycle (repeatedly) through examples and for each example do:
–If perceptron is 0 while it should be 1,
add the input vector to the weight vector
–If perceptron is 1 while it should be 0,
subtract the input vector to the weight vector
–Otherwise do nothing.
Perceptron Learning: Simple Example
W
j  W
j +   I
j  Err
Err = T – O
),(
i
iyx

Intuitively correct,
(e.g., if output is 0
but it should be 1,
the weights are
increased) !
This procedure provably converges
(polynomial number of steps)
if the function is represented
by a perceptron
(i.e., linearly separable)

Carla P. Gomes
CS4700
Perceptron Learning: Simple Example
Consider learning the logical OR function.
Our examples are:
Sample x0 x1 x2 label
1 1 0 0 0
2 1 0 1 1
3 1 1 0 1
4 1 1 1 1
010
0



OelseOthenSxwS
nk
k
kk
Activation Function

Carla P. Gomes
CS4700
Perceptron Learning:
Simple Example
We’ll use a single perceptron with three inputs.
We’ll start with all weights 0 W= <0,0,0>
Example 1 I= < 1 0 0>label=0 W= <0,0,0>
Perceptron (10+ 00+ 00 =0, S=0) output  0
it classifies it as 0, so correct, do nothing
Example 2 I=<1 0 1>label=1 W= <0,0,0>
Perceptron (10+ 00+ 10 = 0) output 0
it classifies it as 0, while it should be 1, so we add input to weights
W = <0,0,0> + <1,0,1>= <1,0,1>
I
0
I
1
I
2
O
w
0
w
1
w
2
010
0



OelseOthenSxwS
nk
k
kk
1
If perceptron is 0 while it should be 1,
add the input vector to the weight vector
If perceptron is 1 while it should be 0,
subtract the input vector to the weight vector
Otherwise do nothing.
Error correcting method

Carla P. Gomes
CS4700
Example 3 I=<1 1 0>label=1 W= <1,0,1>
Perceptron (10+ 10+ 00 > 0) output = 1
it classifies it as 1, correct, do nothing
W = <1,0,1>
Example 4 I=<1 1 1>label=1 W= <1,0,1>
Perceptron (10+ 10+ 10 > 0) output = 1
it classifies it as 1, correct, do nothing
W = <1,0,1>
I
0
I
1
I
2
O
w
0
w
1
w
2
1

Carla P. Gomes
CS4700
Perceptron Learning:
Simple Example
Epoch 2, through the examples, W = <1,0,1> .
Example 1 I = <1,0,0> label=0 W = <1,0,1>
Perceptron (11+ 00+ 01 >0) output  1
it classifies it as 1, while it should be 0, so subtract input from weights
W = <1,0,1> - <1,0,0> = <0, 0, 1>
Example 2 I=<1 0 1>label=1 W= <0,0,1>
Perceptron (10+ 00+ 11 > 0) output 1
it classifies it as 1, so correct, do nothing
I
0
I
1
I
2
O
w
0
w
1
w
2
If perceptron is 0 while it should be 1,
add the input vector to the weight vector
If perceptron is 1 while it should be 0,
subtract the input vector from the weight vector
Otherwise do nothing.
Error correcting method
1

Carla P. Gomes
CS4700
Example 3 I=<1 1 0>label=1 W= <0,0,1>
Perceptron (10+ 10+ 01 > 0) output = 0
it classifies it as 0, while it should be 1, so add input to weights
W = <0,0,1> + W = <1,1,0> = <1, 1, 1>
Example 4 I=<1 1 1>label=1 W= <1,1,1>
Perceptron (11+ 11+ 11 > 0) output = 1
it classifies it as 1, correct, do nothing
W = <1,1,1>

Carla P. Gomes
CS4700
Perceptron Learning:
Simple Example
Epoch 3, through the examples, W = <1,1,1> .
Example 1 I=<1,0,0> label=0 W = <1,1,1>
Perceptron (11+ 01+ 01 >0) output  1
it classifies it as 1, while it should be 0, so subtract input from weights
W = <1,1,1> - W = <1,0,0> = <0, 1, 1>
Example 2 I=<1 0 1>label=1 W= <0, 1, 1>
Perceptron (10+ 01+ 11 > 0) output 1
it classifies it as 1, so correct, do nothing
I
0
I
1
I
2
O
w
0
w
1
w
2
1

Carla P. Gomes
CS4700
Example 3 I=<1 1 0>label=1 W= <0, 1, 1>
Perceptron (10+ 11+ 01 > 0) output = 1
it classifies it as 1, correct, do nothing
Example 4 I=<1 1 1>label=1 W= <0, 1, 1>
Perceptron (10+ 11+ 11 > 0) output = 1
it classifies it as 1, correct, do nothing
W = <1,1,1>

Carla P. Gomes
CS4700
Perceptron Learning:
Simple Example
Epoch 4, through the examples, W= <0, 1, 1>.
Example 1 I= <1,0,0> label=0 W = <0,1,1>
Perceptron (10+ 01+ 01 = 0) output  0
it classifies it as 0, so correct, do nothing
I
0
I
1
I
2
O
W
0
=0
W
1
=1
W
2
=1
ORSo the final weight vector W= <0, 1, 1> classifies all
examples correctly, and the perceptron has learned the function!
Aside: in more realistic cases the bias (W0) will not be 0.
(This was just a toy example!)
Also, in general, many more inputs (100 to 1000)
1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
1011 0000 1 10 1
1101 1011 0 10 1
1111 1011 0 10 1
2 1000 1011 -100 1
1011 0011 0 00 1
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
1101 1011 0 10 1
1111 1011 0 10 1
2 1000 1011 -100 1
1011 0011 0 00 1
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
1111 1011 0 10 1
2 1000 1011 -100 1
1011 0011 0 00 1
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 4
1111 1011 0 10 1
2 1000 1011 -100 1
1011 0011 0 00 1
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 4
1111 1011 0 10 1
2 example 11000 1011 -100 1
1011 0011 0 00 1
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
example 4
1111 1111 0 11 1
3 1000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
example 4
1111 1111 0 11 1
3 example 11000 1111 -101 1
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
example 4
1111 1111 0 11 1
3 example 11000 1111 -101 1
example 2
1011 0111 0 01 1
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
example 4
1111 1111 0 11 1
3 example 11000 1111 -101 1
example 2
1011 0111 0 01 1
example 3
1101 0111 0 01 1
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
example 4
1111 1111 0 11 1
3 example 11000 1111 -101 1
example 2
1011 0111 0 01 1
example 3
1101 0111 0 01 1
example 4
1111 0111 0 01 1
4 1000 0110 0 01 1

Carla P. Gomes
CS4700
Epoch
x0x1x2
Desired
Target
w0w1w2OutputErrorNew
w0
New
w1
New
w2
1 example 11000 0000 0 00 0
example 2
1011 0000 1 10 1
example 3
1101 1011 0 10 1
example 41111 1011 0 10 1
2 example 11000 1011 -100 1
example 2
1011 0011 0 00 1
example 3
1101 0010 1 11 1
example 4
1111 1111 0 11 1
3 example 11000 1111 -101 1
example 2
1011 0111 0 01 1
example 3
1101 0111 0 01 1
example 4
1111 0111 0 01 1
4 example 11000 0110 0 01 1

Carla P. Gomes
CS4700
Threshold perceptrons have some advantages , in particular
 Simple learning algorithm that fits a threshold perceptron to any
linearly separable training set.
Key idea: Learn by adjusting weights to reduce error on training set.
 update weights repeatedly (epochs) for each example.
We’ll use:
Sum of squared errors (e.g., used in linear regression), classical error measure
Learning is an optimization search problem in weight space.
Derivation of a learning rule for
Perceptrons Minimizing Squared Errors

Carla P. Gomes
CS4700
Derivation of a learning rule for
Perceptrons Minimizing Squared Errors
Let S = {(x
i
, y
i
): i = 1, 2, ..., m} be a training set. (Note, x is a vector of
inputs, and y is the vector of the true outputs.)
Let h
w be the perceptron classifier represented by the weight vector w.
Definition:
2
))((
2
1
)()( xxx
w
hyErrorSquaredE 

Carla P. Gomes
CS4700
The squared error for a single training example with input x and true output y is:
Where h
w (x) is the output of the perceptron on the example and y is the true output
value.
We can use the gradient descent to reduce the squared error by calculating the
partial derivatives of E with respect to each weight.
Note: g’(in) derivative of the activation function. For sigmoid g’=g(1-g). For threshold perceptrons,
Where g’(n) is undefined, the original perceptron rule simply omitted it.
Derivation of a learning rule for
Perceptrons Minimizing Squared Errors

Carla P. Gomes
CS4700
Gradient descent algorithm  we want to reduce , E, for each weight w
i , change weight in
direction of steepest descent:
Intuitively:
Err = y – h
W(x) positive
output is too small  weights are increased for positive inputs and decreased for negative
inputs.
Err = y – h
W
(x) negative
 opposite
 learning rate
W
j  W
j +   I
j  Err

Carla P. Gomes
CS4700
Rule is intuitively correct!
Greedy Search:
Gradient descent through weight space!
Surprising proof of convergence:
Weight space has no local minima!
With enough examples, it will find the target function!
(provide  not too large)
Perceptron Learning:
Intuition

Carla P. Gomes
CS4700
Gradient descent in weight space
From T. M. Mitchell, Machine Learning

Carla P. Gomes
CS4700
Perceptron learning rule:
1.Start with random weights, w = (w
1
, w
2
, ... , w
n
).
2.Select a training example (x,y)
 S.
3.Run the perceptron with input x and weights w to obtain g
4.Let  be the training rate (a user-set parameter).
5.Go to 2.
ii
iiii
xingingyw
wwww
)('))((
where
,,




E
p
o
c
h




c
y
c
l
e

t
h
r
o
u
g
h

t
h
e
e
x
a
m
p
l
e
s
The stochastic gradient method selects examples randomly from the
training set rather than cycling through them.
Epochs are repeated until some stopping criterion is reached—
typically, that the weight changes have become very small.
W
j  W
j +   I
j  Err

Carla P. Gomes
CS4700
Perceptron Learning:
Gradient Descent Learning Algorithm

Carla P. Gomes
CS4700
Expressiveness of Perceptrons

Carla P. Gomes
CS4700
Expressiveness of Perceptrons
What hypothesis space can a perceptron represent?
Even more complex Booelan functions such as majority
function .
But can it represent any arbitrary Boolean function?

Carla P. Gomes
CS4700
Expressiveness of Perceptrons
A threshold perceptron returns 1 iff the weighted sum of its
inputs (including the bias) is positive, i.e.,:
I.e., iff the input is on one side of the hyperplane it defines.
Linear discriminant function or linear decision surface.
Weights determine slope and bias determines offset.
Perceptron  Linear Separator

Carla P. Gomes
CS4700
x
1
x
2
+
++
+
+
+
+
2
0
1
2
1
2
22110
0
w
w
x
w
w
x
xwxww


Can view trained network
as defining a “separation line”.
Linear Separability
Percepton used for classification
Consider example with two inputs, x1, x2:
What is its equation?

Carla P. Gomes
CS4700
Linear Separability
x
1
x
2

 
OR

Carla P. Gomes
CS4700
Linear Separability
x
1
x
2

 
AND

Carla P. Gomes
CS4700
Linear Separability
x
1
x
2

 
XOR

Carla P. Gomes
CS4700
Linear Separability
x
1
x
2

 
XOR
Minsky & Papert (1969)
Bad News: Perceptrons can only represent linearly separable functions.
Not linearly separable

Carla P. Gomes
CS4700
Consider a threshold perceptron for the logical XOR function (two inputs):
Our examples are:
x1 x2 label
10 0 0
21 0 1
30 1 1
41 1 0
Linear Separability:
XOR
Txwxw 
2211
Given our examples, we have the following inequalities
for the perceptron:
From (1) 0 + 0 ≤ T  T0
From (2) w
1
+ 0 > T  w
1
> T
From (3) 0 + w2 > T  w
2 > T
From (4) w1 + w2 ≤ T
w1 + w2 > 2T
contradiction
So, XOR is not linearly separable

Carla P. Gomes
CS4700
Convergence of Perceptron
Learning Algorithm
… training data linearly separable
… step size  sufficiently small
… no “hidden” units
Perceptron converges to a consistent function, if…

Carla P. Gomes
CS4700
Perceptron learns majority function easily,
DTL is hopeless

Carla P. Gomes
CS4700
DTL learns restaurant function easily,
perceptron cannot represent it

Carla P. Gomes
CS4700
Good news: Adding hidden layer allows more target
functions to be represented.
Minsky & Papert (1969)