Perceptron simple y compuesto, en algoritmos de IA

leonadi2021 3 views 13 slides Jun 05, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

Sistemas neyrodifusos


Slide Content

Linear Discriminators
Chapter 20
Only relevant parts

Concerns
•Generalization Accuracy
•Efficiency
•Noise
•Irrelevant features
•Generality: when does this work?

Linear Model
•Let f1, …fn be the feature values of an example.
•Let class be denoted {+1, -1}.
•Define f0 = -1. (bias weight)
•Linear model defines weights w0,w1,..wn.
–-w0 is the threshold
•Classification rule:
–If w0*f0+w1*f1..+wn*fn> 0, predict class + else
predict class -.
•Briefly: W*F>0 where * is inner product of
weight vector and feature weights and F has been
augmented with extra 1.

Augmentation Trick
•Suppose data defined features f1 and f2.
•2* f1 + 3*f2 > 4 is classifier
•Equivalently: <4,2,3> *<-1,f1,f2> > 0
•Mapping data <f1,f2> to <-1,f1,f2> allows
learning/representing threshold as just
another featuer.
•Mapping data into higher dimensions is key
idea behind SVMs

Mapping to enable Linear Separation
•Let xi be m vectors in R^N.
•Map xi into R^{N+M} by xi -> <xi,0,..1,0..>
where 1 in n+i position.
•For any labelling of xi by classes +/-, the
embedding makes data linearly separable.
–Define wi = 0 i<N
–w(i+n) = 1 if xi is + else 0.
–W(i+n) = -1 if xi is negative else 0.

Representational Power
•“Or” of n features
–Wi = 1, threshold = 0
•“And” of n features
–Wi = 1 threshold = n -1
•K of n features (prototype)
–Wi =1 threshold = k -1
•Can’t do XOR
•Combining linear threshold units yields any
boolean function.

Classical Perceptron
•Goal: Any W which separates the data.
•Algorithm (X is augmented with 1)
•W = 0
•Repeat
–If X positive and W*X wrong, W = W+X;
–Else if X negative & W*X wrong, W = W-X.
•Until no errors or very large number of
times.

Classical Perceptron
•Theorem: If concept linearly separable, then
algorithm finds a solution.
•Training time can be exponential in number
of features.
•Epoch is single pass through entire data.
•Convergence can take exponentially many
epochs, but guaranteed to work.
•If |xi|<R and margin = m, then number of
mistake is < R^2/m^2.

Hill-Climbing Search
•This is an optimization problem.
•The solution is by hill-climbing so there is no
guarantee of finding the optimal solution.
•While derivates tell you the direction (the negative
gradient) they do not tell you how much to change
each Xi.
•On the plus side it is fast.
•On the negative side, no guarantee of separation

Hill-climbing View
•Goal: minimize Squared-error = Err^2.
•Let class yi be 1 or -1.
•Let Err = sum(W*Xi –Yi) where Xi is ith
example.
•This is a function only of the weights.
•Use Calculus; take partial derivates wrt Wj.
•To move to lower value, move in direction of
negative gradient, i.e.
•change in Xi is -2*Err*Xj

Support Vector Machine
•Goal: maximize the margin.
•Assuming the line separates the data, the
margin is the minimum of the closest
positive and negative example to the line.
•Good News: This can be solved by
quadratic program.
•Implemented in Weka as SOM.
•If not linearly separable, SVM will add
more features.

If not Linearly Separable
1.Add more nodes: Neural Nets
1.Can Represent any boolean function: why?
2.No guarantees about learning
3.Slow
4.Incomprehensible
2.Add more features: SVM
1.Can represent any boolean function
2.Learning guarantees
3.Fast
4.Semi-comprehensible

Adding features
•Suppose pt (x,y) is positive if it lies in the
unit disk else negative.
•Clearly very unlinearly separable
•Map (x,y) -> (x,y, x^2+y^2)
•Now in 3-space, easily separable.
•This works for any learning algorithm, but
SVM will almost do it for you. (set
parameters).