Support Vector Machines Support Vector Machines

nikitabhagat28 44 views 47 slides Oct 09, 2024
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

Support Vector Machines


Slide Content

Machine Learning:Machine Learning:
k-Nearest Neighbor andk-Nearest Neighbor and
Support Vector MachinesSupport Vector Machines
skim 20.4, 20.6-20.7skim 20.4, 20.6-20.7
CMSC 471CMSC 471

Revised End-of-Semester ScheduleRevised End-of-Semester Schedule

Wed 11/21Wed 11/21 Machine Learning IVMachine Learning IV

Mon 11/26Mon 11/26 Philosophy of AI Philosophy of AI (You must read the three articles!)(You must read the three articles!)

Wed 11/28Wed 11/28 Special TopicsSpecial Topics

Mon 12/3Mon 12/3 Special TopicsSpecial Topics

Wed 12/5Wed 12/5 Review / Tournament dry run #2 Review / Tournament dry run #2 (HW6 due)(HW6 due)

Mon 12/10Mon 12/10 TournamentTournament

Wed 12/19Wed 12/19 FINAL EXAM FINAL EXAM (1:00pm - 3:00pm) (Project and final report due)(1:00pm - 3:00pm) (Project and final report due)
NO LATE SUBMISSIONS ALLOWED!NO LATE SUBMISSIONS ALLOWED!

Special Topics Special Topics

RoboticsRobotics

AI in GamesAI in Games    

Natural language processingNatural language processing

Multi-agent systemsMulti-agent systems

kk-Nearest Neighbor -Nearest Neighbor
Instance-Based LearningInstance-Based Learning
Some material adapted from slides by Andrew Moore, CMU.Some material adapted from slides by Andrew Moore, CMU.
Visit Visit http://www.autonlab.org/tutorials/http://www.autonlab.org/tutorials/ for for
Andrew’s repository of Data Mining tutorials.Andrew’s repository of Data Mining tutorials.

1-Nearest Neighbor1-Nearest Neighbor

One of the simplest of all machine One of the simplest of all machine
learning classifierslearning classifiers

Simple idea: label a new point the same as Simple idea: label a new point the same as
the closest known pointthe closest known point
Label it red.

1-Nearest Neighbor1-Nearest Neighbor

A type of instance-based learningA type of instance-based learning

Also known as “memory-based” learningAlso known as “memory-based” learning

Forms a Voronoi tessellation of the Forms a Voronoi tessellation of the
instance spaceinstance space

Distance MetricsDistance Metrics

Different metrics can change the decision surfaceDifferent metrics can change the decision surface

Standard Euclidean distance metric:Standard Euclidean distance metric:
Two-dimensional: Dist(a,b) = sqrt((aTwo-dimensional: Dist(a,b) = sqrt((a
11 – b – b
11))
2 2
+ (a+ (a
22 – b – b
22))
22
))
Multivariate: Dist(a,b) = sqrt(∑ (aMultivariate: Dist(a,b) = sqrt(∑ (a
ii – b – b
ii))
22
))
Dist(a,b) =(a
1 – b
1)
2
+ (a
2 – b
2)
2
Dist(a,b) =(a
1 – b
1)
2
+ (3a
2 – 3b
2)
2
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.

Four Aspects of anFour Aspects of an
Instance-Based Learner:Instance-Based Learner:
1.A distance metric
2.How many nearby neighbors to look at?
3.A weighting function (optional)
4.How to fit with the local points?
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.

1-NN’s Four Aspects as an1-NN’s Four Aspects as an
Instance-Based Learner:Instance-Based Learner:
1.A distance metric
Euclidian
2.How many nearby neighbors to look at?
One
3.A weighting function (optional)
Unused
4.How to fit with the local points?
Just predict the same output as the nearest
neighbor.
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.

Zen GardensZen Gardens
Mystery of renowned zen garden revealed [CNN Article]
Thursday, September 26, 2002 Posted: 10:11 AM EDT (1411 GMT)
LONDON (Reuters) -- For centuries visitors to the renowned Ryoanji Temple garden in
Kyoto, Japan have been entranced and mystified by the simple arrangement of rocks.
The five sparse clusters on a rectangle of raked gravel are said to be pleasing to the eyes
of the hundreds of thousands of tourists who visit the garden each year.
Scientists in Japan said on Wednesday they now believe they have discovered its
mysterious appeal.
"We have uncovered the implicit structure of the Ryoanji garden's visual ground and
have shown that it includes an abstract, minimalist depiction of natural scenery," said
Gert Van Tonder of Kyoto University.
The researchers discovered that the empty space of the garden evokes a hidden image of a
branching tree that is sensed by the unconscious mind.
"We believe that the unconscious perception of this pattern contributes to the enigmatic
appeal of the garden," Van Tonder added.
He and his colleagues believe that whoever created the garden during the Muromachi era
between 1333-1573 knew exactly what they were doing and placed the rocks around the
tree image.
By using a concept called medial-axis transformation, the scientists showed that the
hidden branched tree converges on the main area from which the garden is viewed.
The trunk leads to the prime viewing site in the ancient temple that once overlooked the
garden. It is thought that abstract art may have a similar impact.
"There is a growing realisation that scientific analysis can reveal unexpected structural
features hidden in controversial abstract paintings," Van Tonder said
Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.

k – Nearest Neighbork – Nearest Neighbor

Generalizes 1-NN to smooth away noise Generalizes 1-NN to smooth away noise
in the labelsin the labels

A new point is now assigned the most A new point is now assigned the most
frequent label of its frequent label of its kk nearest neighbors nearest neighbors
Label it red, when k = 3
Label it blue, when k = 7

k-Nearest Neighbor (k = 9)k-Nearest Neighbor (k = 9)
A magnificent job of
noise smoothing.
Three cheers for 9-
nearest-neighbor.
But the lack of
gradients and the
jerkiness isn’t good.
Appalling behavior!
Loses all the detail
that 1-nearest
neighbor would give.
The tails are horrible!
Fits much less of the
noise, captures trends.
But still, frankly,
pathetic compared
with linear regression.
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.

Support Vector Support Vector
Machines and KernelsMachines and Kernels
Adapted from slides by Tim OatesAdapted from slides by Tim Oates
Cognition, Robotics, and Learning (CORAL) LabCognition, Robotics, and Learning (CORAL) Lab
University of Maryland Baltimore CountyUniversity of Maryland Baltimore County
Doing Doing ReallyReally Well with Well with
Linear Decision SurfacesLinear Decision Surfaces

OutlineOutline

PredictionPrediction

Why might predictions be wrong?Why might predictions be wrong?

Support vector machinesSupport vector machines

Doing really well with linear modelsDoing really well with linear models

KernelsKernels

Making the non-linear linearMaking the non-linear linear

Supervised ML = PredictionSupervised ML = Prediction

Given training instances (x,y)Given training instances (x,y)

Learn a model fLearn a model f

Such that f(x) = ySuch that f(x) = y

Use f to predict y for new xUse f to predict y for new x

Many variations on this basic themeMany variations on this basic theme

Why might predictions be wrong?Why might predictions be wrong?

True Non-Determinism True Non-Determinism

Flip a biased coinFlip a biased coin

p(p(headsheads) = ) = 

Estimate Estimate 

If If  > 0.5 predict > 0.5 predict headsheads, else , else tailstails

Lots of ML research on problems like thisLots of ML research on problems like this

Learn a modelLearn a model

Do the best you can in expectationDo the best you can in expectation

Why might predictions be wrong? Why might predictions be wrong?

Partial Observability Partial Observability

Something needed to predict y is missing Something needed to predict y is missing
from observation xfrom observation x

N-bit parity problemN-bit parity problem

x contains N-1 bits (hard PO)x contains N-1 bits (hard PO)

x contains N bits but learner ignores some of them x contains N bits but learner ignores some of them
(soft PO)(soft PO)

Why might predictions be wrong?Why might predictions be wrong?

True non-determinismTrue non-determinism

Partial observabilityPartial observability

hard, softhard, soft

Representational biasRepresentational bias

Algorithmic biasAlgorithmic bias

Bounded resourcesBounded resources

Representational BiasRepresentational Bias

Having the right features (x) is crucialHaving the right features (x) is crucial
XOOOOXXX
X
O
O
O
O
X
X
X

Support Vector Support Vector
MachinesMachines
Doing Doing ReallyReally Well with Linear Well with Linear
Decision SurfacesDecision Surfaces

Strengths of SVMsStrengths of SVMs

Good generalization in theoryGood generalization in theory

Good generalization in practiceGood generalization in practice

Work well with few training instancesWork well with few training instances

Find globally best modelFind globally best model

Efficient algorithmsEfficient algorithms

Amenable to the kernel trickAmenable to the kernel trick

Linear SeparatorsLinear Separators

Training instancesTraining instances

x x  
nn

y y  {-1, 1} {-1, 1}

w w  
nn

b b  

HyperplaneHyperplane

<w, x> + b = 0<w, x> + b = 0
ww
11xx
11 + w + w
22xx
22 … + w … + w
nnxx
nn + b = 0 + b = 0

Decision functionDecision function

f(x) = sign(<w, x> + b)f(x) = sign(<w, x> + b)
Math ReviewMath Review
Inner (dot) product:Inner (dot) product:
<a, b> = a · b = ∑ a<a, b> = a · b = ∑ a
ii*b*b
ii
= a= a
11bb
11 + a + a
22bb
22 + …+a + …+a
nnbb
nn

IntuitionsIntuitions
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

IntuitionsIntuitions
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

IntuitionsIntuitions
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

IntuitionsIntuitions
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

A “Good” SeparatorA “Good” Separator
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

Noise in the ObservationsNoise in the Observations
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

Ruling Out Some SeparatorsRuling Out Some Separators
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

Lots of NoiseLots of Noise
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

Maximizing the MarginMaximizing the Margin
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

““Fat” SeparatorsFat” Separators
X
X
O
O
O
O
O
O
X
X
X
X
X
X
O
O

Support VectorsSupport Vectors
X
X
O
O
O
O
O
O
O
O
X
X
X
X
X
X

The MathThe Math

Training instancesTraining instances

x x  
nn

y y  {-1, 1} {-1, 1}

Decision functionDecision function

f(x) = sign(<w,x> + b)f(x) = sign(<w,x> + b)

w w  
nn

b b  

Find w and b that Find w and b that

Perfectly classify training instancesPerfectly classify training instances

Assuming linear separabilityAssuming linear separability

Maximize marginMaximize margin

The MathThe Math

For perfect classification, we wantFor perfect classification, we want
yy
ii (<w,x (<w,x
ii> + b) ≥ 0 for all i> + b) ≥ 0 for all i

Why?Why?

To maximize the margin, we wantTo maximize the margin, we want

w that minimizes |w|w that minimizes |w|
22

Dual Optimization ProblemDual Optimization Problem

Maximize over Maximize over 
 W(W() = ) = 
ii 
ii - 1/2 - 1/2 
i,ji,j 
ii 
jj y y
ii y y
j j <x<x
ii, x, x
jj>>

Subject toSubject to
 
i i  0 0


ii 
ii y y
ii = 0 = 0

Decision functionDecision function
f(x) = sign(f(x) = sign(
ii 
ii y y
ii <x, x <x, x
ii> + b)> + b)

Strengths of SVMsStrengths of SVMs

Good generalization in theoryGood generalization in theory

Good generalization in practiceGood generalization in practice

Work well with few training instancesWork well with few training instances

Find globally best modelFind globally best model

Efficient algorithmsEfficient algorithms

Amenable to the kernel trick …Amenable to the kernel trick …

What if Surface is Non-Linear?What if Surface is Non-Linear?
X
X
X
X
X
X
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
OO
O
Image from http://www.atrandomresearch.com/iclass/

Kernel MethodsKernel Methods
Making the Non-Linear LinearMaking the Non-Linear Linear

When Linear Separators FailWhen Linear Separators Fail
XOOOOXXX x
1
x
2
X
O
O
O
O
X
X
X
x
1
x
1
2

Mapping into a New Feature SpaceMapping into a New Feature Space

Rather than run SVM on xRather than run SVM on x
ii, run it on , run it on (x(x
ii))

Find non-linear separator in input spaceFind non-linear separator in input space

What if What if (x(x
ii) is really big?) is really big?

Use kernels to compute it implicitly!Use kernels to compute it implicitly!
 : x : x  X = X = (x)(x)
(x(x
11,x,x
22) = (x) = (x
11,x,x
22,x,x
11
22
,x,x
22
22
,x,x
11xx
22))
Image from http://web.engr.oregonstate.edu/
~afern/classes/cs534/

KernelsKernels

Find kernel K such thatFind kernel K such that
K(xK(x
11,x,x
22) = < ) = < (x(x
11), ), (x(x
22)>)>
Computing K(xComputing K(x
11,x,x
22) should be efficient, much ) should be efficient, much
more so than computing more so than computing (x(x
11) and ) and (x(x
22))
Use K(xUse K(x
11,x,x
22) in SVM algorithm rather than ) in SVM algorithm rather than
<x<x
11,x,x
22>>

Remarkably, this is possibleRemarkably, this is possible

The Polynomial KernelThe Polynomial Kernel
K(xK(x
11,x,x
22) = < x) = < x
11, x, x
2 2 > >
22
xx
11 = (x = (x
1111, x, x
1212))
xx
22 = (x = (x
2121, x, x
2222))
< x< x
11, x, x
2 2 > = (x> = (x
1111xx
2121 + x + x
1212xx
2222))
< x< x
11, x, x
2 2 > >
22
= (x = (x
1111
2 2
xx
2121
22
+ x + x
1212
22
xx
2222
22
+ 2x + 2x
1111

xx
12 12 xx
2121

xx
2222))
 (x(x
11) = (x) = (x
1111
22
, x, x
1212
22
, √2x, √2x
1111

xx
1212))
 (x(x
22) = (x) = (x
2121
22
, x, x
2222
22
, √2x, √2x
2121

xx
2222))
K(xK(x
11,x,x
22) = < ) = < (x(x
11), ), (x(x
22))
>>

The Polynomial KernelThe Polynomial Kernel

(x) contains all monomials of degree d(x) contains all monomials of degree d

Useful in visual pattern recognitionUseful in visual pattern recognition

Number of monomialsNumber of monomials

16x16 pixel image16x16 pixel image

1010
1010
monomials of degree 5 monomials of degree 5

Never explicitly compute Never explicitly compute (x)!(x)!
Variation - K(xVariation - K(x
11,x,x
22) = (< x) = (< x
11, x, x
2 2 > + 1) > + 1)
22

A Few Good KernelsA Few Good Kernels

Dot product kernelDot product kernel

K(xK(x
11,x,x
22) = < x) = < x
11,x,x
22 > >

Polynomial kernelPolynomial kernel

K(xK(x
11,x,x
22) = < x) = < x
11,x,x
22 > >
d d
(Monomials of degree d)(Monomials of degree d)

K(xK(x
11,x,x
22) = (< x) = (< x
11,x,x
22 > + 1) > + 1)
d d
(All monomials of degree 1,2,…,d)(All monomials of degree 1,2,…,d)

Gaussian kernelGaussian kernel

K(xK(x
11,x,x
22) = exp(-| x) = exp(-| x
11-x-x
22 | |
22
/2/2
22
))

Radial basis functionsRadial basis functions

Sigmoid kernelSigmoid kernel

K(xK(x
11,x,x
22) = tanh(< x) = tanh(< x
11,x,x
22 > + > + ))

Neural networksNeural networks

Establishing “kernel-hood” from first principles is non-Establishing “kernel-hood” from first principles is non-
trivialtrivial

The Kernel TrickThe Kernel Trick
“Given an algorithm which is
formulated in terms of a positive
definite kernel K
1
, one can construct
an alternative algorithm by replacing
K
1 with another positive definite
kernel K
2”
 SVMs can use the kernel trick

Using a Different Kernel in the Using a Different Kernel in the
Dual Optimization ProblemDual Optimization Problem

For example, using the polynomial kernel For example, using the polynomial kernel
with d = 4 (including lower-order terms).with d = 4 (including lower-order terms).

Maximize over Maximize over 

W(W() = ) = 
ii 
ii - 1/2 - 1/2 
i,ji,j 
ii 
jj y y
ii y y
j j <x<x
ii, x, x
jj>>

Subject toSubject to


i i  0 0


ii 
ii y y
ii = 0 = 0

Decision functionDecision function

f(x) = sign(f(x) = sign(
ii 
ii y y
ii <x, x <x, x
ii> + b)> + b)
(<x(<x
ii, x, x
jj> + 1)> + 1)
44
X
(<x(<x
ii, x, x
jj> + 1)> + 1)
44
X
These are kernels!
So by the kernel trick,
we just replace them!

ConclusionConclusion

SVMs find optimal linear separatorSVMs find optimal linear separator

The kernel trick makes SVMs non-linear The kernel trick makes SVMs non-linear
learning algorithmslearning algorithms
Tags