course slides of Support-Vector-Machine.pdf

onurenginar1 14 views 34 slides May 10, 2024
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

course slides of Support-Vector-Machine.pdf


Slide Content

Support Vector Machines
Here we approach the two-class classication problem in a
direct way:
We try and nd a plane that separates the classes in
feature space.
If we cannot, we get creative in two ways:
We soften what we mean by \separates", and
We enrich and enlarge the feature space so that separation
is possible.
1 / 21

What is a Hyperplane?
A hyperplane inpdimensions is a at ane subspace of
dimensionp1.
In general the equation for a hyperplane has the form
0+1X1+2X2+: : :+pXp= 0
Inp= 2 dimensions a hyperplane is a line.
If0= 0, the hyperplane goes through the origin,
otherwise not.
The vector= (1; 2; ; p) is called the normal vector
| it points in a direction orthogonal to the surface of a
hyperplane.
2 / 21

Hyperplane in 2 Dimensions−2 0 2 4 6 8 10
−2
0
2
4
6
8
10
X
1
X
2
l
l
l
l
l
b=(b
1,b
2)
b
1X
1+b
2X
2−6=0
b
1X
1+b
2X
2−6=1.6
l
b
1X
1+b
2X
2−6=−4
b
1 = 0.8
b
2 = 0.6
3 / 21

Separating Hyperplanes−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
X1X1
X2
X2
Iff(X) =0+1X1+ +pXp, thenf(X)>0 for points on
one side of the hyperplane, andf(X)<0 for points on the other.
If we code the colored points asYi= +1 for blue, say, and
Yi=1 for mauve, then ifYif(Xi)>0 for alli,f(X) = 0
denes aseparating hyperplane.
4 / 21

Maximal Margin Classier
Among all separating hyperplanes, nd the one that makes the
biggest gap or margin between the two classes.−1 0 1 2 3
−1 0 1 2 3
X1
X2
Constrained optimization problem
maximize
0;1;:::;p
M
subject to
p
X
j=1

2
j= 1;
yi(0+1xi1+: : :+pxip)M
for alli= 1; : : : ; N:
This can be rephrased as a convex quadratic program, and
solved eciently. The functionsvm()in packagee1071solves
this problem eciently
5 / 21

Maximal Margin Classier
Among all separating hyperplanes, nd the one that makes the
biggest gap or margin between the two classes.−1 0 1 2 3
−1 0 1 2 3
X1
X2
Constrained optimization problem
maximize
0;1;:::;p
M
subject to
p
X
j=1

2
j= 1;
yi(0+1xi1+: : :+pxip)M
for alli= 1; : : : ; N:
This can be rephrased as a convex quadratic program, and
solved eciently. The functionsvm()in packagee1071solves
this problem eciently
5 / 21

Non-separable Data0 1 2 3
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
X1
X2
The data on the left are
not separable by a linear
boundary.
This is often the case,
unlessN < p.
6 / 21

Noisy Data−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
X1X1
X2
X2
Sometimes the data are separable, but noisy. This can lead to a
poor solution for the maximal-margin classier.
Thesupport vector classiermaximizes asoftmargin.7 / 21

Noisy Data−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
X1X1
X2
X2
Sometimes the data are separable, but noisy. This can lead to a
poor solution for the maximal-margin classier.
Thesupport vector classiermaximizes asoftmargin.7 / 21

Support Vector Classier−0.5 0.0 0.5 1.0 1.5 2.0 2.5
−1 0 1 2 3 4
1
2
3
4
5
6
7
8
9
10
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
−1 0 1 2 3 4
1
2
3
4
5
6
7
8
9
10
11
12
X1X1
X2
X2
maximize
0;1;:::;p;1;:::;n
Msubject to
p
X
j=1

2
j= 1;
yi(0+1xi1+2xi2+: : :+pxip)M(1i);
i0;
n
X
i=1
iC;
8 / 21

Cis a regularization parameter−1 0 1 2
−3 −2 −1 0 1 2 3
−1 0 1 2
−3 −2 −1 0 1 2 3
−1 0 1 2
−3 −2 −1 0 1 2 3
−1 0 1 2
−3 −2 −1 0 1 2 3
X1X1
X1X1
X2
X2
X2
X2
9 / 21

Linear boundary can fail−4 −2 0 2 4
−4 −2 0 2 4
−4 −2 0 2 4
−4 −2 0 2 4
X1X1
X2
X2
Sometime a linear bound-
ary simply won't work,
no matter what value of
C.
The example on the left
is such a case.
What to do?
10 / 21

Feature Expansion
Enlarge the space of features by including transformations;
e.g.X
2
1
,X
3
1
,X1X2,X1X
2
2
,: : :. Hence go from a
p-dimensional space to aM > pdimensional space.
Fit a support-vector classier in the enlarged space.
This results in non-linear decision boundaries in the
original space.
Example: Suppose we use (X1; X2; X
2
1
; X
2
2
; X1X2) instead of
just (X1; X2). Then the decision boundary would be of the form
0+1X1+2X2+3X
2
1+4X
2
2+5X1X2= 0
This leads to nonlinear decision boundaries in the original space
(quadratic conic sections).
11 / 21

Feature Expansion
Enlarge the space of features by including transformations;
e.g.X
2
1
,X
3
1
,X1X2,X1X
2
2
,: : :. Hence go from a
p-dimensional space to aM > pdimensional space.
Fit a support-vector classier in the enlarged space.
This results in non-linear decision boundaries in the
original space.
Example: Suppose we use (X1; X2; X
2
1
; X
2
2
; X1X2) instead of
just (X1; X2). Then the decision boundary would be of the form
0+1X1+2X2+3X
2
1+4X
2
2+5X1X2= 0
This leads to nonlinear decision boundaries in the original space
(quadratic conic sections).
11 / 21

Cubic Polynomials
Here we use a basis
expansion of cubic poly-
nomials
From 2 variables to 9
The support-vector clas-
sier in the enlarged
space solves the problem
in the lower-dimensional
space−4 −2 0 2 4
−4 −2 0 2 4
  
  
  
  
  
  
−4 −2 0 2 4
−4 −2 0 2 4
  
  
  
  
X1X1
X2
X2
0+1X1+2X2+3X
2
1
+4X
2
2
+5X1X2+6X
3
1
+7X
3
2
+8X1X
2
2
+9X
2
1
X2= 0
12 / 21

Cubic Polynomials
Here we use a basis
expansion of cubic poly-
nomials
From 2 variables to 9
The support-vector clas-
sier in the enlarged
space solves the problem
in the lower-dimensional
space−4 −2 0 2 4
−4 −2 0 2 4
  
  
  
  
  
  
−4 −2 0 2 4
−4 −2 0 2 4
  
  
  
  
X1X1
X2
X2
0+1X1+2X2+3X
2
1
+4X
2
2
+5X1X2+6X
3
1
+7X
3
2
+8X1X
2
2
+9X
2
1
X2= 0
12 / 21

Nonlinearities and Kernels
Polynomials (especially high-dimensional ones) get wild
rather fast.
There is a more elegant and controlled way to introduce
nonlinearities in support-vector classiers | through the
use ofkernels.
Before we discuss these, we must understand the role of
inner productsin support-vector classiers.
13 / 21

Inner products and support vectors
hxi; xi
0i=
p
X
j=1
xijxi
0
j| inner product between vectors
The linear support vector classier can be represented as
f(x) =0+
n
X
i=1
ihx; xii|nparameters
To estimate the parameters1; : : : ; nand0, all we need
are the

n
2

inner productshxi; xi
0ibetween all pairs of
training observations.
It turns out that most of the ^ican be zero:
f(x) =0+
X
i2S
^ihx; xii
Sis thesupport setof indicesisuch that ^i>0.
14 / 21

Inner products and support vectors
hxi; xi
0i=
p
X
j=1
xijxi
0
j| inner product between vectors
The linear support vector classier can be represented as
f(x) =0+
n
X
i=1
ihx; xii|nparameters
To estimate the parameters1; : : : ; nand0, all we need
are the

n
2

inner productshxi; xi
0ibetween all pairs of
training observations.
It turns out that most of the ^ican be zero:
f(x) =0+
X
i2S
^ihx; xii
Sis thesupport setof indicesisuch that ^i>0.
14 / 21

Inner products and support vectors
hxi; xi
0i=
p
X
j=1
xijxi
0
j| inner product between vectors
The linear support vector classier can be represented as
f(x) =0+
n
X
i=1
ihx; xii|nparameters
To estimate the parameters1; : : : ; nand0, all we need
are the

n
2

inner productshxi; xi
0ibetween all pairs of
training observations.
It turns out that most of the ^ican be zero:
f(x) =0+
X
i2S
^ihx; xii
Sis thesupport setof indicesisuch that ^i>0.
14 / 21

Inner products and support vectors
hxi; xi
0i=
p
X
j=1
xijxi
0
j| inner product between vectors
The linear support vector classier can be represented as
f(x) =0+
n
X
i=1
ihx; xii|nparameters
To estimate the parameters1; : : : ; nand0, all we need
are the

n
2

inner productshxi; xi
0ibetween all pairs of
training observations.
It turns out that most of the ^ican be zero:
f(x) =0+
X
i2S
^ihx; xii
Sis thesupport setof indicesisuch that ^i>0.
14 / 21

Kernels and Support Vector Machines
If we can compute inner-products between observations, we
can t a SV classier. Can be quite abstract!
Some specialkernel functionscan do this for us. E.g.
K(xi; xi
0) =
0
@1 +
p
X
j=1
xijxi
0
j
1
A
d
computes the inner-products needed forddimensional
polynomials |

p+d
d

basis functions!
Try it forp= 2andd= 2.
The solution has the form
f(x) =0+
X
i2S
^iK(x; xi):
15 / 21

Kernels and Support Vector Machines
If we can compute inner-products between observations, we
can t a SV classier. Can be quite abstract!
Some specialkernel functionscan do this for us. E.g.
K(xi; xi
0) =
0
@1 +
p
X
j=1
xijxi
0
j
1
A
d
computes the inner-products needed forddimensional
polynomials |

p+d
d

basis functions!
Try it forp= 2andd= 2.
The solution has the form
f(x) =0+
X
i2S
^iK(x; xi):
15 / 21

Kernels and Support Vector Machines
If we can compute inner-products between observations, we
can t a SV classier. Can be quite abstract!
Some specialkernel functionscan do this for us. E.g.
K(xi; xi
0) =
0
@1 +
p
X
j=1
xijxi
0
j
1
A
d
computes the inner-products needed forddimensional
polynomials |

p+d
d

basis functions!
Try it forp= 2andd= 2.
The solution has the form
f(x) =0+
X
i2S
^iK(x; xi):
15 / 21

Kernels and Support Vector Machines
If we can compute inner-products between observations, we
can t a SV classier. Can be quite abstract!
Some specialkernel functionscan do this for us. E.g.
K(xi; xi
0) =
0
@1 +
p
X
j=1
xijxi
0
j
1
A
d
computes the inner-products needed forddimensional
polynomials |

p+d
d

basis functions!
Try it forp= 2andd= 2.
The solution has the form
f(x) =0+
X
i2S
^iK(x; xi):
15 / 21

Radial Kernel
K(xi; xi
0) = exp(
p
X
j=1
(xijxi
0
j)
2
):−4 −2 0 2 4
−4 −2 0 2 4
  
  
  
  
  
  
−4 −2 0 2 4
−4 −2 0 2 4
  
  
  
  
X1X1
X2
X2
f(x) =0+
X
i2S
^iK(x; xi)
Implicit feature space;
very high dimensional.
Controls variance by
squashing down most
dimensions severely
16 / 21

Example: Heart DataFalse positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
LDA
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
SVM: γ=10
−3
SVM: γ=10
−2
SVM: γ=10
−1
ROC curve is obtained by changing the threshold 0 to threshold
tin
^
f(X)> t, and recordingfalse positiveandtrue positive
rates astvaries. Here we see ROC curves on training data.
17 / 21

Example continued: Heart Test DataFalse positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
LDA
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
SVM: γ=10
−3
SVM: γ=10
−2
SVM: γ=10
−1
18 / 21

SVMs: more than 2 classes?
The SVM as dened works forK= 2 classes. What do we do if
we haveK >2 classes?
OVA Kdierent 2-class SVM
classiers
^
fk(x); k= 1; : : : ; K; each class versus
the rest. Classifyx

to the class for which
^
fk(x

)
is largest.
OVO

K
2

pairwise classiers
^
fk`(x). Classifyx

to the class that wins the most
pairwise competitions.
Which to choose? IfKis not too large, use OVO.
19 / 21

SVMs: more than 2 classes?
The SVM as dened works forK= 2 classes. What do we do if
we haveK >2 classes?
OVA Kdierent 2-class SVM
classiers
^
fk(x); k= 1; : : : ; K; each class versus
the rest. Classifyx

to the class for which
^
fk(x

)
is largest.
OVO

K
2

pairwise classiers
^
fk`(x). Classifyx

to the class that wins the most
pairwise competitions.
Which to choose? IfKis not too large, use OVO.
19 / 21

SVMs: more than 2 classes?
The SVM as dened works forK= 2 classes. What do we do if
we haveK >2 classes?
OVA Kdierent 2-class SVM
classiers
^
fk(x); k= 1; : : : ; K; each class versus
the rest. Classifyx

to the class for which
^
fk(x

)
is largest.
OVO

K
2

pairwise classiers
^
fk`(x). Classifyx

to the class that wins the most
pairwise competitions.
Which to choose? IfKis not too large, use OVO.
19 / 21

SVMs: more than 2 classes?
The SVM as dened works forK= 2 classes. What do we do if
we haveK >2 classes?
OVA Kdierent 2-class SVM
classiers
^
fk(x); k= 1; : : : ; K; each class versus
the rest. Classifyx

to the class for which
^
fk(x

)
is largest.
OVO

K
2

pairwise classiers
^
fk`(x). Classifyx

to the class that wins the most
pairwise competitions.
Which to choose? IfKis not too large, use OVO.
19 / 21

Support Vector versus Logistic Regression?
Withf(X) =0+1X1+: : :+pXpcan rephrase
support-vector classier optimization as
minimize
0;1;:::;p
8
<
:
n
X
i=1
max [0;1yif(xi)] +
p
X
j=1

2
j
9
=
;−6 −4 −2 0 2
0 2 4 6 8
Loss
SVM Loss
Logistic Regression Loss
yi(0+1xi1+:::+pxip)
This has the form
loss plus penalty.
The loss is known as the
hinge loss.
Very similar to \loss" in
logistic regression (negative
log-likelihood).
20 / 21

Which to use: SVM or Logistic Regression
When classes are (nearly) separable, SVM does better than
LR. So does LDA.
When not, LR (with ridge penalty) and SVM very similar.
If you wish to estimate probabilities, LR is the choice.
For nonlinear boundaries, kernel SVMs are popular. Can
use kernels with LR and LDA as well, but computations
are more expensive.
21 / 21
Tags