course slides of Support-Vector-Machine.pdf

Support Vector Machines
Here we approach the two-class classication problem in a
direct way:
We try and nd a plane that separates the classes in
feature space.
If we cannot, we get creative in two ways:
We soften what we mean by \separates", and
We enrich and enlarge the feature space so that separation
is possible.
1 / 21

What is a Hyperplane?
A hyperplane inpdimensions is a at ane subspace of
dimensionp1.
In general the equation for a hyperplane has the form
0+1X1+2X2+: : :+pXp= 0
Inp= 2 dimensions a hyperplane is a line.
If0= 0, the hyperplane goes through the origin,
otherwise not.
The vector= (1; 2; ; p) is called the normal vector
| it points in a direction orthogonal to the surface of a
hyperplane.
2 / 21

Hyperplane in 2 Dimensions−2 0 2 4 6 8 10
−2
0
2
4
6
8
10
X
1
X
2
l
l
l
l
l
b=(b
1,b
2)
b
1X
1+b
2X
2−6=0
b
1X
1+b
2X
2−6=1.6
l
b
1X
1+b
2X
2−6=−4
b
1 = 0.8
b
2 = 0.6
3 / 21

Separating Hyperplanes−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
X1X1
X2
X2
Iff(X) =0+1X1+ +pXp, thenf(X)>0 for points on
one side of the hyperplane, andf(X)<0 for points on the other.
If we code the colored points asYi= +1 for blue, say, and
Yi=1 for mauve, then ifYif(Xi)>0 for alli,f(X) = 0
denes aseparating hyperplane.
4 / 21

Maximal Margin Classier
Among all separating hyperplanes, nd the one that makes the
biggest gap or margin between the two classes.−1 0 1 2 3
−1 0 1 2 3
X1
X2
Constrained optimization problem
maximize
0;1;:::;p
M
subject to
p
X
j=1

2
j= 1;
yi(0+1xi1+: : :+pxip)M
for alli= 1; : : : ; N:
This can be rephrased as a convex quadratic program, and
solved eciently. The functionsvm()in packagee1071solves
this problem eciently
5 / 21

Non-separable Data0 1 2 3
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
X1
X2
The data on the left are
not separable by a linear
boundary.
This is often the case,
unlessN < p.
6 / 21

Noisy Data−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
−1 0 1 2 3
X1X1
X2
X2
Sometimes the data are separable, but noisy. This can lead to a
poor solution for the maximal-margin classier.
Thesupport vector classiermaximizes asoftmargin.7 / 21

Support Vector Classier−0.5 0.0 0.5 1.0 1.5 2.0 2.5
−1 0 1 2 3 4
1
2
3
4
5
6
7
8
9
10
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
−1 0 1 2 3 4
1
2
3
4
5
6
7
8
9
10
11
12
X1X1
X2
X2
maximize
0;1;:::;p;1;:::;n
Msubject to
p
X
j=1

2
j= 1;
yi(0+1xi1+2xi2+: : :+pxip)M(1i);
i0;
n
X
i=1
iC;
8 / 21

Cis a regularization parameter−1 0 1 2
−3 −2 −1 0 1 2 3
−1 0 1 2
−3 −2 −1 0 1 2 3
−1 0 1 2
−3 −2 −1 0 1 2 3
−1 0 1 2
−3 −2 −1 0 1 2 3
X1X1
X1X1
X2
X2
X2
X2
9 / 21

Linear boundary can fail−4 −2 0 2 4
−4 −2 0 2 4
−4 −2 0 2 4
−4 −2 0 2 4
X1X1
X2
X2
Sometime a linear bound-
ary simply won't work,
no matter what value of
C.
The example on the left
is such a case.
What to do?
10 / 21

Feature Expansion
Enlarge the space of features by including transformations;
e.g.X
2
1
,X
3
1
,X1X2,X1X
2
2
,: : :. Hence go from a
p-dimensional space to aM > pdimensional space.
Fit a support-vector classier in the enlarged space.
This results in non-linear decision boundaries in the
original space.
Example: Suppose we use (X1; X2; X
2
1
; X
2
2
; X1X2) instead of
just (X1; X2). Then the decision boundary would be of the form
0+1X1+2X2+3X
2
1+4X
2
2+5X1X2= 0
This leads to nonlinear decision boundaries in the original space
(quadratic conic sections).
11 / 21

Cubic Polynomials
Here we use a basis
expansion of cubic poly-
nomials
From 2 variables to 9
The support-vector clas-
sier in the enlarged
space solves the problem
in the lower-dimensional
space−4 −2 0 2 4
−4 −2 0 2 4






−4 −2 0 2 4
−4 −2 0 2 4




X1X1
X2
X2
0+1X1+2X2+3X
2
1
+4X
2
2
+5X1X2+6X
3
1
+7X
3
2
+8X1X
2
2
+9X
2
1
X2= 0
12 / 21

Nonlinearities and Kernels
Polynomials (especially high-dimensional ones) get wild
rather fast.
There is a more elegant and controlled way to introduce
nonlinearities in support-vector classiers | through the
use ofkernels.
Before we discuss these, we must understand the role of
inner productsin support-vector classiers.
13 / 21

Inner products and support vectors
hxi; xi
0i=
p
X
j=1
xijxi
0
j| inner product between vectors
The linear support vector classier can be represented as
f(x) =0+
n
X
i=1
ihx; xii|nparameters
To estimate the parameters1; : : : ; nand0, all we need
are the

n
2

inner productshxi; xi
0ibetween all pairs of
training observations.
It turns out that most of the ^ican be zero:
f(x) =0+
X
i2S
^ihx; xii
Sis thesupport setof indicesisuch that ^i>0.
14 / 21

Kernels and Support Vector Machines
If we can compute inner-products between observations, we
can t a SV classier. Can be quite abstract!
Some specialkernel functionscan do this for us. E.g.
K(xi; xi
0) =
0
@1 +
p
X
j=1
xijxi
0
j
1
A
d
computes the inner-products needed forddimensional
polynomials |

p+d
d

basis functions!
Try it forp= 2andd= 2.
The solution has the form
f(x) =0+
X
i2S
^iK(x; xi):
15 / 21

Radial Kernel
K(xi; xi
0) = exp(
p
X
j=1
(xijxi
0
j)
2
):−4 −2 0 2 4
−4 −2 0 2 4






−4 −2 0 2 4
−4 −2 0 2 4




X1X1
X2
X2
f(x) =0+
X
i2S
^iK(x; xi)
Implicit feature space;
very high dimensional.
Controls variance by
squashing down most
dimensions severely
16 / 21

Example: Heart DataFalse positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
LDA
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
SVM: γ=10
−3
SVM: γ=10
−2
SVM: γ=10
−1
ROC curve is obtained by changing the threshold 0 to threshold
tin
^
f(X)> t, and recordingfalse positiveandtrue positive
rates astvaries. Here we see ROC curves on training data.
17 / 21

Example continued: Heart Test DataFalse positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
LDA
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Support Vector Classifier
SVM: γ=10
−3
SVM: γ=10
−2
SVM: γ=10
−1
18 / 21

SVMs: more than 2 classes?
The SVM as dened works forK= 2 classes. What do we do if
we haveK >2 classes?
OVA Kdierent 2-class SVM
classiers
^
fk(x); k= 1; : : : ; K; each class versus
the rest. Classifyx

to the class for which
^
fk(x

)
is largest.
OVO

K
2

pairwise classiers
^
fk`(x). Classifyx

to the class that wins the most
pairwise competitions.
Which to choose? IfKis not too large, use OVO.
19 / 21

Support Vector versus Logistic Regression?
Withf(X) =0+1X1+: : :+pXpcan rephrase
support-vector classier optimization as
minimize
0;1;:::;p
8
<
:
n
X
i=1
max [0;1yif(xi)] +
p
X
j=1

2
j
9
=
;−6 −4 −2 0 2
0 2 4 6 8
Loss
SVM Loss
Logistic Regression Loss
yi(0+1xi1+:::+pxip)
This has the form
loss plus penalty.
The loss is known as the
hinge loss.
Very similar to \loss" in
logistic regression (negative
log-likelihood).
20 / 21

Which to use: SVM or Logistic Regression
When classes are (nearly) separable, SVM does better than
LR. So does LDA.
When not, LR (with ridge penalty) and SVM very similar.
If you wish to estimate probabilities, LR is the choice.
For nonlinear boundaries, kernel SVMs are popular. Can
use kernels with LR and LDA as well, but computations
are more expensive.
21 / 21

course slides of Support-Vector-Machine.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

course slides of Support-Vector-Machine.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd