MIS 720 Electronic Business and Big Data
Infrastructures
Lecture 07: Support Vector Machines
Dr. Xialu Liu
Management Information Systems Department
San Diego State University
(MIS Department) Dr. Xialu Liu 1 / 32
Classication Problems
Classication is the problem of identifying which of a set of categories
an observation belongs to.
Given a feature matrixXand a qualitative responseYtaking values
in the setS, the classication task is to build a functionC(X)that
takes as input the feature vectorYand predicts its value forY; i.e.
C(X)2 S.
Often we are more interested in estimating the Y
belongs to each category inS.
For example, it is more valuable to have an estimate of the probability
that a credit cardholder defaults, than a credit cardholder defaults or
not.
(MIS Department) Dr. Xialu Liu 2 / 32
Classication Problems
Classication is the problem of identifying which of a set of categories
an observation belongs to.
Given a feature matrixXand a qualitative responseYtaking values
in the setS, the classication task is to build a functionC(X)that
takes as input the feature vectorYand predicts its value forY; i.e.
C(X)2 S.
Often we are more interested in estimating the Y
belongs to each category inS.
For example, it is more valuable to have an estimate of the probability
that a credit cardholder defaults, than a credit cardholder defaults or
not.
(MIS Department) Dr. Xialu Liu 2 / 32
Classication Problems
Here the response variableYis qualitative - e.g. email is one ofC= (spam;ham)
(ham=good email), digit class is one ofS=f0;1; :::;9g. Our goals are to:
Build a classierC(X)that assigns a class label fromSto a future
unlabeled observation.
Assess the uncertainty in each classication
Understand the roles of the dierent predictors among
X= (X1; X2; : : : ; Xp).
(MIS Department) Dr. Xialu Liu 3 / 32
Example: Credit Card Default
We are interested in predicting whether an individual will default on his or her credit
card payment, on the basis of annual income and monthly credit card balance.Default
data set contains annualincomeand monthly credit cardbalancefor a subset of
10,000 individuals.
Blue circle represents "not default" and orange pluses represents "default".
(MIS Department) Dr. Xialu Liu 4 / 32
Overview
We discuss
was developed in the computer science community in the 1990s and that has
grown in popularity since then.
Maximal margin classier: it is simple and elegant but it requires that the
classes be separable by a linear boundary.
Support vector classier: it can be applied in a broader range of cases.
Support vector machine: it accommodates non-linear class boundaries.
Note: people often loosely refer to the maximal margin classier, support vector
classier, support vector machine as "support vector machines".
To avoid confusion, we will carefully distinguish between these three notions in
this lecture.
(MIS Department) Dr. Xialu Liu 5 / 32
Support Vector Machines
Here we approach the two-class classication problem in a direct way:
We try and nd a hyperplane that separates the classes in feature space.
If we cannot, we get creative in two ways:
We soften what we mean by "separates", and
We enrich and enlarge the feature space so that separation is possible.
(MIS Department) Dr. Xialu Liu 6 / 32
Hyperplane
A hyperplane inpdimensions is a at ane subspace of dimensionp1.
For instance, in two dimensions, a hyperplane is a at one-dimensional subspace-
in other words, a line. A hyperplane is
0+1X1+2X2= 0
If a pointX= (X1; X2)satises0+1X1+2X2= 0, it lies on the line; if it satises
0+1X1+2X2>0, it lies above the line; if it satises0+1X1+2X2<0, it
lies below the line.
(MIS Department) Dr. Xialu Liu 7 / 32
Hyperplane in 2 Dimensions
The following plot shows a hyperplane in a 2-dimensional space6 +1X1+2X2= 0
(the blue line).
One point above the line is6 +1X1+2X2= 1:6, and the other point below the
line is6 +1X1+2X2=4.
To determine a point is above the line or below the line, we need to calculate the inner
product ofX= (X1; X2)and= (1; 2).
hX; i=1X1+2X2
(MIS Department) Dr. Xialu Liu 8 / 32
Hyperplane in 3 Dimensions
For three dimensions, a hyperplane is a at two-dimensional plane. A hyperplane is
0+1X1+2X2+3X3= 0
If a pointX= (X1; X2; X3)satises0+1X1+2X2+3X3= 0, it lies on
the plane; if it satises0+1X1+2X2+3X3>0, it lies above the plane; if
it satises0+1X1+2X2+3X3<0, it lies below the plane.
To determine a point is above the plane or below the plane, we need to calculate
the inner product ofX= (X1; X2; X3)and= (1; 2; 3).
hX; i=1X1+2X2+3X3
(MIS Department) Dr. Xialu Liu 9 / 32
Hyperplane
In general the equation for a hyperplane has the form
0+1X1+2X2+: : :+pXp= 0
It is hard to visualize the hyperplane, but the notion of a(p1)-dimensional at
subspace still applies.
It divides thep-dimensional space into two halves.
Iff(X) =0+1X1+: : :+pXp, thenf(X)>0for points on one side of the
hyperplane, andf(X)<0for points on the other.
The vector= (1; 2; : : : ; p)is called the
direction orthogonal to the surface of a hyperplane.
(MIS Department) Dr. Xialu Liu 10 / 32
Separating Hyperplanes
If we code the colored points asYi= +1for blue, say, andYi=1for purple,
then ifYif(Xi)>0for alli,f(X) = 0denes a.
(MIS Department) Dr. Xialu Liu 11 / 32
Maximal Margin Classier
Suppose it is possible to construct a hyperplane that separates the data perfectly
according to their labels. Label observations from the blue class withyi= 1and
from purple class withyi=1, then
0+1xi1+2xi2+: : :+pxip>0;ifyi= 1;
and
0+1xi1+2xi2+: : :+pxip<0;ifyi=1:
Then it means a hyperplane has the property that
yi(0+1xi1+2xi2+: : :+pxip)>0
for alli= 1; : : : ; n.
(MIS Department) Dr. Xialu Liu 12 / 32
Separating Hyperplanes
If data can perfectly separated using a hyperplane, then there could exist an
innite number of such hyperplane, shown in the plot.
Which one to choose?
(MIS Department) Dr. Xialu Liu 13 / 32
Maximal Margin Classier
A natural choice is the, which is the separating plane
that is farthest from the data.
We compute the distance from
each data point to a given hyperplane; the smallest such distance is known as.
The maximal margin hyperplane is the one for which the margin is largest.
(MIS Department) Dr. Xialu Liu 14 / 32
Maximal Margin Classier
Among all separating hyperplanes, nd the one that makes the biggest gap or margin
between the two classes.
yi(0+1xi1+2xi2+: : :+pxip)0guarantees that each observation will be
on the correct side of the hyperplane provided thatMis positive.
Mrepresents the margin of the hyperplane, and the optimization problem choose
0; 1; : : : ; pto maximizeM.
(MIS Department) Dr. Xialu Liu 15 / 32
Maximal Margin Classier
We see three data points are equidistant from the maximal margin hyperplane and
lie along the dashed lines indicating the width of the margin.
They are known as, since they "support" the maximal margin
hyperplane in the sense that if they were moved slightly then the hyperplane would
move as well.
It is interesting that the maximal margin hyperplane depends directly on the
support vectors but not on the other observations.
(MIS Department) Dr. Xialu Liu 16 / 32
Non-separable Data
The data on the left are not separable by a linear boundary.
We cannot exactly separate the two classes.
(MIS Department) Dr. Xialu Liu 17 / 32
Noisy Data
Sometimes even when the data are separable, but they are noisy. This can lead to
a poor solution for the maximal-margin classier.
We want to consider a classier that does not perfectly separate the two classes, in
the interest of
Greater robustness to individual observations
Better classication of
It could be worthwhile to misclassify a few data points in order to do a better job
in classifying remaining data.
(MIS Department) Dr. Xialu Liu 18 / 32
Rather than seeking the largest possible margin so that every observation is on the
correct side of the hyperplane, we instead allow some observations to be on the
incorrect side of the margin, or even the hyperplane.
The.
The margin is soft, because it can be violated by some data points.
(MIS Department) Dr. Xialu Liu 19 / 32
Support Vector Classier
max
0;1;:::;p;1;:::;n
M
subject to
p
X
j=1
2
j= 1;
yi(0+1xi1+2xi2+: : :+pxip)M(1i);
i0;
n
X
i=1
iC:
Cis a nonnegative tuning parameter
Mis the width of the margin, and we seek to make it as large as possible
1; : : : ; nare slack variables that all data to be on the wrong side of the margin or
hyperplane.
Ifi= 0, thei-th observation is on the correct side of the margin.
If0< i<1, then thei-th observation is on the wrong side of the margin,
and we say thei-th observation has violated the margin.
If >1, it is on the wrong side of the hyperplane.
(MIS Department) Dr. Xialu Liu 20 / 32
Cis a regularization parameter
IfC= 0, there is no budget for violations to the margin with1=2=: : := 0.
IfC >0, no more thanCobservations an be on the wrong side of the hyperplane.
AsCincreases, we are more tolerant of violations.
(MIS Department) Dr. Xialu Liu 21 / 32
(MIS Department) Dr. Xialu Liu 22 / 32
Linear Boundary Can Fail
Sometime a linear boundary simply won't work, no matter what value ofC.
What to do?
(MIS Department) Dr. Xialu Liu 23 / 32
Feature Expansion
Enlarge the space of features by including transformations; e.g.X
2
1,X
3
1,X1X2,
X1X
2
2,. . . . Hence go from ap-dimensional space to a space with dimension
greater thanp.
Fit a support vector classier in the enlarged space.
This results in non-linear decision boundaries in the original space.
Example: Suppose we use(X1; X2; X
2
1; X
2
2; X1X2)instead of just(X1; X2). Then the
decision boundary would be of the form
0+1X1+2X2+3X
2
1+4X
2
2+5X1X2= 0
This leads to nonlinear decision boundaries in the original space (quadratic conic
sections).
(MIS Department) Dr. Xialu Liu 24 / 32
Cubic Polynomial
(MIS Department) Dr. Xialu Liu 25 / 32
Feature Expansion
Polynomials (especially high-dimensional ones) get wild rather fast.
There is a more elegant and controlled way to introduce nonlinearities in support
vector classiers - through the use of.
Before we discuss these, we must understand the role of
vector classiers.
We do not discussed exactly how the support vector classier is computed because
details are quite technical. But it turns out that the solution involves the inner
products of the data.
(MIS Department) Dr. Xialu Liu 26 / 32
Inner Products and Support Vectors
Inner product between vectors:
hxi; xi
0i=
p
X
j=1
xijxi
0
j
The linear support vector classier can be represented as
f(x) =0+
n
X
i=1
ihx; xiinparameters (1)
To estimate the parameters1; : : : ; nand0, all we need are the
n
2
inner
productshxi; xi
0ibetween all pairs of training observations.
It turns out that most of the^iare zero; it is nonzero for the support vectors only.
f(x) =0+
X
i2S
^ihx; xii (2)
Sis the isuch that^i>0. (2) involves far fewer terms than (1).
(MIS Department) Dr. Xialu Liu 27 / 32
Kernels and Support Vector Machines
If we can compute inner-products between observations, we can t a SV classier.
Here we replace it with a generalization of the inner product of the form
K(xi; xi
0);
whereKis some function that we will refer to as a. A kernel is a function
that quanties the similarity of two observations.
For example,
K(xi; xi
0) =
p
X
j=1
xijxi
0
j; (3)
which gives us back to the support vector classier. (3) is known as a.
We can also choose the following kernel which is called
d.
K(xi; xi
0) =
1 +
p
X
j=1
xijxi
0
j
!
d
(4)
It gives a much exible decision boundary, because it ts a support vector classier
in a higher-dimensional space.
(MIS Department) Dr. Xialu Liu 28 / 32
Kernels and Support Vector Machines
When the support vector classier is combined with a non-linear kernel such as
(4), the resulting classier is known as a. In this case, the
classier can be written as
f(x) =0+
X
i2S
iK(x; xi):
(MIS Department) Dr. Xialu Liu 29 / 32
Radial Kernel
K(xi; xi
0) = exp
p
X
j=1
(xijxi
0
j)
2
!
;where >0:
f(x) =0+
X
i2S
^iK(x; xi)
Advantage: Implicit feature space; very high dimensional.
(MIS Department) Dr. Xialu Liu 30 / 32
Heart Test Data
We apply the support vector machines to theHeartdata. These data contain a binary
outcomeHDfor 303 patients who presented with chest pain. An outcome value ofYes
indicates the presence of heart disease andNomeans no heart disease. The aim is to use
13 predictors such asAge,Sex, andChol(a cholesterol measurement) in order to predict
whether an individual has heart disease.
ROC curve is obtained by changing the threshold 0 to thresholdtin
^
f(X)> t, and
recording tvaries. Here we see ROC curves on
test data.
(MIS Department) Dr. Xialu Liu 31 / 32
Summary
Maximal margin classier
Support vector classier
Support vector machine
(MIS Department) Dr. Xialu Liu 32 / 32