Linear Discriminant Analysis and Its Generalization

ssuser8381fa 3,692 views 33 slides Jan 04, 2015
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

The brief introduction to the linear discriminant analysis and some extended methods. Much of the materials are taken from The Elements of Statistical Learning by Hastie et al. (2008).


Slide Content

Linear Discriminant Analysis and Its Generalization
Chapter 4 and 12 of The Elements of Statistical Learning
Presented by Ilsang Ohn
Department of Statistics, Seoul National University
September 3, 2014
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 1 / 33

Contents
1
Linear Discriminant Analysis
2
Flexible Discriminant Analysis
3
Penalized Discriminant Analysis
4
Mixture Discriminant Analysis
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 2 / 33

Review of
Linear Discriminant Analysis
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 3 / 33

LDA: Overview
Linear discriminant analysis (LDA) does classication by assuming
that the data within each class are normally distributed:
fk(x) =P(X=xjG=k) =N(k;):
We allow each class to have its own meank2R
p
, but we assume a
common variance matrix2R
pp
. Thus
fk(x) =
1
(2)
p=2
jj
1=2
exp


1
2
(xk)
T

1
(xk)

:
We want to ndkso thatP(G=kjX=x)/fk(x)kis the largest.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 4 / 33

LDA: Overview
The linear discriminant functions are derived from the relation
log(fk(x)k) =
1
2
(xk)
T

1
(xk) + log(k) +C
=x
T

1
k
1
2

T
k

1
k+ log(k) +C
0
;
and we denote
k(x) =x
T

1
k
1
2

T
k

1
k+ log(k):
The decision rule isG(x) = argmax
kk(x).
The Bayes classier is a linear classier.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 5 / 33

LDA: Overview
We need to estimate the parameters based on the training data
xi2R
p
andyi2 f1; ; Kgby
^k=Nk=N
^k=N
1
k
P
yi=k
xi, the centroid of classk
^
=
1
NK
P
K
k=1
P
yi=k
(xi^k)(xi^k)
T
, the pooled sample
variance matrix
The decision boundary between each pair of classeskandlis given by
fx:k(x) =l(x)g
which is equivalent to
(^k^l)
T^

1
x=
1
2
(^k+ ^l)
T^

1
(^k^l)log(^k=^l):
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 6 / 33

Fisher's discriminant analysis
Fisher's idea is to nd a covariatevsuch that
max
v
v
T
Bv=v
T
W v:
where
-B=
P
K
k=1
(xkx)(xkx)
T
: between-class covariance matrix
-W=
P
K
k=1
P
yi=k
(xixk)(xixk)
T
: within-class covariance
matrix, previously denoted by(NK)
^

This ratio is maximized byv1=e1, which is the eigenvector of
W
1
Bwith the largest eigenvalue. The linear combinationv
T
1
Xis
called rst discriminant. Similarly one can nd the next directionv2
orthogonal inWtov1.
Fisher's canonical discriminant analysis ndsLK1canonical
coordinates (or a rank-Lsubspace) that best separate the categories.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 7 / 33

Fisher's discriminant analysis
Consequently, we havev1; : : : ; vL,LK1, which is the
eigenvectors with non-zero eigenvalues.
Fisher's discriminant rule assigns to the class closest in Mahalanobis
distance, so the rule is given by
G
0
(x) = argmin
k
L
X
l=1
[v
T
l
(xxk)]
2
= argmin
k
(xxk)
T^

1
(xxk)
= argmin
k
(2k(x) +x
T^

1
x+ 2 logk)
= argmax
k
(k(x)logk):
Thus Fisher's rule is equivalent to the Gaussian classication rule with
equal prior probabilities.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 8 / 33

LDA by optimal scoring
The standard way of carrying out a (Fisher's) canonical discriminant
analysis is by way of a suitable SVD.
There is a somewhat dierent approach: optimal scoring.
This method is performing LDA using linear regression on derived
responses.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 9 / 33

LDA by optimal scoring
RecallG=f1; ; Kg.
:G !Ris a function that assigns scores to the classes such that
the transformed class labels are optimally predicted by linear
regression onX.
We ndLK1sets of independent scorings for the class labels
f1; ; Lg, andLcorresponding linear mapsl(X) =X
T
lchosen
to be optimal for multiple regression inR
l
.
landlare chosen to minimize
ASR=
1
N
L
X
l=1
"
N
X
i=1
(l(gi)x
T
il)
2
#
:
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 10 / 33

LDA by optimal scoring
Notation
Y:NKindicator matrix
PX=X(X
T
X)
1
X
T
: projection matrix onto the column space of
the predictors
:KLmatrix ofLscore vectors for theKclasses.


=Y:NKmatrix with

ij
=j(gi).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 11 / 33

LDA by optimal scoring
Problem
Minimize ASR by regressing

onX. This says that ndthat
minimizes
ASR() =tr(
T
(IPX)

)=N=tr(
T
Y
T
(IPX)Y)=N
ASR()is minimized by nding theLlargest eigenvectorsof
Y
T
PXYwith normalization
T
Dp =IL.
HearDp=Y
T
Y=Nis a diagonal matrix of the sample class
proportionsNj=N.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 12 / 33

LDA by optimal scoring
Way to the solution
1.: Form Y:NK.
2.
^
Y=PXYand denote thepK
coecient matrix byB:
^
Y=XB.
3.: Obtain the eigenvector matrix ofY
T^
Y=Y
T
PXY
with normalization
T
DP =I.
4.: Update the coecient matrix in step 2 to reect the optimal
scores:B B. The nal optimally scaled regression t is theK1
vector function(x) =B
T
x.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 13 / 33

LDA by optimal scoring
The sequence of discriminant vectorslin LDA are identical to the
sequencelup to a constant.
That is, the coecient matrixBis, up to a diagonal scale matrix, the
same as the discriminant analysis coecient matrix,
V
T
x=DB
T
x=D(x)
whereDll= 1=[
2
l
(1
2
l
)]andxis a test point. Herelislth
largest eigenvalue of.
Then the Mahalanobis distance is given by
J(x;^k) =
K1
X
l=1
wl(^l(x)
k
l
)
2
+D(x)
where
k
l
=N
1
k
P
nk
i=1
^l(xi)andwl= 1=[
2
l
(1
2
l
)].
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 14 / 33

Generalization of LDA
FDA: Allow non-linear decision boundary
PDA: Expand the predictors into a large basis set, and then penalize
its coecients to be smooth
MDA: Model each class by a mixture of two or more Gaussians with
dierent centroids but same covariance, rather than a single Gaussian
distribution as in LDA
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 15 / 33

Flexible Discriminant Analysis
(Hastie et al., 1994)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 16 / 33

FDA: Overview
Optimal scoring method provides a starting point for generalizing
LDA to a nonparametric version.
We replace the linear projection operatorPXby a nonparametric
regression procedure, which we denote by the linear operatorS.
One simple and eective approach toward this end is to expandX
into a larger set of basis variablesh(X)and then simply use
S=P
h(X)in place ofPX.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 17 / 33

FDA: Overview
This regression problems are dened via the criterion
ASR(fl; lg
L
l=1
) =
1
N
L
X
l=1
"
N
X
i=1
(l(gi)l(xi))
2
+J(l)
#
;
whereJis a regularizer appropriate for some forms of nonparametric
regression (e.g., smoothing splines, additive splines and lower-order
ANOVA models).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 18 / 33

FDA by optimal scoring
Way to the solution
1.: Form Y:NK.
2.: Fit a multi-response adaptive
nonparametric regression ofYonX, giving tted values
^
Y: LetSbe
the linear operator that ts the the nal chosen model and let

(x)be
the vector of tted regression functions.
3.: Compute the eigen-decomposition of of
Y
T^
Y=Y
T
SY, where the eigenvectorsare normalized:

T
Dp =IK.
4.: Update the nal model from step 2 using the optimal scores:
(x)
T


(x)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 19 / 33

Penalized Discriminant Analysis
(Hastie et al., 1995)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 20 / 33

PDA: Overview
Although FDA is motivated by generalizing optimal scoring, it can
also be viewed directly as a form of regularized discriminant analysis.
Suppose the regression procedure used in FDA amounts to a linear
regression onto a basis expansionh(X), with a quadratic penalty on
the coecients:
ASR(fl; lg
L
l=1
) =
1
N
L
X
l=1
"
N
X
i=1
(l(gi)h
T
(xi)l)
2
+
T
l
l
#
has a role to give penalty to ough" ones
The steps in FDA can be viewed as a generalized form of LDA, which
we call PDA.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 21 / 33

PDA: Overview
Enlarge the set of predictorsXvia a basis expansionh(X).
Use (penalized) LDA in the enlarged space, where the penalized
Mahalanobis distance is given by
D(x; ) = (h(x)h())
T
(W+)
1
(h(x)h());
whereWis the within-class covariance matrix of the derived
variablesh(xi).
Decompose the classication subspace using a penalized metric:
maxu
T
usubject tou
T
( +)u= 1
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 22 / 33

PDA by optimal scoring
Way to the solution
1.: Form YandH= (hij) = (hj(xi)).
2.: Fit a penalized multi-response
regression ofYonH, giving tted values
^
Y=S()Y: Let
S() =H(H
T
H+ )
1
H
T
be the smoother matrix ofHregularized
byand let= (H
T
H+ )
1
H
T
Y be the penalized least squares
estimate,
3.: Compute the eigen-decomposition of of
Y
T^
Y=Y
T
S()Y, where the eigenvectorsare normalized:

T
Dp =IK.
4.: Update the .
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 23 / 33

Mixture Discriminant Analysis
(Hastie and Tibshirani, 1996)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 24 / 33

MDA: Overview
Linear discriminant analysis can be viewed as a prototype classier.
Each class is represented by its centroid, and we classify to the closest
using an appropriate metric.
In many situations a single prototype is not sucient to represent
inhomogeneous classes, and mixture models are more appropriate.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 25 / 33

MDA: Overview
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 26 / 33

MDA: Overview
A Gaussian mixture model for thekth class has density
P(XjG=k) =
RkX
r=1
kr(X;kr;)
where the mixing proportionskrsum to one andRkis a number of
prototypes for thekth class.
The class posterior probabilities are given by
P(G=kjX=x) =
P
Rk
r=1
kr(X;kr;)k
P
K
l=1
P
Rl
r=1
lr(X;lr;)l
wherekrepresent the class prior probabilities
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 27 / 33

MDA: Estimation
We estimate the parameters by maximum likelihood, using the joint
log-likelihood based onP(G; X):
K
X
k=1
X
gi=k
log
"
RkX
r=1
kr(X;kr;)k
#
We solve above MLEs by EM algorithm
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 28 / 33

MDA: Estimation
E-step: Given the current parameters, compute the responsibility of
subclassckrwithin classkfor each of the class-kobservations
(gi=k):
^p(ckrjxi; gi) =
kr(xi;kr;)
P
Rk
l=1
kr(xi;kr;)
:
M-step: Compute the weighted MLEs for the parameters of each of
the component Gaussians within each of the classes, using the
weights from the E-step.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 29 / 33

MDA: Estimation
The M-step is a weighted version of LDA, withR=
P
K
k=1
RKclasses
and
P
K
k=1
NkRKobservations.
We can use optimal scoring as before to solve the weighted LDA
problem, which allows us to use a weighted version of FDA or PDA at
this stage.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 30 / 33

MDA: Estimation
The indicator matrixYNKcollapses in this case to ablurred
response matrixZNR.
For example,
c11c12c13c21c22c23c31c32c33
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
g1= 2 0 0 0 0 :3 0:5 0:2 0 0 0
g2= 1 0:9 0:1 0:0 0 0 0 0 0 0
g3= 1 0:1 0:8 0:1 0 0 0 0 0 0
g4= 3 0 0 0 0 0 0 0 :5 0:4 0:1
.
.
.
.
.
.
gN= 3 0 0 0 0 0 0 0 :5 0:4 0:1
where the entries in a class-krow correspond to^p(ckrjx; gi).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 31 / 33

MDA: Estimation by optimal scoring
Optimal scoring over EM-step of MDA:
1.: Start with set of Rksubclassesckr, and associated subclass
probabilities^p(ckrjx; gi)
2.: If gi=k, then ll thekth block ofRkentries in
theith row with the values^p(ckrjx; gi), and the rest with 0s
3.: Fit a multi-response adaptive
nonparametric regression ofZonX, giving tted values
^
Z. Let

(x)
be the vector of tted regression functions.
4.: Let be the largestKnon-trivial eigenvectors ofZ
^
Z,
with normalization
T
Dp =IK.
5.: Update the nal model from step 2 using the optimal scores:
(x)
T


(x), and update^p(ckrjx; gi)and^kr.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 32 / 33

Performance
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 33 / 33