Linear Discriminant Analysis and Its Generalization

Linear Discriminant Analysis and Its Generalization
Chapter 4 and 12 of The Elements of Statistical Learning
Presented by Ilsang Ohn
Department of Statistics, Seoul National University
September 3, 2014
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 1 / 33

Contents
1
Linear Discriminant Analysis
2
Flexible Discriminant Analysis
3
Penalized Discriminant Analysis
4
Mixture Discriminant Analysis
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 2 / 33

Review of
Linear Discriminant Analysis
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 3 / 33

LDA: Overview
Linear discriminant analysis (LDA) does classication by assuming
that the data within each class are normally distributed:
fk(x) =P(X=xjG=k) =N(k;):
We allow each class to have its own meank2R
p
, but we assume a
common variance matrix2R
pp
. Thus
fk(x) =
1
(2)
p=2
jj
1=2
exp

1
2
(xk)
T

1
(xk)

:
We want to ndkso thatP(G=kjX=x)/fk(x)kis the largest.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 4 / 33

LDA: Overview
The linear discriminant functions are derived from the relation
log(fk(x)k) =
1
2
(xk)
T

1
(xk) + log(k) +C
=x
T

1
k
1
2

T
k

1
k+ log(k) +C
0
;
and we denote
k(x) =x
T

1
k
1
2

T
k

1
k+ log(k):
The decision rule isG(x) = argmax
kk(x).
The Bayes classier is a linear classier.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 5 / 33

LDA: Overview
We need to estimate the parameters based on the training data
xi2R
p
andyi2 f1; ; Kgby
^k=Nk=N
^k=N
1
k
P
yi=k
xi, the centroid of classk
^
=
1
NK
P
K
k=1
P
yi=k
(xi^k)(xi^k)
T
, the pooled sample
variance matrix
The decision boundary between each pair of classeskandlis given by
fx:k(x) =l(x)g
which is equivalent to
(^k^l)
T^

1
x=
1
2
(^k+ ^l)
T^

1
(^k^l)log(^k=^l):
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 6 / 33

Fisher's discriminant analysis
Fisher's idea is to nd a covariatevsuch that
max
v
v
T
Bv=v
T
W v:
where
-B=
P
K
k=1
(xkx)(xkx)
T
: between-class covariance matrix
-W=
P
K
k=1
P
yi=k
(xixk)(xixk)
T
: within-class covariance
matrix, previously denoted by(NK)
^

This ratio is maximized byv1=e1, which is the eigenvector of
W
1
Bwith the largest eigenvalue. The linear combinationv
T
1
Xis
called rst discriminant. Similarly one can nd the next directionv2
orthogonal inWtov1.
Fisher's canonical discriminant analysis ndsLK1canonical
coordinates (or a rank-Lsubspace) that best separate the categories.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 7 / 33

Fisher's discriminant analysis
Consequently, we havev1; : : : ; vL,LK1, which is the
eigenvectors with non-zero eigenvalues.
Fisher's discriminant rule assigns to the class closest in Mahalanobis
distance, so the rule is given by
G
0
(x) = argmin
k
L
X
l=1
[v
T
l
(xxk)]
2
= argmin
k
(xxk)
T^

1
(xxk)
= argmin
k
(2k(x) +x
T^

1
x+ 2 logk)
= argmax
k
(k(x)logk):
Thus Fisher's rule is equivalent to the Gaussian classication rule with
equal prior probabilities.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 8 / 33

LDA by optimal scoring
The standard way of carrying out a (Fisher's) canonical discriminant
analysis is by way of a suitable SVD.
There is a somewhat dierent approach: optimal scoring.
This method is performing LDA using linear regression on derived
responses.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 9 / 33

LDA by optimal scoring
RecallG=f1; ; Kg.
:G !Ris a function that assigns scores to the classes such that
the transformed class labels are optimally predicted by linear
regression onX.
We ndLK1sets of independent scorings for the class labels
f1; ; Lg, andLcorresponding linear mapsl(X) =X
T
lchosen
to be optimal for multiple regression inR
l
.
landlare chosen to minimize
ASR=
1
N
L
X
l=1
"
N
X
i=1
(l(gi)x
T
il)
2
#
:
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 10 / 33

LDA by optimal scoring
Notation
Y:NKindicator matrix
PX=X(X
T
X)
1
X
T
: projection matrix onto the column space of
the predictors
:KLmatrix ofLscore vectors for theKclasses.

=Y:NKmatrix with

ij
=j(gi).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 11 / 33

LDA by optimal scoring
Problem
Minimize ASR by regressing

onX. This says that ndthat
minimizes
ASR() =tr(
T
(IPX)

)=N=tr(
T
Y
T
(IPX)Y)=N
ASR()is minimized by nding theLlargest eigenvectorsof
Y
T
PXYwith normalization
T
Dp =IL.
HearDp=Y
T
Y=Nis a diagonal matrix of the sample class
proportionsNj=N.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 12 / 33

LDA by optimal scoring
Way to the solution
1.: Form Y:NK.
2.
^
Y=PXYand denote thepK
coecient matrix byB:
^
Y=XB.
3.: Obtain the eigenvector matrix ofY
T^
Y=Y
T
PXY
with normalization
T
DP =I.
4.: Update the coecient matrix in step 2 to reect the optimal
scores:B B. The nal optimally scaled regression t is theK1
vector function(x) =B
T
x.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 13 / 33

LDA by optimal scoring
The sequence of discriminant vectorslin LDA are identical to the
sequencelup to a constant.
That is, the coecient matrixBis, up to a diagonal scale matrix, the
same as the discriminant analysis coecient matrix,
V
T
x=DB
T
x=D(x)
whereDll= 1=[
2
l
(1
2
l
)]andxis a test point. Herelislth
largest eigenvalue of.
Then the Mahalanobis distance is given by
J(x;^k) =
K1
X
l=1
wl(^l(x)
k
l
)
2
+D(x)
where
k
l
=N
1
k
P
nk
i=1
^l(xi)andwl= 1=[
2
l
(1
2
l
)].
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 14 / 33

Generalization of LDA
FDA: Allow non-linear decision boundary
PDA: Expand the predictors into a large basis set, and then penalize
its coecients to be smooth
MDA: Model each class by a mixture of two or more Gaussians with
dierent centroids but same covariance, rather than a single Gaussian
distribution as in LDA
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 15 / 33

Flexible Discriminant Analysis
(Hastie et al., 1994)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 16 / 33

FDA: Overview
Optimal scoring method provides a starting point for generalizing
LDA to a nonparametric version.
We replace the linear projection operatorPXby a nonparametric
regression procedure, which we denote by the linear operatorS.
One simple and eective approach toward this end is to expandX
into a larger set of basis variablesh(X)and then simply use
S=P
h(X)in place ofPX.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 17 / 33

FDA: Overview
This regression problems are dened via the criterion
ASR(fl; lg
L
l=1
) =
1
N
L
X
l=1
"
N
X
i=1
(l(gi)l(xi))
2
+J(l)
#
;
whereJis a regularizer appropriate for some forms of nonparametric
regression (e.g., smoothing splines, additive splines and lower-order
ANOVA models).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 18 / 33

FDA by optimal scoring
Way to the solution
1.: Form Y:NK.
2.: Fit a multi-response adaptive
nonparametric regression ofYonX, giving tted values
^
Y: LetSbe
the linear operator that ts the the nal chosen model and let

(x)be
the vector of tted regression functions.
3.: Compute the eigen-decomposition of of
Y
T^
Y=Y
T
SY, where the eigenvectorsare normalized:

T
Dp =IK.
4.: Update the nal model from step 2 using the optimal scores:
(x)
T

(x)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 19 / 33

Penalized Discriminant Analysis
(Hastie et al., 1995)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 20 / 33

PDA: Overview
Although FDA is motivated by generalizing optimal scoring, it can
also be viewed directly as a form of regularized discriminant analysis.
Suppose the regression procedure used in FDA amounts to a linear
regression onto a basis expansionh(X), with a quadratic penalty on
the coecients:
ASR(fl; lg
L
l=1
) =
1
N
L
X
l=1
"
N
X
i=1
(l(gi)h
T
(xi)l)
2
+
T
l
l
#
has a role to give penalty to ough" ones
The steps in FDA can be viewed as a generalized form of LDA, which
we call PDA.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 21 / 33

PDA: Overview
Enlarge the set of predictorsXvia a basis expansionh(X).
Use (penalized) LDA in the enlarged space, where the penalized
Mahalanobis distance is given by
D(x; ) = (h(x)h())
T
(W+)
1
(h(x)h());
whereWis the within-class covariance matrix of the derived
variablesh(xi).
Decompose the classication subspace using a penalized metric:
maxu
T
usubject tou
T
( +)u= 1
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 22 / 33

PDA by optimal scoring
Way to the solution
1.: Form YandH= (hij) = (hj(xi)).
2.: Fit a penalized multi-response
regression ofYonH, giving tted values
^
Y=S()Y: Let
S() =H(H
T
H+ )
1
H
T
be the smoother matrix ofHregularized
byand let= (H
T
H+ )
1
H
T
Y be the penalized least squares
estimate,
3.: Compute the eigen-decomposition of of
Y
T^
Y=Y
T
S()Y, where the eigenvectorsare normalized:

T
Dp =IK.
4.: Update the .
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 23 / 33

Mixture Discriminant Analysis
(Hastie and Tibshirani, 1996)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 24 / 33

MDA: Overview
Linear discriminant analysis can be viewed as a prototype classier.
Each class is represented by its centroid, and we classify to the closest
using an appropriate metric.
In many situations a single prototype is not sucient to represent
inhomogeneous classes, and mixture models are more appropriate.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 25 / 33

MDA: Overview
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 26 / 33

MDA: Overview
A Gaussian mixture model for thekth class has density
P(XjG=k) =
RkX
r=1
kr(X;kr;)
where the mixing proportionskrsum to one andRkis a number of
prototypes for thekth class.
The class posterior probabilities are given by
P(G=kjX=x) =
P
Rk
r=1
kr(X;kr;)k
P
K
l=1
P
Rl
r=1
lr(X;lr;)l
wherekrepresent the class prior probabilities
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 27 / 33

MDA: Estimation
We estimate the parameters by maximum likelihood, using the joint
log-likelihood based onP(G; X):
K
X
k=1
X
gi=k
log
"
RkX
r=1
kr(X;kr;)k
#
We solve above MLEs by EM algorithm
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 28 / 33

MDA: Estimation
E-step: Given the current parameters, compute the responsibility of
subclassckrwithin classkfor each of the class-kobservations
(gi=k):
^p(ckrjxi; gi) =
kr(xi;kr;)
P
Rk
l=1
kr(xi;kr;)
:
M-step: Compute the weighted MLEs for the parameters of each of
the component Gaussians within each of the classes, using the
weights from the E-step.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 29 / 33

MDA: Estimation
The M-step is a weighted version of LDA, withR=
P
K
k=1
RKclasses
and
P
K
k=1
NkRKobservations.
We can use optimal scoring as before to solve the weighted LDA
problem, which allows us to use a weighted version of FDA or PDA at
this stage.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 30 / 33

MDA: Estimation
The indicator matrixYNKcollapses in this case to ablurred
response matrixZNR.
For example,
c11c12c13c21c22c23c31c32c33
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
g1= 2 0 0 0 0 :3 0:5 0:2 0 0 0
g2= 1 0:9 0:1 0:0 0 0 0 0 0 0
g3= 1 0:1 0:8 0:1 0 0 0 0 0 0
g4= 3 0 0 0 0 0 0 0 :5 0:4 0:1
.
.
.
.
.
.
gN= 3 0 0 0 0 0 0 0 :5 0:4 0:1
where the entries in a class-krow correspond to^p(ckrjx; gi).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 31 / 33

MDA: Estimation by optimal scoring
Optimal scoring over EM-step of MDA:
1.: Start with set of Rksubclassesckr, and associated subclass
probabilities^p(ckrjx; gi)
2.: If gi=k, then ll thekth block ofRkentries in
theith row with the values^p(ckrjx; gi), and the rest with 0s
3.: Fit a multi-response adaptive
nonparametric regression ofZonX, giving tted values
^
Z. Let

(x)
be the vector of tted regression functions.
4.: Let be the largestKnon-trivial eigenvectors ofZ
^
Z,
with normalization
T
Dp =IK.
5.: Update the nal model from step 2 using the optimal scores:
(x)
T

(x), and update^p(ckrjx; gi)and^kr.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 32 / 33

Performance
Presented by Ilsang Ohn Linear Discriminant Analysis and Its GeneralizationSeptember 3, 2014 33 / 33

Linear Discriminant Analysis and Its Generalization

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Linear Discriminant Analysis and Its Generalization

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......