3. TheMultivariateNormalDistribution
3.1 Introduction
A generalization of the familiar bell shaped normal density to several
dimensions plays a fundamental role in multivariate analysis
While real data are neverexactlymultivariate normal, the normal density
is often a useful approximation to the rue" population distribution because
of acentral limiteect.
One advantage of the multivariate normal distribution stems from the fact
that it is mathematically tractable and ice" results can be obtained.
1
To summarize, many real-world problems fall naturally within the framework
of normal theory. The importance of the normal distribution rests on its dual
role as both population model for certain natural phenomena and approximate
sampling distribution for many statistics.
2
3.2 The Multivariate Normal density and Its Properties
Recall that the univariate normal distribution, with meanand variance
2
,
has the probability density function
f(x) =
1
p
2
2
e
[(x)=]
2
=2
1< x <1
The term
x
2
= (x)(
2
)
1
(x)
This can be generalized forp1vectorxof observations on serval variables
as
(x)
0
1
(x)
Thep1vectorrepresents the expected value of the random vectorX,
and theppmatrixis the variance-covariance matrix ofX.
3
Ap-dimensionalnormaldensityfortherandomvectorX
0
= [X1;X2;:::;Xp]
has the form
f(x) =
1
(2)
p=2
jj
1=2
e
(x)
0
1
(x)=2
where1< xi<1;i= 1;2;:::;p:We should denote this p-dimensional
normal density byNp(;):
4
Example 3.1 (Bivariate normal density)Let us evaluate thep= 2variate
normal density in terms of the individual parameters1= E(X1);2=
E(X2);11=Var(X1);22=Var(X2), and12=12=(
p
11
p
22) =
Corr(X1;X2):
Result 3.1Ifis positive denite, so that
1
exists, then
e=eimplies
1
e=
1
e
so(;e)is an eigenvalue-eigenvector pair forcorresponding to the pair
(1=;e)for
1
. Also
1
is positive denite.
5
6
Constant probability density contour
=fallxsuch that(x)
0
1
(x) =c
2
g
=surface of an ellipsoid centered at.
Contours of constant density for thep-dimensional normal distribution are
ellipsoids dened byxsuch the that
(x)
0
1
(x) =c
2
These ellipsoids are centered atand have axesc
p
iei, whereei=ifor
i= 1;2;:::;p.
7
Example 4.2 (Contours of the bivariate normal density)Obtain the axes
of constant probability density contours for a bivariate normal distribution when
11=22
8
The solid ellipsoid ofxvalues satisfying
(x)
0
1
(x)
2
p()
hasprobability1where
2
p()istheupper(100)thpercentileofachi-square
distribution withpdegrees of freedom.
9
Additional Properties of the Multivariate Normal
Distribution
The following are true for a normal vectorXhaving a multivariate normal
distribution:
1. Xare normally distributed.
2. Xhave a (multivariate) normal distribution.
3.
distributed.
4.
10
Result 3.2IfXis distributed asNp(;), then any linear combination of
variablesa
0
X=a1X1+a2X2++apXpis distributed asN(a
0
;a
0
a). Also
ifa
0
Xis distributed asN(a
0
;a
0
a)for everya, thenXmust beNp(;):
Example 3.3 (The distribution of a linear combination of the component
of a normal random vector)Consider the linear combinationa
0
Xof a
multivariate normal random vector determined by the choicea
0
= [1;0;:::;0]:
Result 3.3IfXis distributed asNp(;), theqlinear combinations
A
(qp)Xp1=
2
6
6
4
a11X1++a1pXp
a21X1++a2pXp
...
aq1X1++aqpXp
3
7
7
5
are distributed asNq(A;AA
0
). AlsoXp1+dp1, wheredis a vector of
constants, is distributed asNp(+d;).
11
Example 3.4 (The distribution of two linear combinations of the
components of a normal random vector)ForXdistributed asN3(;),
nd the distribution of
X1X2
X2X3
=
11 0
0 1 1
2
4
X1
X2
X3
3
5=AX
12
Result 3.4All subsets ofXare normally distributed. If we respectively partition
X, its mean vector, and its covariance matrixas
X
(p1)=
2
6
6
6
6
4
X1
(q1)
X2
(pq)1
3
7
7
7
7
5
(p1)=
2
6
6
6
6
4
1
(q1)
2
(pq)1
3
7
7
7
7
5
and
(pp)=
2
6
6
6
6
4
11 12
(q1) ( q(pq))
21 22
((pq)q) ((pq)(pq))
3
7
7
7
7
5
thenX1is distributed asNq(
1;11).
Example 3.5 (The distribution of a subset of a normal random vector)
IfXis distributed asN5(;), nd the distribution of[X2;X4]
0
.
13
Result 3.5
(a) X1andX2are independent, thenCov(X1;X2) = 0, aq1q2matrix of
zeros, whereX1isq11random vector andX2isq21. random vector
(b)
X1
X2
isNq1+q2
1
2
;
1112
2122
, thenX1andX2are
independent if and only if12= 21= 0.
(c) X1andX2are independent and are distributed asNq1
(
1;11)
andNq2
(
2;22), respectively, then
X1
X2
has the multivariate normal
distribution
Nq1+q2
1
2
;
110
0 22
14
Example 3.6 (The equivalence of zero covariance and independence for
normal variables)LetX31beN3(;)with
=
2
4
4 1 0
1 3 0
0 0 2
3
5
AreX1andX2independent ? What about(X1;X2)andX3?
Result 3.6LetX=
X1
X2
be distributed asNp(;)with
1
2
, =
1112
2122
, andj22j>0. Then the conditional distribution ofX1, given
thatX2=x2is normal and has
Mean =
1+12
1
22
(x2
2)
and
Covariance = 1112
1
22
21
Note that the covariance does not depend on the valuex2of the conditioning
variable.
15
Example 3.7 (The conditional density of a bivariate normal distribution)
Obtain the conditional density ofX1, give thatX2=x2for any bivariate
distribution.
Result 3.7LetXbe distributed asNp(;)withjj>0. Then
(a)(X)
0
1
(X)is distributed as
2
p, where
2
pdenotes the chi-square
distribution withpdegrees of freedom.
(b) Np(;)distribution assign probability1to the solid ellipsoid
fx: (x)
0
1
(x)
2
p()g, where
2
p()denote the upper(100)th
percentile of the
2
pdistribution.
16
Result 3.8LetX1;X2;:::;Xnbe mutually independent withXjdistributed
asNp(
j;). (Note that eachXjhas thesamecovariance matrix.) Then
V1=c1X1+c2X2++cnXn
is distributed asNp
nP
j=1
cj
j;(
nP
j=1
c
2
j
)
!
. Moreover,V1andV2=b1X1+
b2X2++bnXnare jointly multivariate normal with covariance matrix
2
6
6
4
(
nP
j=1
c
2
j
)b
0
c
b
0
c21(
nP
j=1
b
2
j
)
3
7
7
5
Consequently,V1andV2are independent ifb
0
c=
nP
j=1
cjbj= 0.
17
Example 3.8 (Linear combinations of random vectors)LetX1;X2;X3
andX4be independent and identically distributed31random vectors with
=
2
4
3
1
1
3
5and =
2
4
31 1
1 1 0
1 0 2
3
5
(a) a
0
X1of the three
components ofX1wherea= [a1a2a3]
0
.
(b)
1
2
X1+
1
2
X2+
1
2
X3+
1
2
X4
and
X1+X2+X33X4:
Find the mean vector and covariance matrix for each linear combination of
vectors and also the covariance between them.
18
3.3 Sampling from a Multivariate Normal Distribution and
Maximum Likelihood Estimation
The Multivariate Normal Likelihood
Joint density function of allp1observed random vectorsX1;X2;:::;Xn
Joint density
ofX1;X2;:::;Xn
=
n
Y
j=1
1
(2)
p=2
jj
1=2
e
(xj)
0
1
(xj)=2
=
1
(2)
np=2
jj
n=2
e
nP
j=1
(xj)
0
1
(xj)=2
=
1
(2)
np=2
jj
n=2
e
tr
"
1
nP
j=1
(xjx)(xjx)
0
+n(x)(x)
0
!#.
2
19
Likelihood
When the numerical values of the observations become available, they may
be substituted for thexjin the equation above. The resulting expression,
now considered as a function ofandfor the xed set of observations
x1;x2;:::;xn, is called thelikelihood.
Maximum likelihood estimation
One meaning of best is to select the parameter values that maximize
the joint density evaluated at the observations. This technique is called
maximum likelihood estimation, and the maximizing parameter values are
calledmaximum likelihood estimates.
Result 3.9LetAbe akksymmetric matrix andxbe ak1vector. Then
(a)x
0
Ax=tr(x
0
Ax) =tr(Axx
0
)
(b) (A) =
nP
i=1
i, where theiare the eigenvalues ofA.
20
Maximum Likelihood Estimate ofand
Result 3.10Given appsymmetric positive denite matrixBand a scalar
b >0, it follows that
1
jj
b
e
tr(
1
B)=2
1
jBj
b
(2b)
pb
e
bp
for all positive denitepp, with equality holding only for = (1=2b)B:
Result 3.11LetX1;X2;:::;Xnbearandomsamplefromanormalpopulation
with meanand covariance. Then
^=
Xand
^
=
1
n
n
X
j=1
(Xj
X)(Xj
X)
0
=
n1
n
S
are themaximum likelihood estimatorsofand, respectively. Their
observed valuexand(1=n)
nP
j=1
(xjx)(xjx)
0
, are called themaximum
likelihood estimatesofand.
21
Invariance Propertyof Maximum likelihood estimators
Let
^
be the maximum likelihood estimator of, and consider the parameter
h(), which is a function of. Then the maximum likelihood estimate of
h()is given byh(
^
):
For example
1.
0
1
is^
^
1
^, where^=
Xand
^
=
n1
n
Sare the maximum likelihood estimators ofandrespectively.
2.
p
iiis
p
^ii, where
^ii=
1
n
n
X
j=1
(Xij
Xi)
2
is the maximum likelihood estimator ofii= Var(Xi).
22
Sucient Statistics
LetX1;X2;:::;Xnbe a random sample from a multivariate normal
population with meanand covariance. Then
XandS=
1
n1
n
X
j=1
(Xj
X)(Xj
X)
0
are sucient statistics
The importance of sucient statistics for normal populations is that all of
the information aboutandin the data matrixXis contained in
Xand
S, regardless of the sample sizen.
This generally is not true for nonnormal populations.
Sincemanymultivariatetechniquesbeginwithsamplemeansandcovariances,
it is prudent to check on the adequacy of the multivariate normal assumption.
Ifthedatacannotberegardedasmultivariatenormal, techniquesthatdepend
solely on
XandSmay be ignoring other useful sample information.
23
3.4 The Sampling Distribution of
XandS
The univariate case(p= 1)
{
Xis normal with mean=(population mean) and variance
1
n
X)
2
isdistributed
as
2
times a chi-square variable havingn1degrees of freedom (d.f.).
{The chi-square is the distribution of a sum squares of independent standard
normal random variables. That is,(n1)s
2
is distributed as
2
(Z
2
1+
+Z
2
n1) = (Z1)
2
++ (Zn1)
2
:The individual termsZiare
independently distributed asN(0;
2
).
24
Wishart distribution
Wm(j) =Wishart distribution withmd.f.
=distribution of
n
X
j=1
ZjZ
0
j
whereZjare each independently distributed asNp(0;).
Properties of the Wishart Distribution
1. A1is distributed asWm1
(A1j)independently ofA2, which is
distributed asWm2
(A2j), thenA1+A2is distributed asWm1+m2
(A1+
A2j). That is, the the degree of freedom add.
2. Ais distributed asWm(Aj), thenCAC
0
is distributed as
Wm(CAC
0
jCC
0
).
25
The Sampling Distribution of
XandS
LetX1;X2;:::;Xnbe a random sample sizenfrom ap-variate normal
distribution with meanand covariance matrix. Then
1.
Xis distributed asNp(;
1
n
).
2.(n1)Sis distributed as a Wishart random matrix withn1d.f.
3.
XandSare independent.
26
4.5 Large-Sample Behavior of
XandS
Result 3.12 (Law of Large numbers)LetY1;Y2;:::;Ynbe independent
observations from a population with meanE(Yi) =, then
Y=
Y1+Y2++Yn
n
converges in probability toasnincreases without bound. That is, for any
prescribed accuracy" >0,P[" <
Y < "]approaches unity asn! 1:
Result3.13(Thecentrallimittheorem)LetX1;X2;:::;Xnbeindependent
observations from any population with meanand nite covariance. Then
p
n(
X)has an approximateNp(0;)distribution
for large sample sizes. Herenshould also be large relative top.
27
Large-Sample Behavior of
XandS
LetX1;X2;:::;Xnbe independent observations from a population with mean
and nite (nonsingular) covariance. Then
p
n(
X)is approximatelyNp(0;)
and
n(
X)
0
S
1
(
X)is approximately
2
p
fornplarge.
28
3.6 Assessing the Assumption of Normality
Most of the statistical techniques discussed assume that each vector
observationXjcomes from a multivariate normal distribution.
In situations where the sample size is large and the techniques dependent
solely on the behavior of
X, or distances involve
Xof the formn(
X
)
0
S(
X), the assumption of normality for the individual observations is
less crucial.
But to some degree, thequalityof inferences made by these methods
depends on how closely the true parent population resembles the multivariate
normal form.
29
Therefore, we address these questions:
1. Xappear to be normal ?
What about a few linear combinations of the componentsXj?
2.
elliptical appearance expected from normal population ?
3.
30
Evaluating the Normality of the Univariate Marginal
Distributions
Dot diagrams for smallernand histogram forn >25or so help reveal
situations where one tail of a univariate distribution is much longer than
other.
If the histogram for a variableXiappears reasonably symmetric , we can
check further by counting the number of observations in certain interval, for
examples
A univariate normal distribution assigns probability 0.683 to the interval
(i
p
ii;i+
p
ii)
and probability 0.954 to the interval
(i2
p
ii;i+2
p
ii)
Consequently, with a large same sizen, the observed proportion^pi1of the
observations lying in the interval(xi
p
sii;xi+
p
sii)to be about 0.683,
and the interval(xi2
p
sii;xi+2
p
sii)to be about 0.954
31
Using the normal approximating to the sampling of^pi, observe that either
j^pi10:683j>3
r
(0:683)(0:317)
n
=
1:396
p
n
or
j^pi20:954j>3
r
(0:954)(0:046)
n
=
0:628
p
n
would indicate departures from an assumed normal distribution for theith
characteristic.
32
Plots are always useful devices in any data analysis. Special plots called
QQplotscan be used to assess the assumption of normality.
Letx
(1)x
(2) x
(n)represent these observations after they are
ordered according to magnitude. For a standard normal distribution, the
quantilesq
(j)are dened by the relation
P[Zq
(j)] =
Z
q(j)
1
1
p
2
e
z
2
=2
dz=p
(j)=
j
1
2
n
Herep
(j)is the probability of getting a value less than or equal toq
(j)in a
single drawing from a standard normal population.
The idea is to look at the pairs of quantiles(q
(j);x
(j))with the same
associated cumulative probability(j
1
2
)=n. If the data arise from a normal
population, the pairs(q
(j);x
(j))will be approximately linear related, since
q
(j)+is nearly expected sample quantile.
33
Example 3.9 (Constructing a Q-Q plot)A sample ofn= 10observation
gives the values in the following table:
The steps leading to a Q-Q plot are as follows:
1. x
(1);x
(2);:::;x
(n)and their
corresponding probability values(1
1
2
)=n;(2
1
2
)=n;:::;(n
1
2
)=n;
2. q
(1);q
(2);:::;q
(n)and
3. (q
(1);x
(1));(q
(2);x
(2));:::;(q
(n);x
(n)), and
examine the \straightness" of the outcome.
34
Example 4.10 (A Q-Q plot for radiation data) The quality -control
department of a manufacturer of microwave ovens is required by the federal
government to monitor the amount of radiation emitted when the doors of the
ovens are closed. Observations of the radiation emitted through closed doors of
n= 42randomly selected ovens were made. The data are listed in the following
table.
35
The straightness of the Q-Q plot can be measured ba calculating the
correlation coecient of the points in the plot. The correlation coecient for
the Q-Q plot is dened by
rQ=
nP
j=1
(x
(j)x)(q
(j)q)
s
nP
j=1
(x
(j)x)
2
s
nP
j=1
(q
(j)q)
2
and a powerful test of normality can be based on it. Formally we reject the
hypothesis of normality at level of signicanceifrQfallbelowthe appropriate
value in the following table
36
Example 3.11 (A correlation coecient test for normality)Let us calculate
the correlation coecientrQfrom Q-Q plot of Example 3.9 and test for
normality.
37
Linear combinations of more than one characteristic can be investigated.
Many statistician suggest plotting
^e
0
1xjwhereS^e1=
^
1^e1
in which
^
1is the largest eigenvalue ofS. Herex
0
j
= [xj1;xj2;:::;xjp]is
thejth observation onpvariablesX1;X2;:::;Xp. The linear combination
^epxjcorresponding to the smallest eigenvalue is also frequently singled out for
inspection
38
Evaluating Bivariate Normality
By Result 3.7, the set of bivariate outcomesxsuch that
(x)
0
1
(x)
2
2(0:5)
has probability 0.5.
Thus we should expectroughlythe same percentage, 50%, of sample
observations lie in the ellipse given by
fallxsuch that(x^x)
0
S
1
(x^x)
2
2(0:5)g
whereis replaced by^xand
1
by its estimateS
1
. If not, the normality
assumption is suspect.
Example3.12(Checkingbivariatenormality)Althoughnotarandomsample,
data consisting of the pairs of observations (x1=sales,x2=prots) for the 10
largest companies in the world are listed in the following table. Check if(x1;x2)
follows bivariate normal distribution.
39
40
A somewhat more formal method for judging normality of a data set is based
on the squared generalized distances
d
2
j= (xjx)
0
S
1
(xjx)
When the parent population is multivariate normal and bothnandnp
are greater than 25 or 30, each of the squared distanced
2
1;d
2
2;:::;d
2
nshould
behave like a chi-square random variable.
Although these distances are not independent or exactly chi-square
distributed, it is helpful to plot them as if they were. The resulting
plot is called achi-square plotorgamma plot, because the chi-square
distribution is a special case of the more general gamma distribution. To
construct the chi-square plot
1.
asd
2
(1)
d
2
(2)
d
2
(n)
:
2. (qc;p((j
1
2
)=n);d
2
(j)
), whereqc;p((j
1
2
)=n)is the100(j
1
2
)=nquantile of the chi-square distribution withpdegrees of freedom.
41
Example 3.13 (Constructing a chi-square plot)Let us construct a chi-square
plot of the generalized distances given in Example 3.12. The order distance and
the corresponding chi-square percentile forp= 2andn= 10are listed in the
following table:
42
43
Example 3.14 (Evaluating multivariate normality for a four-variable data
set)The data in Table 4.3 were obtained by taking four dierent measures of
stiness,x1;x2;x3, andx4, of each ofn= 30boards. the rst measurement
involving sending a shock wave down the board, the second measurement
is determined while vibrating the board, and the last two measurements are
obtained from static tests. The squared distancesdj= (xjx)
0
S
1
(xjx)are
also presented in the table
44
45
3.7 Detecting Outliers and Cleaning Data
Outliers are best detected visually whenever this is possible
For a single random variable, the problem is one dimensional, and we look
for observations that are far from the others.
In the bivariate case, the situation is more complicated. Figure 4.10 shows a
situation with two unusual observations.
In higher dimensions, there can be outliers that cannot be detected from
the univariate plots or even the bivariate scatter plots. Here a large value
of(xjx)
0
S
1
(xjx)will suggest an unusual observation. even though it
cannot be seen visually.
46
47
Steps for Detecting Outliers
1.
2.
3. zjk= (xjkxk)=
p
skkforj= 1;2;:::;n
and each columnk= 1;2;:::;p. Examine these standardized values for large
or small values.
4. (xjx)
0
S
1
(xjx). Examine
these distances for unusually values. In a chi-square plot, these would be the
points farthest from the origin.
48
Example 3.15 (Detecting outliers in the data on lumber)Table4.4contains
the data in Table 4.3, along with the standardized observations. These data
consist of four dierent measurements of stinessx1;x2;x3andx4, on each
n= 30boards. Detect outliers in these data.
49
50
3.8 Transformations to Near Normality
If normality is not a viable assumption, what is the next step ?
Ignore the ndings of a normality check and proceed as if the data were
normality distributed. (Not recommend)
Make nonnormal data more ormal looking" by considering
transformationsof data. Normal-theory analyses can then be carried
out with the suitably transformed data.
Appropriate transformations are suggested by
1.
2.
51
Helpful Transformations To Near Normality
Original Scale Transformed Scale
1. Counts,y
p
y
2. Proportions,^plogit=
1
2
log
^p
1^p
3. Correlations,rFisher'sz(r) =
1
2
log
1+r
1r
Box and Cox transformation
x
()
=
(
x
1
6= 0
lnx = 0
ory
()
j
=
x
j
1
"
nQ
i=1
xi
1=n
#
1
; j= 1;:::;n
Given the observationsx1;x2;:::;xn;the Box-Cox transformation for the
choice of an appropriate poweris the solution that maximizes the express
`() =
n
2
ln
2
4
1
n
n
X
j=1
(x
()
j
x
()
)
2
3
5+(1)
n
X
j=1
lnxj
where
x
()
=
1
n
nP
j=1
x
j
1
.
52
Example 3.16 (Determining a power transformation for univariate data)
We gave readings of microwave radiation emitted through the closed doors of
n= 42ovens in Example 3.10. The Q-Q plot of these data in Figure 4.6
indicates that the observations deviate from what would be expected if they
were normally distributed. Since all the positive observations are positive, let
us perform a power transformation of the data which, we hope, will produce
results that are more nearly normal. We must nd that value ofmaximize the
function`().
53
54
Transforming Multivariate Observations
With multivariate observations, a power transformation must be selected for
each of the variables.
Let1;2;:::;pbe the power transformations for thepmeasured
characteristics. Eachkcan be selected bymaximizing
`() =
n
2
ln
2
4
1
n
n
X
j=1
(x
(
k)
jk
x
(
k)
k
)
2
3
5+(k1)
n
X
j=1
lnxjk
wherex1k;x2k;:::;xnkarenobservations on thekth variable,k=
1;2;:::;p:Here
x
(
k)
k
=
1
n
n
X
j=1
x
k
jk
1
k
!
Let
^
1;
^
2;:::;
^
pbe the values that individually maximize the equation
above. Then thejth transformed multivariate observation is
x
(^)
j
=
2
4
x
^
1
j1
1
^
1
;
x
^
2
j2
1
^
2
;;
x
^
p
jp
1
^
p
3
5
0
55
The procedure just described is equivalent to making each marginal
distribution approximately normal. Although normal marginals are not
sucient to ensure that the joint distribution is normal, in practical
applications this may be good enough.
If not, the value
^
1;
^
2;:::;
^
pcan be obtained from the preceding
transformations and iterate toward the set of values
0
= [1;2;:::;p],
which collectively maximizes
`(1;2;:::;p) =
n
2
lnjS()j+ (11)
n
X
j=1
lnxj1+(21)
n
X
j=1
lnxj2
++(p1)
n
X
j=1
lnxjp
whereS()is the sample covariance matrix computed from
x
()
j
=
"
x
1
j1
1
1
;
x
2
j2
1
2
;;
x
p
jp
1
p
#0
; j= 1;2;:::;n
56
Example 3.17 (Determining power transformations for bivariate data)
Radiation measurements were also recorded though the open doors of the
n= 42micowave ovens introduced in Example 3.10. The amount of radiation
emitted through the open doors of these ovens is list Table 4.5. Denote the
door-close datax11;x21;:::;x42;1and the door-open datax12;x22;:::;x42;2.
Consider the joint distribution ofx1andx2, Choosing a power transformation
for(x1;x2)to make the joint distribution of(x1;x2)approximately bivariate
normal.
57
58
59
If the data includes some large negative values and have a single long tail, a
more general transformation should be applied.
x
()
=
8
>
>
<
>
>
:
f(x+1)
1g= x 0;6= 0
ln(x+1) x0;= 0
f(x+1)
2
1g=(2)x <0;6= 2
ln(x+1) x <0;= 2
60