Solutions to Statistical infeence by George Casella

JamesR0510 3,017 views 190 slides Aug 24, 2016
Slide 1
Slide 1 of 192
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186
Slide 187
187
Slide 188
188
Slide 189
189
Slide 190
190
Slide 191
191
Slide 192
192

About This Presentation

Almost all solution to Statistical Inference by George Casella


Slide Content

STAT 6720
Mathematical Statistics II
Spring Semester 2008
Dr. J¨urgen Symanzik
Utah State University
Department of Mathematics and Statistics
3900 Old Main Hill
Logan, UT 84322–3900
Tel.: (435) 797–0696
FAX: (435) 797–1822
e-mail:[email protected]

Contents
Acknowledgements 1
6 Limit Theorems 1
6.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 2
6.2 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . .. . 15
6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . .. . . 19
6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 29
7 Sample Moments 36
7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
7.2 Sample Moments and the Normal Distribution . . . . . . . . . . .. . . . . . . 39
8 The Theory of Point Estimation 44
8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . .. . . . . 44
8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 45
8.3 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 48
8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 58
8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . .. . . . . . . 67
8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75
8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . .. . . . . 77
8.8 Decision Theory — Bayes and Minimax Estimation . . . . . . . .. . . . . . . . 83
9 Hypothesis Testing 91
9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 91
9.2 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . .. . 96
9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 102
9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . .. . . . . 106
10 More on Hypothesis Testing 116
10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 116
10.2 Parametric Chi–Squared Tests . . . . . . . . . . . . . . . . . . . . .. . . . . . 121
10.3t–Tests andF–Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 129
1

11 Confidence Estimation 134
11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 134
11.2 Shortest–Length Confidence Intervals . . . . . . . . . . . . . .. . . . . . . . . . 138
11.3 Confidence Intervals and Hypothesis Tests . . . . . . . . . . .. . . . . . . . . . 143
11.4 Bayes Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 149
12 Nonparametric Inference 152
12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 152
12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . .. . . . . . . . 158
12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 165
13 Some Results from Sampling 169
13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 169
13.2 Stratified Random Samples . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 172
14 Some Results from Sequential Statistical Inference 176
14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . .. . . . . . . . 176
14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . .. . . . . . . . . 180
Index 184
2

Acknowledgements
I would like to thank all my students who helped from the Fall 1999 through the Spring 2006
semesters with the creation and improvement of these lecture notes and for their suggestions
how to improve some of the material presented in class.
In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously
taught this course at Utah State University, for providing me with their lecture notes and other
materials related to this course. Their lecture notes, combined with additional material from
a variety of textbooks listed below, form the basis of the script presented here.
The textbook required for this class is:
•Casella, G., and Berger, R. L. (2002):Statistical Inference(Second Edition), Duxbury/
Thomson Learning, Pacific Grove, CA.
A Web page dedicated to this class is accessible at:
http://www.math.usu.edu/~symanzik/teaching/2006_stat6720/stat6720.html
This course follows Casella and Berger (2002) as described in the syllabus. Additional material
originates from the lectures from Professors Hering, Trenkler, Gather, and Kreienbrock I have
attended while studying at the Universit¨at Dortmund, Germany, the collection of Masters and
PhD Preliminary Exam questions from Iowa State University,Ames, Iowa, and the following
textbooks:
•Bandelow, C. (1981):Einf¨uhrung in die Wahrscheinlichkeitstheorie, Bibliographisches
Institut, Mannheim, Germany.
•B¨uning, H., and Trenkler, G. (1978):Nichtparametrische statistische Methoden, Walter
de Gruyter, Berlin, Germany.
•Casella, G., and Berger, R. L. (1990):Statistical Inference, Wadsworth & Brooks/Cole,
Pacific Grove, CA.
•Fisz, M. (1989):Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-
scher Verlag der Wissenschaften, Berlin, German Democratic Republic.
•Gibbons, J. D., and Chakraborti, S. (1992):Nonparametric Statistical Inference(Third
Edition, Revised and Expanded), Dekker, New York, NY.
•Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994):Continuous Univariate
Distributions, Volume 1(Second Edition), Wiley, New York, NY.
3

•Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995):Continuous Univariate
Distributions, Volume 2(Second Edition), Wiley, New York, NY.
•Kelly, D. G. (1994):Introduction to Probability, Macmillan, New York, NY.
•Lehmann, E. L. (1983):Theory of Point Estimation(1991 Reprint), Wadsworth &
Brooks/Cole, Pacific Grove, CA.
•Lehmann, E. L. (1986):Testing Statistical Hypotheses(Second Edition – 1994 Reprint),
Chapman & Hall, New York, NY.
•Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974):Introduction to the Theory
of Statistics(Third Edition), McGraw-Hill, Singapore.
•Parzen, E. (1960):Modern Probability Theory and Its Applications, Wiley, New York,
NY.
•Rohatgi, V. K. (1976):An Introduction to Probability Theory and Mathematical Statis-
tics, John Wiley and Sons, New York, NY.
•Rohatgi, V. K., and Saleh, A. K. E. (2001):An Introduction to Probability and Statistics
(Second Edition), John Wiley and Sons, New York, NY.
•Searle, S. R. (1971):Linear Models, Wiley, New York, NY.
•Tamhane, A. C., and Dunlop, D. D. (2000):Statistics and Data Analysis – From Ele-
mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.
Additional definitions, integrals, sums, etc. originate from the following formula collections:
•Bronstein, I. N. and Semendjajew, K. A. (1985):Taschenbuch der Mathematik(22.
Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.
•Bronstein, I. N. and Semendjajew, K. A. (1986):Erg¨anzende Kapitel zu Taschenbuch der
Mathematik(4. Auflage), Verlag Harri Deutsch, Thun, German DemocraticRepublic.
•Sieber, H. (1980):Mathematische Formeln — Erweiterte Ausgabe E, Ernst Klett, Stuttgart,
Germany.
J¨urgen Symanzik, January 7, 2006
4

Lecture 02:
We 01/07/04
6 Limit Theorems
(Based on Rohatgi, Chapter 6, Rohatgi/Saleh, Chapter 6 & Cas ella/Berger,
Section 5.5)
Motivation:
I found this slide from my Stat 250, Section 003, “Introductory Statistics” class (an under-
graduate class I taught at George Mason University in Spring1999):
What does this mean at a more theoretical level???
1

6.1 Modes of Convergence
Definition 6.1.1:
LetX1, . . . , Xnbe iid rv’s with common cdfFX(x). LetT=T(X) be anystatistic, i.e., a
Borel–measurable function ofXthat does not involve the population parameter(s)ϑ, defined
on the supportXofX. The induced probability distribution ofT(X) is called thesampling
distributionofT(X).
Note:
(i) Commonly used statistics are:
Sample Mean:Xn=
1
n
n
X
i=1
Xi
Sample Variance:S
2
n
=
1
n−1
n
X
i=1
(Xi−Xn)
2
Sample Median, Order Statistics, Min, Max, etc.
(ii) Recall that ifX1, . . . , Xnare iid and ifE(X) andV ar(X) exist, thenE(Xn) ==
E(X),E(S
2
n
) =σ
2
=V ar(X), andV ar(Xn) =
σ
2
n
.
(iii) Recall that ifX1, . . . , Xnare iid and ifXhas mgfMX(t) or characteristic function ΦX(t)
thenM
Xn
(t) = (MX(
t
n
))
n
or Φ
Xn
(t) = (ΦX(
t
n
))
n
.
Note:Let{Xn}

n=1
be a sequence of rv’s on some probability space (Ω, L, P). Is there any
meaning behind the expression lim
n→∞
Xn=X? Not immediately under the usual definitions
of limits. We first need to define modes of convergence for rv’sand probabilities.
Definition 6.1.2:
Let{Xn}

n=1
be a sequence of rv’s with cdf’s{Fn}

n=1
and letXbe a rv with cdfF. If
Fn(x)→F(x) at all continuity points ofF, we say thatXnconverges in distribution to
X(Xn
d
−→X) orXnconverges in law toX(Xn
L
−→X), orFnconverges weakly toF
(Fn
w
−→F).
Example 6.1.3:
LetXn∼N(0,
1
n
). Then
Fn(x) =
Z
x
−∞
exp


1
2
nt
2

q

n
dt
2

=
Z

nx
−∞
exp(−
1
2
s
2
)


ds
= Φ(

nx)
=⇒Fn(x)→







Φ(∞) = 1,ifx >0
Φ(0) =
1
2
, ifx= 0
Φ(−∞) = 0,ifx <0
IfFX(x) =
(
1, x≥0
0, x <0
the only point of discontinuity is atx= 0. Everywhere else,
Φ(

nx) =Fn(x)→FX(x), where Φ(z) =P(Z≤z) withZ∼N(0,1).
So,Xn
d
−→X, whereP(X= 0) = 1, orXn
d
−→0 since the limiting rv here is degenerate,
i.e., it has a Dirac(0) distribution.
Example 6.1.4:
In this example, the sequence{Fn}

n=1
converges pointwise to something that is not a cdf:
LetXn∼Dirac(n), i.e.,P(Xn=n) = 1. Then,
Fn(x) =
(
0, x < n
1, x≥n
It isFn(x)→0∀xwhich is not a cdf. Thus, there is no rvXsuch thatXn
d
−→X.
Example 6.1.5:
Let{Xn}

n=1
be a sequence of rv’s such thatP(Xn= 0) = 1−
1
n
andP(Xn=n) =
1
n
and let
X∼Dirac(0), i.e.,P(X= 0) = 1.
It is
Fn(x) =







0, x < 0
1−
1
n
,0≤x < n
1, x ≥n
FX(x) =
(
0, x <0
1, x≥0
It holds thatFn
w
−→FXbut
E(X
k
n
) = 0
k
∆(1−
1
n
) +n
k

1
n
=n
k−1
6→E(X
k
) = 0.
Thus, convergence in distribution does not imply convergence of moments/means.
3

Note:
Convergence in distribution does not say that theXi’s are close to each other or toX. It only
means that their cdf’s are (eventually) close to some cdfF. TheXi’s do not even have to be
defined on the same probability space.
Example 6.1.6:
LetXand{Xn}

n=1
be iidN(0,1). Obviously,Xn
d
−→Xbut lim
n→∞
Xn6=X.
Theorem 6.1.7:
LetXand{Xn}

n=1
be discrete rv’s with supportXand{Xn}

n=1
, respectively. Define
the countable setA=X∪

[
n=1
Xn={ak:k= 1,2,3, . . .}. Letpk=P(X=ak) and
pnk=P(Xn=ak). Then it holds thatpnk→pk∀kiffXn
d
−→X.
Theorem 6.1.8:
LetXand{Xn}

n=1
be continuous rv’s with pdf’sfand{fn}

n=1
, respectively. Iffn(x)→f(x)
for almost allxasn→ ∞thenXn
d
−→X.
Theorem 6.1.9:
LetXand{Xn}

n=1
be rv’s such thatXn
d
−→X. Letc∈IRbe a constant. Then it holds:
(i)Xn+c
d
−→X+c.
(ii)cXn
d
−→cX.
(iii) Ifan→aandbn→b, thenanXn+bn
d
−→aX+b.
Proof :
Part (iii):
Suppose thata >0, an>0. (Ifa <0, an<0, the result follows via (ii) andc=−1.)
LetYn=anXn+bnandY=aX+b. It is
FY(y) =P(Y < y) =P(aX+b < y) =P(X <
y−b
a
) =FX(
y−b
a
).
Likewise,
FYn(y) =FXn(
y−bn
an
).
Ifyis a continuity point ofFY,
y−b
a
is a continuity point ofFX. Sincean→a, bn→band
FXn(x)→FX(x), it follows thatFYn(y)→FY(y) for every continuity pointyofFY. Thus,
anXn+bn
d
−→aX+b.
4

Lecture 38:
We 11/29/00
Definition 6.1.10:
Let{Xn}

n=1
be a sequence of rv’s defined on a probability space (Ω, L, P). We say thatXn
converges in probabilityto a rvX(Xn
p
−→X, P- lim
n→∞
Xn=X) if
lim
n→∞
P(|Xn−X|> ǫ) = 0∀ǫ >0.
Note:
The following are equivalent:
lim
n→∞
P(|Xn−X|> ǫ) = 0
⇐⇒ lim
n→∞
P(|Xn−X|≤ǫ) = 1
⇐⇒ lim
n→∞
P({ω:|Xn(ω)−X(ω)|> ǫ)) = 0
IfXis degenerate, i.e.,P(X=c) = 1, we say thatXnisconsistentforc. For example, let
Xnsuch thatP(Xn= 0) = 1−
1
n
andP(Xn= 1) =
1
n
. Then
P(|Xn|> ǫ) =
(
1
n
,0< ǫ <1
0, ǫ≥1
Therefore, lim
n→∞
P(|Xn|> ǫ) = 0∀ǫ >0. SoXn
p
−→0, i.e.,Xnis consistent for 0.
Theorem 6.1.11:
(i)Xn
p
−→X⇐⇒Xn−X
p
−→0.
(ii)Xn
p
−→X, Xn
p
−→Y=⇒P(X=Y) = 1.
(iii)Xn
p
−→X, Xm
p
−→X=⇒Xn−Xm
p
−→0 asn, m→ ∞.
(iv)Xn
p
−→X, Yn
p
−→Y=⇒Xn±Yn
p
−→X±Y.
(v)Xn
p
−→X, k∈IRa constant =⇒kXn
p
−→kX.
(vi)Xn
p
−→k, k∈IRa constant =⇒X
r
n
p
−→k
r
∀r∈IN.
(vii)Xn
p
−→a, Yn
p
−→b, a, b∈IR=⇒XnYn
p
−→ab.
(viii)Xn
p
−→1 =⇒X
−1
n
p
−→1.
(ix)Xn
p
−→a, Yn
p
−→b, a∈IR, b∈IR− {0}=⇒
Xn
Yn
p
−→
a
b
.
5

(x)Xn
p
−→X, Yan arbitrary rv =⇒XnY
p
−→XY.
(xi)Xn
p
−→X, Yn
p
−→Y=⇒XnYn
p
−→XY.
Proof:
See Rohatgi, page 244–245, and Rohatgi/Saleh, page 260–261, for partial proofs.
Theorem 6.1.12:
LetXn
p
−→Xand letgbe a continuous function onIR. Theng(Xn)
p
−→g(X).
Proof:
Preconditions:
1.)Xrv =⇒ ∀ǫ >0∃k=k(ǫ) :P(|X|> k)<
ǫ
2
2.)gis continuous onIR
=⇒gis also uniformly continuous on [−k, k] (see Definition of uniformly continuous
in Theorem 3.3.3 (iii))
=⇒ ∃δ=δ(ǫ, k) :|X| ≤k,|Xn−X|< δ⇒ |g(Xn)−g(X)|< ǫ
Let
A={|X| ≤k}={ω:|X(ω)| ≤k}
B={|Xn−X|< δ}={ω:|Xn(ω)−X(ω)|< δ}
C={|g(Xn)−g(X)|< ǫ}={ω:|g(Xn(ω))−g(X(ω))|< ǫ}
Ifω∈A∩B
2.)
=⇒ω∈C
=⇒A∩B⊆C
=⇒C
C
⊆(A∩B)
C
=A
C
∪B
C
=⇒P(C
C
)≤P(A
C
∪B
C
)≤P(A
C
) +P(B
C
)
Now:
P(|g(Xn)−g(X)| ≥ǫ)≤P(|X|> k)
|{z}

ǫ
2
by 1.)
+ P(|Xn−X| ≥δ)
| {z }

ǫ
2
forn≥n0(ǫ,δ,k)sinceXn
p
−→X
≤ǫforn≥n0(ǫ, δ, k)
6

Corollary 6.1.13:
(i) LetXn
p
−→c, c∈IRand letgbe a continuous function onIR. Theng(Xn)
p
−→g(c).
(ii) LetXn
d
−→Xand letgbe a continuous function onIR. Theng(Xn)
d
−→g(X).
(iii) LetXn
d
−→c, c∈IRand letgbe a continuous function onIR. Theng(Xn)
d
−→g(c).
Theorem 6.1.14:
Xn
p
−→X=⇒Xn
d
−→X.
Proof:
Xn
p
−→X⇔P(|Xn−X|> ǫ)→0 asn→ ∞ ∀ǫ >0
It holds:
P(X≤x−ǫ) =P(X≤x−ǫ,|Xn−X| ≤ǫ) +P(X≤x−ǫ,|Xn−X|> ǫ)
(A)
≤P(Xn≤x) +P(|Xn−X|> ǫ)
(A) holds sinceX≤x−ǫandXnwithinǫofX, thusXn≤x.
Similarly, it holds:
P(Xn≤x) =P(Xn≤x,|Xn−X|≤ǫ) +P(Xn≤x,|Xn−X|> ǫ)
≤P(X≤x+ǫ) +P(|Xn−X|> ǫ)
Combining the 2 inequalities from above gives:
P(X≤x−ǫ)−P(|Xn−X|> ǫ)
| {z }
→0asn→∞
≤P(Xn≤x)
|{z}
=Fn(x)
≤P(X≤x+ǫ) +P(|Xn−X|> ǫ)
| {z }
→0asn→∞
Therefore,
P(X≤x−ǫ)≤Fn(x)≤P(X≤x+ǫ) asn→ ∞.
Since the cdf’sFn(∆) are not necessarily left continuous, we get the following result forǫ↓0:
P(X < x)≤Fn(x)≤P(X≤x) =FX(x)
Letxbe a continuity point ofF. Then it holds:
F(x) =P(X < x)≤Fn(x)≤F(x)
=⇒Fn(x)→F(x)
=⇒Xn
d
−→X
7

Theorem 6.1.15:
Letc∈IRbe a constant. Then it holds:
Xn
d
−→c⇐⇒Xn
p
−→c.
Example 6.1.16:
In this example, we will see that
Xn
d
−→X6=⇒Xn
p
−→X
for some rvX. LetXnbe identically distributed rv’s and let (Xn, X) have the following joint
distribution:
Xn
X
0 1
0 0
1
2
1
2
1
1
2
0
1
2
1
2
1
2
1
Obviously,Xn
d
−→Xsince all have exactly the same cdf, but for anyǫ∈(0,1), it is
P(|Xn−X|> ǫ) =P(|Xn−X|= 1) = 1∀n,
so lim
n→∞
P(|Xn−X|> ǫ)6= 0. Therefore,Xn6
p
−→X.
Theorem 6.1.17:
Let{Xn}

n=1
and{Yn}

n=1
be sequences of rv’s andXbe a rv defined on a probability space
(Ω, L, P). Then it holds:
Yn
d
−→X,|Xn−Yn|
p
−→0 =⇒Xn
d
−→X.
Proof:
Similar to the proof of Theorem 6.1.14. See also Rohatgi, page 253, Theorem 14, and Ro-
hatgi/Saleh, page 269, Theorem 14.
Lecture 41:
We 12/06/00
Theorem 6.1.18:Slutsky’s Theorem
Let{Xn}

n=1
and{Yn}

n=1
be sequences of rv’s andXbe a rv defined on a probability space
(Ω, L, P). Letc∈IRbe a constant. Then it holds:
(i)Xn
d
−→X, Yn
p
−→c=⇒Xn+Yn
d
−→X+c.
(ii)Xn
d
−→X, Yn
p
−→c=⇒XnYn
d
−→cX.
Ifc= 0, then alsoXnYn
p
−→0.
(iii)Xn
d
−→X, Yn
p
−→c=⇒
Xn
Yn
d
−→
X
c
ifc6= 0.
8

Proof:
(i)Yn
p
−→c
T h.6.1.11(i)
⇐⇒ Yn−c
p
−→0
=⇒Yn−c=Yn+ (Xn−Xn)−c= (Xn+Yn)−(Xn+c)
p
−→0 (A)
Xn
d
−→X
T h.6.1.9(i)
=⇒Xn+c
d
−→X+c(B)
Combining (A) and (B), it follows from Theorem 6.1.17:
Xn+Yn
d
−→X+c
(ii) Casec= 0:
∀ǫ >0∀k >0, it is
P(|XnYn|> ǫ) =P(|XnYn|> ǫ, Yn≤
ǫ
k
) +P(|XnYn|> ǫ, Yn>
ǫ
k
)
≤P(|Xn
ǫ
k
|> ǫ) +P(Yn>
ǫ
k
)
≤P(|Xn|> k) +P(|Yn|>
ǫ
k
)
SinceXn
d
−→XandYn
p
−→0, it follows
lim
n→∞
P(|XnYn|> ǫ)≤P(|Xn|> k)→0 ask→ ∞.
Therefore,XnYn
p
−→0.
Casec6= 0:
SinceXn
d
−→XandYn
p
−→c, it follows from (ii), casec= 0, thatXnYn−cXn=
Xn(Yn−c)
p
−→0.
=⇒XnYn
p
−→cXn
T h.6.1.14
=⇒XnYn
d
−→cXn
SincecXn
d
−→cXby Theorem 6.1.9 (ii), it follows from Theorem 6.1.17:
XnYn
d
−→cX
(iii) LetZn
p
−→1 and letYn=cZn.
c6=0
=⇒
1
Yn
=
1
Zn

1
c
T h.6.1.11(v,viii)
=⇒
1
Yn
p
−→
1
c
With part (ii) above, it follows:
Xn
d
−→Xand
1
Yn
p
−→
1
c
=⇒
Xn
Yn
d
−→
X
c
9

Definition 6.1.19:
Let{Xn}

n=1
be a sequence of rv’s such thatE(|Xn|
r
)<∞for somer >0. We say thatXn
converges in ther
th
meanto a rvX(Xn
r
−→X) ifE(|X|
r
)<∞and
lim
n→∞
E(|Xn−X|
r
) = 0.
Example 6.1.20:
Let{Xn}

n=1
be a sequence of rv’s defined byP(Xn= 0) = 1−
1
n
andP(Xn= 1) =
1
n
.
It isE(|Xn|
r
) =
1
n
∀r >0. Therefore,Xn
r
−→0∀r >0.
Note:
The special casesr= 1 andr= 2 are calledconvergence in absolute meanforr= 1
(Xn
1
−→X) andconvergence in mean squareforr= 2 (Xn
ms
−→XorXn
2
−→X).
Theorem 6.1.21:
Assume thatXn
r
−→Xfor somer >0. ThenXn
p
−→X.
Proof:
Using Markov’s Inequality (Corollary 3.5.2), it holds for anyǫ >0:
E(|Xn−X|
r
)
ǫ
r
≥P(|Xn−X|≥ǫ)≥P(|Xn−X|> ǫ)
Xn
r
−→X=⇒lim
n→∞
E(|Xn−X|
r
) = 0
=⇒lim
n→∞
P(|Xn−X|> ǫ)≤lim
n→∞
E(|Xn−X|
r
)
ǫ
r
= 0
=⇒Xn
p
−→X
Example 6.1.22:
Let{Xn}

n=1
be a sequence of rv’s defined byP(Xn= 0) = 1−
1
n
randP(Xn=n) =
1
n
rfor
somer >0.
For anyǫ >0,P(|Xn|> ǫ)→0 asn→ ∞; soXn
p
−→0.
For 0< s < r,E(|Xn|
s
) =
1
n
r−s→0 asn→ ∞; soXn
s
−→0. ButE(|Xn|
r
) = 16→0 as
n→ ∞; soXn6
r
−→0.
10

Theorem 6.1.23:
IfXn
r
−→X, then it holds:
(i) lim
n→∞
E(|Xn|
r
) =E(|X|
r
); and
(ii)Xn
s
−→Xfor 0< s < r.
Proof:
(i) For 0< r≤1, it holds:
E(|Xn|
r
) =E(|Xn−X+X|
r
)
(∗)
≤E(|Xn−X|
r
+|X|
r
)
=⇒E(|Xn|
r
)−E(|X|
r
)≤E(|Xn−X|
r
)
=⇒lim
n→∞
E(|Xn|
r
)−lim
n→∞
E(|X|
r
)≤lim
n→∞
E(|Xn−X|
r
) = 0
=⇒lim
n→∞
E(|Xn|
r
)≤E(|X|
r
) (A)
(∗) holds due to Bronstein/Semendjajew (1986), page 36 (see Handout)
Similarly,
E(|X|
r
) =E(|X−Xn+Xn|
r
)≤E(|Xn−X|
r
+|Xn|
r
)
=⇒E(|X|
r
)−E(|Xn|
r
)≤E(|Xn−X|
r
)
=⇒lim
n→∞
E(|X|
r
)−lim
n→∞
E(|Xn|
r
)≤lim
n→∞
E(|Xn−X|
r
) = 0
=⇒E(|X|
r
)≤lim
n→∞
E(|Xn|
r
) (B)
Combining (A) and (B) gives
lim
n→∞
E(|Xn|
r
) =E(|X|
r
)
Forr >1, it follows from Minkowski’s Inequality (Theorem 4.8.3):
[E(|X−Xn+Xn|
r
)]
1
r≤[E(|X−Xn|
r
)]
1
r+ [E(|Xn|
r
)]
1
r
=⇒[E(|X|
r
)]
1
r−[E(|Xn|
r
)]
1
r≤[E(|X−Xn|
r
)]
1
r
=⇒[E(|X|
r
)]
1
r−lim
n→∞
[E(|Xn|
r
)]
1
r≤lim
n→∞
[E(|Xn−X|
r
)]
1
r= 0 sinceXn
r
−→X
=⇒[E(|X|
r
)]
1
r≤lim
n→∞
[E(|Xn|
r
)]
1
r(C)
Similarly,
[E(|Xn−X+X|
r
)]
1
r≤[E(|Xn−X|
r
)]
1
r+ [E(|X|
r
)]
1
r
=⇒lim
n→∞
[E(|Xn|
r
)]
1
r−lim
n→∞
[E(|X|
r
)]
1
r≤lim
n→∞
[E(|Xn−X|
r
)]
1
r= 0 sinceXn
r
−→X
11

=⇒lim
n→∞
[E(|Xn|
r
)]
1
r≤[E(|X|
r
)]
1
r(D)
Combining (C) and (D) gives
lim
n→∞
[E(|Xn|
r
)]
1
r= [E(|X|
r
)]
1
r
=⇒lim
n→∞
E(|Xn|
r
) =E(|X|
r
)
Lecture 42/1:
Fr 12/08/00(ii) For 1≤s < r, it follows from Lyapunov’s Inequality (Theorem 3.5.4):
[E(|Xn−X|
s
)]
1
s≤[E(|Xn−X|
r
)]
1
r
=⇒E(|Xn−X|
s
)≤[E(|Xn−X|
r
)]
s
r
=⇒lim
n→∞
E(|Xn−X|
s
)≤lim
n→∞
[E(|Xn−X|
r
)]
s
r= 0 sinceXn
r
−→X
=⇒Xn
s
−→X
An additional proof is required for 0< s < r <1.
Definition 6.1.24:
Let{Xn}

n=1
be a sequence of rv’s on (Ω, L, P). We say thatXnconverges almost surely
to a rvX(Xn
a.s.
−→X) orXnconverges with probability 1toX(Xn
w.p.1
−→X) orXn
converges stronglytoXiff
P({ω:Xn(ω)→X(ω) asn→ ∞}) = 1.
Note:
An interesting characterization of convergence with probability 1 and convergence in proba-
bility can be found in Parzen (1960) “Modern Probability Theory and Its Applications” on
page 416 (see Handout).
Example 6.1.25:
Let Ω = [0,1] andPa uniform distribution on Ω. LetXn(ω) =ω+ω
n
andX(ω) =ω.
Forω∈[0,1),ω
n
→0 asn→ ∞. SoXn(ω)→X(ω)∀ω∈[0,1).
However, forω= 1,Xn(1) = 26= 1 =X(1)∀n, i.e., convergence fails atω= 1.
Anyway, sinceP({ω:Xn(ω)→X(ω) asn→ ∞}) =P({ω∈[0,1)}) = 1, it isXn
a.s.
−→X.
12

Theorem 6.1.26:
Xn
a.s.
−→X=⇒Xn
p
−→X.
Proof:
Chooseǫ >0 andδ >0. Findn0=n0(ǫ, δ) such that
P


\
n=n0
{|Xn−X|≤ǫ}
!
≥1−δ.
Since

\
n=n0
{|Xn−X|≤ǫ} ⊆ {|Xn−X|≤ǫ} ∀n≥n0, it is
P({|Xn−X|≤ǫ})≥P


\
n=n0
{|Xn−X|≤ǫ}
!
≥1−δ∀n≥n0.
Therefore,P({|Xn−X|≤ǫ})→1 asn→ ∞. Thus,Xn
p
−→X.
Example 6.1.27:
Xn
p
−→X6=⇒Xn
a.s.
−→X:
Let Ω = (0,1] andPa uniform distribution on Ω.
DefineAnby
A1= (0,
1
2
], A2= (
1
2
,1]
A3= (0,
1
4
], A4= (
1
4
,
1
2
], A5= (
1
2
,
3
4
], A6= (
3
4
,1]
A7= (0,
1
8
], A8= (
1
8
,
1
4
], . . .
LetXn(ω) =IAn(ω).
It isP(|Xn−0|≥ǫ)→0∀ǫ >0 sinceXnis 0 except onAnandP(An)↓0. ThusXn
p
−→0.
ButP({ω:Xn(ω)→0}) = 0 (and not 1) because anyωkeeps being in someAnbeyond any
n0, i.e.,Xn(ω) looks like 0. . .010. . .010. . .010. . ., soXn6
a.s.
−→0.
Example 6.1.28:
Xn
r
−→X6=⇒Xn
a.s.
−→X:
LetXnbe independent rv’s such thatP(Xn= 0) = 1−
1
n
andP(Xn= 1) =
1
n
.
It isE(|Xn−0|
r
) =E(|Xn|
r
) =E(|Xn|) =
1
n
→0 asn→ ∞, soXn
r
−→0∀r >0 (and
due to Theorem 6.1.21, alsoXn
p
−→0).
But
13

P(Xn= 0∀m≤n≤n0) =
n0Y
n=m
(1−
1
n
) = (
m−1
m
)(
m
m+ 1
)(
m+ 1
m+ 2
). . .(
n0−2
n0−1
)(
n0−1
n0
) =
m−1
n0
Asn0→ ∞, it isP(Xn= 0∀m≤n≤n0)→0∀m, soXn6
a.s.
−→0.
Example 6.1.29:
Xn
a.s.
−→X6=⇒Xn
r
−→X:
Let Ω = [0,1] andPa uniform distribution on Ω.
LetAn= [0,
1
lnn
].
LetXn(ω) =nIAn(ω) andX(ω) = 0.
It holds that∀ω >0∃n0:
1
lnn0
< ω=⇒Xn(ω) = 0∀n > n0andP(ω= 0) = 0. Thus,
Xn
a.s.
−→0.
ButE(|Xn−0|
r
) =
n
r
lnn
→ ∞ ∀r >0, soXn6
r
−→X.
14

Lecture 39:
Fr 12/01/00
6.2 Weak Laws of Large Numbers
Theorem 6.2.1:WLLN: Version I
Let{Xi}

i=1
be a sequence of iid rv’s with meanE(Xi) =and varianceV ar(Xi) =σ
2
<∞.
LetXn=
1
n
n
X
i=1
Xi. Then it holds
lim
n→∞
P(|Xn−|≥ǫ) = 0∀ǫ >0,
i.e.,Xn
p
−→.
Proof:
By Markov’s Inequality (Corollary 3.5.2), it holds for allǫ >0:
P(|Xn−|≥ǫ)≤
E((Xn−)
2
)
ǫ
2
=
V ar(Xn)
ǫ
2
=
σ
2

2
−→0 asn→ ∞
Note:
For iid rv’s with finite variance,Xnis consistent for.
A more general way to derive a “WLLN” follows in the next Definition.
Definition 6.2.2:
Let{Xi}

i=1
be a sequence of rv’s. LetTn=
n
X
i=1
Xi. We say that{Xi}obeys the WLLN
with respect to a sequence ofnorming constants{Bi}

i=1
,Bi>0, Bi↑ ∞, if there exists a
sequence ofcentering constants{Ai}

i=1
such that
B
−1
n(Tn−An)
p
−→0.
Theorem 6.2.3:
Let{Xi}

i=1
be a sequence of pairwise uncorrelated rv’s withE(Xi) =iandV ar(Xi) =σ
2
i
,
i∈IN. If
n
X
i=1
σ
2
i
→ ∞asn→ ∞, we can chooseAn=
n
X
i=1
iandBn=
n
X
i=1
σ
2
i
and get
n
X
i=1
(Xi−i)
n
X
i=1
σ
2
i
p
−→0.
15

Proof:
By Markov’s Inequality (Corollary 3.5.2), it holds for allǫ >0:
P(|
n
X
i=1
Xi−
n
X
i=1
i|> ǫ
n
X
i=1
σ
2
i)≤
E((
n
X
i=1
(Xi−i))
2
)
ǫ
2
(
n
X
i=1
σ
2
i
)
2
=
1
ǫ
2
n
X
i=1
σ
2
i
−→0 asn→ ∞
Note:
To obtain Theorem 6.2.1, we chooseAn=nandBn=nσ
2
.
Theorem 6.2.4:
Let{Xi}

i=1
be a sequence of rv’s. Let Xn=
1
n
n
X
i=1
Xi. A necessary and sufficient condition
for{Xi}to obey the WLLN with respect toBn=nis that
E

X
2
n
1 +X
2
n
!
→0
asn→ ∞.
Proof:
Rohatgi, page 258, Theorem 2, and Rohatgi/Saleh, page 275, Theorem 2.
Example 6.2.5:
Let (X1, . . . , Xn) be jointly Normal withE(Xi) = 0,E(X
2
i
) = 1 for alli, andCov(Xi, Xj) =ρ
if|i−j|= 1 andCov(Xi, Xj) = 0 if|i−j|>1.
LetTn=
n
X
i=1
Xi. Then,Tn∼N(0, n+ 2(n−1)ρ) =N(0, σ
2
). It is
E

X
2
n
1 +X
2
n
!
= E

T
2
n
n
2
+T
2
n
!
=
2

2πσ
Z

0
x
2
n
2
+x
2
e

x
2

2
dx|y=
x
σ
, dy=
dx
σ
=
2


Z

0
σ
2
y
2
n
2

2
y
2
e

y
2
2dy
=
2


Z

0
(n+ 2(n−1)ρ)y
2
n
2
+ (n+ 2(n−1)ρ)y
2
e

y
2
2dy

n+ 2(n−1)ρ
n
2
Z

0
2


y
2
e

y
2
2dy
| {z }
=1,since Var ofN(0,1)distribution
16

→0 as n→ ∞
=⇒Xn
p
−→0
Note:
We would like to have a WLLN that just depends on means but doesnot depend on the
existence of finite variances. To approach this, we considerthe following:
Let{Xi}

i=1
be a sequence of rv’s. LetTn=
n
X
i=1
Xi. We truncate each|Xi|atc >0 and get
X
c
i
=
(
Xi,|Xi|≤c
0,otherwise
LetT
c
n=
n
X
i=1
X
c
iandmn=
n
X
i=1
E(X
c
i).
Lemma 6.2.6:
ForTn,T
c
nandmnas defined in the Note above, it holds:
P(|Tn−mn|> ǫ)≤P(|T
c
n
−mn|> ǫ) +
n
X
i=1
P(|Xi|> c)∀ǫ >0
Proof:
It holds for allǫ >0:
P(|Tn−mn|> ǫ) =P(|Tn−mn|> ǫand|Xi|≤c∀i∈ {1, . . . , n}) +
P(|Tn−mn|> ǫand|Xi|> cfor at least onei∈ {1, . . . , n}
(∗)
≤P(|T
c
n
−mn|> ǫ) +P(|Xi|> cfor at least onei∈ {1, . . . , n})
≤P(|T
c
n
−mn|> ǫ) +
n
X
i=1
P(|Xi|> c)
(∗) holds sinceT
c
n=Tnwhen|Xi|≤c∀i∈ {1, . . . , n}.
Note:
If theXi’s are identically distributed, then
P(|Tn−mn|> ǫ)≤P(|T
c
n
−mn|> ǫ) +nP(|X1|> c)∀ǫ >0.
If theXi’s are iid, then
P(|Tn−mn|> ǫ)≤
nE((X
c
1
)
2
)
ǫ
2
+nP(|X1|> c)∀ǫ >0 (∗).
Note thatP(|Xi|> c) =P(|X1|> c)∀i∈INif theXi’s are identically distributed and that
E((X
c
i
)
2
) =E((X
c
1
)
2
)∀i∈INif theXi’s are iid.
17

Lecture 42/2:
Fr 12/08/00
Theorem 6.2.7:Khintchine’s WLLN
Let{Xi}

i=1
be a sequence of iid rv’s with finite meanE(Xi) =. Then it holds:
Xn=
1
n
Tn
p
−→
Proof:
If we takec=nand replaceǫbynǫin (∗) in the Note above, we get
P
`



Tn−mn
n




> ǫ
´
=P(|Tn−mn|> nǫ)≤
E((X
n
1
)
2
)

2
+nP(|X1|> n).
SinceE(|X1|)<∞, it isnP(|X1|> n)→0 asn→ ∞by Theorem 3.1.9. From Corollary
3.1.12 we know thatE(|X|
α
) =α
Z

0
x
α−1
P(|X|> x)dx. Therefore,
E((X
n
1)
2
) = 2
Z
n
0
xP(|X
n
1|> x)dx
= 2
Z
A
0
xP(|X
n
1|> x)dx+ 2
Z
n
A
xP(|X
n
1|> x)dx
(+)
≤K+δ
Z
n
A
dx
≤K+nδ
In (+),Ais chosen sufficiently large such thatxP(|X
n
1
|> x)<
δ
2
∀x≥Afor an arbitrary
constantδ >0 andK >0 a constant.
Therefore,
E((X
n
1
)
2
)

2

K

2
+
δ
ǫ
2
Sinceδis arbitrary, we can make the right hand side of this last inequality arbitrarily small
for sufficiently largen.
SinceE(Xi) =∀i, it is
mn
n
=
n
X
i=1
E(X
n
i)
n
→asn→ ∞.
Note:
Theorem 6.2.7 meets the previously stated goal of not havinga finite variance requirement.
18

6.3 Strong Laws of Large Numbers
Definition 6.3.1:
Let{Xi}

i=1
be a sequence of rv’s. LetTn=
n
X
i=1
Xi. We say that{Xi}obeys the SLLN
with respect to a sequence ofnorming constants{Bi}

i=1
,Bi>0, Bi↑ ∞, if there exists a
sequence ofcentering constants{Ai}

i=1
such that
B
−1
n
(Tn−An)
a.s.
−→0.
Note:
Unless otherwise specified, we will only use the case thatBn=nin this section.
Theorem 6.3.2:
Xn
a.s.
−→X⇐⇒lim
n→∞
P(sup
m≥n
|Xm−X|> ǫ) = 0∀ǫ >0.
Proof:(see also Rohatgi, page 249, Theorem 11)
WLOG, we can assume thatX= 0 sinceXn
a.s.
−→XimpliesXn−X
a.s.
−→0. Thus, we have to
prove:
Xn
a.s.
−→0⇐⇒lim
n→∞
P(sup
m≥n
|Xm|> ǫ) = 0∀ǫ >0
Chooseǫ >0 and define
An(ǫ) ={sup
m≥n
|Xm|> ǫ}
C={lim
n→∞
Xn= 0}
“=⇒”:
SinceXn
a.s.
−→0, we know thatP(C) = 1 and thereforeP(C
c
) = 0.
LetBn(ǫ) =C∩An(ǫ). Note thatBn+1(ǫ)⊆Bn(ǫ) and for the limit set

\
n=1
Bn(ǫ) = Ø. It
follows that
lim
n→∞
P(Bn(ǫ)) =P(

\
n=1
Bn(ǫ)) = 0.
We also have
P(Bn(ǫ)) =P(An∩C)
= 1−P(C
c
∪A
c
n)
= 1−P(C
c
)
| {z}
=0
−P(A
c
n) +P(C
c
∩A
C
n)
| {z}
=0
=P(An)
19

=⇒lim
n→∞
P(An(ǫ)) = 0
“⇐=”:
Assume that lim
n→∞
P(An(ǫ)) = 0∀ǫ >0 and defineD(ǫ) ={lim
n→∞
|Xn|> ǫ}.
SinceD(ǫ)⊆An(ǫ)∀n∈IN, it follows thatP(D(ǫ)) = 0∀ǫ >0. Also,
C
c
={lim
n→∞
Xn6= 0} ⊆

[
k=1
{lim
n→∞
|Xn|>
1
k
}.
=⇒1−P(C)≤

X
k=1
P(D(
1
k
)) = 0
=⇒Xn
a.s.
−→0
Note:
(i)Xn
a.s.
−→0 implies that∀ǫ >0∀δ >0∃n0∈IN:P( sup
n≥n0
|Xn|> ǫ)< δ.
(ii) Recall that for a given sequence of events{An}

n=1
,
A= lim
n→∞
An= lim
n→∞

[
k=n
Ak=

\
n=1

[
k=n
Ak
is the event that infinitely many of theAnoccur. We writeP(A) =P(Ani.o.) where
i.o.stands for “infinitely often”.
(iii) Using the terminology defined in (ii) above, we can rewrite Theorem 6.3.2 as
Xn
a.s.
−→0⇐⇒P(|Xn|> ǫ i.o.) = 0∀ǫ >0.
20

Lecture 02:
We 01/10/01
Theorem 6.3.3:Borel–Cantelli Lemma
(i) 1
st
BC–Lemma:
Let{An}

n=1
be a sequence of events such that

X
n=1
P(An)<∞. ThenP(A) = 0.
(ii) 2
nd
BC–Lemma:
Let{An}

n=1
be a sequence of independent events such that

X
n=1
P(An) =∞. Then
P(A) = 1.
Proof:
(i):
P(A) =P( lim
n→∞

[
k=n
Ak)
= lim
n→∞
P(

[
k=n
Ak)
≤lim
n→∞

X
k=n
P(Ak)
= lim
n→∞


X
k=1
P(Ak)−
n−1
X
k=1
P(Ak)
!
(∗)
= 0
(∗) holds since

X
n=1
P(An)<∞.
(ii): We haveA
c
=

[
n=1

\
k=n
A
c
k. Therefore,
P(A
c
) =P( lim
n→∞

\
k=n
A
c
k) = lim
n→∞
P(

\
k=n
A
c
k).
If we choosen0> n, it holds that

\
k=n
A
c
k⊆
n0\
k=n
A
c
k.
Therefore,
P(

\
k=n
A
c
k
)≤ lim
n0→∞
P(
n0\
k=n
A
c
k
)
indep.
= lim
n0→∞
n0Y
k=n
(1−P(Ak))
21

(+)
≤ lim
n0→∞
exp


n0X
k=n
P(Ak)
!
= 0
=⇒P(A) = 1
(+) holds since
1−exp

−
n0X
j=n
αj

≤1−
n0Y
j=n
(1−αj)≤
n0X
j=n
αjforn0> nand 0≤αj≤1
Example 6.3.4:
Independence is necessary for 2
nd
BC–Lemma:
Let Ω = (0,1) andPa uniform distribution on Ω.
LetAn=I
(0,
1
n
)
(ω). Therefore,

X
n=1
P(An) =

X
n=1
1
n
=∞.
But for anyω∈Ω,Anoccurs only for 1,2, . . . ,⌊
1
ω
⌋, where⌊
1
ω
⌋denotes the largest integer
(“floor”) that is≤
1
ω
. Therefore,P(A) =P(Ani.o.) = 0.
Lemma 6.3.5:Kolmogorov’s Inequality
Let{Xi}

i=1
be a sequence of independent rv’s with common mean 0 and variancesσ
2
i
. Let
Tn=
n
X
i=1
Xi. Then it holds:
P( max
1≤k≤n
|Tk|≥ǫ)≤
n
X
i=1
σ
2
i
ǫ
2
∀ǫ >0
Proof:
See Rohatgi, page 268, Lemma 2, and Rohatgi/Saleh, page 284,Lemma 1.
Lemma 6.3.6:Kronecker’s Lemma
For any real numbersxn, if

X
n=1
xnconverges tos <∞andBn↑ ∞, then it holds:
1
Bn
n
X
k=1
Bkxk→0 asn→ ∞
22

Proof:
See Rohatgi, page 269, Lemma 3, and Rohatgi/Saleh, page 285,Lemma 2.
Theorem 6.3.7:Cauchy Criterion
Xn
a.s.
−→X⇐⇒lim
n→∞
P(sup
m
|Xn+m−Xn|≤ǫ) = 1∀ǫ >0.
Proof:
See Rohatgi, page 270, Theorem 5.
Theorem 6.3.8:
If

X
n=1
V ar(Xn)<∞, then

X
n=1
(Xn−E(Xn)) converges almost surely.
Proof:
See Rohatgi, page 272, Theorem 6, and Rohatgi/Saleh, page 286, Theorem 4.
Corollary 6.3.9:
Let{Xi}

i=1
be a sequence of independent rv’s. Let{Bi}

i=1
,Bi>0, Bi↑ ∞, be a sequence
of norming constants. LetTn=
n
X
i=1
Xi.
If

X
i=1
V ar(Xi)
B
2
i
<∞, then it holds:
Tn−E(Tn)
Bn
a.s.
−→0
Proof:
This Corollary follows directly from Theorem 6.3.8 and Lemma 6.3.6.
Lemma 6.3.10:Equivalence Lemma
Let{Xi}

i=1
and{X

i
}

i=1
be sequences of rv’s. LetTn=
n
X
i=1
XiandT

n=
n
X
i=1
X

i.
If the series

X
i=1
P(Xi6=X

i)<∞, then the series{Xi}and{X

i
}aretail–equivalentand
TnandT

n
areconvergence–equivalent, i.e., forBn↑ ∞the sequences
1
Bn
Tnand
1
Bn
T

n
converge on the same event and to the same limit, except for a null set.
Proof:
See Rohatgi, page 266, Lemma 1.
23

Lemma 6.3.11:
LetXbe a rv withE(|X|)<∞. Then it holds:

X
n=1
P(|X|≥n)≤E(|X|)≤1 +

X
n=1
P(|X|≥n)
Proof:
Continuous case only:
LetXhave a pdff. Then it holds:
E(|X|) =
Z

−∞
|x|f(x)dx=

X
k=0
Z
k≤|x|≤k+1
|x|f(x)dx
=⇒

X
k=0
kP(k≤|X|≤k+ 1)≤E(|X|)≤

X
k=0
(k+ 1)P(k≤|X|≤k+ 1)
It is

X
k=0
kP(k≤|X|≤k+ 1) =

X
k=0
k
X
n=1
P(k≤|X|≤k+ 1)
=

X
n=1

X
k=n
P(k≤|X|≤k+ 1)
=

X
n=1
P(|X|≥n)
Similarly,

X
k=0
(k+ 1)P(k≤|X|≤k+ 1) =

X
n=1
P(|X|≥n) +

X
k=0
P(k≤|X|≤k+ 1)
=

X
n=1
P(|X|≥n) + 1
Lecture 03:
Fr 01/12/01
Theorem 6.3.12:
Let{Xi}

i=1
be a sequence of independent rv’s. Then it holds:
Xn
a.s.
−→0⇐⇒

X
n=1
P(|Xn|> ǫ)<∞ ∀ǫ >0
Proof:
See Rohatgi, page 265, Theorem 3.
24

Theorem 6.3.13:Kolmogorov’s SLLN
Let{Xi}

i=1
be a sequence of iid rv’s. LetTn=
n
X
i=1
Xi. Then it holds:
Tn
n
=Xn
a.s.
−→ <∞ ⇐⇒ E(|X|)<∞(and then=E(X))
Proof:
“=⇒”:
Suppose thatXn
a.s.
−→ <∞. It is
Tn=
n
X
i=1
Xi=
n−1
X
i=1
Xi+Xn=Tn−1+Xn.
=⇒
Xn
n
=
Tn
n
|{z}
a.s.
−→

n−1
n
|{z}
→1
Tn−1
n−1
|{z}
a.s.
−→
a.s.
−→0
By Theorem 6.3.12, we have

X
n=1
P(|
Xn
n
|≥1)<∞,i.e.,

X
n=1
P(|Xn|≥n)<∞.
Lemma6.3.11
=⇒ E(|X|)<∞
T h.6.2.7 (W LLN)
=⇒ Xn
p
−→E(X)
SinceXn
a.s.
−→, it follows by Theorem 6.1.26 thatXn
p
−→. Therefore, it must hold that
=E(X) by Theorem 6.1.11 (ii).
“⇐=”:
LetE(|X|)<∞. Define truncated rv’s:
X

k=
(
Xk,if|Xk|≤k
0,otherwise
T

n
=
n
X
k=1
X

k
X

n
=
T

n
n
Then it holds:

X
k=1
P(Xk6=X

k
) =

X
k=1
P(|Xk|> k)


X
k=1
P(|Xk|≥k)
25

iid
=

X
k=1
P(|X|≥k)
Lemma6.3.11
≤ E(|X|)
< ∞
By Lemma 6.3.10, it follows thatTnandT

nare convergence–equivalent. Thus, it is sufficient
to prove thatX

n
a.s.
−→E(X).
We now establish the conditions needed in Corollary 6.3.9. It is
V ar(X

n)≤E((X

n)
2
)
=
Z
n
−n
x
2
fX(x)dx
=
n−1
X
k=0
Z
k≤|x|<k+1
x
2
fX(x)dx

n−1
X
k=0
(k+ 1)
2
Z
k≤|x|<k+1
fX(x)dx
=
n−1
X
k=0
(k+ 1)
2
P(k≤|X|< k+ 1)

n
X
k=0
(k+ 1)
2
P(k≤|X|< k+ 1)
=⇒

X
n=1
1
n
2
V ar(X

n)≤

X
n=1
n
X
k=0
(k+ 1)
2
n
2
P(k≤|X|< k+ 1)
=

X
n=1
n
X
k=1
(k+ 1)
2
n
2
P(k≤|X|< k+ 1) +

X
n=1
1
n
2
P(0≤|X|<1)
(∗)


X
k=1
(k+ 1)
2
P(k≤|X|< k+ 1)


X
n=k
1
n
2
!
+ 2P(0≤|X|<1) (A)
(∗) holds since

X
n=1
1
n
2
=
π
2
6
≈1.65<2 and the first two sums can be rearranged as follows:
nk
11
21,2
31,2,3
.
.
.
.
.
.
=⇒
k n
11,2,3, . . .
22,3, . . .
33, . . .
.
.
.
.
.
.
26

It is

X
n=k
1
n
2
=
1
k
2
+
1
(k+ 1)
2
+
1
(k+ 2)
2
+. . .

1
k
2
+
1
k(k+ 1)
+
1
(k+ 1)(k+ 2)
+. . .
=
1
k
2
+

X
n=k+1
1
n(n−1)
From Bronstein, page 30, # 7, we know that
1 =
1
1∆2
+
1
2∆3
+
1
3∆4
+. . .+
1
n(n+ 1)
+. . .
=
1
1∆2
+
1
2∆3
+
1
3∆4
+. . .+
1
(k−1)∆k
+

X
n=k+1
1
n(n−1)
=⇒

X
n=k+1
1
n(n−1)
= 1−
1
1∆2

1
2∆3

1
3∆4
−. . .−
1
(k−1)∆k
=
1
2

1
2∆3

1
3∆4
−. . .−
1
(k−1)∆k
=
1
3

1
3∆4
−. . .−
1
(k−1)∆k
=
1
4
−. . .−
1
(k−1)∆k
=. . .
=
1
k
=⇒

X
n=k
1
n
2

1
k
2
+

X
n=k+1
1
n(n−1)
=
1
k
2
+
1
k

2
k
27

Using this result in (A), we get

X
n=1
1
n
2
V ar(X

n
)≤2

X
k=1
(k+ 1)
2
k
P(k≤|X|< k+ 1) + 2P(0≤|X|<1)
= 2

X
k=0
kP(k≤|X|< k+ 1) + 4

X
k=1
P(k≤|X|< k+ 1)
+ 2

X
k=1
1
k
P(k≤|X|< k+ 1) + 2P(0≤|X|<1)
(B)
≤2E(|X|) + 4 + 2 + 2
<∞
To establish (B), we use an inequality from the Proof of Lemma 6.3.11, i.e.,

X
k=0
kP(k≤|X|< k+ 1)
P roof


X
n=1
P(|X|≥n)
Lemma6.3.11
≤ E(|X|)
Thus, the conditions needed in Corollary 6.3.9 are met. WithBn=n, it follows that
1
n
T

n

1
n
E(T

n
)
a.s.
−→0 (C)
SinceE(X

n)→E(X) asn→ ∞, it follows by Kronecker’s Lemma (6.3.6) that
1
n
E(T

n)→
E(X). Thus, when we replace
1
n
E(T

n
) byE(X) in (C), we get
1
n
T

n
a.s.
−→E(X)
Lemma6.3.10
=⇒
1
n
Tn
a.s.
−→E(X)
sinceTnandT

n
are convergence–equivalent (as defined in Lemma 6.3.10).
28

Lecture 04:
We 01/17/01
6.4 Central Limit Theorems
Let{Xn}

n=1
be a sequence of rv’s with cdf’s{Fn}

n=1
. Suppose that the mgfMn(t) ofXn
exists.
Questions: DoesMn(t) converge? Does it converge to a mgfM(t)? If it does converge, does
it hold thatXn
d
−→Xfor some rvX?
Example 6.4.1:
Let{Xn}

n=1
be a sequence of rv’s such thatP(Xn=−n) = 1. Then the mgf isMn(t) =
E(e
tX
) =e
−tn
. So
lim
n→∞
Mn(t) =







0, t >0
1, t= 0
∞, t <0
SoMn(t) does not converge to a mgf andFn(x)→F(x) = 1∀x. ButF(x) is not a cdf.
Note:
Due to Example 6.4.1, the existence of mgf’sMn(t) that converge to something is not enough
to conclude convergence in distribution.
Conversely, suppose thatXnhas mgfMn(t),Xhas mgfM(t), andXn
d
−→X. Does it hold
that
Mn(t)→M(t)?
Not necessarily! See Rohatgi, page 277, Example 2, and Rohatgi/Saleh, page 289, Example
2, as a counter example. Thus, convergence in distribution of rv’s that all have mgf’s does
not imply the convergence of mgf’s.
However, we can make the following statement in the next Theorem:
Theorem 6.4.2:Continuity Theorem
Let{Xn}

n=1
be a sequence of rv’s with cdf’s{Fn}

n=1
and mgf’s{Mn(t)}

n=1
. Suppose that
Mn(t) exists for|t|≤t0∀n. If there exists a rvXwith cdfFand mgfM(t) which exists for
|t|≤t1< t0such that lim
n→∞
Mn(t) =M(t)∀t∈[−t1, t1], thenFn
w
−→F, i.e.,Xn
d
−→X.
29

Example 6.4.3:
LetXn∼Bin(n,
λ
n
). Recall (e.g., from Theorem 3.3.12 and related Theorems) that for
X∼Bin(n, p) the mgf isMX(t) = (1−p+pe
t
)
n
. Thus,
Mn(t) = (1−
λ
n
+
λ
n
e
t
)
n
= (1 +
λ(e
t
−1)
n
)
n
(∗)
−→e
λ(e
t
−1)
asn→ ∞.
In (∗) we use the fact that lim
n→∞
(1 +
x
n
)
n
=e
x
. Recall thate
λ(e
t
−1)
is the mgf of a rvXwhere
X∼P oisson(λ). Thus, we have established the well–known result that the Binomial distribu-
tion approaches the Poisson distribution, given thatn→ ∞in such a way thatnp=λ >0.
Note:
Recall Theorem 3.3.11: Suppose that{Xn}

n=1
is a sequence of rv’s with characteristic fuctions
{Φn(t)}

n=1
. Suppose that
lim
n→∞
Φn(t) = Φ(t)∀t∈(−h, h) for someh >0,
and Φ(t) is the characteristic function of a rvX. ThenXn
d
−→X.
Theorem 6.4.4:Lindeberg–L´evy Central Limit Theorem
Let{Xn}

n=1
be a sequence of iid rv’s withE(Xi) =and 0< V ar(Xi) =σ
2
<∞. Then it
holds forXn=
1
n
n
X
i=1
Xithat

n(Xn−)
σ
d
−→Z
whereZ∼N(0,1).
Proof:
LetZ∼N(0,1). According to Theorem 3.3.12 (v), the characteristic function ofZis
ΦZ(t) = exp(−
1
2
t
2
).
Let Φ(t) be the characteristic function ofXi. We now determine the characteristic function
Φn(t) of

n(Xn−)
σ
:
Φn(t) =E






exp






it

n(
1
n
n
X
i=1
Xi−)
σ












=
Z

−∞
. . .
Z

−∞
exp






it

n(
1
n
n
X
i=1
xi−)
σ






dFX (x)
30

= exp(−
it

n
σ
)
Z

−∞
exp(
itx1


)dFX1
(x1). . .
Z

−∞
exp(
itxn


)dFXn(xn)
=
`
Φ(
t


) exp(−
it


)
´
n
Recall from Theorem 3.3.5 that if thek
th
moment exists, then Φ
(k)
(0) =i
k
E(X
k
). In partic-
ular, it holds for the given distribution that Φ
(1)
(0) =iE(X) =iand Φ
(2)
(0) =i
2
E(X
2
) =
i
2
(
2

2
) =−(
2

2
). Also recall the definition of aTaylor seriesin MacLaurin’s form:
f(x) =f(0) +
f

(0)
1!
x+
f
′′
(0)
2!
x
2
+
f
′′′
(0)
3!
x
3
+. . .+
f
(n)
(0)
n!
x
n
+. . . ,
e.g.,
f(x) =e
x
= 1 +x+
x
2
2!
+
x
3
3!
+. . .
Thus, if we develop a Taylor series for Φ(
t


) aroundt= 0, we get:
Φ(
t


) = Φ(0) +tΦ

(0) +
1
2
t
2
Φ
′′
(0) +
1
6
t
3
Φ
′′′
(0) +. . .
= 1 +t
i



1
2
t
2

2

2

2
+o
`
(
t


)
2
´
Here we make use of theLandau symbol“o”. In general, if we writeu(x) =o(v(x)) for
x→L, this implies lim
x→L
u(x)
v(x)
= 0, i.e.,u(x) goes to 0 faster thanv(x) orv(x) goes to∞
faster thanu(x). We say thatu(x)is of smaller order thanv(x) asx→L. Examples are
1
x
3=o(
1
x
2) andx
2
=o(x
3
) forx→ ∞. See Rohatgi, page 6, for more details on theLandau
symbols“O” and “o”.
Similarly, if we develop a Taylor series for exp(−
it


) aroundt= 0, we get:
exp(−
it


) = 1−t
i



1
2
t
2

2

2
+o
`
(
t


)
2
´
Combining these results, we get:
Φn(t) =

1 +t
i



1
2
t
2

2

2

2
+o
`
(
t


)
2
´
!
1−t
i



1
2
t
2

2

2
+o
`
(
t


)
2
´
!!
n
=

1−t
i



1
2
t
2

2

2
+t
i


+t
2

2

2

1
2
t
2

2

2

2
+o
`
(
t


)
2
´
!
n
=

1−
1
2
t
2
n
+o
`
(
t


)
2
´
!
n
=

1 +

1
2
t
2
n
+o
`
1
n
´
!
n
(∗)
−→exp(−
t
2
2
) asn→ ∞
31

Thus, lim
n→∞
Φn(t) = ΦZ(t)∀t. For a proof of (∗), see Rohatgi, page 278, Lemma 1. According
to the Note above, it holds that

n(Xn−)
σ
d
−→Z.
Lecture 05:
Fr 01/19/01
Definition 6.4.5:
LetX1, X2be iid non–degenerate rv’s with common cdfF. Leta1, a2>0. We say thatFis
stableif there exist constantsAandB(depending ona1anda2) such that
B
−1
(a1X1+a2X2−A) also has cdfF.
Note:
When generalizing the previous definition to sequences of rv’s, we have the following examples
for stable distributions:
•Xiiid Cauchy. Then
1
n
n
X
i=1
Xi∼Cauchy (hereBn=n, An= 0).
•XiiidN(0,1). Then
1

n
n
X
i=1
Xi∼N(0,1) (hereBn=

n, An= 0).
Definition 6.4.6:
Let{Xi}

i=1
be a sequence of iid rv’s with common cdfF. LetTn=
n
X
i=1
Xi.Fbelongs to
thedomain of attractionof a distributionVif there exist norming and centering constants
{Bn}

n=1
, Bn>0, and{An}

n=1
such that
P(B
−1
n
(Tn−An)≤x) =F
B
−1
n(Tn−An)
(x)→V(x) asn→ ∞
at all continuity pointsxofV.
Note:
A very general Theorem from Lo`eve states that only stable distributions can have domains
of attraction. From the practical point of view, a wide classof distributionsFbelong to the
domain of attraction of the Normal distribution.
32

Theorem 6.4.7:Lindeberg Central Limit Theorem
Let{Xi}

i=1
be a sequence of independent non–degenerate rv’s with cdf’s{Fi}

i=1
. Assume
thatE(Xk) =kandV ar(Xk) =σ
2
k
<∞. Lets
2
n=
n
X
k=1
σ
2
k.
If theFkare absolutely continuous with pdf’sfk=F

k
, assume that it holds for allǫ >0 that
(A) lim
n→∞
1
s
2
n
n
X
k=1
Z
{|x−k|>ǫsn}
(x−k)
2
F

k
(x)dx= 0.
If theXkare discrete rv’s with support{xkl}and probabilities{pkl},l= 1,2, . . ., assume that
it holds for allǫ >0 that
(B) lim
n→∞
1
s
2
n
n
X
k=1
X
|xkl−k|>ǫsn
(xkl−k)
2
pkl= 0.
The conditions (A) and (B) are calledLindeberg Condition(LC). If either LC holds, then
n
X
k=1
(Xk−k)
sn
d
−→Z
whereZ∼N(0,1).
Proof:
Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-
tive proof is given in Rohatgi, pages 282–288.
Note:
Feller shows that the LC is a necessary condition if
σ
2
n
s
2
n
→0 ands
2
n→ ∞asn→ ∞.
Corollary 6.4.8:
Let{Xi}

i=1
be a sequence of iid rv’s such that
1

n
n
X
i=1
Xihas the same distribution for alln.
IfE(Xi) = 0 andV ar(Xi) = 1, thenXi∼N(0,1).
Proof:
LetFbe the common cdf of
1

n
n
X
i=1
Xifor alln(includingn= 1). By the CLT,
lim
n→∞
P(
1

n
n
X
i=1
Xi≤x) = Φ(x),
where Φ(x) denotesP(Z≤x) forZ∼N(0,1). Also,P(
1

n
n
X
i=1
Xi≤x) =F(x) for eachn.
Therefore, we must haveF(x) = Φ(x).
33

Note:
In general, ifX1, X2, . . ., are independent rv’s such that there exists a constantAwith
P(|Xn|≤A) = 1∀n, then the LC is satisfied ifs
2
n
→ ∞asn→ ∞. Why??
Suppose thats
2
n
→ ∞asn→ ∞. Since the|Xk|’s are uniformly bounded (byA), so are the
rv’s (Xk−E(Xk)). Thus, for everyǫ >0 there exists anNǫsuch that ifn≥Nǫthen
P(|Xk−E(Xk)|< ǫsn, k= 1, . . . , n) = 1.
This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the
set{|x−k|> ǫsn}= Ø.
The converse also holds. For a sequence of uniformly boundedindependent rv’s, a necessary
and sufficient condition for the CLT to hold is thats
2
n
→ ∞asn→ ∞.
Example 6.4.9:
Let{Xi}

i=1
be a sequence of independent rv’s such thatE(Xk) = 0,αk=E(|Xk|
2+δ
)<∞
for someδ >0, and
n
X
k=1
αk=o(s
2+δ
n).
Does the LC hold? It is:
1
s
2
n
n
X
k=1
Z
{|x|>ǫsn}
x
2
fk(x)dx
(A)

1
s
2
n
n
X
k=1
Z
{|x|>ǫsn}
|x|
2+δ
ǫ
δ
s
δ
n
fk(x)dx

1
s
2

δ
s
δ
n
n
X
k=1
Z

−∞
|x|
2+δ
fk(x)dx
=
1
s
2

δ
s
δ
n
n
X
k=1
αk
=
1
ǫ
δ






n
X
k=1
αk
s
2+δ
n






(B)
−→0 asn→ ∞
(A) holds since for|x|> ǫsn, it is
|x|
δ
ǫ
δ
s
δ
n
>1. (B) holds since
n
X
k=1
αk=o(s
2+δ
n
).
Thus, the LC is satisfied and the CLT holds.
34

Note:
(i) In general, if there exists aδ >0 such that
n
X
k=1
E(|Xk−k|
2+δ
)
s
2+δ
n
−→0 asn→ ∞,
then the LC holds.
(ii) Both the CLT and the WLLN hold for a large class of sequences of rv’s{Xi}
n
i=1
. If
the{Xi}’s are independent uniformly bounded rv’s, i.e., ifP(|Xn|≤M) = 1∀n, the
WLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided thats
2
n
→ ∞
asn→ ∞.
If the rv’s{Xi}are iid, then the CLT is a stronger result than the WLLN since the CLT
provides an estimate of the probabilityP(
1
n
|
n
X
i=1
Xi−n|≥ǫ)≈1−P(|Z|≤
ǫ
σ

n),
whereZ∼N(0,1), and the WLLN follows. However, note that the CLT requiresthe
existence of a 2
nd
moment while the WLLN does not.
(iii) If the{Xi}are independent (but not identically distributed) rv’s, the CLT may apply
while the WLLN does not.
(iv) See Rohatgi, pages 289–293, and Rohatgi/Saleh, pages 299–303, for additional details
and examples.
35

7 Sample Moments
7.1 Random Sampling
(Based on Casella/Berger, Section 5.1 & 5.2)
Definition 7.1.1:
LetX1, . . . , Xnbe iid rv’s with common cdfF. We say that{X1, . . . , Xn}is a(random)
sampleof sizenfrom thepopulation distributionF. The vector of values{x1, . . . , xn}is
called arealizationof the sample. A rvg(X1, . . . , Xn) which is a Borel–measurable function
ofX1, . . . , Xnand does not depend on any unknown parameter is called a(sample) statistic.
Definition 7.1.2:
LetX1, . . . , Xnbe a sample of sizenfrom a population with distributionF. Then
X=
1
n
n
X
i=1
Xi
is called thesample meanand
S
2
=
1
n−1
n
X
i=1
(Xi−X)
2
=
1
n−1

n
X
i=1
X
2
i−nX
2
!
is called thesample variance.
Definition 7.1.3:
LetX1, . . . , Xnbe a sample of sizenfrom a population with distributionF. The function
ˆ
Fn(x) =
1
n
n
X
i=1
I
(−∞,x](Xi)
is calledempirical cumulative distribution function(empirical cdf).
Note:
For any fixedx∈IR,
ˆ
Fn(x) is a rv.
Theorem 7.1.4:
The rv
ˆ
Fn(x) has pmf
P(
ˆ
Fn(x) =
j
n
) =

n
j
!
(F(x))
j
(1−F(x))
n−j
, j∈ {0,1, . . . , n},
36

withE(
ˆ
Fn(x)) =F(x) andV ar(
ˆ
Fn(x)) =
F(x)(1−F(x))
n
.
Proof:
It isI
(−∞,x](Xi)∼Bin(1, F(x)). Thenn
ˆ
Fn(x)∼Bin(n, F(x)).
The results follow immediately.
Corollary 7.1.5:
By the WLLN, it follows that
ˆ
Fn(x)
p
−→F(x).
Corollary 7.1.6:
By the CLT, it follows that

n(
ˆ
Fn(x)−F(x))
p
F(x)(1−F(x))
d
−→Z,
whereZ∼N(0,1).
Theorem 7.1.7:Glivenko–Cantelli Theorem
ˆ
Fn(x) converges uniformly toF(x), i.e., it holds for allǫ >0 that
lim
n→∞
P( sup
−∞<x<∞
|
ˆ
Fn(x)−F(x)|> ǫ) = 0.
Lecture 06:
Mo 01/22/01
Definition 7.1.8:
LetX1, . . . , Xnbe a sample of sizenfrom a population with distributionF. We call
ak=
1
n
n
X
i=1
X
k
i
thesample moment of orderkand
bk=
1
n
n
X
i=1
(Xi−a1)
k
=
1
n
n
X
i=1
(Xi−X)
k
thesample central moment of orderk.
Note:
It isb1= 0 andb2=
n−1
n
S
2
.
37

Theorem 7.1.9:
LetX1, . . . , Xnbe a sample of sizenfrom a population with distributionF. Assume that
E(X) =,V ar(X) =σ
2
, andE((X−)
k
) =kexist. Then it holds:
(i)E(a1) =E(X) =
(ii)V ar(a1) =V ar(X) =
σ
2
n
(iii)E(b2) =
n−1
n
σ
2
(iv)V ar(b2) =
4−
2
2
n

2(4−2
2
2
)
n
2+
4−3
2
2
n
3
(v)E(S
2
) =σ
2
(vi)V ar(S
2
) =
4
n

n−3
n(n−1)

2
2
Proof:
(i)
E(X) =
1
n
n
X
i=1
E(Xi) =
n
n
=
(ii)
V ar(X) =
`
1
n
´
2n
X
i=1
V ar(Xi) =
σ
2
n
(iii)
E(b2) =E

1
n
n
X
i=1
(Xi−X)
2
!
=E


1
n
n
X
i=1
X
2
i−
1
n
2

n
X
i=1
Xi
!
2


=E(X
2
)−
1
n
2
E


n
X
i=1
X
2
i
+
X X
i6=j
XiXj


(∗)
=E(X
2
)−
1
n
2
(nE(X
2
) +n(n−1)
2
)
=
n−1
n
(E(X
2
)−
2
)
=
n−1
n
σ
2
(∗) holds sinceXiandXjare independent and then, due to Theorem 4.5.3, it holds
thatE(XiXj) =E(Xi)E(Xj).
See Casella/Berger, page 214, and Rohatgi, page 303–306, for the proof of parts (iv) through
(vi) and results regarding the 3
rd
and 4
th
moments and covariances.
38

7.2 Sample Moments and the Normal Distribution
(Based on Casella/Berger, Section 5.3)
Theorem 7.2.1:
LetX1, . . . , Xnbe iidN(, σ
2
) rv’s. ThenX=
1
n
n
X
i=1
Xiand (X1−X, . . . , Xn−X) are
independent.
Proof:
By computing the joint mgf of (X, X1−X, X2−X, . . . , Xn−X), we can use Theorem 4.6.3
(iv) to show independence. We will use the following two facts:
(1):
M
X
(t) =M

1
n
n
X
i=1
Xi
!
(t)
(A)
=
n
Y
i=1
MXi
(
t
n
)
(B)
=
"
exp(
t
n
+
σ
2
t
2
2n
2
)
#
n
= exp

t+
σ
2
t
2
2n
!
(A) holds by Theorem 4.6.4 (i). (B) follows from Theorem 3.3.12 (vi) since theXi’s are iid.
(2):
M
X1−X,X2−X,...,Xn−X
(t1, t2, . . . , tn)
Def.4.6.1
= E
"
exp

n
X
i=1
ti(Xi−X)
!#
= E
"
exp

n
X
i=1
tiXi−X
n
X
i=1
ti
!#
= E
"
exp

n
X
i=1
Xi(ti−t)
!#
,wheret=
1
n
n
X
i=1
ti
= E
"
n
Y
i=1
exp (Xi(ti−t))
#
(C)
=
n
Y
i=1
E(exp(Xi(ti−t)))
=
n
Y
i=1
MXi
(ti−t)
39

(D)
=
n
Y
i=1
exp

(ti−t) +
σ
2
(ti−t)
2
2
!
= exp







n
X
i=1
(ti−t)
|{z}
=0
+
σ
2
2
n
X
i=1
(ti−t)
2






= exp

σ
2
2
n
X
i=1
(ti−t)
2
!
(C) follows from Theorem 4.5.3 since theXi’s are independent. (D) holds since we evaluate
MX(h) = exp(h+
σ
2
h
2
2
) forh=ti−t.
From (1) and (2), it follows:
M
X,X1−X,...,Xn−X
(t, t1, . . . , tn)
Def.4.6.1
= E
h
exp(tX+t1(X1−X) +. . .+tn(Xn−X))
i
= E
h
exp(tX+t1X1−t1X+. . .+tnXn−tnX)
i
= E
"
exp

n
X
i=1
Xiti−(
n
X
i=1
ti−t)X
!#
= E






exp






n
X
i=1
Xiti−
(t1+. . .+tn−t)
n
X
i=1
Xi
n












= E
"
exp

n
X
i=1
Xi(ti−
t1+. . .+tn−t
n
)
!#
= E
"
n
Y
i=1
exp

Xi
nti−nt+t
n
!#
,wheret=
1
n
n
X
i=1
ti
(E)
=
n
Y
i=1
E
"
exp

Xi[t+n(ti−t)]
n
!#
=
n
Y
i=1
MXi

t+n(ti−t)
n
!
(F)
=
n
Y
i=1
exp

[t+n(ti−t)]
n
+
σ
2
2
1
n
2
[t+n(ti−t)]
2
!
40

= exp







n






nt+n
n
X
i=1
(ti−t)
|{z}
=0






+
σ
2
2n
2
n
X
i=1
(t+n(ti−t))
2






= exp(t) exp






σ
2
2n
2






nt
2
+ 2nt
n
X
i=1
(ti− t)
|{z}
=0
+n
2
n
X
i=1
(ti−t)
2












= exp

t+
σ
2
2n
t
2
!
exp

σ
2
2
n
X
i=1
(ti−t)
2
!
(1)&(2)
= M
X
(t)M
X1−X,...,Xn−X
(t1, . . . , tn)
Thus,Xand (X1−X, . . . , Xn−X) are independent by Theorem 4.6.3 (iv). (E) follows
from Theorem 4.5.3 since theXi’s are independent. (F) holds since we evaluateMX(h) =
exp(h+
σ
2
h
2
2
) forh=
t+n(ti−t)
n
.
Corollary 7.2.2:
XandS
2
are independent.
Proof:
This can be seen sinceS
2
is a function of the vector (X1−X, . . . , Xn−X), and (X1−
X, . . . , Xn−X) is independent ofX, as previously shown in Theorem 7.2.1. We can use
Theorem 4.2.7 to formally complete this proof.
Corollary 7.2.3:
(n−1)S
2
σ
2
∼χ
2
n−1
.
Proof:
Recall the following facts:
•IfZ∼N(0,1) thenZ
2
∼χ
2
1
.
•IfY1, . . . , Yn∼iidχ
2
1
, then
n
X
i=1
Yi∼χ
2
n
.
•Forχ
2
n
, the mgf isM(t) = (1−2t)
−n/2
.
•IfXi∼N(, σ
2
), then
Xi−
σ
∼N(0,1) and
(Xi−)
2
σ
2∼χ
2
1
.
Therefore,
n
X
i=1
(Xi−)
2
σ
2
∼χ
2
n
and
( X−)
2
(
σ

n
)
2=n
(X−)
2
σ
2∼χ
2
1
. (∗)
41

Now consider
n
X
i=1
(Xi−)
2
=
n
X
i=1
((Xi−X) + (X−))
2
=
n
X
i=1
((Xi−X)
2
+ 2(Xi−X)(X−) + (X−)
2
)
= (n−1)S
2
+ 0 +n(X−)
2
Therefore,
n
X
i=1
(Xi−)
2
σ
2
| {z }
W
=
n(X−)
2
σ
2
|{z}
U
+
(n−1)S
2
σ
2
|{z}
V
We have an expression of the form:W=U+V
Since U and V are functions ofXandS
2
, we know by Corollary 7.2.2 that they are independent
and also that their mgf’s factor by Theorem 4.6.3 (iv). Now wecan write:
MW(t) =MU(t)MV(t)
=⇒MV(t) =
MW(t)
MU(t)
(∗)
=
(1−2t)
−n/2
(1−2t)
−1/2
= (1−2t)
−(n−1)/2
Note that this is the mgf ofχ
2
n−1
by the uniqueness of mgf’s. Thus,V=
(n−1)S
2
σ
2∼χ
2
n−1
.
Corollary 7.2.4:

n(X−)
S
∼tn−1.
Proof:
Recall the following facts:
•IfZ∼N(0,1), Y∼χ
2
n
andZ, Yindependent, then
Z
p
Y
n
∼tn.
•Z1=

n(X−)
σ
∼N(0,1), Yn−1=
(n−1)S
2
σ
2∼χ
2
n−1
, andZ1, Yn−1are independent.
Therefore,

n(X−)
S
=
(X−)
σ/

n
S/

n
σ/

n
=
(X−)
σ/

n
r
S
2
(n−1)
σ
2
(n−1)
=
Z1
q
Yn−1
(n−1)
∼tn−1.
42

Corollary 7.2.5:
Let (X1, . . . , Xm)∼iidN(1, σ
2
1
) and (Y1, . . . , Yn)∼iidN(2, σ
2
2
). LetXi, Yjbe independent
∀i, j.
Then it holds:
X−Y−(1−2)
q
[(m−1)S
2
1

2
1
] + [(n−1)S
2
2

2
2
]

s
m+n−2
σ
2
1
/m+σ
2
2
/n
∼tm+n−2
In particular, ifσ1=σ2, then:
X−Y−(1−2)
q
(m−1)S
2
1
+ (n−1)S
2
2

s
mn(m+n−2)
m+n
∼tm+n−2
Proof:
Homework.
Corollary 7.2.6:
Let (X1, . . . , Xm)∼iidN(1, σ
2
1
) and (Y1, . . . , Yn)∼iidN(2, σ
2
2
). LetXi, Yjbe independent
∀i, j.
Then it holds:
S
2
1

2
1
S
2
2

2
2
∼Fm−1,n−1
In particular, ifσ1=σ2, then:
S
2
1
S
2
2
∼Fm−1,n−1
Proof:
Recall that, ifY1∼χ
2
mandY2∼χ
2
n, then
F=
Y1/m
Y2/n
∼Fm,n.
Now,C1=
(m−1)S
2
1
σ
2
1
∼χ
2
m−1
andC2=
(n−1)S
2
2
σ
2
2
∼χ
2
n−1
. Therefore,
S
2
1

2
1
S
2
2

2
2
=
(m−1)S
2
1
σ
2
1
(m−1)
(n−1)S
2
2
σ
2
2
(n−1)
=
C1/(m−1)
C2/(n−1)
∼Fm−1,n−1.
Ifσ1=σ2, then
S
2
1
S
2
2
∼Fm−1,n−1.
43

Lecture 07:
We 01/24/01
8 The Theory of Point Estimation
(Based on Casella/Berger, Chapters 6 & 7)
8.1 The Problem of Point Estimation
LetXbe a rv defined on a probability space (Ω, L, P). Suppose that the cdfFofXdepends
on some set of parameters and that the functional form ofFis known except for a finite
number of these parameters.
Definition 8.1.1:
The set of admissible values ofθis called theparameter spaceΘ. IfFθis the cdf ofX
whenθis the parameter, the set{Fθ:θ∈Θ}is thefamily of cdf’s. Likewise, we speak of
thefamily of pdf’sifXis continuous, and thefamily of pmf’sifXis discrete.
Example 8.1.2:
X∼Bin(n, p), punknown. Thenθ=pand Θ ={p: 0< p <1}.
X∼N(, σ
2
),(, σ
2
) unknown. Thenθ= (, σ
2
) and Θ ={(, σ
2
) :−∞< <∞, σ
2
>0}.
Definition 8.1.3:
LetXbe a sample fromFθ, θ∈Θ⊆IR. Let a statisticT(X) mapIR
n
to Θ. We callT(X)
anestimatorofθandT(x) for a realizationxofXan(point) estimateofθ. In practice,
the termestimateis used for both.
Example 8.1.4:
LetX1, . . . , Xnbe iidBin(1, p),punknown. Estimates ofpinclude:
T1(X) =X, T2(X) =X1, T3(X) =
1
2
, T4(X) =
X1+X2
3
Obviously, not all estimates are equally good.
44

8.2 Properties of Estimates
Definition 8.2.1:
Let{Xi}

i=1
be a sequence of iid rv’s with cdfFθ, θ∈Θ. A sequence of point estimates
Tn(X1, . . . , Xn) =Tnis called
•(weakly) consistentforθifTn
p
−→θasn→ ∞ ∀θ∈Θ
•strongly consistentforθifTn
a.s.
−→θasn→ ∞ ∀θ∈Θ
•consistent in ther
th
meanforθifTn
r
−→θasn→ ∞ ∀θ∈Θ
Example 8.2.2:
Let{Xi}

i=1
be a sequence of iidBin(1, p) rv’s. Let Xn=
1
n
n
X
i=1
Xi. SinceE(Xi) =p, it
follows by the WLLN thatXn
p
−→p, i.e., consistency, and by the SLLN thatXn
a.s.
−→p, i.e,
strong consistency.
However, a consistent estimate may not be unique. We may evenhave infinite many consistent
estimates, e.g.,
n
X
i=1
Xi+a
n+b
p
−→p∀finitea, b∈IR.
Theorem 8.2.3:
IfTnis a sequence of estimates such thatE(Tn)→θandV ar(Tn)→0 asn→ ∞, thenTnis
consistent forθ.
Proof:
P(|Tn−θ|> ǫ)
(A)

E((Tn−θ)
2
)
ǫ
2
=
E[((Tn−E(Tn)) + (E(Tn)−θ))
2
]
ǫ
2
=
V ar(Tn) + 2E[(Tn−E(Tn))(E(Tn)−θ)] + (E(Tn)−θ)
2
ǫ
2
=
V ar(Tn) + (E(Tn)−θ)
2
ǫ
2
(B)
−→0 asn→ ∞
45

(A) holds due to Corollary 3.5.2 (Markov’s Inequality). (B) holds sinceV ar(Tn)→0 as
n→ ∞andE(Tn)→θasn→ ∞.
Definition 8.2.4:
LetGbe a group of Borel–measurable functions ofIR
n
onto itself which is closed under com-
position and inverse. A family of distributions{Pθ:θ∈Θ}isinvariantunderGif for
eachg∈Gand for allθ∈Θ, there exists a uniqueθ

=g(θ) such that the distribution of
g(X) isPθ
′whenever the distribution ofXisPθ. We callgtheinduced functiononθsince
Pθ(g(X)∈A) =P
g(θ)(X∈A).
Example 8.2.5:
Let (X1, . . . , Xn) be iidN(, σ
2
) with pdf
f(x1, . . . , xn) =
1
(

2πσ)
n
exp


1

2
n
X
i=1
(xi−)
2
!
.
The group of linear transformationsGhas elements
g(x1, . . . , xn) = (ax1+b, . . . , axn+b), a >0,− ∞< b <∞.
The pdf ofg(X) is
f

(x

1
, . . . , x

n
) =
1
(

2πaσ)
n
exp


1
2a
2
σ
2
n
X
i=1
(x

i
−a−b)
2
!
, x

i
=axi+b, i= 1, . . . , n.
So{f:−∞< <∞, σ
2
>0}is invariant under this groupG, with g(, σ
2
) = (a+b, a
2
σ
2
),
where−∞< a+b <∞anda
2
σ
2
>0.
Definition 8.2.6:
LetGbe a group of transformations that leaves{Fθ:θ∈Θ}invariant. An estimateTis
invariantunderGif
T(g(X1), . . . , g(Xn)) =T(X1, . . . , Xn)∀g∈G.
46

Definition 8.2.7:
An estimateTis:
•location invariantifT(X1+a, . . . , Xn+a) =T(X1, . . . , Xn), a∈IR
•scale invariantifT(cX1, . . . , cXn) =T(X1, . . . , Xn), c∈IR− {0}
•permutation invariantifT(Xi1
, . . . , Xin) =T(X1, . . . , Xn)∀permutations (i1, . . . , in)
of 1, . . . , n
Example 8.2.8:
LetFθ∼N(, σ
2
).
S
2
is location invariant.
XandS
2
are both permutation invariant.
NeitherXnorS
2
is scale invariant.
Note:
Different sources make different use of the terminvariant. Mood, Graybill & Boes (1974)
for example definelocation invariantasT(X1+a, . . . , Xn+a) =T(X1, . . . , Xn) +a(page
332) andscale invariantasT(cX1, . . . , cXn) =cT(X1, . . . , Xn) (page 336). According to their
definition,Xis location invariant and scale invariant.
47

8.3 Sufficient Statistics
(Based on Casella/Berger, Section 6.2)
Definition 8.3.1:
LetX= (X1, . . . , Xn) be a sample from{Fθ:θ∈Θ⊆IR
k
}. A statisticT=T(X) is
sufficientforθ(or for the family of distributions{Fθ:θ∈Θ}) iff the conditional dis-
tribution ofXgivenT=tdoes not depend onθ(except possibly on a null setAwhere
Pθ(T∈A) = 0∀θ).
Note:
(i) The sampleXis always sufficient but this is not particularly interestingand usually is
excluded from further considerations.
(ii) Idea: Once we have “reduced” fromXtoT(X), we have captured all the information
inXaboutθ.
(iii) Usually, there are several sufficient statistics for a given family of distributions.
Example 8.3.2:
LetX= (X1, . . . , Xn) be iidBin(1, p) rv’s. To estimatep, can we ignore the order and simply
count the number of “successes”?
LetT(X) =
n
X
i=1
Xi. It is
P(X1=x1, . . . Xn=xn|
n
X
i=1
Xi=t) =
P(X1=x1, . . . , Xn=xn, T=t)
P(T=t)
=







P(X1=x1, . . . , Xn=xn)
P(T=t)
,
n
X
i=1
xi=t
0, otherwise
=











p
t
(1−p)
n−t

n
t
!
p
t
(1−p)
n−t
,
n
X
i=1
xi=t
0, otherwise
=











1

n
t
!,
n
X
i=1
xi=t
0, otherwise
48

This does not depend onp. Thus,T=
n
X
i=1
Xiis sufficient forp.
Example 8.3.3:
LetX= (X1, . . . , Xn) be iid Poisson(λ). IsT=
n
X
i=1
Xisufficient forλ? It is
P(X1=x1, . . . , Xn=xn|T=t) =
P(X1=x1, . . . , Xn=xn, T=t)
P(T=t)
=















n
Y
i=1
e
−λ
λ
xi
xi!
e
−nλ
(nλ)
t
t!
,
n
X
i=1
xi=t
0, otherwise
=













e
−nλ
λ
P
xi
Q
xi!
e
−nλ
(nλ)
t
t!
,
n
X
i=1
xi=t
0, otherwise
=











t!
n
t
n
Y
i=1
xi!
,
n
X
i=1
xi=t
0, otherwise
This does not depend onλ. Thus,T=
n
X
i=1
Xiis sufficient forλ.
Example 8.3.4:
LetX1, X2be iid Poisson(λ). IsT=X1+ 2X2sufficient forλ? It is
P(X1= 0, X2= 1|X1+ 2X2= 2) =
P(X1= 0, X2= 1, X1+ 2X2= 2)
P(X1+ 2X2= 2)
=
P(X1= 0, X2= 1)
P(X1+ 2X2= 2)
=
P(X1= 0, X2= 1)
P(X1= 0, X2= 1) +P(X1= 2, X2= 0)
=
e
−λ
(e
−λ
λ)
e
−λ
(e
−λ
λ) + (
e
−λ
λ
2
2
)e
−λ
=
1
1 +
λ
2
,
49

i.e., this is a counter–example. This expression still depends onλ. Thus,T=X1+ 2X2is
not sufficient forλ.
Note:
Definition 8.3.1 can be difficult to check. In addition, it requires a candidate statistic. We
need something constructive that helps in finding sufficient statistics without having to check
Definition 8.3.1. The next Theorem helps in finding such statistics.
Lecture 08:
Fr 01/26/01
Theorem 8.3.5:Factorization Criterion
LetX1, . . . , Xnbe rv’s with pdf (or pmf)f(x1, . . . , xn|θ), θ∈Θ. ThenT(X1, . . . , Xn) is
sufficient forθiff we can write
f(x1, . . . , xn|θ) =h(x1, . . . , xn)g(T(x1, . . . , xn)|θ),
wherehdoes not depend onθandgdoes not depend onx1, . . . , xnexcept as a function ofT.
Proof:
Discrete case only.
“=⇒”:
SupposeT(X) is sufficient forθ. Let
g(t|θ) =Pθ(T(X) =t)
h(x) =P(X=x|T(X) =t)
Then it holds:
f(x|θ) =Pθ(X=x)
(∗)
=Pθ(X=x, T(X) =T(x) =t)
=Pθ(T(X) =t)P(X=x|T(X) =t)
=g(t|θ)h(x)
(∗) holds sinceX=ximplies thatT(X) =T(x) =t.
“⇐=”:
Suppose the factorization holds. For fixedt0, it is
Pθ(T(X) =t0) =
X
{x:T(x)=t0}
Pθ(X=x)
50

=
X
{x:T(x)=t0}
h(x)g(T(x)|θ)
=g(t0|θ)
X
{x:T(x)=t0}
h(x) ( A)
IfPθ(T(X) =t0)>0, it holds:
Pθ(X=x|T(X) =t0) =
Pθ(X=x, T(X) =t0)
Pθ(T(X) =t0)
=





Pθ(X=x)
Pθ(T(X) =t0)
,ifT(x) =t0
0, otherwise
(A)
=









g(t0|θ)h(x)
g(t0|θ)
X
{x:T(x)=t0}
h(x)
,ifT(x) =t0
0, otherwise
=









h(x)
X
{x:T(x)=t0}
h(x)
,ifT(x) =t0
0, otherwise
This last expression does not depend onθ. Thus,T(X) is sufficient forθ.
Note:
(i) In the Theorem above,θandTmay be vectors.
(ii) IfTis sufficient forθ, then also any 1–to–1 mapping ofTis sufficient forθ. However,
this does not hold for arbitrary functions ofT.
Example 8.3.6:
LetX1, . . . , Xnbe iidBin(1, p). It is
P(X1=x1, . . . , Xn=xn|p) =p
P
xi
(1−p)
n−
P
xi
.
Thus,h(x1, . . . , xn) = 1 andg(
P
xi|p) =p
P
xi
(1−p)
n−
P
xi
.
Hence,T=
n
X
i=1
Xiis sufficient forp.
51

Example 8.3.7:
LetX1, . . . , Xnbe iid Poisson(λ). It is
P(X1=x1, . . . , Xn=xn|λ) =
n
Y
i=1
e
−λ
λ
xi
xi!
=
e
−nλ
λ
P
xi
Q
xi!
.
Thus,h(x1, . . . , xn) =
1Q
xi!
andg(
P
xi|λ) =e
−nλ
λ
P
xi
.
Hence,T=
n
X
i=1
Xiis sufficient forλ.
Example 8.3.8:
LetX1, . . . , Xnbe iidN(, σ
2
) where∈IRandσ
2
>0 are both unknown. It is
f(x1, . . . , xn|, σ
2
) =
1
(

2πσ)
n
exp


P
(xi−)
2

2
!
=
1
(

2πσ)
n
exp


P
x
2
i

2
+
P
xi
σ
2

n
2

2
!
.
Hence,T= (
n
X
i=1
Xi,
n
X
i=1
X
2
i) is sufficient for (, σ
2
).
Example 8.3.9:
LetX1, . . . , Xnbe iidU(θ, θ+ 1) where−∞< θ <∞. It is
f(x1, . . . , xn|θ) =
(
1, θ < xi< θ+ 1∀i∈ {1, . . . , n}
0,otherwise
=
n
Y
i=1
I
(θ,∞)(xi)I
(−∞,θ+1)(xi)
=I
(θ,∞)(min(xi))I
(−∞,θ+1)(max(xi))
Hence,T= (X
(1), X
(n)) is sufficient forθ.
Definition 8.3.10:
Let{fθ(x) :θ∈Θ}be a family of pdf’s (or pmf’s). We say the family iscompleteif
Eθ(g(X)) = 0∀θ∈Θ
implies that
Pθ(g(X) = 0) = 1∀θ∈Θ.
We say a statisticT(X) iscompleteif the family of distributions ofTis complete.
52

Example 8.3.11:
LetX1, . . . , Xnbe iidBin(1, p). We have seen in Example 8.3.6 thatT=
n
X
i=1
Xiis sufficient
forp. Is it also complete?
We know thatT∼Bin(n, p). Thus,
Ep(g(T)) =
n
X
t=0
g(t)

n
t
!
p
t
(1−p)
n−t
= 0∀p∈(0,1)
implies that
(1−p)
n
n
X
t=0
g(t)

n
t
!
(
p
1−p
)
t
= 0∀p∈(0,1)∀t.
However,
n
X
t=0
g(t)

n
t
!
(
p
1−p
)
t
is a polynomial in
p
1−p
which is only equal to 0 for allp∈(0,1)
if all of its coefficients are 0.
Therefore,g(t) = 0 fort= 0,1, . . . , n. Hence,Tis complete.
Lecture 09:
Mo 01/29/01
Example 8.3.12:
LetX1, . . . , Xnbe iidN(θ, θ
2
). We know from Example 8.3.8 thatT= (
n
X
i=1
Xi,
n
X
i=1
X
2
i
) is
sufficient forθ. Is it also complete?
We know that
n
X
i=1
Xi∼N(nθ, nθ
2
). Therefore,
E((
n
X
i=1
Xi)
2
) =nθ
2
+n
2
θ
2
=n(n+ 1)θ
2
E(
n
X
i=1
X
2
i) =n(θ
2

2
) = 2nθ
2
It follows that
E

2(
n
X
i=1
Xi)
2
−(n+ 1)
n
X
i=1
X
2
i
!
= 0∀θ.
Butg(x1, . . . , xn) = 2(
n
X
i=1
xi)
2
−(n+ 1)
n
X
i=1
x
2
iis not identically to 0.
Therefore,Tis not complete.
Note:
Recall from Section 5.2 what it means if we say the family of distributions{fθ:θ∈Θ}is a
one–parameter (ork–parameter) exponential family.
53

Theorem 8.3.13:
Let{fθ:θ∈Θ}be ak–parameter exponential family. LetT1, . . . , Tkbe statistics. Then the
family of distributions of (T1(X), . . . , Tk(X)) is also ak–parameter exponential family given
by
gθ(t) = exp

k
X
i=1
tiQi(θ) +D(θ) +S

(t)
!
for suitableS

(t).
Proof:
The proof follows from our Theorems regarding the transformation of rv’s.
Theorem 8.3.14:
Let{fθ:θ∈Θ}be ak–parameter exponential family withk≤nand letT1, . . . , Tkbe
statistics as in Theorem 8.3.13. Suppose that the range ofQ= (Q1, . . . , Qk) contains an open
set inIR
k
. ThenT= (T1(X), . . . , Tk(X)) is a complete sufficient statistic.
Proof:
Discrete case andk= 1 only.
WriteQ(θ) =θand let (a, b)⊆Θ.
It follows from the Factorization Criterion (Theorem 8.3.5) thatTis sufficient forθ. Thus,
we only have to show thatTis complete, i.e., that
Eθ(g(T(X))) =
X
t
g(t)Pθ(T(X) =t)
(A)
=
X
t
g(t) exp(θt+D(θ) +S

(t)) = 0∀θ(B)
impliesg(t) = 0∀t. Note that in (A) we make use of a result established in Theorem 8.3.13.
We now define functionsg
+
andg

as:
g
+
(t) =
(
g(t),ifg(t)≥0
0, otherwise
g

(t) =
(
−g(t),ifg(t)<0
0, otherwise
It isg(t) =g
+
(t)−g

(t) where both functions,g
+
andg

, are non–negative functions. Using
g
+
andg

, it turns out that (B) is equivalent to
X
t
g
+
(t) exp(θt+S

(t)) =
X
t
g

(t) exp(θt+S

(t))∀θ(C)
where the term exp(D(θ)) in (A) drops out as a constant on both sides.
54

If we fixθ0∈(a, b) and define
p
+
(t) =
g
+
(t) exp(θ0t+S

(t))
X
t
g
+
(t) exp(θ0t+S

(t))
, p

(t) =
g

(t) exp(θ0t+S

(t))
X
t
g

(t) exp(θ0t+S

(t))
,
it is obvious thatp
+
(t)≥0∀tandp

(t)≥0∀tand by construction
X
t
p
+
(t) = 1 and
X
t
p

(t) = 1. Hence,p
+
andp

are both pmf’s.
From (C), it follows for the mgf’sM
+
andM

ofp
+
andp

that
M
+
(δ) =
X
t
e
δt
p
+
(t)
=
X
t
e
δt
g
+
(t) exp(θ0t+S

(t))
X
t
g
+
(t) exp(θ0t+S

(t))
=
X
t
g
+
(t) exp((θ0+δ)t+S

(t))
X
t
g
+
(t) exp(θ0t+S

(t))
(C)
=
X
t
g

(t) exp((θ0+δ)t+S

(t))
X
t
g

(t) exp(θ0t+S

(t))
=
X
t
e
δt
g

(t) exp(θ0t+S

(t))
X
t
g

(t) exp(θ0t+S

(t))
=
X
t
e
δt
p

(t)
=M

(δ)∀δ∈(a−θ0
|{z}
<0
, b−θ0
|{z}
>0
).
By the uniqueness of mgf’s it follows thatp
+
(t) =p

(t)∀t.
=⇒g
+
(t) =g

(t)∀t
=⇒g(t) = 0∀t
=⇒Tis complete
55

Definition 8.3.15:
LetX= (X1, . . . , Xn) be a sample from{Fθ:θ∈Θ⊆IR
k
}and letT=T(X) be a sufficient
statistic forθ.T=T(X) is called aminimal sufficientstatistic forθif, for any other
sufficient statisticT

=T

(X),T(x) is a function ofT

(x).
Note:
(i) A minimal sufficient statistic achieves the greatest possible data reduction for a sufficient
statistic.
(ii) IfTis minimal sufficient forθ, then also any 1–to–1 mapping ofTis minimal sufficient
forθ. However, this does not hold for arbitrary functions ofT.
Definition 8.3.16:
LetX= (X1, . . . , Xn) be a sample from{Fθ:θ∈Θ⊆IR
k
}. A statisticT=T(X) is called
ancillaryif its distribution does not depend on the parameterθ.
Example 8.3.17:
LetX1, . . . , Xnbe iidU(θ, θ+ 1) where−∞< θ <∞. As shown in Example 8.3.9,
T= (X
(1), X
(n)) is sufficient forθ. Define
Rn=X
(n)−X
(1).
Use the result from Stat 6710, Homework Assignment 5, Question (viii) (a) to obtain
fRn(r|θ) =fRn(r) =n(n−1)r
n−2
(1−r)I
(0,1)(r).
This means thatRn∼Beta(n−1,2). Moreover,Rndoes not depend onθand, therefore,
Rnis ancillary.
Theorem 8.3.18:Basu’s Theorem
LetX= (X1, . . . , Xn) be a sample from{Fθ:θ∈Θ⊆IR
k
}. IfT=T(X) is a complete and
minimal sufficient statistic, thenTis independent of any ancillary statistic.
Theorem 8.3.19:
LetX= (X1, . . . , Xn) be a sample from{Fθ:θ∈Θ⊆IR
k
}. If any minimal sufficient statis-
ticT=T(X) exists forθ, then any complete statistic is also a minimal sufficient statistic.
56

Note:
(i) Due to the last Theorem, Basu’s Theorem often only is stated in terms of a complete
sufficient statistic (which automatically is also a minimal sufficient statistic).
(ii) As already shown in Corollary 7.2.2,XandS
2
are independent when sampling from a
N(, σ
2
) population. As outlined in Casella/Berger, page 289, we could also use Basu’s
Theorem to obtain the same result.
(iii) The converse of Basu’s Theorem is false, i.e., ifT(X) is independent of any ancillary
statistic, it does not necessarily follow thatT(X) is a complete, minimal sufficient statis-
tic.
(iv) As seen in Examples 8.3.8 and 8.3.12,T= (
n
X
i=1
Xi,
n
X
i=1
X
2
i) is sufficient forθbut it is not
complete whenX1, . . . , Xnare iidN(θ, θ
2
). However, it can be shown thatTis minimal
sufficient. So, there may be distributions where a minimal sufficient statistic exists but
a complete statistic does not exist.
(v) As with invariance, there exist several different definitions of ancillarity within the lit-
erature — the one defined in this chapter being the most commonly used.
57

8.4 Unbiased Estimation
(Based on Casella/Berger, Section 7.3)
Definition 8.4.1:
Let{Fθ:θ∈Θ}, Θ⊆IR, be a nonempty set of cdf’s. A Borel–measurable functionTfrom
IR
n
to Θ is calledunbiasedforθ(or an unbiased estimate forθ) if
Eθ(T) =θ∀θ∈Θ.
Any functiond(θ) for which an unbiased estimateTexists is called anestimable function.
IfTis biased,
b(θ, T) =Eθ(T)−θ
is called thebiasofT.
Example 8.4.2:
If thek
th
population moment exists, thek
th
sample moment is an unbiased estimate. If
V ar(X) =σ
2
, the sample varianceS
2
is an unbiased estimate ofσ
2
.
However, note that forX1, . . . , XniidN(, σ
2
),Sis not an unbiased estimate ofσ:
(n−1)S
2
σ
2
∼χ
2
n−1
=Gamma(
n−1
2
,2)
=⇒E


s
(n−1)S
2
σ
2

=
Z

0

x
x
n−1
2
−1
e

x
2
2
n−1
2Γ(
n−1
2
)
dx
=

2Γ(
n
2
)
Γ(
n−1
2
)
Z

0
x
n
2
−1
e

x
2
2
n
2Γ(
n
2
)
dx
(∗)
=

2Γ(
n
2
)
Γ(
n−1
2
)
=⇒E(S) =σ
s
2
n−1
Γ(
n
2
)
Γ(
n−1
2
)
(∗) holds since
x
n
2
−1
e

x
2
2
n
2Γ(
n
2
)
is the pdf of aGamma(
n
2
,2) distribution and thus the integral is 1.
SoSis biased forσand
b(σ, S) =σ


s
2
n−1
Γ(
n
2
)
Γ(
n−1
2
)
−1

.
58

Note:
IfTis unbiased forθ,g(T) is not necessarily unbiased forg(θ) (unlessgis a linear function).
Lecture 10:
We 01/31/01
Example 8.4.3:
Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd
as in the following case:
LetX∼Poisson(λ) and letd(λ) =e
−2λ
. ConsiderT(X) = (−1)
X
as an estimate. It is
Eλ(T(X)) =e
−λ

X
x=0
(−1)
x
λ
x
x!
=e
−λ

X
x=0
(−λ)
x
x!
=e
−λ
e
−λ
=e
−2λ
=d(λ)
HenceTis unbiased ford(λ) but sinceTalternates between -1 and 1 whiled(λ)>0,Tis not
a good estimate.
Note:
If there exist 2 unbiased estimatesT1andT2ofθ, then any estimate of the formαT1+(1−α)T2
for 0< α <1 will also be an unbiased estimate ofθ. Which one should we choose?
Definition 8.4.4:
Themean square errorof an estimateTofθis defined as
MSE(θ, T) =Eθ((T−θ)
2
)
=V arθ(T) + (b(θ, T))
2
.
Let{Ti}

i=1
be a sequence of estimates ofθ. If
lim
i→∞
MSE(θ, Ti) = 0∀θ∈Θ,
then{Ti}is called amean–squared–error consistent(MSE–consistent) sequence of es-
timates ofθ.
59

Note:
(i) If we allow all estimates and compare their MSE, generally it will depend onθwhich
estimate is better. For example
ˆ
θ= 17 is perfect ifθ= 17, but it is lousy otherwise.
(ii) If we restrict ourselves to the class of unbiased estimates, thenMSE(θ, T) =V arθ(T).
(iii) MSE–consistency means that both the bias and the variance ofTiapproach 0 asi→ ∞.
Definition 8.4.5:
Letθ0∈Θ and letU(θ0) be the class of all unbiased estimatesTofθ0such thatEθ0
(T
2
)<∞.
ThenT0∈U(θ0) is called alocally minimum variance unbiased estimate(LMVUE)
atθ0if
Eθ0
((T0−θ0)
2
)≤Eθ0
((T−θ0)
2
)∀T∈U(θ0).
Definition 8.4.6:
LetUbe the class of all unbiased estimatesTofθ∈Θ such thatEθ(T
2
)<∞ ∀θ∈Θ. Then
T0∈Uis called auniformly minimum variance unbiased estimate (UMVUE) ofθif
Eθ((T0−θ)
2
)≤Eθ((T−θ)
2
)∀θ∈Θ∀T∈U.
60

An Excursion into Logic II
In our first “Excursion into Logic” in Stat 6710 MathematicalStatistics I, we have established
the following results:
A⇒Bis equivalent to¬B⇒ ¬Ais equivalent to¬A∨B:
ABA⇒B¬A¬B¬B⇒ ¬A¬A∨B
11 1 0 0 1 1
10 0 0 1 0 0
01 1 1 0 1 1
00 1 1 1 1 1
When dealing with formal proofs, there exists one more technique to showA⇒B. Equiva-
lently, we can show (A∧¬B)⇒0, a technique calledProof by Contradiction. This means,
assuming thatAand¬Bhold, we show that this implies 0, i.e., something that is always
false, i.e., a contradiction. And here is the correspondingtruth table:
ABA⇒B¬BA∧ ¬B(A∧ ¬B)⇒0
11 1 0 0 1
10 0 1 1 0
01 1 0 0 1
00 1 1 0 1
Note:
We make use of this proof technique in the Proof of the next Theorem.
Example:
LetA:x= 5 andB:x
2
= 25. ObviouslyA⇒B.
But we can also prove this in the following way:
A:x= 5 and¬B:x
2
6= 25
=⇒x
2
= 25∧x
2
6= 25
This is impossible, i.e., a contradiction. Thus,A⇒B.
61

Theorem 8.4.7:
LetUbe the class of all unbiased estimatesTofθ∈Θ withEθ(T
2
)<∞ ∀θ, and suppose
thatUis non–empty. LetU0be the set of all unbiased estimates of 0, i.e.,
U0={ν:Eθ(ν) = 0, Eθ(ν
2
)<∞ ∀θ∈Θ}.
ThenT0∈Uis UMVUE iff
Eθ(νT0) = 0∀θ∈Θ∀ν∈U0.
Proof:
Note thatEθ(νT0) always exists. This follows from the Cauchy–Schwarz–Inequality (Theorem
4.5.7 (ii)):
(Eθ(νT0))
2
≤Eθ(ν
2
)Eθ(T
2
0)<∞
becauseEθ(ν
2
)<∞andEθ(T
2
0
)<∞. Therefore, alsoEθ(νT0)<∞.
“=⇒:”
We suppose thatT0∈Uis UMVUE and thatEθ0
(ν0T0)6= 0 for someθ0∈Θ and some
ν0∈U0.
It holds
Eθ(T0+λν0) =Eθ(T0) =θ∀λ∈IR∀θ∈Θ.
Therefore,T0+λν0∈U∀λ∈IR.
Also,Eθ0

2
0
)>0 (since otherwise,Pθ0
(ν0= 0) = 1 and thenEθ0
(ν0T0) = 0).
Now let
λ=−
Eθ0
(T0ν0)
Eθ0

2
0
)
.
Then,
Eθ0
((T0+λν0)
2
) =Eθ0
(T
2
0+ 2λT0ν0+λ
2
ν
2
0)
=Eθ0
(T
2
0)−2
(Eθ0
(T0ν0))
2
Eθ0

2
0
)
+
(Eθ0
(T0ν0))
2
Eθ0

2
0
)
=Eθ0
(T
2
0
)−
(Eθ0
(T0ν0))
2
Eθ0

2
0
)
< Eθ0
(T
2
0),
and therefore,
V arθ0
(T0+λν0)< V arθ0
(T0).
This means,T0is not UMVUE, i.e., a contradiction!
62

“⇐=:”
LetEθ(νT0) = 0 for someT0∈Ufor allθ∈Θ and allν∈U0.
We chooseT∈U, then alsoT0−T∈U0and
Eθ(T0(T0−T)) = 0∀θ∈Θ,
i.e.,
Eθ(T
2
0) =Eθ(T0T)∀θ∈Θ.
It follows from the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)) that
Eθ(T
2
0) =Eθ(T0T)≤(Eθ(T
2
0))
1
2(Eθ(T
2
))
1
2.
This implies
(Eθ(T
2
0
))
1
2≤(Eθ(T
2
))
1
2
and
V arθ(T0)≤V arθ(T),
whereTis an arbitrary unbiased estimate ofθ. Thus,T0is UMVUE.
Lecture 11:
Mo 02/05/01
Theorem 8.4.8:
LetUbe the non–empty class of unbiased estimates ofθ∈Θ as defined in Theorem 8.4.7.
Then there exists at most one UMVUET∈Uforθ.
Proof:
SupposeT0, T1∈Uare both UMVUE.
ThenT1−T0∈U0,V arθ(T0) =V arθ(T1), andEθ(T0(T1−T0)) = 0∀θ∈Θ
=⇒Eθ(T
2
0
) =Eθ(T0T1)
=⇒Covθ(T0, T1) =Eθ(T0T1)−Eθ(T0)Eθ(T1)
=Eθ(T
2
0
)−(Eθ(T0))
2
=V arθ(T0)
=V arθ(T1)∀θ∈Θ
=⇒ρT0T1
= 1∀θ∈Θ
=⇒Pθ(aT0+bT1= 0) = 1 for somea, b∀θ∈Θ
=⇒θ=Eθ(T0) =Eθ(−
b
a
T1) =Eθ(T1)∀θ∈Θ
=⇒ −
b
a
= 1
=⇒Pθ(T0=T1) = 1∀θ∈Θ
63

Theorem 8.4.9:
(i) If an UMVUETexists for a real functiond(θ), thenλTis the UMVUE forλd(θ), λ∈IR.
(ii) If UMVUE’sT1andT2exist for real functionsd1(θ) andd2(θ), respectively, thenT1+T2
is the UMVUE ford1(θ) +d2(θ).
Proof:
Homework.
Theorem 8.4.10:
If a sample consists ofnindependent observationsX1, . . . , Xnfrom the same distribution, the
UMVUE, if it exists, is permutation invariant.
Proof:
Homework.
Theorem 8.4.11:Rao–Blackwell
Let{Fθ:θ∈Θ}be a family of cdf’s, and lethbe any statistic inU, whereUis the non–
empty class of all unbiased estimates ofθwithEθ(h
2
)<∞. LetTbe a sufficient statistic for
{Fθ:θ∈Θ}. Then the conditional expectationEθ(h|T) is independent ofθand it is an
unbiased estimate ofθ. Additionally,
Eθ((E(h|T)−θ)
2
)≤Eθ((h−θ)
2
)∀θ∈Θ
with equality iffh=E(h|T).
Proof:
By Theorem 4.7.3,Eθ(E(h|T)) =Eθ(h) =θ.
SinceX|Tdoes not depend onθdue to sufficiency, neither doesE(h|T) depend onθ.
Thus, we only have to show that
Eθ((E(h|T))
2
)≤Eθ(h
2
) =Eθ(E(h
2
|T)).
Thus, we only have to show that
(E(h|T))
2
≤E(h
2
|T).
But the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)) gives us
(E(h|T))
2
≤E(h
2
|T)E(1|T) =E(h
2
|T).
64

Equality holds iff
Eθ((E(h|T))
2
) =Eθ(h
2
) =Eθ(E(h
2
|T))
⇐⇒Eθ(E(h
2
|T)−(E(h|T))
2
) = 0
⇐⇒Eθ(V ar(h|T)) = 0
⇐⇒Eθ(E((h−E(h|T))
2
|T)) = 0
⇐⇒E((h−E(h|T))
2
|T) = 0
⇐⇒his a function ofTandh=E(h|T).
For the proof of the last step, see Rohatgi, page 170–171, Theorem 2, Corollary, and Proof of
the Corollary.
Theorem 8.4.12:Lehmann–Scheff´ee
IfTis a complete sufficient statistic and if there exists an unbiased estimatehofθ, then
E(h|T) is the (unique) UMVUE.
Proof:
Suppose thath1, h2∈U. ThenEθ(E(h1|T)) =Eθ(E(h2|T)) =θby Theorem 8.4.11.
Therefore,
Eθ(E(h1|T)−E(h2|T)) = 0∀θ∈Θ.
SinceTis complete,E(h1|T) =E(h2|T).
Therefore,E(h|T) must be the same for allh∈UandE(h|T) improves allh∈U. There-
fore,E(h|T) is UMVUE by Theorem 8.4.11.
Note:
We can use Theorem 8.4.12 to find the UMVUE in two ways if we havea complete sufficient
statisticT:
(i) If we can find an unbiased estimateh(T), it will be the UMVUE sinceE(h(T)|T) =
h(T).
(ii) If we have any unbiased estimatehand if we can calculateE(h|T), thenE(h|T)
will be the UMVUE. The process of determining the UMVUE this way often is called
Rao–Blackwellization.
(iii) Even if a complete sufficient statistic does not exist, the UMVUE may still exist (see
Rohatgi, page 357–358, Example 10).
65

Example 8.4.13:
LetX1, . . . , Xnbe iidBin(1, p). ThenT=
n
X
i=1
Xiis a complete sufficient statistic as seen in
Examples 8.3.6 and 8.3.11.
SinceE(X1) =p,X1is an unbiased estimate ofp. However, due to part (i) of the Note above,
sinceX1is not a function ofT,X1is not the UMVUE.
We can use part (ii) of the Note above to construct the UMVUE. It is
P(X1=x|T) =



T
n
, x= 1
n−T
n
, x= 0
=⇒E(X1|T) =
T
n
=X
=⇒Xis the UMVUE forp
If we are interested in the UMVUE ford(p) =p(1−p) =p−p
2
=V ar(X), we can find it in
the following way:
E(T) =np
E(T
2
) =E


n
X
i=1
X
2
i
+
n
X
i=1
n
X
j=1,j6=i
XiXj


=np+n(n−1)p
2
=⇒E
`
nT
n(n−1)
´
=
np
n−1
E


T
2
n(n−1)
!
=−
p
n−1
−p
2
=⇒E

nT−T
2
n(n−1)
!
=
np
n−1

p
n−1
−p
2
=
(n−1)p
n−1
−p
2
=p−p
2
=d(p)
Thus, due to part (i) of the Note above,
nT−T
2
n(n−1)
is the UMVUE ford(p) =p(1−p).
66

Lecture 12:
We 02/07/01
8.5 Lower Bounds for the Variance of an Estimate
(Based on Casella/Berger, Section 7.3)
Theorem 8.5.1:Cram´er–Rao Lower Bound (CRLB)
Let Θ be an open interval ofIR. Let{fθ:θ∈Θ}be a family of pdf’s or pmf’s. Assume
that the set{x:fθ(x) = 0}is independent ofθ.
Letψ(θ) be defined on Θ and let it be differentiable for allθ∈Θ. LetTbe an unbiased
estimate ofψ(θ) such thatEθ(T
2
)<∞ ∀θ∈Θ. Suppose that
(i)
∂fθ(x)
∂θ
is defined for allθ∈Θ,
(ii) for a pdffθ

∂θ
`Z
fθ(x)dx
´
=
Z
∂fθ(x)
∂θ
dx= 0∀θ∈Θ
or for a pmffθ

∂θ


X
x
fθ(x)

=
X
x
∂fθ(x)
∂θ
= 0∀θ∈Θ,
(iii) for a pdffθ

∂θ
`Z
T(x)fθ(x)dx
´
=
Z
T(x)
∂fθ(x)
∂θ
dx∀θ∈Θ
or for a pmffθ

∂θ


X
x
T(x)fθ(x)

=
X
x
T(x)
∂fθ(x)
∂θ
∀θ∈Θ.
Letχ: Θ→IRbe any measurable function. Then it holds


(θ))
2
≤Eθ((T(X)−χ(θ))
2
)Eθ

`
∂logfθ(X)
∂θ
´2
!
∀θ∈Θ (A).
Further, for anyθ0∈Θ, eitherψ

(θ0) = 0 and equality holds in (A) forθ=θ0, or we have
Eθ0
((T(X)−χ(θ0))
2
)≥


(θ0))
2
Eθ0

(
∂logfθ(X)
∂θ
)
2
≡(B).
Finally, if equality holds in (B), then there exists a real numberK(θ0)6= 0 such that
T(X)−χ(θ0) =K(θ0)
∂logfθ(X)
∂θ




θ=θ0
(C)
with probability 1, provided thatTis not a constant.
67

Note:
(i) Conditions (i), (ii), and (iii) are calledregularity conditions. Conditions under which
they hold can be found in Rohatgi, page 11–13, Parts 12 and 13.
(ii) The right hand side of inequality (B) is calledCram´er–Rao Lower Boundofθ0, or, in
symbolsCRLB(θ0).
Proof:
From (ii), we get

`

∂θ
logfθ(X)
´
=
Z`

∂θ
logfθ(x)
´
fθ(x)dx
=
Z`

∂θ
fθ(x)
´
1
fθ(x)
fθ(x)dx
=
Z`

∂θ
fθ(x)
´
dx
= 0
=⇒Eθ
`
χ(θ)

∂θ
logfθ(X)
´
= 0
From (iii), we get

`
T(X)

∂θ
logfθ(X)
´
=
Z`
T(x)

∂θ
logfθ(x)
´
fθ(x)dx
=
Z`
T(x)

∂θ
fθ(x)
´
1
fθ(x)
fθ(x)dx
=
Z`
T(x)

∂θ
fθ(x)
´
dx
(iii)
=

∂θ
`Z
T(x)fθ(x)dx
´
=

∂θ
E(T(X))
=

∂θ
ψ(θ)


(θ)
=⇒Eθ
`
(T(X)−χ(θ))

∂θ
logfθ(X)
´


(θ)
68

=⇒(ψ

(θ))
2
=
`

`
(T(X)−χ(θ))

∂θ
logfθ(X)
´´
2
(∗)
≤Eθ

(T(X)−χ(θ))
2



`

∂θ
logfθ(X)
´
2
!
,
i.e., (A) holds. (∗) follows from the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)).
Ifψ

(θ0)6= 0, then the left–hand side of (A) is>0. Therefore, the right–hand side is>0.
Thus,
Eθ0

`

∂θ
logfθ(X)
´
2
!
>0,
and (B) follows directly from (A).
Ifψ

(θ0) = 0, but equality does not hold in (A), then
Eθ0

`

∂θ
logfθ(X)
´
2
!
>0,
and (B) follows directly from (A) again.
Finally, if equality holds in (B), thenψ

(θ0)6= 0 (becauseTis not constant). Thus,
MSE(χ(θ0), T(X))>0. The Cauchy–Schwarz–Inequality (Theorem 4.5.7 (iii)) gives equality
iff there exist constants (α, β)∈IR
2
− {(0,0)}such that
P

α(T(X)−χ(θ0)) +β


∂θ
logfθ(X)




θ=θ0
!
= 0
!
= 1.
This impliesK(θ0) =−
β
α
and (C) holds. SinceTis not a constant, it also holds that
K(θ0)6= 0.
Example 8.5.2:
If we takeχ(θ) =ψ(θ), we get from (B)
V arθ(T(X))≥


(θ))
2


(
∂logfθ(X)
∂θ
)
2
≡(∗).
If we haveψ(θ) =θ, the inequality (∗) above reduces to
V arθ(T(X))≥



`
∂logfθ(X)
∂θ
´2
!!
−1
.
Finally, ifX= (X1, . . . , Xn) iid with identicalfθ(x), the inequality (∗) reduces to
V arθ(T(X))≥


(θ))
2
nEθ

(
∂logfθ(X1)
∂θ
)
2
≡.
69

Example 8.5.3:
LetX1, . . . , Xnbe iidBin(1, p). LetX∼Bin(n, p),p∈Θ = (0,1)⊂IR. Let
ψ(p) =E(T(X)) =
n
X
x=0
T(x)

n
x
!
p
x
(1−p)
n−x
.
ψ(p) is differentiable with respect topunder the summation sign since it is a finite polynomial
inp.
SinceX=
n
X
i=1
Xiwithfp(x1) =p
x1
(1−p)
1−x1
, x1∈ {0,1}
=⇒logfp(x1) =x1logp+ (1−x1) log(1−p)
=⇒

∂p
logfp(x1) =
x1
p

1−x1
1−p
=
x1(1−p)−p(1−x1)
p(1−p)
=
x1−p
p(1−p)
=⇒Ep

`

∂p
logfp(X1)
´
2
!
=
V ar(X1)
p
2
(1−p)
2
=
1
p(1−p)
So, ifψ(p) =χ(p) =pand ifTis unbiased forp, then
V arp(T(X))≥
1
n
1
p(1−p)
=
p(1−p)
n
.
SinceV ar(X) =
p(1−p)
n
,Xattains the CRLB. Therefore,Xis the UMVUE.
Lecture 13:
Fr 02/09/01
Example 8.5.4:
LetX∼U(0, θ),θ∈Θ = (0,∞)⊂IR.
fθ(x) =
1
θ
I
(0,θ)(x)
=⇒logfθ(x) =−logθ
=⇒(

∂θ
logfθ(x))
2
=
1
θ
2
=⇒Eθ

`

∂θ
logfθ(X)
´
2
!
=
1
θ
2
Thus, the CRLB is
θ
2
n
.
We know that
n+1
n
X
(n)is the UMVUE since it is a function of a complete sufficient statistic
X
(n)(see Homework) andE(X
(n)) =
n
n+1
θ. It is
V ar
`
n+ 1
n
X
(n)
´
=
θ
2
n(n+ 2)
<
θ
2
n
???
70

How is this possible? Quite simple, since one of the requiredconditions for Theorem 8.5.1
does not hold. The support ofXdepends onθ.
Theorem 8.5.5:Chapman, Robbins, Kiefer Inequality (CRK Inequality)
Let Θ⊆IR. Let{fθ:θ∈Θ}be a family of pdf’s or pmf’s. Letψ(θ) be defined on Θ. Let
Tbe an unbiased estimate ofψ(θ) such thatEθ(T
2
)<∞ ∀θ∈Θ.
Ifθ6=ϑ, θandϑ∈Θ, assume thatfθ(x) andfϑ(x) are different. Also assume that there
exists such aϑ∈Θ such thatθ6=ϑand
S(θ) ={x:fθ(x)>0} ⊃S(ϑ) ={x:fϑ(x)>0}.
Then it holds that
V arθ(T(X))≥ sup
{ϑ:S(ϑ)⊂S(θ), ϑ6=θ}
(ψ(ϑ)−ψ(θ))
2
V arθ

fϑ(X)
fθ(X)
≡∀θ∈Θ.
Proof:
SinceTis unbiased, it follows
Eϑ(T(X)) =ψ(ϑ)∀ϑ∈Θ.
Forϑ6=θandS(ϑ)⊂S(θ), it follows
Z
S(θ)
T(x)
fϑ(x)−fθ(x)
fθ(x)
fθ(x)dx=Eϑ(T(X))−Eθ(T(X)) =ψ(ϑ)−ψ(θ)
and
0 =
Z
S(θ)
fϑ(x)−fθ(x)
fθ(x)
fθ(x)dx=Eθ
`
fϑ(X)
fθ(X)
−1
´
.
Therefore
Covθ
`
T(X),
fϑ(X)
fθ(X)
−1
´
=ψ(ϑ)−ψ(θ).
It follows by the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)) that
(ψ(ϑ)−ψ(θ))
2
=
`
Covθ
`
T(X),
fϑ(X)
fθ(X)
−1
´´2
≤V arθ(T(X))V arθ
`
fϑ(X)
fθ(X)
−1
´
=V arθ(T(X))V arθ
`
fϑ(X)
fθ(X)
´
.
Thus,
V arθ(T(X))≥
(ψ(ϑ)−ψ(θ))
2
V arθ

fϑ(X)
fθ(X)
≡.
Finally, we take the supremum of the right–hand side with respect to{ϑ:S(ϑ)⊂S(θ),
ϑ6=θ}, which completes the proof.
71

Note:
(i) The CRK inequality holds without the previous regularity conditions.
(ii) An alternative form of the CRK inequality is:
Letθ, θ+δ, δ6= 0, be distinct withS(θ+δ)⊂S(θ). Letψ(θ) =θ. Define
J=J(θ, δ) =
1
δ
2

`
fθ+δ(X)
fθ(X)
´2
−1
!
.
Then the CRK inequality reads as
V arθ(T(X))≥
1
inf
δ
Eθ(J)
with the infimum taken overδ6= 0 such thatS(θ+δ)⊂S(θ).
(iii) The CRK inequality works for discrete Θ, the CRLB does not work in such cases.
Example 8.5.6:
LetX∼U(0, θ), θ >0. The required conditions for the CRLB are not met. Recall from
Example 8.5.4 that
n+1
n
X
(n)is UMVUE withV ar(
n+1
n
X
(n)) =
θ
2
n(n+2)
<
θ
2
n
=CRLB.
Letψ(θ) =θ. Ifϑ < θ, thenS(ϑ)⊂S(θ). It is


`
fϑ(X)
fθ(X)
´2
!
=
Z
ϑ
0
`
θ
ϑ
´
2
1
θ
dx=
θ
ϑ

`
fϑ(X)
fθ(X)
´
=
Z
ϑ
0
θ
ϑ
1
θ
dx= 1
=⇒V arθ(T(X))≥ sup
{ϑ: 0<ϑ<θ}
(ϑ−θ)
2
θ
ϑ
−1
= sup
{ϑ: 0<ϑ<θ}
(ϑ(θ−ϑ))
(∗)
=
θ
2
4
See Homework for a proof of (∗).
SinceXis complete and sufficient and 2Xis unbiased forθ, soT(X) = 2Xis the UMVUE.
It is
V arθ(2X) = 4V arθ(X) = 4
θ
2
12
=
θ
2
3
>
θ
2
4
.
Since the CRK lower bound is not achieved by the UMVUE, it is not achieved by any unbiased
estimate ofθ.
72

Definition 8.5.7:
LetT1, T2be unbiased estimates ofθwithEθ(T
2
1
)<∞andEθ(T
2
2
)<∞ ∀θ∈Θ. We define
theefficiencyofT1relative toT2by
effθ(T1, T2) =
V arθ(T1)
V arθ(T2)
and say thatT1ismore efficientthanT2ifeffθ(T1, T2)<1.
Definition 8.5.8:
Assume the regularity conditions of Theorem 8.5.1 are satisfied by a family of cdf’s
{Fθ:θ∈Θ}, Θ⊆IR. An unbiased estimateTforθismost efficientfor the family
{Fθ}if
V arθ(T) =



`
∂logfθ(X)
∂θ
´2
!!
−1
.
Definition 8.5.9:
LetTbe the most efficient estimate for the family of cdf’s{Fθ:θ∈Θ}, Θ⊆IR. Then the
efficiencyof any unbiasedT1ofθis defined as
effθ(T1) =effθ(T1, T) =
V arθ(T1)
V arθ(T)
.
Definition 8.5.10:
T1isasymptotically (most) efficientifT1is asymptotically unbiased, i.e., lim
n→∞
Eθ(T1) =θ,
and lim
n→∞
effθ(T1) = 1, wherenis the sample size.
Lecture 14:
Mo 02/12/01
Theorem 8.5.11:
A necessary and sufficient condition for an estimateTofθto be most efficient is thatTis
sufficient and
1
K(θ)
(T(x)−θ) =
∂logfθ(x)
∂θ
∀θ∈Θ (∗),
whereK(θ) is defined as in Theorem 8.5.1 and the regularity conditionsfor Theorem 8.5.1
hold.
Proof:
“=⇒:”
Theorem 8.5.1 says that ifTis most efficient, then (∗) holds.
Assume that Θ =IR. We define
C(θ0) =
Z
θ0
−∞
1
K(θ)
dθ, ψ(θ0) =
Z
θ0
−∞
θ
K(θ)
dθ,andλ(x) = lim
θ→−∞
logfθ(x)−c(x).
73

Integrating (∗) with respect toθgives
Z
θ0
−∞
1
K(θ)
T(x)dθ−
Z
θ0
−∞
θ
K(θ)
dθ=
Z
θ0
−∞
∂logfθ(x)
∂θ

=⇒T(x)C(θ0)−ψ(θ0) = logfθ(x)|
θ0
−∞+c(x)
=⇒T(x)C(θ0)−ψ(θ0) = logfθ0
(x)−lim
θ→−∞
logfθ(x) +c(x)
=⇒T(x)C(θ0)−ψ(θ0) = logfθ0
(x)−λ(x)
Therefore,
fθ0
(x) = exp(T(x)C(θ0)−ψ(θ0) +λ(x))
which belongs to an exponential family. Thus,Tis sufficient.
“⇐=:”
From (∗), we get


`
∂logfθ(X)
∂θ
´2
!
=
1
(K(θ))
2
V arθ(T(X)).
Additionally, it holds

`
(T(X)−θ)
∂logfθ(X)
∂θ
´
= 1
as shown in the Proof of Theorem 8.5.1.
Using (∗) in the line above, we get
K(θ)Eθ

`
∂logfθ(X)
∂θ
´2
!
= 1,
i.e.,
K(θ) =



`
∂logfθ(X)
∂θ
´2
!!
−1
.
Therefore,
V arθ(T(X)) =



`
∂logfθ(X)
∂θ
´2
!!
−1
,
i.e.,Tis most efficient forθ.
Note:
Instead of saying “a necessary and sufficient condition for an estimateTofθto be most
efficient ...” in the previous Theorem, we could say that “an estimateTofθis most efficient
iff ...”, i.e., “necessary and sufficient” means the same as “iff”.
74

8.6 The Method of Moments
(Based on Casella/Berger, Section 7.2.1)
Definition 8.6.1:
LetX1, . . . , Xnbe iid with pdf (or pmf)fθ, θ∈Θ. We assume that firstkmomentsm1, . . . , mk
offθexist. Ifθcan be written as
θ=h(m1, . . . , mk),
whereh:IR
k
→IRis a Borel–measurable function, themethod of moments estimate
(mom) ofθis
ˆ
θmom=T(X1, . . . , Xn) =h(
1
n
n
X
i=1
Xi,
1
n
n
X
i=1
X
2
i
, . . . ,
1
n
n
X
i=1
X
k
i
).
Note:
(i) The Definition above can also be used to estimate joint moments. For example, we use
1
n
n
X
i=1
XiYito estimateE(XY).
(ii) SinceE(
1
n
n
X
i=1
X
j
i
) =mj, method of moment estimates are unbiased for the popula-
tion moments. The WLLN and the CLT say that these estimates are consistent and
asymptotically Normal as well.
(iii) Ifθis not a linear function of the population moments,
ˆ
θmomwill, in general, not be
unbiased. However, it will be consistent and (usually) asymptotically Normal.
(iv) Method of moments estimates do not exist if the related moments do not exist.
(v) Method of moments estimates may not be unique. If there exist multiple choices for the
mom, one usually takes the estimate involving the lowest–order sample moment.
(vi) Alternative method of moment estimates can be obtainedfrom central moments (rather
than from raw moments) or by using moments other than the firstkmoments.
75

Example 8.6.2:
LetX1, . . . , Xnbe iidN(, σ
2
).
Since=m1, it is ˆmom=X.
This is an unbiased, consistent and asymptotically Normal estimate.
Sinceσ=
q
m2−m
2
1
, it is ˆσmom=
v
u
u
t 1
n
n
X
i=1
X
2
i−X
2
.
This is a consistent, asymptotically Normal estimate. However, it is not unbiased.
Example 8.6.3:
LetX1, . . . , Xnbe iid Poisson(λ).
We know thatE(X1) =V ar(X1) =λ.
Thus,Xand
1
n
n
X
i=1
(Xi−X)
2
are possible choices for the mom ofλ. Due to part (v) of the
Note above, one uses
ˆ
λmom=X.
76

8.7 Maximum Likelihood Estimation
(Based on Casella/Berger, Section 7.2.2)
Definition 8.7.1:
Let (X1, . . . , Xn) be ann–rv with pdf (or pmf)fθ(x1, . . . , xn), θ∈Θ. We call the function
L(θ;x1, . . . , xn) =fθ(x1, . . . , xn)
ofθthelikelihood function.
Note:
(i) Oftenθis a vector of parameters.
(ii) If (X1, . . . , Xn) are iid with pdf (or pmf)fθ(x), thenL(θ;x1, . . . , xn) =
n
Y
i=1
fθ(xi).
Definition 8.7.2:
Amaximum likelihood estimate(MLE) is a non–constant estimate
ˆ
θM Lsuch that
L(
ˆ
θM L;x1, . . . , xn) = sup
θ∈Θ
L(θ;x1, . . . , xn).
Note:
It is often convenient to work with logLwhen determining the maximum likelihood estimate.
Since the log is monotone, the maximum is the same.
Example 8.7.3:
LetX1, . . . , Xnbe iidN(, σ
2
), whereandσ
2
are unknown.
L(, σ
2
;x1, . . . , xn) =
1
σ
n
(2π)
n
2
exp


n
X
i=1
(xi−)
2

2
!
=⇒logL(, σ
2
;x1, . . . , xn) =−
n
2
logσ
2

n
2
log(2π)−
n
X
i=1
(xi−)
2

2
The MLE must satisfy
∂logL

=
1
σ
2
n
X
i=1
(xi−) = 0 (A)
∂logL
∂σ
2
=−
n

2
+
1

4
n
X
i=1
(xi−)
2
= 0 (B)
77

These are the two likelihood equations. From equation (A) we get ˆM L=X. Substituting
this forinto equation (B) and solving forσ
2
, we get ˆσ
2
M L
=
1
n
n
X
i=1
(Xi−X)
2
. Note that ˆσ
2
M L
is biased forσ
2
.
Formally, we still have to verify that we found the maximum (and not a minimum) and that
there is no parameterθat the edge of the parameter space Θ such that the likelihood function
does not take its absolute maximum which is not detectable byusing our approach for local
extrema.
Lecture 15:
We 02/14/01
Example 8.7.4:
LetX1, . . . , Xnbe iidU(θ−
1
2
, θ+
1
2
).
L(θ;x1, . . . , xn) =



1,ifθ−
1
2
≤xi≤θ+
1
2
∀i= 1, . . . , n
0,otherwise
Therefore, any
ˆ
θ(X) such that max(X)−
1
2

ˆ
θ(X)≤min(X) +
1
2
is an MLE. Obviously, the
MLE is not unique.
Example 8.7.5:
LetX∼Bin(1, p), p∈[
1
4
,
3
4
].
L(p;x) =p
x
(1−p)
1−x
=



p, ifx= 1
1−p,ifx= 0
This is maximized by
ˆp=



3
4
,ifx= 1
1
4
,ifx= 0
=
2x+ 1
4
It is
Ep(ˆp) =
3
4
p+
1
4
(1−p) =
1
2
p+
1
4
MSEp(ˆp) =Ep((ˆp−p)
2
)
=Ep((
2X+ 1
4
−p)
2
)
=
1
16
Ep((2X+ 1−4p)
2
)
=
1
16
Ep(4X
2
+ 2∆2X−2∆8pX−2∆4p+ 1 + 16p
2
)
=
1
16
(4(p(1−p) +p
2
) + 4p−16p
2
−8p+ 1 + 16p
2
)
=
1
16
78

So ˆpis biased withMSEp(ˆp) =
1
16
. If we compare this with ˜p=
1
2
regardless of the data, we
have
MSEp(
1
2
) =Ep((
1
2
−p)
2
) = (
1
2
−p)
2

1
16
∀p∈[
1
4
,
3
4
].
Thus, in this example the MLE is worse than the trivial estimate when comparing their MSE’s.
Theorem 8.7.6:
LetTbe a sufficient statistic forfθ(x), θ∈Θ. If a unique MLE ofθexists, it is a function
ofT.
Proof:
SinceTis sufficient, we can write
fθ(x) =h(x)gθ(T(x))
due to the Factorization Criterion (Theorem 8.3.5). Maximizing the likelihood function with
respect toθtakesh(x) as a constant and therefore is equivalent to maximizinggθ(x) with
respect toθ. Butgθ(x) involvesxonly throughT.
Note:
(i) MLE’s may not be unique (however they frequently are).
(ii) MLE’s are not necessarily unbiased.
(iii) MLE’s may not exist.
(iv) If a unique MLE exists, it is a function of a sufficient statistic.
(v) Often (but not always), the MLE will be a sufficient statistic itself.
Theorem 8.7.7:
Suppose the regularity conditions of Theorem 8.5.1 hold andθbelongs to an open interval in
IR. If an estimate
ˆ
θofθattains the CRLB, it is the unique MLE.
Proof:
If
ˆ
θattains the CRLB, it follows by Theorem 8.5.1 that
∂logfθ(X)
∂θ
=
1
K(θ)
(
ˆ
θ(X)−θ) w.p. 1.
Thus,
ˆ
θsatisfies the likelihood equations.
79

We defineA(θ) =
1
K(θ)
. Then it follows

2
logfθ(X)
∂θ
2
=A

(θ)(
ˆ
θ(X)−θ)−A(θ).
The Proof of Theorem 8.5.11 gives us
A(θ) =Eθ

`
∂logfθ(X)
∂θ
´2
!
>0.
So

2
logfθ(X)
∂θ
2





θ=
ˆ
θ
=−A(θ)<0,
i.e., logfθ(X ) has a maximum in
ˆ
θ. Thus,
ˆ
θis the MLE.
Note:
The previous Theorem does not imply that every MLE is most efficient.
Theorem 8.7.8:
Let{fθ:θ∈Θ}be a family of pdf’s (or pmf’s) with Θ⊆IR
k
, k≥1. Leth: Θ→∆ be a
mapping of Θ onto ∆⊆IR
p
,1≤p≤k. If
ˆ
θis an MLE ofθ, thenh(
ˆ
θ) is an MLE ofh(θ).
Proof:
For eachδ∈∆, we define
Θδ={θ:θ∈Θ, h(θ) =δ}
and
M(δ;x) = sup
θ∈Θδ
L(θ;x),
the likelihood function induced byh.
Let
ˆ
θbe an MLE and a member of Θˆ
δ
, where
ˆ
δ=h(
ˆ
θ). It holds
M(
ˆ
δ;x) = sup
θ∈Θˆ
δ
L(θ;x)≥L(
ˆ
θ;x),
but also
M(
ˆ
δ;x)≤sup
δ∈∆
M(δ;x) = sup
δ∈∆

sup
θ∈Θδ
L(θ;x)
!
= sup
θ∈Θ
L(θ;x) =L(
ˆ
θ;x).
Therefore,
M(
ˆ
δ;x) =L(
ˆ
θ;x) = sup
δ∈∆
M(δ;x).
Thus,
ˆ
δ=h(
ˆ
θ) is an MLE.
80

Example 8.7.9:
LetX1, . . . , Xnbe iidBin(1, p). Leth(p) =p(1−p).
Since the MLE ofpis ˆp=X, the MLE ofh(p) ish(ˆp) =X(1−X).
Theorem 8.7.10:
Consider the following conditions a pdffθcan fulfill:
(i)
∂logfθ
∂θ
,

2
logfθ
∂θ
2,

3
logfθ
∂θ
3exist for allθ∈Θ for allx. Also,
Z

−∞
∂fθ(x)
∂θ
dx=Eθ
`
∂logfθ(X)
∂θ
´
= 0∀θ∈Θ.
(ii)
Z

−∞

2
fθ(x)
∂θ
2
dx= 0∀θ∈Θ.
(iii)−∞<
Z

−∞

2
logfθ(x)
∂θ
2
fθ(x)dx <0∀θ∈Θ.
(iv) There exists a functionH(x) such that for allθ∈Θ:






3
logfθ(x)
∂θ
3





< H(x) and
Z

−∞
H(x)fθ(x)dx=M(θ)<∞.
(v) There exists a functiong(θ) that is positive and twice differentiable for everyθ∈Θ and
there exists a functionH(x) such that for allθ∈Θ:






2
∂θ
2

g(θ)
∂logfθ(x)
∂θ
λ





< H(x) and
Z

−∞
H(x)fθ(x)dx=M(θ)<∞.
In case that multiple of these conditions are fulfilled, we can make the following statements:
(i) (Cram´er) Conditions (i), (iii), and (iv) imply that, with probability approaching 1, as
n→ ∞, the likelihood equation has a consistent solution.
(ii) (Cram´er) Conditions (i), (ii), (iii), and (iv) imply that a consistent solution
ˆ
θnof the
likelihood equation is asymptotically Normal, i.e.,

n
σ
(
ˆ
θn−θ)
d
−→Z
whereZ∼N(0,1) andσ
2
=
`

`

∂logfθ(X)
∂θ

2
´´
−1
.
(iii) (Kulldorf) Conditions (i), (iii), and (v) imply that,with probability approaching 1, as
n→ ∞, the likelihood equation has a consistent solution.
81

(iv) (Kulldorf) Conditions (i), (ii), (iii), and (v) imply that a consistent solution
ˆ
θnof the
likelihood equation is asymptotically Normal.
Note:
In case of a pmffθ, we can define similar conditions as in Theorem 8.7.10.
82

Lecture 16:
Fr 02/16/01
8.8 Decision Theory — Bayes and Minimax Estimation
(Based on Casella/Berger, Section 7.2.3 & 7.3.4)
Let{fθ:θ∈Θ}be a family of pdf’s (or pmf’s). LetX1, . . . , Xnbe a sample fromfθ. Let
Abe the set of possibleactions(or decisions) that are open to the statistician in a given
situation , e.g.,
A={rejectH0, do not rejectH0}(Hypothesis testing, see Chapter 9)
A= artefact found is of{Greek, Roman}origin (Classification)
A= Θ (Estimation)
Definition 8.8.1:
Adecision functiondis a statistic, i.e., a Borel–measurable function, that mapsIR
n
into
A. IfX=xis observed, the statistician takes actiond(x)∈A.
Note:
For the remainder of this Section, we are restricting ourselves toA= Θ, i.e., we are facing
the problem of estimation.
Definition 8.8.2:
A non–negative functionLthat maps Θ×AintoIRis called aloss function. The value
L(θ, a) is the loss incurred to the statistician if he/she takes actionawhenθis the true pa-
rameter value.
Definition 8.8.3:
LetDbe a class of decision functions that mapIR
n
intoA. LetLbe a loss function on Θ×A.
The functionRthat maps Θ×DintoIRis defined as
R(θ, d) =Eθ(L(θ, d(X)))
and is called therisk functionofdatθ.
Example 8.8.4:
LetA= Θ⊆IR. LetL(θ, a) = (θ−a)
2
. Then it holds that
R(θ, d) =Eθ(L(θ, d(X))) =Eθ((θ−d(X))
2
) =Eθ((θ−
ˆ
θ)
2
).
83

Note that this is just the MSE. If
ˆ
θis unbiased, this would just beV ar(
ˆ
θ).
Note:
The basic problem of decision theory is that we would like to find a decision functiond∈D
such thatR(θ, d) is minimized for allθ∈Θ. Unfortunately, this is usually not possible.
Definition 8.8.5:
Theminimax principleis to choose the decision functiond

∈Dsuch that
max
θ∈Θ
R(θ, d

)≤max
θ∈Θ
R(θ, d)∀d∈D.
Note:
If the problem of interest is an estimation problem, we call ad

that satisifies the condition
in Definition 8.8.5 aminimax estimateofθ.
Example 8.8.6:
LetX∼Bin(1, p), p∈Θ ={
1
4
,
3
4
}=A.
We consider the following loss function:
paL(p, a)
1
4
1
4
0
1
4
3
4
2
3
4
1
4
5
3
4
3
4
0
The set of decision functions consists of the following fourfunctions:
d1(0) =
1
4
, d1(1) =
1
4
d2(0) =
1
4
, d2(1) =
3
4
d3(0) =
3
4
, d3(1) =
1
4
d4(0) =
3
4
, d4(1) =
3
4
First, we evaluate the loss function for these four decisionfunctions:
L(
1
4
, d1(0)) =L(
1
4
,
1
4
) = 0
L(
1
4
, d1(1)) =L(
1
4
,
1
4
) = 0
84

L(
3
4
, d1(0)) =L(
3
4
,
1
4
) = 5
L(
3
4
, d1(1)) =L(
3
4
,
1
4
) = 5
L(
1
4
, d2(0)) =L(
1
4
,
1
4
) = 0
L(
1
4
, d2(1)) =L(
1
4
,
3
4
) = 2
L(
3
4
, d2(0)) =L(
3
4
,
1
4
) = 5
L(
3
4
, d2(1)) =L(
3
4
,
3
4
) = 0
L(
1
4
, d3(0)) =L(
1
4
,
3
4
) = 2
L(
1
4
, d3(1)) =L(
1
4
,
1
4
) = 0
L(
3
4
, d3(0)) =L(
3
4
,
3
4
) = 0
L(
3
4
, d3(1)) =L(
3
4
,
1
4
) = 5
L(
1
4
, d4(0)) =L(
1
4
,
3
4
) = 2
L(
1
4
, d4(1)) =L(
1
4
,
3
4
) = 2
L(
3
4
, d4(0)) =L(
3
4
,
3
4
) = 0
L(
3
4
, d4(1)) =L(
3
4
,
3
4
) = 0
Then, the risk function
R(p, di(X)) =Ep(L(p, d(X))) =L(p, d(0))∆Pp(X= 0) +L(p, d(1))∆Pp(X= 1)
takes the following values:
ip=
1
4
:R(
1
4
, di)p=
3
4
:R(
3
4
, di) max
p∈{1/4,3/4}
R(p, di)
1 0 5 5
2
3
4
∆0 +
1
4
∆2 =
1
2
1
4
∆5 +
3
4
∆0 =
5
4
5
4
3
3
4
∆2 +
1
4
∆0 =
3
2
1
4
∆0 +
3
4
∆5 =
15
4
15
4
4 2 0 2
Hence,
min
i∈{1,2,3,4}
max
p∈{1/4,3/4}
R(p, di) =
5
4
.
Thus,d2is the minimax estimate.
85

Note:
Minimax estimation does not require any unusual assumptions. However, it tends to be very
conservative.
Definition 8.8.7:
Suppose we considerθto be a rv with pdfπ(θ) on Θ. We callπthea priori distribution
(orprior distribution).
Note:
f(x|θ) is the conditional density ofxgiven a fixedθ. The joint density ofxandθis
f(x, θ) =π(θ)f(x|θ),
the marginal density ofxis
g(x) =
Z
f(x, θ)dθ,
and thea posteriori distribution(orposterior distribution), which gives the distribution
ofθafter sampling, has pdf (or pmf)
h(θ|x) =
f(x, θ)
g(x)
.
Definition 8.8.8:
TheBayes riskof a decision functiondis defined as
R(π, d) =Eπ(R(θ, d)),
whereπis the a priori distribution.
Note:
Ifθis a continuous rv andXis of continuous type, then
R(π, d) =Eπ(R(θ, d))
=
Z
R(θ, d)π(θ)dθ
=
Z
Eθ(L(θ, d(X)))π(θ)dθ
=
Z`Z
L(θ, d(x))f(x|θ)dx
´
π(θ)dθ
=
Z Z
L(θ, d(x))f(x|θ)π(θ)dxdθ
86

=
Z Z
L(θ, d(x))f(x, θ)dxdθ
=
Z
g(x)
`Z
L(θ, d(x))h(θ|x)dθ
´
dx
Similar expressions can be written ifθand/orXare discrete.
Lecture 17:
Tu 02/20/01
Definition 8.8.9:
A decision functiond

is called aBayes ruleifd

minimizes the Bayes risk, i.e., if
R(π, d

) = inf
d∈D
R(π, d).
Theorem 8.8.10:
LetA= Θ⊆IR. LetL(θ, d(x)) = (θ−d(x))
2
. In this case, a Bayes rule is
d(x) =E(θ|X=x).
Proof:
Minimizing
R(π, d) =
Z
g(x)
`Z
(θ−d(x))
2
h(θ|x)dθ
´
dx,
wheregis the marginal pdf ofXandhis the conditional pdf ofθgivenx, is the same as
minimizing Z
(θ−d(x))
2
h(θ|x)dθ.
However, this is minimized whend(x) =E(θ|X=x) as shown in Stat 6710, Homework 3.
Question (ii), for the unconditional case.
Note:
Under the conditions of Theorem 8.8.10,d(x) =E(θ|X=x) is called theBayes estimate.
Example 8.8.11:
LetX∼Bin(n, p). LetL(p, d(x)) = (p−d(x))
2
.
Letπ(p) = 1∀p∈(0,1), i.e.,π∼U(0,1), be the a priori distribution ofp.
Then it holds:
87

f(x, p) =

n
x
!
p
x
(1−p)
n−x
g(x) =
Z
f(x, p)dp
=
Z
1
0

n
x
!
p
x
(1−p)
n−x
dp
h(p|x) =
f(x, p)
g(x)
=
Γ
n
x

p
x
(1−p)
n−x
Z
1
0

n
x
!
p
x
(1−p)
n−x
dp
=
p
x
(1−p)
n−x
Z
1
0
p
x
(1−p)
n−x
dp
E(p|x) =
Z
1
0
ph(p|x)dp
=
Z
1
0
p

n
x
!
p
x
(1−p)
n−x
dp
Z
1
0

n
x
!
p
x
(1−p)
n−x
dp
=
Z
1
0
p
x+1
(1−p)
n−x
dp
Z
1
0
p
x
(1−p)
n−x
dp
=
B(x+ 2, n−x+ 1)
B(x+ 1, n−x+ 1)
=
Γ(x+ 2)Γ(n−x+ 1)
Γ(x+ 2 +n−x+ 1)
Γ(x+ 1)Γ(n−x+ 1)
Γ(x+ 1 +n−x+ 1)
=
x+ 1
n+ 2
Thus, by Theorem 8.8.10, the Bayes rule is
ˆpBayes=d

(X) =
X+ 1
n+ 2
.
88

The Bayes risk ofd

(X) is
R(π, d

(X)) =Eπ(R(p, d

(X)))
=
Z
1
0
π(p)R(p, d

(X))dp
=
Z
1
0
π(p)Ep(L(p, d

(X)))dp
=
Z
1
0
π(p)Ep((p−d

(X))
2
)dp
=
Z
1
0
Ep
`
(
X+ 1
n+ 2
−p)
2
´
dp
=
Z
1
0
Ep
`
(
X+ 1
n+ 2
)
2
−2p
X+ 1
n+ 2
+p
2
´
dp
=
1
(n+ 2)
2
Z
1
0
Ep

(X+ 1)
2
−2p(n+ 2)(X+ 1) +p
2
(n+ 2)
2

dp
=
1
(n+ 2)
2
Z
1
0
Ep

X
2
+ 2X+ 1−2p(n+ 2)(X+ 1) +p
2
(n+ 2)
2

dp
=
1
(n+ 2)
2
Z
1
0
(np(1−p) + (np)
2
+ 2np+ 1−2p(n+ 2)(np+ 1) +p
2
(n+ 2)
2
)dp
=
1
(n+ 2)
2
Z
1
0
(np−np
2
+n
2
p
2
+ 2np+ 1−2n
2
p
2
−2np−4np
2
−4p+
p
2
n
2
+ 4np
2
+ 4p
2
)dp
=
1
(n+ 2)
2
Z
1
0
(1−4p+np−np
2
+ 4p
2
)dp
=
1
(n+ 2)
2
Z
1
0
(1 + (n−4)p+ (4−n)p
2
)dp
=
1
(n+ 2)
2
(p+
n−4
2
p
2
+
4−n
3
p
3
)




1
0
=
1
(n+ 2)
2
(1 +
n−4
2
+
4−n
3
)
=
1
(n+ 2)
2
6 + 3n−12 + 8−2n
6
=
1
(n+ 2)
2
n+ 2
6
=
1
6(n+ 2)
89

Now we compare the Bayes ruled

(X) with the MLE ˆpM L=
X
n
. This estimate has Bayes risk
R(π,
X
n
) =
Z
1
0
Ep((
X
n
−p)
2
)dp
=
Z
1
0
1
n
2
Ep((X−np)
2
)dp
=
Z
1
0
np(1−p)
n
2
dp
=
Z
1
0
p(1−p)
n
dp
=
1
n

p
2
2

p
3
3
!




1
0
=
1
6n
which is, as expected, larger than the Bayes risk ofd

(X).
Theorem 8.8.12:
Let{fθ:θ∈Θ}be a family of pdf’s (or pmf’s). Suppose that an estimated

ofθis a
Bayes estimate corresponding to some prior distributionπon Θ. If the risk functionR(θ, d

)
is constant on Θ, thend

is a minimax estimate ofθ.
Proof:
Homework.
Definition 8.8.13:
LetFdenote the class of pdf’s (or pmf’s)fθ(x). A class Π of prior distributions is aconju-
gate familyforFif the posterior distribution is in the class Π for allf∈F, all priors in Π,
and allx∈X.
Note:
The beta family is conjugate for the binomial family. Thus, if we start with a beta prior, we
will end up with a beta posterior. (See Homework.)
90

9 Hypothesis Testing
9.1 Fundamental Notions
(Based on Casella/Berger, Section 8.1 & 8.3)
We assume thatX= (X1, . . . , Xn) is a random sample from a population distribution
Fθ, θ∈Θ⊆IR
k
, where the functional form ofFθis known, except for the parameterθ.
We also assume that Θ contains at least two points.
Definition 9.1.1:
Aparametric hypothesisis an assumption about the unknown parameterθ.
Thenull hypothesisis of the form
H0:θ∈Θ0⊂Θ.
Thealternative hypothesisis of the form
H1:θ∈Θ1= Θ−Θ0.
Definition 9.1.2:
If Θ0(or Θ1) contains only one point, we say thatH0and Θ0(orH1and Θ1) aresimple. In
this case, the distribution ofXis completely specified under the null (or alternative) hypoth-
esis.
If Θ0(or Θ1) contains more than one point, we say thatH0and Θ0(orH1and Θ1) are
composite.
Example 9.1.3:
LetX1, . . . , Xnbe iidBin(1, p). Examples for hypotheses arep=
1
2
(simple),p≥
1
2
(com-
posite),p6=
1
4
(composite), etc.
Note:
The problem of testing a hypothesis can be described as follows: Given a sample pointx, find
a decision rule that will lead to a decision to accept or reject the null hypothesis. This means,
we partition the spaceIR
n
into two disjoint setsCandC
c
such that, ifx∈C, we reject
H0:θ∈Θ0(and we acceptH1). Otherwise, ifx∈C
c
, we acceptH0thatX∼Fθ, θ∈Θ0.
91

Definition 9.1.4:
LetX∼Fθ, θ∈Θ. LetCbe a subset ofIR
n
such that, ifx∈C, thenH0is rejected (with
probability 1), i.e.,
C={x∈IR
n
:H0is rejected for thisx}.
The setCis called thecritical region.
Definition 9.1.5:
If we rejectH0when it is true, we call this aType I error. If we fail to rejectH0when it
is false, we call this aType II error. Usually,H0andH1are chosen such that the Type I
error is considered more serious.
Example 9.1.6:
We first consider a non–statistical example, in this case a jury trial. Our hypotheses are that
the defendant is innocent or guilty. Our possible decisionsare guilty or not guilty. Since it is
considered worse to punish the innocent than to let the guilty go free, we make innocence the
null hypothesis. Thus, we have
Truth (unknown)
Decision (known)
Innocent (H0) Guilty (H1)
Not Guilty (H0) Correct Type II Error
Guilty (H1) Type I Error Correct
The jury tries to make a decision “beyond a reasonable doubt”, i.e., it tries to make the
probability of a Type I error small.
Definition 9.1.7:
IfCis the critical region, thenPθ(C), θ∈Θ0, is a probability of Type I error, and
Pθ(C
c
), θ∈Θ1, is a probability of Type II error.
Note:
We would like both error probabilities to be 0, but this is usually not possible. We usually
settle for fixing the probability of Type I error to be small, e.g., 0.05 or 0.01, and minimizing
the Type II error.
92

Lecture 18:
We 02/21/01
Definition 9.1.8:
Every Borel–measurable mappingφofIR
n
→[0,1] is called atest function.φ(x) is the
probability of rejectingH0whenxis observed.
Ifφis the indicator function of a subsetC⊆IR
n
,φis called anonrandomized testandC
is the critical region of this test function.
Otherwise, ifφis not an indicator function of a subsetC⊆IR
n
,φis called arandomized
test.
Definition 9.1.9:
Letφbe a test function of the hypothesisH0:θ∈Θ0against the alternativeH1:θ∈Θ1.
We say thatφhas alevel of significanceofα(orφis alevel–α–testorφis ofsizeα) if
Eθ(φ(X)) =Pθ(rejctH0)≤α∀θ∈Θ0.
In short, we say thatφis a test for the problem (α,Θ0,Θ1).
Definition 9.1.10:
Letφbe a test for the problem (α,Θ0,Θ1). For everyθ∈Θ, we define
βφ(θ) =Eθ(φ(X)) =Pθ(rejctH0).
We callβφ(θ) thepower functionofφ. For anyθ∈Θ1,βφ(θ) is called thepowerofφ
against the alternativeθ.
Definition 9.1.11:
Let Φαbe the class of all tests for (α,Θ0,Θ1). A testφ0∈Φαis called amost powerful
(MP)testagainst an alternativeθ∈Θ1if
βφ0
(θ)≥βφ(θ)∀φ∈Φα.
Definition 9.1.12:
Let Φαbe the class of all tests for (α,Θ0,Θ1). A testφ0∈Φαis called auniformly most
powerful(UMP)testif
βφ0
(θ)≥βφ(θ)∀φ∈Φα∀θ∈Θ1.
93

Example 9.1.13:
LetX1, . . . , Xnbe iidN(,1), ∈Θ ={0, 1}, 0< 1.
LetH0:Xi∼N(0,1) vs.H1:Xi∼N(1,1).
Intuitively, rejectH0whenXis too large, i.e., ifX≥kfor somek.
UnderH0it holds thatX∼N(0,
1
n
).
For a givenα, we can solve the following equation fork:
P0
(X > k) =P

X−0
1/

n
>
k−0
1/

n
!
=P(Z > zα) =α
Here,
X−0
1/

n
=Z∼N(0,1) andzαis defined in such a way thatP(Z > zα) =α, i.e.,zαis
the upperα–quantile of theN(0,1) distribution. It follows that
k−0
1/

n
=zαand therefore,
k=0+
zα√
n
.
Thus, we obtain the nonrandomized test
φ(x) =



1,ifx > 0+
zα√
n
0,otherwise
φhas power
βφ(1) =P1
`
X > 0+


n
´
=P

X−1
1/

n
>(0−1)

n+zα
!
=P(Z > zα−

n(1−0)
|{z}
>0
)
> α
The probability of a Type II error is
P(Type II error) = 1−βφ(1).
94

Example 9.1.14:
LetX∼Bin(6, p), p∈Θ = (0,1).
H0:p=
1
2
, H1:p6=
1
2
.
Desired level of significance:α= 0.05.
Reasonable plan: SinceE
p=
1
2
(X) = 3, rejectH0when|X−3|≥cfor some constantc. But
how should we selectc?
xc=|x−3|P
p=
1
2
(X=x)P
p=
1
2
(|X−3|≥c)
0,6 3 0.015625 0.03125
1,5 2 0.093750 0.21875
2,4 1 0.234375 0.68750
3 0 0.312500 1.00000
Thus, there is no nonrandomized test withα= 0.05.
What can we do instead? — Three possibilities:
(i) Reject if|X−3|= 3, i.e., use a nonrandomized test of sizeα= 0.03125.
(ii) Reject if|X−3|≥2, i.e., use a nonrandomized test of sizeα= 0.21875.
(iii) Reject if|X−3|= 3, do not reject if|X−3|≤1, and reject with probability
0.05−0.03125
2∆0.093750
= 0.1 if|X−3|= 2. Thus, we obtain the randomized test
φ(x) =











1,ifx= 0,6
0.1,ifx= 1,5
0,ifx= 2,3,4
This test has size
α=E
p=
1
2
(φ(X))
= 1∆0.015625∆2 + 0.1∆0.093750∆2 + 0∆(0.24375∆2 + 0.3125)
= 0.05
as intended. The power ofφcan be calculated for anyp6=
1
2
and it is
βφ(p) =Pp(X= 0 orX= 6) + 0.1∆Pp(X= 1 orX= 5)
95

Lecture 19:
Fr 02/23/01
9.2 The Neyman–Pearson Lemma
(Based on Casella/Berger, Section 8.3.2)
Let{fθ:θ∈Θ ={θ0, θ1}}be a family of possible distributions ofX.fθrepresents the pdf
(or pmf) ofX. For convenience, we writef0(x) =fθ0
(x) andf1(x) =fθ1
(x).
Theorem 9.2.1:Neyman–Pearson Lemma (NP Lemma)
Suppose we wish to testH0:X∼f0(x) vs.H1:X∼f1(x), wherefiis the pdf (or pmf) of
XunderHi, i= 0,1, where both,H0andH1, are simple.
(i) Any test of the form
φ(x) =







1, iff1(x)> kf0(x)
γ(x),iff1(x) =kf0(x)
0, iff1(x)< kf0(x)
(∗)
for somek≥0 and 0≤γ(x)≤1, is most powerful of its significance level for testing
H0vs.H1.
Ifk=∞, the test
φ(x) =
(
1,iff0(x) = 0
0,iff0(x)>0
(∗∗)
is most powerful of size (or significance level) 0 for testingH0vs.H1.
(ii) Given 0≤α≤1, there exists a test of the form (∗) or (∗∗) withγ(x) =γ(i.e., a
constant) such that
Eθ0
(φ(X)) =α.
Proof:
We prove the continuous case only.
(i):
Letφbe a test satisfying (∗). Letφ

be any other test with sizeEθ0


(X))≤Eθ0
(φ(X)).
It holds that
Z
(φ(x)−φ

(x))(f1(x)−kf0(x))dx=
`Z
f1>kf0
(φ(x)−φ

(x))(f1(x)−kf0(x))dx
´
+
`Z
f1<kf0
(φ(x)−φ

(x))(f1(x)−kf0(x))dx
´
96

since
Z
f1=kf0
(φ(x)−φ

(x))(f1(x)−kf0(x))dx= 0.
It is
Z
f1>kf0
(φ(x)−φ

(x))(f1(x)−kf0(x))dx=
Z
f1>kf0
(1−φ

(x))
|{z}
≥0
(f1(x)−kf0(x))
| {z }
≥0
dx≥0
and
Z
f1<kf0
(φ(x)−φ

(x))(f1(x)−kf0(x))dx=
Z
f1<kf0
(0−φ

(x))
|{z}
≤0
(f1(x)−kf0(x))
| {z }
≤0
dx≥0.
Therefore,
0≤
Z
(φ(x)−φ

(x))(f1(x)−kf0(x))dx
=Eθ1
(φ(X))−Eθ1


(X))−k(Eθ0
(φ(X))−Eθ0


(X))).
SinceEθ0
(φ(X))≥Eθ0


(X)), it holds that
βφ(θ1)−βφ
∗(θ1)≥k(Eθ0
(φ(X))−Eθ0


(X)))≥0,
i.e.,φis most powerful.
Ifk=∞, any testφ

of size 0 must be 0 on the set{x|f0(x)>0}. Therefore,
βφ(θ1)−βφ
∗(θ1) =Eθ1
(φ(X))−Eθ1


(X)) =
Z
{x|f0(x)=0}
(1−φ

(x))f1(x)dx≥0,
i.e.,φis most powerful of size 0.
(ii):
Ifα= 0, then use (∗∗). Otherwise, assume that 0< α≤1 andγ(x) =γ. It is
Eθ0
(φ(X)) =Pθ0
(f1(X)> kf0(X)) +γPθ0
(f1(X) =kf0(X))
= 1−Pθ0
(f1(X)≤kf0(X)) +γPθ0
(f1(X) =kf0(X))
= 1−Pθ0
`
f1(X)
f0(X)
≤k
´
+γPθ0
`
f1(X)
f0(X)
=k
´
.
Note that the last step is valid sincePθ0
(f0(X) = 0) = 0.
Therefore, given 0< α≤1, we want to findkandγsuch thatEθ0
(φ(X)) =α, i.e.,
Pθ0
`
f1(X)
f0(X)
≤k
´
−γPθ0
`
f1(X)
f0(X)
=k
´
= 1−α.
Note that
f1(X)
f0(X)
is a rv and, therefore,Pθ0

f1(X)
f0(X)
≤k

is a cdf.
97

If there exists ak0such that
Pθ0
`
f1(X)
f0(X)
≤k0
´
= 1−α,
we chooseγ= 0 andk=k0.
Otherwise, if there exists no suchk0, then there exists ak1such that
Pθ0
`
f1(X)
f0(X)
< k1
´
≤1−α < Pθ0
`
f1(X)
f0(X)
≤k1
´
,
i.e., the cdf has a jump atk1. In this case, we choosek=k1and
γ=
Pθ0

f1(X)
f0(X)
≤k1

−(1−α)
Pθ0

f1(X)
f0(X)
=k1
≡ .
················································································································································································································································································································································································································
························································································································································································································································································································································································································································································································································································································


· · · · · · · · · · · · · · · · · · · · ·
· · · · · · · · · · · · · · · · · · · · ·
k1 x
F(x)
*
o
P(f1/f0 <= k1)
1 - alpha
P(f1/f0 < k1)
Theorem 9.2.1
Let us verify that these values fork1andγmeet the necessary conditions:
Obviously,
Pθ0
`
f1(X)
f0(X)
≤k1
´

Pθ0

f1(X)
f0(X)
≤k1

−(1−α)
Pθ0

f1(X)
f0(X)
=k1
≡ Pθ0
`
f1(X)
f0(X)
=k1
´
= 1−α.
Also, since
Pθ0
`
f1(X)
f0(X)
≤k1
´
>(1−α),
it follows thatγ≥0 and since
Pθ0

f1(X)
f0(X)
≤k1

−(1−α)
Pθ0

f1(X)
f0(X)
=k1
≡ ≤
Pθ0

f1(X)
f0(X)
≤k1

−Pθ0

f1(X)
f0(X)
< k1

Pθ0

f1(X)
f0(X)
=k1

98

=
Pθ0

f1(X)
f0(X)
=k1

Pθ0

f1(X)
f0(X)
=k1

= 1,
it follows thatγ≤1. Overall, 0≤γ≤1 as required.
99

Lecture 21:
We 02/28/01
Theorem 9.2.2:
If a sufficient statisticTexists for the family{fθ:θ∈Θ ={θ0, θ1}}, then the Neyman–
Pearson most powerful test is a function ofT.
Proof:
Homework
Example 9.2.3:
We want to testH0:X∼N(0,1) vs.H1:X∼Cauchy(1,0), based on a single observation.
It is
f1(x)
f0(x)
=
1
π
1
1+x
2
1


exp(−
x
2
2
)
=
r
2
π
exp(
x
2
2
)
1 +x
2
.
The MP test is
φ(x) =





1,if
q
2
π
exp(
x
2
2
)
1+x
2> k
0,otherwise
wherekis determined such thatEH0
(φ(X)) =α.
Ifα <0.113, we rejectH0if|x|> zα
2
, wherezα
2
is the upper
α
2
quantile of aN(0,1)
distribution.
Ifα >0.113, we rejectH0if|x|> k1or if|x|< k2, wherek1>0,k2>0, such that
exp(
k
2
1
2
)
1 +k
2
1
=
exp(
k
2
2
2
)
1 +k
2
2
and
Z
k1
k2
1


exp(−
x
2
2
)dx=
1−α
2
.
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
····························································································································································································································································································································································································································································································································································································································································································································································································································································································································································································
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
x
f1(x) / f0(x)
-2 -1 0 1 2
0.7 0.8 0.9 1.0 1.1
····································································································
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
-1.585 1.585
-k1 -k2 k2 k1
Example 9.2.3a
········································································································································································································································································································································································································································································································································································································································································································································································································································································································································································································································
x
f0(x)
-2 -1 0 1 2
0.1 0.2 0.3 0.4
-1.585 1.585
alpha/2 = 0.113/2 alpha/2 = 0.113/2
Example 9.2.3b
100

Why isα= 0.113 so interesting?
Forx= 0, it is
f1(x)
f0(x)
=
r
2
π
≈0.7979.
Similarly, forx≈ −1.585 andx≈1.585, it is
f1(x)
f0(x)
=
r
2
π
exp(
(±1.585)
2
2
)
1 + (±1.585)
2
≈0.7979≈
f1(0)
f0(0)
.
More importantly,PH0
(|X|>1.585) = 0.113.
101

9.3 Monotone Likelihood Ratios
(Based on Casella/Berger, Section 8.3.2)
Suppose we want to testH0:θ≤θ0vs.H1:θ > θ0for a family of pdf’s{fθ:θ∈Θ⊆IR}.
In general, it is not possible to find a UMP test. However, there exist conditions under which
UMP tests exist.
Definition 9.3.1:
Let{fθ:θ∈Θ⊆IR}be a family of pdf’s (pmf’s) on a one–dimensional parameter space.
We say the family{fθ}has amonotone likelihood ratio (MLR)in statisticT(X) if for
θ1< θ2, wheneverfθ1
andfθ2
are distinct, the ratio

2
(x)

1
(x)
is a nondecreasing function ofT(x)
for the set of valuesxfor which at least one offθ1
andfθ2
is>0.
Note:
We can also define families of densities with nonincreasing MLR inT(X), but such families
can be treated by symmetry.
Example 9.3.2:
LetX1, . . . , Xn∼U[0, θ], θ >0. Then the joint pdf is
fθ(x) =
(
1
θ
n,0≤xmax≤θ
0,otherwise
=
1
θ
n
I
[0,θ](xmax),
wherexmax= max
i=1,...,n
xi.
Letθ2> θ1, then
fθ2
(x)
fθ1
(x)
= (
θ1
θ2
)
n
I
[0,θ2](xmax)
I
[0,θ1](xmax)
.
It is
I
[0,θ2](xmax)
I
[0,θ1](xmax)
=
(
1, xmax∈[0, θ1]
∞, xmax∈(θ1, θ2]
since forxmax∈[0, θ1], it holds thatxmax∈[0, θ2].
But forxmax∈(θ1, θ2], it isI
[0,θ1](xmax) = 0.
=⇒asT(X) =Xmaxincreases, the density ratio goes from (
θ1
θ2
)
n
to∞.
=⇒

2

1
is a nondecreasing function ofT(X) =Xmax
=⇒the family ofU[0, θ] distributions has a MLR inT(X) =Xmax
102

Theorem 9.3.3:
The one–parameter exponential familyfθ(x) = exp(Q(θ)T(x) +D(θ) +S(x)), whereQ(θ) is
nondecreasing, has a MLR inT(X).
Proof:
Homework.
Example 9.3.4:
LetX= (X1,∆ ∆ ∆, Xn) be a random sample from the Poisson family with parameterλ >0.
Then the joint pdf is
fλ(x) =
n
Y
i=1
`
e
−λ
λ
xi
1
xi!
´
=e
−nλ
λ
P
n
i=1
xi
n
Y
i=1
1
xi!
= exp

−nλ+
n
X
i=1
xi∆log(λ)−
n
X
i=1
log(xi!)
!
,
which belongs to the one–parameter exponential family.
SinceQ(λ) =log(λ) is a nondecreasing function ofλ, it follows by Theorem 9.3.3 that the
Poisson family with parameterλ >0 has a MLR inT(X) =
n
X
i=1
Xi.
We can verify this result by Definition 9.3.1:
fλ2
(x)
fλ1
(x)
=
λ
P
xi
2
λ
P
xi
1
e
−nλ2
e
−nλ1
=
`
λ2
λ1
´
P
xi
e
−n(λ2−λ1)
.
Ifλ2> λ1, then
λ2
λ1
>1 and

λ2
λ1

P
xi
is a nondecreasing function of
P
xi.
Therefore,fλhas a MLR inT(X) =
n
X
i=1
Xi.
Theorem 9.3.5:
LetX∼fθ, θ∈Θ⊆IR, where the family{fθ}has a MLR inT(X).
For testingH0:θ≤θ0vs.H1:θ > θ0, θ0∈Θ,any test of the form
φ(x) =







1,ifT(x)> t0
γ,ifT(x) =t0
0,ifT(x)< t0
(∗)
has a nondecreasing power function and is UMP of its sizeEθ0
(φ(X)) =α, if the size is not 0.
Also, for every 0≤α≤1 and everyθ0∈Θ, there exists at0and aγ(−∞ ≤t0≤ ∞,
0≤γ≤1), such that the test of form (∗) is the UMP sizeαtest ofH0vs.H1.
103

Proof:
“=⇒”:
Letθ1< θ2, θ1, θ2∈Θ and supposeEθ1
(φ(X))>0, i.e., the size is>0. Sincefθhas a MLR
inT,

2
(x)

1
(x)
is a nondecreasing function ofT. Therefore, any test of form (∗) is equivalent to
a test
φ(x) =













1,if

2
(x )

1
(x)
> k
γ,if

2
(x)

1
(x)
=k
0,if

2
(x)

1
(x)
< k
(∗∗)
which by the Neyman–Pearson Lemma (Theorem 9.2.1) is MP of sizeαfor testingH0:θ=θ1
vs.H1:θ=θ2.
Lecture 22:
Fr 03/02/01
Let Φαbe the class of tests of sizeα, and letφt∈Φαbe the trivial testφt(x) =α∀x. Then
φthas size and powerα. The power of the MP testφof form (∗) must be at leastαas the
MP testφcannot have power less than the trivial testφt, i.e,
Eθ2
(φ(X))≥Eθ2
(φt(X)) =α=Eθ1
(φ(X)).
Thus, forθ2> θ1,
Eθ2
(φ(X))≥Eθ1
(φ(X)),
i.e., the power function of the testφof form (∗) is a nondecreasing function ofθ.
Now letθ1=θ0andθ2> θ0. We know that a testφof form (∗) is MP forH0:θ=θ0vs.
H1:θ=θ2> θ0, provided that its sizeα=Eθ0
(φ(X)) is>0.
Notice that the testφof form (∗) does not depend onθ2. It only depends ont0andγ.
Therefore, the testφof form (∗) is MP for allθ2∈Θ1. Thus, this test is UMP for simple
H0:θ=θ0vs. compositeH1:θ > θ0with sizeEθ0
(φ(X)) =α0.
Sinceφis UMP for a class Φ
′′
of tests (φ
′′
∈Φ
′′
) satisfying
Eθ0

′′
(X))≤α0,
φmust also be UMP for the more restrictive class Φ
′′′
of tests (φ
′′′
∈Φ
′′′
) satisfying
Eθ(φ
′′′
(X))≤α0∀θ≤θ0.
But since the power function ofφis nondecreasing, it holds forφthat
Eθ(φ(X))≤Eθ0
(φ(X)) =α0∀θ≤θ0.
Thus,φis the UMP sizeα0test ofH0:θ≤θ0vs.H1:θ > θ0ifα0>0.
104

“⇐=”:
Use the Neyman–Pearson Lemma (Theorem 9.2.1).
Note:
By interchanging inequalities throughout Theorem 9.3.5 and its proof, we see that this The-
orem also provides a solution of the dual problemH

0
:θ≥θ0vs.H

1
:θ < θ0.
Theorem: 9.3.6
For the one–parameter exponential family, there exists a UMP two–sided test ofH0:θ≤θ1
orθ≥θ2, (whereθ1< θ2) vs.H1:θ1< θ < θ2of the form
φ(x) =







1,ifc1< T(x)< c2
γi,ifT(x) =ci, i= 1,2
0,ifT(x)< c1,or ifT(x)> c2
Note:
UMP tests forH0:θ1≤θ≤θ2andH

0
:θ=θ0do not exist for one–parameter exponential
families.
105

9.4 Unbiased and Invariant Tests
(Based on Rohatgi, Section 9.5, Rohatgi/Saleh, Section 9.5 & Casella/Berger,
Section 8.3.2)
If we look at all sizeαtests in the class Φα, there exists no UMP test for many hypotheses.
Can we find UMP tests if we reduce Φαby reasonable restrictions?
Definition 9.4.1:
A sizeαtestφofH0:θ∈Θ0vsH1:θ∈Θ1isunbiasedif
Eθ(φ(X))≥α∀θ∈Θ1.
Note:
This condition means thatβφ(θ)≤α∀θ∈Θ0andβφ(θ)≥α∀θ∈Θ1. In other words, the
power of this test is never less thanα.
Definition 9.4.2:
LetUαbe the class of all unbiased sizeαtests ofH0vsH1. If there exists a testφ∈Uα
that has maximal power for allθ∈Θ1, we callφaUMP unbiased(UMPU) sizeαtest.
Note:
It holds thatUα⊆Φα. A UMP testφα∈Φαwill haveβφα
≥α∀θ∈Θ1since we must
compare all testsφαwith the trivial testφ(x) =α. Thus, if a UMP test exists in Φα, it is
also a UMPU test inUα.
Example 9.4.3:
LetX1, . . . , Xnbe iidN(, σ
2
), whereσ
2
>0 is known. ConsiderH0:=0vsH1:6=0.
From the Neyman–Pearson Lemma, we know that for1> 0, the MP test is of the form
φ1(X) =



1,ifX > 0+
σ

n

0,otherwise
and for2< 0, the MP test is of the form
φ2(X) =



1,ifX < 0−
σ

n

0,otherwise
If a test is UMP, it must have the same rejection region asφ1andφ2. However, these 2
rejection regions are different (actually, their intersection is empty). Thus, there exists no
106

UMP test.
We next state a helpful Theorem and then continue with this example and see how we can
find a UMPU test.
Theorem 9.4.4:
Letc1, . . . , cn∈IRbe constants andf1(x), . . . , fn+1(x) be real–valued functions. LetCbe the
class of functionsφ(x) satisfying 0≤φ(x)≤1 and
Z

−∞
φ(x)fi(x)dx=ci∀i= 1, . . . , n.
Ifφ∗ ∈Csatisfies
φ

(x) =











1,iffn+1(x)>
n
X
i=1
kifi(x)
0,iffn+1(x)<
n
X
i=1
kifi(x)
for some constantsk1, . . . , kn∈IR, thenφ

maximizes
Z

−∞
φ(x)fn+1(x)dxamong allφ∈C.
Proof:
Letφ

(x) be as above. Letφ(x) be any other function inC. Since 0≤φ(x)≤1∀x, it is


(x)−φ(x))

fn+1(x)−
n
X
i=1
kifi(x)
!
≥0∀x.
This holds since ifφ

(x) = 1, the left factor is≥0 and the right factor is≥0. Ifφ

(x) = 0,
the left factor is≤0 and the right factor is≤0.
Therefore,
0≤
Z


(x)−φ(x))

fn+1(x)−
n
X
i=1
kifi(x)
!
dx
=
Z
φ

(x)fn+1(x)dx−
Z
φ(x)fn+1(x)dx−
n
X
i=1
ki
`Z
φ

(x)fi(x)dx−
Z
φ(x)fi(x)dx
´
| {z }
=ci−ci=0
Thus,
Z
φ

(x)fn+1(x)dx≥
Z
φ(x)fn+1(x)dx.
Note:
(i) Iffn+1is a pdf, thenφ

maximizes the power.
(ii) The Theorem above is the Neyman–Pearson Lemma ifn= 1, f1=fθ0
, f2=fθ1
, and
c1=α.
107

Example 9.4.3:(continued)
So far, we have seen that there exists no UMP test forH0:=0vsH1:6=0.
We will show that
φ3(x) =



1,ifX < 0−
σ

n
z
α/2or ifX > 0+
σ

n
z
α/2
0,otherwise
is a UMPU sizeαtest.
Due to Theorem 9.2.2, we only have to consider functions of sufficient statisticsT(X) =X.
Letτ
2
=
σ
2
n
.
To be unbiased and of sizeα, a testφmust have
(i)
Z
φ(t)f0
(t)dt=α, and
(ii)


Z
φ(t)f(t)dt|
=0
=
Z
φ(t)
`


f(t)
´



=0
dt= 0, i.e., we have a minimum at0.
We want to maximize
Z
φ(t)f(t)dt, 6=0such that conditions (i) and (ii) hold.
Lecture 23:
Mo 03/05/01
We choose an arbitrary16=0and let
f1(t) =f0
(t)
f2(t) =


f(t)




=0
f3(t) =f1
(t)
We now consider how the conditions onφ

in Theorem 9.4.4 can be met:
f3(t)> k1f1(t) +k2f2(t)
⇐⇒
1

2πτ
exp(−
1

2
(x−1)
2
)>
k1

2πτ
exp(−
1

2
(x−0)
2
) +
k2

2πτ
exp(−
1

2
(x−0)
2
)(
x−0
τ
2
)
⇐⇒exp(−
1

2
(x−1)
2
)> k1exp(−
1

2
(x−0)
2
) +k2exp(−
1

2
(x−0)
2
)(
x−0
τ
2
)
⇐⇒exp(
1

2
((x−0)
2
−(x−1)
2
))> k1+k2(
x−0
τ
2
)
⇐⇒exp(
x(1−0)
τ
2


2
1

2
0

2
)> k1+k2(
x−0
τ
2
)
108

Note that the left hand side of this inequality is increasinginxif1> 0and decreasing in
xif1< 0. Either way, we can choosek1andk2such that the linear function inxcrosses
the exponential function inxat the two points
L=0−
σ

n
z
α/2, U=0+
σ

n
z
α/2.
Obviously,φ3satisfies (i). We still need to check thatφ3satisfies (ii) and thatβφ3
() has a
minimum at0but omit this part from our proof here.
φ3is of the formφ

in Theorem 9.4.4 and thereforeφ3is UMP inC. But the trivial test
φt(x) =αalso satisfies (i) and (ii) above. Therefore,βφ3
()≥α∀6=0. This means that
φ3is unbiased.
Overall,φ3is a UMPU test of sizeα.
Definition 9.4.5:
A testφis said to beα–similaron a subset Θ

of Θ if
βφ(θ) =Eθ(φ(X)) =α∀θ∈Θ

.
A testφis said to besimilaron Θ

⊆Θ if it isα–similar on Θ

for someα,0≤α≤1.
Note:
The trivial testφ(x) =αisα–similar on every Θ

⊆Θ.
Theorem 9.4.6:
Letφbe an unbiased test of sizeαforH0:θ∈Θ0vsH1:θ∈Θ1such thatβφ(θ) is a
continuous function inθ. Thenφisα–similar on the boundary Λ =Θ0∩Θ1, whereΘ0and
Θ1are the closures of Θ0and Θ1, respectively.
Proof:
Letθ∈Λ. There exist sequences{θn}and{θ

n
}whithθn∈Θ0andθ

n
∈Θ1such that
lim
n→∞
θn=θand lim
n→∞
θ

n=θ.
By continuity,βφ(θn)→βφ(θ) andβφ(θ

n)→βφ(θ).
Sinceβφ(θn)≤αimpliesβφ(θ)≤αand sinceβφ(θ

n
)≥αimpliesβφ(θ)≥αit must hold
thatβφ(θ) =α∀θ∈Λ.
109

Lecture 24:
We 03/07/01
Definition 9.4.7:
A testφthat is UMP among allα–similar tests on the boundary Λ =Θ0∩Θ1is called a
UMPα–similartest.
Theorem 9.4.8:
Supposeβφ(θ) is continuous inθfor all testsφofH0:θ∈Θ0vsH1:θ∈Θ1. If a sizeα
test ofH0vsH1is UMPα–similar, then it is UMP unbiased.
Proof:
Letφ0be UMPα–similar and of sizeα. This means thatEθ(φ(X))≤α∀θ∈Θ0.
Since the trivial testφ(x) =αisα–similar, it must hold forφ0thatβφ0
(θ)≥α∀θ∈Θ1since
φ0is UMPα–similiar. This implies thatφ0is unbiased.
Sinceβφ(θ) is continuous inθ, we see from Theorem 9.4.6 that the class of unbiased tests is
a subclass of the class ofα–similar tests. Sinceφ0is UMP in the larger class, it is also UMP
in the subclass. Thus,φ0is UMPU.
Note:
The continuity of the power functionβφ(θ) cannot always be checked easily.
Example 9.4.9:
LetX1, . . . , Xn∼N(,1).
LetH0:≤0 vsH1: >0.
Since the family of densities has a MLR in
n
X
i=1
Xi, we could use Theorem 9.3.5 to find a UMP
test. However, we want to illustate the use of Theorem 9.4.8 here.
It is Λ ={0}and the power function
βφ() =
Z
IR
n
φ(x)
`
1


´
n
exp


1
2
n
X
i=1
(xi−)
2
!
dx
of any testφis continuous in. Thus, due to Theorem 9.4.6, any unbiased sizeαtest ofH0
isα–similar on Λ.
We need a UMP test ofH

0
:= 0 vsH1: >0.
By the NP Lemma, a MP test ofH
′′
0
:= 0 vsH
′′
1
:=1, where1>0 is given by
φ(x) =





1,if exp
`P
x
2
i
2

P
(xi−)
2
2
´
> k

0,otherwise
110

or equivalently, by Theorem 9.2.2,
φ(x) =







1,ifT=
n
X
i=1
Xi> k
0,otherwise
Since underH0,T∼N(0, n),kis determined byα=P=0(T > k) =P(
T

n
>
k

n
), i.e.,
k=

nzα.
φis independent of1for every1>0. Soφis UMPα–similar forH

0
vs.H1.
Finally,φis of sizeα, since for≤0, it holds that
E(φ(X)) =P(T >

nzα)
=P
`
T−n

n
> zα−

n
´
(∗)
≤P(Z > zα)

(∗) holds since
T−n

n
∼N(0,1) for≤0 andzα−

n≥zαfor≤0.
Thus all the requirements are met for Theorem 9.4.8, i.e.,βφis continuous andφis UMP
α–similar and of sizeα, and thusφis UMPU.
Note:
Rohatgi, page 428–430, lists Theorems (without proofs), stating that for Normal data, one–
and two–tailedt–tests, one– and two–tailedχ
2
–tests, two–samplet–tests, andF–tests are all
UMPU.
Note:
Recall from Definition 8.2.4 that a class of distributions isinvariant under a groupGof trans-
formations, if for eachg∈Gand for eachθ∈Θ there exists a uniqueθ

∈Θ such that if
X∼Pθ, theng(X)∼Pθ
′.
Definition 9.4.10:
A groupGof transformations onXleaves a hypothesis testing probleminvariantifG
leaves both{Pθ:θ∈Θ0}and{Pθ:θ∈Θ1}invariant, i.e., ify=g(x)∼hθ(y), then
{fθ(x) :θ∈Θ0} ≡ {hθ(y) :θ∈Θ0}and{fθ(x) :θ∈Θ1} ≡ {hθ(y) :θ∈Θ1}.
111

Note:
We want two types of invariance for our tests:
Measurement Invariance:Ify=g(x) is a 1–to–1 mapping, the decision based onyshould
be the same as the decision based onx. Ifφ(x) is the test based onxandφ

(y) is the
test based ony, then it must hold thatφ(x) =φ

(g(x)) =φ

(y).
Formal Invariance:If two tests have the same structure, i.e, the same Θ, the samepdf’s (or
pmf’s), and the same hypotheses, then we should use the same test in both problems.
So, if the transformed problem in terms ofyhas the same formal structure as that of
the problem in terms ofx, we must have thatφ

(y) =φ(x) =φ

(g(x)).
We can combine these two requirements in the following definition:
Definition 9.4.11:
Aninvariant testwith respect to a groupGof tansformations is any testφsuch that
φ(x) =φ(g(x))∀x∀g∈G.
Example 9.4.12:
LetX∼Bin(n, p). LetH0:p=
1
2
vs.H1:p6=
1
2
.
LetG={g1, g2}, whereg1(x) =n−xandg2(x) =x.
Ifφis invariant, thenφ(x) =φ(n−x). Is the test problem invariant? Forg2, the answer is
obvious.
Forg1, we get:
g1(X) =n−X∼Bin(n,1−p)
H0:p=
1
2
:{fp(x) :p=
1
2
}={hp(g1(x)) :p=
1
2
}=Bin(n,
1
2
)
H1:p6=
1
2
:{fp(x) :p6=
1
2
}
| {z }
=Bin(n,p6=
1
2
)
={hp(g1(x)) :p6=
1
2
}
| {z }
=Bin(n,p6=
1
2
)
So all the requirements in Definition 9.4.10 are met. If, for example,n= 10, the test
φ(x) =



1,ifx= 0,1,2,8,9,10
0,otherwise
is invariant underG. For example,φ(4) = 0 =φ(10−4) =φ(6), and, in general,
φ(x) =φ(10−x)∀x∈ {0,1, . . . ,9,10}.
112

Lecture 25:
Fr 03/09/01
Example 9.4.13:
LetX1, . . . , Xn∼N(, σ
2
) where bothandσ
2
>0 are unknown. It isX∼N(,
σ
2
n
) and
n−1
σ
2S
2
∼χ
2
n−1
and XandS
2
independent.
LetH0:≤0 vs.H1: >0.
LetGbe the group of scale changes:
G={gc(x, s
2
), c >0 :gc(x, s
2
) = (cx, c
2
s
2
)}
The problem is invariant because, whengc(x, s
2
) = (cx, c
2
s
2
), then
(i)cXandc
2
S
2
are independent.
(ii)cX∼N(c,
c
2
σ
2
n
)∈ {N(η,
τ
2
n
)}.
(iii)
n−1
c
2
σ
2c
2
S
2
∼χ
2
n−1
.
So, this is the same family of distributions and Definition 9.4.10 holds because≤0 implies
thatc≤0 (forc >0).
An invariant test satisfiesφ( x, s
2
)≡φ(cx, c
2
s
2
), c >0, s
2
>0,x∈IR.
Letc=
1
s
. Thenφ(x, s
2
)≡φ(
x
s
,1) so invariant tests depend on (x, s
2
) only through
x
s
.
If
x1
s1
6=
x2
s2
, then there exists noc >0 such that (x2, s
2
2
)≡(cx1, c
2
s
2
1
). So invariance places
no restrictions onφfor different
x1
s1
=
x2
s2
. Thus, invariant tests are exactly those that depend
only on
x
s
, which are equivalent to tests that are based only ont=
x
s/

n
. Since this mapping
is 1–to–1, the invariant test will useT=
X
S/

n
∼tn−1if= 0. Note that this test does not
depend on the nuisance parameterσ
2
. Invariance often produces such results.
Definition 9.4.14:
LetGbe a group of transformations on the space ofX. We say a statisticT(x) ismaximal
invariantunderGif
(i)Tis invariant, i.e.,T(x) =T(g(x))∀g∈G, and
(ii)Tis maximal, i.e.,T(x
1) =T(x
2) implies thatx
1=g(x
2) for someg∈G.
113

Example 9.4.15:
Letx= (x1, . . . , xn) andgc(x) = (x1+c, . . . , xn+c).
ConsiderT(x) = (xn−x1, xn−x2, . . . , xn−xn−1).
It isT(gc(x)) = (xn−x1, xn−x2, . . . , xn−xn−1) =T(x), soTis invariant.
IfT(x) =T(x

), thenxn−xi=x

n−x

i
∀i= 1,2, . . . , n−1.
This implies thatxi−x

i
=xn−x

n=c∀i= 1,2, . . . , n−1.
Thus,gc(x

) = (x

1
+c, . . . , x

n
+c) =x .
Therefore,Tis maximal invariant.
Definition 9.4.16:
LetIαbe the class of all invariant tests of sizeαofH0:θ∈Θ0vs.H1:θ∈Θ1. If there
exists a UMP member inIα, it is called theUMP invarianttest ofH0vsH1.
Theorem 9.4.17:
LetT(x) be maximal invariant with respect toG. A testφis invariant underGiffφis a
function ofT.
Proof:
“=⇒”:
Letφbe invariant underG. IfT(x
1) =T(x
2), then there exists ag∈Gsuch thatx
1=g(x
2).
Thus, it follows from invariance thatφ(x
1) =φ(g(x
2)) =φ(x
2). Sinceφis the same whenever
T(x
1) =T(x
2),φmust be a function ofT.
“⇐=”:
Letφbe a function ofT, i.e.,φ(x) =h(T(x)). It follows that
φ(g(x)) =h(T(g(x)))
(∗)
=h(T(x)) =φ(x).
(∗) holds sinceTis invariant.
This means thatφis invariant.
114

Example 9.4.18:
Consider the test problem
H0:X∼f0(x1−θ, . . . , xn−θ) vs.H1:X∼f1(x1−θ, . . . , xn−θ),
whereθ∈IR.
LetGbe the group of transformations with
gc(x) = (x1+c, . . . , xn+c),
wherec∈IRandn≥2.
As shown in Example 9.4.15, a maximal invariant statistic isT(X) = (X1−Xn, . . . , Xn−1−
Xn) = (T1, . . . , Tn−1). Due to Theorem 9.4.17, an invariant testφdepends onXonly through
T.
Since the transformation

T

Z
!
=







T1
.
.
.
Tn−1
Z







=







X1−Xn
.
.
.
Xn−1−Xn
Xn







is 1–to–1, there exists inversesXn=ZandXi=Ti+Xn=Ti+Z∀i= 1, . . . , n−1.
Applying Theorem 4.3.5 and integrating out the last componentZ(=Xn) gives us the joint
pdf ofT= (T1, . . . , Tn−1).
Thus, underHi, i= 0,1, the joint pdf ofTis given by
Z

−∞
fi(t1+z, t2+z, . . . , tn−1+z, z)dz
which is independent ofθ. The problem is thus reduced to testing a simple hypothesis against
a simple alternative. By the NP Lemma (Theorem 9.2.1), the MPtest is
φ(t1, . . . , tn−1) =
(
1,ifλ(t)> c
0,ifλ(t)< c
wheret= (t1, . . . , tn−1) andλ(t) =
Z

−∞
f1(t1+z, t2+z, . . . , tn−1+z, z)dz
Z

−∞
f0(t1+z, t2+z, . . . , tn−1+z, z)dz
.
In the homework assignment, we use this result to construct aUMP invariant test of
H0:X∼N(θ,1) vs.H1:X∼Cauchy(1, θ),
where a Cauchy(1, θ) distribution has pdff(x;θ) =
1
π
1
1 + (x−θ)
2
, whereθ∈IR.
115

Lecture 26:
Mo 03/19/01
10 More on Hypothesis Testing
10.1 Likelihood Ratio Tests
(Based on Casella/Berger, Section 8.2.1)
Definition 10.1.1:
Thelikelihood ratio test statisticfor
H0:θ∈Θ0vs.H1:θ∈Θ1= Θ−Θ0
is
λ(x) =
sup
θ∈Θ0
fθ(x)
sup
θ∈Θ
fθ(x)
.
Thelikelihood ratio test(LRT) is the test function
φ(x) =I
[0,c)(λ(x)),
for some constantc∈[0,1], wherecis usually chosen in such a way to makeφa test of size
α.
Note:
(i) We have to selectcsuch that 0≤c≤1 since 0≤λ(x)≤1.
(ii) LRT’s are strongly related to MLE’s. If
ˆ
θis the unrestricted MLE ofθover Θ and
ˆ
θ0is
the MLE ofθover Θ0, thenλ(x) =

θ
0
(x)

θ
(x)
.
Example 10.1.2:
LetX1, . . . , Xnbe a sample fromN(,1). We want to construct a LRT for
H0:=0vs.H1:6=0.
It is ˆ0=0and ˆ=X. Thus,
λ(x) =
(2π)
−n/2
exp(−
1
2
P
(xi−0)
2
)
(2π)
−n/2
exp(−
1
2
P
(xi−x)
2
)
= exp(−
n
2
(x−0)
2
).
116

The LRT rejectsH0ifλ(x)≤c, or equivalently,|x−0|≥
q
−2
logc
n
. This means, the LRT
rejectsH0:=0ifxistoo far from0.
Theorem 10.1.3:
IfT(X) is sufficient forθandλ

(t) andλ(x) are LRT statistics based onTandXrespectively,
then
λ

(T(x)) =λ(x)∀x,
i.e., the LRT can be expressed as a function of every sufficientstatistic forθ.
Proof:
SinceTis sufficient, it follows from Theorem 8.3.5 that its pdf (or pmf) factorizes asfθ(x) =
gθ(T)h(x). Therefore we get:
λ(x) =
sup
θ∈Θ0
fθ(x)
sup
θ∈Θ
fθ(x)
=
sup
θ∈Θ0
gθ(T)h(x)
sup
θ∈Θ
gθ(T)h(x)
=
sup
θ∈Θ0
gθ(T)
sup
θ∈Θ
gθ(T)


(T(x))
Thus, our simplified expression forλ(x) indeed only depends on a sufficient statisticT.
Theorem 10.1.4:
If for a givenα, 0≤α≤1, and for a simple hypothesisH0and a simple alternativeH1a
non–randomized test based on the NP Lemma and a LRT exist, then these tests are equivalent.
Proof:
See Homework.
Note:
Usually, LRT’s perform well since they are often UMP or UMPU sizeαtests. However, this
does not always hold. Rohatgi, Example 4, page 440–441, cites an example where the LRT is
not unbiased and it is even worse than the trivial testφ(x) =α.
Theorem 10.1.5:
Under some regularity conditions onfθ(x), the rv−2 logλ(X) underH0has asymptotically
117

a chi–squared distribution withνdegrees of freedom, whereνequals the difference between
the number of independent parameters in Θ and Θ0, i.e.,
−2 logλ(X)
d
−→χ
2
νunderH0.
Note:
The regularity conditions required for Theorem 10.1.5 are basically the same as for Theorem
8.7.10. Under “independent” parameters we understand parameters that are unspecified, i.e.,
free to vary.
Example 10.1.6:
LetX1, . . . , Xn∼N(, σ
2
) where∈IRandσ
2
>0 are both unknown.
LetH0:=0vs.H1:6=0.
We haveθ= (, σ
2
), Θ ={(, σ
2
) :∈IR, σ
2
>0}and Θ0={(0, σ
2
) :σ
2
>0}.
It is
ˆ
θ0= (0,
1
n
n
X
i=1
(xi−0)
2
) and
ˆ
θ= (x,
1
n
n
X
i=1
(xi−x)
2
).
Now, the LR test statisticλ(x) can be determined:
λ(x) =

θ0
(x)

θ
(x)
=
1
(

n)
n
2(
P
(xi−0)
2
)
n
2
exp
`

P
(xi−0)
2
2
1
n
P
(xi−0)
2
´
1
(

n)
n
2
(
P
(xi−x)
2
)
n
2
exp
`

P
(xi−x)
2
2
1
n
P
(xi−x)
2
´
=
P
(xi−x)
2
P
(xi−0)
2
!n
2
=
P
x
2
i
−nx
2
P
x
2
i
−20
P
xi+n
2
0
−n x
2
+nx
2
!n
2
=




1
1 +
`
n(x−0)
2
P
(xi−x)
2
´




n
2
Note that this is a decreasing function of
t(X) =

n(X−0)
q
1
n−1
P
(Xi−X)
2
=

n(X−0)
S
(∗)
∼tn−1.
118

(∗) holds due to Corollary 7.2.4.
So we rejectH0ift(x) is large. Now,
−2 logλ(x) =−2(−
n
2
) log

1 +n
(x−0)
2
P
(xi−x)
2
!
=nlog

1 +n
(x−0)
2
P
(xi−x)
2
!
UnderH0,

n(X−0)
σ
∼N(0,1) and
P
(Xi−X)
2
σ
2 ∼χ
2
n−1
and both are independent according
to Theorem 7.2.1.
Therefore, underH0,
n(X−0)
2
1
n−1
P
(Xi−X)
2
∼F1,n−1.
Thus, the mgf of−2 logλ(X) underH0is
Mn(t) =EH0
(exp(−2tlogλ(X)))
=EH0
(exp(ntlog(1 +
F
n−1
)))
=EH0
(exp(log(1 +
F
n−1
)
nt
))
=EH0
((1 +
F
n−1
)
nt
)
=
Z

0
Γ(
n
2
)
Γ(
n−1
2
)Γ(
1
2
)(n−1)
1
2
1

f
`
1 +
f
n−1
´

n
2
`
1 +
f
n−1
´
nt
df
Note that
f1,n−1(f) =
Γ(
n
2
)
Γ(
n−1
2
)Γ(
1
2
)(n−1)
1
2
1

f
`
1 +
f
n−1
´

n
2
∆I
[0,∞)(f)
is the pdf of aF1,n−1distribution.
Lety=

1 +
f
n−1

−1
, then
f
n−1
=
1−y
y
anddf=−
n−1
y
2dy.
Thus,
Mn(t) =
Γ(
n
2
)
Γ(
n−1
2
)Γ(
1
2
)(n−1)
1
2
Z

0
1

f
`
1 +
f
n−1
´
nt−
n
2
df
=
Γ(
n
2
)
Γ(
n−1
2
)Γ(
1
2
)(n−1)
1
2
(n−1)
1
2
Z
1
0
y
n−3
2
−nt
(1−y)

1
2dy
119

(∗)
=
Γ(
n
2
)
Γ(
n−1
2
)Γ(
1
2
)
B(
n−1
2
−nt,
1
2
)
=
Γ(
n
2
)
Γ(
n−1
2
)Γ(
1
2
)
Γ(
n−1
2
−nt)Γ(
1
2
)
Γ(
n
2
−nt)
=
Γ(
n
2
)
Γ(
n−1
2
)
Γ(
n−1
2
−nt)
Γ(
n
2
−nt)
, t <
1
2

1
2n
(∗) holds since the integral represents the Beta function (seealso Example 8.8.11).
Asn→ ∞, we can apply Stirling’s formula which states that
Γ(α(n) + 1)≈(α(n))!≈

2π(α(n))
α(n)+
1
2exp(−α(n)).
Lecture 27:
We 03/21/01
So,
Mn(t)≈

2π(
n−2
2
)
n−1
2exp(−
n−2
2
)

2π(
n(1−2t)−3
2
)
n(1−2t)−2
2exp(−
n(1−2t)−3
2
)

2π(
n−3
2
)
n−2
2exp(−
n−3
2
)

2π(
n(1−2t)−2
2
)
n(1−2t)−1
2exp(−
n(1−2t)−2
2
)
=
`
n−2
n−3
´n−2
2
`
n−2
2
´1
2
`
n(1−2t)−3
n(1−2t)−2
´
n(1−2t)−2
2
`
n(1−2t)−2
2
´−
1
2
=
`
(1 +
1
n−3
)
n−2
´1
2
| {z }
−→e
1
2asn→∞
`
(1−
1
n(1−2t)−2
)
n(1−2t)−2
´1
2
| {z }
−→e

1
2asn→∞
`
n−2
n(1−2t)−2
´1
2
| {z }
−→(1−2t)

1
2asn→∞
Thus,
Mn(t)−→
1
(1−2t)
1
2
asn→ ∞.
Note that this is the mgf of aχ
2
1
distribution. Therefore, it follows by the Continuity Theorem
(Theorem 6.4.2) that
−2 logλ(X)
d
−→χ
2
1.
Obviously, we could use Theorem 10.1.5 (after checking thatthe regularity conditions hold)
as well to obtain the same result. Since underH01 parameter (σ
2
) is unspecified and under
H12 parameters (,σ
2
) are unspecified, it isν= 2−1 = 1.
120

10.2 Parametric Chi–Squared Tests
(Based on Rohatgi, Section 10.3 & Rohatgi/Saleh, Section 10.3)
Definition 10.2.1:Normal Variance Tests
LetX1, . . . , Xnbe a sample from aN(, σ
2
) distribution wheremay be known or unknown
andσ
2
>0 is unknown. The following table summarizes theχ
2
tests that are typically being
used:
RejectH0at levelαif
H0 H1 known unknown
Iσ≥σ0σ < σ0
P
(xi−)
2
≤σ
2
0
χ
2
n;1−α
s
2

σ
2
0
n−1
χ
2
n−1;1−α
II σ≤σ0σ > σ0
P
(xi−)
2
≥σ
2
0
χ
2
n;α s
2

σ
2
0
n−1
χ
2
n−1;α
III σ=σ0σ6=σ0
P
(xi−)
2
≤σ
2
0
χ
2
n;1−α/2
s
2

σ
2
0
n−1
χ
2
n−1;1−α/2
or
P
(xi−)
2
≥σ
2
0
χ
2
n;α/2
ors
2

σ
2
0
n−1
χ
2
n−1;α/2
Note:
(i) In Definition 10.2.1,σ0is any fixed positive constant.
(ii) Tests I and II are UMPU ifis unknown and UMP ifis known.
(iii) In test III, the constants have been chosen in such a wayto give equal probability to
each tail. This is the usual approach. However, this may result in a biased test.
(iv)χ
2
n;1−α
is the (lower)αquantile andχ
2
n;α
is the (upper) 1−αquantile, i.e., forX∼χ
2
n
,
it holds thatP(X≤χ
2
n;1−α
) =αandP(X≤χ
2
n;α) = 1−α.
(v) We can also useχ
2
tests to test for equality of binomial probabilities as shown in the
next few Theorems.
Theorem 10.2.2:
LetX1, . . . , Xkbe independent rv’s withXi∼Bin(ni, pi), i= 1, . . . , k. Then it holds that
T=
k
X
i=1

Xi−nipi
p
nipi(1−pi)
!
2
d
−→χ
2
k
asn1, . . . , nk→ ∞.
121

Proof:
Homework
Corollary 10.2.3:
LetX1, . . . , Xkbe as in Theorem 10.2.2 above. We want to test the hypothesis thatH0:p1=
p2=. . .=pk=p, wherepis a known constant (vs. the alternativeH1that at least one of
thepi’s is different from the other ones). An appoximate level–αtest rejectsH0if
y=
k
X
i=1

xi−nip
p
nip(1−p)
!
2
≥χ
2
k;α.
Theorem 10.2.4:
LetX1, . . . , Xkbe independent rv’s withXi∼Bin(ni, p), i= 1, . . . , k. Then the MLE ofpis
ˆp=
k
X
i=1
xi
k
X
i=1
ni
.
Proof:
This can be shown by using the joint likelihood function or bythe fact that
P
Xi∼Bin(
P
ni, p)
and forX∼Bin(n, p), the MLE is ˆp=
x
n
.
Theorem 10.2.5:
LetX1, . . . , Xkbe independent rv’s withXi∼Bin(ni, pi), i= 1, . . . , k. An approximate
level–αtest ofH0:p1=p2=. . .=pk=p, wherepis unknown (vs. the alternativeH1that
at least one of thepi’s is different from the other ones), rejectsH0if
y=
k
X
i=1

xi−niˆp
p
niˆp(1−ˆp)
!
2
≥χ
2
k−1;α,
where ˆp=
P
xi
P
ni
.
Theorem 10.2.6:
Let (X1, . . . , Xk) be a multinomial rv with parametersn, p1, p2, . . . , pkwhere
k
X
i=1
pi= 1 and
k
X
i=1
Xi=n. Then it holds that
Uk=
k
X
i=1
(Xi−npi)
2
npi
d
−→χ
2
k−1
122

asn→ ∞.
An approximate level–αtest ofH0:p1=p

1
, p2=p

2
, . . . , pk=p

k
rejectsH0if
k
X
i=1
(xi−np

i
)
2
np

i
> χ
2
k−1;α.
Proof:
Casek= 2 only:
U2=
(X1−np1)
2
np1
+
(X2−np2)
2
np2
=
(X1−np1)
2
np1
+
(n−X1−n(1−p1))
2
n(1−p1)
= (X1−np1)
2
`
1
np1
+
1
n(1−p1)
´
= (X1−np1)
2
`
(1−p1) +p1
np1(1−p1)
´
=
(X1−np1)
2
np1(1−p1)
By the CLT,
X1−np1
p
np1(1−p1)
d
−→N(0,1). Therefore,U2
d
−→χ
2
1
.
Lecture 28:
Fr 03/23/01
Theorem 10.2.7:
LetX1, . . . , Xnbe a sample fromX. LetH0:X∼F, where the functional form ofFis
known completely. We partition the real line intokdisjoint Borel setsA1, . . . , Akand let
P(X∈Ai) =pi, wherepi>0∀i= 1, . . . , k.
LetYj= #X

i
sinAj=
n
X
i=1
IAj
(Xi),∀j= 1, . . . , k.
Then, (Y1, . . . , Yk) has multinomial distribution with parametersn, p1, p2, . . . , pk.
Theorem 10.2.8:
LetX1, . . . , Xnbe a sample fromX. LetH0:X∼Fθ, whereθ= (θ1, . . . , θr) is unknown.
Let the MLE
ˆ
θexist. We partition the real line intokdisjoint Borel setsA1, . . . , Akand let

θ
(X∈Ai) = ˆpi, where ˆpi>0∀i= 1, . . . , k.
LetYj= #X

i
sinAj=
n
X
i=1
IAj
(Xi),∀j= 1, . . . , k.
Then it holds that
Vk=
k
X
i=1
(Yi−nˆpi)
2
nˆpi
d
−→χ
2
k−r−1
.
123

An approximate level–αtest ofH0:X∼FθrejectsH0if
k
X
i=1
(yi−nˆpi)
2
nˆpi
> χ
2
k−r−1;α,
whereris the number of parameters inθ that have to be estimated.
124

10.3t–Tests andF–Tests
(Based on Rohatgi, Section 10.4 & 10.5 & Rohatgi/Saleh, Section 10.4 & 10.5)
Definition 10.3.1:One– and Two–Tailedt-Tests
LetX1, . . . , Xnbe a sample from aN(, σ
2
) distribution whereσ
2
>0 may be known or
unknown andis unknown. LetX=
1
n
n
X
i=1
XiandS
2
=
1
n−1
n
X
i=1
(Xi−X)
2
.
The following table summarizes thez– andt–tests that are typically being used:
RejectH0at levelαif
H0 H1 σ
2
known σ
2
unknown
I≤0 > 0x≥0+
σ

n
zα x≥0+
s

n
tn−1;α
II≥0 < 0x≤0+
σ

n
z1−αx≤0+
s

n
tn−1;1−α
III=06=0|x−0|≥
σ

n
z
α/2|x−0|≥
s

n
t
n−1;α/2
Note:
(i) In Definition 10.3.1,0is any fixed constant.
(ii) These tests are based on just one sample and are often calledone samplet–tests.
(iii) Tests I and II are UMP and test III is UMPU ifσ
2
is known. Tests I, II, and III are
UMPU and UMP invariant ifσ
2
is unknown.
(iv) For largen(≥30), we can usez–tables instead oft-tables. Also, for largenwe can
drop the Normality assumption due to the CLT. However, for smalln, none of these
simplifications is justified.
Definition 10.3.2:Two–Samplet-Tests
LetX1, . . . , Xmbe a sample from aN(1, σ
2
1
) distribution whereσ
2
1
>0 may be known or
unknown and1is unknown. LetY1, . . . , Ynbe a sample from aN(2, σ
2
2
) distribution where
σ
2
2
>0 may be known or unknown and2is unknown.
LetX=
1
m
m
X
i=1
XiandS
2
1
=
1
m−1
m
X
i=1
(Xi−X)
2
.
LetY=
1
n
n
X
i=1
YiandS
2
2
=
1
n−1
n
X
i=1
(Yi−Y)
2
.
LetS
2
p
=
(m−1)S
2
1
+(n−1)S
2
2
m+n−2
.
The following table summarizes thez– andt–tests that are typically being used:
125

RejectH0at levelαif
H0 H1 σ
2
1
, σ
2
2
known σ
2
1
, σ
2
2
unknown, σ1=σ2
I1−2≤δ 1−2> δx−y≥δ+zα
q
σ
2
1
m
+
σ
2
2
n
x−y≥δ+tm+n−2;αsp
q
1
m
+
1
n
II1−2≥δ 1−2< δx−y≤δ+z1−α
q
σ
2
1
m
+
σ
2
2
n
x−y≤δ+tm+n−2;1−αsp
q
1
m
+
1
n
III1−2=δ 1−26=δ|x−y−δ|≥z
α/2
q
σ
2
1
m
+
σ
2
2
n
|x−y−δ|≥t
m+n−2;α/2sp
q
1
m
+
1
n
Note:
(i) In Definition 10.3.2,δis any fixed constant.
(ii) All tests are UMPU and UMP invariant.
(iii) Ifσ
2
1

2
2

2
(which is unknown), thenS
2
p
is an unbiased estimate ofσ
2
. We should
check thatσ
2
1

2
2
with anF–test.
(iv) For largem+n, we can usez–tables instead oft-tables. Also, for largemand largen
we can drop the Normality assumption due to the CLT. However,for smallmor small
n, none of these simplifications is justified.
Definition 10.3.3:Pairedt-Tests
Let (X1, Y1). . . ,(Xn, Yn) be a sample from a bivariateN(1, 2, σ
2
1
, σ
2
2
, ρ) distribution where
all 5 parameters are unknown.
LetDi=Xi−Yi∼N(1−2, σ
2
1

2
2
−2ρσ1σ2).
LetD=
1
n
n
X
i=1
DiandS
2
d
=
1
n−1
n
X
i=1
(Di−D)
2
.
The following table summarizes thet–tests that are typically being used:
H0 H1 RejectH0at levelαif
I1−2≤δ 1−2> δ d≥δ+
sd√
n
tn−1;α
II1−2≥δ 1−2< δd≤δ+
sd√
n
tn−1;1−α
III1−2=δ 1−26=δ|d−δ|≥
sd√
n
t
n−1;α/2
126

Note:
(i) In Definition 10.3.3,δis any fixed constant.
(ii) These tests are special cases of one–sample tests. All the properties stated in the Note
following Definition 10.3.1 hold.
(iii) We could do a test based on Normality assumptions ifσ
2

2
1

2
2
−2ρσ1σ2were
known, but that is a very unrealistic assumption.
Definition 10.3.4:F–Tests
LetX1, . . . , Xmbe a sample from aN(1, σ
2
1
) distribution where1may be known or unknown
andσ
2
1
is unknown. LetY1, . . . , Ynbe a sample from aN(2, σ
2
2
) distribution where2may
be known or unknown andσ
2
2
is unknown.
Recall that
m
X
i=1
(Xi−X)
2
σ
2
1
∼χ
2
m−1
,
n
X
i=1
(Yi− Y)
2
σ
2
2
∼χ
2
n−1
,
and
m
X
i=1
(Xi− X)
2
(m−1)σ
2
1
n
X
i=1
(Yi−Y)
2
(n−1)σ
2
2
=
σ
2
2
σ
2
1
S
2
1
S
2
2
∼Fm−1,n−1.
The following table summarizes theF–tests that are typically being used:
RejectH0at levelαif
H0 H1 1, 2known 1, 2unknown

2
1
≤σ
2
2
σ
2
1
> σ
2
2
1
m
P
(xi−1)
2
1
n
P
(yi−2)
2≥Fm,n;α
s
2
1
s
2
2
≥Fm−1,n−1;α
IIσ
2
1
≥σ
2
2
σ
2
1
< σ
2
2
1
n
P
(yi−2)
2
1
m
P
(xi−1)
2≥Fn,m;α
s
2
2
s
2
1
≥Fn−1,m−1;α
IIIσ
2
1

2
2
σ
2
1
6=σ
2
2
1
m
P
(xi−1)
2
1
n
P
(yi−2)
2
≥F
m,n;α/2
s
2
1
s
2
2
≥F
m−1,n−1;α/2ifs
2
1
≥s
2
2
or
1
n
P
(yi−2)
2
1
m
P
(xi−1)
2
≥F
n,m;α/2or
s
2
2
s
2
1
≥F
n−1,m−1;α/2ifs
2
1
< s
2
2
127

Note:
(i) Tests I and II are UMPU and UMP invariant if1and2are unknown.
(ii) Test III uses equal tails and therefore may not be unbiased.
(iii) If anF–test (at levelα1) and at–test (at levelα2) are both performed, the combined
test has levelα= 1−(1−α1)(1−α2)≥max(α1, α2) (≡α1+α2if both are small).
128

Lecture 29:
Mo 03/26/01
10.4 Bayes and Minimax Tests
(Based on Rohatgi, Section 10.6 & Rohatgi/Saleh, Section 10.6)
Hypothesis testing may be conducted in a decision–theoretic framework. Here our action
spaceAconsists of two options:a0= fail to rejectH0anda1= rejectH0.
Usually, we assume no loss for a correct decision. Thus, our loss function looks like:
L(θ, a0) =



0, ifθ∈Θ0
a(θ),ifθ∈Θ1
L(θ, a1) =



b(θ),ifθ∈Θ0
0, ifθ∈Θ1
We consider the following special cases:
0–1 loss:a(θ) =b(θ) = 1, i.e., all errors are equally bad.
Generalized 0–1 loss:a(θ) =cII,b(θ) =cI, i.e., all Type I errors are equally bad and all
Type II errors are equally bad and Type I errors are worse thanType II errors or vice
versa.
Then, the risk function can be written as
R(θ, d(X)) =L(θ, a0)Pθ(d(X) =a0) +L(θ, a1)Pθ(d(X) =a1)
=



a(θ)Pθ(d(X) =a0),ifθ∈Θ1
b(θ)Pθ(d(X) =a1),ifθ∈Θ0
The minimax rule minimizes
max
θ
{a(θ)Pθ(d(X) =a0), b(θ)Pθ(d(X) =a1)}.
Theorem 10.4.1:
The minimax ruledfor testing
H0:θ=θ0vs.H1:θ=θ1
under the generalized 0–1 loss function rejectsH0if
fθ1
(x)
fθ0
(x)
≥k,
129

wherekis chosen such that
R(θ1, d(X)) =R(θ0, d(X))
⇐⇒cIIPθ1
(d(X) =a0) =cIPθ0
(d(X) =a1)
⇐⇒cIIPθ1
`
fθ1
(X)
fθ0
(X)
< k
´
=cIPθ0
`
fθ1
(X)
fθ0
(X)
≥k
´
.
Proof:
Letd

be any other rule.
•IfR(θ0, d)< R(θ0, d

), then
R(θ0, d) =R(θ1, d)<max{R(θ0, d

), R(θ1, d

)}.
So,d

is not minimax.
•IfR(θ0, d)≥R(θ0, d

), i.e.,
cIPθ0
(d=a1) =R(θ0, d)≥R(θ0, d

) =cIPθ0
(d

=a1),
then
P(rejectH0|H0true) =Pθ0
(d=a1)≥Pθ0
(d

=a1).
By the NP Lemma, the ruledis MP of its size. Thus,
Pθ1
(d=a1)≥Pθ1
(d

=a1)⇐⇒Pθ1
(d=a0)≤Pθ1
(d

=a0)
=⇒R(θ1, d)≤R(θ1, d

)
=⇒max{R(θ0, d), R(θ1, d)}=R(θ1, d)≤R(θ1, d

)≤max{R(θ0, d

), R(θ1, d

)} ∀d

=⇒dis minimax
Example 10.4.2:
LetX1, . . . , Xnbe iidN(,1). LetH0:=0vs.H1:=1> 0.
As we have seen before,
fθ1
(x)
fθ0
(x)
≥k1is equivalent tox≥k2.
Therefore, we choosek2such that
cIIP1
(X < k2) =cIP0
(X≥k2)
⇐⇒cIIΦ(

n(k2−1)) =cI(1−Φ(

n(k2−0))),
where Φ(z) =P(Z≤z) forZ∼N(0,1).
GivencI, cII, 0, 1, andn, we can solve (numerically) fork2using Normal tables.
130

Note:
Now suppose we have a prior distributionπ(θ) on Θ. Then the Bayes risk of a decision rule
d(under the loss function introduced before) is
R(π, d) =Eπ(R(θ, d(X)))
=
Z
Θ
R(θ, d)π(θ)dθ
=
Z
Θ0
b(θ)π(θ)Pθ(d(X) =a1)dθ+
Z
Θ1
a(θ)π(θ)Pθ(d(X) =a0)dθ
ifπis a pdf.
The Bayes risk for a pmfπlooks similar (see Rohatgi, page 461).
Theorem 10.4.3:
The Bayes rule for testingH0:θ=θ0vs.H1:θ=θ1under the priorπ(θ0) =π0and
π(θ1) =π1= 1−π0and the generalized 0–1 loss function is to rejectH0if
fθ1
(x)
fθ0
(x)

cIπ0
cIIπ1
.
Proof:
We wish to minimizeR(π, d). We know that
R(π, d)
Def.8.8.8
= Eπ(R(θ, d))
Def.8.8.3
= Eπ(Eθ(L(θ, d(X))))
N ote
=
Z
g(x)
|{z}
marginal
(
Z
L(θ, d(x))h(θ|x)
|{z}
posterior
dθ)dx
Def.4.7.1
=
Z
g(x)Eθ(L(θ, d(X))|X=x)dx
= EX(Eθ(L(θ, d(X))|X)).
Therefore, it is sufficient to minimizeEθ(L(θ, d(X))|X).
The a posteriori distribution ofθis
h(θ|x) =
π(θ)fθ(x)
X
θ
π(θ)fθ(x)
=







π0fθ0
(x)
π0fθ0
(x) +π1fθ1
(x)
, θ=θ0
π1fθ1
(x)
π0fθ0
(x) +π1fθ1
(x)
, θ=θ1
131

Therefore,
Eθ(L(θ, d(X))|X=x) =

















cIh(θ0|x),ifθ=θ0, d(x) =a1
cIIh(θ1|x),ifθ=θ1, d(x) =a0
0, ifθ=θ0, d(x) =a0
0, ifθ=θ1, d(x) =a1
This will be minimized if we rejectH0, i.e.,d(x) =a1, whencIh(θ0|x)≤cIIh(θ1|x)
=⇒cIπ0fθ0
(x)≤cIIπ1fθ1
(x)
=⇒
fθ1
(x)
fθ0
(x)

cIπ0
cIIπ1
Note:
For minimax rules and Bayes rules, the significance levelαis no longer predetermined.
Example 10.4.4:
LetX1, . . . , Xnbe iidN(,1). LetH0:=0vs.H1:=1> 0. LetcI=cII.
By Theorem 10.4.3, the Bayes ruledrejectsH0if
fθ1
(x)
fθ0
(x)

π0
1−π0
=⇒exp


P
(xi−1)
2
2
+
P
(xi−0)
2
2
!

π0
1−π0
=⇒exp

(1−0)
X
xi+
n(
2
0

2
1
)
2
!

π0
1−π0
=⇒(1−0)
X
xi+
n(
2
0

2
1
)
2
≥ln(
π0
1−π0
)
=⇒
1
n
X
xi≥
1
n
ln(
π0
1−π0
)
1−0
+
0+1
2
Ifπ0=
1
2
, then we rejectH0ifx≥
0+1
2
.
132

Note:
We can generalize Theorem 10.4.3 to the case of classifying amongkoptionsθ1, . . . , θk. If we
use the 0–1 loss function
L(θi, d) =



1,ifd(X ) =θj∀j6=i
0,ifd(X) =θi
,
then the Bayes rule is to pickθiif
πifθi
(x)≥πjfθj
(x)∀j6=i.
Lecture 30:
We 03/28/01
Example 10.4.5:
LetX1, . . . , Xnbe iidN(,1). Let1< 2< 3and letπ1=π2=π3.
Choose=iif
πiexp


P
(xk−i)
2
2
!
≥πjexp


P
(xk−j)
2
2
!
, j6=i, j= 1,2,3.
Similar to Example 10.4.4, these conditions can be transformed as follows:
x(i−j)≥
(i−j)(i+j)
2
, j6=i, j= 1,2,3.
In our particular example, we get the following decision rules:
(i) Choose1ifx≤
1+2
2
(andx≤
1+3
2
).
(ii) Choose2ifx≥
1+2
2
andx≤
2+3
2
.
(iii) Choose3ifx≥
2+3
2
(andx≥
1+3
2
).
Note that in (i) and (iii) the condition in parentheses automatically holds when the other
condition holds.
If1= 0, 2= 2, and3= 4, we have the decision rules:
(i) Choose1ifx≤1.
(ii) Choose2if 1≤x≤3.
(iii) Choose3ifx≥3.
We do not have to worry how to handle the boundary since the probability that the rv will
realize on any of the two boundary points is 0.
133

11 Confidence Estimation
11.1 Fundamental Notions
(Based on Casella/Berger, Section 9.1 & 9.3.2)
LetXbe a rv anda, bbe fixed positive numbers,a < b. Then
P(a < X < b) =P(a < XandX < b)
=P(a < Xand
X
b
<1)
=P(a < Xand
aX
b
< a)
=P(
aX
b
< a < X)
The intervalI(X) = (
aX
b
, X) is an example of arandom interval.I(X) contains the value
awith a certain fixed probability.
For example, ifX∼U(0,1), a=
1
4
, andb=
3
4
, then the intervalI(X) = (
X
3
, X) contains
1
4
with probability
1
2
.
Definition 11.1.1:
LetPθ, θ∈Θ⊆IR
k
, be a set of probability distributions of a rvX. A family of subsets
S(x) of Θ, whereS(x) depends onxbut not onθ, is called afamily of random sets. In
particular, ifθ∈Θ⊆IRandS(x) is an interval (θ(x),θ(x)) whereθ(x) andθ(x) depend on
xbut not onθ, we callS(X) arandom interval, withθ(X) andθ(X) as lower and upper
bounds, respectively.θ(X) may be−∞andθ(X) may be +∞.
Note:
Frequently in inference, we are not interested in estimating a parameter or testing a hypoth-
esis about it. Instead, we are interested in establishing a lower or upper bound (or both) for
one or multiple parameters.
Definition 11.1.2:
A family of subsetsS(x) of Θ⊆IR
k
is called afamily of confidence setsatconfidence
level1−αif
Pθ(S(X)∋θ)≥1−α∀θ∈Θ,
where 0< α <1 is usually small.
The quantity
inf
θ
Pθ(S(X)∋θ) = 1−α
134

is called theconfidence coefficient(i.e., the smallest probability of true coverage is 1−α).
Definition 11.1.3:
Fork= 1, we use the following names for some of the confidence sets defined in Definition
11.1.2:
(i) IfS(x) = (θ(x),∞), thenθ(x) is called a level 1−αlower confidence bound.
(ii) IfS(x) = (−∞,θ(x)), thenθ(x) is called a level 1−αupper confidence bound.
(iii)S(x) = (θ(x),θ(x)) is called a level 1−αconfidence interval(CI).
Definition 11.1.4:
A family of 1−αlevel confidence sets{S(x)}is calleduniformly most accurate(UMA)
if
Pθ(S(X)∋θ

)≤Pθ(S

(X)∋θ

)∀θ, θ

∈Θ, θ6=θ

,
and for any 1−αlevel family of confidence setsS

(X) (i.e.,S(x) minimizes the probability
of false (or incorrect) coverage).
Theorem 11.1.5:
LetX1, . . . , Xn∼Fθ, θ∈Θ, where Θ is an interval onIR. LetT(X, θ) be a function on
IR
n
×Θ such that for eachθ,T(X, θ) is a statistic, and as a function ofθ,Tis strictly
monotone (either increasing or decreasing) inθat every value ofx∈IR
n
.
Let Λ⊆IRbe the range ofTand let the equationλ=T(x, θ) be solvable forθfor every
λ∈Λ and everyx∈IR
n
.
If the distribution ofT(X, θ) is independent ofθ, then we can construct a confidence interval
forθat any level.
Proof:
Chooseαsuch that 0< α <1. Then we can chooseλ1(α)< λ2(α) (which may not necessarily
be unique) such that
Pθ(λ1(α)< T(X, θ)< λ2(α))≥1−α∀θ.
Since the distribution ofT(X, θ) is independent ofθ,λ1(α) andλ2(α) also do not depend on
θ.
IfT(X, θ) is increasing inθ, solve the equationsλ1(α) =T(X, θ) forθ(X) andλ2(α) =T(X, θ)
135

forθ(X).
IfT(X, θ) is decreasing inθ, solve the equationsλ1(α) =T(X, θ) forθ(X) andλ2(α) =
T(X, θ) forθ(X).
In either case, it holds that
Pθ(θ(X)< θ <θ(X))≥1−α∀θ.
Note:
(i) Solvability is guaranteed ifTis continuous and strictly monotone as a function ofθ.
(ii) IfTis not monotone, we can still use this Theorem to get confidence sets that may not
be confidence intervals.
Example 11.1.6:
LetX1, . . . , Xn∼N(, σ
2
), whereandσ
2
>0 are both unknown. We seek a 1−αlevel
confidence interval for.
Note that by Corollary 7.2.4
T(X, ) =
X−
S/

n
∼tn−1
andT(X, ) is independent ofand monotone and decreasing in.
We chooseλ1(α) andλ2(α) such that
P(λ1(α)< T(X, )< λ2(α)) = 1−α
and solve forwhich yields
P(λ1(α)<
X−
S/

n
< λ2(α)) =
P(λ1(α)
S

n
<X− < λ2(α)
S

n
) =
P(λ1(α)
S

n
−X <− < λ2(α)
S

n
−X) =
P(X−
Sλ2(α)

n
< <X−
Sλ1(α)

n
) = 1−α.
Thus,
(X−
Sλ2(α)

n
,X−
Sλ1(α)

n
)
136

is a 1−αlevel CI for. We commonly chooseλ2(α) =−λ1(α) =t
n−1;α/2.
Example 11.1.7:
LetX1, . . . , Xn∼U(0, θ).
We know that
ˆ
θ= max(Xi) =Maxnis the MLE forθand sufficient forθ.
The pdf ofMaxnis given by
fn(y) =
ny
n−1
θ
n
I
(0,θ)(y).
Then the rvTn=
M axn
θ
has the pdf
hn(t) =nt
n−1
I
(0,1)(t),
which is independent ofθ.Tnis monotone and decreasing inθ.
We now have to find numbersλ1(α) andλ2(α) such that
P(λ1(α)< Tn< λ2(α)) = 1−α
=⇒n
Z
λ2
λ1
t
n−1
dt= 1−α
=⇒λ
n
2−λ
n
1= 1−α
If we chooseλ2= 1 andλ1=α
1/n
, then (Maxn, α
−1/n
Maxn) is a 1−αlevel CI forθ. This
holds since
1−α=P(α
1/n
<
Maxn
θ
<1)
=P(α
−1/n
>
θ
Maxn
>1)
=P(α
−1/n
Maxn> θ > Maxn)
137

Lecture 31:
Fr 03/30/01
11.2 Shortest–Length Confidence Intervals
(Based on Casella/Berger, Section 9.2.2 & 9.3.1)
In practice, we usually want not only an interval with coverage probability 1−αforθ, but if
possible the shortest (most precise) such interval.
Definition 11.2.1:
A rvT(X, θ) whose distribution is independent ofθis called apivot.
Note:
The methods we will discuss here can provide the shortest interval based on a given pivot.
They will not guarantee that there is no other pivot with a shorter minimal interval.
Example 11.2.2:
LetX1, . . . , Xn∼N(, σ
2
), whereσ
2
>0 is known. The obvious pivot foris
T(X) =
X−
σ/

n
∼N(0,1).
Suppose that (a, b) is an interval such thatP(a < Z < b) = 1−α, whereZ∼N(0,1).
A 1−αlevel CI based on this pivot is found by
1−α=P

a <
X−
σ/

n
< b
!
=P
`
X−b
σ

n
< <X−a
σ

n
´
.
The length of the interval isL= (b−a)
σ

n
.
To minimizeL, we must chooseaandbsuch thatb−ais minimal while
Φ(b)−Φ(a) =
1


Z
b
a
e

x
2
2dx= 1−α,
where Φ(z) =P(Z≤z).
To find a minimum, we can differentiate these expressions withrespect toa. However,bis
not a constant but is an implicit function ofa. Formally, we could write
d b(a)
da
. However, this
is usually shortened to
db
da
.
Here we get
d
da
(Φ(b)−Φ(a)) =φ(b)
db
da
−φ(a) =
d
da
(1−α) = 0
138

and
dL
da
=
σ

n
(
db
da
−1) =
σ

n
(
φ(a)
φ(b)
−1).
The minimum occurs whenφ(a) =φ(b) which happens whena=bora=−b. If we select
a=b, then Φ(b)−Φ(a) = Φ(a)−Φ(a) = 06= 1−α. Thus, we must have thatb=−a=z
α/2.
Thus, the shortest CI based onTis
(X−z
α/2
σ

n
,X+z
α/2
σ

n
).
Definition 11.2.3:
A pdff(x) isunimodaliff there exists ax

such thatf(x) is nondecreasing forx≤x

and
f(x) is nonincreasing forx≥x

.
Theorem 11.2.4:
Letf(x) be a unimodal pdf. If the interval [a, b] satisfies
(i)
Z
b
a
f(x)dx= 1−α
(ii)f(a) =f(b)>0, and
(iii)a≤x

≤b, wherex

is a mode off(x),
then the interval [a, b] is the shortest of all intervals which satisfy condition (i).
Proof:
Let [a

, b

] be any interval withb

−a

< b−a. We will show that this implies
Z
b

a

f(x)dx <1−α,
i.e., a contradiction.
We assume thata

≤a. The casea < a

is similar.
•Suppose thatb

≤a. Thena

≤b

≤a≤x

.
139

········································································································································································································································································································································································································································································································································································································································································································································································································································································································································································································································


x*a ba' b'
Theorem 11.2.4a
It follows
Z
b

a

f(x)dx≤f(b

)(b

−a

) |x≤b

≤x

⇒f(x)≤f(b

)
≤f(a)(b

−a

) |b

≤a≤x

⇒f(b

)≤f(a)
< f(a)(b−a) |b

−a

< b−aandf(a)>0

Z
b
a
f(x)dx |f(x)≥f(a) fora≤x≤b
= 1−α |by (i)
•Supposeb

> a. We can immediately exclude thatb

> bsince thenb

−a

> b−a, i.e.,
b

−a

wouldn’t be of shorter length thanb−a. Thus, we have to consider the case that
a

≤a < b

< b.
········································································································································································································································································································································································································································································································································································································································································································································································································································································································································································································································


x*a ba' b'
Theorem 11.2.4b
140

It holds that
Z
b

a

f(x)dx=
Z
b
a
f(x)dx+
Z
a
a

f(x)dx−
Z
b
b

f(x)dx
Note that
Z
a
a

f(x)dx≤f(a)(a−a

) and
Z
b
b

f(x)dx≥f(b)(b−b

). Therefore, we get
Z
a
a

f(x)dx−
Z
b
b

f(x)dx≤f(a)(a−a

)−f(b)(b−b

)
=f(a)((a−a

)−(b−b

)) sincef(a) =f(b)
=f(a)((b

−a

)−(b−a))
<0
Thus,
Z
b

a

f(x)dx <1−α.
Note:
Example 11.2.2 is a special case of Theorem 11.2.4. However,Theorem 11.2.4 is not immedi-
ately applicable in the following example since the length of that interval is proportional to
1
a

1
b
(and not tob−a).
Example 11.2.5:
LetX1, . . . , Xn∼N(, σ
2
), whereis known. The obvious pivot forσ
2
is
T
σ
2(X) =
P
(Xi−)
2
σ
2
∼χ
2
n.
So
P

a <
P
(Xi−)
2
σ
2
< b
!
= 1−α
⇐⇒P
P
(Xi−)
2
b
< σ
2
<
P
(Xi−)
2
a
!
= 1−α
We wish to minimize
L= (
1
a

1
b
)
X
(Xi−)
2
such that
Z
b
a
fn(t)dt= 1−α, wherefn(t) is the pdf of aχ
2
ndistribution.
We get
fn(b)
db
da
−fn(a) = 0
and
dL
da
=
`

1
a
2
+
1
b
2
db
da
´
X
(Xi−)
2
=
`

1
a
2
+
1
b
2
fn(a)
fn(b)
´
X
(Xi−)
2
.
141

We obtain a minimum ifa
2
fn(a) =b
2
fn(b).
Note that in practice equal tailsχ
2
n;α/2
andχ
2
n;1−α/2
are used, which do not result in shortest–
length CI’s. The reason for this selection is simple: When these tests were developed, com-
puters did not exist that could solve these equations numerically. People in general had to
rely on tabulated values. Manually solving the equation above for each case obviously wasn’t
a feasible solution.
Example 11.2.6:
LetX1, . . . , Xn∼U(0, θ). LetMaxn= maxXi=X
(n). SinceTn=
M axn
θ
has pdf
nt
n−1
I
(0,1)(t) which does not depend onθ,Tncan be selected as a our pivot. The den-
sity ofTnis strictly increasing forn≥2, so we cannot find constantsaandbas in Example
11.2.5.
IfP(a < Tn< b) = 1−α, thenP(
M axn
b
< θ <
M axn
a
) = 1−α.
We wish to minimize
L=Maxn(
1
a

1
b
)
such that
Z
b
a
nt
n−1
dt=b
n
−a
n
= 1−α.
We get
nb
n−1
−na
n−1
da
db
= 0 =⇒
da
db
=
b
n−1
a
n−1
and
dL
db
=Maxn(−
1
a
2
da
db
+
1
b
2
) =Maxn(−
b
n−1
a
n+1
+
1
b
2
) =Maxn(
a
n+1
−b
n+1
b
2
a
n+1
)<0 for 0≤a < b≤1.
Thus,Ldoes not have a local minimum. However, since
dL
db
<0,Lis strictly decreasing
as a function ofb. It is minimized whenb= 1, i.e., whenbis as large as possible. The
correspondingais selected asa=α
1/n
.
The shortest 1−αlevel CI based onTnis (Maxn, α
−1/n
Maxn).
142

Lecture 32:
Mo 04/02/01
11.3 Confidence Intervals and Hypothesis Tests
(Based on Casella/Berger, Section 9.2)
Example 11.3.1:
LetX1, . . . , Xn∼N(, σ
2
), whereσ
2
>0 is known. In Example 11.2.2 we have shown that
the interval
(X−z
α/2
σ

n
,X+z
α/2
σ

n
)
is a 1−αlevel CI for.
Suppose we define a testφofH0:=0vs.H1:6=0that rejectsH0iff0does not
fall in this interval. Then,
P0
(Type I error) =P0
(RejectH0whenH0is true )
=P0
⊆≤
X−z
α/2
σ

n
,X+z
α/2
σ

n

6∋0
´
= 1−P0
⊆≤
X−z
α/2
σ

n
,X+z
α/2
σ

n

∋0
´
= 1−P0
`
X−z
α/2
σ

n
≤0and0≤X+z
α/2
σ

n
´
=P0
`
X−z
α/2
σ

n
≥0or0≥X+z
α/2
σ

n
´
=P0
`
X−0≥z
α/2
σ

n
orX−0≤ −z
α/2
σ

n
´
=P0

X−0
σ

n
≥z
α/2or
X−0
σ

n
≤ −z
α/2
!
=P0

|X−0|
σ

n
≥z
α/2
!
=α,
i.e.,φhas sizeα. So a test based on the shortest 1−αlevel CI obtained in Example 11.2.2
is equivalent to the UMPU test III of sizeαintroduced in Definition 10.3.1 (whenσ
2
is known).
Conversely, ifφ(x, 0) is a family of sizeαtests ofH0:=0, the set{0|φ(x, 0) fails to rejectH0}
is a level 1−αconfidence set for0.
143

Theorem 11.3.2:
DenoteH0(θ0) forH0:θ=θ0, andH1(θ0) for the alternative. LetA(θ0), θ0∈Θ, denote the
acceptance region of a level–αtest ofH0(θ0). For each possible observationx, define
S(x) ={θ:x∈A(θ), θ∈Θ}.
ThenS(x) is a family of 1−αlevel confidence sets forθ.
If, moreover,A(θ0) is UMP for (α, H0(θ0), H1(θ0)), thenS(x) minimizesPθ(S(X)∋θ

)
∀θ∈H1(θ

) among all 1−αlevel families of confidence sets, i.e.,S(x) is UMA.
Proof:
It holds thatS(x)∋θiffx∈A(θ). Therefore,
Pθ(S(X)∋θ) =Pθ(X∈A(θ))≥1−α.
LetS

(X) be any other family of 1−αlevel confidence sets. DefineA

(θ) ={x:S

(x)∋θ}.
Then,
Pθ(X∈A

(θ)) =Pθ(S

(X)∋θ)≥1−α.
SinceA(θ0) is UMP, it holds that
Pθ(X∈A

(θ0))≥Pθ(X∈A(θ0))∀θ∈H1(θ0).
This implies that
Pθ(S

(X)∋θ0)≥Pθ(X∈A(θ0)) =Pθ(S(X)∋θ0)∀θ∈H1(θ0).
Example 11.3.3:
LetXbe a rv that belongs to a one–parameter exponential family with pdf
fθ(x) = exp(Q(θ)T(x) +S

(x) +D(θ)),
whereQ(θ) is non–decreasing.
We consider a testH0:θ=θ0vs.H1:θ < θ0. By Theorem 9.3.3, the family{fθ}
has a MLR inT(X). It follows by the Note after Theorem 9.3.5 that the acceptance region
of a UMP sizeαtest ofH0has the formA(θ0) ={x:T(x)> c(θ0)}and this test has a
non–increasing power function.
Now consider a similar testH

0
:θ=θ1vs.H

1
:θ < θ1. The acceptance region of a UMP
sizeαtest ofH

0
also has the formA(θ1) ={x:T(x)> c(θ1)}.
Thus, forθ1≥θ0,
Pθ0
(T(X)≤c(θ0)) =α=Pθ1
(T(X)≤c(θ1))≤Pθ0
(T(X)≤c(θ1))
144

(since for a UMP test, it holds that power≥size). Therefore, we can choosec(θ) as non–
decreasing.
A level 1−αCI forθis then
S(x) ={θ:x∈A(θ)}= (−∞, c
−1
(T(x))),
wherec
−1
(T(x)) = sup
θ
{θ:c(θ)≤T(x)}.
Example 11.3.4:
LetX∼Exp(θ) withfθ(x) =
1
θ
e

x
θI
(0,∞)(x), which belongs to a one–parameter exponential
familty. ThenQ(θ) =−
1
θ
is non–decreasing andT(x) =x.
We want to testH0:θ=θ0vs.H1:θ < θ0.
The acceptance region of a UMP sizeαtest ofH0has the formA(θ0) ={x:x≥c(θ0)},
where
α=
Z
c(θ0)
0
fθ0
(x)dx=
Z
c(θ0)
0
1
θ0
e

x
θ
0dx= 1−e

c(θ
0
)
θ
0.
Thus,
e

c(θ
0
)
θ
0= 1−α=⇒c(θ0) =θ0log(
1
1−α
)
Therefore, the UMA family of 1−αlevel confidence sets is of the form
S(x) ={θ:x∈A(θ)}
={θ:θ≤
x
log(
1
1−α
)
}
=
"
0,
x
log(
1
1−α
)
#
.
Note:
Just as we frequently restrict the class of tests (when UMP tests don’t exist), we can make
the same sorts of restrictions on CI’s.
Definition 11.3.5:
A familyS(x) of confidence sets for parameterθis said to beunbiasedat level 1−αif
Pθ(S(X)∋θ)≥1−αandPθ(S(X)∋θ

)≤1−α∀θ, θ

∈Θ, θ6=θ

.
IfS(x) is unbiased and minimizesPθ(S(X)∋θ

) among all unbiased CI’s at level 1−α, it is
calleduniformly most accurate unbiased(UMAU).
145

Theorem 11.3.6:
LetA(θ0) be the acceptance region of a UMPU sizeαtest ofH0:θ=θ0vs.H1:θ6=θ0
(for allθ0). ThenS(x) ={θ:x∈A(θ)}is a UMAU family of confidence sets at level 1−α.
Proof:
SinceA(θ) is unbiased, it holds that
Pθ(S(X)∋θ

) =Pθ(X∈A(θ

))≤1−α.
Thus,Sis unbiased.
LetS

(x) be any other unbiased family of level 1−αconfidence sets, where
A

(θ) ={x:S

(x)∋θ}.
It holds that
Pθ(X∈A



)) =Pθ(S

(X)∋θ

)≤1−α.
Therefore,A

(θ) is the acceptance region of an unbiased sizeαtest. Thus,
Pθ(S

(X)∋θ

) =Pθ(X∈A



))
(∗)
≥Pθ(X∈A(θ

))
=Pθ(S(X)∋θ

)
(∗) holds sinceA(θ) is the acceptance region of a UMPU test.
Lecture 33:
We 04/04/01
Theorem 11.3.7:
Let Θ be an interval onIRandfθbe the pdf ofX. LetS(X) be a family of 1−αlevel CI’s,
whereS(X) = (θ(X),θ(X)),θandθincreasing functions ofX, andθ(X)−θ(X) is a finite
rv.
Then it holds that
Eθ(θ(X)−θ(X))) =
Z
(θ(x)−θ(x))fθ(x)dx=
Z
θ

6=θ
Pθ(S(X)∋θ

)dθ

∀θ∈Θ.
Proof:
It holds thatθ−θ=
Z
θ
θ


. Thus, for allθ∈Θ,
Eθ(θ(X)−θ(X))) =
Z
IR
n
(θ(x)−θ(x))fθ(x)dx
=
Z
IR
n

Z
θ(x)
θ(x)


!
fθ(x)dx
146

=
Z
IR







Z
∈IR
n
z }|{
θ
−1


)
θ
−1


)
|{z}
∈IR
n
fθ(x)dx









=
Z
IR


X∈[θ
−1


),θ
−1


)]



=
Z
IR
Pθ(S(X)∋θ

)dθ

=
Z
θ

6=θ
Pθ(S(X)∋θ

)dθ

Note:
Theorem 11.3.7 says that the expected length of the CI is the probability thatS(X) includes
the falseθ

, averaged over all false values ofθ

.
Corollary 11.3.8:
IfS(X) is UMAU, thenEθ(θ(X)−θ(X)) is minimized among all unbiased families of CI’s.
Proof:
In Theorem 11.3.7 we have shown that
Eθ(θ(X)−θ(X)) =
Z
θ

6=θ
Pθ(S(X)∋θ

)dθ

.
Since a UMAU CI minimizes this probability for allθ

, the entire integral is minimized.
Example 11.3.9:
LetX1, . . . , Xn∼N(, σ
2
), whereσ
2
>0 is known.
By Example 11.2.2, (X−z
α/2
σ

n
,X+z
α/2
σ

n
) is the shortest 1−αlevel CI for.
By Example 9.4.3, the equivalent test is UMPU. So by Theorem 11.3.6 this interval is UMAU
and by Corollary 11.3.8 it has shortest expected length as well.
Example 11.3.10:
LetX1, . . . , Xn∼N(, σ
2
), whereandσ
2
>0 are both unknown.
Note that
T(X, σ
2
) =
(n−1)S
2
σ
2
=Tσ∼χ
2
n−1.
Thus,
P
σ
2

λ1<
(n−1)S
2
σ
2
< λ2
!
= 1−α⇐⇒P
σ
2

(n−1)S
2
λ2
< σ
2
<
(n−1)S
2
λ1
!
= 1−α.
147

We now defineP(γ) as
P
σ
2

(n−1)S
2
λ2
< σ
′2
<
(n−1)S
2
λ1
!
=P
`

λ2
< γ <

λ1
´
=P(λ1γ < Tσ< λ2γ) =P(γ),
whereγ=
σ
′2
σ
2.
If our test is unbiased, then it follows from Definition 11.3.5 that
P(1) = 1−αandP(γ)<1−α∀γ6= 1.
This implies that we can findλ1, λ2such thatP(1) = 1−αand
dP(γ)





γ=1
=

d

Z
λ2γ
λ1γ
fTσ(γ)dγ
!




γ=1
(∗)
= (fTσ(λ2γ)∆λ2−fTσ(λ1γ)∆λ1+ 0)|γ=1
=λ2fTσ(λ2)−λ1fTσ(λ1)
= 0,
wherefTσis the pdf that relates toTσ, i.e., a the pdf of aχ
2
n−1
distribution. (∗) follows from
Leibniz’s Rule (Theorem 3.2.4). We can solve forλ1, λ2numerically. Then,

(n−1)S
2
λ2
,
(n−1)S
2
λ1
!
is an unbiased 1−αlevel CI forσ
2
.
Rohatgi, Theorem 4(b), page 428–429, states that the related test is UMPU. Therefore, by
Theorem 11.3.6 and Corollary 11.3.8, our CI is UMAU with shortest expected length among
all unbiased intervals.
Note that this CI is different from the equal–tail CI based on Definition 10.2.1, III, and from
the shortest–length CI obtained in Example 11.2.5.
148

11.4 Bayes Confidence Intervals
(Based on Casella/Berger, Section 9.2.4)
Definition 11.4.1:
Given a posterior distributionh(θ|x), a level 1−αcredible set(Bayesian confidence
set) is any setAsuch that
P(θ∈A|x) =
Z
A
h(θ|x)dθ= 1−α.
Note:
IfAis an interval, we speak of aBayesian confidence interval.
Example 11.4.2:
LetX∼Bin(n, p) andπ(p)∼U(0,1).
In Example 8.8.11, we have shown that
h(p|x) =
p
x
(1−p)
n−x
Z
1
0
p
x
(1−p)
n−x
dp
I
(0,1)(p)
=B(x+ 1, n−x+ 1)
−1
p
x
(1−p)
n−x
I
(0,1)(p)
=
Γ(n+ 2)
Γ(x+ 1)Γ(n−x+ 1)
p
x
(1−p)
n−x
I
(0,1)(p)
∼Beta(x+ 1, n−x+ 1),
whereB(a, b) =
Γ(a)Γ(b)
Γ(a+b)
is the beta function evaluated foraandbandBeta(x+ 1, n−x+ 1)
represents a Beta distribution with parametersx+ 1 andn−x+ 1.
Using the observed value forxand tables for incomplete beta integrals or a numerical ap-
proach, we can findλ1andλ2such thatP
p|x(λ1< p < λ2) = 1−α. So (λ1, λ2) is a credible
interval forp.
Note:
(i) The definitions and interpretations of credible intervals and confidence intervals are quite
different. Therefore, very different intervals may result.
(ii) We can often use Theorem 11.2.4 to find the shortest credible interval (if the precondi-
tions hold).
149

Lecture 34:
Mo 04/09/01
Example 11.4.3:
LetX1, . . . , Xnbe iidN(,1) andπ()∼N(0,1). We want to construct a Bayesian level
1−αCI for.
By Definition 8.8.7, the posterior distribution ofgivenxis
h(|x) =
π()f(x|)
g(x)
where
g(x) =
Z

−∞
f(x, )d
=
Z

−∞
π()f(x|)d
=
Z

−∞
1


exp(−
1
2

2
)
1
(

2π)
n
exp


1
2
n
X
i=1
(xi−)
2
!
d
=
Z

−∞
1
(2π)
n+1
2
exp


1
2

n
X
i=1
x
2
i−2
n
X
i=1
xi+n
2
!

1
2

2
!
d
=
Z

−∞
1
(2π)
n+1
2
exp


1
2
n
X
i=1
x
2
i
+n x−
n
2

2

1
2

2
!
d
=
1
(2π)
n+1
2
exp


1
2
n
X
i=1
x
2
i
!
Z

−∞
exp
`

n+ 1
2
`

2
−2
nx
n+ 1
´´
d
=
1
(2π)
n+1
2
exp


1
2
n
X
i=1
x
2
i
!
Z

−∞
exp


n+ 1
2
`

nx
n+ 1
´
2
+
n+ 1
2
`
nx
n+ 1
´
2
!
d
=
1
(2π)
n+1
2
exp


1
2
n
X
i=1
x
2
i
+
n
2
x
2
2(n+ 1)
!s

1
n+ 1

Z

−∞
1
q

1
n+1
exp


1
2
1
1/(n+ 1)
`

nx
n+ 1
´
2
!
d
| {z }
=1since pdf of aN(
nx
n+1
,
1
n+1
)distribution
=
(n+ 1)

1
2
(2π)
n
2
exp


1
2
n
X
i=1
x
2
i+
n
2
x
2
2(n+ 1)
!
150

Therefore,
h(|x) =
π()f(x|)
g(x)
=
1


exp(−
1
2

2
)
1
(

2π)
n
exp


1
2
n
X
i=1
(xi−)
2
!
(n+ 1)

1
2
(2π)
n
2
exp


1
2
n
X
i=1
x
2
i+
n
2
x
2
2(n+ 1)
!
=

n+ 1


exp


1
2
n
X
i=1
x
2
i
+n x−
n
2

2

1
2

2
+
1
2
n
X
i=1
x
2
i

n
2
x
2
2(n+ 1)
!
=

n+ 1


exp


n+ 1
2


2

2
n+ 1
nx+
2
n+ 1
n
2
x
2
2(n+ 1)
!!
=
1
q

1
n+1
exp


1
2
1
1
n+1
`

nx
n+ 1
´
2
!
,
i.e.,
h(|x)∼N
`
nx
n+ 1
,
1
n+ 1
´
.
Therefore, a Bayesian level 1−αCI foris
`
n
n+ 1
X−
z
α/2

n+ 1
,
n
n+ 1
X+
z
α/2

n+ 1
´
.
The shortest (“classical”) level 1−αCI for(treatingas fixed) is
`
X−
z
α/2

n
,X+
z
α/2

n
´
as seen in Example 11.2.2.
Thus, the Bayesian CI is slightly shorter than the classicalCI since we use additional infor-
mation in constructing the Bayesian CI.
151

12 Nonparametric Inference
12.1 Nonparametric Estimation
Definition 12.1.1:
A statistical method which does not rely on assumptions about the distributional form of a
rv (except, perhaps, that it is absolutely continuous, or purely discrete) is called anonpara-
metricordistribution–freemethod.
Note:
Unless otherwise specified, we make the following assumptions for the remainder of this chap-
ter: LetX1, . . . , Xnbe iid∼F, whereFis unknown. LetPbe the class of all possible
distributions ofX.
Definition 12.1.2:
A statisticT(X) issufficientfor a family of distributionsPif the conditional distibution of
XgivenT=tis the same for allF∈P.
Example 12.1.3:
LetX1, . . . , Xnbe absolutely continuous. LetT= (X
(1), . . . , X
(n)) be the order statistics.
It holds that
f(x|T=t) =
1
n!
,
soTis sufficient for the family of absolutely continuous distributions onIR.
Definition 12.1.4:
A family of distributionsPiscompleteif the only unbiased estimate of 0 is the 0 itself, i.e.,
EF(h(X)) = 0∀F∈P=⇒h(x) = 0∀x.
Definition 12.1.5:
A statisticT(X) iscomplete in relation toPif the class of induced distributions ofTis
complete.
Theorem 12.1.6:
The order statistic (X
(1), . . . , X
(n)) is a complete sufficient statistic, provided thatX1, . . . , Xn
are of either (pure) discrete of (pure) continuous type.
152

Definition 12.1.7:
A parameterg(F) is calledestimableif it has an unbiased estimate, i.e., if there exists a
T(X) such that
EF(T(X)) =g(F)∀F∈P.
Example 12.1.8:
LetPbe the class of distributions for which second moments exist. ThenXis unbiased for
(F) =
R
xdF(x). Thus,(F) is estimable.
Definition 12.1.9:
Thedegreemof an estimable parameterg(F) is the smallest sample size for which an unbi-
ased estimate exists for allF∈P.
An unbiased estimate based on a sample of sizemis called akernel.
Lemma 12.1.10:
There exists asymmetric kernelfor every estimable parameter.
Proof:
LetT(X1, . . . , Xm) be a kernel ofg(F). Define
Ts(X1, . . . , Xm) =
1
m!
X
all permutations of{1,...,m}
T(Xi1
, . . . , Xim).
where the summation is over allm! permutations of{1, . . . , m}.
ClearlyTsis symmetric andE(Ts) =g(F).
Example 12.1.11:
(i)E(X1) =(F), so(F) has degree 1 with kernelX1.
(ii)E(I
(c,∞)(X1)) =PF(X > c), wherecis a known constant. Sog(F) =PF(X > c) has
degree 1 with kernelI
(c,∞)(X1).
(iii) There exists noT(X1) such thatE(T(X1)) =σ
2
(F) =
R
(x−(F))
2
dF(x).
ButE(T(X1, X2)) =E(X
2
1
−X1X2) =σ
2
(F). Soσ
2
(F) has degree 2 with kernel
X
2
1
−X1X2. Note thatX
2
2
−X2X1is another kernel.
(iv) A symmetric kernel forσ
2
(F) is
Ts(X1, X2) =
1
2
((X
2
1−X1X2) + (X
2
2−X1X2)) =
1
2
(X1−X2)
2
.
153

Lecture 35:
We 04/11/01
Definition 12.1.12:
Letg(F) be an estimable parameter of degreem. LetX1, . . . , Xnbe a sample of sizen, n≥m.
Given a kernelT(Xi1
, . . . , Xim) ofg(F), we define aU–statisticby
U(X1, . . . , Xn) =
1
Γ
n
m

X
c
Ts(Xi1
, . . . , Xim),
whereTsis defined as in Lemma 12.1.10 and the summationcis over all
Γ
n
m

combina-
tions ofmintegers (i1, . . . , im) from{1,∆ ∆ ∆, n}. U(X1, . . . , Xn) is symmetric in theXi’s and
EF(U(X)) =g(F) for allF.
Example 12.1.13:
For estimating(F) with degreemof(F) = 1:
Symmetric kernel:
Ts(Xi) =Xi, i= 1, . . . , n
U–statistic:
U(X) =
1
Γ
n
1

X
c
Xi
=
1∆(n−1)!
n!
X
c
Xi
=
1
n
n
X
i=1
Xi
=X
For estimatingσ
2
(F) with degreemofσ
2
(F) = 2:
Symmetric kernel:
Ts(Xi1
, Xi2
) =
1
2
(Xi1
−Xi2
)
2
, i1, i2= 1, . . . , n, i16=i2
U–statistic:
U
σ
2(X) =
1
Γ
n
2

X
i1<i2
1
2
(Xi1
−Xi2
)
2
=
1
Γ
n
2

1
4
X
i16=i2
(Xi1
−Xi2
)
2
=
(n−2)!∆2!
n!
1
4
X
i16=i2
(Xi1
−Xi2
)
2
154

=
1
2n(n−1)
X
i1
X
i26=i1
(X
2
i1
−2Xi1
Xi2
+X
2
i2
)
=
1
2n(n−1)

(n−1)
n
X
i1=1
X
2
i1
−2(
n
X
i1=1
Xi1
)(
n
X
i2=1
Xi2
) + 2
n
X
i=1
X
2
i+
(n−1)
n
X
i2=1
X
2
i2


=
1
2n(n−1)

n
n
X
i1=1
X
2
i1

n
X
i1=1
X
2
i1
−2(
n
X
i1=1
Xi1
)
2
+ 2
n
X
i=1
X
2
i
+
n
n
X
i2=1
X
2
i2

n
X
i2=1
X
2
i2


=
1
n(n−1)
"
n
n
X
i=1
X
2
i−(
n
X
i=1
Xi)
2
#
=
1
n(n−1)


n
n
X
i=1

Xi−
1
n
n
X
j=1
Xj


2



=
1
(n−1)
n
X
i=1
(Xi−X)
2
=S
2
Theorem 12.1.14:
LetPbe the class of all absolutely continuous or all purely discrete distribution functions on
IR. Any estimable functiong(F), F∈P, has a unique estimate that is unbiased and sym-
metric in the observations and has uniformly minimum variance among all unbiased estimates.
Proof:
LetX1, . . . , Xn
iid
∼F∈P,withT(X1, . . . , Xn) an unbiased estimate ofg(F).
We define
Ti=Ti(X1, . . . , Xn) =T(Xi1
, Xi2
, . . . , Xin), i= 1,2, . . . , n!,
over all possible permutations of{1, . . . , n}.
LetT=
1
n!
n!
X
i=1
TiandT=
n!
X
i=1
Ti.
155

Then
EF(T) =g(F)
and
V ar(T) =E(T
2
)−(E(T))
2
=E
"
(
1
n!
n!
X
i=1
Ti)
2
#
−[g(F)]
2
=E

(
1
n!
)
2
n!
X
i=1
n!
X
j=1
TiTj

−[g(F)]
2
≤E


n!
X
i=1
n!
X
j=1
TiTj

−[g(F)]
2
=E



n!
X
i=1
Ti
!


n!
X
j=1
Tj



−[g(F)]
2
=E



n!
X
i=1
Ti
!2

−[g(F)]
2
=E(T
2
)−[g(F)]
2
=V ar(T)
Equality holds iffTi=Tj∀i, j= 1, . . . , n!
=⇒Tis symmetric in (X1, . . . , Xn) and T=T
=⇒by Rohatgi, Problem 4, page 538,Tis a function of order statistics
=⇒by Rohatgi, Theorem 1, page 535,Tis a complete sufficient statistic
=⇒by Note (i) following Theorem 8.4.12,Tis UMVUE
Corollary 12.1.15:
IfT(X1, . . . , Xn) is unbiased forg(F), F∈P, the correspondingU–statistic is an essentially
unique UMVUE.
156

Definition 12.1.16:
Suppose we have independent samplesX1, . . . , Xm
iid
∼F∈P, Y1, . . . , Yn
iid
∼G∈P(Gmay or
may not equalF.) Letg(F, G) be an estimable function with unbiased estimatorT(X1, . . . , Xk, Y1, . . . , Yl).
Define
Ts(X1, . . . , Xk, Y1, . . . , Yl) =
1
k!l!
X
PX
X
PY
T(Xi1
, . . . , Xik
, Yj1
, . . . , Yjl
)
(wherePXandPYare permutations ofXandY) and
U(X, Y) =
1
Γ
m
k
∆Γ
n
l

X
CX
X
CY
Ts(Xi1
, . . . , Xik
, Yj1
, . . . , Yjl
)
(whereCXandCYare combinations ofXandY).
Uis a called ageneralizedU–statistic.
Example 12.1.17:
LetX1, . . . , XmandY1, . . . , Ynbe independent random samples fromFandG, respectively,
withF, G∈P.We wish to estimate
g(F, G) =PF,G(X≤Y).
Let us define
Zij=
(
1, Xi≤Yj
0, Xi> Yj
for each pairXi, Yj, i= 1,2, . . . , m, j= 1,2, . . . , n.
Then
m
X
i=1
Zijis the number ofX’s≤Yj, and
n
X
j=1
Zijis the number ofY’s> Xi.
E(I(Xi≤Yj)) =g(F, G) =PF,G(X≤Y),
and degreeskandlare = 1, so we use
U(X, Y) =
1
Γ
m
1
∆Γ
n
1

X
CX
X
CY
Ts(Xi1
, . . . , Xik
, Yj1
, . . . , Yjl
)
=
(m−1)!(n−1)!
m!n!
X
CX
X
CY
1
1!1!
X
PX
X
PY
T(Xi1
, . . . , Xik
, Yj1
, . . . , Yjl
)
=
1
mn
m
X
i=1
n
X
j=1
I(Xi≤Yj).
ThisMann–Whitney estimator (orWilcoxin 2–Sample estimator) is unbiased and
symmetric in theX’s andY’s. It follows by Corollary 12.1.15 that it has minimum variance.
157

Lecture 36:
Fr 04/13/01
12.2 Single-Sample Hypothesis Tests
LetX1, . . . , Xnbe a sample from a distributionF. Theproblem of fitis to test the hypoth-
esis that the sampleX1, . . . , Xnis from some specified distribution against the alternative
that it is from some other distribution, i.e.,H0:F=F0vs.H1:F(x)6=F0(x) for somex.
Definition 12.2.1:
LetX1, . . . , Xn
iid
∼F, and let the corresponding empirical cdf be
F

n(x) =
1
n
n
X
i=1
I
(−∞,x](Xi).
The statistic
Dn= sup
x
|F

n
(x)−F(x)|
is called thetwo–sided Kolmogorov–Smirnov statistic(K–S statistic).
Theone–sided K–S statisticsare
D
+
n= sup
x
[F

n(x)−F(x)] andD

n= sup
x
[F(x)−F

n(x)].
Theorem 12.2.2:
For any continuous distributionF, the K–S statisticsDn, D

n
, D
+
n
are distribution free.
Proof:
LetX
(1), . . . , X
(n)be the order statistics ofX1, . . . , Xn, i.e.,X
(1)≤X
(2)≤. . .≤X
(n), and
defineX
(0)=−∞andX
(n+1)= +∞.
Then,
F

n(x) =
i
n
forX
(i)≤x < X
(i+1), i= 0, . . . , n.
Therefore,
D
+
n
= max
0≤i≤n
{ sup
X
(i)≤x<X
(i+1)
[
i
n
−F(x)]}
= max
0≤i≤n
{
i
n
−[ inf
X
(i)≤x<X
(i+1)
F(x)]}
(∗)
= max
0≤i≤n
{
i
n
−F(X
(i))}
= max{max
1≤i≤n

i
n
−F(X
(i))
σ
,0}
(∗) holds sinceFis nondecreasing in [X
(i), X
(i+1)).
158

Note thatD
+
n
is a function ofF(X
(i)). In order to make some inference aboutD
+
n
, the dis-
tribution ofF(X
(i)) must be known. We know from theProbability Integral Transformation
(see Rohatgi, page 203, Theorem 1) that for a rvXwith continuous cdfFX, it holds that
FX(X)∼U(0,1).
Thus,F(X
(i)) is thei
th
order statistic of a sample fromU(0,1), independent fromF. There-
fore, the distribution ofD
+
nis independent ofF.
Similarly, the distribution of
D

n= max{max
1≤i≤n

F(X
(i))−
i−1
n
σ
,0}
is independent ofF.
Since
Dn= sup
x
|F

n
(x)−F(x)|= max{D
+
n
, D

n
},
the distribution ofDnis also independent ofF.
Theorem 12.2.3:
IfFis continuous, then
P(Dn≤ν+
1
2n
) =













0, ifν≤0
Z
ν+
1
2n
1
2n
−ν
Z
ν+
3
2n
3
2n
−ν
. . .
Z
ν+
2n−1
2n
2n−1
2n
−ν
f(u)du,if 0< ν <
2n−1
2n
1, ifν≥
2n−1
2n
where
f(u) =f(u1, . . . , un) =
(
n!,if 0< u1< u2< . . . < un<1
0,otherwise
is the joint pdf of an order statistic of a sample of sizenfromU(0,1).
Note:
As Gibbons & Chakraborti (1992), page 108–109, point out, this result must be interpreted
carefully. Consider the casen= 2.
For 0< ν <
3
4
, it holds that
P(D2≤ν+
1
4
) =
Z
ν+
1
4
1
4
−ν
0<u
1
<u
2
<1
Z
ν+
3
4
3
4
−ν
2!du2du1.
Note that the integration limits overlap if
ν+
1
4
≥ −ν+
3
4
⇐⇒ν≥
1
4
159

When 0< ν <
1
4
, it automatically holds that 0< u1< u2<1. Thus, for 0< ν <
1
4
, it holds
that
P(D2≤ν+
1
4
) =
Z
ν+
1
4
1
4
−ν
Z
ν+
3
4
3
4
−ν
2!du2du1
= 2!
Z
ν+
1
4
1
4
−ν
`
u2|
ν+
3
4
3
4
−ν
´
du1
= 2!
Z
ν+
1
4
1
4
−ν
2ν du1
= 2! (2ν)u1|
ν+
1
4
1
4
−ν
= 2! (2ν)
2
For
1
4
≤ν <
3
4
, the region of integration is as follows:


u1
u2
1
10
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
1/4 + nu
3/4 - nu
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
3/4 - nu
Area 1 Area 2
Note to Theorem 12.2.3
Thus, for
1
4
≤ν <
3
4
, it holds that
P(D2≤ν+
1
4
) =
Z
ν+
1
4
1
4
−ν
0<u
1
<u
2
<1
Z
ν+
3
4
3
4
−ν
2!du2du1
=
Z
ν+
1
4
3
4
−ν
Z
1
u1
2!du2du1+
Z3
4
−ν
0
Z
1
3
4
−ν
2!du2du1
= 2
"
Z
ν+
1
4
3
4
−ν

u2|
1
u1

du1+
Z3
4
−ν
0

u2|
1
3
4
−ν

du1
#
= 2
"
Z
ν+
1
4
3
4
−ν
(1−u1)du1+
Z3
4
−ν
0
(1−
3
4
+ν)du1
#
160

= 2



u1−
u
2
1
2
!




ν+
1
4
3
4
−ν
+
`
u1
4
+νu1
´



3
4
−ν
0


= 2
"
(ν+
1
4
)−
(ν+
1
4
)
2
2
−(−ν+
3
4
) +
(−ν+
3
4
)
2
2
+
(−ν+
3
4
)
4
+ν(−ν+
3
4
)
#
= 2
"
ν+
1
4

ν
2
2

ν
4

1
32
+ν−
3
4
+
ν
2
2

3
4
ν+
9
32

ν
4
+
3
16
−ν
2
+
3
4
ν
#
= 2

−ν
2
+
3
2
ν−
1
16
λ
=−2ν
2
+ 3ν−
1
8
Combining these results gives
P(D2≤ν+
1
4
) =

















0, ifν≤0
2! (2ν)
2
, if 0< ν <
1
4
−2ν
2
+ 3ν−
1
8
,if
1
4
≤ν <
3
4
1, ifν≥
3
4
Lecture 37:
Mo 04/16/01
Theorem 12.2.4:
LetFbe a continuous cdf. Then it holds∀z≥0:
lim
n→∞
P(Dn≤
z

n
) =L1(z) = 1−2

X
i=1
(−1)
i−1
exp(−2i
2
z
2
).
Theorem 12.2.5:
LetFbe a continuous cdf. Then it holds:
P(D
+
n≤z) =P(D

n≤z) =













0, ifz≤0
Z
1
1−z
Z
un
n−1
n
−z
. . .
Z
u3
2
n
−z
Z
u2
1
n
−z
f(u)du,if 0< z <1
1, ifz≥1
wheref(u) is defined in Theorem 12.2.3.
Note:
It should be obvious that the statisticsD
+
nandD

nhave the same distribution because of
symmetry.
161

Theorem 12.2.6:
LetFbe a continuous cdf. Then it holds∀z≥0:
lim
n→∞
P(D
+
n≤
z

n
) = lim
n→∞
P(D

n≤
z

n
) =L2(z) = 1−exp(−2z
2
)
Corollary 12.2.7:
LetVn= 4n(D
+
n
)
2
. Then it holdsVn
d
−→χ
2
2
, i.e., this transformation ofD
+
n
has an asymptotic
χ
2
2
distribution.
Proof:
Letx≥0. Then it follows:
lim
n→∞
P(Vn≤x)
x=4z
2
= lim
n→∞
P(Vn≤4z
2
)
= lim
n→∞
P(4n(D
+
n)
2
≤4z
2
)
= lim
n→∞
P(

nD
+
n≤z)
T h.12.2.6
= 1 −exp(−2z
2
)
4z
2
=x
= 1 −exp(−x/2)
Thus, lim
n→∞
P(Vn≤x) = 1−exp(−x/2) forx≥0. Note that this is the cdf of aχ
2
2
distribu-
tion.
Definition 12.2.8:
LetDn;αbe the smallest value such thatP(Dn> Dn;α)≤α. Likewise, letD
+
n;α
be the
smallest value such thatP(D
+
n> D
+
n;α)≤α.
TheKolmogorov–Smirnov test (K–S test) rejectsH0:F(x) =F0(x)∀xat levelαif
Dn> Dn;α.
It rejectsH

0
:F(x)≥F0(x)∀xat levelαifD

n> D
+
n;αand it rejectsH
′′
0
:F(x)≤F0(x)∀x
at levelαifD
+
n> D
+
n;α.
Note:
Rohatgi, Table 7, page 661, gives values ofDn;αandD
+
n;αfor selected values ofαand small
n. Theorems 12.2.4 and 12.2.6 allow the approximation ofDn;αandD
+
n;α
for largen.
162

Example 12.2.9:
LetX1, . . . , Xn∼C(1,0). We want to test whetherH0:X∼N(0,1).
The following data has been observed forx
(1), . . . , x
(10):
−1.42,−0.43,−0.19,0.26,0.30,0.45,0.64,0.96,1.97,and 4.68
The results for the K–S test have been obtained through the following S–Plus session, i.e.,
D
+
10
= 0.02219616,D

10
= 0.3025681, andD10= 0.3025681:
> x _ c(-1.42, -0.43, -0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, 4.68)
> FX _ pnorm(x)
> FX
[1] 0.07780384 0.33359782 0.42465457 0.60256811 0.61791142 0.67364478
[7] 0.73891370 0.83147239 0.97558081 0.99999857
> Dp _ (1:10)/10 - FX
> Dp
[1] 2.219616e-02 -1.335978e-01 -1.246546e-01 -2.025681e-01 -1.179114e-01
[6] -7.364478e-02 -3.891370e-02 -3.147239e-02 -7.558081e-02 1.434375e-06
> Dm _ FX - (0:9)/10
> Dm
[1] 0.07780384 0.23359782 0.22465457 0.30256811 0.21791142 0.17364478
[7] 0.13891370 0.13147239 0.17558081 0.09999857
> max(Dp)
[1] 0.02219616
> max(Dm)
[1] 0.3025681
> max(max(Dp), max(Dm))
[1] 0.3025681
>
> ks.gof(x, alternative = "two.sided", mean = 0, sd = 1)
One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = normal
data: x
ks = 0.3026, p-value = 0.2617
alternative hypothesis:
True cdf is not the normal distn. with the specified parameters
Using Rohatgi, Table 7, page 661, we have to useD10;0.20= 0.323 forα= 0.20. Since
D10= 0.3026<0.323 =D10;0.20, it isp >0.20. The K–S test does not rejectH0at level
α= 0.20. As S–Plus shows, the precise p–value is evenp= 0.2617.
163

Note:
Comparison betweenχ
2
and K–S goodness of fit tests:
•K–S uses all available data;χ
2
bins the data and loses information
•K–S works for all sample sizes;χ
2
requires large sample sizes
•it is more difficult to modify K–S for estimated parameters;χ
2
can be easily adapted
for estimated parameters
•K–S is “conservative” for discrete data, i.e., it tends to acceptH0for such data
•the order matters for K–S;χ
2
is better for unordered categorical data
164

12.3 More on Order Statistics
Definition 12.3.1:
LetFbe a continuous cdf. Atolerance intervalforFwithtolerance coefficientγis
a random interval such that the probability isγthat this random interval covers at least a
specified percentage 100p% of the distribution.
Theorem 12.3.2:
If order statisticsX
(r)< X
(s)are used as the endpoints for a tolerance interval for a continuous
cdfF, it holds that
γ=
s−r−1
X
i=0

n
i
!
p
i
(1−p)
n−i
.
Proof:
According to Definition 12.3.1, it holds that
γ=PX
(r),X
(s)

PX(X
(r)< X < X
(s))≥p

.
SinceFis continuous, it holds thatFX(X)∼U(0,1). Therefore,
PX(X
(r)< X < X
(s)) =P(X < X
(s))−P(X≤X
(r))
=F(X
(s))−F(X
(r))
=U
(s)−U
(r),
whereU
(s)andU
(r)are the order statistics of aU(0,1) distribution.
Thus,
γ=PX
(r),X
(s)

PX(X
(r)< X < X
(s))≥p

=P(U
(s)−U
(r)≥p).
By Therorem 4.4.4, we can determine the joint distribution of order statistics and calculateγ
as
γ=
Z
1
p
Z
y−p
0
n!
(r−1)!(s−r−1)!(n−s)!
x
r−1
(y−x)
s−r−1
(1−y)
n−s
dx dy.
Rather than solving this integral directly, we make the transformation
U=U
(s)−U
(r)
V=U
(s).
Then the joint pdf ofUandVis
fU,V(u, v) =



n!
(r−1)!(s−r−1)!(n−s)!
(v−u)
r−1
u
s−r−1
(1−v)
n−s
,if 0< u < v <1
0, otherwise
165

and the marginal pdf ofUis
fU(u) =
Z
1
0
fU,V(u, v)dv
=
n!
(r−1)!(s−r−1)!(n−s)!
u
s−r−1
I
(0,1)(u)
Z
1
u
(v−u)
r−1
(1−v)
n−s
dv
(A)
=
n!
(r−1)!(s−r−1)!(n−s)!
u
s−r−1
(1−u)
n−s+r
I
(0,1)(u)
Z
1
0
t
r−1
(1−t)
n−s
dt
| {z }
B(r,n−s+1)
=
n!
(r−1)!(s−r−1)!(n−s)!
u
s−r−1
(1−u)
n−s+r
(r−1)!(n−s)!
(n−s+r)!
I
(0,1)(u)
=
n!
(n−s+r)!(s−r−1)!
u
s−r−1
(1−u)
n−s+r
I
(0,1)(u)
=n

n−1
s−r−1
!
u
s−r−1
(1−u)
n−s+r
I
(0,1)(u).
(A) is based on the transformationt=
v−u
1−u
,v−u= (1−u)t, 1−v= 1−u−(1−u)t=
(1−u)(1−t) anddv= (1−u)dt.
It follows that
γ=P(U
(s)−U
(r)≥p)
=P(U≥p)
=
Z
1
p
n

n−1
s−r−1
!
u
s−r−1
(1−u)
n−s+r
du
(B)
=P(Y < s−r)|whereY∼Bin(n, p)
=
s−r−1
X
i=0

n
i
!
p
i
(1−p)
n−i
.
(B) holds due to Rohatgi, Remark 3 after Theorem 5.3.18, page 216, since forX∼Bin(n, p),
it holds that
P(X < k) =
Z
1
p
n

n−1
k−1
!
x
k−1
(1−x)
n−k
dx.
166

Lecture 38:
We 04/18/01
Example 12.3.3:
Lets=nandr= 1. Then,
γ=
n−2
X
i=0

n
i
!
p
i
(1−p)
n−i
= 1−p
n
−np
n−1
(1−p).
Ifp= 0.8 andn= 10, then
γ10= 1−(0.8)
10
−10∆(0.8)
9
∆(0.2) = 0.624,
i.e., (X
(1), X
(10)) defines a 62.4% tolerance interval for 80% probability.
Ifp= 0.8 andn= 20, then
γ20= 1−(0.8)
20
−20∆(0.8)
19
∆(0.2) = 0.931,
and ifp= 0.8 andn= 30, then
γ30= 1−(0.8)
30
−30∆(0.8)
29
∆(0.2) = 0.989.
Theorem 12.3.4:
Letkpbe thep
th
quantile of a continuous cdfF. LetX
(1), . . . , X
(n)be the order statistics of
a sample of sizenfromF. Then it holds that
P(X
(r)≤kp≤X
(s)) =
s−1
X
i=r

n
i
!
p
i
(1−p)
n−i
.
Proof:
It holds that
P(X
(r)≤kp) =P(at leastrof theXi’s are≤kp)
=
n
X
i=r

n
i
!
p
i
(1−p)
n−i
.
Therefore,
P(X
(r)≤kp≤X
(s)) =P(X
(r)≤kp)−P(X
(s)< kp)
=
n
X
i=r

n
i
!
p
i
(1−p)
n−i

n
X
i=s

n
i
!
p
i
(1−p)
n−i
=
s−1
X
i=r

n
i
!
p
i
(1−p)
n−i
.
167

Corollary 12.3.5:
(X
(r), X
(s)) is a level
s−1
X
i=r

n
i
!
p
i
(1−p)
n−i
confidence interval forkp.
Example 12.3.6:
Letn= 10. We want a 95% confidence interval for the median, i.e.,kpwherep=
1
2
.
We get the following probabilitiespr,s=
s−1
X
i=r

n
i
!
p
i
(1−p)
n−i
that (X
(r), X
(s)) coversk0.5:
pr,s s
2 3 4 5 6 7 8 9 10
10.01 0.05 0.17 0.38 0.62 0.83 0.940.99 0.998
2 0.04 0.16 0.37 0.61 0.82 0.930.98 0.99
3 0.12 0.32 0.57 0.77 0.89 0.93 0.94
4 0.21 0.45 0.66 0.77 0.82 0.83
r5 0.25 0.45 0.57 0.61 0.62
6 0.21 0.32 0.37 0.38
7 0.12 0.16 0.17
8 0.04 0.05
9 0.01
Only the random intervals (X
(1), X
(9)), (X
(1), X
(10)), (X
(2), X
(9)), and (X
(2), X
(10)) give the
desired coverage probability. Therefore, we use the one that comes closest to 95%, i.e.,
(X
(2), X
(9)), as the 95% confidence interval for the median.
168

13 Some Results from Sampling
13.1 Simple Random Samples
Definition 13.1.1:
Let Ω be a population of sizeNwith meanand varianceσ
2
. A sampling method (of size
n) is calledsimpleif the setSof possible samples contains all combinations ofnelements of
Ω (without repetition) and the probability for each samples∈Sto become selected depends
only onn, i.e.,p(s) =
1
(
N
n)
∀s∈S. Then we calls∈Sasimple random sample(SRS) of
sizen.
Theorem 13.1.2:
Let Ω be a population of sizeNwith meanand varianceσ
2
. LetY: Ω→IRbe a measurable
function. Letnibe the total number of times the parameter ˜yioccurs in the population and
pi=
ni
N
be the relative frequency the parameter ˜yioccurs in the population. Let (y1, . . . , yn)
be a SRS of sizenwith respect toY, whereP(Y= ˜yi) =pi=
ni
N
.
Then the componentsyi, i= 1, . . . , n, are identically distributed asYand it holds fori6=j:
P(yi= ˜yk, yj= ˜yl) =
1
N(N−1)
nkl,wherenkl=



nknl, k 6=l
nk(nk−1), k=l
Note:
(i) In Sampling, many authors use capital letters to denote properties of the population
and small letters to denote properties of the random sample.In particular,xi’s andyi’s
are considered as random variables related to the sample. They are not seen as specific
realizations.
(ii) The following equalities hold in the scenario of Theorem 13.1.2:
N=
X
i
ni
=
1
N
X
i
ni˜yi
σ
2
=
1
N
X
i
ni(˜yi−)
2
=
1
N
X
i
ni˜y
2
i

2
169

Theorem 13.1.3:
Let the same conditions hold as in Theorem 13.1.2. Lety=
1
n
n
X
i=1
yibe the sample mean of a
SRS of sizen. Then it holds:
(i)E(y) =, i.e., the sample mean is unbiased for the population mean.
(ii)V ar(y) =
1
n
N−n
N−1
σ
2
=
1
n
(1−f)
N
N−1
σ
2
, wheref=
n
N
.
Proof:
(i)
E(y) =
1
n
n
X
i=1
E(yi) =,sinceE(yi) =∀i.
(ii)
V ar(y) =
1
n
2


n
X
i=1
V ar(yi) + 2
X
i<j
Cov(yi, yj)


Cov(yi, yj) = E(yi∆yj)−E(yi)E(yj)
= E(yi∆yj)−
2
=
X
k,l
˜yk˜ylP(yi= ˜yk, yj= ˜yl)−
2
T h.13.1.2
=
1
N(N−1)


X
k6=l
˜yk˜ylnknl+
X
k
˜y
2
knk(nk−1)

−
2
=
1
N(N−1)


X
k,l
˜yk˜ylnknl−
X
k
˜y
2
k
nk

−
2
=
1
N(N−1)

X
k
˜yknk
!
X
l
˜ylnl
!

X
k
˜y
2
knk
!

2
N ote(ii)
=
1
N(N−1)

N
2

2
−N(σ
2
+
2
)


2
=
1
N−1

N
2
−σ
2

2

2
(N−1)

= −
1
N−1
σ
2
,fori6=j
170

Lecture 39:
Fr 04/20/01
=⇒V ar(y) =
1
n
2


n
X
i=1
V ar(yi) + 2
X
i<j
Cov(yi, yj)


=
1
n
2
`

2
+n(n−1)
`

1
N−1
σ
2
´´
=
1
n
`
1−
n−1
N−1
´
σ
2
=
1
n
N−n
N−1
σ
2
=
1
n
(1−
n
N
)
N
N−1
σ
2
=
1
n
(1−f)
N
N−1
σ
2
Theorem 13.1.4:
Lety
nbe the sample mean of a SRS of sizen. Then it holds that
r
n
1−f
y
n−
q
N
N−1
σ
d
−→N(0,1),
whereN→ ∞andf=
n
N
is a constant.
In particular, when theyi’s are 0–1–distributed withE(yi) =P(yi= 1) =p∀i, then it holds
that r
n
1−f
y
n−p
q
N
N−1
p(1−p)
d
−→N(0,1),
whereN→ ∞andf=
n
N
is a constant.
171

13.2 Stratified Random Samples
Definition 13.2.1:
Let Ω be a population of sizeN, that is split intomdisjoint sets Ωj, calledstrata, of size
Nj,j= 1, . . . , m, whereN=
m
X
j=1
Nj. If we independently draw a random sample of sizenjin
each strata, we speak of astratified random sample.
Note:
(i) The random samples in each strata are not always SRS’s.
(ii) Stratified random samples are used in practice as a meansto reduce the sample variance
in the case that data in each strata is homogeneous and data among different strata is
heterogeneous.
(iii) Frequently used strata in practice are gender, state (or county), income range, ethnic
background, etc.
Definition 13.2.2:
LetY: Ω→IRbe a measurable function. In case of a stratified random sample, we use the
following notation:
LetYjk, j= 1, . . . , m, k= 1, . . . , Njbe the elements in Ωj. Then, we define
(i)Yj=
Nj
X
k=1
Yjkthe total in thej
th
strata,
(ii)j=
1
Nj
Yjthe mean in thej
th
strata,
(iii)=
1
N
m
X
j=1
Njjthe expectation (or grand mean),
(iv)N=
m
X
j=1
Yj=
m
X
j=1
Nj
X
k=1
Yjkthe total,
(v)σ
2
j
=
1
Nj
Nj
X
k=1
(Yjk−j)
2
the variance in thej
th
strata, and
(vi)σ
2
=
1
N
m
X
j=1
Nj
X
k=1
(Yjk−)
2
the variance.
172

(vii) We denote an (ordered) sample in Ωjof sizenjas (yj1, . . . , yjnj
) andy
j=
1
nj
nj
X
k=1
yjkthe
sample mean in thej
th
strata.
Theorem 13.2.3:
Let the same conditions hold as in Definitions 13.2.1 and 13.2.2. Let ˆjbe an unbiased
estimate ofjand
d
V ar(ˆj) be an unbiased estimate ofV ar(ˆj). Then it holds:
(i) ˆ=
1
N
m
X
j=1
Njˆjis unbiased for.
V ar(ˆ) =
1
N
2
m
X
j=1
N
2
jV ar(ˆj).
(ii)
d
V ar(ˆ) =
1
N
2
m
X
j=1
N
2
j
d
V ar(ˆj) is unbiased forV ar(ˆ).
Proof:
(i)
E(ˆ) =
1
N
m
X
j=1
NjE(ˆj) =
1
N
m
X
j=1
Njj=
By independence of the samples within each strata,
V ar(ˆ) =
1
N
2
m
X
j=1
N
2
jV ar(ˆj).
(ii)
E(
d
V ar(ˆ)) =
1
N
2
m
X
j=1
N
2
jE(
d
V ar(ˆj)) =
1
N
2
m
X
j=1
N
2
jV ar(ˆj) =V ar(ˆ)
Theorem 13.2.4:
Let the same conditions hold as in Theorem 13.2.3. If we draw aSRS in each strata, then it
holds:
(i) ˆ=
1
N
m
X
j=1
Njy
jis unbiased for, wherey
j=
1
nj
nj
X
k=1
yjk, j= 1, . . . , m.
V ar(ˆ) =
1
N
2
m
X
j=1
N
2
j
1
nj
(1−fj)
Nj
Nj−1
σ
2
j
, wherefj=
nj
Nj
.
173

(ii)
d
V ar(ˆ) =
1
N
2
m
X
j=1
N
2
j
1
nj
(1−fj)s
2
jis unbiased forV ar(ˆ), where
s
2
j
=
1
nj−1
nj
X
k=1
(yjk−y
j)
2
.
Proof:
For a SRS in thej
th
strata, it follows by Theorem 13.1.3:
E(y
j) =j
V ar(y
j) =
1
nj
(1−fj)
Nj
Nj−1
σ
2
j
Also, we can show that
E(s
2
j
) =
Nj
Nj−1
σ
2
j
.
Now the proof follows directly from Theorem 13.2.3.
Definition 13.2.5:
Let the same conditions hold as in Definitions 13.2.1 and 13.2.2. If the sample in each strata
is of sizenj=n
Nj
N
, j= 1, . . . , m, wherenis the total sample size, then we speak ofpro-
portional selection.
Note:
(i) In the case of proportional selection, it holds thatfj=
nj
Nj
=
n
N
=f, j= 1, . . . , m.
(ii) Proportional strata cannot always be obtained for eachcombination ofm,n, andN.
Theorem 13.2.6:
Let the same conditions hold as in Definition 13.2.5. If we draw a SRS in each strata, then it
holds in case of proportional selection that
V ar(ˆ) =
1
N
2
1−f
f
m
X
j=1
Nj˜σ
2
j
,
where ˜σ
2
j
=
Nj
Nj−1
σ
2
j
.
Proof:
The proof follows directly from Theorem 13.2.4 (i).
174

Theorem 13.2.7:
If we draw (1) a stratified random sample that consists of SRS’s of sizesnjunder proportional
selection and (2) a SRS of sizen=
m
X
j=1
njfrom the same population, then it holds that
V ar(y)−V ar(ˆ) =
1
n
N−n
N(N−1)


m
X
j=1
Nj(j−)
2

1
N
m
X
j=1
(N−Nj)˜σ
2
j

.
Proof:
See Homework.
175

Lecture 41:
We 04/25/01
14 Some Results from Sequential Statistical Inference
14.1 Fundamentals of Sequential Sampling
Example 14.1.1:
A particular machine produces a large number of items every day. Each item can be either
“defective” or “non–defective”. The unknown proportion ofdefective items in the production
of a particular day isp.
Let (X1, . . . , Xm) be a sample from the daily production wherexi= 1 when the item is
defective andxi= 0 when the item is non–defective. Obviously,Sm=
m
X
i=1
Xi∼Bin(m, p)
denotes the total number of defective items in the sample (assuming thatmis small compared
to the daily production).
We might be interested to testH0:p≤p0vs.H1:p > p0at a given significance levelα
and use this decision to trash the entire daily production and have the machine fixed if indeed
p > p0. A suitable test could be
Φ1(x1, . . . , xm) =



1,ifsm> c
0,ifsm≤c
wherecis chosen such that Φ1is a level–αtest.
However, wouldn’t it be more beneficial if we sequentially sample the items (e.g., take item
# 57, 623, 1005, 1286, 2663, etc.) and stop the machine as soonas it becomes obvious that
it produces too many bad items. (Alternatively, we could also finish the time consuming and
expensive process to determine whether an item is defectiveor non–defective if it is impossible
to surpass a certain proportion of defectives.) For example, if for somej < mit already holds
thatsj> c, then we could stop (and immediately call maintenance) and rejectH0after only
jobservations.
More formally, let us defineT=min{j|Sj> c}andT

=min{T, m}. We can now con-
sider a decision rule that stops with the sampling process atrandom timeT

and rejectsH0if
T≤m. Thus, if we considerR0={(x1, . . . , xm)|t≤m}andR1={(x1, . . . , xm)|sm> c}
as critical regions of two tests Φ0and Φ1, then these two tests are equivalent.
176

Definition 14.1.2:
Let Θ be the parameter space andAthe set of actions the statistician can take. We assume
that the rv’sX1, X2, . . .are observed sequentially and iid with common pdf (or pmf)fθ(x).
Asequential decision procedureis defined as follows:
(i) Astopping rulespecifies whether an element ofAshould be chosen without taking
any further observation. If at least one observation is taken, this rule specifies for every
set of observed values (x1, x2, . . . , xn),n≥1, whether to stop sampling and choose an
action inAor to take another observationxn+1.
(ii) Adecision rulespecifies the decision to be taken. If no observation has beentaken,
then we take actiond0∈A. Ifn≥1 observation have been taken, then we take action
dn(x1, . . . , xn)∈A, wheredn(x1, . . . , xn) specifies the action that has to be taken for
the set (x1, . . . , xn) of observed values. Once an action has been taken, the sampling
process is stopped.
Note:
In the remainder of this chapter, we assume that the statistician takes at least one observation.
Definition 14.1.3:
LetRn⊆IR
n
, n= 1,2, . . ., be a sequence of Borel–measurable sets such that the sampling
process is stopped after observingX1=x1, X2=x2, . . . , Xn=xnif (x1, . . . , xn)∈Rn. If
(x1, . . . , xn)/∈Rn, then another observationxn+1is taken. The setsRn, n= 1,2, . . .are called
stopping regions.
Definition 14.1.4:
With every sequential stopping rule we associate astopping random variableNwhich
takes on the values 1,2,3, . . .. Thus,Nis a rv that indicates the total number of observations
taken before the sampling is stopped.
Note:
We use the (sloppy) notation{N=n}to denote the event that sampling is stopped after
observing exactlynvaluesx1, . . . , xn(i.e., sampling is not stopped before takingnsamples).
Then the following equalities hold:
{N= 1}=R1
177

{N=n}={(x1, . . . , xn)∈IR
n
|sampling is stopped afternobservations but not before}
= (R1∪R2∪. . .∪Rn−1)
c
∩Rn
=R
c
1
∩R
c
2
∩. . .∩R
c
n−1
∩Rn
{N≤n}=
n
[
k=1
{N=k}
Here we will only considerclosedsequential sampling procedures, i.e., procedures where
sampling eventually stops with probability 1, i.e.,
P(N <∞) = 1,
P(N=∞) = 1−P(N <∞) = 0.
Theorem 14.1.5:Wald’s Equation
LetX1, X2, . . .be iid rv’s withE(|X1|)<∞. LetNbe a stopping variable. LetSN=
N
X
k=1
Xk.
IfE(N)<∞, then it holds
E(SN) =E(X1)E(N).
Proof:
Define a sequence of rv’sYi, i= 1,2, . . ., where
Yi=



1,if no decision is reached up to the (i−1)
th
stage, i.e.,N >(i−1)
0,otherwise
Then eachYiis a function ofX1, X2, . . . , Xi−1only andYiis independent ofXi.
Consider the rv

X
n=1
XnYn.
Obviously, it holds that
SN=

X
n=1
XnYn.
Thus, it follows that
E(SN) =E


X
n=1
XnYn
!
.(∗)
It holds that

X
n=1
E(|XnYn|) =

X
n=1
E(|Xn|)E(|Yn|)
=E(|X1|)

X
n=1
P(N≥n)
178

=E(|X1|)

X
n=1

X
k=n
P(N=k)
(A)
=E(|X1|)

X
n=1
nP(N=n)
=E(|X1|)E(N)
<∞
(A) holds due to the following rearrangement of indizes:
n k
11,2,3, . . .
22,3, . . .
33, . . .
.
.
.
.
.
.
Lecture 42:
Fr 04/27/01
We may therefore interchange the expectation and summationsigns in (∗) and get
E(SN) =E


X
n=1
XnYn
!
=

X
n=1
E(XnYn)
=

X
n=1
E(Xn)E(Yn)
=E(X1)

X
n=1
P(N≥n)
=E(X1)E(N)
which completes the proof.
179

14.2 Sequential Probability Ratio Tests
Definition 14.2.1:
LetX1, X2, . . .be a sequence of iid rv’s with common pdf (or pmf)fθ(x). We want to test a
simple hypothesisH0:X∼fθ0
vs. a simple alternativeH1:X∼fθ1
when the observations
are taken sequentially.
Letf0nandf1ndenote the joint pdf’s (or pmf’s) ofX1, . . . , XnunderH0andH1respectively,
i.e.,
f0n(x1, . . . , xn) =
n
Y
i=1
fθ0
(xi) andf1n(x1, . . . , xn) =
n
Y
i=1
fθ1
(xi).
Finally, let
λn(x1, . . . , xn) =
f1n(x)
f0n(x)
,
wherex= (x1, . . . , xn). Then asequential probability ratio test(SPRT) for testingH0
vs.H1is the following decision rule:
(i) If at any stage of the sampling process it holds that
λn(x)≥A,
then stop and rejectH0.
(ii) If at any stage of the sampling process it holds that
λn(x)≤B,
then stop and acceptH0, i.e., rejectH1.
(iii) If
B < λn(x)< A,
then continue sampling by taking another observationxn+1.
Note:
(i) It is usually convenient to define
Zi= log
fθ1
(Xi)
fθ0
(Xi)
,
whereZ1, Z2, . . .are iid rv’s. Then, we work with
logλn(x) =
n
X
i=1
zi=
n
X
i=1
(logfθ1
(xi)−logfθ0
(xi))
instead of usingλn(x). Obviously, we now have to use constantsb= logBanda= logA
instead of the original constantsBandA.
180

(ii)AandB(whereA > B) are constants such that the SPRT will have strength (α, β),
where
α=P(Type I error) =P(RejectH0|H0)
and
β=P(Type II error) =P(AcceptH0|H1).
IfNis the stopping rv, then
α=Pθ0
(λN(X)≥A) andβ=Pθ1
(λN(X)≤B).
Example 14.2.2:
LetX1, X2, . . .be iidN(, σ
2
), whereis unknown andσ
2
>0 is known. We want to test
H0:=0vs.H1:=1, where0< 1.
If our data is sampled sequentially, we can constract a SPRT as follows:
logλn(x) =
n
X
i=1
`

1

2
(xi−1)
2
−(−
1

2
(xi−0)
2
)
´
=
1

2
n
X
i=1

(xi−0)
2
−(xi−1)
2
)

=
1

2
n
X
i=1

x
2
i
−2xi0+
2
0
−x
2
i
+ 2xi1−
2
1

=
1

2
n
X
i=1

−2xi0+
2
0
+ 2xi1−
2
1

=
1

2

n
X
i=1
2xi(1−0) +n(
2
0−
2
1)
!
=
1−0
σ
2

n
X
i=1
xi−n
0+1
2
!
We decide forH0if
logλn(x)≤b
⇐⇒
1−0
σ
2

n
X
i=1
xi−n
0+1
2
!
≤b
⇐⇒
n
X
i=1
xi≤n
0+1
2
+b

,
whereb

=
σ
2
1−0
b.
181

We decide forH1if
logλn(x)≥a
⇐⇒
1−0
σ
2

n
X
i=1
xi−n
0+1
2
!
≥a
⇐⇒
n
X
i=1
xi≥n
0+1
2
+a

,
wherea

=
σ
2
1−0
a.


n
sum(x_i)b* a*
1
2
3
4
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
accept H0 continue accept H1
Example 14.2.2
Theorem 14.2.3:
For a SPRT with stopping boundsAandB,A > B, and strength (α, β), we have
A≤
1−β
α
andB≥
β
1−α
,
where 0< α <1 and 0< β <1.
Theorem 14.2.4:
Assume we select for givenα, β∈(0,1), whereα+β≤1, the stopping bounds
A

=
1−β
α
andB

=
β
1−α
.
Then it holds that the SPRT with stopping boundsA

andB

has strength (α

, β

), where
α


α
1−β
, β


β
1−α
,andα



≤α+β.
182

Note:
(i) The approximationA

=
1−β
α
andB

=
β
1−α
in Theorem 14.2.4 is calledWald–
Approximationfor the optimal stopping bounds of a SPRT.
(ii)A

andB

are functions ofαandβonly and do not depend on the pdf’s (or pmf’s)fθ0
andfθ1
. Therefore, they can be computed once and for allfθi
’s,i= 0,1.
THEEND!!!
183

Index
α–similar, 109
0–1 Loss, 129
A Posteriori Distribution, 86
A Priori Distribution, 86
Action, 83
Alternative Hypothesis, 91
Ancillary, 56
Asymptotically (Most) Efficient, 73
Basu’s Theorem, 56
Bayes Estimate, 87
Bayes Risk, 86
Bayes Rule, 87
Bayesian Confidence Interval, 149
Bayesian Confidence Set, 149
Bias, 58
Borel–Cantelli Lemma, 21
Cauchy Criterion, 23
Centering Constants, 15, 19
Central Limit Theorem, Lindeberg, 33
Central Limit Theorem, Lindeberg–L`evy, 30
Chapman, Robbins, Kiefer Inequality, 71
CI, 135
Closed, 178
Complete, 52, 152
Complete in Relation toP, 152
Composite, 91
Confidence Bound, Lower, 135
Confidence Bound, Upper, 135
Confidence Coefficient, 135
Confidence Interval, 135
Confidence Level, 134
Confidence Sets, 134
Conjugate Family, 90
Consistent, 5
Consistent in ther
th
M ean, 45
Consistent, Mean–Squared–Error, 59
Consistent, Strongly, 45
Consistent, Weakly, 45
Continuity Theorem, 29
Contradiction, Proof by, 61
Convergence, Almost Sure, 12
Convergence, Inr
th
Mean, 10
Convergence, In Absolute Mean, 10
Convergence, In Distribution, 2
Convergence, In Law, 2
Convergence, In Mean Square, 10
Convergence, In Probability, 5
Convergence, Strong, 12
Convergence, Weak, 2
Convergence, With Probability 1, 12
Convergence–Equivalent, 23
Cram´er–Rao Lower Bound, 67, 68
Credible Set, 149
Critical Region, 92
CRK Inequality, 71
CRLB, 67, 68
Decision Function, 83
Decision Rule, 177
Degree, 153
Distribution, A Posteriori, 86
Distribution, Population, 36
Distribution, Sampling, 2
Distribution–Free, 152
Domain of Attraction, 32
Efficiency, 73
Efficient, Asymptotically (Most), 73
Efficient, More, 73
Efficient, Most, 73
Empirical CDF, 36
Empirical Cumulative Distribution Function, 36
Equivalence Lemma, 23
Error, Type I, 92
Error, Type II, 92
Estimable, 153
Estimable Function, 58
Estimate, Bayes, 87
Estimate, Maximum Likelihood, 77
Estimate, Method of Moments, 75
Estimate, Minimax, 84
Estimate, Point, 44
Estimator, 44
Estimator, Mann–Whitney, 157
Estimator, Wilcoxin 2–Sample, 157
Exponential Family, One–Parameter, 53
F–Test, 127
Factorization Criterion, 50
Family of CDF’s, 44
Family of Confidence Sets, 134
Family of PDF’s, 44
Family of PMF’s, 44
Family of Random Sets, 134
Formal Invariance, 112
GeneralizedU–Statistic, 157
Glivenko–Cantelli Theorem, 37
Hypothesis, Alternative, 91
Hypothesis, Null, 91
184

Independence ofXandS
2
, 41
Induced Function, 46
Inequality, Kolmogorov’s, 22
Interval, Random, 134
Invariance, Measurement, 112
Invariant, 46, 111
Invariant Test, 112
Invariant, Location, 47
Invariant, Maximal, 113
Invariant, Permutation, 47
Invariant, Scale, 47
K–S Statistic, 158
K–S Test, 162
Kernel, 153
Kernel, Symmetric, 153
Khintchine’s Weak Law of Large Numbers, 18
Kolmogorov’s Inequality, 22
Kolmogorov’s SLLN, 25
Kolmogorov–Smirnov Statistic, 158
Kolmogorov–Smirnov Test, 162
Kronecker’s Lemma, 22
Landau SymbolsOando, 31
Lehmann–Scheff´ee, 65
Level of Significance, 93
Level–α–Test, 93
Likelihood Function, 77
Likelihood Ratio Test, 116
Likelihood Ratio Test Statistic, 116
Lindeberg Central Limit Theorem, 33
Lindeberg Condition, 33
Lindeberg–L`evy Central Limit Theorem, 30
LMVUE, 60
Locally Minumum Variance Unbiased Estimate, 60
Location Invariant, 47
Logic, 61
Loss Function, 83
Lower Confidence Bound, 135
LRT, 116
Mann–Whitney Estimator, 157
Maximal Invariant, 113
Maximum Likelihood Estimate, 77
Mean Square Error, 59
Mean–Squared–Error Consistent, 59
Measurement Invariance, 112
Method of Moments Estimate, 75
Minimax Estimate, 84
Minimax Principle, 84
Minmal Sufficient, 56
MLE, 77
MLR, 102
MOM, 75
Monotone Likelihood Ratio, 102
More Efficient, 73
Most Efficient, 73
Most Powerful Test, 93
MP, 93
MSE–Consistent, 59
Neyman–Pearson Lemma, 96
Nonparametric, 152
Nonrandomized Test, 93
Normal Variance Tests, 121
Norming Constants, 15, 19
NP Lemma, 96
Null Hypothesis, 91
One Samplet–Test, 125
One–Tailedt-Test, 125
Pairedt-Test, 126
Parameter Space, 44
Parametric Hypothesis, 91
Permutation Invariant, 47
Pivot, 138
Point Estimate, 44
Point Estimation, 44
Population Distribution, 36
Posterior Distribution, 86
Power, 93
Power Function, 93
Prior Distribution, 86
Probability Integral Transformation, 159
Probability Ratio Test, Sequential, 180
Problem of Fit, 158
Proof by Contradiction, 61
Proportional Selection, 174
Random Interval, 134
Random Sample, 36
Random Sets, 134
Random Variable, Stopping, 177
Randomized Test, 93
Rao–Blackwell, 64
Rao–Blackwellization, 65
Realization, 36
Regularity Conditions, 68
Risk Function, 83
Risk, Bayes, 86
Sample, 36
Sample Central Moment of Orderk, 37
Sample Mean, 36
Sample Moment of Orderk, 37
Sample Statistic, 36
Sample Variance, 36
Sampling Distribution, 2
Scale Invariant, 47
185

Selection, Proportional, 174
Sequential Decision Procedure, 177
Sequential Probability Ratio Test, 180
Significance Level, 93
Similar, 109
Similar,α, 109
Simple, 91, 169
Simple Random Sample, 169
Size, 93
Slutsky’s Theorem, 8
SPRT, 180
SRS, 169
Stable, 32
Statistic, 2, 36
Statistic, Kolmogorov–Smirnov, 158
Statistic, Likelihood Ratio Test, 116
Stopping Random Variable, 177
Stopping Regions, 177
Stopping Rule, 177
Strata, 172
Stratified Random Sample, 172
Strong Law of Large Numbers, Kolmogorov’s, 25
Strongly Consistent, 45
Sufficient, 48, 152
Sufficient, Minimal, 56
Symmetric Kernel, 153
t–Test, 125
Tail–Equivalent, 23
Taylor Series, 31
Test Function, 93
Test, Invariant, 112
Test, Kolmogorov–Smirnov, 162
Test, Likelihood Ratio, 116
Test, Most Powerful, 93
Test, Nonrandomized, 93
Test, Randomized, 93
Test, Uniformly Most Powerful, 93
Tolerance Coefficient, 165
Tolerance Interval, 165
Two–Samplet-Test, 125
Two–Tailedt-Test, 125
Type I Error, 92
Type II Error, 92
U–Statistic, 154
U–Statistic, Generalized, 157
UMA, 135
UMAU, 145
UMP, 93
UMPα–similar, 110
UMP Invariant, 114
UMP Unbiased, 106
UMPU, 106
UMVUE, 60
Unbiased, 58, 106, 145
Uniformly Minumum Variance Unbiased Estimate, 60
Uniformly Most Accurate, 135
Uniformly Most Accurate Unbiased, 145
Uniformly Most Powerful Test, 93
Unimodal, 139
Upper Confidence Bound, 135
Wald’s Equation, 178
Wald–Approximation, 183
Weak Law Of Large Numbers, 15
Weak Law Of Large Numbers, Khintchine’s, 18
Weakly Consistent, 45
Wilcoxin 2–Sample Estimator, 157
186
Tags