Chapter 4Summarizing Data Collected in the Sample

WilheminaRossi174 41 views 117 slides Sep 21, 2022
Slide 1
Slide 1 of 117
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117

About This Presentation

Chapter 4
Summarizing Data Collected in the Sample



Learning Objectives (1 of 3)Distinguish between dichotomous, ordinal, categorical, and continuous variablesIdentify appropriate numerical and graphical summaries for each variable typeCompute a mean, median, standard deviation, quartiles and rang...


Slide Content

Chapter 4
Summarizing Data Collected in the Sample



Learning Objectives (1 of 3)Distinguish between dichotomous,
ordinal, categorical, and continuous variablesIdentify
appropriate numerical and graphical summaries for each
variable typeCompute a mean, median, standard deviation,
quartiles and range for a continuous variable



Learning Objectives (2 of 3)Construct a frequency distribution
table for dichotomous, categorical, and ordinal variablesProvide
an example of when the mean is a better measure of location
than the medianInterpret the standard deviation of a continuous
variable



Learning Objectives (3 of 3)Generate and interpret a box plot
for a continuous variableProduce and interpret side-by-side box
plotsDifferentiate between a histogram and a bar chart



Variable TypesDichotomous variables have two possible
responses (e.g., yes/no).Ordinal and categorical variables have
more than two responses, and responses are ordered and

unordered, respectively.Continuous (or measurement) variables
assume in theory any values between a theoretical minimum and
maximum.



BiostatisticsTwo areas of applied biostatisticsDescriptive
statistics—summarize a sample selected from a population
Inferential statistics—make inferences about population
parameters based on sample statistics.



VocabularyData elements/data points Subjects/units of
measurementPopulation versus sample



Sample vs. Population Any summary measure computed on a
sample is a statistic.Any summary measure computed on a
population is a parameter.

n = Sample Size
N = Population Size



Example 4.1.

Dichotomous Variable
Frequency Distribution Table



Relative Frequency Bar Chart for Dichotomous Variable

Sample: n = 50
Population: Patients at health center
Variable: Marital status
Categorical Outcome (1 of 2)Marital StatusNumber of
PatientsMarried24Separated5Divorced8Widowed2Never
married11Total50














Categorical Outcome (2 of 2)
Frequency Distribution Table Marital StatusNumber of
Patients (f)Relative Frequency
(f/n)Married240.48Separated50.10Divorced80.16Widowed20.04
Never married110.22Total501.00

Frequency Bar Chart



Sample: n =50
Population: Patients at health center
Variable: Self-reported current health status
Ordinal Outcome (1 of 2)Health StatusNumber of
PatientsExcellent19Very good12Good9Fair6Poor4Total50














Ordinal Outcome (2 of 2)
Frequency Distribution Table Heath StatusFreq.Rel.
Freq.Cumulative Freq.Cumulative Rel.
Freq.Excellent1938%1938%Very
good1224%3162%Good918%4080%Fair612%4692 %Poor48%50
100%50100%

Relative Frequency Histogram




Example 4.2.

Ordinal Variable
Frequency Distribution Table



Relative Frequency Histogram

for Ordinal Variable


Assume, in theory, any value between a theoretical minimum
and maximumQuantitative, measurement variables
Continuous Variable (1 of 9)

Population: Patients 50 years of age with
coronary artery diseaseSample: n = 7
patientsOutcome: Systolic blood pressure (mmHg)

Continuous Variable (2 of 9)



Sample data


X 100 110 114
121 130
130 160
Continuous Variable (3 of 9)






X 100 110 114
121 130
130 160
865
Continuous Variable (4 of 9)





Consider a second sample from the same population.
We record SBP on each subject in the second sample:
120 121 122 124 125 126 127

n = 7
= 865 / 7 = 123.6.
What is different between the two samples?
Continuous Variable (5 of 9)


*




Dispersion
Continuous Variable (6 of 9)X(X – )100–23.6110–13.6114–
9.6121–2.61306.41306.416036.48650

















Dispersion
Mean absolute
deviation (MAD):
Continuous Variable (7 of 9)X(X – )100–23.6110–13.6114–

9.6121–2.61306.41306.416036.48650

















Sample variance
X (X – ) (X – )2
100 –23.6 556.96
110 –13.6 184.96
114 –9.6 92.16
121 –2.6 6.76
130 6.4 40.96
130 6.4 40.96
160 36.4 1324.96
865 0 2247.72
Continuous Variable (8 of 9)

Continuous Variable (9 of 9)Sample standard deviationStandard
summary

n = 7, X = 123.6, s = 19.4





Median
Median
100 110 114 121 130 130 160Median—holds 50% of
values above and 50% of values belowOrder data
For n odd—median is middle value
For n even—median is mean of two middle values



QuartilesQ1 = first quartile holds approximately 25% of the
scores at or below it.Q3 = third quartile holds approximately
25% of the scores at or above it.Q2 = ??



Continuous Variable
Median
Order data
100 110 114 121 130 130 160
Q1
Q3



Box and Whisker Plot

100 110 120 130 140 150 160

Min Q1 Median Q3 Max



Comparing Samples with

Box and Whisker Plots






100 110 120 130 140 150 160





Summarizing Location and VariabilityWhen there are no
outliers, the sample mean and standard deviation summarize
location and variability.When there are outliers, the median and
interquartile range (IQR) summarize location and variability,
where IQR = Q3 – Q1.



Sample: n = 51 participants in a study of
cardiovascular risk factors.
Variable: age (years)

60 62 63 64 64 65 65 65 65 65 65
66 66 66 66 66 67 67 67 68 68 68
70 70 70 71 71 72 72 73 73 73 73
73 73 75 75 75 76 76 77 77 77 77
79 82 83 85 85 87
Example (1 of 2)



Example (2 of 2)
Sample mean:

Sample variance:

Sample standard deviation:

Standard summary: n = 51, X = 71.3, s = 6.4






Outliers
IQR = Interquartile Range = Q3 – Q1
= Range of middle half of the dataOutliers are values
that either:Exceed Q3 + 1.5 IQRFall below Q1 – 1.5 IQR Or,
are outside ± 3s




Check for Outliers in ExampleQ1 = 66, Q3 = 76, IQR =
10Lower = 66 – 1.5(10) = 51Upper = 76 + 1.5(10) = 91
± 3s = 52.1 to 90.5

Presenting Data (1 of 2)Suppose we collapse ages into five
mutually exclusive and exhaustive categories
Age Class Number of Individuals (freq.) 60–
64 5
65–69 17
70–74 12
75–79 12
80–84 2
85–89 3



Presenting Data (2 of 2)
Cumulative
Age Class Freq. Rel. Freq. Freq. Rel. Freq.
60-64 5 0.10 5 0.10
65-69 17 0.33 22 0.43
70-74 12 0.24 34 0.67
75-79 12 0.24 46 0.91
80-84 2 0.04 48 0.95
85-89 3 0.06 51 1.00
Total 51 1.00



Frequency Histogram




Example 4.3.

Summarizing Continuous VariablesDiastolic blood pressures in
n = 10 randomly selected participants attending the seventh
examination of the Framingham Offspring Study

76 64 62 81 70
72 81 63 67 77



Summarizing Location What is a typical diastolic blood
pressure?

Sample mean:
= Sum of diastolic blood pressures/n
= 713/10 = 71.3



NotationLet X represent the outcome of interest (e.g., X =
diastolic blood pressure)




Summarizing VariabilitySample range:
= maximum – minimum = 81 – 62 = 19
Sample variance:






Sample Variance (1 of 2)
DBP Deviation from Mean
76 (76 – 71.3) = 4.7

64 (64 – 71.3) = –7.3
62 (62 – 71.3) = –9.3
81 9.7
70 –1.3
72 0.7
81 9.7
63 –8.3
67 –4.3
77 5.7
S X = 71.3 S Deviations from Mean = 0



Sample Variance (2 of 2)
DBP Deviation from Mean Squared Deviations
76 (76 – 71.3) = 4.7 22.09
64 (64 – 71.3) = –7.3 53.29
62 (62 – 71.3) = –9.3 86.49
81 9.7 94.09
70 –1.3 1.69
72 0.7 0.49
81 9.7 94.09
63 –8.3 68.89
67 –4.3 18.49
77 5.7 32.49
S X = 71.3 S Deviations = 0 S Deviations2 = 472.10




Sample Variance and

Sample Standard Deviation

MedianMedian holds 50% of values above and 50% of values
belowOrder dataFor n odd—median is middle valueFor n even—
median is mean of two middle values
Median = 71
62 63 64 64 70 | 72 76 77 81 81



QuartilesQ1 = first quartile holds 25% of values below itQ3 =
third quartile holds 25% of values above it

Median = 71
62 63 64 64 70 | 72 76 77 81 81
Q1 Q3



Determining OutliersOutliers—values below Q1 – 1.5(Q3 – Q1)
or above Q3 + 1.5(Q3 – Q1)In Example 4.3: lower limit = 64 –
1.5(77 – 64) = 44.5 and upper limit = 77 + 1.5(77 – 64) =
96.5Outliers?Mean or median? s or IQR?



Box Plot for Continuous Variable



Dichotomous and categoricalFrequencies and relative
frequenciesBar charts (freq. or relative freq.)Ordinal
Frequencies, relative frequencies, cumulative frequencies, and
cumulative relative frequenciesHistograms (freq. or relative
freq.)
Numerical and Graphical

Summaries (1 of 2)



Numerical and Graphical

Summaries (2 of 2)ContinuousMean, standard deviation,
minimum, maximum, range, median, quartiles, interquartile
rangeBox plot
0
5
10
15
20
25
30
35
40
PoorFairGoodVery GoodExcellent
Health Status
%
6
.
123
7
865
n
X
X

=
=
=
å
n

X
X

mean

Sample
å
=
=
X
n
|

X

-

X
|

Σ

=

MAD
X
X
1
n
)
X
Σ(X
s
2
2
-

-
=
374.6
6
2247.72
s
2
=
=
s

=

s
2
4
.
19
6
.
374
=
=
s
71.3

=

51
3637
=

n
X
Σ

=

X
41.4

=

50
/51
)
(3637

-

261,439

=

1

-
n
/n
)
X
(
Σ

-

X
Σ

=

s

2
2
2
2
6.4

=

41.4

=

s
0
2
4
6
8
10
12
14
16
18
60-
64
65-
69
70-
74
75-
79
80-
84
85-
89
Age Class

Frequency
1
n
)
x
(x
s
2
2
-
-
=
å
46
.
52
9
10
.
472
1
n
)
x
(x
s
2
2
=
=
-
-
=
å
2
.

7
46
.
52
1
n
)
x
(x
s
2
=
=
-
-
=
å
60
65
70
75
80
dbp
60
65
70
75
80
dbp





Chapter 5
The Role of Probability

Learning Objectives (1 of 3)Define the terms “equally likely”
and “at random”Compute and interpret unconditional and
conditional probabilitiesEvaluate and interpret independence of
eventsExplain the key features of the binomial distribution
model



Learning Objectives (2 of 3)Calculate probabilities using the
binomial formulaExplain the key features of the normal
distribution modelCalculate probabilities using the standard
normal distribution tableCompute and interpret percentiles of
the normal distribution



Learning Objectives (3 of 3)Define and interpret the standard
errorExplain sampling variabilityApply and interpret the results
of the Central Limit Theorem



Two Areas of Biostatistics
Goal: Statistical Inference

Descriptive Statistics


n, X

SAMPLE



POPULATION







Sampling from a Population




n



SAMPLES



n



n

n



n



n



n



n



n



n







Population


N

Sampling:

Population Size = N, Sample Size = n (1 of 2)Simple random
sampleEnumerate all members of population N (sampling
frame), select n individuals at random (each has same
probability of being selected).Systematic sampleStart with
sampling frame; determine sampling interval (N/n); select first
person at random from first (N/n) and every (N/n) thereafter.


Stratified sampleOrganize population into mutually exclusive
strata; select individuals at random within each
stratum.Convenience sampleNon-probability sample (not for
inference)Quota sampleSelect a predetermined number of
individuals into sample from groups of interest.
Sampling:

Population Size = N, Sample Size = n (2 of 2)



BasicsProbability reflects the likelihood that outcome will
occur.0 ≤ Probability ≤ 1

P(Select any child) = 1/5290 = 0.0002
Example 5.1.

Basic Probability (1 of 2)



Example 5.1.

Basic Probability (2 of 2)
P(Select a boy) = 2560/5290 = 0.484
P(Select boy age 10) = 418/5290 = 0.079
P(Select child at least 8 years of age)
= (846 + 881 + 918)/5290
= 2645/5290 = 0.500



Conditional Probability Probability of outcome in a specific
subpopulationExample 5.1.
P(Select 9-year-old from among girls)

= P(Select 9-year-old | girl)
= 461/2730 = 0.169
P(Select boy | 6 years of age)
= 379/892=0.425



Example 5.2.

Conditional Probability (1 of 2)

Example 5.2.

Conditional Probability (2 of 2)
P(Prostate cancer | Low PSA)
= 3/64 = 0.047
P(Prostate cancer | Moderate PSA)
= 13/41 = 0.317
P(Prostate cancer | High PSA)
= 12/15 = 0.80



Sensitivity and Specificity
Sensitivity = True positive fraction
= P(test+ | disease)
Specificity = True negative fraction
= P(test– | disease free)

False negative fraction = P(test– | disease)
False positive fraction = P(test+ | disease free)



Example 5.4.

Sensitivity and Specificity



Sensitivity and Specificity
Sensitivity = P(test+ | disease) = 9/10 = 0.90

Specificity = P(test– | disease free)
= 4449/4800 = 0.927

False negative fraction = P(test– | disease)
= 1/10 = 0.10
False positive fraction = P(test+ | disease free)
= 351/4800 = 0.073



IndependenceTwo events, A and B, are independent if P(A | B)
= P(A) or if P(B | A) = P(B)
Example 5.2. Is screening test independent of prostate cancer
diagnosis?P(Prostate cancer) = 28/120 = 0.023P(Prostate cancer
| Low PSA) = 0.047P(Prostate cancer | Moderate PSA) =
0.317P(Prostate cancer | High PSA) = 0.80



Bayes’ Theorem (1 of 2)Using Bayes’ Theorem we revise or
update a probability based on additional information.Prior
probability is an initial probability.Posterior probability is a
probability that is revised or updated based on additional
information.



Bayes’ Theorem (2 of 2)





Example (1 of 2)In Boston, 51% of adults are male.One adult is
randomly selected to participate in a study.
Prior probability of selecting a male = 0.51

Example (2 of 2)Selected participant is a smoker.9.5% of males
in Boston smoke as compared to 1.7% of females.Find the
probability that we selected a male given he is a smoker.








P(disease) = 0.002
Sensitivity = 0.85 = P(test+ | disease)
P(test+) = 0.08 and P(test–) = 0.92

What is P(disease | test+)?
Example 5.8.

Bayes’ Theorem (1 of 3)



What is P(disease | test+)?

P(disease) = 0.002
Sensitivity = 0.85 = P(test+ | disease)
P(test+) = 0.08 and P(test–) = 0.92


Example 5.8.

Bayes’ Theorem (2 of 3)

Example 5.8.

Bayes’ Theorem (3 of 3)
P(disease) = 0.002
Sensitivity = 0.85 = P(test+ | disease)
P(test+) = 0.08 and P(test–) = 0.92





Model for discrete outcomeProcess or experiment has two
possible outcomes: success and failure.Replications of process
are independent.P(success) is constant for each replication.
Binomial Distribution (1 of 2)



Binomial Distribution (2 of 2)Notation
n = number of times process is replicated
p = P(success)
x = number of successes of interest
0 ≤ x ≤ n




Example 5.9.

Binomial DistributionMedication for allergies is effective in
reducing symptoms in 80% of patients. If medication is given to
10 patients, what is the probability it is effective in 7?
= 120(0.2097)(0.008) = 0.2013

Antibiotic is claimed to be effective in 70% of the patients. If
antibiotic is given to five patients, what is the probability it is
effective on exactly three?
Success = Antibiotic is effective: n = 5, p = 0.7, x = 3
= 10(0.343)(0.09) = 0.3087
Binomial Distribution (1 of 4)



What is the probability that the antibiotic is effective on all
five?
Binomial Distribution (2 of 4)



What is the probability that the antibiotic is effective on at least
three?

P(X ≥ 3) = P(3) + P(4) + P(5)
= 0.3087 + 0.3601 + 0.1681 = 0.8369


Binomial Distribution (3 of 4)



Binomial Distribution (4 of 4)Mean and variance of the
binomial distribution

m = np
s2 = np (1 – p)

For example, the mean (or expected) number of patients in
whom the antibiotic is effective is 5*0.7 = 3.5

Model for continuous outcomeMean = median = mode

Normal Distribution (1 of 3)



Notation: m = mean and s = standard deviation

m-3s m-2s m-s m m+s m+2s m+3s
Normal Distribution (2 of 3)



Normal Distribution (3 of 3)Properties of normal distribution
I) The normal distribution is symmetric about the mean (i.e.,
P(X > m) = P(X < m) = 0.5).
ii) The mean and variance, m and s2, completely characterize
the normal distribution.
iii) The mean = the median = the mode.
P(m - s < X < m + s) = 0.68
P(m - 2s < X < m + 2s) = 0.95
P(m - 3s < X < m + 3s) = 0.99
iv) P(a < X < b) = the area under the normal curve from a to b.


Body mass index (BMI) for men age 60 is normally distributed
with a mean of 29 and standard deviation of 6. What is the
probability that a male has BMI less than 29?
Example 5.11.

Normal Distribution (1 of 10)

Example 5.11.

Normal Distribution (2 of 10)





11 17 23 29 35 41 47
P(X<29)=0.5
0.5 0.5
Example 5.11.

Normal Distribution (3 of 10)


Body mass index (BMI) for men age 60 is normally distributed
with a mean of 29 and standard deviation of 6. What is the
probability that a male has BMI less than 35?
Example 5.11.

Normal Distribution (4 of 10)





11 17 23 29 35 41 47
P(X<35)=?
Example 5.11.

Normal Distribution (5 of 10)



Example 5.11.

Normal Distribution (6 of 10)


11 17 23 29 35 41 47
P(X < 35) = 0.5 + 0.34 = 0.84
0.5 0.34



Standard Normal Distribution ZNormal distribution with m = 0
and s = 1
-3 -2 -1 0 1 2 3





11 17 23 29 35 41 47
P(X < 35) = P(Z < 1) = ?
Example 5.11.

Normal Distribution (7 of 10)




P(X < 35) = P(Z < 1).
Using Table 1, P(Z < 1.00) = 0.8413

Table 1. Probabilities of Z
Table entries represent P(Z < Zi)
Zi .00 .01 .02 .03 .04 …
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 …
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 …
.

.
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 …
Example 5.11.

Normal Distribution (8 of 10)





11 17 23 29 35 41 47
P(X<30)=?
What is the probability that a male has BMI less than 30?
Example 5.11.

Normal Distribution (9 of 10)



Example 5.11.

Normal Distribution (10 of 10)


P(X < 30)= P(Z < 0.17) = 0.5675




Percentiles of the Normal Distribution The kth percentile is
defined as the score that holds k percent of the scores below it
(e.g., 90th percentile is the score that holds 90% of the scores
below it. Q1 = 25th percentileMedian = 50th percentileQ3 =
75th percentile

Percentiles (1 of 2)For the normal distribution, the following is
used to compute percentiles:
X = m + Z s
where
m = mean of the random variable X,
s = standard deviation, and
Z = value from the standard normal distribution for the desired
percentile (See Table 1A, next slide).


Percentiles of the standard normal distribution
(Table 1A)

Percentiles (2 of 2)PercentileZ1st–2.3262.5th–1.9605th–
1.64510th–1.28250th090th1.28295th1.64597.5th1.96099th2.326
















0.05

1.645



0.95







4



3



2



1



0



-1

-2



-3



-4



0







Example 5.12.

Percentiles of the Normal DistributionBMI in men follows a
normal distribution with m = 29, s = 6; BMI in women follows a
normal distribution with m = 28, s = 7.The 90th percentile of
BMI for men
X = 29 + 1.282(6) = 36.69.The 90th percentile of BMI for
women
X = 28 + 1.282(7) = 36.97.



Central Limit TheoremSuppose we have a population with
known mean m and standard deviation s. If we take simple
random samples of size n with replacement, then for large n, the

sampling distribution of the sample means is approximately
normal with mean and standard deviation.





ApplicationNon-normal populationTake samples of size n, as
long as n is sufficiently large (usually n ≥ 30 suffices).The
distribution of the sample mean is approximately normal,
therefore can use Z to compute probabilities.



HDL cholesterol has a mean of 54 and standard deviation of 17
in patients over 50. A physician has 40 patients over age 50 and
wants to know the probability that their mean cholesterol is
above 60.
Example 5.18.

Central Limit Theorem (1 of 2)




Example 5.18.

Central Limit Theorem (2 of 2)





ExampleSuppose we wish to estimate the mean of a population
(m) whose standard deviation is known and equal to 12, and a
simple random sample of 100 individuals is selected from the

population. Find the probability that the sample mean is no
more than 2 units from the population mean.




Sampling Distribution of Sample Mean








- 2

Central Limit Theorem
P( m – 2 < < m + 2) = ??
Z = { (m – 2) – m }/12/ –2/1.2 = –1.67
Z = { (m + 2) –
Then: P(–1.67 < Z < 1.67) = 0.9525 – 0.0475 = 0.905The
probability that the sample mean is no more than 2 units from
the population mean is 0.905, or 90.5%.



POPULATION

SAMPLE

m

= ?


n, X

Population
N

n
n
n
n
n
n
n
n
n
n
SAMPLES


Probability=Numberwithoutcome
N

Probability=
Numberwithoutcome
N
P(B)
A)P(A)
|
P(B
B)
|
P(A
=
)
A'
|
)P(B
P(A'


A)
|

P(A)P(B
A)P(A)
|
P(B
B)
|
P(A
+
=
)
M'
|
)P(S
P(M'


M)
|
P(M)P(S
M)P(M)
|
P(S
S)
|
P(M
+
=


P(M|S) = 0.095(0.51)
0.51(0.095)+ 0.49(0.017)

= 0.853

P(M|S)=
0.095(0.51)

0.51(0.095)+ 0.49(0.017)
=0.853


P(disease | test+) = P(test+ | disease)P(disease)
P(test+)

P(disease | test+)=
P(test+ | disease)P(disease)
P(test+)


=
P(test+ | disease)P(disease)

P(test+)

=
P(test+ | disease)P(disease)
P(test+)
x
n
x
p)
(1
p
x)!
(n
x!
n!
successes)
P(x
-
-
-
=

P(7 successes) = 10!
7!(10 – 7)!

0.87 (1− 0.8)10-7

P(7 successes)=
10!
7!(10 – 7)!
0.8
7
(1-0.8)
10-7


P(X = 3) = 5!
3!(5 – 3)!

0.73 (1− 0.7)5-3

P(X=3)=
5!
3!(5 – 3)!
0.7
3
(1-0.7)
5-3


P(X = 5) = 5!
5!(5 – 5)!

0.75 (1− 0.7)5-5

=1(0.1681)(1) = 0.1681

P(X=5)=
5!
5!(5 – 5)!
0.7
5
(1-0.7)
5-5
=1(0.1681)(1)=0.1681
σ
μ
x
Z
-
=
0.17
6
29
30
σ
μ
x
Z
=
-
=
-
=

0

-
4

-
3

-
2

-
1

0

1

2

3

4





1.645

0.05

0.95

μ
μ
X
=
n
σ
σ
X
=

n
σ
μ
x
Z
-
=
?
60)
X
P(
=
>
2.22
40
17
54
60
n
σ
μ
X
Z
=
-
=
-
=


P(X > 60) = P(Z > 2.22) = 1 – 0.9868 = 0.0132

P(X>60)=P(Z>2.22) = 1 – 0.9868 = 0.0132

m

-
2


m


m
+ 2









X





Chapter 3
Quantifying the Extent of Disease



Learning ObjectivesDefine and differentiate prevalence and
incidenceSelect, compute, and interpret the appropriate measure
to compare the extent of disease between groupsCompare and
contrast relative risks, risk differences, and odds ratiosCompute

and interpret relative risks, risk differences, and odds ratios



Critical Components of RCTRandomizationControl group –
ethical issuesMonitoring Interim analysisData and safety
monitoring boardData managementReporting



PrevalenceProportion of participants with disease at a particular
point in time





Example 3.1.

Computing Prevalence
Prevalence of CVD = 379/3799 = 0.0998 = 9.98%
Prevalence of CVD in Men = 244/1792 = 0.1362 = 13.62%
Prevalence of CVD in Women = 135/2007 = 0.0673 = 6.73%



Example: H1N1 OutbreakH1N1 outbreak first noticed in
Mexico.Large outbreak early on in La Gloria—a small village
outside of Mexico CityStudied extensively in the first report on
H1N1 (Fraser, Donelly, et al. “Pandemic potential of a strain of
Influenza (H1N1): Early findings,” Science Express, 11 May
2009.)Important questionsWho is most likely to be impacted?
What are characteristics of people commonly impacted?

Data on H1N1 outbreak in La Gloria, Mexico: n = 1575
villagers (out of 2155) were surveyed to determine if they had
influenza-like illness (ILI) between 2/15/09 and 4/27/09.
Computing Prevalence (1 of 2)AgeNo ILIILITotal≤ 44
years7035221225> 44 years25694350Total9596161575













Computing Prevalence (2 of 2)
Prevalence of ILI = 616/1575 = 0.3911 = 39.11%
Prevalence of ILI in ≤ 44 = 522/1225 = 0.4261 = 42.61%
Prevalence of ILI in > 44 = 94/350 = 0.2686 = 26.86%AgeNo
ILIILITotal≤ 44 years7035221225> 44
years25694350Total9596161575

IncidenceLikelihood of developing disease among persons free
of disease who are at risk of developing disease






Computing IncidenceCumulative incidence requires complete
follow-up on all participants.Person-time data is used to take
full advantage of available information in incidence
rate.Incidence rate often expressed as an integer per multiple of
participants over a specified time.



Incidence of CVD?



Incidence Rate





Incidence of CVD


Incidence = 2/(10 + 9 + 3 + 10 + 5) = 2/37
= 0.054

5.4 per 100 person-years

Example 3.2.

Computing Incidence

Incidence Rate of CVD in Men = 190/9984 = 0.01903
= 190 per 10,000 person-years
Incidence Rate of CVD in Women = 119/12153 = 0.00979
= 98 per 10,000 person-yearsDevelop
CVDTotal Follow-Up Time
(years)Men1909984Women11912153Total30922137












Computing Incidence
Incidence Rate of ILI in ≤ 44 = 522/20,064 = 0.0260
= 260 per 10,000 person-years
Incidence Rate of ILI in > 44 = 94/3514 = 0.0268
= 268 per 10,000 person-yearsDeveloped
ILITotal Follow-Up Time (years)≤ 44 years52220,064> 44
years943,514Total61623,578

Risk difference (excess risk)

Comparing Extent of Disease

Between Groups (1 of 2)




Comparing Extent of Disease

Between Groups (2 of 2)Risk difference of prevalent CVD in
smokers versus nonsmokers
= 81/744 – 298/3055 = 0.1089 – 0.0975 = 0.0114




Population Attributable Risk of CVD in Smokers vs.
Nonsmokers

= (0.0998 – 0.0975) / 0.0998 = 0.023 = 2.3%



Risk difference of history of ILI in males and females in La
Gloria

= 356/798 – 260/777 = 0.4461 – 0.3346 = 0.1115
Comparing Extent of Disease

Between Groups (1 of 7)No
ILIILITotalMales517260777Females442356798Total959616157
5













Relative risk
Comparing Extent of Disease

Between Groups (2 of 7)




Comparing Extent of Disease

Between Groups (3 of 7)Relative risk of CVD in smokers versus
nonsmokers
= 0.1089/0.0975 = 1.12



Relative risk of ILI in females versus males
= 0.4461/0.3346 = 1.33
Comparing Extent of Disease

Between Groups (4 of 7)No
ILIILITotalMales517260777Females4423567989596161575













Odds ratio
Comparing Extent of Disease

Between Groups (5 of 7)



Odds ratio of CVD in hypertensives versus hypertensives
Comparing Extent of Disease

Between Groups (6 of 7)




Comparing Extent of Disease

Between Groups (7 of 7)Odds ratio of ILI in younger group
versus older groupAgeNo ILIILITotal≤ 44 years7035221225>
44 years25694350Total9596161575

Relative Risks and Odds RatiosNot possible to estimate relative
risk in case-control studiesPossible to estimate odds ratio
because of its invariance property


Case-control study to assess association between smoking and
cancer
Invariance Property of Odds Ratio

(1 of 2)



Invariance Property of Odds Ratio

(2 of 2)
Odds ratio for cancer in smokers versus nonsmokers
= (40/29) / (10/21) = 2.90
Odds of smoking in patients with cancer versus not
= (40/10) / (29/21) = 2.90(!)
baseline
at
examined

persons
of
Number
disease
with
persons
of
Number
Prevalence
Point
=
baseline
at
risk
at
persons
of
Number
period

specified

a

during

disease
develop

who
persons
of
Number
Incidence
Cumulative
=

Incidence Rate = Number of persons who develop disease
during a specified period
Sum of the lengths of time during which persons are disease-
free

Incidence Rate=
Numberofpersonswho developdisease during a specified period
Sumofthelengthsoftimeduringwhichpersonsaredisease-free


Incidence Rate =

Number of persons who develop disease
during a specified period

Sum of the lengths of time during
which persons are disease-free

Incidence Rate=
Numberofpersonswho developdisease
during a specified period
Sumofthelengthsoftimeduring
whichpersonsaredisease-free
unexposed
exposed
unexposed
exposed
unexposed
exposed
Rate

Incidence
Rate

Incidence
Incidence

Cumulative
Incidence

Cumulative
Prevalence
Prevalence
-
=
-
=
-
=


= Prevalencesmokers −Prevalencenonsmokers

=Prevalence
smokers
-Prevalence
nonsmokers


=
Prevalenceoverall −Prevalencenonsmokers

Prevalenceoverall

=
Prevalence
overall
-Prevalence
nonsmokers
Prevalence

overall
Males
Females
Prevalence
Prevalence
-
=
unexposed
exposed
Prevalence
Prevalence
=


=
Prevalencesmokers
Prevalencenonsmokers

=
Prevalence
smokers
Prevalence
nonsmokers
males
females
Prevalence
Prevalence
=
)
Prevalence
(1
Prevalence
)
Prevalence
(1
Prevalence

unexposed
unexposed
exposed
exposed
-
-
=
04
.
4
932
.
0
/
068
.
0
725
.
0
/
275
.
0
)
2942
/
88
1
(1
188/2942
)
840
/
181
(1

181/840
=
=
-
-
=
02
.
2
731
.
0
/
269
.
0
574
.
0
/
426
.
0
)
350
/
94
(1
94/350
)
1225
/
522
(1
522/1225
=

=
-
-
=





Chapter 6
Confidence Interval
Estimates



Learning Objectives (1 of 2)Define point estimate, standard
error, confidence level, and margin of errorCompare and
contrast standard error and margin of errorCompute and
interpret confidence intervals for means and
proportionsDifferentiate independent and matched or paired
samples



Learning Objectives (1 of 2)Compute confidence intervals for
the difference in means and proportions in independent samples
and for the mean difference in paired samplesIdentify the
appropriate confidence interval formula based on type of
outcome variable and number of samples



Statistical Inference (1 of 2)There are two broad areas of
statistical inference: estimation and hypothesis
testing.Estimation—the population parameter is unknown, and
sample statistics are used to generate estimates of the unknown

parameter.



Statistical Inference (2 of 2)Hypothesis testing—an explicit
statement or hypothesis is generated about the population
parameter; sample statistics are analyzed and determined to
either support or reject the hypothesis about the parameter.In
both estimation and hypothesis testing, it is assumed that the
sample drawn from the population is a random sample.



Estimation (1 of 2)Process of determining likely values for
unknown population parameterPoint estimate is best single-
valued estimate for parameter.Confidence interval is range of
values for parameter.
point estimate ± margin of error



Estimation (2 of 2)A point estimate for a population parameter
is the “best” single number estimate of that parameter. A
confidence interval estimate is a range of values for the
population parameter with a level of confidence attached (e.g.,
95% confidence that the range or interval contains the
parameter).



Confidence Interval Estimates
point estimate ± margin of error

point estimate ± Z SE (point estimate)

where Z = value from standard normal distribution for desired

confidence level and SE (point estimate) = standard error of the
point estimate



Confidence Intervals for mContinuous outcomeOne sample

n ≥ 30 (Find Z in Table 1B)

n < 30 (Find t in Table 2 [next slide],
df = n – 1)







Table 2. Critical Values of

the t DistributionTable entries represent values from t
distribution with upper tail area equal to a.

Confidence level 80% 90% 95% 98% 99%
Two-sided test a .20 .10 .05 .02 .01
One-sided test a .10 .05 .025 .01 .005
df
1 3.078 6.314 12.71 31.82 63.66
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250

10 1.372 1.812 2.228 2.764 3.169



Example 6.1.

Confidence Interval for m (1 of 2)In the Framingham Offspring
Study (n = 3534), the mean systolic blood pressure (SBP) was
127.3 with a standard deviation of 19.0. Generate a 95%
confidence interval for the true mean SBP.
127.3 ± 0.63
(126.7, 127.9)





Example 6.2.

Confidence Interval for m (2 of 2)In a subset of n = 10
participants attending the Framingham Offspring Study, the
mean SBP was 121.2 with a standard deviation of 11.1.
Generate a 95% confidence interval for the true mean SBP.
121.2 ± 7.94
(113.3, 129.1)
df = n – 1 = 9, t = 2.262





New ScenarioOutcome is dichotomous (p = population
proportion)Result of surgery (success, failure)Cancer remission
(yes/no)One study sampleData On each participant, measure
outcome (yes/no)n, x = number of positive responses

Confidence Intervals for pDichotomous outcomeOne sample




(Find Z in Table 1B)








Example 6.3.

Confidence Interval for pIn the Framingham Offspring Study (n
= 3532), 1219 patients were on antihypertensive
medications.Generate a 95% confidence interval for the true
proportion on antihypertensive medication.
0.345 ± 0.016
(0.329, 0.361)






New ScenarioOutcome is continuousSBP, weight,
cholesterolTwo independent study samplesData On each
participant, identify group and measure outcome

Two Independent Samples (1 of 2)
RCT: Set of Subjects Who Meet
Study Eligibility Criteria
Randomize


Treatment 1 Treatment 2
Mean Trt 1 Mean Trt 2



Two Independent Samples (2 of 2)
Cohort Study: Set of Subjects Who
Meet Study Inclusion Criteria


Group 1 Group 2
Mean Group 1 Mean Group 2



Confidence Intervals for (m1 - m2)Continuous outcomeTwo
independent samples

n1 ≥ 30
and n2 ≥ 30 (Find Z in
Table 1B)

n1 < 30
or n2 < 30 (Find t in

Table 2,
df = n1 + n2 – 2)









Pooled Estimate of Common Standard Deviation, SpPrevious
formulas assume equal variances (s12 = s22)If 0.5 ≤ s12/s22 ≤
2, assumption is reasonable





Example 6.5.

Confidence Interval for (m1 - m2)Using data collected in the
Framingham Offspring Study, generate a 95% confidence
interval for the difference in mean SBP between men and
women.

n Mean Std Dev
Men 1623 128.2 17.5
Women 1911 126.5 20.1



Assess Equality of VariancesRatio of sample variances:
17.52/20.12 = 0.76

Confidence Intervals for (m1 - m2)





1.7 ± 1.26
(0.44, 2.96)





New ScenarioOutcome is continuousSBP, weight,
cholesterolTwo matched study samplesData On each participant,
measure outcome under each experimental condition.Compute
differences (D = X1 – X2)







Two Dependent/Matched Samples
Subject ID Measure 1 Measure 2
1 55 70
2 42 60
.
.

Measures taken serially in time or under different
experimental conditions

Crossover Trial
Treatment Treatment


Eligible R
Participants

Placebo Placebo

Each participant measured on treatment and placebo



Confidence Intervals for mdContinuous outcomeTwo
matched/paired samples

n ≥ 30 (Find Z in Table 1B)

n < 30 (Find t in Table 2,
df = n – 1)







Example 6.8.

Confidence Interval for md (1 of 3)In a crossover trial to
evaluate a new medication for depressive symptoms, patients’
depressive symptoms were measured after taking new drug and
after taking placebo. Depressive symptoms were measured on a

scale of 0 to100 with higher scores indicative of more
symptoms.



Example 6.8.

Confidence Interval for md (2 of 3)Construct a 95% confidence
interval for the mean difference in depressive symptoms
between drug and placebo.The mean difference in the sample (n
= 100) is –12.7 with a standard deviation of 8.9.



Example 6.8.

Confidence Interval for md (3 of 3)

–12.7 ± 1.74
(–14.1, –10.7)





New ScenarioOutcome is dichotomousResult of surgery
(success, failure)Cancer remission (yes/no)Two independent
study samplesData On each participant, identify group and
measure outcome (yes/no)





Confidence Intervals for (p1 - p2)Dichotomous outcomeTwo
independent samples

(Find Z in Table 1B)







Example 6.10.

Confidence Interval for (p1 – p2) (1 of 3)A clinical trial
compares a new pain reliever to that considered standard care in
patients undergoing joint replacement surgery; the outcome of
interest is reduction in pain by 3+ scale points. Construct a
95% confidence interval for the difference in proportions of
patients reporting a reduction between treatments.



Example 6.10.

Confidence Interval for (p1 – p2) (2 of 3)
Reduction of 3+ Points
Treatment n Number Proportion
New 50 23 0.46
Standard 50 11 0.22



Example 6.10.

Confidence Interval for (p1 – p2) (3 of 3)

0.24 ± 0.18
(0.06, 0.42)





Confidence Intervals for Relative Risk (RR)Dichotomous
outcomeTwo independent samples




exp(lower limit), exp(upper limit)

(Find Z in Table 1B)






Example 6.12.

Confidence Interval for RR (1 of 2)
Reduction of 3+ Points
Treatment n Number Proportion
New 50 23 0.46
Standard 50 11 0.22


Construct a 95% CI for the relative risk.

Example 6.12.

Confidence Interval for RR (2 of 2)
0.737 ± 0.602 exp(0.135), exp(1.339)
(0.135, 1.339) (1.14, 3.82)





Confidence Intervals for Odds Ratio (OR)Dichotomous
outcomeTwo independent samples




exp(lower limit), exp(upper limit)

(Find Z in Table 1B)






Example 6.14.

Confidence Interval for OR (1 of 2)
Reduction of 3+ Points
Treatment n Number Proportion
New 50 23 0.46
Standard 50 11 0.22

Construct a 95% CI for the odds ratio.



Example 6.14.

Confidence Interval for OR (2 of 2)
1.105 ± 0.870 exp(0.235), exp(1.975)
(0.235, 1.975) (1.26, 7.21)


n
s
Z
X
±
n
s
t
X
±
3534
19.0
96
.
1
127.3
±
n
s
t
X
±
10
11.1
262

.
2
121.2
±
n
x
p
ˆ
=
5
)]
p
ˆ
n(1
,
p
ˆ
min[n
³
-


p̂ ± Z p̂(1 – p̂)
n

ˆ
p±Z
ˆ
p(1 –
ˆ
p)
n


p̂ ± Z p̂(1 – p̂)
n

ˆ
p±Z
ˆ
p(1 –
ˆ
p)
n
0.345
3532
1219
p
ˆ
=
=


0.345 ±1.96 0.345(1 – 0.345)
3532

0.345±1.96
0.345(1 – 0.345)
3532
)
s
(or
s
,
X
,
n
),
s
(or
s
,

X
,
n
2
2
2
2
2
1
2
1
1
1


(X1 – X2 )± ZSp
1
n1
+

1
n2

(X
1
– X
2
)± ZSp
1
n
1
+
1
n
2

(X1 – X2 )± tSp
1
n1
+

1
n2

(X
1
– X
2
)± tSp
1
n
1
+
1
n
2
2
n
n
1)s
(n
1)s
(n
Sp
2
1
2
2
2
2
1

1
-
+
-
+
-
=
0
.
19
12
.
359
2
1911
1623
1)20.1
(1911
1)17.5
(1623
Sp
2
2
=
=
-
+
-
+
-
=


(128.2 – 126.5) ± 1.96 (19.0) 1
1623

+
1

1911

(128.2 – 126.5) ± 1.96 (19.0)
1
1623
+
1
1911


(X1 – X2 )± ZSp
1
n1
+

1
n2

(X
1
– X
2
)± ZSp
1
n
1
+
1
n
2
d
d
s

,
X
n,
n
s
Z
X
d
d
±
n
s
t
X
d
d
±
n
s
Z
X
d
d
±


–12.7±1.96 8.9
100

–12.7±1.96
8.9
100
2
2
1
1

p
ˆ
,
n
,
p
ˆ
,
n
5
)]
p
ˆ
(1
n
,
p
ˆ
n
),
p
ˆ
(1
n
,
p
ˆ
min[n
2
2
2
2
1
1
1
1

³
-
-


(p̂1 – p̂2 ) ± Z
p̂1(1 – p̂1 )

n1
+

p̂2 (1 – p̂2 )
n2

(
ˆ
p
1

ˆ
p
2
)±Z
ˆ
p
1
(1 –
ˆ
p
1
)
n
1
+
ˆ
p

2
(1–
ˆ
p
2
)
n
2


(p̂1 – p̂2 ) ± Z
p̂1(1 – p̂1 )

n1
+

p̂2 (1 – p̂2 )
n2

(
ˆ
p
1

ˆ
p
2
)±Z
ˆ
p
1
(1 –
ˆ
p
1
)

n
1
+
ˆ
p
2
(1–
ˆ
p
2
)
n
2


(0.46 – 0.22) ±1.96 0.46(1 – 0.46)
50

+
0.22(1 – 0.22)

50

(0.46 – 0.22)±1.96
0.46(1 – 0.46)
50
+
0.22(1–0.22)
50


ln(R̂R)± Z (n1 – x1 )/x1
n1

+
(n2 – x2 )/x2

n2

ln(
ˆ
RR)±Z
(n
1
– x
1
)/x
1
n
1
+
(n
2
– x
2
)/x
2
n
2
2.09
0.22
0.46
p
ˆ
p
ˆ
R
R
ˆ
2
1
=

=
=
50
39/11
50
27/23
1.96
ln(2.09)
+
±
)
x
(n
1
n
1
)
x
(n
1
x
1
Z
R)
O
ˆ
ln(
2
2
2
1
1
1
-
+
+

-
+
±


ÔR = x1 /(n1 – x1 )
x2 /(n2 – x2 )

=
23/27
11/39

= 3.02

ˆ
OR=
x
1
/(n
1
– x
1
)
x
2
/(n
2
– x
2
)
=
23/27
11/39
=3.02
39
1

11
1
27
1
23
1
1.96
ln(3.02)
+
+
+
±





Chapter 2
Study Designs



Learning Objectives (1 of 2)List and define the components of a
good study designCompare and contrast observational and
experimental study designsSummarize the advantages and
disadvantages of alternative study designs



Learning Objectives (1 of 2)Describe the key features of a
randomized controlled trialIdentify the study designs used in
public health and medical studies



Study DesignsObservational studiesCase report/case

seriesCross-sectional (prevalence) surveyCase-control
studyCohort studyExperimental studiesRandomized controlled
(clinical) trial



InferencesObservational studies—inferences limited to
descriptions and associations; with carefully designed analysis,
can make stronger inferences (statistical
adjustment)Experimental studies—cause and effect

In all studies—need careful definition of disease (outcome)
and exposure (risk factor)



Which Design Is Best?Depends on the study questionWhat is
current knowledge on topic?How common is disease (and risk
factors)?How long would study take; what are costs?Ethical
issues



Case Report/Case SeriesObservational studyCase report—
detailed report of specific features of caseCase series—
systematic review of common features of a small number of
casesAdvantage: cost-efficientDisadvantages: no comparison
group, no specific research question



Case Series (1 of 2)Simplest design—description of interesting
observations in a small number of individualsUsually case
series do not involve control patients (i.e., patients free of
disease)Usually lead to generation of hypotheses for more
formal testingCriticisms: not planned, no research hypotheses

Case Series (2 of 2)Gottleib (1981) studied five young
homosexual men with rare form of pneumonia and other unusual
infections.Initial report was followed by more series (26 cases
in NY and CA; “cluster” in southern CA; 34 cases among
Haitians, etc.)Condition termed AIDS in 1982.



Cross-Sectional Survey (1 of 2)Observational study conducted
at a point in timeAdvantages: cost-efficient, easy to implement,
ethicalDisadvantages: no temporal information, non-response
bias



Cross-Sectional Survey (2 of 2)Is there an association between
diabetes and cardiovascular disease (CVD)?

Patients with Diabetes

Patients without Diabetes


Patients with CVD



Prospective Cohort StudyObservational study involving a group
(cohort) of individuals who meet inclusion criteria followed
prospectively in time for risk factor and outcome
informationAdvantages: can assess temporal
relationshipsDisadvantages: need large numbers for rare
outcomes, confounding

Cohort Study (1 of 3)Is there an association between
hypertension and cardiovascular disease?

CVD
Hypertension
No CVD

Cohort

CVD
No Hypertension
No CVD

Study Start Time










Cohort Study (2 of 3)Identify a group of individuals that meet
inclusion criteria.Follow prospectively in time.Assess
exposure.Evaluate outcome status.



Cohort Study (3 of 3)Includes persons exposed and not exposed
to risk factor at outset—usually persons are disease free.Can
assess temporal relationshipProblem if disease is rare (small

numbers)Bias is less of an issue than in case-
control.Confounding may be a problem.



The Framingham Heart Study5000+ men and women enrolled in
1948Longitudinal cohort studyExams every 2 years for
cardiovascular risk factors—surveillanceAncillary studies—
hearing, exercise, nutrition, neurological studies5000+ offspring
and spouses enrolled in 1976Third generation enrolled in 2002



Selection of Study SampleExposure groupCommon risk
factors—general population (e.g., Framingham Study)Rare risk
factors—special exposure cohort (e.g., soldiers exposed to agent
orange)Comparison groupSimilar on all other factors that might
affect outcome



Case-Control Study (1 of 3)Observational study involving
individuals with (cases) and without (controls) outcome of
interest Advantages: cost and time efficient for rare
outcomesDisadvantages: need careful selection of cases and
controls, bias



Case-Control Study (2 of 3)Is there an association between
sleep position and sudden infant death syndrome (SIDS)?

Sleep prone
SIDS
Other

Sleep prone
No SIDS
Other

Study Start Time









Case-Control Study (3 of 3)Select subjects on the basis of
outcome.Cases have disease.Controls are free of
disease.Compare groups with respect to proportions with a
history of exposure (possible cause).Investigation is
retrospective in time.



SamplingSelection of casesNeed explicit definition to make
cases as homogeneous as possibleDebate over whether cases
should represent all persons with disease or specific subgroup
(limit inferences)Selection of controlsShould be comparable to
cases (same exclusions)Controls represent non-diseased persons
who would have been included as cases if they had disease.



FeaturesRetrospective designCost and time efficientCan get
sufficient number of cases (useful for rare conditions)Can
investigate array of exposuresBest for diseases with long
latency

IssuesAscertainment of exposure and disease statusBoth
exposure and disease have occurred—hard to establish temporal
relationship BiasSelection bias—select cases or controls and
some drop out, leaving groups not comparableObservation
bias—knowledge of disease might influence reporting of
exposure (over-reporting among cases)Recall bias—
retrospective (long term)



Randomized Control Trial (1 of 2)Experimental study where
patients are randomized to receive one of several comparison
treatmentsAdvantages: gold standard from a statistical point of
view, minimizes bias and confoundingDisadvantages:
expensive, requires extensive monitoring, inclusion criteria can
limit generalizability



Randomized Control Trial (2 of 2)Is new drug effective in
reducing hyperlipidemia (high total serum cholesterol)?

Hyperlipidemia
Drug
No Hyperlipidemia
Sample RANDOMIZE

Hyperlipidemia
Placebo
No Hyperlipidemia


Study Start Time

Randomized Controlled Trial (Clinical Trial)Subjects are
randomized to one of two (or more) treatments, one of which
may be a control treatment. In the long run, treatment groups
will be balanced in known and unknown prognostic factors.
Important that the treatments are concurrent—that the active
and control treatments occur in the same period of time Single-
versus multicenter



FeaturesIf possible, a study should be double blinded—neither
the investigator nor the participant are aware of what treatment
the participant is undergoing. Sometimes it is impossible to
blind the participants (for example, when the treatments being
compared are medical versus surgical); but often it is possible
to ensure that the people evaluating the outcome are unaware of
the treatment.



Phase I: SafetyFirst time in humans; main objective to assess
toxicity and safety in humans—pharmacokineticsUsually
involves 10 to 15 patientsSubjects are usually healthy.Some are
placebo-controlled.

Phase II: Feasibility StudyFocus still on safetySide effects and
adverse eventsEfficacy is important—goal is to determine
optimal dosage.Involves a control group, and subjects are
randomized.



Phase III: Clinical TrialFocus is efficacy.Data are collected to
monitor safety.Involves a control group (placebo, active
control)Usually involves 200 to 500 subjectsSubjects are
randomized.At least two centers



Phase IV: Post-MarketingAfter approval by FDA (based on
efficacy proven statistically in two or more studies, New Drug
Application (NDA) reviewed within 1 year)Focus is
effectiveness.



Critical Components of RCTRandomizationControl group—
ethical issuesMonitoring Interim analysisData and safety
monitoring boardData managementReporting





Chapter 1
Introduction



Learning Objectives (1 of 2)Define biostatistical applications

and their objectivesExplain the limitations of biostatistical
analysisCompare and contrast a population and a sampleExplain
the importance of random sampling



Learning Objectives (2 of 2)Develop research questions and
select appropriate outcome variables to address important public
health problemsIdentify the general principles and explain the
role and importance of biostatostistical analysis in medical,
public health, and biological research



What Is Biostatistics? (1 of 2)Application of statistical
principles to medical, public health, and biological
applicationsCollecting, summarizing, and interpreting
information and Making inferences that appropriately account
for uncertainty



What Is Biostatistics? (2 of 2)



Population
(unknown information)
Sample

Summarize sampleMake inferences about Population



Issues and Limitations (1 of 2)Must clearly define research
questionMust choose appropriate study design (i.e., the way in

which data are collected)Must select a sufficiently large,
representative sampleMust carefully collect and summarize data



Issues and Limitations (2 of 2)Must quantify uncertaintyMust
appropriately account for relationships among
characteristicsMust limit inferences to appropriate population



Important QuestionsH1N1 outbreakRisk factors for heart
diseaseDrug safety and efficacyHigh-risk health
behaviorsGenetic determinants of diseaseRisk factors for
autismImpact of diet and exercise on healthImpact of Gulf oil
spill on health



Issues for Biostatisticians (1 of 2)Children: Obesity,
immunizations, asthma, autism, etc.Adolescents: Alcohol and
tobacco use, depression, STDs, traffic accidents, etc.Adults:
Cancer, CVD, substance abuse, HIV/AIDS, mental health,
etc.What is number one killer of men and women in United
States?What are the risk factors?



Issues for Biostatisticians (2 of 2)Research questionStudy
sampleSample sizeAnalytic techniquesInferences—
cause/effectLimitations




Types of StudiesLaboratory studiesAnimal studiesClinical

studiesObservational studies Experimental trials



Research Teams Principal investigatorBiostatisticianCo-
investigatorsProject managerStatistical programmersResearch
assistants



Biostatistician’s Role on TeamStudy designResearch
questionStudy sampleSample sizeEnrollment/follow-up
strategiesOngoing monitoringInterim and final
analysisReporting of results



Careers Pharmaceutical industryGovernmentAcademiaHealth
insurance


Demand far exceeds supply of qualified biostatisticians today.




Training/SkillsMathematics
backgroundBiostatistics/statisticsPublic
health/biologyComputer skillsCommunication skills Analytic
skillsOrganizational skillsAttention to detail

HSA-6752 Statistic in Health Care Management: Assignment 3
Case Study: Chapters 5 and 6.

Objective: The students will complete a Case study assignments
that give the occasion to create and apply the thoughts learned

in this and previous project to examine a real-world scenario.
This set-up will illustrate through example the practical
importance and implications of various roles and functions of a
Health Care Administrator in probability and interval Estimates.
The investigative trainings will advance students’ understanding
and ability to think critically about basic concepts of
probability and introduction to estimation.
ASSIGNMENT GUIDELINES (10%):
Students will critically measure the readings from Chapters 5
and 6 in your textbook. This assignment is planned to help you
examination, evaluation, and apply the readings and strategies
to your Health Care organization.
You need to read the chapters assigned for week 4 and develop
a 3-4 page paper reproducing your understanding and capability
to apply the readings to your Health Care organization. Each
paper must be typewritten with 12-point font and double-spaced
with standard margins. Follow APA style 7th edition format
when referring to the selected articles and include a reference
page.
EACH PAPER SHOULD INCLUDE THE FOLLOWING:

1. Introduction (25%) Provide a brief synopsis of the meaning
(not a description) of each Chapter you read, in your own words
that will apply to the case study presented.
2. Your Critique (50%)
Case Studies
The Effect of Maternal Healthcare on the Probability of Child
Survival in Azerbaijan
Abstract
This study assesses the effects of maternal healthcare on child
survival by using nonrandomized data from a cross-sectional
survey in Azerbaijan. Using 2SLS and simultaneous equation
bivariate probit models, we estimate the effects of delivering in
healthcare facility on probability of child survival taking into
account self-selection into the treatment. For women who
delivered at healthcare facilities, the probability of child

survival increases by approximately 18%. Furthermore, if every
woman had the opportunity to deliver in healthcare facility,
then the probability of child survival in Azerbaijan as a whole
would have increased by approximately 16%.
1. Introduction
Poor child outcomes are usually associated with underutilization
of maternal healthcare. Given unusually high mortality rates in
countries of Central Asia and Caucasus, poor child outcomes
and maternal healthcare should become important topics for
research. Nevertheless, there are a very few studies on these
topics in the region. The available studies can be divided into
two broader groups. The first group explored determinants of
child mortality. The second group explored determinants of
maternal healthcare utilization. Although these studies have
important contributions, their main limitation is that the most
important question on whether healthcare has an effect on the
reduction of child mortality is overlooked. However, designing
and implementing effective health policy require concrete
information on the effectiveness of the existing maternal
healthcare.
The contribution of the presented study is that it attempts to fill
the gap in the existing literature by quantifying the direct effect
of delivery in healthcare facility on probability of child
survival. The robust evaluation of program effect on population
usually involves randomized control trials (RCT). In many
cases, including evaluation of maternal healthcare, conducting a
RCT is not possible from an ethical perspective, withholding
vital service, and from technical perspective, lack of money and
time required to conduct a countrywide RCT. To overcome
these difficulties, we assess the effect of healthcare and
homecare on child survival by using quasiexperimental
evaluation of nonrandomized data from a cross-sectional survey.
In this way, this study contributes to the recent discussion on
appropriate methods for the evaluation of effect of healthcare
programs when RCT is not feasible.
Azerbaijan, a low-income transitional country on Caucasus, is

an interesting setting for examining the above-mentioned issues
for several reasons. First, Azerbaijan has the highest infant
mortality rate and one of the highest proportions of child
deliveries outside of healthcare facilities even compared with
other transitional countries in the region. Second, by studying
Azerbaijan, we benefit from recently available 2006 Azerbaijan
Demographics and Health Survey that contains high-quality
nationally representative data on the issues of our interest.
Third, there is a current theoretical debate on the actual
effectiveness of maternal healthcare in transitional countries.
On the one hand, maternal healthcare is universal, officially
free of charge, fully funded, and operated by the government. It
has an extensive network of facilities which is adequately
staffed with qualified personnel. Hence, a fairly strong positive
impact on child survival could be expected and some authors
underscore the importance of maternal healthcare utilization in
transitional countries to improve child outcomes. On the other
hand, the system is characterized by chronic underfunding, lack
of drugs and supplies, dilapidated facilities, lack of systematic
and effective treatments, and high levels of unofficial out-of-
pocket expenditures for personnel. Hence, no or only weak
impact on the child survival could be expected. Therefore, by
focusing on Azerbaijan, a transitional country, this study
provides necessary empirical evidence which will contribute to
the current theoretical debate on the effectiveness of maternal
healthcare in transitional countries.
2. Materials and Methods
2.1. Conceptual Framework
We are guided by Mosley and Chen’s framework for studying
the determinants of child survival. According to the framework,
socioeconomic determinants at individual (e.g., women’s
education), household (e.g., household income), and community
(e.g., healthcare input) levels affect a total of 14 proximate
determinants of mortality which are grouped into several
categories, namely, maternal factors, environmental
contamination, nutrient deficiency, and personal illness control.

However, the model has a few limitations for applied research.
Some proximate determinants, for instance, environmental
contamination, are notoriously difficult to define and measure
adequately, especially in population-based surveys.
Furthermore, if a model includes all socioeconomic and all
proximate determinants, then the coefficients on the
socioeconomic variables should not be statistically significant
given that the proximate determinants will pick up all
significance by definition. Consequently, we reduced the
number of independent variables to women’s age at birth and
education, birth order, low child birthweight, household wealth,
and healthcare input. As a result, we used the reduced set of
independent variables which is similar to previous studies on
child survival in the region and international comparative
studies.
2.2. Method
We are interested in estimating effect of treatment, having child
delivery at a healthcare facility, on the outcome, probability of
child survival. Thus, we face a problem of self-selection—the
sampled individuals who receive the treatment are different
from those who do not receive it in unobservable ways which
are also simultaneously correlated with outcome . To address
the self-selection we use simultaneous equation regression that
tackles the endogeneity by specifying and estimating a joint
model of the treatment and outcome. Since both treatment and
outcome variable in our case are binomial, we use a
simultaneous equation bivariate probit, so-called biprobit. The
model consists of first and main equations. In the first equation,
a dummy treatment variable is regressed on all control variables
and one or more instruments. In the main equation, a dummy
outcome variable is regressed on all control variables and the
value of the treatment variable estimated in the first stage.
Importantly, the instruments are excluded from the main
equation. This statistical specification is estimated
using biprobit command in Stata software package. After
biprobit was estimated, we compute the average treatment effect

(ATE) and the average treatment effect on the treated (ATT).
The value of the ATE indicates the expected mean effect of the
treatment for a woman drawn at random from the population. By
contrast, the value of ATT indicates the expected mean effect of
the treatment for a woman who actually participates in the
program and receives treatment. ATT permits us to evaluate the
effect on women who received treatment and who can be
considered as a more relevant subpopulation for the purposes of
evaluating effect of a specific program. The full details of
biprobit, ATE, and ATT computations can be found in Greene
and Wooldridge .
2.3. Data
This study uses data from the 2006 Azerbaijan Demographic
and Health Survey (the AZDHS). The AZDHS is conducted by
the national statistical authority, the State Statistical Committee
of Azerbaijan, with technical assistance of Macro International,
USA, and with financial support from USAID and UNICEF. The
AZDHS is a cross-sectional survey of 8,444 women aged 15 to
49 from 7,180 households. Field work was conducted from July
to November 2006. The household gross response rate exceeds
90 percent. The AZDHS gathered information on demographics,
educational level, household wealth, healthcare utilization, and
child mortality. The AZDHS collected information about the
outcome of each respondent’s pregnancy for the period, whether
the pregnancy ended in a live birth, a stillbirth, a miscarriage,
or an induced abortion. The survey used the international
definition of child mortality, under which any birth in which a
child showed any sign of life such as breathing, beating of the
heart, or movement of voluntary muscles is defined as a live
birth. The AZDHS collected information on child mortality for
births in 2001 or later, covering a period of 5 years before the
date of the survey only. Among recorded 13,565 observations,
about 92% of children survived between birth and their fifth
birthday and about 8% died. However, our sample is further
reduced since the questions about place of delivery asked only
about the most recent birth delivered during the the last 5 years

before the date of the survey. It means that if a women had
multiple births during the last 5 years, the questions about place
of delivery was asked only about the latest birth. Consequently,
our final sample consists of 2,285 observations for analysis.
2.4. Outcome and Treatment Variables
The outcome variable of this study is child survival defined as
probability to survive during 60 months or 5 years. This
variable is binomial; it takes the value of 1 if the child survives
60 months and takes the value of 0 if otherwise. There are two
endogenous instrumented variables of interests which denote
treatment and serve to gauge healthcare input. The instrumented
treatment variable is “delivery in a healthcare facility” that
takes the value of 1 if the child was delivered in a healthcare
facility and takes the value of 0 if otherwise. The healthcare
facility is defined as a government or private hospital, maternity
home, polyclinic, woman’s consultation, and primary healthcare
posts. Overall, from the sample of 2,285 women who answered
the questions about place of delivery in the AZDHS,
approximately 79% delivered babies in a healthcare facility.
2.5. Instrumental Variables
The instrumental variables used to estimate the endogenous
treatment variables are taken from a previous study that used
instrumental variables to estimate the effect of prenatal
healthcare utilization on child birthweight in Azerbaijan. There
are two instrumental variables—“women from wealthier
households” and “birth order.” The AZDHS contains a variable
representing 5 quintiles of household wealth—poorest, poor,
middle, richer, and richest. We create a “wealthier households”
dummy variable which denotes women from richest and richer
households, and this variable is used in our model 1 and model
2. Finally, “birth order” is a straightforward continuous variable
denoting number of births.
2.6. Exogenous Variables
The exogenous variables used to explain child survival are
taken from the previous studies on the determinants of child
mortality conducted in the countries of the region of Caucasus

and Central Asia. We have two dummy variables representing
women’s age: variable “age 20” indicates women aged 20 or
younger at the time of delivery, while variable “age 36”
indicates women aged 36 and older at the time of delivery.
Dummy variable “low birthweight” indicates if a child’s
birthweight was 2500 grams or lower. Dummy variable “higher
education” indicates women with bachelor education or higher.
Previous studies reported that having delivery at age <20 and
age >35 is associated with higher probability of child mortality.
Likewise, previous studies reported that having low birthweight
is associated with higher probability of child mortality, while
having higher educational achievements is associated with
lower probability of child mortality.
2.7. Estimation
We commence with 2SLS model because the tests for
overidentifying restrictions and the adequacy of the instruments
are readily available for the 2SLS but not for biprobit
[11, 12, 32]. Since the number of instrumental variables exceeds
the number of endogenous variables in our case, the
Hansen statistic is employed to evaluate overidentifying
restrictions. If Hansen statistics cannot reject the null
hypothesis, then the selected instrumental variables are
exogenous. In addition, Kleibergen-Paap LM statistic is used to
test the adequacy of the instruments. If the test rejects the null
hypothesis, the instruments are adequate to identify the
equation. Lastly, we conduct Durbin-Wu-Hausman test for
potential endogeneity. The significance of the test confirms the
presence of endogeneity and suggests that estimation of
equations without taking into account endogeneity will lead to
biased results. All the above-described tests have been passed in
all estimated models.
Next we estimate biprobit which is more relevant model due to
the binary nature of outcome and treatment variables. A
straightforward Wald test of endogeneity is available in
biprobit. If result of the test is significantly different from zero,
then biprobit should be estimated due to the presence of

endogeneity. In all estimated models, the Wald tests have
confirmed endogeneity. After biprobit model estimation, ATE
and ATT are computed and reported.
3. Results
The results are reported in Table 1. In the first equation four
variables are significant with predicted directions in 2SLS
estimation. Having birth at the age of 20 or earlier and having a
higher value of birth order are associated with lower probability
of delivery in a healthcare facility, while having higher
educational achievements and being from a wealthier household
are associated with higher probability of delivery in a
healthcare facility. Looking at the main equation in 2SLS, we
can see that having a delivery in the facility improves the
chances of child survival. Results of biprobit estimation are
consistent with the results of the 2SLS estimation. The same
four variables are significant in the first equation and with the
same direction.
4. Discussion and Policy Implications
In this study, we identified and then attempted to fill the
important gap in the literature regarding the effectiveness of
maternal healthcare in reducing under-five child mortality in the
region of the Central Asia and the Caucasus. We assessed the
effects of delivering in a healthcare facility on child survival by
using a quasiexperimental evaluation based on nonrandomized
data from a cross-sectional survey in Azerbaijan, a low-income
country in transition. The empirical evidence presented in this
paper allows for drawing several conclusions.
First, delivering children in healthcare facilities increases the
probability of survival. Since reducing child mortality is raison
d’être for maternal healthcare programs, such a funding could
be expected. However, we were able to confirm that the effect
of delivering at a healthcare facility on child survival is
statistically significant on the national level. We also quantified
the positive effect of such treatment. For women who delivered
at healthcare facilities the probability of child survival
increases by approximately 18%. Furthermore, if every woman

in Azerbaijan had the opportunity to deliver in a healthcare
facility, then the probability of child survival in the country
would have increased by approximately 16%. These findings
suggest that utilization of maternal services in transitional
countries should be encouraged and promoted in spite of the
limitations and deficiencies in the current maternal healthcare
system.
Second, our study demonstrates that the wealth gradient is an
important barrier for utilization and hence influences the child
outcomes. Since maternal healthcare is officially free, the prior
studies explained the effect of wealth gradient by high level of
unofficial out-of-pocket expenditures for healthcare personnel,
supplies, and medication .As a result, the wealthier use
healthcare facilities which the poorer cannot afford. This is in
line with our finding that the wealthier deliver in healthcare
facilities, while the poorer have to deliver outside of healthcare
facilities. While the poorer have to deliver at home. In this
context, one of the promising ways to reduce effect of wealth
gradient to utilization is to introduce the benefits for pregnant
women which could be linked to receipt of targeted social
assistance programs.
Third, our study demonstrates that the risk of not delivering at a
healthcare facility increased for less educated women. Women
with higher education are strongly associated with delivering in
medical settings and hence with higher chances of child
survival. Habibov reported that there is no significant gender
gap in the level of literacy and education in general in
Azerbaijan and concluded that increase in nonacademic
educational activities promoting antenatal care should be a
priority. Habibov and Fan confirmed these conclusions showing
the example of Tajikistan, another transitional country that
having limited knowledge on matters related to sex is associated
with a lower probability of maternal healthcare utilization. The
authors underlined that significant effect of knowledge about
sex is independent of formal educational level and it persisted
even if formal educational level is controlled for. Effectiveness

of communication campaigns designed to explain the benefits of
maternal healthcare and encourage healthcare utilization is well
documented in developing countries. In addition, intensive
communication campaigns aimed at encouraging healthcare
utilization slowly but steadily became appreciated in some
transitional countries. This positive experience should be shared
across the region.
Finally, the population based nationally representative surveys
such as the Demographic and Health Surveys by Macro
International and Living Standards Measurement Surveys by the
World Bank became an important tool for measuring policy
effect on health outcomes in many transitional and developing
countries. Most of these surveys include modules on healthcare
utilization and childbirth outcome. Having high-quality
microdata to conduct evaluation of healthcare programs is an
effective way to save time, effort, and costs while providing
nationally representative results. From this standpoint, our
results are illustrative to empirical strategies for evaluation of
nonrandomized data from cross-sectional surveys using a
standard statistical software package.




CASE STUDY CHALLENGE
1. Students should be asked to read the case and discuss all
procedures done and suggest a solution program.
3. Conclusion (15%)
Briefly summarize your thoughts & conclusion to your critique
of the case study and provide a possible outcome for. The Effect
of Maternal Healthcare on the Probability of Child Survival in
Azerbaijan. How did these articles and Chapters influence your
Maternal Healthcare on the Probability?
Evaluation will be based on how clearly you respond to the
above, in particular:
a) The clarity with which you critique the case study;

b) The depth, scope, and organization of your paper; and,
c) Your conclusions, including a description of the impact of
these Case study on any Health Care Setting



HSA-6752 Statistics in Health Care Management: Assignment 1
Student PowerPoint Presentation: Chapters 1 and 2

Objectives:The presentation assignment has several goals. It
requires students to apply concepts from Biostatistics, what are
the issues, and study designs. The study designs process will
allow students to develop observational, randomized and more
on clinical trials and designs, a skill they will be using as
Healthcare Administrator.
Format and Guidelines:The student will create a Power Point
Presentation from Chapter 1 and 2 of the Textbook related to
Week 1 (Choose your desire topic form these chapters).The
Presentation should have a minimum of 12 slides, including
Title Page, Introduction, Conclusion, and References.
The student must use other textbooks, research papers, and
articles as references (minimum 3).
EACH PAPER SHOULD INCLUDE THE FOLLOWING:

1. Title Page: Topic Name, Student Name
2. Introduction: Provide a brief synopsis of the meaning (not a
description) of the topic you choose, in your own words
3. Content Body: Progress your theme, provide Material,
illustrations and Diagram to explain, describe and clarify the
Topic you choose.
4. Conclusion: Briefly summarize your thoughts & conclusion
to your critique of the articles and Chapter you read.
5. References: The student must use other textbooks, research
papers, and articles as references (minimum 3).

HSA-6752 Statistic in Health Care Management Assignment 2

Critical Reflection Paper: Chapters 3 & 4

Objective: To critically reflect your understanding of the
readings and your ability to apply them to your Health care
Setting.
ASSIGNMENT GUIDELINES (10%):
Students will critically analyze the readings from Chapter 3 and
4 in your textbook. This assignment is designed to help you
review, critique, and apply the readings to your Health Care
setting as well as become the foundation for all of your
remaining assignments.
You need to read the chapters assigned for week 2 and develop
a 2-3-page paper reflecting your understanding and ability to
apply the readings to your Health Care Setting. Each paper must
be typewritten with 12-point font and double-spaced with
standard margins. Follow APA style 7th edition format when
referring to the selected articles and include a reference page.
EACH PAPER SHOULD INCLUDE THE FOLLOWING:

1. Introduction (25%) Provide a brief synopsis of the meaning
(not a description) of each Chapter and articles you read, in
your own words.
2. Your Critique (50%)
What is your reaction to the content of the chapters?
What did you learn about Prevalence, Incidence and the
relationships between them?
What did you learn and how you can apply Ordinal and
Categorical Variables?
Did these Chapters change your thoughts about Comparing the
extent of Disease between groups? If so, how? If not, what
remained the same?
3. Conclusion (15%)
Briefly summarize your thoughts & conclusion to your critique
of the Chapter you read. How did these Chapters impact your
thoughts on Quantifying the extent of disease and summarizing
Data Collected in the sample?

Evaluation will be based on how clearly you respond to the
above, in particular:
a) The clarity with which you critique the chapters.
b) The depth, scope, and organization of your paper; and,
c) Your conclusions, including a description of the impact of
these Chapters on any Health Care Setting.
Tags