Hypothesis testing for non-
parametric data
By Emmanuel BIKORIMANA
Overview of most common statistical testsDependant
Independe
nt
Normally
distribut
ed
One-sample T
test / Z test
Paired Two
Sample T-
Test
T-Test ANOVA
Not
Normally
distribut
ed
Sign Test Wilcoxon
Mann-
Whitney
(Willcoxon)
Kruskall
Wallis
Binary
Z- test for
proportion
Nominal
Ordinal
Chi-square
for trend
Chi-square
for trend
One Group
Two Groups
Chi-Square
Categorica
l Data?
McNemar
test
> = 3
Groups
Numerical Data
Chi-Square
Chi-Square
Goodness of
Fit
How to check for normality?
•Several methods
•Look at the similarity between mean and median
•Graphical methods (Eyeballing)
•Statistical tests
Look at the mean and median
variable 1variable 2
mean 100 130
median 101 100
sd 20 25
Variable 1 normally distributed: mean = median
Variable 2 not normally distributed: mean not equal to
median
0
.02
.04
.06
0
.02
.04
.06
140 160 180 200
140 160 180 200
30-45 46-59
60+
Density
normal bp_before
Density
Before
Graphs by Age Group Graphical methods (eyeballing)
Histogram
Table of Parametric & Nonparametric
Tests
Parametric Test
Nonparametric
Test Purpose of Test
Two-Sample
t-Test (either case)
Mann-Whitney/
Wilcoxon Rank
Sum Test
Compare two
independent samples
Paired t-Test
Sign Test or
Wilcoxon Signed-
Rank Test
Compare dependent
samples
Oneway ANOVA Kruskal-Wallis
Test
Compare ≥ three
k-independent
samples
1.Independent samples
Mann-Whitney/Wilcoxon Rank Sum Test
•Alternative to two-sample t-Test
•Use when…
-populations being sampled are not normally
distributed.
-sample sizes are small so assessing normality is
not possible (n
i<20).
-response is ordinal
Mann-Whitney/Wilcoxon Rank Sum Test
General Hypotheses
H
o: distribution of pop. A and pop. B are the
same, i.e. A = B
H
A: distribution of pop. A and pop. B are NOT
the same, i.e A = B
H
A: distribution of pop. A is shifted to the right
of pop. B, i.e. A > B.
H
A: distribution of pop. A is shifted to the left of
pop. B, i.e. A < B
Mann-Whitney/Wilcoxon Rank Sum Test
H
o: A = B vs. H
A: A > B
Q: Is there evidence that the values in
population A are generally larger than
those in population B?
Mann-Whitney/Wilcoxon Rank Sum Test
(Test Procedure)
1.Rank all N = n
A+ n
Bobservations in the combined
sample from both populations in ascending order.
2.Sum the ranks of the observations from populations A
and B separately and denote the sums w
Aand w
B.
Assign average rank to tied observations.
3.For H
A: A < B reject Ho if w
Ais “small” or w
Bis
“big”.
For H
A: A > B reject Ho if w
Ais “big” or w
Bis
“small”.
4.Use tables to determine how “big” or “small” the rank
sums must be in order to reject H
o or use software to
conduct the test.
Mann-Whitney/Wilcoxon Rank Sum Test
(Critical Value Table)
This table contains
the value the smaller
rank sum must be
less than in order to
reject the H
ofor a
one-tailed test
situation for two
significance levels
(a= .05 & .01)
Tables exist for the
two-tailed tests as
well.
nis the sample size of the group with the smaller rank sum.
Example: Huntington’s Disease and
Fasting Glucose Levels
Davidson et al. studied the responses to oral
glucose in patients with Huntington’s disease and
in a group of control subjects. The five-hour
responses are shown below. Is there evidence to
suggest the five-hour glucose (mg present) is
greater for patients with Huntington’s disease?
H
o: Control = Huntington’s i.e. C = H
H
A: Control < Huntington’s i.e. C < H
Example:Critical Value Table
Here,
n
C= 10 (control)
n
H= 11 (Huntington’s)
we will reject
H
o: C = H
in favor of
H
A: C < H
if the rank sum for the
control group is less than
86 at a= .05 level and
less than 77 at a = .01
level.
Example: Decision/Conclusion
Using the Wilcoxon Rank Sum Test we have
evidence to suggest that the five hour glucose
level for individuals with Huntington’s disease is
greater than that for healthy controls (p < .05).
Note:p < .05 because the observed rank sum for
the control group is less than 86 which is the
critical value for a= .05.
2.Dependent Samples
•Sign Test
•Wilcoxon Signed-
Rank Test
Sign Test
•The sign test can be used in place of the paired t-
test when we have evidence that the paired
differences are NOT normally distributed.
•It can be used when the response is ordinal.
•Best used when the response is difficult to
quantify and only improvement can be measured,
i.e. subject got better, got worse, or no change.
•Magnitude of the paired difference is lost when
using this test.
Sign Test
•The sign test looks at the number of (+) and (-)
differences amongst the nonzero paired
differences.
•A preponderance of +’s or –’s can indicate that
some type of change has occurred.
•If the null hypothesis of no change is true we
expect +’s and –’s to be equally likely to occur,
i.e. P(+) = P(-) = .50 and the number of each
observed follows a binomial distribution.
Example: Sign Test
Consider a clinical investigation to assess the
effectiveness of a new drug designed to reduce
repetitive behaviors in children affected with
autism. If the drug is effective, children will
exhibit fewer repetitive behaviors on treatment
as compared to when they are untreated. A total
of 8 children with autism enroll in the study.
Each child is observed by the study psychologist
for a period of 3 hours both before treatment
and then again after taking the new drug for 1
week.
•The time that each child is engaged in repetitive
behavior during each 3 hour observation period is
measured. Repetitive behavior is scored on a scale of 0
to 100 and scores represent the percent of the
observation time in which the child is engaged in
repetitive behavior. For example, a score of 0 indicates
that during the entire observation period the child did
not engage in repetitive behavior while a score of 100
indicates that the child was constantly engaged in
repetitive behavior. The data are shown below.
•The critical values for the Sign (see Critical Values
for the Sign Testtable).
•To determine the appropriate critical value we need
the sample size, which is equal to the number of
matched pairs (n=8) and our one-sided level of
significance α=0.05. For this example, the critical
value is 1, and the decision rule is to reject H
0if the
smaller of the number of positive or negative
signs<1. We do not reject H
0because 2 > 1. We
do not have sufficient evidence at α=0.05 to show
that there is improvement in repetitive behavior
after taking the drug as compared to before.
Wilcoxon Signed-Rank Test
•The problem with the sign test is that the
magnitude or size of the paired differences is lost.
•The Wilcoxon Signed-Rank Test uses ranks of
the paired differences to retain some sense of
their size.
•Use when the distribution of the paired
differences are NOT normal or when sample size
is small.
•Can be used with an ordinal response.
Wilcoxon Signed Rank Test
(Test Procedure)
•Exclude any differences which are zero.
•Put the rest of differences in ascending
order ignoring their signs.
•Assign them ranks.
•If any differences are equal, average
their ranks.
Example: Wilcoxon Signed Rank Test
Resting Energy Expenditure (REE) for
Patient with Cystic Fibrosis
•A researcher believes that patients with cystic
fibrosis (CF) expend greater energy during
resting than those without CF. To obtain a fair
comparison she matches 13 patients with CF to
13 patients without CF on the basis of age, sex,
height, and weight.
Example: Wilcoxon Signed Rank Test
Pair
CF
(C)
Healthy
(H)
Difference
d = C -H
Signed
Rank
1 1153 996 157 6
2 1132 1080 52 3
3 1165 1182 -17 -2
4 1460 1452 8 1
5 1634 1162 472 13
6 1493 1619 -126 -5
7 1358 1140 218 9
8 1453 1123 330 11
9 1185 1113 72 4
10 1824 1463 361 12
11 1793 1632 161 7
12 1930 1614 316 8
13 2075 1836 239 10
We then calculate the sum
of the positive ranks ( T
+)
and the sum of the negative
ranks (T
-).
Here we have
T
+= 6 + 3 + 1 + 13 + 9 + 11 +
4 + 12 + 7 + 8 + 10= 84
and
T
-
= 2 + 5= 7
Wilcoxon Signed Rank Test
(Test Statistic)
•Intuitively we will reject the H
o,which
states that there is no difference between
the populations, if either one of these rank
sums is “large” and the other is “small”.
•The Wilcoxon Signed Rank Test uses the
smaller rank sum, T = min( T
+ ,T
-), as
the test statistic.
Example: Wilcoxon Signed Rank Test
For the cystic fibrosis example we have the
following hypotheses:
H
o:there is no difference in the resting energy
expenditure of individuals with CF and healthy
controls who are the same gender, age, height,
and weight.
H
A:the resting energy expenditure of individuals
with CF is greater than that of healthy individuals
who are the same gender, age, height, and weight.
MEDIAN PAIRED DIFFERENCE = 0
MEDIAN PAIRED DIFFERENCE > 0
Example: Wilcoxon Signed Rank Test
H
A:the resting energy expenditure of individuals
with CF is greater than that of healthy individuals
who are the same gender, age, height, and weight.
•The alternative is clearly supported if T+ is
“large” or T-is “small”.
•The test statistic T = min( T
+ , T
-) = 7
•Is T = 7 considered small, i.e. what is the
corresponding p-value?
•To answer this question we need a Wilcoxon
Signed Rank Test table or statistical software.
Example: Wilcoxon Signed Rank Test
This table gives the value of
T = min( T
+ , T
-)that our
observed value must be less
thanin order to reject Ho for
the both two-and one-tailed
tests.
Here we have n = 13 & T = 7.
We can see that our test
statistic is less than 21 (a= .05)
and 12 (a= .01)so we will
reject H
oand we also estimate
that our p-value < .01.
Example: Wilcoxon Signed Rank Test
•We conclude that individuals with cystic
fibrosis (CF) have a large resting energy
expenditure when compared to healthy
individuals who are the same gender,
age, height, and weight (p < .01).
Independent Samples
•If we have three or more
populations to compare we use…
Kruskal –Wallis Test
Kruskal-Wallis Test
•One-way ANOVA for a completely randomized
design is based on the assumption of normality and
equality of variance.
•The nonparametric alternative not relying on these
assumptions is called the Kruskal-Wallis Test.
•Like the Mann-Whitney/Wilcoxon Rank Sum Test
we use the sum of the ranks assigned to each group
when considering the combined sample as the basis
for our test statistic.
Kruskal-Wallis Test
Basic Idea:
1) Looking at all observations together,
rank them.
2) Let R
1, R
2, …,R
kbe the sum of the ranks
of each group
3) If some R
i’s are much larger than others,
it indicates the response values in
different groups come from different
populations.
Kruskal-Wallis Test
•The test statistic is
where,
N= total sample size = n
1+ n
2+ ... + n
k
k
i
k
i
i
i
N
n
R
n
NN
H
1
2
1~
2
1
)1(
12
rank overall average
2
1
groupfor rank average
N
i
n
R
i
i
Kruskal-Wallis Test
•The test statistic is
•Under the null hypothesis, this has an
approximate chi-square distribution with
df = k -1, i.e. .
•The approximation is OK when each group
contains at least 5 observations.
•N= total sample size = n
1+ n
2+ ... + n
k
k
i
k
i
i
i
N
n
R
n
NN
H
1
2
1~
2
1
)1(
12
2
1k
Chi-squared Distribution and p-value2
1k
Area = p-value2
The null and alternative hypotheses are stated
verbally.
For example
ho: The plans A, B and C are equally
effective.
h1: At least one of the following is true: A is
different from B, A is different from C or B is
different from C
Example: Kruskal-Wallis Test
A clinical trial evaluating the fever reducing effects
of aspirin, ibuprofen, and acetaminophen was
conducted. Study subjects were adults seen in
an ER with diagnoses of flu with body
temperatures between 100
o
F and 100.9
o
F.
Subjects were randomly assigned to treatment.
Changes in body temperature were recorded
2 hrs. after administration of treatments.
Example: Kruskal-Wallis Test
Resulting Data: Temperature Decrease (deg. F)
AspirinRankIbuprofenRank
Acetaminophen
Rank
.95 .39 .19
1.48 .44 1.02
1.33 1.31 .07
1.28 2.48 .01
1.39 .62
-.39
(i.e. temp increase)
1
2
3
45
6
7
8
9
10
1112
13
14
15
N = 15R
1= 44 R
2= 50 R
3= 26
n
1= 4 n
2= 5 n
3= 6
Example: Kruskal-Wallis Test2on with distributi square-chi i.e. ~ 833.6
2
115
6
26
6
2
115
5
50
5
2
115
4
44
4
)115(15
12
2
1
)1(
12
2
2
1
df
N
n
R
n
NN
H
k
i i
i
i
N = 15R
1= 44 R
2= 50 R
3= 26
n
1= 4 n
2= 5 n
3= 6
Chi-squared Distribution and p-value833.6
Area = .0332
2
Decision/Conclusion
•Using the Kruskal-Wallis test have evidence to
suggest that the temperature changes after taking
the different drugs are not the same (p = .033).
•Now we might like to know which drugs
significantly differ from one another.