The jackknife and bootstrap

5,524 views 30 slides May 10, 2018
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

BIOL309: Experimental Design and Data Analysis for Biologists
Lecture block 2


Slide Content

BIOL309: The Jackknife & Bootstrap
Paul Gardner
September 25, 2017
Paul Gardner BIOL309: The Jackknife & Bootstrap

What is Resampling?
\I do not believe in any statistical test unless I can prove it with a permutation test." { R.A. Fisher
IResampling is a statistical technique in which multiple new
samples are drawn from a sample or from the population
IStatistics of interest (e.g. sample median) are calculated for
each new sample. The distribution of new statistics can be
analysed to investigate dierent properties (e.g., condence
intervals, the error, the bias) of the statistics.Sampling
Inference
Paul Gardner BIOL309: The Jackknife & Bootstrap

First, some denitions & reminders
IMean x=
1
n
P
n
i=1
xi
IVariances
2
=
1
n1
P
n
i=1
(xix)
2
IStandard deviations=
q
P
(xix)
2
n1
IStandard errorSEx=
qP
(xix)
2
n(n1)
IBias of an estimator is the dierence between the estimators
expected value and the true value of the parameter being
estimated
ICondence intervalout
in
0
50
100
150
200
250
300
350
out
in
0
200
400
600
800
Paul Gardner BIOL309: The Jackknife & Bootstrap

Jackkning
Ia resampling technique especially useful for nding standard
error, variance and bias of estimators
Ithe jackknife is a small, handy tool
Ialso called leave-one-out (LOO)
IThis approach tests that some outlier datapoint is not having
a disproportionate inuence on the outcome.
Paul Gardner BIOL309: The Jackknife & Bootstrap

Jackkning
IThe jackknife deletes each observation and calculates an
estimate based on the remainingn1 values
IIt uses this collection of estimates to do things like estimate
the bias and the standard error
Paul Gardner BIOL309: The Jackknife & Bootstrap

Jackkning: denition
ILetx1; : : :xnbe a dataset
Iis a paramater you want to estimate from the data (e.g.
mean, median, standard deviation, ...)
ILet
^
be the estimate based upon theentire dataset
ILet
^
ibe the estimate ofobtained bydeleting observation
xi
ILet=
1
n
P
n
i=1
^i
ISometimes

is written


(:)
Paul Gardner BIOL309: The Jackknife & Bootstrap

Jackkning: estimating bias of a method, and correcting it
IThis provides an estimated correction of bias due to the
estimation method.The jackknife does not correct for a
biased sample.
(Wikipedia/Jackkniferesampling)
IThe jackknife estimate of bias isB= (n1)(


^
)
IIn other words, is the dierence between the actual and the
average of the delete-one estimates.
IWe can then correct
^
(the estimator on the entire dataset),
using:
I^
corrected=
^
B
IWith the magic of algebra:
I^corrected=n^(n1)
Paul Gardner BIOL309: The Jackknife & Bootstrap

Jackkning
IThe jackknife estimate of the standard error is:
SEJK(^) =
v
u
u
t
n1
n
n
X
i=1
(^i)
2
IThis simplies to the standard error (SEx=
qP
(xix)
2
n(n1)
) when
is the mean
Paul Gardner BIOL309: The Jackknife & Bootstrap

Example
x1 <- rnorm(1000, mean = 2, sd = 1)
x <- c(x1,-10)
hist(x,breaks=500)
library(bootstrap)
#define theta function
theta <- function(x){sd(x)}
j <- jackknife(x,theta)
mean(j$jack.values)
#check j$jack.values is normal
hist(j$jack.values,breaks=500)
#What is the bias corrected sd?Histogram of x
x
Frequency
−4−2024
0
10
30 Histogram of j$jack.values
j$jack.values
Frequency
1.02 1.04 1.06 1.08 1.10
0
50
100
150
200
250
300
Paul Gardner BIOL309: The Jackknife & Bootstrap

CORRECTION: testing bias corrected values...
The lab example...
for (i in c(10,100, 1000) ){
for (j in c(-100,-10,-1,1, 10,100) ){
x <- c(rnorm(i, mean = 2, sd = 1),j)
jk <- jackknife(x,sd)
corr <- sd(x) - jk$jack.bias
cat(paste(round(corr, digits = 2), " "))
}
cat("")
}
Jackknife bias corrected values (i=N,j=outlier, expected= 1)
inj-100 -10 -1 1 10 100
1044.29 4.76 1.31 0.65 2.96 42.25
10014.28 1.67 0.94 0.96 1.34 13.64
10004.21 1.07 0.99 1.01 1.05 4.03
Paul Gardner BIOL309: The Jackknife & Bootstrap

Example: more bad...
The jackknife estimate of variance is slightly biased upward!
Efron & Stein (1981) The jackknife estimate of variance.The Annals of Statistics, pp. 586-596
Paul Gardner BIOL309: The Jackknife & Bootstrap

Jackkning issues
IWhen the estimator is not normally distributed jackkning
may fail
IMay be unreliable on a small number of datasets
IThis provides an estimated correction of bias due to the
estimation method.The jackknife does not correct for a
biased sample.
(Wikipedia/Jackkniferesampling)
INot great whenis the standard deviation!
Paul Gardner BIOL309: The Jackknife & Bootstrap

What is bootstrapping?
IBootstrapping is a useful means for assessing the reliability of
your data (e.g. condence intervals, bias, variance, prediction
error, ...).
IIt refers to any metric that relies onrandom sampling with
replacement.
IUsed to estimate SE, condence intervals, and test for
signicance
Paul Gardner BIOL309: The Jackknife & Bootstrap

First, a denition
ICentral limit theorem:
Ithe means from a large number of independent random
samples will be approximately normally distributed, regardless
of the underlying distributionX
Frequency 0 2 4 6 8
0 40 80 140 Bootstrap means of X
Frequency 0.90 1.00 1.10
0 20 40 60 X
Frequency 0 5 10 15
0 100 200 Bootstrap means of X
Frequency 2.3 2.5 2.7
0 20 40 60 Bootstrap
Bootstrap

N=1,000
N=1,000
Paul Gardner BIOL309: The Jackknife & Bootstrap

Bootstrapping illustrated (unknown) true distribution
(unknown) true value of empirical distribution of sampleestimate of<>
bootstrap replicate 1
bootstrap replicate 2
bootstrap replicate 3
distribution of estimates of θ
IBootstrap sampling from a distribution (a mixture of 3 normal
distributions) to estimate the variance of the mean
Paul Gardner BIOL309: The Jackknife & Bootstrap

Bootstrapping is used a lot in phylogenetics
Yang & Rannala (2012) Molecular phylogenetics: principles and practice.Nature Reviews Genetics.
Paul Gardner BIOL309: The Jackknife & Bootstrap

Application: DNA surveillance
http://dna-surveillance.fos.auckland.ac.nz/
Paul Gardner BIOL309: The Jackknife & Bootstrap

Bootstrap sampling
To infer the error in a quantity,, estimated from a dataset
x1;x2; : : :xNwe do the followingRtimes (e.g.R= 1;000):
1. ntimes with
replacement from the sample. Call theseX
*
1
;X

2
; : : :X

n. Note
that some points are represented more than once in the
bootstrap samples, some once, some not at all.
2. from the bootstrap sample, call this
^


k
(k= 1;2; : : :R).
3. Rbootstrap samples have been done, the
distribution of^

k
estimates the distribution one would get if
one were able to draw repeated samples ofnpoints from the
unknown true distribution.
Paul Gardner BIOL309: The Jackknife & Bootstrap

Example: condence intervals for the median
x1=rnorm(500, mean = 2, sd = 1)
x2=rnorm(500, mean = -2, sd = 1)
x=c(x1,x2)
hist(x,breaks=50)
summary(x)
library(bootstrap)
#define theta function
theta = function(x){median(x)}
bs = bootstrap(x,50,theta)
summary(bs$thetastar)
#What is the 50% confidence
#interval for boostrap estimates
#of median?
boott(x,theta,nboott=1000,perc=c(0.025,0.975))Histogram of x
x
Frequency
−4−2024
0
10
30
Paul Gardner BIOL309: The Jackknife & Bootstrap

Example: regression (I)
#create a simulated dataset, sampling from a normal distribution:
x<-runif(1000,-10,10)
#generate a y dataset with a little noise:
#y = m * x + c
y<-rnorm(length(x),1,0.1)*x + rnorm(length(x),mean=0,sd=1)
#plot a regression
reg1<-lm(y ~ x)
plot(x,y,type="p")
abline(reg1,col="red",lwd=3)
Paul Gardner BIOL309: The Jackknife & Bootstrap

Example: regression (II)
library(bootstrap)
#column bind x & y
xdata <- cbind(x,y)
#create functions, theta1 & theta2,
#1 returns the intercept, 2 returns the slope
theta1 <- function(i,xdata){
coef(lm(xdata[i,2] ~ xdata[i,1]))[1]
}
theta2 <- function(i,xdata){
coef(lm(xdata[i,2] ~ xdata[i,1]))[2]
}
#bootstrap!
bs1=bootstrap(1:length(x),1000,theta1,xdata)
bs2=bootstrap(1:length(x),1000,theta2,xdata)
quantile(bs2$thetastar,probs = c(0.025,0.975))
Paul Gardner BIOL309: The Jackknife & Bootstrap

Example: regression (III)
#plot the resulting lines:
for (i in 1:length(bs1$thetastar)){
abline(bs1$thetastar[i],bs2$thetastar[i], lty=2,col="pink")
}
abline(reg1,col="red",lwd=3)
hist(bs1$thetastar,breaks=100,main="Intercepts")
hist(bs2$thetastar,breaks=100,main="Slopes")l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−10−5 0 5 10
−10
−5
0
5
10
x
y
Intercepts
bs1$thetastar
Frequency
−0.8 −0.4 0.0
0
10
20
30
Slopes
bs2$thetastar
Frequency
0.900.951.001.051.10
0
10
20
30
40
Paul Gardner BIOL309: The Jackknife & Bootstrap

Bootstrap issues
INeed a large number of bootstrap samples (e.g.R1000).
The larger the number, the better the estimates.
IIfis hard to calculate (e.g. tree building) then
bootstrapping can be very computationally intensive.
Paul Gardner BIOL309: The Jackknife & Bootstrap

Another example:−4 −2 0 2
Robust Z−score (F−measure)
EBI−mg
NBC
MetaPhlan
MLTreeMap
Treephyler
RITA
MEGAN
taxator−tk
RAIphy
MetaPhyler
mothur
Kraken
phymmBL
Taxy−Pro
Genometa
Quickr
BMP
QIIME
metaCV
GOTTCHA
LMAT
mOTU
TIPP
CLARK
FOCUS
MG−RAST
MetaBin
CLARK−S
PhyloPythiaS
OneCodex
DUDes
CARMA3
commonkmers
DiScRIBinATE
MetaPhlAn2.0
TACOA
llllll|| ||
ll lllll l ll lllll lll lll |
llll||||
llllll|| ||
l|| ||
l|| ||
llllll|| ||
lllllll lllllll llllllllllllll llllllllllllllllllllllllllllllllllllllllll|
lll llll llllll l lllll ll|
llllll|
l llll||||
llllllllllllllllllllllllllllllllllllllllll|
llllll||||
llll|| ||
l ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll l|
llllll|
llllll|
l llll||||
lllllllllll llllllllllll llllllllllll llllllllllllllll llllllllllll llllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll||||
lllllll lllllll llllll lllllll lllllll lllllll|
llllllllllll|
lll llll lll llll lllllll|
llllll||||
lllll|
l lll||||
l llll llll ll ll llllllllll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll llll lll||||
llllll|| ||
lllllllllllllllllllll llllllllll lllllllll||||
llllllllllllllllllllllllllllllllllllllllll llll llllllllllll llllllllllll llllll llllllllllllllllllllllllllllllllllllllllll l lllllllllll llllllllllll llllll||||
lllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll llllll ll llllll llllllllllllllllll llllll llllllllllllllllll llllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll llllll llllllllllllllll ll llllll llllllllllllllll ll llllll llllllll|
llllll||||
lllll|| ||
llllll|| ||
lll ll ll llll lllll lll ll|
llllll||||
l ll ll ll ll llll ll ll ll ll ll ll ll ll ll ll llll ll l|
llllll||||
l ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll l||||
l ll llll ll llll ll llll ll ll llll ll llll ll llll ll llll ll llll ll llll ll llll ll llll ll lll|
llllllllllllllllllllllllllllllll|
llllll|
lll llll lllllll lllllll|
llllll||||
lll ll ll ll ll llll ll ll ll ll ll ll ll ll ll ll llll ll llll ll ll ll llll llll ll ll ll ll ll ll llll ll ll ll ll ll ll llll ll llll ll ll llll ll ll ll ll ll l||||
llllll|
ll lll||||
llllll|| ||
lll llll lllllll lllllll||||
llllllllllllll lllllll llllllll llllll lllllll lllllll|
llll||||
llllll||||
l ll ll llll ll ll ll ll llll ll l||||
l lllll|
l llll||||
l ll ll ll ll ll llllllllllllll ll ll ll llll ll ll ll ll ll ll llll llll llll llll llllllll l||||
llllllllllll||||
l ll llll ll llllllllllllll ll llllllllllll ll l||||
l lllll||||
lBazinet.2012lLindgreen.2016
lMcIntyre.2017lPeabody.2015
lSczyrba.2017lSiegwald.2017
Paul Gardner BIOL309: The Jackknife & Bootstrap

For more information
Chapters 1, 2 and 3:
Manly, B. F. (2006). Randomization, bootstrap and Monte Carlo
methods in biology (Vol. 70). CRC Press.
https://books.google.co.nz/books?id=j2UN5xDMbIsC&rediresc=y
Paul Gardner BIOL309: The Jackknife & Bootstrap

UC Summer undergraduate Research Scholarships
A list of Summer Research Scholarships is now available at
http://www.canterbury.ac.nz/summer-school/summer-scholarships/
Scholarships are for final year undergraduates only
Scholarships are for 10 weeks (Nov-Feb) and valued at $5,000
Students should apply by the 19
th
September by completing the
Application form found at the above website Paul Gardner BIOL309: The Jackknife & Bootstrap

Need to talk things over?
Practical guidance, advice and
support for our domestic and
international students.
Student Care
[email protected]
Advice, help and
support on campus
Are you Māori
and need
advice, cultural
or academic support?
Māori Student Development Team
[email protected]
A disability or medical condition affecting your study?
Disability Resource Service
[email protected]
Upskill your academic writing and study skills.
Academic Skills Centre
[email protected]
Are you Pasifika and need advice,
cultural or academic support?
Pacific Development Team
[email protected]
Have issues?
Need help?
Students’ Association
(UCSA)
[email protected]
Medical care,
counselling,
travel advice,
or physiotherapy.
UC Health Centre
[email protected]
Feel more energised. Lift. Move. Play. Compete. Excel.
UC RecCentre
@UC RecCentre
UC Sport
@UC Sport
Develop your employability.
Visit UC Careers: www.canterbury.ac.nz/careers
Feeling unsafe or need emergency
help? UC Security 0800 823 637 Paul Gardner BIOL309: The Jackknife & Bootstrap

Not tested: Visualisation: good vs bad
Image source: https://commons.wikimedia.org/wiki/File:Piecharts.svg
Paul Gardner BIOL309: The Jackknife & Bootstrap

Not tested: Visualisation: good vs bad
Weissgerberet al.(2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm.PLOS
Biology.
Paul Gardner BIOL309: The Jackknife & Bootstrap

The End
Paul Gardner BIOL309: The Jackknife & Bootstrap