Super Learning in Mathematical Algorithms

razigineer 17 views 181 slides Sep 16, 2025
Slide 1
Slide 1 of 181
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181

About This Presentation

Super Learning Algorithm


Slide Content

Introductionto
SuperLearning
Ted Westling, PhD
Postdoctoral Researcher
Center for Causal Inference
Perelman School of Medicine
University of Pennsylvania
September 25, 2018
1 / 48

Learning Goals
Conceptual understanding of Super Learning (SL)
Comfort with the SuperLearner R packageAwareness of the mathematical backbone of SL
2 / 48

Learning Goals
Conceptual understanding of Super Learning (SL)
Comfort with the SuperLearner R packageAwareness of the mathematical backbone of SL
2 / 48

Learning Goals
Conceptual understanding of Super Learning (SL)
Comfort with the SuperLearner R packageAwareness of the mathematical backbone of SL
2 / 48

Outline
I.Motivation and description of SL (30 minutes)
II.Lab 1: Vanilla SL for a continuous outcome (30 minutes)
III.Mathematical presentation of SL (20 minutes)
IV.Lab 2: Vanilla SL for a binary outcome (30 minutes)
15 minute break
3 / 48

Outline
15 minute break
V.Bells and whistles: Screens, weights, and CV-SL (30
minutes)
VI.Lab 3: Binary outcome redux (40 minutes)
VII.Lab 4: Case-control analysis of Fluzone vaccine (30
minutes)
4 / 48

I.Motivationand
descriptionofSuper
Learning
4 / 48

Notation
Yis a univariate outcome
Xis ap-variate set of predictorsWe observenindependent copies
(Y1;X1); : : : ;(Yn;Xn)
from the joint distribution of(Y;X).
5 / 48

Notation
Yis a univariate outcome
Xis ap-variate set of predictorsWe observenindependent copies
(Y1;X1); : : : ;(Yn;Xn)
from the joint distribution of(Y;X).
5 / 48

Notation
Yis a univariate outcome
Xis ap-variate set of predictorsWe observenindependent copies
(Y1;X1); : : : ;(Yn;Xn)
from the joint distribution of(Y;X).
5 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

The problem
We want to estimate a function, e.g.:
–– ––
Super Learning can be applied in all of the above settingsWe will focus on estimating the regression function
(x) :=E[YjX=x].
6 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

Why?
1.Exploratory analysis
2.Imputationof missing values3.Predictionfor new observations4. prediction quality/comparingcompeting
estimators
5. nuisance parameterestimator6.Conrmatory analysis/hypothesis testing
(not our goal here)
7 / 48

We want to estimate(x) =E[YjX=x].
How should we do it?
8 / 48

We want to estimate(x) =E[YjX=x].
How should we do it?
8 / 48

We want to estimate(x) =E[YjX=x].
How should we do it?GAM
8 / 48

We want to estimate(x) =E[YjX=x].
How should we do it?GAM
Random Forest
8 / 48

We want to estimate(x) =E[YjX=x].
How should we do it?GAM
Random Forest
Neural network
8 / 48

We want to estimate(x) =E[YjX=x].
How should we do it?GAM
Random Forest
Neural network
GLM
8 / 48

How do we choose which algorithm to use?
9 / 48

Super Learning is:
Anensemble methodfor combining
predictions from many candidate machine
learning algorithms
10 / 48

Measuring algorithm
performance
Suppose^1; : : : ;^
Kare candidate estimators of.
kwill always indexestimators, andiwill always index
observations(e.g. study participants)
Themean squared errorof^
k,
MSE(^
k) =E
h
(Y^
k(X))
2
i
measures the performance of^
kas an estimator of.
If we knewMSE(^
k), we could choose the^
kwith the
smallestMSE(^
k).
11 / 48

Measuring algorithm
performance
Suppose^1; : : : ;^
Kare candidate estimators of.
kwill always indexestimators, andiwill always index
observations(e.g. study participants)
Themean squared errorof^
k,
MSE(^
k) =E
h
(Y^
k(X))
2
i
measures the performance of^
kas an estimator of.
If we knewMSE(^
k), we could choose the^
kwith the
smallestMSE(^
k).
11 / 48

Measuring algorithm
performance
Suppose^1; : : : ;^
Kare candidate estimators of.
kwill always indexestimators, andiwill always index
observations(e.g. study participants)
Themean squared errorof^
k,
MSE(^
k) =E
h
(Y^
k(X))
2
i
measures the performance of^
kas an estimator of.
If we knewMSE(^
k), we could choose the^
kwith the
smallestMSE(^
k).
11 / 48

Measuring algorithm
performance
Suppose^1; : : : ;^
Kare candidate estimators of.
kwill always indexestimators, andiwill always index
observations(e.g. study participants)
Themean squared errorof^
k,
MSE(^
k) =E
h
(Y^
k(X))
2
i
measures the performance of^
kas an estimator of.
If we knewMSE(^
k), we could choose the^
kwith the
smallestMSE(^
k).
11 / 48

Estimating MSE
MSE(^
k) =E
h
(Y^
k(X))
2
i
It is tempting to take
[
MSE(^
k) =
1
n
P
n
i=1
[Y
i^
k(X
i)]
2
.
This estimator will favor^
kwhich areovert, because^
k
are trained on the same data used to evaluate the MSE.
Analogy: a student has the exam questionsbeforetaking
the exam!
Instead, we estimate MSE usingcross-validation.
12 / 48

Estimating MSE
MSE(^
k) =E
h
(Y^
k(X))
2
i
It is tempting to take
[
MSE(^
k) =
1
n
P
n
i=1
[Y
i^
k(X
i)]
2
.
This estimator will favor^
kwhich areovert, because^
k
are trained on the same data used to evaluate the MSE.
Analogy: a student has the exam questionsbeforetaking
the exam!
Instead, we estimate MSE usingcross-validation.
12 / 48

Estimating MSE
MSE(^
k) =E
h
(Y^
k(X))
2
i
It is tempting to take
[
MSE(^
k) =
1
n
P
n
i=1
[Y
i^
k(X
i)]
2
.
This estimator will favor^
kwhich areovert, because^
k
are trained on the same data used to evaluate the MSE.
Analogy: a student has the exam questionsbeforetaking
the exam!
Instead, we estimate MSE usingcross-validation.
12 / 48

Estimating MSE
MSE(^
k) =E
h
(Y^
k(X))
2
i
It is tempting to take
[
MSE(^
k) =
1
n
P
n
i=1
[Y
i^
k(X
i)]
2
.
This estimator will favor^
kwhich areovert, because^
k
are trained on the same data used to evaluate the MSE.
Analogy: a student has the exam questionsbeforetaking
the exam!
Instead, we estimate MSE usingcross-validation.
12 / 48

Estimating MSE
MSE(^
k) =E
h
(Y^
k(X))
2
i
It is tempting to take
[
MSE(^
k) =
1
n
P
n
i=1
[Y
i^
k(X
i)]
2
.
This estimator will favor^
kwhich areovert, because^
k
are trained on the same data used to evaluate the MSE.
Analogy: a student has the exam questionsbeforetaking
the exam!
Instead, we estimate MSE usingcross-validation.
12 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
13 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
13 / 48

123456789101Fold 1123456789102Fold 2123456789103Fold 3123456789104Fold 4123456789105Fold 5123456789106Fold 6123456789107Fold 7123456789108Fold 8123456789109Fold 91234567891010Fold 10
Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets.
14 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
we obtain^
k;vusing thetraining set;we obtain^
k;v(X
i)forX
iin thevalidation setVv.3. cross-validated MSEis
[
MSE
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
[Y
i^
k;v(X
i)]
2
:
We average the MSEs of theVvalidation sets. 15 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
we obtain^
k;vusing thetraining set;we obtain^
k;v(X
i)forX
iin thevalidation setVv.3. cross-validated MSEis
[
MSE
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
[Y
i^
k;v(X
i)]
2
:
We average the MSEs of theVvalidation sets. 15 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
we obtain^
k;vusing thetraining set;we obtain^
k;v(X
i)forX
iin thevalidation setVv.3. cross-validated MSEis
[
MSE
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
[Y
i^
k;v(X
i)]
2
:
We average the MSEs of theVvalidation sets. 15 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
we obtain^
k;vusing thetraining set;we obtain^
k;v(X
i)forX
iin thevalidation setVv.3. cross-validated MSEis
[
MSE
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
[Y
i^
k;v(X
i)]
2
:
We average the MSEs of theVvalidation sets. 15 / 48

Cross-validation
1. V“folds” of size roughlyn=V.
2. v=1; : : : ;V:
the data in folds other thanvis called thetraining set;
the data in foldvis called thetest/validation set.
we obtain^
k;vusing thetraining set;we obtain^
k;v(X
i)forX
iin thevalidation setVv.3. cross-validated MSEis
[
MSE
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
[Y
i^
k;v(X
i)]
2
:
We average the MSEs of theVvalidation sets. 15 / 48

123456789101Fold 1123456789102Fold 2123456789103Fold 3123456789104Fold 4123456789105Fold 5123456789106Fold 6123456789107Fold 7123456789108Fold 8123456789109Fold 91234567891010Fold 1012345678910CV preds.
Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets.
16 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

How do we chooseV?
LargeV:
– training data, so better forsmall n– –– SmallV:– test data–
(People typically useV=5 orV=10.)
17 / 48

“Discrete” Super Learner
At this point, we have cross-validated MSE estimates
[
MSE
CV(^1); : : : ;
[
MSE
CV(^
K)
for each of our candidate algorithms.
We could simply take as our estimator the^
kminimizing
these cross-validated MSEs.
We call this the “discrete Super Learner”.
18 / 48

“Discrete” Super Learner
At this point, we have cross-validated MSE estimates
[
MSE
CV(^1); : : : ;
[
MSE
CV(^
K)
for each of our candidate algorithms.
We could simply take as our estimator the^
kminimizing
these cross-validated MSEs.
We call this the “discrete Super Learner”.
18 / 48

“Discrete” Super Learner
At this point, we have cross-validated MSE estimates
[
MSE
CV(^1); : : : ;
[
MSE
CV(^
K)
for each of our candidate algorithms.
We could simply take as our estimator the^
kminimizing
these cross-validated MSEs.
We call this the “discrete Super Learner”.
18 / 48

Super Learner
Let= (1; : : : ;
K)be an element ofS
K, the
K-dimensional simplex: each
k2[0;1]and
P
k

k=1.
Super Learner considers as its set of candidate algorithms
allconvex combinations^:=
P
K
k=1

k^
k:
The Super Learner is^
b

, where
b
:= arg min
2S
K
[
MSE
CV

K
X
k=1

k^
k
!
:
(We use constrained optimization to compute the argmin.)
19 / 48

Super Learner
Let= (1; : : : ;
K)be an element ofS
K, the
K-dimensional simplex: each
k2[0;1]and
P
k

k=1.
Super Learner considers as its set of candidate algorithms
allconvex combinations^:=
P
K
k=1

k^
k:
The Super Learner is^
b

, where
b
:= arg min
2S
K
[
MSE
CV

K
X
k=1

k^
k
!
:
(We use constrained optimization to compute the argmin.)
19 / 48

Super Learner
Let= (1; : : : ;
K)be an element ofS
K, the
K-dimensional simplex: each
k2[0;1]and
P
k

k=1.
Super Learner considers as its set of candidate algorithms
allconvex combinations^:=
P
K
k=1

k^
k:
The Super Learner is^
b

, where
b
:= arg min
2S
K
[
MSE
CV

K
X
k=1

k^
k
!
:
(We use constrained optimization to compute the argmin.)
19 / 48

Super Learner
b
:= arg min
2S
K
[
MSE
CV

K
X
k=1

k^
k
!
:
[
MSE
CV

K
X
k=1

k^
k
!
=
1
V
V
X
v=1
1
jVvj
X
i2Vv
"
Y
i
K
X
k=1

k^
k;v(X
i)
#2
:
20 / 48

Super Learner
b
:= arg min
2S
K
[
MSE
CV

K
X
k=1

k^
k
!
:
[
MSE
CV

K
X
k=1

k^
k
!
=
1
V
V
X
v=1
1
jVvj
X
i2Vv
"
Y
i
K
X
k=1

k^
k;v(X
i)
#2
:
20 / 48

Super Learner
b
:= arg min
2S
K
[
MSE
CV

K
X
k=1

k^
k
!
:
[
MSE
CV

K
X
k=1

k^
k
!
=
1
V
V
X
v=1
1
jVvj
X
i2Vv
"
Y
i
K
X
k=1

k^
k;v(X
i)
#2
:
20 / 48

Super Learner: steps
Putting it all together:
1. library of candidate algorithms^1; : : : ;^
K.2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
21 / 48

Super Learner: steps
Putting it all together:
1. library of candidate algorithms^1; : : : ;^
K.2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
21 / 48

Super Learner: steps
Putting it all together:
1. library of candidate algorithms^1; : : : ;^
K.2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
21 / 48

Super Learner: steps
Putting it all together:
1. library of candidate algorithms^1; : : : ;^
K.2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
21 / 48

Super Learner: steps
Putting it all together:
1. library of candidate algorithms^1; : : : ;^
K.2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
21 / 48

II.Lab1:
VanillaSLfora
continuousoutcome
21 / 48

III.Intotheweeds:
amathematical
presentationofSL
21 / 48

Review
Recall the construction of SL for a continuous outcome:
1. library of candidate algorithms^1; : : : ;^
K.
2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.
3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
22 / 48

Review
Recall the construction of SL for a continuous outcome:
1. library of candidate algorithms^1; : : : ;^
K.
2. CV-predictions^
k;v(X
i)for allk;vandi2 Vv.
3. SL weights
b
:= arg min
2S
K
[
MSE
CV

P
K
k=1

k^
k

:
4. ^
SL=
P
K
k=1
b

k^
k.
22 / 48

In this section, we generalize this procedure to estimation
of
appropriate loss
23 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk: setup
Denote byOtheobserved data unit– e.g.O= (Y;X).
Denote byOthesample spaceofOLetMdenote ourstatistical model.Denote byP02 Mthetrue distributionofO.Thus, we observe i.i.d. copiesO1; : : : ;OnP0.Suppose we want to estimate aparameter:M !.Denote0:=(P0)the true parameter value.
24 / 48

Loss and risk
LetLbe a map fromO toR.
We callLaloss functionforif it holds that
0= arg min
2
E
P0
[L(O; )]:
R0() =E
P0
[L(O; )]is called theoracle risk.
These denitions of loss and risk come from thestatistical
learningliterature (see, e.g. Vapnik, 1992, 1999, 2013)
and arenot to be confusedwith loss and risk from the
decision theory literature (e.g. Ferguson, 2014).
25 / 48

Loss and risk
LetLbe a map fromO toR.
We callLaloss functionforif it holds that
0= arg min
2
E
P0
[L(O; )]:
R0() =E
P0
[L(O; )]is called theoracle risk.
These denitions of loss and risk come from thestatistical
learningliterature (see, e.g. Vapnik, 1992, 1999, 2013)
and arenot to be confusedwith loss and risk from the
decision theory literature (e.g. Ferguson, 2014).
25 / 48

Loss and risk
LetLbe a map fromO toR.
We callLaloss functionforif it holds that
0= arg min
2
E
P0
[L(O; )]:
R0() =E
P0
[L(O; )]is called theoracle risk.
These denitions of loss and risk come from thestatistical
learningliterature (see, e.g. Vapnik, 1992, 1999, 2013)
and arenot to be confusedwith loss and risk from the
decision theory literature (e.g. Ferguson, 2014).
25 / 48

Loss and risk
LetLbe a map fromO toR.
We callLaloss functionforif it holds that
0= arg min
2
E
P0
[L(O; )]:
R0() =E
P0
[L(O; )]is called theoracle risk.
These denitions of loss and risk come from thestatistical
learningliterature (see, e.g. Vapnik, 1992, 1999, 2013)
and arenot to be confusedwith loss and risk from the
decision theory literature (e.g. Ferguson, 2014).
25 / 48

Loss and risk: MSE example
MSE is the oracle risk corresponding to a
squared-error loss function
O= (Y;X).
(P) =(P) =fx7!E
P[YjX=x]gL(O; ) = [Y(X)]
2
is thesquared-error loss.R0() =MSE() =E
P0
[Y(X)]
2
.
26 / 48

Loss and risk: MSE example
MSE is the oracle risk corresponding to a
squared-error loss function
O= (Y;X).
(P) =(P) =fx7!E
P[YjX=x]gL(O; ) = [Y(X)]
2
is thesquared-error loss.R0() =MSE() =E
P0
[Y(X)]
2
.
26 / 48

Loss and risk: MSE example
MSE is the oracle risk corresponding to a
squared-error loss function
O= (Y;X).
(P) =(P) =fx7!E
P[YjX=x]gL(O; ) = [Y(X)]
2
is thesquared-error loss.R0() =MSE() =E
P0
[Y(X)]
2
.
26 / 48

Loss and risk: MSE example
MSE is the oracle risk corresponding to a
squared-error loss function
O= (Y;X).
(P) =(P) =fx7!E
P[YjX=x]gL(O; ) = [Y(X)]
2
is thesquared-error loss.R0() =MSE() =E
P0
[Y(X)]
2
.
26 / 48

Loss and risk: MSE example
MSE is the oracle risk corresponding to a
squared-error loss function
O= (Y;X).
(P) =(P) =fx7!E
P[YjX=x]gL(O; ) = [Y(X)]
2
is thesquared-error loss.R0() =MSE() =E
P0
[Y(X)]
2
.
26 / 48

Estimating the oracle risk
0= arg min
2
R0()
R0() =E
P0
[L(O; )]
Suppose that
^
1; : : : ;
^

Kare candidate estimators.
As before, we need to estimateR0()to evaluate each^
k.The naive estimator is
b
R(
^

k) =
1
n
P
n
i=1
L(O
i;
^

k).We instead estimateR0()using thecross-validated risk
b
R
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
L(O
i;^
k;v):
27 / 48

Estimating the oracle risk
0= arg min
2
R0()
R0() =E
P0
[L(O; )]
Suppose that
^
1; : : : ;
^

Kare candidate estimators.
As before, we need to estimateR0()to evaluate each^
k.The naive estimator is
b
R(
^

k) =
1
n
P
n
i=1
L(O
i;
^

k).We instead estimateR0()using thecross-validated risk
b
R
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
L(O
i;^
k;v):
27 / 48

Estimating the oracle risk
0= arg min
2
R0()
R0() =E
P0
[L(O; )]
Suppose that
^
1; : : : ;
^

Kare candidate estimators.
As before, we need to estimateR0()to evaluate each^
k.The naive estimator is
b
R(
^

k) =
1
n
P
n
i=1
L(O
i;
^

k).We instead estimateR0()using thecross-validated risk
b
R
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
L(O
i;^
k;v):
27 / 48

Estimating the oracle risk
0= arg min
2
R0()
R0() =E
P0
[L(O; )]
Suppose that
^
1; : : : ;
^

Kare candidate estimators.
As before, we need to estimateR0()to evaluate each^
k.The naive estimator is
b
R(
^

k) =
1
n
P
n
i=1
L(O
i;
^

k).We instead estimateR0()using thecross-validated risk
b
R
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
L(O
i;^
k;v):
27 / 48

Estimating the oracle risk
0= arg min
2
R0()
R0() =E
P0
[L(O; )]
Suppose that
^
1; : : : ;
^

Kare candidate estimators.
As before, we need to estimateR0()to evaluate each^
k.The naive estimator is
b
R(
^

k) =
1
n
P
n
i=1
L(O
i;
^

k).We instead estimateR0()using thecross-validated risk
b
R
CV(^
k) =
1
V
V
X
v=1
1
jVvj
X
i2Vv
L(O
i;^
k;v):
27 / 48

Super Learner: general steps
Using this framework, we can generalize the SL recipe:
1. library of candidate algorithms
^
1; : : : ;
^

K.2. CV-Risks
b
R
CV(^
k),k=1; : : : ;K.3. SL weights
b
:= arg min
2S
K
b
R
CV

P
K
k=1

k
^
k

:
4. ^
SL=
P
K
k=1
b

k
^
k.
28 / 48

Super Learner: general steps
Using this framework, we can generalize the SL recipe:
1. library of candidate algorithms
^
1; : : : ;
^

K.2. CV-Risks
b
R
CV(^
k),k=1; : : : ;K.3. SL weights
b
:= arg min
2S
K
b
R
CV

P
K
k=1

k
^
k

:
4. ^
SL=
P
K
k=1
b

k
^
k.
28 / 48

Super Learner: general steps
Using this framework, we can generalize the SL recipe:
1. library of candidate algorithms
^
1; : : : ;
^

K.2. CV-Risks
b
R
CV(^
k),k=1; : : : ;K.3. SL weights
b
:= arg min
2S
K
b
R
CV

P
K
k=1

k
^
k

:
4. ^
SL=
P
K
k=1
b

k
^
k.
28 / 48

Super Learner: general steps
Using this framework, we can generalize the SL recipe:
1. library of candidate algorithms
^
1; : : : ;
^

K.2. CV-Risks
b
R
CV(^
k),k=1; : : : ;K.3. SL weights
b
:= arg min
2S
K
b
R
CV

P
K
k=1

k
^
k

:
4. ^
SL=
P
K
k=1
b

k
^
k.
28 / 48

Super Learner: general steps
Using this framework, we can generalize the SL recipe:
1. library of candidate algorithms
^
1; : : : ;
^

K.2. CV-Risks
b
R
CV(^
k),k=1; : : : ;K.3. SL weights
b
:= arg min
2S
K
b
R
CV

P
K
k=1

k
^
k

:
4. ^
SL=
P
K
k=1
b

k
^
k.
28 / 48

Theoretical guarantees
van der Vaart et al. (2006) showed that, under some conditions,
theoracle risk of the SL estimatorisas goodas theoracle
risk of the oracle minimizerup to a multiple of
logn
n
as long as
the number of candidate algorithms ispolynomial inn.
29 / 48

Loss functions for a binary
outcome
We return toO= (Y;X),=.
ForcontinuousY, we usedsquared-error loss.
ForbinaryY, squared-error loss is still valid.However, there are (at least) two other alternative loss
functions for a binary outcome.
–Negative log-likelihood loss:
L(O; ) =Ylog(X)[1Y] log[1(X)].
–AUC loss.
30 / 48

Loss functions for a binary
outcome
We return toO= (Y;X),=.
ForcontinuousY, we usedsquared-error loss.
ForbinaryY, squared-error loss is still valid.However, there are (at least) two other alternative loss
functions for a binary outcome.
–Negative log-likelihood loss:
L(O; ) =Ylog(X)[1Y] log[1(X)].
–AUC loss.
30 / 48

Loss functions for a binary
outcome
We return toO= (Y;X),=.
ForcontinuousY, we usedsquared-error loss.
ForbinaryY, squared-error loss is still valid.However, there are (at least) two other alternative loss
functions for a binary outcome.
–Negative log-likelihood loss:
L(O; ) =Ylog(X)[1Y] log[1(X)].
–AUC loss.
30 / 48

Loss functions for a binary
outcome
We return toO= (Y;X),=.
ForcontinuousY, we usedsquared-error loss.
ForbinaryY, squared-error loss is still valid.However, there are (at least) two other alternative loss
functions for a binary outcome.
–Negative log-likelihood loss:
L(O; ) =Ylog(X)[1Y] log[1(X)].
–AUC loss.
30 / 48

Loss functions for a binary
outcome
We return toO= (Y;X),=.
ForcontinuousY, we usedsquared-error loss.
ForbinaryY, squared-error loss is still valid.However, there are (at least) two other alternative loss
functions for a binary outcome.
–Negative log-likelihood loss:
L(O; ) =Ylog(X)[1Y] log[1(X)].
–AUC loss.
30 / 48

Loss functions for a binary
outcome
We return toO= (Y;X),=.
ForcontinuousY, we usedsquared-error loss.
ForbinaryY, squared-error loss is still valid.However, there are (at least) two other alternative loss
functions for a binary outcome.
–Negative log-likelihood loss:
L(O; ) =Ylog(X)[1Y] log[1(X)].
–AUC loss.
30 / 48

IV.Lab2:
VanillaSLforabinary
outcome
30 / 48

15minutebreak
30 / 48

V.Bellsandwhistles:
Screens,weights,and
CV-SL
30 / 48

Overview
In this section, we will introduce three of the add-ons to SL that
are frequently useful in practice:variable screens,
observation weights, andcross-validated SL.
31 / 48

Variable screens
We think of a candidate algorithm as a two-step procedure:
1.Select a subsetof the covariates.2. t a model.We call step 1 ascreening procedure.While we could program steps 1 and 2 by hand in to each
candidate algorithm, theSuperLearnerpackage has
built-in functionality to ease this process.
Screening algorithms allow us to
domain knowledge.
32 / 48

Variable screens
We think of a candidate algorithm as a two-step procedure:
1.Select a subsetof the covariates.2. t a model.We call step 1 ascreening procedure.While we could program steps 1 and 2 by hand in to each
candidate algorithm, theSuperLearnerpackage has
built-in functionality to ease this process.
Screening algorithms allow us to
domain knowledge.
32 / 48

Variable screens
We think of a candidate algorithm as a two-step procedure:
1.Select a subsetof the covariates.2. t a model.We call step 1 ascreening procedure.While we could program steps 1 and 2 by hand in to each
candidate algorithm, theSuperLearnerpackage has
built-in functionality to ease this process.
Screening algorithms allow us to
domain knowledge.
32 / 48

Variable screens
We think of a candidate algorithm as a two-step procedure:
1.Select a subsetof the covariates.2. t a model.We call step 1 ascreening procedure.While we could program steps 1 and 2 by hand in to each
candidate algorithm, theSuperLearnerpackage has
built-in functionality to ease this process.
Screening algorithms allow us to
domain knowledge.
32 / 48

Variable screens
We think of a candidate algorithm as a two-step procedure:
1.Select a subsetof the covariates.2. t a model.We call step 1 ascreening procedure.While we could program steps 1 and 2 by hand in to each
candidate algorithm, theSuperLearnerpackage has
built-in functionality to ease this process.
Screening algorithms allow us to
domain knowledge.
32 / 48

Variable screens
We think of a candidate algorithm as a two-step procedure:
1.Select a subsetof the covariates.2. t a model.We call step 1 ascreening procedure.While we could program steps 1 and 2 by hand in to each
candidate algorithm, theSuperLearnerpackage has
built-in functionality to ease this process.
Screening algorithms allow us to
domain knowledge.
32 / 48

Example use-cases of screening
If we have a high-dimensional set of covariates, we can try
different ways ofreducing the dimensionality.
If we have a large number of “raw” measurements, we
might try providing a smaller number ofsummary
measures– e.g. mean, median, min, max.
If we have measurements collected atmultiple time
points, we might try providing just baseline, or just the last
time point, or some summaries of the trajectory.
We can force certain variables to always be used.
33 / 48

Example use-cases of screening
If we have a high-dimensional set of covariates, we can try
different ways ofreducing the dimensionality.
If we have a large number of “raw” measurements, we
might try providing a smaller number ofsummary
measures– e.g. mean, median, min, max.
If we have measurements collected atmultiple time
points, we might try providing just baseline, or just the last
time point, or some summaries of the trajectory.
We can force certain variables to always be used.
33 / 48

Example use-cases of screening
If we have a high-dimensional set of covariates, we can try
different ways ofreducing the dimensionality.
If we have a large number of “raw” measurements, we
might try providing a smaller number ofsummary
measures– e.g. mean, median, min, max.
If we have measurements collected atmultiple time
points, we might try providing just baseline, or just the last
time point, or some summaries of the trajectory.
We can force certain variables to always be used.
33 / 48

Example use-cases of screening
If we have a high-dimensional set of covariates, we can try
different ways ofreducing the dimensionality.
If we have a large number of “raw” measurements, we
might try providing a smaller number ofsummary
measures– e.g. mean, median, min, max.
If we have measurements collected atmultiple time
points, we might try providing just baseline, or just the last
time point, or some summaries of the trajectory.
We can force certain variables to always be used.
33 / 48

Observation weights
In some applications, we need to includeobservation
weightsin the procedure – e.g.case-control sampling,
or as a simple way to account forloss-to-followup.
Observation weights can be included directly in a call to
SuperLearner, butmethod.AUC does not make correct
use of weights!!!!
Note that someSuperLearnerwrappers might not make
use of observation weights.
34 / 48

Observation weights
In some applications, we need to includeobservation
weightsin the procedure – e.g.case-control sampling,
or as a simple way to account forloss-to-followup.
Observation weights can be included directly in a call to
SuperLearner, butmethod.AUC does not make correct
use of weights!!!!
Note that someSuperLearnerwrappers might not make
use of observation weights.
34 / 48

Observation weights
In some applications, we need to includeobservation
weightsin the procedure – e.g.case-control sampling,
or as a simple way to account forloss-to-followup.
Observation weights can be included directly in a call to
SuperLearner, butmethod.AUC does not make correct
use of weights!!!!
Note that someSuperLearnerwrappers might not make
use of observation weights.
34 / 48

Case-control weights
LetYrepresent disease status at the end of a study.
Suppose specimens from allncasecases(Y
i=1) are
assayed.
A random subset ofN
controlcontrols(Y
i=0) (out of
n
controltotal controls) are assayed.
We will use this case-control cohort to predict disease
status using the results of the assay and other covariates.
35 / 48

Case-control weights
LetYrepresent disease status at the end of a study.
Suppose specimens from allncasecases(Y
i=1) are
assayed.
A random subset ofN
controlcontrols(Y
i=0) (out of
n
controltotal controls) are assayed.
We will use this case-control cohort to predict disease
status using the results of the assay and other covariates.
35 / 48

Case-control weights
LetYrepresent disease status at the end of a study.
Suppose specimens from allncasecases(Y
i=1) are
assayed.
A random subset ofN
controlcontrols(Y
i=0) (out of
n
controltotal controls) are assayed.
We will use this case-control cohort to predict disease
status using the results of the assay and other covariates.
35 / 48

Case-control weights
LetYrepresent disease status at the end of a study.
Suppose specimens from allncasecases(Y
i=1) are
assayed.
A random subset ofN
controlcontrols(Y
i=0) (out of
n
controltotal controls) are assayed.
We will use this case-control cohort to predict disease
status using the results of the assay and other covariates.
35 / 48

Case-control weights
We can use SL with observation weights.
Caseshave weightw
i=1.Controlshave weightw
i=n
control=N
control.Control weights could also be estimated using a logistic
regression of the indicator of inclusion in the control cohort
on baseline covariates.
36 / 48

Case-control weights
We can use SL with observation weights.
Caseshave weightw
i=1.Controlshave weightw
i=n
control=N
control.Control weights could also be estimated using a logistic
regression of the indicator of inclusion in the control cohort
on baseline covariates.
36 / 48

Case-control weights
We can use SL with observation weights.
Caseshave weightw
i=1.Controlshave weightw
i=n
control=N
control.Control weights could also be estimated using a logistic
regression of the indicator of inclusion in the control cohort
on baseline covariates.
36 / 48

Case-control weights
We can use SL with observation weights.
Caseshave weightw
i=1.Controlshave weightw
i=n
control=N
control.Control weights could also be estimated using a logistic
regression of the indicator of inclusion in the control cohort
on baseline covariates.
36 / 48

Right-censored outcomes
SupposeY=I(Tt0)indicates that disease occurs
before timet0.
Tis subject to right-censoring byC: we observe
Y= minfT;Cgand =I(TC).
We want to estimate
(x) =P(Tt0jX=x) =E[YjX=x].
37 / 48

Right-censored outcomes
SupposeY=I(Tt0)indicates that disease occurs
before timet0.
Tis subject to right-censoring byC: we observe
Y= minfT;Cgand =I(TC).
We want to estimate
(x) =P(Tt0jX=x) =E[YjX=x].
37 / 48

Right-censored outcomes
SupposeY=I(Tt0)indicates that disease occurs
before timet0.
Tis subject to right-censoring byC: we observe
Y= minfT;Cgand =I(TC).
We want to estimate
(x) =P(Tt0jX=x) =E[YjX=x].
37 / 48

Right-censored outcomes
0= arg min

E
P0


G0(YjX)
L((Y;X); )

Here,G0(tjx) =P0(C>tjX=x).
Leither squared-error or negative log-likelihood loss.
If we knewG0, we could use SL with weight

G0(YjX)
.Instead, we estimateG0and plug in this estimator to
obtain an estimated weight.
IfC??T, we can use a Kaplan-Meier estimator forG0;
otherwise we might use a Cox model.
38 / 48

Right-censored outcomes
0= arg min

E
P0


G0(YjX)
L((Y;X); )

Here,G0(tjx) =P0(C>tjX=x).
Leither squared-error or negative log-likelihood loss.
If we knewG0, we could use SL with weight

G0(YjX)
.Instead, we estimateG0and plug in this estimator to
obtain an estimated weight.
IfC??T, we can use a Kaplan-Meier estimator forG0;
otherwise we might use a Cox model.
38 / 48

Right-censored outcomes
0= arg min

E
P0


G0(YjX)
L((Y;X); )

Here,G0(tjx) =P0(C>tjX=x).
Leither squared-error or negative log-likelihood loss.
If we knewG0, we could use SL with weight

G0(YjX)
.Instead, we estimateG0and plug in this estimator to
obtain an estimated weight.
IfC??T, we can use a Kaplan-Meier estimator forG0;
otherwise we might use a Cox model.
38 / 48

Right-censored outcomes
0= arg min

E
P0


G0(YjX)
L((Y;X); )

Here,G0(tjx) =P0(C>tjX=x).
Leither squared-error or negative log-likelihood loss.
If we knewG0, we could use SL with weight

G0(YjX)
.Instead, we estimateG0and plug in this estimator to
obtain an estimated weight.
IfC??T, we can use a Kaplan-Meier estimator forG0;
otherwise we might use a Cox model.
38 / 48

CV-Super Learner
The standard SL framework gives us CV risks for each
candidate algorithm.
However, the SL and discrete SL are obtained using all the
data, sotheir estimated risks will be optimistic.
We can rectify this using asecond layer of
cross-validation.
39 / 48

CV-Super Learner
The standard SL framework gives us CV risks for each
candidate algorithm.
However, the SL and discrete SL are obtained using all the
data, sotheir estimated risks will be optimistic.
We can rectify this using asecond layer of
cross-validation.
39 / 48

CV-Super Learner
The standard SL framework gives us CV risks for each
candidate algorithm.
However, the SL and discrete SL are obtained using all the
data, sotheir estimated risks will be optimistic.
We can rectify this using asecond layer of
cross-validation.
39 / 48

CV-Super Learner
The standard SL framework gives us CV risks for each
candidate algorithm.
However, the SL and discrete SL are obtained using all the
data, sotheir estimated risks will be optimistic.
We can rectify this using asecond layer of
cross-validation.
39 / 48

CV-Super Learner
1. V1folds.
2. v=1; : : : ;V1:a. vusing
V2-fold CV.
b.
validation set for foldv.
3.
discrete SL and SL.
40 / 48

CV-Super Learner
1. V1folds.
2. v=1; : : : ;V1:a. vusing
V2-fold CV.
b.
validation set for foldv.
3.
discrete SL and SL.
40 / 48

CV-Super Learner
1. V1folds.
2. v=1; : : : ;V1:a. vusing
V2-fold CV.
b.
validation set for foldv.
3.
discrete SL and SL.
40 / 48

CV-Super Learner
1. V1folds.
2. v=1; : : : ;V1:a. vusing
V2-fold CV.
b.
validation set for foldv.
3.
discrete SL and SL.
40 / 48

CV-Super Learner
1. V1folds.
2. v=1; : : : ;V1:a. vusing
V2-fold CV.
b.
validation set for foldv.
3.
discrete SL and SL.
40 / 48

VI.Lab3:
Binaryoutcome
redux
40 / 48

VII.Lab4:
Case-controlanalysis
ofFluzonevaccine
40 / 48

FLUVACS trial
Health adults aged 18–49 years, Michigan, 2007–2008.
Randomly assigned to:
–
–
–
We are only interested in Fluzone vs placebo.Followed for one u season.
Endpoint = laboratory-conrmed inuenza.
41 / 48

FLUVACS trial
Health adults aged 18–49 years, Michigan, 2007–2008.
Randomly assigned to:
–
–
–
We are only interested in Fluzone vs placebo.Followed for one u season.
Endpoint = laboratory-conrmed inuenza.
41 / 48

FLUVACS trial
Health adults aged 18–49 years, Michigan, 2007–2008.
Randomly assigned to:
–
–
–
We are only interested in Fluzone vs placebo.Followed for one u season.
Endpoint = laboratory-conrmed inuenza.
41 / 48

FLUVACS trial
Health adults aged 18–49 years, Michigan, 2007–2008.
Randomly assigned to:
–
–
–
We are only interested in Fluzone vs placebo.Followed for one u season.
Endpoint = laboratory-conrmed inuenza.
41 / 48

FLUVACS trial
42 / 48

FLUVACS trial
All 52 cases and 52 random controls were assayed for a
variety of markers (HAI, NAI, MN, AM titers,
proteins/virus/peptide magnitude/breadth).
Measured variables:
–
(EVERVAX)
––– 43 / 48

FLUVACS trial
All 52 cases and 52 random controls were assayed for a
variety of markers (HAI, NAI, MN, AM titers,
proteins/virus/peptide magnitude/breadth).
Measured variables:
–
(EVERVAX)
––– 43 / 48

FLUVACS trial
All 52 cases and 52 random controls were assayed for a
variety of markers (HAI, NAI, MN, AM titers,
proteins/virus/peptide magnitude/breadth).
Measured variables:
–
(EVERVAX)
––– 43 / 48

FLUVACS trial
All 52 cases and 52 random controls were assayed for a
variety of markers (HAI, NAI, MN, AM titers,
proteins/virus/peptide magnitude/breadth).
Measured variables:
–
(EVERVAX)
––– 43 / 48

FLUVACS trial
All 52 cases and 52 random controls were assayed for a
variety of markers (HAI, NAI, MN, AM titers,
proteins/virus/peptide magnitude/breadth).
Measured variables:
–
(EVERVAX)
––– 43 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Variable sets
1.
2.3.4.5. Day 0 markers6. Day 30 markers7. Diff. markers8. (Day 0 + Day 30)9. (Day 0 + Diff.)
44 / 48

Analysis goals
We want to compare the quality of these nine sets of
variables for predicting u status in the placebo and
Fluzone arms separately.
We also want to compare the predictive quality of IgA, IgG,
and both IgA + IgG measurements.
We will use cross-validated Super Learning to do this.
45 / 48

Analysis goals
We want to compare the quality of these nine sets of
variables for predicting u status in the placebo and
Fluzone arms separately.
We also want to compare the predictive quality of IgA, IgG,
and both IgA + IgG measurements.
We will use cross-validated Super Learning to do this.
45 / 48

Analysis goals
We want to compare the quality of these nine sets of
variables for predicting u status in the placebo and
Fluzone arms separately.
We also want to compare the predictive quality of IgA, IgG,
and both IgA + IgG measurements.
We will use cross-validated Super Learning to do this.
45 / 48

l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
EV x (Day 30 − Day 0) EV x (Day 0, Day 30) EV x (Day 0, Diff)
Day 30 − Day 0 EV x Day 0 EV x Day 30
Baseline Day 0 Day 30
0.4 0.6 0.8 1.00.4 0.6 0.8 1.00.4 0.6 0.8 1.0
SL
Discrete SL
SL.glm
SL.bayesglm
SL.glmnet
SL.earth
SL.gam
SL.xgboost
SL.ranger
SL.mean
SL
Discrete SL
SL.glm
SL.bayesglm
SL.glmnet
SL.earth
SL.gam
SL.xgboost
SL.ranger
SL.mean
SL
Discrete SL
SL.glm
SL.bayesglm
SL.glmnet
SL.earth
SL.gam
SL.xgboost
SL.ranger
SL.mean
AUC
Learner
Screen
lAll
screen.marginal.05
screen.marginal.10
l
l
l
l
Both
IgA
IgG
Neither 46 / 48

l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
EV x (Day 30 − Day 0) EV x (Day 0, Day 30) EV x (Day 0, Diff)
Day 30 − Day 0 EV x Day 0 EV x Day 30
Baseline Day 0 Day 30
0.20.40.60.81.0 0.20.40.60.81.0 0.20.40.60.81.0
SL
Discrete SL
SL.xgboost
SL
Discrete SL
SL.xgboost
SL
Discrete SL
SL.xgboost
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.xgboost
SL
Discrete SL
SL.xgboost
SL
Discrete SL
SL.glm
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.gam
AUC
Learner
Screen
lAll
screen.marginal.05
screen.marginal.10
l
l
l
l
Both
IgA
IgG
Neither 47 / 48

l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
EV x (Day 30 − Day 0) EV x (Day 0, Day 30) EV x (Day 0, Diff)
Day 30 − Day 0 EV x Day 0 EV x Day 30
Baseline Day 0 Day 30
0.40.60.81.0 0.40.60.81.0 0.40.60.81.0
SL
Discrete SL
SL.glm
SL.bayesglm
SL.glmnet
SL.earth
SL.gam
SL.xgboost
SL.ranger
SL.mean
SL
Discrete SL
SL.glm
SL.bayesglm
SL.glmnet
SL.earth
SL.gam
SL.xgboost
SL.ranger
SL.mean
SL
Discrete SL
SL.glm
SL.bayesglm
SL.glmnet
SL.earth
SL.gam
SL.xgboost
SL.ranger
SL.mean
AUC
Learner
Screen
lAll
screen.marginal.05
screen.marginal.10
l
l
l
l
Both
IgA
IgG
Neither 48 / 48

l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
EV x (Day 30 − Day 0) EV x (Day 0, Day 30) EV x (Day 0, Diff)
Day 30 − Day 0 EV x Day 0 EV x Day 30
Baseline Day 0 Day 30
0.20.40.60.8 0.20.40.60.8 0.20.40.60.8
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.ranger
SL
Discrete SL
SL.bayesglm
SL
Discrete SL
SL.bayesglm
AUC
Learner
l
l
l
l
Both
IgA
IgG
Neither 49 / 48

Ferguson, T. S. (2014).Mathematical statistics: A decision theoretic
approach. Academic Press.
van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Oracle
inequalities for multi-fold cross validation.Statistics & Decisions,
24(3):351–371.
Vapnik, V. (1992). Principles of risk minimization for learning theory. In
Advances in Neural Information Processing Systems, pages 831–838.
Vapnik, V. (2013).The nature of statistical learning theory. Springer Science
& Business Media.
Vapnik, V. N. (1999). An overview of statistical learning theory.IEEE
Transactions on Neural Networks, 10(5):988–999.
50 / 48
Tags