CMSC 177 Regression and Predictionnnnnnn

rmarvicgabriel 8 views 56 slides Sep 16, 2025
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

CMSC 177 Regression and Predictionnnnnnn


Slide Content

REGRESSION AND PREDICTION
CMSC 177
Dr. Vincent Peter Magboo

USING LINEAR REGRESSION
Consider a dataset consisting of the sales of a product in 200 different markets,
along with advertising budgets for the product in each of those markets for three
different media: TV, radio, and newspaper.
Our goal is to develop an accurate model that can be used to predict sales on the
basis of the three media budgets.

SOME QUESTIONS
1.Is there a relationship between advertising budget and sales?
2.How strong is the relationship between advertising budget and sales?
3.Which media contribute to sales?
4.How accurately can we estimate the effect of each medium on sales?
5.How accurately can we predict future sales?
6.Is the relationship linear?
7.Is there synergy among the advertising media?

WHAT IS SIMPLE LINEAR REGRESSION?
Simple linear regression is an approach for predicting a quantitative response �
on the basis of a single predictor variable �.
It assumes an approximate linear relationship between �and �. That is,
�≈??????
0+??????
1�
Here ??????
0and ??????
1are unknown constants that represent the intercept and slope
terms in the linear model. Together, they are known as the model coefficients or
parameters.

ESTIMATING THE COEFFICIENTS
Let �
1,�
1,�
2,�
2,…,�
??????,�
??????be ??????observation pairs. Our goal is to obtain
coefficient estimates መ??????
0and መ??????
1such that the linear model fits the available data
well –that is, ො�
�≈መ??????
0+መ??????
1�
�for �=1,2,…,??????.
The most common criterion to define closeness is the least-squares criterion.

Let ො�
�=መ??????
0+መ??????
1�
�be the prediction for �based on the �thvalue for �. Then �
�=
�
�−ො�
�represents the �thresidual–the difference between the �thobserved
response value and the �thresponse value that is predicted by our linear model.
The least squares approach chooses መ??????
0and መ??????
1to minimize the residual sum of
squares (RSS):
���=෍
�=1
??????
�
�
2
=෍
�=1
??????
�
�−መ??????
0−መ??????
1�
�
2

THE LEAST-SQUARES CRITERION

Using some calculus, it can be shown that the minimizers are
መ??????
1=
σ
�=1
??????
�
�−ҧ��
�−ത�
σ
�=1
??????
�
�−ҧ�
2
መ??????
0=ത�−መ??????
1ҧ�
where ത�=
1
??????
σ
�=1
??????
�
�and ҧ�=
1
??????
σ
�=1
??????
�
�.
These values define the least squares coefficientestimates for simple linear
regression.

MULTIPLE REGRESSION
Assuming that we have ??????distinct predictors, the multiple linear regression model
takes the form
�=??????
0+??????
1�
1+??????
2�
2+⋯+??????
??????�
??????+??????
As in the simple linear regression setting, we choose ??????
0,??????
1,…,??????
??????to minimize the
sum of squared residuals
���=෍
�=1
??????
�
�−ො�
�
2
=෍
�=1
??????
�
�−መ??????
0−መ??????
1�
�1−መ??????
2�
�2−⋯−መ??????
??????�
�??????
2
We can then make predictions using the formula
ො�=መ??????
0+መ??????
1�
1+መ??????
2�
2+⋯+መ??????
??????�
??????

THE LEAST-SQUARES CRITERION IN 3-D

INTERPRETATION OF REGRESSION
COEFFICIENTS
In the regression equation ො�=መ??????
0+መ??????
1�
1+መ??????
2�
2+⋯+መ??????
??????�
??????,
The intercept መ??????
0represents the value of ො�when all the independent variables have
values equal to 0.
The predicted value ො�changes by መ??????
�for each unit change in �
�, (�=1,2,…,??????) ,
assuming all other variables �
�(�≠�), remain the same.

Coefficient Std. error T-statistic P-value
Intercept 7.0325 0.4578 15.36 < 0.0001
TV 0.0475 0.0027 17.67 < 0.0001
Sales vs. TV advertising budget
Note:Here the p-values are for the results of testing the significance of each coefficient. The null hypothesis
is &#3627408443;
0:??????
&#3627408470;=0vs. the alternative hypothesis &#3627408443;
??????:??????
&#3627408470;≠0.

SIMPLE LINEAR REGRESSION RESULTS
Coefficient Std. error T-statistic P-value
Intercept 9.312 0.563 16.54 < 0.0001
Radio 0.203 0.020 9.92 < 0.0001
Coefficient Std. error T-statistic P-value
Intercept 12.351 0.621 19.88 < 0.0001
Newspaper 0.055 0.017 3.30 < 0.0001
Sales vs. Radio advertising budget
Sales vs. Newspaper advertising budget

MULTIPLE LINEAR REGRESSION RESULTS
Coefficient Std. error T-statistic P-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV 0.046 0.0014 32.81 < 0.0001
Radio 0.189 0.0086 21.89 < 0.0001
Newspaper -0.001 0.0059 -0.18 0.8599

CORRELATION MATRIX
TV Radio Newspaper Sales
TV 1.0000 0.0548 0.0567 0.7822
Radio 1.0000 0.3541 0.5762
Newspaper 1.0000 0.2283
Sales 1.0000

SOME IMPORTANT QUESTIONS
1.Is at least one of the predictors &#3627408459;
1,&#3627408459;
2,…,&#3627408459;
??????useful in predicting the response?
2.Do all the predictors help to explain &#3627408460;, or is only a subset of the predictors
useful?
3.How well does the model fit the data?
4.Given a set of predictor values, what response value should we predict, and how
accurate is our prediction?

1: IS THERE A RELATIONSHIP BETWEEN THE
RESPONSE AND PREDICTORS?
We test the null hypothesis
&#3627408443;
0:??????
1=??????
2=⋯=??????
??????=0vs. &#3627408443;
??????:at least one ??????
&#3627408471;is non-zero.
This involves the use of the test statistic
&#3627408441;=
&#3627408455;&#3627408454;&#3627408454;−&#3627408453;&#3627408454;&#3627408454;/??????
&#3627408453;&#3627408454;&#3627408454;/(??????−??????−1)
where &#3627408455;&#3627408454;&#3627408454;=σ&#3627408486;
&#3627408470;−ത&#3627408486;
2
and &#3627408453;&#3627408454;&#3627408454;=σ&#3627408486;
&#3627408470;−ො&#3627408486;
&#3627408470;
2
.
It can be shown that, if the linear model assumptions are correct, and &#3627408443;
0is true,
then the F-statistic will take on a value close to 1. On the other hand, if &#3627408443;
??????is true,
then &#3627408441;will be greater than 1.

Quantity Value
Residual standard error 1.69
&#3627408453;
2
0.897
F-statistic 570

2: DECIDING ON IMPORTANT VARIABLES
Suppose that on the basis of the p-value for the F-test, we are able to conclude
that at least one of the predictors is related to the response. We now would like to
know which ones these are.
Classical methods:
Forward selection
Backward selection
Mixed selection

FORWARD SELECTION
We begin with the null model –a model that contains an intercept but no
predictors.
We then fit ??????simple linear regressions and add to the null model the variable that
results in the lowest RSS.
We then add to that model the variable that results in the lowest RSS for the new
two-variable model.
This approach is continued until some stopping rule is satisfied.

BACKWARD SELECTION
We start with all variables in the model, and remove the variable with the largest
p-value –that is, the variable that is the least statistically significant.
The new (??????−1)-variable model is then fit, and the variable with the largest p-
value is removed.
This procedure continues until a stopping rule is reached. For instance, we may
stop when all remaining variables have a p-value below some threshold.

MIXED SELECTION
Combination of forward and backward selection.
We start with no variables in the model, and as with forward selection, we add the
variable that provides the best fit.
We continue to add variables one-by-one. However, if at any point, the p-value
for one of the variables in the model rises above a certain threshold, then we
remove that variable from the model.
We continue to perform these forward and backward steps until all variables in
the model have a sufficiently low p-value, and all variables outside the model
would have a large p-value if added to the model.

SOME ADDITIONAL REMARKS
Let ??????be the number of predictors, and ??????the number of observations. Backward
selection cannot be used if ??????>??????, while forward selection can always be used.
Forward selection is a greedy approach, and might include variables early that
later become redundant. This can be remedied by mixed selection.
Ideally, we would like to perform variable selection by trying out a lot of different
models, each containing a different subset of the predictors. To determine which
model is best, we can use one of various statistics:
Mallow’s ??????
??????
Akaike information criterion (AIC)
Bayesian information criterion (BIC)
Adjusted &#3627408453;
2

3: MODEL FIT
Residual Standard Error (RSE)
The coefficient of determination &#3627408453;
2
Graphical summaries
Cross-validation

ON THE RSE
Residual Standard Error (RSE):
&#3627408453;&#3627408454;&#3627408440;=
1
??????−??????−1

&#3627408470;=1
??????
&#3627408486;
&#3627408470;−ො&#3627408486;
&#3627408470;
2
The RSE provides an absolute measure of lack of fit of the linear model to the
data.
If ො&#3627408486;
&#3627408470;≈&#3627408486;
&#3627408470;for &#3627408470;=1,2,…,??????, then the RSE will be small, and we can conclude that
the model fits the data very well. Otherwise, if ො&#3627408486;
&#3627408470;is very far from &#3627408486;
&#3627408470;for one or
more observations, then the RSE may be quite large.
Since it is measured in the units of &#3627408460;, it is not always clear what constitutes a good
RSE.

ON THE &#3627408453;
2
STATISTIC
The coefficient of determination &#3627408453;
2
:
&#3627408453;
2
=1−
&#3627408453;&#3627408454;&#3627408454;
&#3627408455;&#3627408454;&#3627408454;
where &#3627408455;&#3627408454;&#3627408454;=σ&#3627408486;
&#3627408470;−ത&#3627408486;
2
and &#3627408453;&#3627408454;&#3627408454;=σ&#3627408486;
&#3627408470;−ො&#3627408486;
&#3627408470;
2
.
The &#3627408453;
2
statistic measures the proportion of variability in &#3627408460;that can be explained
using the independent variables.
Its value is always between 0 and 1. An &#3627408453;
2
statistic value near 0 indicates that the
regression did not explain much of the variability in the response, which suggest
that either the linear model is wrong, the inherent error is high, or both.

Quantity Value
Residual standard error 1.69
&#3627408453;
2
0.897
F-statistic 570
Quantity Value
Residual standard error 3.26
&#3627408453;
2
0.612
F-statistic 312.1
Using all three predictors:
Using only TV advertising budget as predictor:

GRAPHICAL SUMMARIES

(K-FOLD) CROSS-VALIDATION
1.Set aside 1/&#3627408472;of the data as a holdout sample.
2.Train the model on the remaining data.
3.Apply (score) the model to the 1/&#3627408472;holdout, and record needed model
assessment metrics.
4.Restore the first 1/&#3627408472;of the data, and set aside the next 1/&#3627408472;(excluding any
records that got picked the first time.
5.Repeat steps 2 and 3.
6.Repeat until each record has been used in the holdout portion.
7.Average or otherwise combine the model assessment metrics.

4: PREDICTIONS
Once we have fit the (multiple) regression model, it is straightforward to apply the
regression equation to predict the response &#3627408460;on the basis of a set of values for
the predictors &#3627408459;
1,&#3627408459;
2,…,&#3627408459;
??????. There are three sorts of uncertainty associated with
this prediction:
The least squares plane ො&#3627408486;=መ??????
0+መ??????
1&#3627408485;
1+መ??????
2&#3627408485;
2+⋯+መ??????
??????&#3627408485;
??????is only an estimate for the
true population regression plane &#3627408460;=??????
0+??????
1&#3627408459;
1+??????
2&#3627408459;
2+⋯+??????
??????&#3627408459;
??????. We can compute
a confidence intervalin order to determine how close ෠&#3627408460;will be to &#3627408467;(&#3627408459;).
Assuming a linear model for &#3627408467;(&#3627408459;)is almost always an approximation of reality (which
gives rise to an error known as model bias).
Even if we knew &#3627408467;(&#3627408459;), the response value cannot be predicted perfectly because of the
random error ??????in the model. To determine how much &#3627408460;will vary from ෠&#3627408460;, we can use
prediction intervals.

QUALITATIVE PREDICTORS
To incorporate qualitative predictors (factors) with only two levels, we can simply
create an indicator or dummy variable that takes on two possible numerical
values.
For example, for the gender variable in the Credit data set, we can create the
following new variable:
&#3627408485;
&#3627408470;=ቊ
1ifthe&#3627408470;thpersonisfemale
0ifthe&#3627408470;thpersonismale
This results in the model
&#3627408486;
&#3627408470;=??????
0+??????
1&#3627408485;
&#3627408470;+??????
&#3627408470;=ቊ
??????
0+??????
1+??????
&#3627408470;ifthe&#3627408470;thpersonisfemale
??????
0+??????
&#3627408470;ifthe&#3627408470;thpersonismale

QUALITATIVE PREDICTORS
When a qualitative predictor has ??????levels, we will need to create ??????−1dummy
variables. The level with no dummy variable is known as the baseline.
For example, for the ethnicity variable in the Credit dataset, we can define
&#3627408485;
&#3627408470;1=ቊ
1ifthe&#3627408470;thpersonisAsian
0ifthe&#3627408470;thpersonisnotAsian
and the second could be
&#3627408485;
&#3627408470;2=ቊ
1ifthe&#3627408470;thpersonisCaucasian
0ifthe&#3627408470;thpersonisnotCaucasian

QUALITATIVE PREDICTORS
Then both of these variables can be in the regression equation, in order to obtain
the model
&#3627408486;
&#3627408470;=??????
0+??????
1&#3627408485;
&#3627408470;1+??????
2&#3627408485;
&#3627408470;2+??????
&#3627408470;=ቐ
??????
0+??????
1+??????
&#3627408470;ifthe&#3627408470;thpersonisAsian
??????
0+??????
2+??????
&#3627408470;ifthe&#3627408470;thpersonisCaucasian
??????
0+??????
&#3627408470;ifthe&#3627408470;thpersonisAfricanAmerican
In this case, we can interpret the regression coefficients as follows:
??????
0is the average credit card balance for African Americans
??????
1is the difference in the average balance between the Asian and African American
categories
??????
2is the difference in the average balance between the Caucasian and African American
categories

LINEAR MODELS: CREDIT DATASET
Coefficient Std. error T-statistic P-value
Intercept 509.80 33.13 15.389 < 0.0001
Gender [Female]19.73 46.05 0.429 0.6690
Coefficient Std. error T-statistic P-value
Intercept 531.00 46.32 11.464 < 0.0001
Ethnicity [Asian]-18.69 65.02 -0.287 0.7740
Ethnicity [Caucasian]-12.50 56.68 -0.221 0.8260

EXTENSIONS TO THE LINEAR MODEL
While the linear regression model provides interpretable results and works quite
well on many real-world problems, it makes several highly restrictive assumptions
that are often violated in practice.
In particular, two of the most important assumptions state that the relationship
between the predictors and response are additive and linear:
Additive:the effect of changes in a predictor &#3627408459;
&#3627408471;on the response &#3627408460;is independent of the
values of the other predictors.
Linear:the change in the response &#3627408460;due to a 1 unit change in &#3627408459;
&#3627408471;is constant, regardless
of the value of &#3627408459;
&#3627408471;.

INTERACTION / SYNERGY
There are cases where the relationship between a predictor variable and the
response is not independent of the other predictor variables.
This can taken into account in the model by adding a product term.For example,
adding an interaction term for response variables &#3627408459;
1and &#3627408459;
2gives the model
&#3627408460;=??????
0+??????
1&#3627408459;
1+??????
2&#3627408459;
2+??????
3&#3627408459;
1&#3627408459;
2+??????
Hierarchical principle:If we include an interaction in a model, we should also
include the main effects, even if the p-values associated with their coefficients are
not significant.

MODEL WITH INTERACTION
Coefficient Std. error T-statistic P-value
Intercept 6.7502 0.248 27.23 < 0.0001
TV 0.0191 0.002 12.70 < 0.0001
Radio 0.0289 0.009 3.24 0.0014
TV x Radio 0.0011 0.000 20.73 < 0.0001

COMMON PROBLEMS WHEN APPLYING A
LINEAR REGRESSION MODEL
1.Non-linearity of the response-predictor relationships
2.Correlation of error terms
3.Non-constant variance of error terms
4.Outliers
5.High-leverage points
6.Collinearity

1: NON-LINEARITY
The linear regression model assumes that there is a straight-line relationship
between the predictors and the response.
We can use residual plots to identify non-linearity:
In a simple linear regression model, we plot the residuals &#3627408466;
&#3627408470;=&#3627408486;
&#3627408470;−ො&#3627408486;
&#3627408470;versus the
predictor &#3627408485;
&#3627408470;.
In a multiple regression model, we instead plot the residuals versus the predicted (or
fitted) values ො&#3627408486;
&#3627408470;.
Ideally, the residual plot will show no discernible pattern.
If the residual plot indicates that there are non-linear associations in the data,
then one can use non-linear transformations of the predictors, such as
log&#3627408459;,&#3627408459;,or &#3627408459;
2
in the regression model.

DEALING WITH NON-LINEARITY
Polynomial regression:includes polynomial terms to a regression equation.
For example, a quadratic regression between the response &#3627408460;and the predictor &#3627408459;would
take the form &#3627408460;=??????
0+??????
1&#3627408459;+??????
2&#3627408459;
2
+??????.
Spline regression:fits a series of piecewise continuous polynomials to the data.
The polynomial pieces are smoothly connected at a series of fixed points in a predictor
variable, referred to as knots.
More flexible than the polynomial model, however, the coefficients for a spline term are
not interpretable.

2: CORRELATION OF ERROR TERMS
An important assumption of the linear regression model is that the error terms
??????
1,??????
2,…,??????
??????, are uncorrelated.
If error terms are correlated, estimated standard errors will tend to underestimate
the true standard errors, which may lead us to erroneously conclude that a
parameter is statistically significant.
Such correlations frequently occur in the context of time series data.
In order to determine if this is the case for a given data set, we can plot the
residuals from our model as a function of time. If the errors are uncorrelated, then
there should be no discernible pattern.
On the other hand, if the error terms are positively correlated, then we may see
tracking in the residuals—that is, adjacent residuals may have tracking similar
values.

3: NON-CONSTANT VARIANCE OF ERROR
TERMS
Another important assumption of the linear regression model is that error terms
have a constant variance, ????????????????????????
&#3627408470;=??????
2
.
Unfortunately, it is often the case that the variances of the error terms are non-
constant. One can identify non-constant variances in the errors, or
heteroscedasticity, from the presence of a funnel shape in the residual plot.
When faced with this problem, one possible solution is to transform the response
such as log&#3627408460;or &#3627408460;.
If we have a good idea of the variance of each response, another remedy would be
to fix our model by weighted least squares, with weights proportional to the
inverse variances.

4: OUTLIERS
An outlier is a point for which &#3627408486;
&#3627408470;is far from the value predicted by the model.
Outliers can arise for a variety of reasons, such as incorrect recording of an
observation during data collection.
For big data problems, outliers are generally not a problem in fitting the
regression to be used in predicting new data. However, they can be central to
anomaly detection. The outlier could also correspond to a case of fraud or an
accidental action.
If we believe that an outlier has occurred due to an error in data collection or
recording, then one solution is to simply remove the observation. However, care
should be taken, since an outlier may instead indicate a deficiency with the model,
such as a missing predictor.

OUTLIERS AND RESIDUAL PLOTS

5: HIGH LEVERAGE POINTS
Observations with high leveragehave an unusual value for &#3627408485;
&#3627408470;.
In general, high leverage observations tend to have a sizable impact on the
estimated regression line. It is important to identify these points, as any problems
with these points may invalidate the entire fit.
In a simple linear regression, we can easily identify high leverage observations,
since we can simply look for observations for which the predictor value is outside
the normal range of the observations.
In a multiple regression with many predictors, it is possible to have an observation
that is well within the range of each individual predictor’s values, but that is
unusual in terms of the full set of predictors.

HIGH LEVERAGE POINTS

QUANTIFYING LEVERAGE
For a simple linear regression, we can quantify an observation’s (&#3627408485;
&#3627408470;’s) leverage by
computing the leverage statistic:

&#3627408470;=
1
??????
+
&#3627408485;
&#3627408470;−ҧ&#3627408485;
2
σ
&#3627408470;

=1
??????
&#3627408485;
&#3627408470;′−ҧ&#3627408485;
2
Note that ℎ
&#3627408470;increases with the distance of &#3627408485;
&#3627408470;from ҧ&#3627408485;.
There is a more general formula when there are ??????predictors, which we do not
include here anymore.
The leverage statistic ℎ
&#3627408470;is always between 1/??????and 1, and the average leverage
for all observations is always equal to (??????+1)/??????. Thus, if a given observation has
a leverage statistic that greatly exceeds ??????+1/??????, then we may suspect that the
corresponding point has high leverage.

6: COLLINEARITY
Collinearityrefers to the situation in which two or more predictor variables are
closely related to one another.
The presence of collinearity can pose problems in the regression context, since it
can be difficult to separate out the individual effects of collinear variables on the
response.
Collinearity reduces the accuracy of the estimates of the regression coefficients,
and thus causes the standard error for መ??????
&#3627408471;to grow. This results in the power of the
t-test for significance of the regression coefficients to decrease.

COLLINEAR VARIABLES

DETECTING COLLINEARITY
Elements of the correlation matrix of the predictors which have large absolute
values indicate highly correlated variables, and therefore a collinearity problem in
the data.
However, it is possible for collinearity to exist between three or more variables
even if no pair of variables has a particularly high correlation. This is known as
multicollinearity.
Instead of inspecting the correlation matrix, a better would be to compute the
variance inflation factor(VIF).

THE VARIANCE INFLATION FACTOR
The VIF is the ratio of the variance of መ??????
&#3627408471;when fitting the full model divided by the
variance of መ??????
&#3627408471;if fit on its own. Its formula is given as follows:
??????&#3627408444;&#3627408441;መ??????
&#3627408471;=
1
1−&#3627408453;
??????
??????|??????
−??????
2
where &#3627408453;
??????
??????|??????
−??????
2
is the &#3627408453;
2
from a regression of &#3627408459;
&#3627408471;onto all of the other predictors.
The smallest possible value for VIF is 1, which indicates the complete absence of
collinearity.
As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic
amount of collinearity.

DEALING WITH COLLINEARITY
Drop one of the problematic variables from the regression.
Combine the collinear variables together into a single predictor.
Tags