regression and statistics basic ECO101 class

navbali 12 views 75 slides Oct 15, 2024
Slide 1
Slide 1 of 75
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75

About This Presentation

regression


Slide Content

Regression Analysis
Multiple Regression
[ Cross-Sectional Data ]

Learning Objectives
Explain the linear multiple regression
model [for cross-sectional data]
Interpret linear multiple regression
computer output
Explain multicollinearity
Describe the types of multiple regression
models

Regression Modeling Steps
Define problem or question
Specify model
Collect data
Do descriptive data analysis
Estimate unknown parameters
Evaluate model
Use model for prediction

Simple vs. Multiple
 represents the
unit change in Y
per unit change in
X .
Does not take into
account any other
variable besides
single independent
variable.

i
represents the unit
change in Y per unit
change in X
i
.
Takes into account
the effect of other

i
s.
“Net regression
coefficient.”

Assumptions
Linearity - the Y variable is linearly related
to the value of the X variable.
Independence of Error - the error
(residual) is independent for each value of X.
Homoscedasticity - the variation around
the line of regression be constant for all values
of X.
Normality - the values of Y be normally
distributed at each value of X.

Goal
Develop a statistical model that
can predict the values of a
dependent (responseresponse) variable
based upon the values of the
independent (explanatoryexplanatory)
variables.

Simple Regression
A statistical model that utilizes
one quantitative quantitative independent
variable “X” to predict the
quantitativequantitative dependent
variable “Y.”

Multiple Regression
A statistical model that utilizes two
or more quantitative and
qualitative explanatory variables
(x
1,..., x
p) to predict a quantitativequantitative
dependent variable Y.
Caution: have at least two or more quantitative
explanatory variables (rule of thumb)

Multiple Regression Model
X
2
X
1
Y
e

Hypotheses
 H
0
: 
1
= 
2
= 
3
= ... = 
P
= 0
 H
1
: At least one regression
coefficient is not equal to
zero

Hypotheses (alternate format)
H
0
: 
ii
= 0
H
1
: 
ii
 0

Types of Models
Positive linear relationship
Negative linear relationship
No relationship between X and Y
Positive curvilinear relationship
U-shaped curvilinear
Negative curvilinear relationship

Multiple Regression Models
Multiple
Regression
Models
Linear
Dummy
Variable
Linear
Non-
Linear
Inter-
action
Poly-
Nomial
Square
Root
Log Reciprocal Exponential

Multiple Regression Equations
This is too
complicated! You’ve got to
be kiddin’!

Multiple Regression Models
Multiple
Regression
Models
Linear
Dummy
Variable
Linear
Non-
Linear
Inter-
action
Poly-
Nomial
Square
Root
Log Reciprocal Exponential

Linear Model
Relationship between one dependent & two
or more independent variables is a linear
function
 
PP
XXXY 
22110
Dependent Dependent
(response)(response)
variablevariable
Independent Independent
(explanatory)(explanatory)
variablesvariables
Population Population
slopesslopes
Population Population
Y-interceptY-intercept
Random Random
errorerror

Method of Least Squares
The straight line that best fits the data.
Determine the straight line for which the
differences between the actual values (Y)
and the values that would be predicted
from the fitted line of regression (Y-hat)
are as small as possible.

Measures of Variation
Explained variation (sum of
squares due to regression)
Unexplained variation (error sum
of squares)
Total sum of squares

Coefficient of Multiple Determination
When null hypothesis
is rejected, a
relationship between
Y and the X variables
exists.
Strength measured by
R
2
[ several types ]

Coefficient of Multiple
Determination
R
2
y.123- - -P
The proportion of Y that is
explained by the set of
explanatory variables selected

Standard Error of the Estimate
ss
y.x y.x
the measure of
variability
around the line
of regression

Confidence interval estimates
»True mean

Y.X
»Individual
Y-hat
i

Interval Bands [from simple regression]
X
Y
X
Yi
= b0
+ b1
X
^
X
given
_

Multiple Regression Equation
Y-hat = 
0 + 
1
x
1
+ 
2
x
2
+ ... + 
P
x
P
+ 
where:

0 = y-intercept {a constant value}

11 = slope of Y with variable x
1 holding the
variables x
2, x
3, ..., x
P effects constant

P
= slope of Y with variable x
P holding all
other variables’ effects constant

Who is in Charge?

Mini-Case
Predict the consumption of home
heating oil during January for
homes located around Screne Lakes.
Two explanatory variables are
selected - - average daily
atmospheric temperature (
o
F) and
the amount of attic insulation (“).

Oil (Gal)Temp Insulation
275.30 40 3
363.80 27 3
164.30 40 10
40.80 73 6
94.30 64 6
230.90 34 6
366.70 9 6
300.60 8 10
237.80 23 10
121.40 63 3
31.40 65 10
203.50 41 6
441.10 21 3
323.00 38 3
52.50 58 10
Mini-Case
(
0
F)Develop a model for
estimating heating oil
used for a single family
home in the month of
January based on average
temperature and amount
of insulation in inches.

Mini-Case
What preliminary conclusions can home
owners draw from the data?
What could a home owner expect heating
oil consumption (in gallons) to be if the
outside temperature is 15
o
F when the
attic insulation is 10 inches thick?

Multiple Regression Equation
[mini-case]
Dependent variable: Gallons Consumed
-------------------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
--------------------------------------------------------------------------------------
CONSTANT 562.151 21.0931 26.6509 0.0000
Insulation -20.0123 2.34251 -8.54313 0.0000
Temperature -5.43658 0.336216 -16.1699 0.0000
--------------------------------------------------------------------------------------
R-squared = 96.561 percent
R-squared (adjusted for d.f.) = 95.9879 percent
Standard Error of Est. = 26.0138
+

Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x
11 - 20.01x - 20.01x
22
where: xx
11 = temperature [degrees F]
xx
22 = attic attic insulation [inches]

Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x
11 - 20.01x - 20.01x
22
thus:thus:
 For a home with zero inches of attic
insulation and an outside temperature of 0
o
F,
562.15 gallons of heating oil would be consumed.
[ caution .. data boundaries .. extrapolation ][ caution .. data boundaries .. extrapolation ]
+

Extrapolation
Y
Interpolation
X
Extrapolation Extrapolation
Relevant Range

Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x
11 - 20.01x - 20.01x
22
For a home with zero attic insulation and an outside temperature of zero,
562.15 gallons of heating oil would be consumed. [ caution .. data boundaries [ caution .. data boundaries
.. extrapolation ].. extrapolation ]
 For each incremental increase in

degree F of
temperature, for a given amount of attic insulation,for a given amount of attic insulation,
heating oil consumption drops 5.44 gallons.
+

Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x
11 - 20.01x - 20.01x
22
For a home with zero attic insulation and an outside temperature of zero, 562
gallons of heating oil would be consumed. [ caution … ][ caution … ]
 For each incremental increase in degree F of temperature, for a given amount
of attic insulation, heating oil consumption drops 5.44 gallons.
For each incremental increase in inches of
attic insulation, at a given temperature,at a given temperature,
heating oil consumption drops 20.01
gallons.

Multiple Regression Prediction
[mini-case]
Y-hat = 562.15 - 5.44xY-hat = 562.15 - 5.44x
11 - 20.01x - 20.01x
22
with x
1
= 15
o
F and x
2
= 10 inches
Y-hat = 562.15 - 5.44(15) - 20.01(10)
= 280.45 gallons consumed

Coefficient of Multiple Determination
[mini-case]
R
2
y.12 = .9656
96.56 percent of the variation in
heating oil can be explained by
the variation in temperature andand
insulation.

Coefficient of Multiple Determination
Proportion of variation in Y ‘explained’ by all
X variables taken together
R
2
Y.12
= Explained variation = SSR
Total variation SST
Never decreases when new X variable is
added to model
–Only Y values determine SST
–Disadvantage when comparing models

Proportion of variation in Y ‘explained’ by all
X variables taken together
Reflects
–Sample size
–Number of independent variables
Smaller [more conservative] than R
2
Y.12
Used to compare models
Coefficient of Multiple Determination
Adjusted

Coefficient of Multiple Determination
(adjusted)
R
2
(adj) y.123- - -P
The proportion of Y that is explained by the
set of independent [explanatory] variables
selected, adjusted for the number of
independent variables and the sample size.

Coefficient of Multiple Determination
(adjusted) [Mini-Case]
R
2
adj
= 0.9599
95.99 percent of the variation in
heating oil consumption can be
explained by the model - adjusted
for number of independent variables
and the sample size

Coefficient of Partial Determination
Proportion of variation in Y ‘explained’ by
variable X
P holding all others constant
Must estimate separate models
Denoted R
2
Y1.2 in two X variables case
–Coefficient of partial determination of X
1
with Y
holding X
2 constant
Useful in selecting X variables

Coefficient of Partial
Determination [p. 878]
R
2
y1.234 --- P
The coefficient of partial variation of
variable Y with x
1
holding constant
the effects of variables x
2, x
3, x
4, ... x
P.

Coefficient of Partial Determination
[Mini-Case]
R
2
y1.2 = 0.9561
For a fixed (constant) amount of
insulation, 95.61 percent of the variation
in heating oil can be explained by the
variation in average atmospheric
temperature. [p. 879]

Coefficient of Partial Determination
[Mini-Case]
R
2
y2.1 = 0.8588
For a fixed (constant) temperature,
85.88 percent of the variation in
heating oil can be explained by the
variation in amount of insulation.

Testing Overall Significance
Shows if there is a linear relationship between
all X variables together & Y
Uses p-value
Hypotheses
–H
0: 
1 = 
2 = ... = 
P = 0
»No linear relationship
–H
1: At least one coefficient is not 0
»At least one X variable affects Y

Examines the contribution of a set of X
variables to the relationship with Y
Null hypothesis:
–Variables in set do not improve significantly
the model when all other variables are included
Must estimate separate models
Used in selecting X variables
Testing Model Portions

Diagnostic Checking
H
0
retain or reject
If reject - {p-value  0.05}
R
2
adj
Correlation matrix
Partial correlation matrix

Multicollinearity
High correlation between X variables
Coefficients measure combined effect
Leads to unstable coefficients depending on X
variables in model
Always exists; matter of degree
Example: Using both total number of rooms
and number of bedrooms as explanatory
variables in same model

Detecting Multicollinearity
Examine correlation matrix
–Correlations between pairs of X variables are
more than with Y variable
Few remedies
–Obtain new sample data
–Eliminate one correlated X variable

Evaluating Multiple Regression Model Steps
Examine variation measures
Do residual analysis
Test parameter significance
–Overall model
–Portions of model
–Individual coefficients
Test for multicollinearity

Multiple Regression Models
Multiple
Regression
Models
Linear
Dummy
Variable
Linear
Non-
Linear
Inter-
action
Poly-
Nomial
Square
Root
Log Reciprocal Exponential

Dummy-Variable Regression Model
Involves categorical X variable with
two levels
–e.g., female-male, employed-not employed, etc.

Dummy-Variable Regression Model
Involves categorical X variable with
two levels
–e.g., female-male, employed-not employed, etc.
Variable levels coded 0 & 1

Dummy-Variable Regression Model
Involves categorical X variable with
two levels
–e.g., female-male, employed-not employed, etc.
Variable levels coded 0 & 1
Assumes only intercept is different
–Slopes are constant across categories

Dummy-Variable Model Relationships
YY
XX
11
00
00
Same slopes b
1
bb
00
bb
0 0 + b+ b
22
Females
Males

Dummy Variables
Permits use of
qualitative data
(e.g.: seasonal, class
standing, location,
gender).
0, 1 coding
(nominative data)
As part of Diagnostic
Checking;
incorporate outliers
(i.e.: large residuals)
and influence
measures.

Multiple Regression Models
Multiple
Regression
Models
Linear
Dummy
Variable
Linear
Non-
Linear
Inter-
action
Poly-
Nomial
Square
Root
Log Reciprocal Exponential

Interaction Regression Model
Hypothesizes interaction between pairs of X
variables
–Response to one X variable varies at different
levels of another X variable
Contains two-way cross product terms
Y = 
0 + 
1x
1 + 
2x
2 + 
3x
1x
2 + 
Can be combined with other models
e.g. dummy variable models

Effect of Interaction
Given:
Without interaction term, effect of X
1
on Y is
measured by 
1
With interaction term, effect of X
1 on
Y is measured by 
1
+ 
3
X
2
–Effect increases as X
2i increases
Y X X XX
i i i ii i      
0 11 22 312

Interaction Example
XX
11
44
88
1212
00
00 110.50.5 1.51.5
YY
YY = 1 + 2 = 1 + 2XX
11 + 3 + 3XX
2 2 + 4+ 4XX
11XX
22

Interaction Example
XX
11
44
88
1212
00
00 110.50.5 1.51.5
YY
YY = 1 + 2 = 1 + 2XX
11 + 3 + 3XX
2 2 + 4+ 4XX
11XX
22
YY = 1 + 2 = 1 + 2XX
11 + 3( + 3(00) + 4) + 4XX
11((00) = 1 + 2) = 1 + 2XX
11

Interaction Example
YY
XX
11
44
88
1212
00
00 110.50.5 1.51.5
YY = 1 + 2 = 1 + 2XX
11 + 3 + 3XX
2 2 + 4+ 4XX
11XX
22
YY = 1 + 2 = 1 + 2XX
11 + 3( + 3(11) + 4) + 4XX
11((11) = 4 + 6) = 4 + 6XX
11
YY = 1 + 2 = 1 + 2XX
11 + 3( + 3(00) + 4) + 4XX
11((00) = 1 + 2) = 1 + 2XX
11

Interaction Example
Effect (slope) of Effect (slope) of XX
11 on on YY does depend on does depend on XX
22 value value
XX
11
44
88
1212
00
00 110.50.5 1.51.5
YY
YY = 1 + 2 = 1 + 2XX
11 + 3 + 3XX
2 2 + 4+ 4XX
11XX
22
YY = 1 + 2 = 1 + 2XX
11 + 3( + 3(11) + 4) + 4XX
11((11) = 4 + ) = 4 + 66XX
11
YY = 1 + 2 = 1 + 2XX
11 + 3( + 3(00) + 4) + 4XX
11((00) = 1 + ) = 1 + 22XX
11

Multiple Regression Models
Multiple
Regression
Models
Linear
Dummy
Variable
Linear
Non-
Linear
Inter-
action
Poly-
Nomial
Square
Root
Log Reciprocal Exponential

Inherently Linear Models
Non-linear models that can be expressed in
linear form
–Can be estimated by least square in linear form
Require data transformation

Y
X
1
Curvilinear Model Relationships
Y
X
1
Y
X
1
Y
X
1

Logarithmic Transformation
Y
X
1

11 > 0 > 0

11 < 0 < 0
Y =  + 
1
lnx
1
+ 
2
lnx
2
+ 

Square-Root Transformation
Y
X
1
Y X X
i i i i    
0 1 1 2 2

11 > 0 > 0

11 < 0 < 0

Reciprocal Transformation
Y
X
1

11 > 0 > 0

11 < 0 < 0
i
ii
i
XX
Y  
2
2
1
10
11
AsymptoteAsymptote

Exponential Transformation
Y
X
1

11 > 0 > 0

11 < 0 < 0
Ye
i
X X
i
i i

  

011 22

Overview
Explained the linear multiple regression
model
Interpreted linear multiple regression
computer output
Explained multicollinearity
Described the types of multiple regression
models

Source of Elaborate Slides
Prentice Hall, Inc
Levine, et. all, First Edition

Regression Analysis
[Multiple Regression]
*** End of Presentation ***
Questions?
Tags