Simple Linear Regression in R-Programming

Subrahmanya6 18 views 129 slides Jan 29, 2025
Slide 1
Slide 1 of 129
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129

About This Presentation

Simple Linear Regression in R-Programming


Slide Content

Chapter 13:
SIMPLE LINEAR REGRESSION

2
SIMPLE LINEAR REGRESSION

Simple Regression

Linear Regression

3
Simple Regression
Definition
A regression model is a mathematical equation
that describes the relationship between two or
more variables. A simple regression model
includes only two variables: one independent
and one dependent. The dependent variable is
the one being explained, and the independent
variable is the one used to explain the variation
in the dependent variable.

4
Linear Regression
Definition
A (simple) regression model that gives a
straight-line relationship between two
variables is called a linear regression
model.

5
Figure 13.1 Relationship between food
expenditure and income. (a) Linear
relationship. (b) Nonlinear relationship.
F
o
o
d

E
x
p
e
n
d
i
t
u
r
e

F
o
o
d

E
x
p
e
n
d
i
t
u
r
e

Income Income
(a)
(b)
Linear
Nonlinear

6
Figure 13.2 Plotting a linear equation.
150
100
50
5 10 15 x
y = 50 + 5x
x = 0
y = 50
x = 10
y = 100
y

7
Figure 13.3 y-intercept and slope of a line.
Change in y
Change in x
y-intercept
50
5
5
1
1
x
y

8
SIMPLE LINEAR REGRESSION
ANALYSIS

Scatter Diagram

Least Square Line

Interpretation of a and b

Assumptions of the Regression Model

9
SIMPLE LINEAR REGRESSION
ANALYSIS cont.
y = A + Bx
Constant term or y-intercept Slope
Independent variable
Dependent variable

10
SIMPLE LINEAR REGRESSION
ANALYSIS cont.
Definition
In the regression model y = A + Bx + Є, A
is called the y-intercept or constant term,
B is the slope, and Є is the random error
term. The dependent and independent
variables are y and x, respectively.

11
SIMPLE LINEAR REGRESSION
ANALYSIS
Definition
In the model ŷ = a + bx, a and b, which
are calculated using sample data, are
called the estimates of A and B.

12
Table 13.1 Incomes (in hundreds of dollars) and
Food Expenditures of Seven Households
Income Food Expenditure
35
49
21
39
15
28
25
9
15
7
11
5
8
9

13
Scatter Diagram
Definition
A plot of paired observations is called a
scatter diagram.

14
Figure 13.4 Scatter diagram.
Income













F
o
o
d

e
x
p
e
n
d
i
t
u
r
e

First household
Seventh household

15
Figure 13.5 Scatter diagram and straight lines.
Income













F
o
o
d

e
x
p
e
n
d
i
t
u
r
e

16
Least Squares Line
Figure 13.6 Regression line and random errors.
Income







F
o
o
d

e
x
p
e
n
d
i
t
u
r
e
e
Regression line

17
Error Sum of Squares (SSE)
The error sum of squares, denoted SSE, is
The values of a and b that give the minimum
SSE are called the least square estimates of A
and B, and the regression line obtained with
these estimates is called the least square line.
 
22
)ˆ(SSE yye

18
The Least Squares Line
For the least squares regression line
ŷ = a + bx,
xbyab
xx
xy
 and
SS
SS

19
The Least Squares Line cont.
where
and SS stands for “sum of squares”. The
least squares regression line ŷ = a + bx
us also called the regression of y on x.
 





n
x
x
n
yx
xy
xxxy
2
2
SS and SS

20
Example 13-1
Find the least squares regression line
for the data on incomes and food
expenditure on the seven households
given in the Table 13.1. Use income as
an independent variable and food
expenditure as a dependent variable.

21
Table 13.2
Income
x
Food
Expenditure
y
xy x²
35
49
21
39
15
28
25
9
15
7
11
5
8
9
315
735
147
429
75
224
225
1225
2401
441
1521
225
784
625
Σx = 212 Σy = 64 Σxy = 2150 Σx² = 7222

22
Solution 13-1
1429.97/64/
2857.307/212/
64 212






nyy
nxx
yx

23
Solution 13-1


4286.801
7
)212(
7222SS
7143.211
7
)64)(212(
2150SS
2
2
2






n
x
x
n
yx
xy
xx
xy

24
Solution 13-1
1414.1)2857.30)(2642(.1429.9
2642.
4286.801
7143.211


xbya
SS
SS
b
xx
xy
Thus,
ŷ = 1.1414 + .2642x

25
Figure 13.7 Error of prediction.
e
Predicted = $1038.84
Error = -$138.84
Actual = $900
ŷ = 1.1414 + .2642x
Income







F
o
o
d

e
x
p
e
n
d
i
t
u
r
e

26
Interpretation of a and b
Interpretation of a
Consider the household with zero income

ŷ = 1.1414 + .2642(0) = $1.1414 hundred
Thus, we can state that households with
no income is expected to spend $114.14
per month on food
The regression line is valid only for the
values of x between 15 and 49

27
Interpretation of a and b
cont.
Interpretation of b

The value of b in the regression model
gives the change in y due to change of
one unit in x

We can state that, on average, a $1
increase in income of a household will
increase the food expenditure by $.2642

28
Figure 13.8 Positive and negative linear
relationships between x and y.
(a) Positive linear
relationship.
(b) Negative linear
relationship.
b > 0
b < 0
y
x
y
x

29
Assumptions of the
Regression Model
Assumption 1:
The random error term Є has a mean
equal to zero for each x

30
Assumptions of the
Regression Model cont.
Assumption 2:
The errors associated with different
observations are independent

31
Assumptions of the
Regression Model cont.
Assumption 3:
For any given x, the distribution of errors
is normal

32
Assumptions of the
Regression Model cont.
Assumption 4:
The distribution of population errors for
each x has the same (constant) standard
deviation, which is denoted σ
Є.

33
Figure 13.11 (a) Errors for households with an
income of $2000 per month.
Normal distribution with (constant)
standard deviation σ
Є

E(ε) = 0
(a)
Errors for households
with income = $2000

34
Figure 13.11 (b) Errors for households with an
income of $ 3500 per month.
Normal distribution with (constant)
standard deviation σ
Є

E(ε) = 0
(b)
Errors for households
with income = $3500

35
Figure 13.12 Distribution of errors around the
population regression line.
16
12
8
4
10 30 40 50x = 35 x = 20
Income













F
o
o
d

e
x
p
e
n
d
i
t
u
r
e

Population
regression
line

36
Figure 13.13 Nonlinear relations between x and
y.
(a) (b)
y
x
y
x

37
Figure 13.14 Spread of errors for x = 20 and
x = 35.
16
12
8
4
10 30 40 50x = 35 x = 20
Income













F
o
o
d

e
x
p
e
n
d
i
t
u
r
e

Population
regression
line

38
STANDARD DEVIATION OF
RANDOM ERRORS
Degrees of Freedom for a Simple Linear
Regression Model
The degrees of freedom for a simple
linear regression model are
df = n – 2

39
STANDARD DEVIATION OF
RANDOM ERRORS cont.

The standard deviation of errors is
calculated as

where
2


n
bSSSS
s
xyyy
e



n
y
ySS
yy
2
2
)(

40
Example 13-2
Compute the standard deviation of
errors s
e for the data on monthly
incomes and food expenditures of the
seven households given in Table 13.1.

41
Table 13.3
Income
x
Food Expenditure
y y
2
35
49
21
39
15
28
25
9
15
7
11
5
8
9
81
225
49
121
25
64
81
Σx = 212 Σy = 64 Σy
2
=646

42
Solution 13-2

9922.
27
)7143.211(2642.8571.60
2
8571.60
7
)64(
646
2
2
2









n
bSSSS
s
n
y
ySS
xyyy
e
yy

43
COEFFICIENT OF
DETERMINATION
Total Sum of Squares (SST)
The total sum of squares, denoted by
SST, is calculated as

n
y
ySST
2
2


44
Figure 13.15 Total errors.
F
o
o
d

e
x
p
e
n
d
i
t
u
r
e
Income
16
12
8
4
10 30 40 50 20
1429.9y

45
Table 13.4
xy ŷ = 1.1414 + .2642xe = y – ŷ
35
49
21
39
15
28
25
9
15
7
11
5
8
9
10.3884
14.0872
6.6896
11.4452
5.1044
8.5390
7.7464
-1.3884
.9128
.3104
-.4452
-.1044
-.5390
1.2536
1.9277
.8332
.0963
.1982
.0109
.2905
1.5715

22
ˆyye 
 9283.4ˆ
2
2
 yye

46
Figure 13.16 Errors of prediction when
regression model is used.
F
o
o
d

e
x
p
e
n
d
i
t
u
r
e

Income
ŷ = 1.1414 + .2642x

47
COEFFICIENT OF
DETERMINATION cont.
Regression Sum of Squares (SSR)
The regression sum of squares , denoted
by SSR, is
SSESSTSSR 

48
COEFFICIENT OF
DETERMINATION cont.
Coefficient of Determination
The coefficient of determination, denoted
by r
2
, represents the proportion of SST
that is explained by the use of the
regression model. The computational
formula for r
2
is
and 0

r
2
1

yy
xy
SS
bSS
r
2

49
Example 13-3
For the data of Table 13.1 on monthly
incomes and food expenditures of seven
households, calculate the coefficient of
determination.

50
Solution 13-3
92.
8571.60
)7143.211)(2642(.
2

yy
xy
SS
bSS
r
From earlier calculations
b = .2642, SS
xx = 211.7143,
and SS
yy = 60.8571

51
INFERENCES ABOUT B

Sampling Distribution of b

Estimation of B

Hypothesis Testing About B

52
Sampling Distribution of b
Mean, Standard Deviation, and Sampling
Distribution of b
The mean and standard deviation of b,
denoted by and , respectively, are
xx
bb
SS
B



 and
b

b

53
Estimation of B
Confidence Interval for B
The (1 – α)100% confidence interval for B
is given by
where
btsb
xx
e
b
SS
s
s

54
Example 13-4
Construct a 95% confidence interval for B
for the data on incomes and food
expenditures of seven households given
in Table 13.1.

55
Solution 13-4
.35 to17.0900.2642.
)0350(.571.22642.
571.2
025.)2/95(.5.2/
5272
0350.
4286.801
9922.






b
xx
e
b
tsb
t
ndf
SS
s
s

56
Hypothesis Testing About B
Test Statistic for b
The value of the test statistic t for b is
calculated as
The value of B is substituted from the null
hypothesis.
bs
Bb
t

57
Example 13-5
Test at the 1% significance level whether
the slope of the regression line for the
example on incomes and food
expenditures of seven households is
positive.

58
Solution 13-5
H
0: B = 0

The slope is zero
H
1: B > 0

The slope is positive

59
Solution 13-5

n = 7 < 30

is not known

Hence, we will use the t distribution to
make the test about B

Area in the right tail = α = .01

df = n – 2 = 7 – 2 = 5

The critical value of t is 3.365

60
Figure 13.17
Reject H
0
Do not reject H
0
0 3.365
Critical value of t
α = .01
t

61
Solution 13-5
549.7
0350.
02642.





bs
Bb
t
From H
0

62
Solution 13-5

The value of the test statistic t = 7.549

It is greater than the critical value of t

It falls in the rejection region

Hence, we reject the null hypothesis

63
LINEAR CORRELATION

Linear Correlation Coefficient

Hypothesis Testing About the Linear
Correlation Coefficient

64
Linear Correlation
Coefficient
Value of the Correlation Coefficient
The value of the correlation coefficient
always lies in the range of –1 to 1; that is,
-1

ρ 1 and -1
≤ ≤
r 1

65
Figure 13.18 Linear correlation between two
variables.
(a) Perfect positive linear correlation, r = 1
r = 1
x
y

66
Figure 13.18 Linear correlation between two
variables.
(b) Perfect negative linear correlation, r = -1
r = -1
x
y

67
Figure 13.18 Linear correlation between two
variables.
(c) No linear correlation, , r 0

r 0

x
y

68
Figure 13.19 Linear correlation between
variables.
(a) Strong positive linear correlation (r is close to 1)
x
y

69
Figure 13.19 Linear correlation between
variables.
(b) Weak positive linear correlation (r is positive
but close to 0)
x
y

70
Figure 13.19 Linear correlation between
variables.
(c) Strong negative linear correlation (r is close to -
1)
x
y

71
Figure 13.19 Linear correlation between
variables.
(d) Weak negative linear correlation (r is negative
and close to 0)
x
y

72
Linear Correlation
Coefficient cont.
Linear Correlation Coefficient
The simple linear correlation, denoted by
r, measures the strength of the linear
relationship between two variables for a
sample and is calculated as
yyxx
xy
SSSS
SS
r

73
Example 13-6
Calculate the correlation coefficient for
the example on incomes and food
expenditures of seven households.

74
Solution 13-6
96.
)8571.60)(4286.801(
7143.211


yyxx
xy
SSSS
SS
r

75
Hypothesis Testing About
the Linear Correlation
Coefficient
Test Statistic for r
If both variables are normally distributed
and the null hypothesis is H
0: ρ = 0, then
the value of the test statistic t is
calculated as
Here n – 2 are the degrees of freedom.
2
1
2
r
n
rt


76
Example 13-7
Using the 1% level of significance and the
data from Example 13-1, test whether the
linear correlation coefficient between
incomes and food expenditures is
positive. Assume that the populations of
both variables are normally distributed.

77
Solution 13-7
H
0: ρ = 0

The linear correlation coefficient is zero
H
1: ρ > 0

The linear correlation coefficient is positive

78
Solution 13-7

Area in the right tail = .01

df = n – 2 = 7 – 2 = 5

The critical value of t = 3.365

79
Figure 13.20
Reject H
0
Do not reject H
0
0 3.365
Critical value of t
α = .01
t

80
Solution 13-7
667.7
)96(.1
27
96.
1
2
2
2







r
n
rt

81
Solution 13-7

The value of the test statistic t = 7.667

It is greater than the critical value of t

It falls in the rejection region

Hence, we reject the null hypothesis

82
REGRESSION ANALYSIS:
COMPLETE EXAMPLE
Example 13-8
A random sample of eight drivers insured
with a company and having similar auto
insurance policies was selected. The
following table lists their driving
experience (in years) and monthly auto
insurance premiums.

83
Example 13-8
Driving Experience
(years)
Monthly Auto Insurance
Premium
5
2
12
9
15
6
25
16
$64
87
50
71
44
56
42
60

84
Example 13-8
a)Does the insurance premium depend
on the driving experience or does the
driving experience depend on the
insurance premium? Do you expect a
positive or a negative relationship
between these two variables?

85
Solution 13-8
a)The insurance premium depends on
driving experience

The insurance premium is the dependent
variable

The driving experience is the independent
variable

86
Example 13-8
b)Compute SS
xx, SS
yy, and SS
xy.

87
Table 13.5
Experience
x
Premium
y xy x ² y²
5
2
12
9
15
6
25
16
64
87
50
71
44
56
42
60
320
174
600
639
660
336
1050
960
25
4
144
81
225
36
625
256
4096
7569
2500
5041
1936
3136
1764
3600
Σx = 90Σy = 474Σxy = 4739Σx² = 1396Σy² = 29,642

88
Solution 13-8
b)
25.598/474/
25.118/90/




nyy
nxx
5000.1557
8
)474(
642,29
)(
5000.383
8
)90(
1396
)(
5000.593
8
)474)(90(
4739
))((
22
2
22
2









n
y
ySS
n
x
xSS
n
yx
xySS
yy
xx
xy

89
Example 13-8
c)Find the least squares regression line
by choosing appropriate dependent
and independent variables based on
your answer in part a.

90
Solution 13-8
c)
6605.76)25.11)(5476.1(25.59
5476.1
5000.383
5000.593




xbya
SS
SS
b
xx
xy
xy 547.16605.76ˆ 

91
Example 13-8
d)Interpret the meaning of the values of
a and b calculated in part c.

92
Solution 13-8
d)The value of a = 76.6605 gives the
value of ŷ for x = 0
Here, b = -1.5476 indicates that, on
average, for every extra year of driving
experience, the monthly auto
insurance premium decreases by
$1.55.

93
Example 13-8
e)Plot the scatter diagram and the
regression line.

94
Figure 13.21 Scatter diagram and the
regression line.
e)
I
n
s
u
r
a
n
c
e

p
r
e
m
i
u
m
Experience
xy 547.16605.76ˆ 

95
Example 13-8
f)Calculate r and r
2
and explain what they
mean.

96
Solution 13-8
59.
5000.1557
)5000.593)(5476.1(
77.
)5000.1557)(5000.383(
5000.593
2






yy
xy
yyxx
xy
SS
bSS
r
SSSS
SS
r
f)

97
Solution 13-8
f)The value of r = -0.77 indicates that the
driving experience
Monthly auto insurance premium are
negatively related
The (linear) relationship is strong but not
very strong
The value of r² = 0.59 states that 59% of
the total variation in insurance premiums
is explained by years of driving
experience and 41% is not

98
Example 13-8
g)Predict the monthly auto insurance for
a driver with 10 years of driving
experience.

99
Solution 13-8
g)The predict value of y for x = 10 is
ŷ = 76.6605 – 1.5476(10) = $61.18

100
Example 13-8
h)Compute the standard deviation of
errors.

101
Solution 13-8
h)
3199.10
28
)5000.593)(5476.1(5000.1557

2







n
bSSSS
s
xyyy
e

102
Example 13-8
i)Construct a 90% confidence interval
for B.

103
Solution 13-8
i)
52. to57.20240.15476.1
)5270(.943.15476.1
943.1
6282
05.)2/90(.5.2/
5270.
5000.383
3199.10







tsb
t
ndf
SS
s
s
b
xx
e
b

104
Example 13-8
j)Test at the 5% significance level
whether B is negative.

105
Solution 13-8
j)
H
0: B = 0
B is not negative
H
1: B < 0
B is negative

106
Solution 13-5

Area in the left tail = α = .05

df = n – 2 = 8 – 2 = 6

The critical value of t is -1.943

107
Figure 13.22
α = .01
Do not reject H
0Reject
H
0
Critical value of t
t
-1.943 0

108
Solution 13-8
937.2
5270.
05476.1





b
s
Bb
t
From H
0

109
Solution 13-8

The value of the test statistic t = -2.937

It falls in the rejection region

Hence, we reject the null hypothesis and
conclude that B is negative

110
Example 13-8
k)Using α = .05, test whether ρ is
difference from zero.

111
Solution 13-8
k)
H
0: ρ = 0

The linear correlation coefficient is zero
H
1: ρ
≠ 0

The linear correlation coefficient is different
from zero

112
Solution 13-8

Area in each tail = .05/2 = .025

df = n – 2 = 8 – 2 = 6

The critical values of t are -2.447 and
2.447

113
Figure 13.23
-2.447 0 2.447 t
α/2 = .025 α/2 = .025
Do not reject H
0
Reject
H
0
Reject
H
0
Two critical values of t

114
Solution 13-8
956.2
)77.(1
28
77.
1
2
2
2







r
n
rt

115
Solution 13-8

The value of the test statistic t = -2.956

It falls in the rejection region

Hence, we reject the null hypothesis

116
USING THE REGRESSION
MODEL

Using the Regression Model for
Estimating the Mean Value of y

Using the Regression Model for
Predicting a Particular Value of y

117
Figure 13.24 Population and sample regression
lines.
y
x
Population
regression line
BxA
xy

|

Regression lines ŷ = a +bx
estimated from different samples

118
Using the Regression Model for
Estimating the Mean Value of y
Confidence Interval for μ
y|x
The (1 – α)100% confidence interval for μ
y|
x
for x = x
0 is
m
y
tsy
ˆ
ˆ

119
Confidence Interval for μ
y|x
Where the value of t is obtained from the t
distribution table for α/2 area in the right
tail of the t distribution curve and df = n –
2. The value of is calculated as follows:
mys
ˆ
xx
ey
SS
xx
n
ss
m
2
0
ˆ
)(1 


120
Example 13-9
Refer to Example 13-1 on incomes and
food expenditures. Find a 99% confidence
interval for the mean food expenditure for
all households with a monthly income of
$3500.

121
Solution 13-9

Using the regression line, we find the
point estimate of the mean food
expenditure for x = 35

ŷ = 1.1414 + .2642(35) = $10.3884 hundred

Area in each tail = α/2 = .5 – (.99/2) = .005

df = n – 2 = 7 – 2 = 5

t = 4.032

122
Solution 13-9
4098.
4286.801
)2857.3035(
7
1
)9922(.
)(1
4286.801 and ,2857.30 ,9922.
2
2
0
ˆ






xx
ey
xxe
SS
xx
n
sS
SSxs
m

123
Solution 13-9
12.0407 to7361.86523.13884.10
)4098(.032.43884.10ˆ
is for interval confidence 99% theHence,
ˆ
35


m
y
y|
tsy
μ

124
Using the Regression Model for
Predicting a Particular Value of
y
Prediction Interval for y
p
The (1 – α)100% prediction interval for
the predicted value of y, denoted by y
p,
for x = x
0 is
pytsy
ˆ
ˆ

125
Prediction Interval for y
p
The value of is calculated as follows:
p
y
s
ˆ
xx
ey
SS
xx
n
ss
p
2
0
ˆ
)(1
1



126
Example 13-10
Refer to Example 13-1 on incomes and
food expenditures. Find a 99%
prediction interval for the predicted
food expenditure for a randomly
selected household with a monthly
income of $3500.

127
Solution 13-10

Using the regression line, we find the
point estimate of the predicted food
expenditure for x = 35

ŷ = 1.1414 + .2642(35) = $10.3884 hundred

Area in each tail = α/2 = .5 – (.99/2) = .005

df = n – 2 = 7 – 2 = 5

t = 4.032

128
Solution 13-10
0735.1
4286.801
)2857.3035(
7
1
1)9922(.
)(1
1
4286.801 and ,2857.30 ,9922.
2
2
0
ˆ






xx
ey
xxe
SS
xx
n
sS
SSxs
p

129
Solution 13-10
14.7168 to0600.63284.43884.10
)0735.1(032.43884.10ˆ
is 35for for interval prediction 99% theHence,
ˆ



py
p
tsy
xy
Tags