Lecture 3 - Linear Regression_imran .pdf

imrensindhu 5 views 46 slides Oct 21, 2025
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

This Slides contains the Linear regression Basics


Slide Content

Regression

What is Regression
•Regression analysis is a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.
•It helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other. In other words analyze
the specific relationships between the independent variables and the
dependent variable.
•In regression, we normally have one dependent variable and one or more
independent variables. Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
•We try to “regress” the value of dependent variable “Y” with the help of
the independent variables.

Types of Regression approaches
•There are many types of regression approaches we will study some of
them here
•Simple Linear Regression
•Multiple Linear Regression
•Polynomial Regression
•Support Vector for Regression (SVR)
•Decision Tree Regression
•Random Forest Regression

Simple linear regression
•In statistics, simple linear regression is a linear regression model with
a single explanatory variable.
•It concerns two-dimensional sample points with one independent
variable and one dependent variable (conventionally, the x and y
coordinates in a Cartesian coordinate system)
•It finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a
function of the independent variable.
•Simply put, Simple linear regression is used to estimate the
relationship between two quantitative variables.

What can simple linear regression be used
for?
•You can use simple linear regression when you want to know:
•How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
•The value of the dependent variable at a certain value of the independent
variable (e.g. the amount of soil erosion at a certain level of rainfall).

Model for simple linear regression
•Consider the equation of line given as,
•Where y is the dependent variable, x is the independent variable, αis
the y-intercept and ꞵis the slope of the line.
•We need to find αand ꞵto estimate y using x , such that the error Ɛis
minimized between the predicted value of y and original value of y.
ˆ
Y =b
0
+b
1
X

House size
House
Cost
Mostlots sell
for $25,000
The Model
The model has a deterministic and a probabilistic components
ˆ
Y =b
0
+b
1
X 
ˆ
Y =b
0
+b
1
X

House size
House
Cost
Most lots sell
for $25,000
However, house cost vary even among same size
houses!
Since cost behave unpredictably,
we add a random component.

•The first order linear model
Y = dependent variable
X = independent variable
b
0= Y-intercept
b1= slope of the line
e= error variable
X
Y
b
0
Run
Rise
b
1= Rise/Run
and bare unknown population
parameters, therefore are estimated
from the data. 
Y=b
0
+b
1
X+e

Estimating the Coefficients
•The estimates are determined by
•drawing a sample from the population of interest,
•calculating sample statistics.
•producing a straight line that cuts into the data.
w
w
w
w
w w w w
w
w w
w
w w
w
Question: What should be
considered a good line?
X
Y

The Estimated Coefficients
To calculate the estimates of the line
coefficients, that minimize the
differences between the data points
and the line, use the formulas: 
b
1
=
cov(X,Y)
s
X
2
=
s
XY
s
X
2






b
0
=Y −b
1
X
The regression equation that
estimates
the equation of the first order linear
model
is: 
ˆ
Y =b
0
+b
1
X

Working concept of simple linear regression
•Ordinary least squares (OLS) method is
usually used to implement simple linear
regression.
•A good line is one that minimizes the sum
of squared differences between the
points and the line.
•The accuracy of each predicted value is
measured by its squared residual (vertical
distance between the point of the data
set and the fitted line), and the goal is to
make the sum of these squared
deviations as small as possible.
w
w
w
w
w w w w
w
w w
w
w w
w
X
Y

•Example 17.2 (Xm17-02)
•A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
•A random sample of 100 cars is
selected, and the data
recorded.
•Find the regression line.CarOdometerPrice
137388 14636
244758 14122
345833 14016
430862 15590
531705 15568
634010 14718
. . .
. . .
. . .
Independent
variable X
Dependent
variable Y
The Simple Linear Regression Line

•Solution
–Solving by hand: Calculate a number of statistics
X =36,009.45;
Y =14,822.823; 
s
X
2
=
(X
i
−X )
2

n−1
=43,528,690
cov(X,Y)=
(X
i
−X )(Y
i
−Y )
n−1
=−2,712,511
where n = 100.
b
1=
cov(X,Y)
s
X
2
=
−1,712,511
43,528,690
=−.06232
b
0
=Y −b
1
X =14,822.82−(−.06232)(36,009.45)=17,067 
ˆ
Y =b
0
+b
1
X=17,067−.0623X

This is the slope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.0623Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Price 
ˆ
Y =17,067−.0623X
Interpreting the Linear Regression -
Equation
The intercept is b
0= $17067.
0 No data
Do not interpret the intercept as the
“Price of cars that have not been driven”
17067

Error Variable: Required Conditions for better
performance of simple linear regression
•The error e is a critical part of the regression model.
•Four requirements involving the distribution of emust
be satisfied.
•The probability distribution of eis normal.
•The mean of eis zero: E(e) = 0.
•The standard deviation of eis s
efor all values of X.
•The set of errors associated with different values of Y are all
independent.

The Normality of e
From the first three assumptions
we
have: Y is normally distributed
with mean E(Y) = b
0+ b
1X, and a
constant standard deviation s
e
m
3
b
0+ b
1X
1
b
0+ b
1 X
2
b
0+ b
1 X
3
E(Y|X
2)
E(Y|X
3)
X
1 X
2 X
3
m
1
E(Y|X
1)
m
2
but the mean value changes with X
The standard deviation remains constant,

Assessing the Model
•The least squares method will produces a regression line whether or not
there are linear relationship between X and Y.
•Consequently, it is important to assess how well the linear model fits the
data.
•Several methods are used to assess the model. All are based on the sum of
squares for errors, SSE.
•RMSE
•Coefficient of variation of RMSE
•Normalized MBE (Mean difference between actual values and model prediction)
•Coefficient of determination
•Corrected coefficient of determination
•Durbin Watson statistics
•T-test

•�??????�??????=(σ
??????=1
??????
(??????−??????
????????????��)
2
)
1/2
•Where Zavgis the average of the original values
•Where

How to determine over fitting and under
fitting
•Durbin-Watson statistic,Є (0, 4):
•DW= 2 well fitted; DW <2 underfitted; DW>4 overfitted

•This is the sum of differences between the points and
the regression line.
•It can serve as a measure of how well the line fits the
data. SSE is defined by
SSE=(Y
i

ˆ
Y
i
)
2
i=1
n
 .
Sum of Squares for Errors
SSE=(n−1)s
Y
2

cov(X,Y) 
2
s
X
2
–A shortcut formula

3
3
w
w
w
w
41
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squared differences =(2 -1)
2
+(4 -2)
2
+(1.5 -3)
2
+
(4,3.2)
(3.2 -4)
2
= 6.89
Sum of squared differences =(2 -2.5)
2
+(4 -2.5)
2
+(1.5 -2.5)
2
+(3.2 -2.5)
2
= 3.99
2.5
Let us compare two lines
The second line is horizontal
The smaller the sum of
squared differences
the better the fit of the
line to the data.

•The mean error is equal to zero.
•If s
eis small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well.
•Therefore, we can, use s
eas a measure of the
suitability of using a linear model.
•An estimator of s
eis given by s
e
StandardErrorofEstimate
s
e
=
SSE
n−2
Standard Error of Estimate

Example: Data that doesn’t meet the
assumptions
•You think there is a linear relationship between meat consumption
and the incidence of cancer in the U.S.
•However, you find that much more data has been collected at high
rates of meat consumption than at low rates of meat consumption,
•With the result that there is much more variation in the estimate of
cancer rates at the low range than at the high range.
•Because the data violate the assumption of homoscedasticity, it
doesn’t work for regression.

Implementing simple linear regression in
Python
1.Import the packages and classes .
2.Import the data
3.Visualize the data
4.Handle missing values and clean the data
5.Split the data into training and test sets
6.Build the regression model and train it.
7.Check the results of model fitting to know whether the model is
satisfactory using plots.
8.Make predictions using unseen data.
9.Evaluate the model

Importing packages and data

Visualize the data

Handle missing values and clean the data
•Missing data present
•Data cleaning is required as
salary cannot be negative

code

Visualizing the processed data

Split the data into training and test sets

Build the regression model and train it.
•Import the linear regression class from the linear model
•Make an instance of the linear regression class
•The train the model using training data

Check the results of model fitting to know
whether the model is satisfactory using plots.

Make predictions using unseen data.

Evaluating the model
•RMSE
•R^2

Durbin Watson statistical test

Another Example

Import dataset and visualize

Data cleaning

Visualize the data

Splitting the data

Build model and train it

Predicting the output for unseen data