Regression

233 views 85 slides Aug 12, 2022
Slide 1
Slide 1 of 85
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85

About This Presentation

Describing Data using Regression -
Machine Learning, It also explains the concept of regression to the mean


Slide Content

Describing Data REGRESSION

Data is present everywhere and becomes a part of our life…. 2 8/12/2022

Data itself has no meaning unless it is contextually processed into information , from which knowledge can be derived Data Raw and unprocessed Obtained from end devices. Knowledge Information – organized and structured t o achieve specific objective. Information filtered , processed categorized and condensed. 8/12/2022 3

Correlation shows the quantity of the degree to which two variables are associated . It does not fix a line through the data points. Correlation shows how much one variable changes when the other remains constant. When it is zero , the relationship does not exist. When it is positive , one variable goes high as the other goes up. When it is negative , one variable goes high as the other goes down . The Pearson correlation measures the degree to which a set of data points form a straight line relationship. Correlation

5 Introduction to Linear Regression Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data . The two terms essential to understanding Regression Analysis:  Dependent variables - The factors that we want to understand or predict.  Independent variables - The factors that influence the dependent variable.

6 Introduction to Linear Regression (cont.) Any straight line can be represented by an equation of the form Y = mX + c , where m and c are constants. The value of m is called the slope constant and determines the direction and degree to which the line is tilted . The value of c is called the Y -intercept and determines the point where the line crosses the Y-axis .

Regressio n

LEAST SQUARES REGRESSION We can place the line "by eye": try to have the line as close as possible to all points , and a similar number of points above and below the line. But for better accuracy, let's see how to calculate the line using  Least Squares Regression . The least square method is the process of obtaining the line of best fit for the given data set by reducing the sum of the squares of the offset s (residual part) of the points from the line.

LEAST SQUARES METHOD

STEPS TO CALCULATE LINE OF BEST FIT

STEP 1

STEP 2 c  =  Σ y − m Σ x N

EXAMPLE 1 x 1 2 3 4 5 y 2 5 3 8 7 x y xy x 2 1 2 2 1 2 5 10 4 3 3 9 9 4 8 32 16 5 7 35 25 ∑x =15 ∑y = 25 ∑ xy = 88 ∑x 2  = 55

EXAMPLE ( contd …) Find the value of m by using the formula, m = ( n∑xy - ∑ y∑x )/n∑x 2  - (∑x) 2 m = [(5×88) - (15×25)]/(5×55) - (15) 2 m = (440 - 375)/(275 - 225) m = 65/50 = 13/10 ∑x =15 ∑y = 25 ∑ xy = 88 ∑x 2  = 55

EXAMPLE ( contd …) Find the value of c by using the formula, c = (∑y - m∑x )/n c = (25 - 1.3×15)/5 c = (25 - 19.5)/5 c = 5.5/5 So, the required equation of least squares is y = mx + c = 13/10x + 5.5/5 . ∑x =15 ∑y = 25 ∑ xy = 88 ∑x 2  = 55

EXAMPLE 2 " x“ Hours of Sunshine "y" Ice Creams Sold 2 4 3 5 5 7 7 10 9 15 Sam found how many hours of sunshine  vs how many ice creams were sold at the shop from Monday to Friday :

EXAMPLE ( contd …) x y y = 1.518x + 0.305 error 2 4 3.34 −0.66 3 5 4.86 −0.14 5 7 7.89 0.89 7 10 10.93 0.93 9 15 13.97 −1.03 Here are the (x , y ) points and the line y = 1.518x + 0.305 on a graph: Nice fit!

EXAMPLE ( contd …) Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above equation to estimate that he will sell y = 1.518 x 8 + 0.305 = 12.45 Ice Creams Sam makes fresh waffle cone mixture for 13 ice creams just in case. Yum.

USING SUM OF CROSS AND SQUARED DEVIATIONS where   sum of cross-deviations of y and x is and    sum of squared deviations of x is

PYTHON CODE import numpy as np import matplotlib.pyplot as plt    def estimate_coef (x, y):     # number of observations/points     n = np.size (x)        # mean of x and y vector      m_x = np.mean (x)      m_y = np.mean (y)        # calculating cross-deviation and deviation about x      SS_xy = np.sum (y*x) - n* m_y * m_x      SS_xx = np.sum (x*x) - n* m_x * m_x        # calculating regression coefficients     b_1 = SS_xy / SS_xx     b_0 = m_y - b_1* m_x        return (b_0, b_1)

PYTHON CODE def plot_regression_line (x, y, b):     # plotting the actual points as scatter plot      plt.scatter (x, y, color = "m",                marker = "o", s = 30)        # predicted response vector      y_pred = b[0] + b[1]*x        # plotting the regression line      plt.plot (x, y_pred , color = "g")        # putting labels      plt.xlabel ('x')      plt.ylabel ('y')        # function to show plot      plt.show ()

PYTHON CODE def main():     # observations / data     x = np.array ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])     y = np.array ([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])        # estimating coefficients     b = estimate_coef (x, y)     print("Estimated coefficients:\nb_0 = {}  \           \nb_1 = {}".format(b[0], b[1]))        # plotting regression line      plot_regression_line (x, y, b)    if __name__ == "__main__":     main() OUTPUT: Estimated coefficients: b_0 = -0.0586206896552 b_1 = 1.45747126437

Implementation of Linear Regression using sklearn import pandas as pd import numpy as np import matplotlib.pyplot as plt # creating a dummy dataset np.random.seed (10) x = np.random.rand (50, 1) y = 3 + 3 * x + np.random.rand (50, 1) #scatterplot plt.scatter ( x,y,s =10) plt.xlabel (' x_dummy ') plt.ylabel (' y_dummy ') plt.show ()

Implementation of Linear Regression using sklearn # creating a model from sklearn.linear_model import LinearRegression # creating a object regressor = LinearRegression () #training the model regressor.fit (x, y) #using the training dataset for the prediction pred = regressor.predict (x)

Implementation of Linear Regression using sklearn # model performance from sklearn.metrics import r2_score, mean_squared_error mse = mean_squared_error (y, pred ) r2 = r2_score(y, pred )#Best fit lineplt.scatter (x, y) plt.plot (x, pred , color = 'Black', marker = 'o') #Results print("Mean Squared Error : ", mse ) print("R-Squared :" , r2) print("Y-intercept :" , regressor.intercept _) print("Slope :" , regressor.coef _) OUTPUT: R-Squared : 0.9068822972556425 Y-intercept : [3.41354381] Slope : [[3.11024701]]

POINTS TO REMEMBER Least squares is sensitive to outliers . A strange value will pull the line towards it . Works better for even non-linear data . But the formulas (and the steps taken) will be very different! Difference between actual value of y and predicted value of y is called as residual .

Linear regression finds the best line that predicts y from x, but Correlation does not fit a line. Correlation is used when we measure both variables , while linear regression is mostly applied when x is a variable that is manipulated . Correlation Vs Regression

Utility of Regression Used in economic and business research Estimation of Relationship Prediction

Types There are several linear regression analyses Simple linear regression One dependent variable One independent variable Multiple linear regression One dependent variable Two or more independent variables

Multiple Regression Model The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is: y = b + b 1 x 1 + b 2 x 2 + . . . + b p x p + e where: b , b 1 , b 2 , . . . , b p are the parameters, and e is a random variable called the error term

The variance of  , denoted by  2 , is the same for all values of the independent variables. The error  is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by  +  1 x 1 +  2 x 2 + . . +  p x p . Assumptions about the Error Term The error  is a random variable with mean of zero. The values of  are independent.

The equation that describes how the mean value of y is related to x1, x2, . . . xp is: Multiple Regression Equation E ( y ) =  +  1 x 1 +  2 x 2 + . . . +  p x p

A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters b , b 1 , b 2 , . . . , b p . Estimated Multiple Regression Equation ^ y = b + b 1 x 1 + b 2 x 2 + . . . + b p x p

Estimation Process Multiple Regression Model E ( y ) =  +  1 x 1 +  2 x 2 +. . .+  p x p + e Multiple Regression Equation E ( y ) =  +  1 x 1 +  2 x 2 +. . .+  p x p Unknown parameters are b , b 1 , b 2 , . . . , b p Sample Data: x 1 x 2 . . . x p y . . . . . . . . Estimated Multiple Regression Equation Sample statistics are b , b 1 , b 2 , . . . , b p b , b 1 , b 2 , . . . , b p provide estimates of b , b 1 , b 2 , . . . , b p

Least Squares Method Least Squares Criterion Computation of Coefficient Values The formulas for the regression coefficients b , b 1 , b 2 ,. . . b p involve the use of matrix algebra. Computer software packages are available to perform the calculations.

Example: Programmer Salary Survey Multiple Regression Model A software firm collected data for a sample of 20 computer programmers. A suggestion was made that regression analysis could be used to determine if salary was related to the years of experience and the score on the firm’s programmer aptitude test .

4 7 1 5 8 10 1 6 6 9 2 10 5 6 8 4 6 3 3 78 100 86 82 86 84 75 80 83 91 88 73 75 81 74 87 79 94 70 89 24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0 38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0 Exper . Score Score Exper. Salary Salary Multiple Regression Model

Suppose we believe that salary ( y ) is related to the years of experience ( x 1 ) and the score on the programmer aptitude test ( x 2 ) by the following regression model: Multiple Regression Model where y = annual salary ($1000) x 1 = years of experience x 2 = score on programmer aptitude test y =  +  1 x 1 +  2 x 2 + 

Solving for the Estimates of b0 , b1 , b2 Input Data Least Squares Output x 1 x 2 y 4 78 24 7 100 43 . . . . . . 3 89 30 Computer Package for Solving Multiple Regression Problems b = b 1 = b 2 = R 2 = etc.

Estimated Regression Equation SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE) Note: Predicted salary will be in thousands of dollars.

Interpreting the Coefficients In multiple regression analysis, we interpret each regression coefficient as follows: b i represents an estimate of the change in y corresponding to a 1-unit increase in x i when all other independent variables are held constant.

Salary is expected to increase by $1,404 for each additional year of experience (when the variable score on programmer attitude test is held constant). b 1 = 1.404 Interpreting the Coefficients

Salary is expected to increase by $251 for each additional point scored on the programmer aptitude test (when the variable years of experience is held constant ). b 2 = 0.251 Interpreting the Coefficients

Standard Error of Estimate

Interpret R-squared in Regression Analysis Determines how well the model fits the data Goodness-of-fit measure for linear regression  models Measures the strength of the relationship between our model and the dependent variable on a convenient 0–100 % scale

Interpret R-squared in Regression Analysis ( contd …) We need to calculate two things:  var ( avg ) = ∑( yi – Ӯ) 2 var (model ) = ∑( yi – ŷ) 2 R 2   = 1 – [var(model)/var(avg)] = 1 -[∑(yi – ŷ) 2 /∑(yi – Ӯ) 2 ] 

Limitations of R-squared R-squared cannot be used to check if the coefficient estimates and predictions are biased or not .  R-squared does not inform if the regression model has an adequate fit or not . 

Multiple Coefficient of Determination Relationship Among SST, SSR, SSE where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error SST = SSR + SSE = +

Testing for Significance: F Test F test is referred to as the test for overall significance . The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.

A separate t test is conducted for each of the independent Variables in the model. If the F test shows an overall significance, the t test is used to determine whether each of the individual independent variables is significant. Testing for Significance: t Test We refer to each of these t tests as a test for individual significance .

In simple linear regression, the F and t tests provide the same conclusion. Testing for Significance In multiple regression, the F and t tests have different purposes.

Testing for Significance: Multicollinearity The term multicollinearity refers to the correlation among the independent variables. When the independent variables are highly correlated (say, | r | > .7), it is not possible to determine the separate effect of any particular independent variable on the dependent variable.

Testing for Significance: Multicollinearity Every attempt should be made to avoid including independent variables that are highly correlated. If the estimated regression equation is to be used only for predictive purposes, multicollinearity is usually not a serious problem.

Using the Estimated Regression Equation for Estimation and Prediction The procedures for estimating the mean value of y and predicting an individual value of y in multiple regression are similar to those in simple regression. We substitute the given values of x 1 , x 2 , . . . , x p into the estimated regression equation and use the corresponding value of y as the point estimate.

In many situations we must work with qualitative independent variables such as gender (male, female), method of payment (cash, check, credit card), etc. For example, x 2 might represent gender where x 2 = 0 indicates male and x 2 = 1 indicates female. Qualitative Independent Variables In this case, x 2 is called a dummy or indicator variable .

Qualitative Independent Variables Example: Programmer Salary Survey As an extension of the problem involving the computer programmer salary survey, suppose that management also believes that the annual salary is related to whether the individual has a graduate degree in computer science or information systems . The years of experience, the score on the programmer aptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for each of the sampled 20 programmers are shown on the next slide.

4 7 1 5 8 10 1 6 6 9 2 10 5 6 8 4 6 3 3 78 100 86 82 86 84 75 80 83 91 88 73 75 81 74 87 79 94 70 89 24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0 38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0 Exper. Score Score Exper. Salary Salary Degr. No Yes No Yes Yes Yes No No No Yes Degr. Yes No Yes No No Yes No Yes No No Qualitative Independent Variables

Estimated Regression Equation y = b + b 1 x 1 + b 2 x 2 + b 3 x 3 ^ where: y = annual salary ($1000) x 1 = years of experience x 2 = score on programmer aptitude test x 3 = 0 if individual does not have a graduate degree 1 if individual does have a graduate degree x 3 is a dummy variable

Regression to the Mean

The Simple Explanation... When you select a group from the extreme end of a distribution...

The Simple Explanation... When you select a group from the extreme end of a distribution... Selected group’s mean Overall mean

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. Selected group’s mean Overall mean

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. Selected group’s mean Overall mean Where it would have been with no regression

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure Selected group’s mean Overall mean Where its mean is

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. The group mean on the first measure appears to “regress toward the mean” of the second measure. Selected group’s mean Overall mean Overall mean

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. The group mean on the first measure appears to “regress toward the mean” of the second measure. Selected group’s mean Overall mean Regression to the mean Overall mean

Example I: If the first measure is a pretest and you select the low scorers... Pretest

Example I: If the first measure is a pretest and you select the low scorers... ...and the second measure is a posttest Pretest Posttest

Example I: if the first measure is a pretest and you select the low scorers... ...and the second measure is a posttest, Pretest Posttest regression to the mean will make it appear as though the group gained from pre to post. Pseudo-effect

Example II: If the first measure is a pretest and you select the high scorers... Pretest

Example II: if the first measure is a pretest and you select the high scorers... ...and the second measure is a posttest, Pretest Posttest

Example I: ...and the second measure is a posttest, Pretest Posttest regression to the mean will make it appear as though the group lost from pre to post. Pseudo-effect If the first measure is a pretest and you select the high scorers...

Some Facts This is purely a statistical phenomenon. This is a group phenomenon. Some individuals will move opposite to this group trend.

Why Does It Happen? Regression artifacts occur whenever you sample asymmetrically from a distribution. Regression artifacts occur with any two variables (not just pre and posttest) and even backwards in time!

What Does It Depend On? The degree of asymmetry (i.e., how far from the overall mean of the first measure the selected group's mean is) The correlation between the two measures The absolute amount of regression to the mean depends on two factors:

A Simple Formula The percent of regression to the mean is P rm = 100(1 - r) Where r is the correlation between the two measures.

For Example: If r = 1, there is no (i.e., 0%) regression to the mean. If r = 0, there is 100% regression to the mean. If r = .2, there is 80% regression to the mean. If r = .5, there is 50% regression to the mean. P rm = 100(1 - r)

Example Assume a standardized test with a mean of 50. Pretest 50

Example Assume a standardized test with a mean of 50 Pretest 50 You give your program to the lowest scorers and their mean is 30. 30

Example Assume a standardized test with a mean of 50. Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5.

Example Assume a standardized test with a mean of 50. The formula is… Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5.

Example Assume a standardized test with a mean of 50. The formula is Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5. P rm = 100(1 - r) = 100(1-.5) = 50% 50%

Example Assume a standardized test with a mean of 50. The formula is Pretest Posttest Therefore the mean will regress up 50% (from 30 to 50), leaving a final mean of 40 and a 10 point pseudo-gain. Pseudo-effect 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5. P rm = 100(1 - r) = 100(1-.5) = 50% 40

ANY QUERIES???

THANK YOU 8/12/2022 85