Regression

Describing Data REGRESSION

Data is present everywhere and becomes a part of our life…. 2 8/12/2022

Data itself has no meaning unless it is contextually processed into information , from which knowledge can be derived Data Raw and unprocessed Obtained from end devices. Knowledge Information – organized and structured t o achieve specific objective. Information filtered , processed categorized and condensed. 8/12/2022 3

Correlation shows the quantity of the degree to which two variables are associated . It does not fix a line through the data points. Correlation shows how much one variable changes when the other remains constant. When it is zero , the relationship does not exist. When it is positive , one variable goes high as the other goes up. When it is negative , one variable goes high as the other goes down . The Pearson correlation measures the degree to which a set of data points form a straight line relationship. Correlation

5 Introduction to Linear Regression Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data . The two terms essential to understanding Regression Analysis: Dependent variables - The factors that we want to understand or predict. Independent variables - The factors that influence the dependent variable.

6 Introduction to Linear Regression (cont.) Any straight line can be represented by an equation of the form Y = mX + c , where m and c are constants. The value of m is called the slope constant and determines the direction and degree to which the line is tilted . The value of c is called the Y -intercept and determines the point where the line crosses the Y-axis .

Regressio n

LEAST SQUARES REGRESSION We can place the line "by eye": try to have the line as close as possible to all points , and a similar number of points above and below the line. But for better accuracy, let's see how to calculate the line using Least Squares Regression . The least square method is the process of obtaining the line of best fit for the given data set by reducing the sum of the squares of the offset s (residual part) of the points from the line.

LEAST SQUARES METHOD

STEPS TO CALCULATE LINE OF BEST FIT

STEP 1

STEP 2 c = Σ y − m Σ x N

EXAMPLE 1 x 1 2 3 4 5 y 2 5 3 8 7 x y xy x 2 1 2 2 1 2 5 10 4 3 3 9 9 4 8 32 16 5 7 35 25 ∑x =15 ∑y = 25 ∑ xy = 88 ∑x 2 = 55

EXAMPLE ( contd …) Find the value of m by using the formula, m = ( n∑xy - ∑ y∑x )/n∑x 2 - (∑x) 2 m = [(5×88) - (15×25)]/(5×55) - (15) 2 m = (440 - 375)/(275 - 225) m = 65/50 = 13/10 ∑x =15 ∑y = 25 ∑ xy = 88 ∑x 2 = 55

EXAMPLE ( contd …) Find the value of c by using the formula, c = (∑y - m∑x )/n c = (25 - 1.3×15)/5 c = (25 - 19.5)/5 c = 5.5/5 So, the required equation of least squares is y = mx + c = 13/10x + 5.5/5 . ∑x =15 ∑y = 25 ∑ xy = 88 ∑x 2 = 55

EXAMPLE 2 " x“ Hours of Sunshine "y" Ice Creams Sold 2 4 3 5 5 7 7 10 9 15 Sam found how many hours of sunshine vs how many ice creams were sold at the shop from Monday to Friday :

EXAMPLE ( contd …) x y y = 1.518x + 0.305 error 2 4 3.34 −0.66 3 5 4.86 −0.14 5 7 7.89 0.89 7 10 10.93 0.93 9 15 13.97 −1.03 Here are the (x , y ) points and the line y = 1.518x + 0.305 on a graph: Nice fit!

EXAMPLE ( contd …) Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above equation to estimate that he will sell y = 1.518 x 8 + 0.305 = 12.45 Ice Creams Sam makes fresh waffle cone mixture for 13 ice creams just in case. Yum.

USING SUM OF CROSS AND SQUARED DEVIATIONS where sum of cross-deviations of y and x is and sum of squared deviations of x is

PYTHON CODE import numpy as np import matplotlib.pyplot as plt def estimate_coef (x, y): # number of observations/points n = np.size (x) # mean of x and y vector m_x = np.mean (x) m_y = np.mean (y) # calculating cross-deviation and deviation about x SS_xy = np.sum (y*x) - n* m_y * m_x SS_xx = np.sum (x*x) - n* m_x * m_x # calculating regression coefficients b_1 = SS_xy / SS_xx b_0 = m_y - b_1* m_x return (b_0, b_1)

PYTHON CODE def plot_regression_line (x, y, b): # plotting the actual points as scatter plot plt.scatter (x, y, color = "m", marker = "o", s = 30) # predicted response vector y_pred = b[0] + b[1]*x # plotting the regression line plt.plot (x, y_pred , color = "g") # putting labels plt.xlabel ('x') plt.ylabel ('y') # function to show plot plt.show ()

PYTHON CODE def main(): # observations / data x = np.array ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array ([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) # estimating coefficients b = estimate_coef (x, y) print("Estimated coefficients:\nb_0 = {} \ \nb_1 = {}".format(b[0], b[1])) # plotting regression line plot_regression_line (x, y, b) if __name__ == "__main__": main() OUTPUT: Estimated coefficients: b_0 = -0.0586206896552 b_1 = 1.45747126437

Implementation of Linear Regression using sklearn import pandas as pd import numpy as np import matplotlib.pyplot as plt # creating a dummy dataset np.random.seed (10) x = np.random.rand (50, 1) y = 3 + 3 * x + np.random.rand (50, 1) #scatterplot plt.scatter ( x,y,s =10) plt.xlabel (' x_dummy ') plt.ylabel (' y_dummy ') plt.show ()

Implementation of Linear Regression using sklearn # creating a model from sklearn.linear_model import LinearRegression # creating a object regressor = LinearRegression () #training the model regressor.fit (x, y) #using the training dataset for the prediction pred = regressor.predict (x)

Implementation of Linear Regression using sklearn # model performance from sklearn.metrics import r2_score, mean_squared_error mse = mean_squared_error (y, pred ) r2 = r2_score(y, pred )#Best fit lineplt.scatter (x, y) plt.plot (x, pred , color = 'Black', marker = 'o') #Results print("Mean Squared Error : ", mse ) print("R-Squared :" , r2) print("Y-intercept :" , regressor.intercept _) print("Slope :" , regressor.coef _) OUTPUT: R-Squared : 0.9068822972556425 Y-intercept : [3.41354381] Slope : [[3.11024701]]

POINTS TO REMEMBER Least squares is sensitive to outliers . A strange value will pull the line towards it . Works better for even non-linear data . But the formulas (and the steps taken) will be very different! Difference between actual value of y and predicted value of y is called as residual .

Linear regression finds the best line that predicts y from x, but Correlation does not fit a line. Correlation is used when we measure both variables , while linear regression is mostly applied when x is a variable that is manipulated . Correlation Vs Regression

Utility of Regression Used in economic and business research Estimation of Relationship Prediction

Types There are several linear regression analyses Simple linear regression One dependent variable One independent variable Multiple linear regression One dependent variable Two or more independent variables

Multiple Regression Model The equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . xp and an error term is: y = b + b 1 x 1 + b 2 x 2 + . . . + b p x p + e where: b , b 1 , b 2 , . . . , b p are the parameters, and e is a random variable called the error term

The variance of  , denoted by  2 , is the same for all values of the independent variables. The error  is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by  +  1 x 1 +  2 x 2 + . . +  p x p . Assumptions about the Error Term The error  is a random variable with mean of zero. The values of  are independent.

The equation that describes how the mean value of y is related to x1, x2, . . . xp is: Multiple Regression Equation E ( y ) =  +  1 x 1 +  2 x 2 + . . . +  p x p

A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters b , b 1 , b 2 , . . . , b p . Estimated Multiple Regression Equation ^ y = b + b 1 x 1 + b 2 x 2 + . . . + b p x p

Estimation Process Multiple Regression Model E ( y ) =  +  1 x 1 +  2 x 2 +. . .+  p x p + e Multiple Regression Equation E ( y ) =  +  1 x 1 +  2 x 2 +. . .+  p x p Unknown parameters are b , b 1 , b 2 , . . . , b p Sample Data: x 1 x 2 . . . x p y . . . . . . . . Estimated Multiple Regression Equation Sample statistics are b , b 1 , b 2 , . . . , b p b , b 1 , b 2 , . . . , b p provide estimates of b , b 1 , b 2 , . . . , b p

Least Squares Method Least Squares Criterion Computation of Coefficient Values The formulas for the regression coefficients b , b 1 , b 2 ,. . . b p involve the use of matrix algebra. Computer software packages are available to perform the calculations.

Example: Programmer Salary Survey Multiple Regression Model A software firm collected data for a sample of 20 computer programmers. A suggestion was made that regression analysis could be used to determine if salary was related to the years of experience and the score on the firm’s programmer aptitude test .

4 7 1 5 8 10 1 6 6 9 2 10 5 6 8 4 6 3 3 78 100 86 82 86 84 75 80 83 91 88 73 75 81 74 87 79 94 70 89 24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0 38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0 Exper . Score Score Exper. Salary Salary Multiple Regression Model

Suppose we believe that salary ( y ) is related to the years of experience ( x 1 ) and the score on the programmer aptitude test ( x 2 ) by the following regression model: Multiple Regression Model where y = annual salary ($1000) x 1 = years of experience x 2 = score on programmer aptitude test y =  +  1 x 1 +  2 x 2 + 

Solving for the Estimates of b0 , b1 , b2 Input Data Least Squares Output x 1 x 2 y 4 78 24 7 100 43 . . . . . . 3 89 30 Computer Package for Solving Multiple Regression Problems b = b 1 = b 2 = R 2 = etc.

Estimated Regression Equation SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE) Note: Predicted salary will be in thousands of dollars.

Interpreting the Coefficients In multiple regression analysis, we interpret each regression coefficient as follows: b i represents an estimate of the change in y corresponding to a 1-unit increase in x i when all other independent variables are held constant.

Salary is expected to increase by $1,404 for each additional year of experience (when the variable score on programmer attitude test is held constant). b 1 = 1.404 Interpreting the Coefficients

Salary is expected to increase by $251 for each additional point scored on the programmer aptitude test (when the variable years of experience is held constant ). b 2 = 0.251 Interpreting the Coefficients

Standard Error of Estimate

Interpret R-squared in Regression Analysis Determines how well the model fits the data Goodness-of-fit measure for linear regression models Measures the strength of the relationship between our model and the dependent variable on a convenient 0–100 % scale

Interpret R-squared in Regression Analysis ( contd …) We need to calculate two things: var ( avg ) = ∑( yi – Ӯ) 2 var (model ) = ∑( yi – ŷ) 2 R 2 = 1 – [var(model)/var(avg)] = 1 -[∑(yi – ŷ) 2 /∑(yi – Ӯ) 2 ]

Limitations of R-squared R-squared cannot be used to check if the coefficient estimates and predictions are biased or not . R-squared does not inform if the regression model has an adequate fit or not .

Multiple Coefficient of Determination Relationship Among SST, SSR, SSE where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error SST = SSR + SSE = +

Testing for Significance: F Test F test is referred to as the test for overall significance . The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.

A separate t test is conducted for each of the independent Variables in the model. If the F test shows an overall significance, the t test is used to determine whether each of the individual independent variables is significant. Testing for Significance: t Test We refer to each of these t tests as a test for individual significance .

In simple linear regression, the F and t tests provide the same conclusion. Testing for Significance In multiple regression, the F and t tests have different purposes.

Testing for Significance: Multicollinearity The term multicollinearity refers to the correlation among the independent variables. When the independent variables are highly correlated (say, | r | > .7), it is not possible to determine the separate effect of any particular independent variable on the dependent variable.

Testing for Significance: Multicollinearity Every attempt should be made to avoid including independent variables that are highly correlated. If the estimated regression equation is to be used only for predictive purposes, multicollinearity is usually not a serious problem.

Using the Estimated Regression Equation for Estimation and Prediction The procedures for estimating the mean value of y and predicting an individual value of y in multiple regression are similar to those in simple regression. We substitute the given values of x 1 , x 2 , . . . , x p into the estimated regression equation and use the corresponding value of y as the point estimate.

In many situations we must work with qualitative independent variables such as gender (male, female), method of payment (cash, check, credit card), etc. For example, x 2 might represent gender where x 2 = 0 indicates male and x 2 = 1 indicates female. Qualitative Independent Variables In this case, x 2 is called a dummy or indicator variable .

Qualitative Independent Variables Example: Programmer Salary Survey As an extension of the problem involving the computer programmer salary survey, suppose that management also believes that the annual salary is related to whether the individual has a graduate degree in computer science or information systems . The years of experience, the score on the programmer aptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for each of the sampled 20 programmers are shown on the next slide.

4 7 1 5 8 10 1 6 6 9 2 10 5 6 8 4 6 3 3 78 100 86 82 86 84 75 80 83 91 88 73 75 81 74 87 79 94 70 89 24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0 38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0 Exper. Score Score Exper. Salary Salary Degr. No Yes No Yes Yes Yes No No No Yes Degr. Yes No Yes No No Yes No Yes No No Qualitative Independent Variables

Estimated Regression Equation y = b + b 1 x 1 + b 2 x 2 + b 3 x 3 ^ where: y = annual salary ($1000) x 1 = years of experience x 2 = score on programmer aptitude test x 3 = 0 if individual does not have a graduate degree 1 if individual does have a graduate degree x 3 is a dummy variable

Regression to the Mean

The Simple Explanation... When you select a group from the extreme end of a distribution...

The Simple Explanation... When you select a group from the extreme end of a distribution... Selected group’s mean Overall mean

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. Selected group’s mean Overall mean

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. Selected group’s mean Overall mean Where it would have been with no regression

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure Selected group’s mean Overall mean Where its mean is

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. The group mean on the first measure appears to “regress toward the mean” of the second measure. Selected group’s mean Overall mean Overall mean

The Simple Explanation... When you select a group from the extreme end of a distribution... the group will do better on a subsequent measure. The group mean on the first measure appears to “regress toward the mean” of the second measure. Selected group’s mean Overall mean Regression to the mean Overall mean

Example I: If the first measure is a pretest and you select the low scorers... Pretest

Example I: If the first measure is a pretest and you select the low scorers... ...and the second measure is a posttest Pretest Posttest

Example I: if the first measure is a pretest and you select the low scorers... ...and the second measure is a posttest, Pretest Posttest regression to the mean will make it appear as though the group gained from pre to post. Pseudo-effect

Example II: If the first measure is a pretest and you select the high scorers... Pretest

Example II: if the first measure is a pretest and you select the high scorers... ...and the second measure is a posttest, Pretest Posttest

Example I: ...and the second measure is a posttest, Pretest Posttest regression to the mean will make it appear as though the group lost from pre to post. Pseudo-effect If the first measure is a pretest and you select the high scorers...

Some Facts This is purely a statistical phenomenon. This is a group phenomenon. Some individuals will move opposite to this group trend.

Why Does It Happen? Regression artifacts occur whenever you sample asymmetrically from a distribution. Regression artifacts occur with any two variables (not just pre and posttest) and even backwards in time!

What Does It Depend On? The degree of asymmetry (i.e., how far from the overall mean of the first measure the selected group's mean is) The correlation between the two measures The absolute amount of regression to the mean depends on two factors:

A Simple Formula The percent of regression to the mean is P rm = 100(1 - r) Where r is the correlation between the two measures.

For Example: If r = 1, there is no (i.e., 0%) regression to the mean. If r = 0, there is 100% regression to the mean. If r = .2, there is 80% regression to the mean. If r = .5, there is 50% regression to the mean. P rm = 100(1 - r)

Example Assume a standardized test with a mean of 50. Pretest 50

Example Assume a standardized test with a mean of 50 Pretest 50 You give your program to the lowest scorers and their mean is 30. 30

Example Assume a standardized test with a mean of 50. Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5.

Example Assume a standardized test with a mean of 50. The formula is… Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5.

Example Assume a standardized test with a mean of 50. The formula is Pretest Posttest 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5. P rm = 100(1 - r) = 100(1-.5) = 50% 50%

Example Assume a standardized test with a mean of 50. The formula is Pretest Posttest Therefore the mean will regress up 50% (from 30 to 50), leaving a final mean of 40 and a 10 point pseudo-gain. Pseudo-effect 50 You give your program to the lowest scorers and their mean is 30. 30 Assume that the correlation of pre-post is .5. P rm = 100(1 - r) = 100(1-.5) = 50% 40

ANY QUERIES???

THANK YOU 8/12/2022 85

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Regression

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77