Objectives 1. Describe the Linear Regression Model 2. State the Regression Modeling Steps 3. Explain Ordinary Least Squares 4. Compute Regression Coefficients 5. Understand and check model assumptions 6. Residual sum of squares (RSS) and R² (R-squared) 7. Predict Response Variable
Simple Linear Regression The most elementary type of regression model is the simple linear regression which explains the relationship between a dependent variable and one independent variable using a straight line. The straight line is plotted on the scatter plot of these two points. Y.Lakshmi Prasad 08978784848
Simple Linear regression In Simple Linear regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y (Continuous in nature).
Scatter Plot Y.Lakshmi Prasad 08978784848
Regression Model
Intercept and Slope Since X been given, and we need to predict something about Y, we require the other 2 parameters those are slope and intercept. Intercept is the value of Y, when X becomes Zero. A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.
Simple Linear Regression Y.Lakshmi Prasad 08978784848
Regression Line Y.Lakshmi Prasad 08978784848
Intercept of a Straight Line What is the intercept of the given line? Use the graph given above to answer this question. A) 0 B) 3 C) 4 D)1/2 Y.Lakshmi Prasad 08978784848
Feedback : Feedback : The value of y when x = 0 in the given straight line is 3. So, 3 would be the intercept in this case. Y.Lakshmi Prasad 08978784848
Slope of a Straight Line What is the slope of the given line? Use the graph given above to answer this question. A) 1/2 B) 1/3 C) 1 D) 2 Y.Lakshmi Prasad 08978784848
Feedback Feedback : The slope of any straight line can be calculated by (y₂ - y₁)/(x₂ - x₁), where (x₁, y₁) and (x₂, y₂) are any two points through which the given line passes. This line passes (0,3) and (2, 4); so the slope of this line would be (4-3/2-0) = ½. Y.Lakshmi Prasad 08978784848
Equation of a Straight Line What would be the equation of the given line? A) Y = X/2 + 3 B) Y = 2X + 3 C) Y = X/3 + ½ D) Y = 3X + ½ Y.Lakshmi Prasad 08978784848
Feedback : Feedback : The standard equation of a straight line is y = mx + c, where m is the slope and c is the intercept. In this case, m = ½ and c = 3, so equation would be Y = X/2 + 3. Y.Lakshmi Prasad 08978784848
strength of the linear regression The strength of the linear regression model can be assessed using 2 metrics: 1. R2 or Coefficient of Determination 2. Residual Standard Error (RSE) Y.Lakshmi Prasad 08978784848
Least Squares Regression Line The coefficients of the least squares regression line are determined by the Ordinary Least Squares method — which basically means minimising the sum of the squares of the: x-coordinates y-coordinates of actual data y-coordinates of predicted data y-coordinates of actual data - y-coordinates of predicted data Y.Lakshmi Prasad 08978784848
Feedback : Feedback : The Ordinary Least Squares method has the criterion of the minimisation of the sum of squares of residuals. Residuals are defined as the difference between the y-coordinates of actual data and the y-coordinates of predicted data. Y.Lakshmi Prasad 08978784848
Best Fit Line The best-fit line is found by minimising the expression of RSS (Residual Sum of Squares) which is equal to the sum of squares of the residual for each data point in the plot. Residuals for any data point is found by subtracting predicted value of dependent variable from actual value of dependent variable: Y.Lakshmi Prasad 08978784848
Residuals Y.Lakshmi Prasad 08978784848
Best Fit Regression Line What is the main criterion used to determine the best-fitting regression line? A) The line that goes through the most number of points B) The line that has an equal number of points above it or below it C) The line that minimises the sum of squares of distances of points from the regression line D) Either B or C (they are same criterion) Y.Lakshmi Prasad 08978784848
Feedback : Answer C: Feedback : The criterion is given by the Ordinary Least Squares (OLS) method, which states that the sum of squares of residuals should be minimum. Y.Lakshmi Prasad 08978784848
R-Square Formula
RSS - Residual Sum of Squares In the previous example of marketing spend (in lakhs ) and sales amount (in crores ), let’s assume you get the same data in different units — marketing spend (in lakhs ) and sales amount (in dollars). Do you think there will be any change in the value of RSS due to change in units in this case (as compared to the value calculated in the Excel demonstration)? A) Yes, value of RSS would change because units are changing. B) No, value won’t change C) Can’t say Y.Lakshmi Prasad 08978784848
Feedback : Feedback : The RSS for any regression line is given by this expression: ∑( yi−yipred ) 2 . RSS is the sum of the squared difference between the actual and the predicted values, and its value will change if the units change since it has units of y2 . For example, (140 rupees - 70 rupees)^2 = 4900, whereas (2 USD - 1 USD)^2 = 1. So value of RSS is different in both the cases because of different units. Y.Lakshmi Prasad 08978784848
RSS and TSS RSS (Residual Sum of Squares): In statistics, it is defined as the total sum of error across the whole sample. It is the measure of the difference between the expected and the actual output. A small RSS indicates a tight fit of the model to the data. TSS (Total sum of squares): It is the sum of errors of the data points from mean of response variable. Y.Lakshmi Prasad 08978784848
RSS Plot Y.Lakshmi Prasad 08978784848
Residual Sum of Squares (RSS) Find the value of RSS for this regression line. A) 0.25 B) 6.25 C) 6.5 D) -0.5 Y.Lakshmi Prasad 08978784848
Feedback : Feedback : The residuals for all 5 points are -0.5, 1, 0, -2, 1. The sum of squares of all 5 residuals would be 0.25 + 1 + 0 + 4 + 1 = 6.25 Y.Lakshmi Prasad 08978784848
Coefficient of Determination R-Square is a number which explains what portion of the given data variation is explained by the developed model. It always takes a value between 0 & 1. In general term, it provides a measure of how well actual outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model, i.e. expected outcomes. Overall, the higher the R-squared, the better the model fits your data.
Adj. R-Squared adjusted R-squared is a better metric than R-squared to assess how good the model fits the data. Adjusted R-squared, penalises R-squared for unnecessary addition of variables. So, if the variable added does not increase the accuracy adequately, adjusted R-squared decreases although R-squared might increase.
Total Sum of Errors (TSS) Find the value of TSS for this regression line. A) 11.5 B) 7.5 C) 0 D) 14 Y.Lakshmi Prasad 08978784848
Feedback : Feedback : The average of y-value for all data points (3 + 5 + 5 + 4 + 8)/5 = 25/5 = 5. So y−¯y term for each data point would be -2, 0, 0, -1, 3. So, the squared sum of these terms would be 4 + 0 + 0 + 1 + 9 = 14. Y.Lakshmi Prasad 08978784848
R² The RSS for this example comes out to be 6.25 and the TSS comes out to be 14. What would be the R² for this regression line? A) 1 - (14/6.25) B) (1 - 14)/6.25 C) 1 - (6.25/14) D) (1 - 6.25)/14 Y.Lakshmi Prasad 08978784848
Feedback Feedback : R² value is given by 1 - (RSS / TSS). So, in this case, R² value would be 1 - (6.25 / 14). Y.Lakshmi Prasad 08978784848
4 Questions we need to ask ourselves: Whenever you are about to build a Model, Ask yourself these 4 Questions: 1. What is My Objective Function? 2. What are my Hyper-parameters? 3. What are my Parameters? 4. How can I Regularize this Model?
Linear Regression Model 1. What is My Objective Function? Find out that line which minimizes the RMSE 2. What are my Hyper-parameters? No.of Jobs, fit_intercept , Normalize, 3. What are my Parameters? Intercept(B0) and Slope Coeffcient (B1) 4. How can I Regularize this Model? L1 Norm, L2 Norm, AIC, BIC
Multiple linear regression Multiple linear regression is a statistical technique to understand the relationship between one dependent variable and several independent variables (explanatory variables). The objective of multiple regression is to find a linear equation that can best determine the value of dependent variable Y for different values independent variables in X. Y.Lakshmi Prasad 08978784848
Multiple linear regression Y.Lakshmi Prasad 08978784848
Understanding the Regression output Y.Lakshmi Prasad 08978784848
We need to understand that not all variables are used to build a model. Some independent variables are insignificant and add nothing to your understanding of the outcome/ response/ dependent variable. Y.Lakshmi Prasad 08978784848
standard error standard error measures the variability in the estimate for these coefficients. A lower value of standard deviation is good but it is somewhat relative to the value of the coefficient. E.g. you can check the standard error of the intercept is about 0.38, whereas its estimate is 2.6, So, it can be interpreted that the variability of the intercept is from 2.6±0.38. Note that standard error is absolute in nature and so many a times, it is difficult to judge whether the model is good or not. Y.Lakshmi Prasad 08978784848
t-value t-value is the ratio of the estimated coefficients to the standard deviation of the estimated coefficients. It measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the p-value and the significance levels which are used for building the final model. Y.Lakshmi Prasad 08978784848
p-value p-value is used for hypothesis testing. Here, in regression model building, the null hypothesis corresponding to each p-value is that the corresponding independent variable does not impact the dependent variable. The alternate hypothesis is that the corresponding independent variable impacts the response. Now, p-value indicates the probability that the null hypothesis is true. Therefore, a low p-value, i.e. less than 0.05, indicates that you can reject the null hypothesis. Y.Lakshmi Prasad 08978784848
P-Value Y.Lakshmi Prasad 08978784848
Assumptions Linear regression assumptions: The relationship between X and Y is linear(linearity). Y is distributed normally at each value of X (Normality). The variance of Y at every value of X is the same – (No Heteroscedasticity). The observations are independent – (No Autocorrelation). Independent variables should not be correlated –(No Multicollinearity). No Outliers(outlier test). No influential observations
Residual Analysis for Linearity Not Linear Linear x residuals x Y x Y x residuals
Heteroscedasticity and Homoscedasticity Non-constant variance Constant variance x x Y x x Y residuals residuals
Residual Analysis for Independence Not Independent Independent X X residuals residuals X residuals
Gradient Descent
Multicollinearity Multicollinearity refers to a situation where multiple predictor variables are correlated with each other. Since multiple variables are involved, you cannot use the rather simplified 'correlation coefficient' to measure co-linearity (it only measures the correlation between two variables).
Multicollinearity Since one of the major goal of linear regression is identifying the important explanatory variables, it is important to assess the impact of each and then keep those which have a significant impact on the outcome. This is the major issue with Multicollinearity. Multicollinearity makes it difficult to assess the effect of individual predictors. Y.Lakshmi Prasad 08978784848
Multicollinearity A simple way to detect Multicollinearity is to look at the correlation matrix. We can use Heat map to find the Multicollinearity. The statistical test VIF(Variance Inflation Factor) is often used to detect Multicollinearity. Y.Lakshmi Prasad 08978784848
VIF(Variance Inflation Factor) Variance Inflation Factor - A Useful Measure of Multicollinearity. VIF(Variance Inflation Factor) to measure the correlation of one variable with multiple variables. Y.Lakshmi Prasad 08978784848
VIF(Variance Inflation Factor) A variable with a high VIF means it can be largely explained by other independent variables. Thus, you have to check and remove variables with a high VIF after checking for p-values, implying that their impact on the outcome can largely be explained by other variables. But remember, variables with a high VIF or Multicollinearity may be statistically significant p<0.05, in which case you will first have to check for other insignificant variables before removing the variables with a higher VIF and lower p-values.
Variable selection (RFE) Recursive feature elimination is based on the idea of repeatedly constructing a model and then choosing either the best or the worst performing feature, setting that feature aside and then repeating the process with the rest of the features. This process is applied until all the features in the dataset are exhausted. Features are then ranked according to what they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features Y.Lakshmi Prasad 08978784848
VIF-Check our understanding If a variable “A” has a high VIF (>5), which of the following is true? A) Variable “A” explains the variation in Y better than variables with a lower VIF B) Variable “A” is highly correlated with other independent variables in the model C) Variable A is insignificant (p>0.05) D) Removing A from the model will increase the adjusted R-squared Y.Lakshmi Prasad 08978784848
linear regression model building process: Once you understood the business objective, you prepared the data, followed by EDA and the division of data into training and test datasets. The next step was the selection of variables for the creation of the model. Variable selection is critical because you cannot just include all the variables in the model; otherwise, you run the risk of including insignificant variables too. This is RFE can be used to quickly shortlist some variables which are significant to save time. However, these significant independent variables might be related to each other. This is where you need to check for multicollinearity amongst variables using variance inflation factor (VIF) and remove variables with high VIF and low significance (p>0.05). Y.Lakshmi Prasad 08978784848
linear regression model building process: The variables with a high VIF or multi- collinearity may be statistically significant or p<0.05, in which case you will first have to check for other insignificant variables (p>0.05) before removing the variables with a higher VIF and lower p-values. Continue removing the variables until all variables are significant or p<0.05, and have low VIFs. Finally you arrive at a model where all variables are significant and there is no threat of multi- collinearity . The final step is to check the model accuracy on the testing data. Y.Lakshmi Prasad 08978784848
Model Building- Test An analyst observes a positive relationship between digital marketing expenses and online sales for a firm. However, she intuitively feels that she should add an additional independent variable, one which has a high correlation with marketing expenses. If the analyst adds this independent variable to the model, which of the following could happen? More than one choices could be correct.(Find out Both) A) The model’s R-squared will decrease B) The model’s adjusted R-squared could decrease C) The Beta-coefficient for predictor - digital marketing expenses will remain same D) The relationship between marketing expenses and sales can become insignificant Y.Lakshmi Prasad 08978784848
Feedback Feedback : Adjusted R-squared could decrease if the variable does not add much to the model, to explain Online Sales. Feedback : The relation between marketing expenses and sales can become insignificant with the addition of a new variable. Y.Lakshmi Prasad 08978784848
Dummy Variables Suppose you need to build a model on a data set which contains 2 categorical variables with 2 and 4 levels respectively. How many dummy variables should you create for model building? A) 6 B) 4 C) 2 D) 8 Y.Lakshmi Prasad 08978784848
Feedback : Since n-1 dummy variables can be used to describe a variable with n levels, you will get 1 dummy variable for the variable with two levels, and 3 dummy variables for the variable with 4 levels. Y.Lakshmi Prasad 08978784848
Data Partition Whenever I am Building a Model, I want my model to predict the unseen(new) cases. To facilitate this, we split the given dataset in to 2 datasets 1. Training and 2. Validation. 1. Training Dataset: is to Build the Model 2. Validation Dataset: is to Evaluate the Model.
Model Validation It is desired that the R-squared between the predicted value and the actual value in the test set should be high. In general, it is desired that the R-squared on the test data be high, and as similar to the R-squared on the training set as possible. We should note that R-squared is only one of the metric among many other metrics to assess accuracy in a linear regression model. Y.Lakshmi Prasad 08978784848
Generalization : The ability to predict or assign a label to a “ new ” observation based on the “ model ” built from past experience Generalization vs. Memorization
Model SIGNAL not NOISE Model is too simple UNDER LEARN Model is too complex MEMORIZE Model is just right GENERALIZE
Generalize, don ’ t Memorize! Model Complexity Model Accuracy Training Set Accuracy Validation Set Accuracy Right Level of Model Complexity
Overfitting In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. The model is fit perfectly to the training data, but has no predictive ability in a new sample (Validation Data).
LASSO Regression LASSO (Least Absolute Shrinkage Selector Operator) It uses L1 regularization technique. It is generally used when we have more number of features, because it automatically does feature selection. The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.
Ridge Regression It shrinks the parameters, therefore it is mostly used to prevent multi- collinearity . It reduces the model complexity by coefficient shrinkage. It uses L2 regularization technique.
Using Linear Regression In which of the following cases can linear regression be used? A) An Institute is looking to admit new students in its Data Analytics Program. Potential students are asked to fill various parameters such as previous company, grades, experience, etc. They need this data to figure out if an applicant would be a good fit for the program. B) Flipkart is trying to analyse user details and past purchases to identify segments where users can be targeted for advertisements. C) A researcher wishes to find out the amount of rainfall on a given day, given that pressure, temperature and wind conditions are known. D) A start-up is analysing the data of potential customers. They need to figure out which people they should reach out for a sales pitch. Y.Lakshmi Prasad 08978784848
Feedback : Feedback : Past data could be used to predict what the rainfall will be based on the given predictors. Y.Lakshmi Prasad 08978784848