Linear Regression in Machine Learning YLP

davidrex699 25 views 74 slides Jul 25, 2024
Slide 1
Slide 1 of 74
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74

About This Presentation

About Linear regression


Slide Content

Linear Regression Y.LAKSHMI PRASAD 08978784848

Objectives 1. Describe the Linear Regression Model 2. State the Regression Modeling Steps 3. Explain Ordinary Least Squares 4. Compute Regression Coefficients 5. Understand and check model assumptions 6. Residual sum of squares (RSS) and R² (R-squared) 7. Predict Response Variable

Simple Linear Regression The most elementary type of regression model is the simple linear regression which explains the relationship between a dependent variable and one independent variable using a straight line. The straight line is plotted on the scatter plot of these two points. Y.Lakshmi Prasad 08978784848

Simple Linear regression In Simple Linear regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y (Continuous in nature).

Scatter Plot Y.Lakshmi Prasad 08978784848

Regression Model

Intercept and Slope Since X been given, and we need to predict something about Y, we require the other 2 parameters those are slope and intercept. Intercept is the value of Y, when X becomes Zero. A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Simple Linear Regression Y.Lakshmi Prasad 08978784848

Regression Line Y.Lakshmi Prasad 08978784848

Intercept of a Straight Line What is the intercept of the given line? Use the graph given above to answer this question. A) 0 B) 3 C) 4 D)1/2 Y.Lakshmi Prasad 08978784848

Feedback : Feedback : The value of y when x = 0 in the given straight line is 3. So, 3 would be the intercept in this case. Y.Lakshmi Prasad 08978784848

Slope of a Straight Line What is the slope of the given line? Use the graph given above to answer this question. A) 1/2 B) 1/3 C) 1 D) 2 Y.Lakshmi Prasad 08978784848

Feedback Feedback : The slope of any straight line can be calculated by (y₂ - y₁)/(x₂ - x₁), where (x₁, y₁) and (x₂, y₂) are any two points through which the given line passes. This line passes (0,3) and (2, 4); so the slope of this line would be (4-3/2-0) = ½. Y.Lakshmi Prasad 08978784848

Equation of a Straight Line What would be the equation of the given line? A) Y = X/2 + 3 B) Y = 2X + 3 C) Y = X/3 + ½ D) Y = 3X + ½ Y.Lakshmi Prasad 08978784848

Feedback : Feedback : The standard equation of a straight line is y = mx + c, where m is the slope and c is the intercept. In this case, m = ½ and c = 3, so equation would be Y = X/2 + 3. Y.Lakshmi Prasad 08978784848

strength of the linear regression The strength of the linear regression model can be assessed using 2 metrics: 1. R2 or Coefficient of Determination 2. Residual Standard Error (RSE) Y.Lakshmi Prasad 08978784848

Least Squares Regression Line The coefficients of the least squares regression line are determined by the Ordinary Least Squares method — which basically means minimising the sum of the squares of the: x-coordinates y-coordinates of actual data y-coordinates of predicted data y-coordinates of actual data - y-coordinates of predicted data Y.Lakshmi Prasad 08978784848

Feedback : Feedback : The Ordinary Least Squares method has the criterion of the minimisation of the sum of squares of residuals. Residuals are defined as the difference between the y-coordinates of actual data and the y-coordinates of predicted data. Y.Lakshmi Prasad 08978784848

Best Fit Line The best-fit line is found by minimising the expression of RSS (Residual Sum of Squares) which is equal to the sum of squares of the residual for each data point in the plot. Residuals for any data point is found by subtracting predicted value of dependent variable from actual value of dependent variable: Y.Lakshmi Prasad 08978784848

Residuals Y.Lakshmi Prasad 08978784848

Best Fit Regression Line What is the main criterion used to determine the best-fitting regression line? A) The line that goes through the most number of points B) The line that has an equal number of points above it or below it C) The line that minimises the sum of squares of distances of points from the regression line D) Either B or C (they are same criterion) Y.Lakshmi Prasad 08978784848

Feedback : Answer C: Feedback : The criterion is given by the Ordinary Least Squares (OLS) method, which states that the sum of squares of residuals should be minimum. Y.Lakshmi Prasad 08978784848

R-Square Formula

RSS - Residual Sum of Squares In the previous example of marketing spend (in lakhs ) and sales amount (in crores ), let’s assume you get the same data in different units — marketing spend (in lakhs ) and sales amount (in dollars).  Do you think there will be any change in the value of RSS due to change in units in this case (as compared to the value calculated in the Excel demonstration)? A) Yes, value of RSS would change because units are changing. B) No, value won’t change C) Can’t say Y.Lakshmi Prasad 08978784848

Feedback : Feedback : The RSS for any regression line is given by this expression:  ∑( yi−yipred ) 2 . RSS is the sum of the squared difference between the actual and the predicted values, and its value will change if the units change since it has units of ​ y2 ​. For example, (140 rupees - 70 rupees)^2 = 4900, whereas (2 USD - 1 USD)^2 = 1. So value of RSS is different in both the cases because of different units. Y.Lakshmi Prasad 08978784848

RSS and TSS RSS (Residual Sum of Squares): In statistics, it is defined as the total sum of error across the whole sample. It is the measure of the difference between the expected and the actual output. A small RSS indicates a tight fit of the model to the data. TSS (Total sum of squares): It is the sum of errors of the data points from mean of response variable. Y.Lakshmi Prasad 08978784848

RSS Plot Y.Lakshmi Prasad 08978784848

Residual Sum of Squares (RSS) Find the value of RSS for this regression line. A) 0.25 B) 6.25 C) 6.5 D) -0.5 Y.Lakshmi Prasad 08978784848

Feedback : Feedback : The residuals for all 5 points are -0.5, 1, 0, -2, 1. The sum of squares of all 5 residuals would be 0.25 + 1 + 0 + 4 + 1 = 6.25 Y.Lakshmi Prasad 08978784848

Coefficient of Determination R-Square is a number which explains what portion of the given data variation is explained by the developed model. It always takes a value between 0 & 1. In general term, it provides a measure of how well actual outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model, i.e. expected outcomes. Overall, the higher the R-squared, the better the model fits your data.

Adj. R-Squared adjusted R-squared is a better metric than R-squared to assess how good the model fits the data. Adjusted R-squared, penalises R-squared for unnecessary addition of variables. So, if the variable added does not increase the accuracy adequately, adjusted R-squared decreases although R-squared might increase.

Total Sum of Errors (TSS) Find the value of TSS for this regression line. A) 11.5 B) 7.5 C) 0 D) 14 Y.Lakshmi Prasad 08978784848

Feedback : Feedback : The average of y-value for all data points (3 + 5 + 5 + 4 + 8)/5 = 25/5 = 5. So  y−¯y  term for each data point would be -2, 0, 0, -1, 3. So, the squared sum of these terms would be 4 + 0 + 0 + 1 + 9 = 14. Y.Lakshmi Prasad 08978784848

R² The RSS for this example comes out to be 6.25 and the TSS comes out to be 14. What would be the R² for this regression line? A) 1 - (14/6.25) B) (1 - 14)/6.25 C) 1 - (6.25/14) D) (1 - 6.25)/14 Y.Lakshmi Prasad 08978784848

Feedback Feedback : R² value is given by 1 - (RSS / TSS). So, in this case, R² value would be 1 - (6.25 / 14). Y.Lakshmi Prasad 08978784848

4 Questions we need to ask ourselves: Whenever you are about to build a Model, Ask yourself these 4 Questions: 1. What is My Objective Function? 2. What are my Hyper-parameters? 3. What are my Parameters? 4. How can I Regularize this Model?

Linear Regression Model 1. What is My Objective Function?  Find out that line which minimizes the RMSE 2. What are my Hyper-parameters?  No.of Jobs, fit_intercept  , Normalize, 3. What are my Parameters?  Intercept(B0) and Slope Coeffcient (B1) 4. How can I Regularize this Model?  L1 Norm, L2 Norm, AIC, BIC

Multiple linear regression Multiple linear regression is a statistical technique to understand the relationship between one dependent variable and several independent variables (explanatory variables). The objective of multiple regression is to find a linear equation that can best determine the value of dependent variable Y for different values independent variables in X. Y.Lakshmi Prasad 08978784848

Multiple linear regression Y.Lakshmi Prasad 08978784848

Understanding the Regression output Y.Lakshmi Prasad 08978784848

We need to understand that  not all variables are used to build a model.   Some independent variables are insignificant and add nothing to your understanding of the outcome/ response/ dependent variable.  Y.Lakshmi Prasad 08978784848

standard error standard error measures the variability in the estimate for these coefficients. A lower value of standard deviation is good but it is somewhat relative to the value of the coefficient. E.g. you can check the standard error of the intercept is about 0.38, whereas its estimate is 2.6, So, it can be interpreted that the variability of the intercept is from 2.6±0.38. Note that standard error is absolute in nature and so many a times, it is difficult to judge whether the model is good or not. Y.Lakshmi Prasad 08978784848

t-value t-value is the ratio of the estimated coefficients to the standard deviation of the estimated coefficients. It measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the p-value and the significance levels which are used for building the final model. Y.Lakshmi Prasad 08978784848

p-value p-value is used for hypothesis testing. Here, in regression model building, the null hypothesis corresponding to each p-value is that the corresponding independent variable does not impact the dependent variable. The alternate hypothesis is that the corresponding independent variable impacts the response. Now, p-value indicates the probability that the null hypothesis is true. Therefore, a low p-value, i.e. less than 0.05, indicates that you can reject the null hypothesis. Y.Lakshmi Prasad 08978784848

P-Value Y.Lakshmi Prasad 08978784848

Assumptions Linear regression assumptions: The relationship between X and Y is linear(linearity). Y is distributed normally at each value of X (Normality). The variance of Y at every value of X is the same – (No Heteroscedasticity). The observations are independent – (No Autocorrelation). Independent variables should not be correlated –(No Multicollinearity). No Outliers(outlier test). No influential observations

Residual Analysis for Linearity Not Linear Linear  x residuals x Y x Y x residuals

Heteroscedasticity and Homoscedasticity Non-constant variance  Constant variance x x Y x x Y residuals residuals

Residual Analysis for Independence Not Independent Independent X X residuals residuals X residuals 

Gradient Descent

Multicollinearity Multicollinearity  refers to a situation where multiple predictor variables are correlated with each other. Since multiple variables are involved, you cannot use the rather simplified 'correlation coefficient' to measure co-linearity (it only measures the correlation between two  variables).

Multicollinearity Since one of the major goal of linear regression is identifying the important explanatory variables, it is important to assess the impact of each and then keep those which have a significant impact on the outcome. This is the major issue with Multicollinearity. Multicollinearity makes it difficult to assess the effect of individual predictors. Y.Lakshmi Prasad 08978784848

Multicollinearity A simple way to detect Multicollinearity is to look at the correlation matrix. We can use Heat map to find the Multicollinearity. The statistical test VIF(Variance Inflation Factor) is often used to detect Multicollinearity. Y.Lakshmi Prasad 08978784848

VIF(Variance Inflation Factor) Variance Inflation Factor - A Useful Measure of Multicollinearity.  VIF(Variance Inflation Factor) to measure the correlation of one variable with multiple variables.  Y.Lakshmi Prasad 08978784848

VIF(Variance Inflation Factor) A variable with a high VIF means it can be largely explained by other independent variables. Thus, you have to check and remove variables with a high VIF after checking for p-values, implying that their impact on the outcome can largely be explained by other variables. But remember, variables with a high VIF or Multicollinearity may be statistically significant p<0.05, in which case you will first have to check for other insignificant variables before removing the variables with a higher VIF and lower p-values.

Variable selection (RFE) Recursive feature elimination is based on the idea of repeatedly constructing a model and then choosing either the best or the worst performing feature, setting that feature aside and then repeating the process with the rest of the features. This process is applied until all the features in the dataset are exhausted. Features are then ranked according to what they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features Y.Lakshmi Prasad 08978784848

VIF-Check our understanding If a variable “A” has a high VIF (>5), which of the following is true? A) Variable “A” explains the variation in Y better than variables with a lower VIF B) Variable “A” is highly correlated with other independent variables in the model C) Variable A is insignificant (p>0.05) D) Removing A from the model will increase the adjusted R-squared Y.Lakshmi Prasad 08978784848

linear regression model building process: Once you understood the business objective, you prepared the data, followed by EDA and the division of data into training and test datasets. The next step was the selection of variables for the creation of the model. Variable selection is critical because you cannot just include all the variables in the model; otherwise, you run the risk of including insignificant variables too. This is RFE can be used to quickly shortlist some variables which are significant to save time. However, these significant independent variables might be related to each other. This is where you need to check for multicollinearity amongst variables using variance inflation factor (VIF) and remove variables with high VIF and low significance (p>0.05). Y.Lakshmi Prasad 08978784848

linear regression model building process: The variables with a high VIF or multi- collinearity may be statistically significant or p<0.05, in which case you will first have to check for other insignificant variables (p>0.05) before removing the variables with a higher VIF and lower p-values. Continue removing the variables until all variables are significant or p<0.05, and have low VIFs. Finally you arrive at a model where all variables are significant and there is no threat of multi- collinearity .  The final step is to check the model accuracy on the testing data. Y.Lakshmi Prasad 08978784848

Model Building- Test An analyst observes a positive relationship between digital marketing expenses and online sales for a firm. However, she intuitively feels that she should add an additional independent variable, one which has a high correlation with marketing expenses. If the analyst adds this independent variable to the model, which of the following could happen? More than one choices could be correct.(Find out Both) A) The model’s R-squared will decrease B) The model’s adjusted R-squared could decrease C) The Beta-coefficient for predictor - digital marketing expenses will remain same D) The relationship between marketing expenses and sales can become insignificant Y.Lakshmi Prasad 08978784848

Feedback Feedback : Adjusted R-squared could decrease if the variable does not add much to the model, to explain Online Sales. Feedback : The relation between marketing expenses and sales can become insignificant with the addition of a new variable. Y.Lakshmi Prasad 08978784848

Dummy Variables Suppose you need to build a model on a data set which contains 2 categorical variables with 2 and 4 levels respectively. How many dummy variables should you create for model building? A) 6 B) 4 C) 2 D) 8 Y.Lakshmi Prasad 08978784848

Feedback : Since n-1 dummy variables can be used to describe a variable with n levels, you will get 1 dummy variable for the variable with two levels, and 3 dummy variables for the variable with 4 levels. Y.Lakshmi Prasad 08978784848

Data Partition Whenever I am Building a Model, I want my model to predict the unseen(new) cases. To facilitate this, we split the given dataset in to 2 datasets 1. Training and 2. Validation. 1. Training Dataset: is to Build the Model 2. Validation Dataset: is to Evaluate the Model.

Model Validation It is desired that the R-squared between the predicted value and the actual value in the test set should be high. In general, it is desired that the R-squared on the test data be high, and as similar to the R-squared on the training set as possible. We should note that R-squared is only one of the metric among many other metrics to assess accuracy in a linear regression model. Y.Lakshmi Prasad 08978784848

Generalization : The ability to predict or assign a label to a “ new ” observation based on the “ model ” built from past experience Generalization vs. Memorization

Model SIGNAL not NOISE Model is too simple  UNDER LEARN Model is too complex  MEMORIZE Model is just right  GENERALIZE

Generalize, don ’ t Memorize! Model Complexity Model Accuracy Training Set Accuracy Validation Set Accuracy Right Level of Model Complexity

Overfitting In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. The model is fit perfectly to the training data, but has no predictive ability in a new sample (Validation Data).

LASSO Regression LASSO (Least Absolute Shrinkage Selector Operator) It uses L1 regularization technique. It is generally used when we have more number of features, because it automatically does feature selection. The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.

Ridge Regression It shrinks the parameters, therefore it is mostly used to prevent multi- collinearity . It reduces the model complexity by coefficient shrinkage. It uses L2 regularization technique. 

Using Linear Regression In which of the following cases can linear regression be used? A) An Institute is looking to admit new students in its Data Analytics Program. Potential students are asked to fill various parameters such as previous company, grades, experience, etc. They need this data to figure out if an applicant would be a good fit for the program. B) Flipkart is trying to analyse user details and past purchases to identify segments where users can be targeted for advertisements. C) A researcher wishes to find out the amount of rainfall on a given day, given that pressure, temperature and wind conditions are known. D) A start-up is analysing the data of potential customers. They need to figure out which people they should reach out for a sales pitch. Y.Lakshmi Prasad 08978784848

Feedback : Feedback : Past data could be used to predict what the rainfall will be based on the given predictors. Y.Lakshmi Prasad 08978784848

Questions? Y.Lakshmi Prasad 08978784848