Logistic Regression.pptx

2,639 views 22 slides Feb 16, 2023
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

Logistic Regression is a part of Supervised Learning in Machine Learning with details


Slide Content

Logistic Regression Machine Learning​

Introduction Used to predict binary outcomes for a given set of independent variables. One of the algorithms used for classification as it contains categorical values. The name may be a little confusing because it has ‘regression’ in it, but it is actually used for performing classification as the output is discrete instead of continuous numerical value. LOGISTIC REGRESSION 2

Explanation Logistic Regression is a type of statistical model that is used to predict the probability of a certain event happening. It works by taking input variables and transforming them into a probability value between 0 and 1, where 0 represents a low probability and 1 represents a high probability. For example, imagine you want to predict whether someone will buy a product based on their age and income. Logistic Regression would take these input variables and use them to calculate the probability of the person buying the product. It's called "logistic" because the transformation of the input variables is done using a mathematical function called the logistic function, which creates an S-shaped curve. Overall, Logistic Regression is a useful tool for making predictions and understanding the relationship between variables in a dataset. LOGISTIC REGRESSION 3

Example Imagine it’s been several years since you service your car. LOGISTIC REGRESSION 4 One day you are wondering… If your car will break down in near future or not. So this is like classification, as we will have answers either in ‘Yes’ or ‘No’. Years since service Probability of breakdown Regression model created based on other user’s experience. As we can imagine that the no. of years that are on lower side like 1 year, 2 year, 3 year after the service, the chances of the car breaking down is very limited. Here, the dependent variable’s output is discrete.

Why not Linear Regression? Take for example, You ae given a data of Employee ratings along with the probability of getting promotion. If we are going to plot Linear Regression with Yes or No (considering 0 as No and 1 as Yes) the graph will certainly be look like this. In the graph, we can see that the output is either 0 or 1, there is nothing in between as the output is discrete in this case. Whereas Employee rating is a continuous number so there will not be any issue while plotting it on x-axis. LOGISTIC REGRESSION 5 0 = No 1 = Yes Employee Rating Probability of getting pomotion

Why not Linear Regression? LOGISTIC REGRESSION 6 0 = No 1 = Yes Employee Rating Probability of getting pomotion As you can see that the graph doesn’t look very right. There would be lot of errors and RMSE would be very, very high. Also, the values of output cannot go beyond 0 or 1. Therefore, instead of using linear regression, we need to come up with something different. So, the logistic model came in picture.

Odds of Success To understand Logistic Regression, let’s talk about the odds of success. Odds(θ ) = or, ( ) The value of Odds range from 0 to α . The values of probability ranges from 0 to 1. If p = 0, θ = 0/(1-0) = 0/1 = 0 If p = 1, θ = 1/(1-1) = 1/0 = α   LOGISTIC REGRESSION 7

Predicting Odds of Success 0 + β 1x ( 0 = constant) Exponentiating both sides, Or, = Let, Y = Then, = Y   LOGISTIC REGRESSION 8

Predicting Odds of Success Then, = Y or, p(x) = Y(1-p(x)) or, p(x) = Y – Y p(x) or, p(x) + Y p(x) = Y or, p(x) (1 + Y) = Y or, p(x) =   LOGISTIC REGRESSION 9

Predicting Odds of Success or, p(x) = or, p(x) = [Sigmoid] The equation of a sigmoid function, p(x) = p(x) =   LOGISTIC REGRESSION 10

Compare Linear regression and Logistic regression Used to solve Regression problems. The response variable is continuous in nature. It helps eliminate the dependent variable when there is a change in the independent variable. It is a straight line. Used to solve classification problems. The response variable is categorical in nature. It helps calculate the possibility of a particular event taking place. It is a S – curve. (S = Sigmoid) LOGISTIC REGRESSION 11 Linear Regression Logistic Regression

Compare Linear regression and Logistic regression Example: Weather Prediction If we need to predict the temperature of the coming week. Then it is a continuous number. Example: Weather Prediction If we are going to predict whether it would be raining tomorrow or not. Then it is a discrete value. The predictions will be either in ‘Yes’ or ‘No’ LOGISTIC REGRESSION 12 Linear Regression Logistic Regression

Compare Logistic Regression and Classification Logistic regression is a statistical modeling technique used to analyze and model the relationship between a dependent variable (binary or dichotomous) and one or more independent variables. In logistic regression, the dependent variable is categorical (i.e., it takes on a limited number of values), but it is continuous in nature. The goal of logistic regression is to predict the probability of an event occurring (i.e., the dependent variable taking a certain value) based on the values of the independent variables. Classification, on the other hand, is a machine learning task that involves assigning an input to one of several predefined categories. Classification can be thought of as a kind of prediction problem, where the goal is to predict the class or category of a given input. LOGISTIC REGRESSION 13 Logistic Regression Classification

Applications of Logistic Regression Fraud Detection: Here, the binary detection variable will be either ‘Detected’ or ‘Not detected’. 2. Disease Diagnosis: Here, the outcome will be either ‘Positive’ or ‘Negative’ LOGISTIC REGRESSION 14 3. Emergency Detection: Here, the binary detection variable will be either ‘Emergency’ or ‘Not Emergency’. 4. Spam Filter: Here, the outcome will be either ‘Spam’ or ‘Not Spam’

Logistic Regression Assumptions Binary Outcome: T he dependent variable, also known as the outcome variable or response variable, is binary in nature. This means that it takes on one of two possible values, typically coded as 0 and 1, or as "success" and "failure", "yes" and "no", "true" and "false", or some other binary coding. The logistic regression model is designed to estimate the probability of the "success" outcome as a function of one or more independent variables, also known as predictors or covariates. The logistic function, which transforms a linear combination of the predictors into a probability between 0 and 1, is used to model the relationship between the predictors and the outcome. LOGISTIC REGRESSION 15

Logistic Regression Assumptions Independence of errors: Independence of errors or residuals is a critical assumption of logistic regression. This means that the error or residual term for each observation in the dataset should not be related to the error or residual term for any other observation. Violation of this assumption can result in biased and inefficient estimates of the logistic regression parameters, which can lead to incorrect inferences and predictions. One way to check for violation of the independence assumption is to examine the residual plot, which should not show any discernible patterns or trends over time, across groups, or as a function of the predicted values. If violations of independence are detected, this may indicate the need to consider a different model or to account for correlation or clustering in the data using more sophisticated methods, such as generalized estimating equations or mixed-effects models. LOGISTIC REGRESSION 16

Logistic Regression Assumptions Linearity of the logit: Linearity of the logit is a key assumption of logistic regression. This assumption means that the relationship between the independent variables and the log-odds of the outcome is linear. In other words, the effect of the independent variables on the log-odds of the outcome is constant across the range of the independent variables. One way to check for linearity is to examine the relationship between each independent variable and the log-odds of the outcome using a scatterplot or other graphical method. If there is evidence of non-linearity, such as a curve or a pattern in the plot, it may be necessary to consider adding polynomial terms, interaction terms, or other nonlinear transformations of the independent variables to the model. Alternatively, if the relationship is complex, a different model may be more appropriate, such as a generalized additive model or a machine learning algorithm. LOGISTIC REGRESSION 17

Logistic Regression Assumptions No Multicollinearity: The assumption of no or low multicollinearity among the independent variables is important in logistic regression. Multicollinearity refers to a situation where two or more independent variables are highly correlated with each other, which can lead to problems in the estimation of the model parameters and in the interpretation of the results. Multicollinearity can cause unstable and imprecise estimates of the logistic regression parameters, and may make it difficult to identify which independent variable(s) are driving the observed effects on the outcome variable. One way to check for multicollinearity is to calculate the correlation matrix between the independent variables and look for high correlations (i.e., correlations greater than 0.7 or 0.8). If high correlations are detected, several strategies can be used to address multicollinearity, such as removing one of the correlated variables, combining the variables into a single index or factor, or using regularization techniques like ridge regression or lasso regression. It is important to resolve issues related to multicollinearity in order to ensure accurate and reliable estimates of the logistic regression parameters. LOGISTIC REGRESSION 18

Logistic Regression Assumptions Large Sample Size: Sample size is an important consideration in logistic regression. A relatively large sample size is typically required to ensure stable estimates and adequate statistical power to detect meaningful effects. The sample size requirements for logistic regression depend on several factors, such as the number and complexity of the independent variables, the prevalence of the outcome in the population, and the desired level of statistical power. As a general rule of thumb, a sample size of at least 10-15 observations per independent variable is often recommended. If the sample size is too small, the logistic regression model may suffer from issues such as overfitting, where the model fits the noise in the data instead of the underlying signal, and underpowered statistical tests, where important effects may be missed due to insufficient sample size. In summary, a relatively large sample size is important for logistic regression to ensure accurate and stable estimates, as well as adequate statistical power to detect meaningful effects. LOGISTIC REGRESSION 19

Confusion Matrix A confusion matrix is a table used to evaluate the performance of a machine learning algorithm for classification tasks. It is a square matrix that compares the actual and predicted values of a classifier. Let's consider an example of a binary classification problem where we have a dataset of 100 patients with diabetes, and we want to build a model that can predict whether a patient has diabetes or not based on their medical data. The model output will be either "Positive" or "Negative". By examining the values in the confusion matrix, we can calculate various performance metrics, such as accuracy, precision, recall, and F1-score, which can help us evaluate the model's performance. The confusion matrix provides a clear and concise way of visualizing the model's performance in terms of its ability to correctly classify positive and negative cases. LOGISTIC REGRESSION 20

Confusion Matrix The values in the confusion matrix are as follows: True Positives (TP): the number of cases that were correctly classified as positive (60 in this case). False Positives (FP): the number of cases that were incorrectly classified as positive (15 in this case). True Negatives (TN): the number of cases that were correctly classified as negative (15 in this case). False Negatives (FN): the number of cases that were incorrectly classified as negative (10 in this case). LOGISTIC REGRESSION 21 Suppose the model has made predictions on the test set and we have the following results: Predicted Positive Predicted Negative Actual Positive 60 10 Actual Negative 15 15 Here, we have a 2x2 matrix, where the rows represent the actual values and the columns represent the predicted values. The diagonal elements of the matrix represent the correctly classified cases, and the off-diagonal elements represent the incorrectly classified cases

Thank you