Logistic Regression The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1. It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables . Since the outcome is a probability, the dependent variable is bounded between 0 and 1. variables.
Linear : When there is a linear relationship between independent and dependent variables is known as linear regression Logistic: When the independent variable is categorical in nature it is known as logistic regression Polynomial: When the power of the independent variables is more than 1 then it is referred as polynomial regression
Why Logistic Regression Whenever the outcome of the dependent variable (y) is discrete like0/1 then we use logistic regression In linear regression y ‘s value is in a range but in our case Y value is discrete, ie , the value will either be 0 or 1. Logistic regression It gives a probability . What are the chances that Y will become 1. Ex: Basketball( 10 scored, the threshhold value is 0.8, if it is less than to 0.5, Y value is above threshold then Y is 1 otherwise Y is 0)
The general mathematical equation for logistic regression is − y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...)) Following is the description of the parameters used − y is the response variable. x is the predictor variable. a and b are the coefficients which are numeric constants. The function used to create the regression model is the glm () function.
Syntax The basic syntax for glm () function in logistic regression is − glm ( formula,data,family ) Following is the description of the parameters used − formula is the symbol presenting the relationship between the variables. data is the data set giving the values of these variables. family is R object to specify the details of the model. It's value is binomial for logistic regression.
Example The in-built data set " mtcars " describes different models of a car with their various engine specifications. In " mtcars " data set, the transmission mode (automatic or manual) is described by the column am which is a binary value (0 or 1). We can create a logistic regression model between the columns "am" and 3 other columns - hp, wt and cyl . # Select some columns form mtcars . input <- mtcars [,c(" am","cyl","hp","wt ")] print(head(input))
When we execute the above code, it produces the following result − am cyl hp wt Mazda RX4 1 6 110 2.620 Mazda RX4 Wag 1 6 110 2.875 Datsun 710 1 4 93 2.320 Hornet 4 Drive 0 6 110 3.215 Hornet Sportabout 0 8 175 3.440 Valiant 0 6 105 3.460
Create Regression Model We use the glm () function to create the regression model and get its summary for analysis. input <- mtcars [,c(" am","cyl","hp","wt ")] am.data = glm (formula = am ~ cyl + hp + wt, data = input, family = binomial) print(summary( am.data ))
Advantages Logistic regression is easier to implement, interpret, and very efficient to train. It makes no assumptions about distributions of classes in feature space. It can easily extend to multiple classes(multinomial regression) and a natural probabilistic view of class predictions. It not only provides a measure of how appropriate a predictor(coefficient size)is, but also its direction of association (positive or negative). It is very fast at classifying unknown records.
Uses of Logistic Regression Classification Problems it is important category of problems in which a decision makers classifies the customers into two or more categories Discrete choice Model It estimates the probability about customers who select a particular brand over several brands are available Probability It measures the probability of the occurrence of any event. It find out the probability of an event
Generalized Linear Model Generalized Linear Model ( GLiM , or GLM) is an advanced statistical modelling technique formulated by John Nelder and Robert Wedderburn in 1972. It is an umbrella term that encompasses many other models, which allows the response variable y to have an error distribution other than a normal distribution. The models include Linear Regression, Logistic Regression, and Poisson Regression.
Why GLM? Linear Regression model is not suitable if, The relationship between X and y is not linear. There exists some non-linear relationship between them. For example, y increases exponentially as X increases. Variance of errors in y (commonly called as Homoscedasticity in Linear Regression), is not constant, and varies with X. Response variable is not continuous, but discrete/categorical. Linear Regression assumes normal distribution of the response variable, which can only be applied on a continuous data. If we try to build a linear regression model on a discrete/binary y variable, then the linear regression model predicts negative values for the corresponding response variable, which is inappropriate.
Assumptions of GLM Similar to Linear Regression Model, there are some basic assumptions for Generalized Linear Models as well. Most of the assumptions are similar to Linear Regression models, while some of the assumptions of Linear Regression are modified. Data should be independent and random (Each Random variable has the same probability distribution). The response variable y does not need to be normally distributed, but the distribution is from an exponential family (e.g. binomial, Poisson, multinomial, normal) The original response variable need not have a linear relationship with the independent variables, but the transformed response variable (through the link function) is linearly dependent on the independent variables
Binomial Logistic Regression A ( often referred to simply as logistic regression), predicts the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical . For example, you could use binomial logistic regression to understand whether exam performance can be predicted based on revision time, test anxiety and lecture attendance (i.e., where the dependent variable is "exam performance", measured on a dichotomous scale – "passed" or "failed" – and you have three independent variables: "revision time", "test anxiety" and "lecture attendance").
Logistic Function It is a function that estimates various parameters and check whether they are statistically significant and influence the probability of an event logit function One of the big assumptions of linear models is that the residuals are normally distributed. This doesn’t mean that Y, the response variable, has to also be normally distributed, but it does have to be continuous, unbounded and measured on an interval or ratio scale. Unfortunately, categorical response variables are none of these.
The Logit Link Function A link function is simply a function of the mean of the response variable Y that we use as the response instead of Y itself. All that means is when Y is categorical, we use the logit of Y as the response in our regression equation instead of just Y: The logit function is the natural log of the odds that Y equals one of the categories. For mathematical simplicity, we’re going to assume Y has only two categories and code them as 0 and 1. This is entirely arbitrary–we could have used any numbers. But these make the math work out nicely, so let’s stick with them. P is defined as the probability that Y=1. So for example, those Xs could be specific risk factors, like age, high blood pressure, and cholesterol level, and P would be the probability that a patient develops heart disease.
Optim function The function optim provides algorithms for general-purpose optimisations and the documentation is perfectly reasonable, but I remember that it took me a little while to get my head around how to pass data and parameters to optim should return a scalar result . A function to return the gradient for the "BFGS" , "CG" and "L-BFGS-B" methods. If it is NULL , a finite-difference approximation will be used. For the "SANN" method it specifies a function to generate a new candidate point . optim (par, fn, data, ...) where: par : Initial values for the parameters to be optimized over fn : A function to be minimized or maximized data : The name of the object in R that contains the data
df <- data.frame (x=c(1, 3, 3, 5, 6, 7, 9, 12), y=c(4, 5, 8, 6, 9, 10, 13, 17 )) #define function to minimize residual sum of squares min_residuals <- function(data, par) { with(data , sum((par[1] + par[2] * x - y)^2)) } #find coefficients of linear regression model optim (par=c(0, 1), fn= min_residuals , data= df ) $ par $ value $ counts $ convergence $message
Maximum likelihood Estimator We can use MLE in order to get more robust parameter estimates. Thus, MLE can be defined as a method for estimating population parameters (such as the mean and variance for Normal, rate (lambda) for Poisson, etc.) from sample data such that the probability (likelihood) of obtaining the observed data is maximized.
n <- 1000 x <- rnorm (n,2,3) # with mean = 2, sd = 3 he essential part of MLE is to specify the likelihood function. In R, you can easily use “ dnorm ” to obtain the density, and specify “log = TRUE”. Then, the objective to minimize the negative sum of log likelihood function, which is equivalent to maximize the positive sum. LL <- function (beta, sigma){ R = dnorm (x, beta, sigma, log = TRUE) -sum(R) }Following that, the parameters to be estimated can be passed to the “mle2” function, available in package “ bbmle ”, which uses the optimization technique to find the solution. library ( bbmle ) . (true parameter ) fit_norm <- mle2(LL, start = list(beta = 0, sigma = 1), lower = c(- Inf , 0), upper = c( Inf , Inf ), method = 'L-BFGS-B')