University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Logistic Regression
Lemma Derseh (BSc. MPH)
Logistic regression
In linear regression, we can fit a model consisting of a
continuous dependent variable with independent variable/s
of any measurement scale (categorical or numeric)
What can we do if the dependent variable is
dichotomous (although we can have also
more than 2 categories i.e. multinomial or
ordinal logistic regression)?
Logistic regression cont…
The above question refers to the following types of problems:
Relationship between Coronary Heart Disease (a binary
outcome variable; i.e. +ve or -ve) and age(continuous variable).
Note: CHD = 0 implies -Ve for CHD, and CHD = 1 implies +VeAge CHD Age CHD Age CHD
22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1
Logistic regression cont…
One possible statistical method is to use a t-test to comparethe mean
ages of the two groups or using ANOVA ( Even though it has only
two outcome levels)
Of course, in this regard we will get a statistically significant age
difference between the two groups (CHD +Ve against -Ve)
(p<0.0001)
However, all these tests tell us only the signifiant différence in mean
age among the two groups, but not the magnitude of the effect of
age on CHD
Therefore, what if our research goal is to know the probability
of getting +Ve CHD (i.e. to prédicat the outcome status of each
individual)? Or
What happens when you have several covariates that you
believe contribute to CHD?
Shall we use linear regression?
Logistic regression cont…
First draw a scatter plot of status of CHD versus age
Problem!If we try to fit an ordinary linear regression, we will
predict probabilities greater than 1 or less than 0 which is impossible
Probability for CHD
Second add the possible linear regression line of probability on age
80 (age)
So what shall we do?
Rather than dealing with a single age data with binary outcomes,
let us group the age data so that we can get proportions
(probabilities) of success (1s) at different age groups
In doing so, we can get intermediate proportions between 0 and 1
0
1
(Yes )
(No)
Sign of coronary disease
Logistic regression cont…
The probabilities in the above table are the same as the
proportions of individuals with CHD in each age category.
The scatter plot of the set of proportions in the age ranges
could give us the above S-shaped curve (red color)
Age group
Probability for CHD
Logistic regression cont…
Again such S-shaped (sigmoidal) curve is difficult to describe
with a linear equation for two reasons.
First, even though it seems linear at the center of the curve, the
extremes do not follow a linear trend;
Second, the errors are neither normally distributed nor constant
across the entire range of data
Question! So what do we do with this S-Shaped curve?
Answer:
First: Find a function that best fits (be linked) with this S-
shaped graph
Second:Find another function that transforms the S-shaped
graph into linear function
(I) Finding a function that best fits with
the S-shaped graph of probability0.0
0.2
0.4
0.6
0.8
1.0
We call the above mathematical expression a logistic function
It always has an S-shaped curve within the range of 0 and 1
for any x
That is why we linked it with p (probability) which has the
same S-shape in the same range of 0 to 1
20 40 60 80 100
P = P(y/x) = P(success given x
occurred) = P(a person is +ve CHD
given his age is x)
(II) Transforming to linear function
using logit function
logit of P{
a= log odds of event
in unexposed
b= log Odds Ratio associated
with being exposed
e
b
= Odds Ratio
This is the logistic function
The odds of an event
Logistic function Linear function Logittransform function Yes
Predi c tor (si ngl e )
No
Outco me Predictor (group)
Pi
Link function
Start
End
The linking and transformation process
Characteristics of the logistic function
The S shaped curve of logistic function has the following
characteristics:
Function:
If βis the slope of the linear function after logit transformation then,
The S-shaped curve has a slope equal to p(1-p)β, where p is the
probability at X = x
As we move to the two extremes of x or p, the slope closes to 0
The (x, p) coordinate on which the slope reaches its pick is (-α/β,
0.5). The value of x at this point is called median effective level
denoted by EL
50
Logistic regression cont…
Variable β S.E Wald Sig.Exp(β)
95.0% C.I. for
EXP(β)
Lower Upper
age
0.1320.0468.0530.0051.1411.0421.249
Constant-6.7082.3548.1210.0040.001
Example on the data given above:
The analysis of logistic regression is computer intensive
After entering the above data using SPSS and running it for
binary logistic regression, the following result has been obtained.
For a unit increase in age of a person, the odds of being
positive for CHD increases by a factor of 1.141
The 95% CI for this estimate (i.e. Odds Ratio) is (1.04, 1.23)
Odds for Rural:
Odds for Urban:
Odds Ratio = = 4.33
OR remains the same by the two methods
Logistic Regression cont…
33.1
429.0
571.0
1
p
p
76.5
15.0
85.0
1
p
p
ln Odds:285.0)33.1ln( 75.1)76.5ln(
ln Odds:
Patient satisfaction
ResidenceUnsatisfiedSatisfiedTotal
Rural 98 17 115
Urban 205 154 359
Total 303 171 474
Example on
categorical variable
(residence) Vs patient
satisfaction on service
∆ ln Odds = ln OR = β= 1.47
OR = e
1.47
= 4.33
The model for this example is:
For urban (x
1= 0) we have:
(Always we make the unexposed category 0)
Thus the estimate of the intercept is equal to ß
0which is the log
odds for urban (unexposed).
Interpreting the Logistic Regression
Model1
101
ln x
p
p
bb
oo
p
p
bbb
0
1
ln
1 285.0
01
ln
b
p
p
The estimate of the slope is the difference between the log
odds for rural on the predictor (exposed) and the log odds for
urban on the predictor (unexposed):
The fitted model is: log(Odds) = 0.285 + 1.465X
The fitted model for log(Odds of dissatisfaction) is:
log(Odds) = 0.285 + 1.465(age)
Interpreting the Logistic cont…
465.1285.075.1
1
ln
1
ln
0
0
1
1
1
p
p
p
p
b
Meaning of the Odds Ratio
The odds ratio is:
Or , Odds Ratio = exp(β
1) = exp (1.465) = e
1.465
= 4.33
SPSS output
Interpretation: the odds of rural patients’ unsatisfaction on the
service they got is 4.33 time that of urban residents’
33.4
465.1
285.0
465.1285.0
e
e
e
Odds
Odds
urban
rural
Variable B S.E Wald P-value.Exp(B)
95.0% C.I. for EXP(B)
Lower Upper
Residence 1.470.28426.72<0.0014.33 2.48 7.55
Constant 2.860.1077.1960.0071.33
Multiple logistic regression
This model includes more than one independent variables
The independent variables could be dichotomous, ordinal,
nominal, or continuous etc
Interpretation of b
i
It is the increase in log-odds for a one unit increase in x
i
with all the other x
is constant
It measuresassociation between x
iand log-odds adjusted
for all other x
iii2211 xβ ... xβ xβα
P-1
P
ln logit(p)
Multiple logistic regression cont..
Example-1: Assume we have a second variable ‘sex’ which is
added to the existing data (CHD data) as indicated in the SPSS
data view in the exercise.
Variable
B S.EWald Sig.Exp(B)95.0% C.I. for
EXP(B)
Lower Upper
age 0.1140.0534.7330.0301.121 1.011 1.243
sex(1) 2.9521.2765.3560.02119.153 1.571233.459
Constant-7.7872.6897.3670.0070.000
Interpretation: For females, the odds of developing CHD is 19.15
times that of males’. (Males are taken as a reference)
Note that the 95% CI is very wide due to the fact that there is small
sample size used in the analysis (There must be at least 10 ‘yes’s and
10 ‘no’s preferably 20 for each category of each variables
Take Notice of:
How we should put the variables (characteristics) and the
corresponding categories
How to put the frequencies in relation to the definition of the
categories of the dependent variable in the SPSS variable-view
(Because SPSS always interprets ORs in terms of the larger code
given in the ‘value’ column e.g. here unsatisfaction)
Overall p-values are important for variables having more than two
categories especially if there are both significant and insignificant
categories in that particular variable
The overall-p-values are written just straight to the variable name
and specific p-values to the respective categories
Specific p-values are of course could look redundancies of the
confidence intervals, however, they could tell us the level (degree)
of significance (Like: strong, marginal weak etc associations)