Logistic Regression.ppt

1,264 views 22 slides Apr 04, 2023
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

ppt


Slide Content

University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Logistic Regression
Lemma Derseh (BSc. MPH)

Logistic regression
In linear regression, we can fit a model consisting of a
continuous dependent variable with independent variable/s
of any measurement scale (categorical or numeric)
What can we do if the dependent variable is
dichotomous (although we can have also
more than 2 categories i.e. multinomial or
ordinal logistic regression)?

Logistic regression cont…
The above question refers to the following types of problems:
Relationship between Coronary Heart Disease (a binary
outcome variable; i.e. +ve or -ve) and age(continuous variable).
Note: CHD = 0 implies -Ve for CHD, and CHD = 1 implies +VeAge CHD Age CHD Age CHD
22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1

Logistic regression cont…
One possible statistical method is to use a t-test to comparethe mean
ages of the two groups or using ANOVA ( Even though it has only
two outcome levels)
Of course, in this regard we will get a statistically significant age
difference between the two groups (CHD +Ve against -Ve)
(p<0.0001)
However, all these tests tell us only the signifiant différence in mean
age among the two groups, but not the magnitude of the effect of
age on CHD
Therefore, what if our research goal is to know the probability
of getting +Ve CHD (i.e. to prédicat the outcome status of each
individual)? Or
What happens when you have several covariates that you
believe contribute to CHD?
Shall we use linear regression?

Logistic regression cont…
First draw a scatter plot of status of CHD versus age
Problem!If we try to fit an ordinary linear regression, we will
predict probabilities greater than 1 or less than 0 which is impossible
Probability for CHD
Second add the possible linear regression line of probability on age
80 (age)

Logistic regression cont… Diseased
Age group # in group # probability
20 - 29 5 0 0
30 - 39 6 1 0.17
40 - 49 7 2 0.29
50 - 59 7 4 0.57
60 - 69 5 4 0.80
70 - 79 2 2 1.00
80 - 89 1 1 1.00



So what shall we do?
Rather than dealing with a single age data with binary outcomes,
let us group the age data so that we can get proportions
(probabilities) of success (1s) at different age groups
In doing so, we can get intermediate proportions between 0 and 1

0
1
(Yes )
(No)
Sign of coronary disease
Logistic regression cont…
The probabilities in the above table are the same as the
proportions of individuals with CHD in each age category.
The scatter plot of the set of proportions in the age ranges
could give us the above S-shaped curve (red color)
Age group
Probability for CHD

Logistic regression cont…
Again such S-shaped (sigmoidal) curve is difficult to describe
with a linear equation for two reasons.
First, even though it seems linear at the center of the curve, the
extremes do not follow a linear trend;
Second, the errors are neither normally distributed nor constant
across the entire range of data
Question! So what do we do with this S-Shaped curve?
Answer:
First: Find a function that best fits (be linked) with this S-
shaped graph
Second:Find another function that transforms the S-shaped
graph into linear function

(I) Finding a function that best fits with
the S-shaped graph of probability0.0
0.2
0.4
0.6
0.8
1.0
We call the above mathematical expression a logistic function
It always has an S-shaped curve within the range of 0 and 1
for any x
That is why we linked it with p (probability) which has the
same S-shape in the same range of 0 to 1
20 40 60 80 100
P = P(y/x) = P(success given x
occurred) = P(a person is +ve CHD
given his age is x)

(II) Transforming to linear function
using logit function
logit of P{
a= log odds of event
in unexposed
b= log Odds Ratio associated
with being exposed
e
b
= Odds Ratio
This is the logistic function
The odds of an event

Logistic function Linear function Logittransform function Yes
Predi c tor (si ngl e )
No
Outco me Predictor (group)
Pi
Link function
Start
End
The linking and transformation process

Characteristics of the logistic function
The S shaped curve of logistic function has the following
characteristics:
Function:
If βis the slope of the linear function after logit transformation then,
The S-shaped curve has a slope equal to p(1-p)β, where p is the
probability at X = x
As we move to the two extremes of x or p, the slope closes to 0
The (x, p) coordinate on which the slope reaches its pick is (-α/β,
0.5). The value of x at this point is called median effective level
denoted by EL
50

Logistic regression cont…
Variable β S.E Wald Sig.Exp(β)
95.0% C.I. for
EXP(β)
Lower Upper
age
0.1320.0468.0530.0051.1411.0421.249
Constant-6.7082.3548.1210.0040.001
Example on the data given above:
The analysis of logistic regression is computer intensive
After entering the above data using SPSS and running it for
binary logistic regression, the following result has been obtained.
For a unit increase in age of a person, the odds of being
positive for CHD increases by a factor of 1.141
The 95% CI for this estimate (i.e. Odds Ratio) is (1.04, 1.23)

Odds for Rural:
Odds for Urban:
Odds Ratio = = 4.33
OR remains the same by the two methods
Logistic Regression cont…
33.1
429.0
571.0
1

p
p 
76.5
15.0
85.0
1

p
p
ln Odds:285.0)33.1ln( 75.1)76.5ln(
ln Odds:
Patient satisfaction
ResidenceUnsatisfiedSatisfiedTotal
Rural 98 17 115
Urban 205 154 359
Total 303 171 474
Example on
categorical variable
(residence) Vs patient
satisfaction on service
∆ ln Odds = ln OR = β= 1.47
OR = e
1.47
= 4.33

The model for this example is:
For urban (x
1= 0) we have:
(Always we make the unexposed category 0)
Thus the estimate of the intercept is equal to ß
0which is the log
odds for urban (unexposed).
Interpreting the Logistic Regression
Model1
101
ln x
p
p
bb








 oo
p
p
bbb 









0
1
ln
1 285.0
01
ln 









b
p
p

The estimate of the slope is the difference between the log
odds for rural on the predictor (exposed) and the log odds for
urban on the predictor (unexposed):
The fitted model is: log(Odds) = 0.285 + 1.465X
The fitted model for log(Odds of dissatisfaction) is:
log(Odds) = 0.285 + 1.465(age)
Interpreting the Logistic cont…  
465.1285.075.1
1
ln
1
ln
0
0
1
1
1 




















p
p
p
p
b

Meaning of the Odds Ratio
The odds ratio is:
Or , Odds Ratio = exp(β
1) = exp (1.465) = e
1.465
= 4.33
SPSS output
Interpretation: the odds of rural patients’ unsatisfaction on the
service they got is 4.33 time that of urban residents’ 
 
33.4
465.1
285.0
465.1285.0


e
e
e
Odds
Odds
urban
rural
Variable B S.E Wald P-value.Exp(B)
95.0% C.I. for EXP(B)
Lower Upper
Residence 1.470.28426.72<0.0014.33 2.48 7.55
Constant 2.860.1077.1960.0071.33

Multiple logistic regression
This model includes more than one independent variables
The independent variables could be dichotomous, ordinal,
nominal, or continuous etc
Interpretation of b
i
It is the increase in log-odds for a one unit increase in x
i
with all the other x
is constant
It measuresassociation between x
iand log-odds adjusted
for all other x
iii2211 xβ ... xβ xβα
P-1
P
ln logit(p) 





Multiple logistic regression cont..
Example-1: Assume we have a second variable ‘sex’ which is
added to the existing data (CHD data) as indicated in the SPSS
data view in the exercise.
Variable
B S.EWald Sig.Exp(B)95.0% C.I. for
EXP(B)
Lower Upper
age 0.1140.0534.7330.0301.121 1.011 1.243
sex(1) 2.9521.2765.3560.02119.153 1.571233.459
Constant-7.7872.6897.3670.0070.000
Interpretation: For females, the odds of developing CHD is 19.15
times that of males’. (Males are taken as a reference)
Note that the 95% CI is very wide due to the fact that there is small
sample size used in the analysis (There must be at least 10 ‘yes’s and
10 ‘no’s preferably 20 for each category of each variables

Multiple Logistic Regression output
Unsatisfied
CharacteristicsYes No CrudeOR Adjusted OR P-Value
(95% CI) (95% CI)
Cost of treatment 0.522
Very cheap 59 69 1.01.0
Cheap 70 42 1.95 (1.16, 3.27) 1.36 (0.66 2.81)0.400
Moderate 30 8 4.39(1.87, 10.30) 2.12 (0.64, 7.05)0.220
Expensive 97 41 2.77(1.67, 4.58) 1.54(0.73, 3.26) 0.255
Highly expensive 47 11 5.00(2.38, 10.50) 2.35 (0.74, 7.53) 0.150
Residence
Urban 205 154 1.01.0
Rural 98 17 4.33 (2.48, 7.55) 2.71 (1.19, 6.16) 0.017*
Extra job < 0.001**
Goven’tworker 19 13 1.01.0
Need partime 210 58 2.48 (1.16, 5.31) 3.18(1.09, 9.06) 0.034
Partimer 60 96 0.43 (0.20, 0.93) 1.04(0.35, 3.15) 0.94
Has his own firm 14 4 2.40 (0.64, 8.93) 6.26 (3.21, 32.27) 0.028
Diagnosis Type
Complex 108 154 1.01.0
Simple 195 17 16.36 (9.41, 28.44) 13.55 (6.96, 26.37) <0.001**
Total 303 171
a=0.05, * shows significant, * *shows highly significant, Underlined figures are overall p-values

Take Notice of:
How we should put the variables (characteristics) and the
corresponding categories
How to put the frequencies in relation to the definition of the
categories of the dependent variable in the SPSS variable-view
(Because SPSS always interprets ORs in terms of the larger code
given in the ‘value’ column e.g. here unsatisfaction)
Overall p-values are important for variables having more than two
categories especially if there are both significant and insignificant
categories in that particular variable
The overall-p-values are written just straight to the variable name
and specific p-values to the respective categories
Specific p-values are of course could look redundancies of the
confidence intervals, however, they could tell us the level (degree)
of significance (Like: strong, marginal weak etc associations)

Exercise on the given data
Tags