7. logistics regression using spss

NishaArora1 1,146 views 76 slides Mar 14, 2021
Slide 1
Slide 1 of 76
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76

About This Presentation

Logistic Regression for binary classification using SPSS


Slide Content

Dr Nisha Arora

Logistic Regression using SPSS

2

Object-wise Analysis
4
Steps to select appropriate statistical
test

Define clearly the objective of the
study

Define the level of measurement
(metric/non-metric) of each variable
to be included in the analysis.

5

Selecting the appropriate technique
10
Bivariate techniques
Response Variable (DV)
Explanatory
Variable
(IDV)
Metric Non-metric
Metric Regression Logistic
Regression/
LDA
Non-metric Dummy Var
Reg./
Hypothesis
Test*
Chi-square
test
Make sure to check all assumptions before applying any statistical
technique.

Selecting the appropriate technique
12
Response Variable(s) (DVs)
One DV More than
one DV
Explanatory
Variable(s)
(IDVs)

One IDV
Metric Non-metric Metric
Metric Simple
Regression
Binary/Multi
Nominal
(Logistic) Reg
Path
Analysis
Non-metric t test/Anova Chi Square
Test
Manova
More
than one
IDV
All Metric Multiple Reg Multiple Logit
Reg/Multiple
Multinominal
Path
Analysis
All Non-
metric
n – way Anova Complex
Crosstab/
Log-linear
analysis
n – way
Manova
Mixed n – way
Ancova/Dumm
y var
Regression
Multiple Logit
Reg/Multiple
Multinominal
n– way
Mancova

Selecting the appropriate Technique
13
Binary (Binomial) Logistic Regression
Multi-Nominal Logistic Regression
Ordinal Logistic regression
Poisson Regression

•Response has only two 2 possible outcomes.
•E.g.: Spam or Not
Binary

•Three or more categories without ordering.
•E.g.: Predicting which food is preferred more
(Veg, Non-Veg, Vegan)
Multinominal

•Three or more categories with ordering.
•E.g.: Movie rating from 1 to 5
Ordinal
14
Types of Logistic Regression

Prediction or Classification?
15

16
Types of Classification Problems
Multi-Label
Classification
Multi-Class
Classification
Binary
Classification

17
To predict in advance whether a product launch will be
successful or not
An online banking service must be able to determine whether or
not a transaction being performed on the site is fraudulent
Benign or malignant tumor
Spam detection
Movies genres classification

Classification Problem

Box-Tidwell Test
In the model, include interactions between the continuous predictors and
their logs.
If such an interaction is significant, then the assumption has been
violated.
If any interaction is significant, try adding to the model powers of the
predictor (that is, going polynomial)

Caution:
Not a very robust test as it gets affected by sample size.
 You should not be very concerned with a just significant interaction
when sample sizes are large.

Assumptions of Logit Regression
Binary response variable with mutually exclusive and exhaustive
categories.
One or more predictor variable(s)
Independent Observations
linear relationship between continuous independent variable(s) and
the logit transformation of the dependent variable
This assumption can be tested by using Box-Tidwell Test.
including in the model interactions between the continuous
predictors and their logs. If such an interaction is significant, then
the assumption has been violated.

Assumptions of Logit Regression
Binary response variable with mutually exclusive and exhaustive
categories.
One or more predictor variable(s)
Independent Observations
linear relationship between continuous independent variable(s) and
the logit transformation of the dependent variable
What about Co-linearity, Perfect co-linearity, and Multi-co-linearity?
https://stats.stackexchange.com/a/432543/79100

More Discussion on Multi-Colinearity
•What happens when you’ve Multi-Colinearity?

Multicollinearity isn't as deleterious for prediction but may affect variable’s
Significance
https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-
checked-in-modern-statistics-machine-learning

•Can you safely ignore Multi-Colineairty?
https://statisticalhorizons.com/multicollinearity

•How to handle Multi-colinearity?
https://www.researchgate.net/post/how_to_deal_with_multicolinearity#view=580e
f132ed99e1c1046fcf01

•Why not to use STEP_WISE method?
http://www.philender.com/courses/linearmodels/notes4/swprobs.html
http://www.danielezrajohnson.com/stepwise.pdf

22

Assumptions _ More Considerations
Logistic regression typically requires a large sample size because they
use maximum likelihood estimation techniques. [maximum likelihood
estimates are less powerful at low sample sizes than ordinary least
square].
It is also important to keep in mind that when the outcome is rare, even
if the overall dataset is large, it can be difficult to estimate a logit model.
Empty cells or small cells: You should check for empty or small cells
by doing a crosstab between categorical predictors and the outcome
variable. If a cell has very few cases (a small cell), the model may
become unstable or it might not run at all.

26
Why can’t we use Linear
Regression for
Classification Problems?

Why not Linear Regression?

29

What is Logistic Regression?
The Logistic Regression Curve is
called as “Sigmoid Curve”, also
known as S-Curve
How to decide whether
the value is 0 or 1 from
this curve?
Set a
threshold

Default - 0.5
Based on group sizes (as we do in LDA)
Based on performance evaluation matrix using cross validation

31
How to set a threshold?

Logistic Regression Equation
Rather than modeling this response Y directly,
Logistic regression models the probability that Y belongs
to a particular category.
P(Y =1 | X) or P(X) can take values from 0 to 1












































nnXX
XP
XP
 









...
)(1
)(
log
110

How to
interpret
?

35

Logistic Regression Equation


Alternatively, we can write

























Or nn
XX
e
XP
XP
 


...
110
)(1
)(  
 
nn
nn
XX
XX
e
e
XP






...
...
110
110
1
)(

Logistic Regression Equation


Alternatively, we can write

























Or nn
XX
e
XP
XP
 


...
110
)(1
)(  
 
nn
nn
XX
XX
e
e
XP






...
...
110
110
1
)(

Understanding the Odds




Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit
change in the predictor.















nn
XX
e
XP
XP
 


...
110
)(1
)(

39
Odds & Odds Ratio

Understanding Odds
Logit = log (Odds) = Log (p/1-p)
= log (probability of event happening/ probability of
event not happening)
Odds Ratio/ OR =

0
0
|_____
1|_____
XXYeventoffavorinOdds
XXYeventoffavorinOdds



Interpreting the coefficients
41
Response: default [Y/N] Predictor: [Account] balance
Estimated coefficients of the logistic regression model that predicts the
probability of default using balance.




A one-unit increase in balance is associated with
An increase in the log odds of default by 0.0055 units. OR
A change in odds by exp(0.0055), i.e., 1.0055

Interpreting the coefficients
42
Probability of default for an individual with a balance of $1, 000 is






Probability of default for an individual with a balance of $2, 000 is
%576.000576.0
1
1000*0055.06513.10
1000*0055.06513.10




e
e %6.58586.0
1
2000*0055.06513.10
2000*0055.06513.10




e
e

43
Parameter Estimation

Maximum Likelihood Estimation
The objective: Not to “correctly” estimate the logit, but to make better
classification.
Parameters should take values which result in such a score [probabilities or p]
which enables us to have a good cutoff.
Meaning this “score” should be high for one class and low for another

If P(Y
i = 1|X
i) = P(X
i) = P
i, then




To maximize collective form of this function for all observations
Maximum Likelihood Estimation



ii
Y
i
Y
ii
PPL


1
1* n
i
iLMaxMaxL
1


Log Likelihood Function
45

Let’s See It In Action
46

Note Points
For a standard logistic regression you should ignore
the Previous and Next buttons because they are for sequential (hierarchical)
logistic regression.
The Method: option needs to be kept at the default value, which is Enter Method.
The "Enter" method is the name given by SPSS Statistics to standard regression
analysis.

SPSS Statistics requires you to define all the categorical predictor values in the
logistic regression model. It does not do this automatically.
The default behaviour in SPSS Statistics is for the last category (numerically) to
be selected as the reference category.
If we change the method from Enter to Forward: Wald the quality of the logistic
regression improves. Now only the significant coefficients are included in the
logistic regression equation.
47
https://statistics.laerd.com/

48
Interpretation of SPSS Output

49

50

51

52

53
We do not report this

55
Omnibus Test Output

56
Omnibus Test Output

57

58

59

60

61

62

63

64

Using ROC to find Optimal Cut-Off

65

How to report the results SPSS
A logistic regression was performed to ascertain the effects of x, y, and gender on the
likelihood that participants to have the event (positive response).
1.The logistic regression model was statistically significant, χ
2
(df) = 28.605, p < 0.05
[Omninus test]
2.A non-significant test result (p=0.78) of Hosmer Lemeshow test is an indicator of
good model fit.
3.The psudeo R
2
measures for explained variations are: 56.4% (Cox & Snell R
2
) and
67.8% (Nagelkerke R
2
) [For validation data, psudeo R
2 …]
… … … … … … … … … … … … … … … …

66

How to report the results SPSS
1.The model correctly classified 81.0% (model accuracy) of cases for the training
data set and 76 % of cases for validation data set.
[The data set was randomly divided into training & validation set with 70%
observation into training and rest of the observations into the validation set.]
2.The model specificity
3.Sensitivity
At cut-off value
2.ROC curve was used to optimize cut-off point
… … … … … … … … … … … … … … … …

67

How to report the results SPSS
The results from the "Variables in the Equation" table, including which of the
predictor variables were statistically significant and what predictions can be made
based on the use of odds ratios. E.g.,
Males were 6.02 times more likely to do this (event) than females.
Increasing x was associated with an decrease in likelihood of the event, but increasing y
was associated with a reduction in the likelihood of the event
… … … … … … … … … … … … … … … …
68

How to report the results SPSS
Box-Tidwell (1962) Test:
69

70
Source: Andy Field

72
Model Evaluation

Evaluation Matrices
AIC
Null & Residual Deviance
Accuracy & Misclassification Error
Sensitivity & Specificity
ROC & AUC
Precision & Recall
Lift & Gain
KS Statistics
F Scores
FDR & FOR
FPR & FNR
Hosmer Lemeshow Test
Customized function specific to business requirement
https://learnerworld.tumblr.com/
How to choose
appropriate
evaluation
matrices?

ROC Curve_Applet
74
https://kennis-research.shinyapps.io/ROC-
Curves/

Other Considerations
Categorical Predictors
Accuracy Paradox
Balanced, Unbalanced & Rare Event Data
Complete or Quasi-separation
Psudo R
2
Measures

Multinomial and Ordinal Logistic Regression

Homoskedasticity
is not an
assumption in
logistic
regression

76
Psudo R
2
Measures

Evaluation Matrices
Efron’s R
2
McFadden’s R
2

McFadden’s Adjusted R
2

Cox & Snell R
2

Nagelkerke / Cragg & Uhler’s R
2

McKelvey & Zavoina R
2

Count R
2

Adjusted Count R
2
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-
pseudo-r-squareds/

References


78
Field, A. P. (2013). Discovering statistics using IBM SPSS
Statistics: and sex and drugs and rock 'n' roll (fourth edition).
London: Sage publications.
Field, A. P., Miles, J. N. V., & Field, Z. C. (2012). Discovering
statistics using R: and sex and drugs and rock 'n' roll. London: Sage
publications.
Field, A. P. & Miles, J. N. V. (2010). Discovering statistics using
SAS: and sex and drugs and rock 'n' roll. London: Sage
publications.
Kothri, C. R. (2004). Research methodology : methods &
techniques. New Age publications.

79
My Interesting answers/posts
To understand results of logistics regression or other classifiers
https://learnerworld.tumblr.com/post/152327498485/enjoystatisticswith
mebinaryclassifierperformance
Hypothesis testing in layman’s terms
https://learnerworld.tumblr.com/search/hypothesis
Understanding mediation effect
https://learnerworld.tumblr.com/post/146541892120/mediation-
effectenjoystatisticswtihme

80
My Interesting answers/posts
Dependence Vs Correlation
https://www.quora.com/What-is-the-difference-between-dependence-
and-correlation/answer/Nisha-Arora-9
Co-linearity & Correlation
https://www.quora.com/In-statistics-what-is-the-difference-between-
collinearity-and-correlation/answer/Nisha-Arora-9

81
My Expertise
Technical Topics:

Python for Data Science or Data Analysis
R Programming
Data Visualization & Storytelling
Machine Learning/Data Science
Statistics [For researchers/Data Science practitioners/ university
students] _Theory/mathematical proofs/application based/using
interactive tools/playing with data using some software
Data Analysis using SPSS
Mathematics [Don't want to write too much but depends on what is the
requirement]
Excel [Basic to intermediate/tools for data analysis/operations
research/operations management/specific course for academicians, etc]

To know more about these,
click here

82
My Expertise
Non-technical Topics:

Interactive pedagogical tools/web resources
The art of effective use of Information & Communication Tools (ICT)
Tools/Platform for hosting online lectures/meetings/live sessions
Effective Googling for finding the right resources (books/ research
papers/ answers)
Leveraging online research communities, Q/A sites, groups, meet-ups
to dive deep in a particular topic of interest
Bridging the gap between industry & academia
Creating a personal brand by leveraging power of social media
Getting smart with MS Office (Word, Excel, Power Point, etc.)
Learning Google products (mentioned in the slide)
Learning how to learn
Learning how to teach
Note Taking
Effective Communication & Presentation

[email protected]

Follow me
http://stats.stackexchange.com/users/79100/nisha-arora
http://stackoverflow.com/users/5114585/nisha-arora
https://www.researchgate.net/profile/Nisha_Arora2/contributions
https://www.quora.com/profile/Nisha-Arora-9
http://learnerworld.tumblr.com/
https://www.slideshare.net/NishaArora1
https://www.youtube.com/channel/UCniyhvrD_8AM2jXki3eEErw/videos
https://www.linkedin.com/in/drnishaarora/

Any other topic which you want to hear/learn from me?
Feel free to leave a comment on my YouTube or mail me at
[email protected]

Thank You

References


87
http://machinelearningmastery.com/
https://www.analyticsvidhya.com/
http://www.analyticbridge.com/
http://www.datasciencecentral.com/
https://www.kaggle.com/
http://stats.stackexchange.com
http://datascience.stackexchange.com/
https://www.researchgate.net
https://www.quora.com
https://github.com/

88
Reach Out to Me
http://stats.stackexchange.com/users/79100/nisha-arora
https://www.researchgate.net/profile/Nisha_Arora2/contributions
https://www.quora.com/profile/Nisha-Arora-9
http://learnerworld.tumblr.com/
[email protected]

Thank You
Tags