Simple & Multiple Regression Analysis

9,121 views 20 slides Jun 23, 2017
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Regression Analysis is simplified in this presentation. Starting with simple linear to multiple regression analysis, it covers all the statistics and interpretation of various diagnostic plots. Besides, how to verify regression assumptions and some advance concepts of choosing best models makes the ...


Slide Content

REGRESSION ANALYSIS
Shailendra Tomar

Index
•WhatisRegressionAnalysis?
•SimpleRegressionTheory
•Example1:HousePriceModel
•RunSimpleRegressionUsingSAS
•Steps&AssumptionsofRegression
•MultipleRegressionAnalysis
•SignificanceTesting
•CoefficientofDetermination
•Example2:CreditCardModel
•Modelselection
•VerifyRegressionAssumptions
•RegressionDiagnostics
•RunMultipleRegressionUsingSAS
2

WHAT IS REGRESSION ANALYSIS?
•Whentwoormorethingsarerelatedtoeachotherandwewanttoquantify
therelationshipbetweenthem,regressionanalysisistherighttechnique
•Itgoesbeyondcorrelationbycreatingamathematicalequationtoestimateor
predictthevalueswithintherangeframedbythedata
•Theregressionproceduredemandsatleastonedependentandoneormore
independentvariables
•Dependentvariable(alsoknownasoutcomeorresponsevariable)isbuilt
uponindependentvariables(alsocalledexplanatoryorpredictorvariable)
3
•Associativerelationshipsbetweenthesevariables
isanalyzedbyRegressionAnalysis
•Itiscommonlyusedinforecasting,timeseries
modelling,financialanalysis,andmarket
researchtofindthecausaleffectrelationship
betweenvariables
Scatter Diagram

SIMPLE REGRESSION THEORY
•Let’sbeginwithsimplelinearregressionwhichiseasiertounderstand
•Remember‘y=mx+c’linearequationfromhighschoolwhichmaketheplot,
fittingastraightlinetodata
•Insimpleregression,thisequationismodifiedto‘y=β
0+β
1x+ε’,whereyisa
dependentvariableandxisindependentvariable
•β
0samelikey-interceptcistheestimatedvalueofywhenxiszero,whileβ
1
similartoslopeoflinemistheestimatedchangeintheaveragevalueofyasa
resultofaunitchangeinxandεistheerror
•Theerrorisneededbecausetheregressionmodelisbasedonsamplerather
population(usuallysampleestimatorsarenotclosetothepopulationmean)
•ThatiswhyOrdinaryLeast-Squares(OLS)procedureisusedforselectingthe
modelparameters(β
0andβ
1)thatminimizethesumofthesquared
differencesbetweenyandŷanddeterminethebest-fittingline
•Theobjectiveisalwaystominimizetheerror,whichisdifferencebetweenthe
observedandthepredictedvaluesgeneratedbythemodel‘ŷ=b
0+b
1x’
4

EXAMPLE 1: HOUSE PRICE MODEL
•Arealestatecompany
wantstoexaminethe
relationshipbetweenthe
sellingpriceofahome(in
$1000s)anditssize(in
squarefeet)foraspecific
region.
•Itselectsarandomsample
of10houses
•Thescatterplotwiththe
datapointsshowsthe
positivelinearrelationship
•Higherthesizeofhouse
meanshigherthepriceof
thehouse
5

STATISTICS: HOUSE PRICE MODEL
6
Dependent
Variable (Y)
House Price
(in $1000s)
R-Square 0.5808Dependent Mean 286.5
Independent
Variable (x)
Size (in
square feet)
AdjR-Sq 0.5284Coeff Var 14.42594
Parameters 2Root MSE 41.33032Observations 10
Analysis of Variance (ANOVA)
Source DF Sum of Squares Mean SqaureF ValuePr>F
Model 1 18935 1893511.080.0104
Error 8 13666 1708.19565
Corrected Total 9 32601
Parameter Estimates
Variable Label DFParameter EstimateStandard Errort ValuePr>|t|
InterceptIntercept 1 98.24833 58.033481.690.1289
X Size 1 0.10977 0.032973.330.0104

INTERPRETATION: HOUSE PRICE MODEL
7
•First,lookattheANOVAresultsinwhichPrvalueislesserthan0.05,meaning
thenullhypothesisisrejected
•Second,R-Sqaurevalueis0.58082whichmeansthat58.08%ofthevariation
inhousepricesisexplained
•Theregressionmodelmakessenseonlywhenitfitsthedatabetterthanthe
baselinemodel,meaningtheslopeoftheregressionlineisnotequaltozero
•Fromtheparameterestimates,HousePriceModelisŷ=98.24833+0.10977x
•Sincethepricesareinoncethousanddollars,foreachsquarefeet,the
averagevalueofhouseincreasesby0.10977($1000)=$109.77
•Forexample,theexpectedpriceofa2000squarefeethousewouldbe
98.24833+0.10977x2000($1000)=$317,85098.25
•Theestimationandpredictionshouldhappenonlywithintherangeofdata
thatwasusedfortheregressionanalysis,elseresultsaredoubtful
•TheremainingstatisticswillbediscussedinMultipleRegressionAnalysis

RUN SIMPLE REGRESSION USING SAS
8
•CopyandpasteabovecodeintheprogramofSASsoftware
DATAHouse;
inputY X;
labelY = 'House Price in $';
labelX = 'Size in Square Feet' ;
datalines;
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
;
odsgraphicson;
title1'Simple Regression
Analysis';
Title2'House Price Model';
procreg
PLOTS(ONLY)=FITPLOT;
modelY = X;
run;
odsgraphicsoff;
title;

STEPS & ASSUMPTIONS OF REGRESSION
Step 1
Formulate the problem
Step 2
Define dependent & independent variables
Step 3
Build the general model
Step 4
Plot the scatter diagram
Step 5
Estimate the parameters
Step 6
Estimate the regression coefficient
Step 7
Test for significance
Step 8
Find the strength of the association
Step 9
Check the prediction accuracy
Step 10
Examine the residuals
Step 11
Cross-validate the model
9
•Linearity of the phenomenon
measured, meaning the mean
of dependent variable is linearly
related to independent variable
•Error are normally distributed
with a mean of zero
•Errorshaveequalvariances, or
in other words the error term is
constant (Homoscedasticity)
•Error are independent, meaning
uncorrelated

10
•Morepowerfulasitinvolvessingledependentvariableandtwoormore
independentvariables
•Thedependentvariableshouldbeininterval-scaleandothervariablesin
metricorappropriatelytransformed
•Analyzetheimpactofasetofindependentvariablesonthedependent
variable.
•Theequationformultipleregressionis‘y=β
0+β
1x
1+β
2x
2+…+β
nx
n+ε
n’,
whereyisadependentvariableandx
1,x
2,x
naretheindependentvariables
•Thepredictedvaluesgeneratedbythemodel‘ŷ=b
0+b
1x
1+b
2x
2+…+b
nx
n’
whereb
0,b
1,b
2,andb
0aretheestimatorsofβ
0,β
1,β
2andβ
n
•ThemodelparametersareestimatedusingOrdinaryLeast-Squares(OLS)
procedurewhichminimizethesumofthesquareddifferencesbetweenyand
ŷanddeterminethebest-fittingline
•Beforeperformingmultipleregression,itisalwaysrecommendedtocheckthe
correlationamongvariablestoavoidmulticollinearityissue
MUTLIPLE REGRESSION ANALYSIS

•Toprovidejustificationforacceptingorrejectingagivenhypothesis
•InANOVA,thenullhypothesisisthatallpopulationmeansareequalandthe
alternativehypothesisisthatnotallofthepopulationmeansareequal.Itis
assumedthatthepopulationsarenormalandthattheyhaveequalvariances.
11
SIGNIFICANCE TESTING
•To test the hypothesis, F ratio is calculated which has to be higher than the
Fisher distribution statistics (based on sample size), proving the model fit the
data better than the baseline model
•The results has p-value which should be lower0.05 to confirm the probability
that relationship exists between dependent and independent variables
•Testing for the significance of the model parameters can be done in a manner
similar but using ttest statistics
•Inregression,therearethree
typesofsumsofsquares:
variationexplainedbymodel
(SS
M),unexplainedvariationerror
(SS
E),andtotalvariation(SS
T)

12
COEFFICIENT OF DETERMINATION
•Coefficient of determination (R
2
)
explains the strength of association
•R
2
= SS
M/ SS
T
•It measuring the percentage of the
variation in dependent variable
that is explained by the
independent variable
•The value of R
2
closer to 1 means
regression line fits perfectly
whereas the value closer to 0
doesn’t fit the data well
•R
2
value will keep increasing if we
add more independent variables to
the model and results can be
misleading
•After adding the first few variables,
additional independent variables
do not make much contribution
•Adjusted R
2
tells the percentage of
variation explained by only the
independent variables that actually
affect the dependent variable
•For example, in below R
2
values,
variables more than 3 does not add
any value to the model

EXAMPLE 2: CREDIT CARD CARD
•Abankwantstopredictthenumberofcreditcardsthatafamilyuses(Y)
basedonthefollowingdata–FamilyNumber(ID),FamilySize(X1),Family
incomeinthousanddollars(X2),andNumberofautomobilesowned(X3)
•Asampleof8familiesisusedintheanalysis
•Theobjectiveistofindabetterpredictingvaluewithminimumprediction
errorsquared
13
Family
ID
Actual No. of
Credit Cards (Y)
Baseline Prediction
(ȳ=ŷ)
Prediction Error
(y-ȳ)
Prediction Error squared
(y-ȳ)
2
1 4 7 -3 9
2 6 7 -1 1
3 6 7 0 1
4 7 7 1 0
5 8 7 0 1
6 7 7 1 0
7 8 7 3 1
8 10 7 0 9
Total 56 (Y/N=56/8) 0 22

STATISTICS: CREDIT CARD CARD
14
Dependent Variable
(Y)
No. of Credit
Cards
R-Square 0.8614
Dependent
Mean
7.0
Independent Variables
(X1 & X2)
Family Size &
Family Income
AdjR-Sq 0.8059Coeff Var 11.157
Parameters 3Root MSE 0.78099Observations 8
Analysis of Variance (ANOVA)
Source DF Sum of Squares Mean SqaureF ValuePr>F
Model 2 18.95027 9.4751415.530.0072
Error 5 3.04973 0.60995
Corrected Total 7 22
Parameter Estimates
Variable Label DFParameter EstimateStandard Errort ValuePr>|t|
InterceptIntercept 1 0.48169 1.461410.330.7551
X1 Family Size 1 0.63224 0.252312.510.0541
X2 Family Income 1 0.21585 0.10801 20.1021

INTERPRETATION: CREDIT CARD MODEL
15
•ANOVAresultsshowsthatPrvalueislesserthan0.05,meaningthenull
hypothesisisrejectedandalinearrelationshipexisitsbetweenY1andX1&X2
•Inthismodel,variationexplainedbymodelis3.04953whichislesserthan
baselinemodel(wherepredictederrorsquaredis22)
•R-Sqaurevalueis0.8614whichmeansthat86.14%ofthevariationincredit
cardsisexplainedbythismodel
•WhenweincludedX3,theadjustedR-squaredecreased.Hence,wedidnot
includeX3inthismodelasitwasstatisticallyinsignificant.
•Fromtheparameterestimates,ŷ=0.482+0.63*X1+0.216*X2
•Assumingthefamilysize(X1)is4anditsannualincome(X2)is17.5.the
predictednumberofcreditcarswouldbe6.782(usingaboveequation).Here
0.218istheerrorifthevalueisroundoffandmadeitto7creditcards
•Theestimationandpredictionshouldhappenonlywithintherangeofdata
thatwasusedfortheregressionanalysis,elseresultsaredoubtful

MODEL SELECTION
16
•Foreffectivemodeling,oneshouldalwayschoosethebestmodel,validate
regressionassumptions,detectinfluentialobservationsandcollinearity.
•Let’sunderstandmodelselection.Weatherrunregressionmanuallyorusing
stepwiseselection,theobjectiveisalwayshavebettermodelwhichcan
explainmorevariation(highervalueofR-square)
•StepwiseRegressionisusedoftenwhentherearemanyvariablesbecausethis
methodschoosethebestpossiblecombinationofvariablesautomatically,
basedontheirp-values.
•Belowisthesummaryofstatisticswhichshowshoweachvariableenteredin
themodelinfluencedR-squareandAdjustedR-square.
•Thep-valueofX3ismorethansignificancelevelof0.05,suggestingtodrop
thevariablefromthemodel
Variables entered in modelR-SquareAdjusted R-Square F Value Pr>F
X1 0.7506 0.7091 18.06 0.0054
X2 0.8614 0.8059 15.53 0.0072
X3 0.8720 0.7761 9.09 0.0294

VERIFY REGRESSION ASSUMPTIONS
17
•Toconfirmthenormalityoftheerrorterm,checkthehistogramand
distributioncurves
•LookingatResidualPlot,onecanverifyothertwoassumptions,equalvariance
andindependence,iferrorsarerandomlyplotted
•Inthepreviousslide,theinterceptwas0.482(wheninterceptisnotzero,the
linearityassumptionisalreadyverified

Influential observations
TheR-squarevaluecanbeaffectedby
outliersorinfluentialobservations.It
isnecessarytolookatRstudentPlot.
Usually,valuesgreaterthantwois
consideredasoutlier(3forlarge
samplesize).Cook’sD,DFFITSand
DFBETASareotherusefulstatistics.
Multicollinearity
Itoccurswhentwoormore
independentvariablesarehighly
correlatedwitheachother,which
leadstoinstabilityintheregression
model.Tomeasurethemagnitudeof
collinearityinamodel,VIF(Variance
InflationFactor)isusedandits
acceptedvaluesareupto10.
18
REGRESSION DIAGNOSTICS
Variables VIF
X1 1.82692
X2 1.93492
X3 1.09976
Inthecreditcardexample,thereis
absenceofcollinearityissueasVIF
valuesarelowerthan8

RUN MULTIPLE REGRESSION USING SAS
19
•CopyandpasteabovecodeintheprogramofSASsoftware
DATACreditCard;
INPUTID Y X1 X2 X3;
LABELID = ‘Family Number’;
LABELY = ‘Number of Credit Cards‘;
LABELX1 = ‘Family Size‘;
LABELX2 = ‘Family income in $000‘;
LABELX3 = ‘Numberof carsowned‘;
DATALINES;
1 4 2 14 1
2 6 2 16 2
3 6 4 14 2
4 7 4 17 1
5 8 5 18 3
6 7 5 21 2
7 8 6 17 1
8 10 6 25 2
;
ODSGRAPHICSON;
TITLE1'Multiple Regression Analysis' ;
TITLE2'Credit Card Model';
PROCREG
PLOTS(ONLY)=RESIDUALHISTOGRAM
PLOTS(ONLY)=RESIDUALBYPREDICTE
D
PLOTS(ONLY)=RSTUDENTBYPREDICTE
D
PLOTS(ONLY)=COOKSD
PLOTS(ONLY)=DFFITS
PLOTS(ONLY)=DFBETAS
PLOTS(ONLY)=DIAGNOSTICSPANEL;
MODELY = X1 X2;
RUN;
ODSGRAPHICSOFF;
TITLE;

Thank You