Regression used to estimate the linear relation between a dependent variable (y) and independent variables (x i ) that are associated with it y = β + β 1 x 1 + β 2 x 2 + β 3 x 3 + ... + e where: y, x 1 , x 2 , ... and are columns of data generated from a sample that capture how these variables move together e is a column of residuals (errors) β’s are estimated by the regression from the data
Regression Assumptions : Y must be actually linearly related to X’s e is Normally distributed with mean 0 (normal plots or histogram) e has constant variance σ 2 (no heteroskedasticity ) e's are independent of each other (no patterns in plots) X's are not a linear combination of each other( multicollinearity ) all required X's are in the model (omitted variables ) outliers are not present (driving results?) tests of high influence remove point and see how much changes
Regression Continued Key Ideas: have a model in mind first based on previous research, your experience, and logic (don’t fish around for a result) u nderstand how to examine residual plots to verify that you have a valid model (histogram, normality plot, no outliers, no heteroskedasticity ) u nderstand what multi- collinearity is and what you should do about it understand what the overall F test means: Ho: All βs = 0 understand what R 2 means, how large is standard error understand individual β test means: Ho: β i = 0 understand how to interpret a p-value probability of getting a test result as large or larger than what you estimated, given Ho is true
Regression Continued p-values: p > 0.10 – no evidence to reject Ho 0.05 < p ≤ 0.10 – weak evidence to reject Ho 0.01 < p ≤ 0.05 – good evidence to reject Ho p ≤ 0.01 – strong evidence to reject Ho 1 - tailed-tests: Ha: β < 0 Ha: β > 0 2 - tailed-tests: Ha: β ≠ 0
Regression Continued Be careful interpreting p-values! they are not the probability your model is incorrect tests of association not cause! a significant p-value does not indicate if economically or meaningfully significant note you can fail to reject Ho for 2 reasons insufficient number of observations to statistically see the difference True β is equal to zero understand Type I versus Type II errors Type I: reject Ho when it is true Type II: do not reject Ho when it is false
Regression Continued categorical variables if has k levels (or attributes) use dummies to code up k-1 levels the last level gets picked up by the intercept ie . the “Base Case” β on dummy captures the difference between that level and the Base Case chose which level to omit based on the tests you would like to run