SC968�Panel data methods for sociologists-Lecture 1, part 1
GulbinErdem1
8 views
65 slides
Jul 02, 2024
Slide 1 of 65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
About This Presentation
Panel data methods for sociologists�Lecture 1, part 1
Size: 984.09 KB
Language: en
Added: Jul 02, 2024
Slides: 65 pages
Slide Content
SC968 Panel data methods for sociologists Lecture 1, part 1 A review of concepts for regression modelling Or things you should know already
Overview Models OLS, logit and probit Mathematically and practically Interpretation of results, measures of fit and regression diagnostics Model specification Post-estimation commands STATA competence
Ordinary Least Squares (OLS) Value of dependent variable for individual i (LHS variable) Value of explanatory variable 1 for person i Coefficient on variable 1 Residual (disturbance, error term) Total no. of explanatory variables (RHS variables or regressors) is K Examples y i = mental health x 1 = sex x 2 = age x 3 = marital status x 4 = employment status x 5 = physical health y i = hourly pay x 1 = sex x 2 = age x 3 = education x 4 = job tenure x 5 = industry x 6 = region Intercept (constant)
OLS Vector of explanatory variables Vector of coefficients In vector form In matrix form Note: you will often see x’ β written as x β
OLS Also called “linear regression” Assumes dependent variable is a linear combination of dependent variables, plus disturbance “Least squares”: β ’s estimated so as to minimise the sum of the ε ’s.
Residuals have zero mean………………………………. Follows that ε ’s and X’s are uncorrelated………………. violated if a regressor is endogenous Eg , number of children in female labour supply models Cure by ( eg ) Instrumental Variables Homoscedasticity : all ε ’s have same variance ………… Classic example: food consumption and income Cure by using weighted least squares Nonautocorrelation : ε ’s uncorrelated with each other … Data sets where the same individual appears multiple times Adjust standard errors: ‘cluster’ option in STATA Distubances are iid (normally distributed, zero mean, constant variance) Basic Assumptions
When is OLS appropriate? When you have a continuous dependent variable Eg, you would use it to estimate regressions for height, but not for whether a person has a university degree. When the assumptions are not obviously violated As a first step in research to get ball-park estimates We will use them a lot for this purpose Worked examples Coefficients, P-values, t-statistics Measures of fit (R-squared, adjusted R-squared) Thinking about specification Post-estimation commands Regression diagnostics. A note on the data All examples (in lectures and practicals) drawn from a 20% sample of the British Household Panel Survey (BHPS) – more about the data later!
Summarize monthly earned income
For illustrative purposes only. Not an example of good practice. R-squared = Model SS / Total SS Tests whether all coeffs except constant are jointly zero MS = SS/df Root MSE = sqrt (MSR) Coefficients + or – 1.96 standard errors T-stat = coefficient / standard error First worked example Analysis of variance (ANOVA) table Monthly labour income, for people whose labour income is >= £1
All coefficients except month of interview are significant 29% of variation explained Being female reduces income by nearly £600 per month Income goes up with age and then down 16458 observations…..oops, this is from panel data, so there are repeated observations on individuals. What do the results tell us?
Coefficients, R-squared etc are unchanged from previous specification But standard errors are adjusted: standard errors larger, t-statistics are lower Add ,cluster(pid) as an option
Let’s get rid of the “month” variable Think about the female coefficient a bit more. Could it be to do with women working shorter hours?
Is the coefficient on hours of work reasonable? £5.65 for every additional hour worked – certainly in the right ball park. Control for weekly hours of work
R-squared jumps from 29% to 46% Coefficient on female goes from -595 to -315 Almost half the effect of gender is explained by women’s shorter hours of work Age, partner and education coefficients are also reduced in magnitude, for similar reasons Number of observations reduces from 16460 to 13998 – missing data on hours Looking at 2 specifications together
Is the effect of university qualifications statistically different from the effect of secondary education? What age does income peak? Income = Y + β 1 *age + β 2 *age 2 d(Income)/d(age) = β 1 + 2 β 2 *age Derivative = zero when age = - β 1 /2 β 2 = -79.552/(-0.873*2) = 45.5 Interesting post-estimation activities
A closer look at “partner” coefficient
Men who are part of a couple earn much more than men who are not – women less so. Other coefficients also differ between men and women, but with current specification, we can’t test whether differences are significant.
Developed for discrete (categorical) dependent variables Eg, psychological morbidity, whether one has a job…. Think of other examples. Outcome variable is always 0 or 1. Estimate: OLS (linear probability model) would set F(X, β ) = X’ β + ε Inappropriate because: Heteroscedasticity: the outcome variable is always 0 or 1, so ε only takes the value -x’ β or 1-x’ β More seriously, one cannot constrain estimated probabilities to lie between 0 and 1. Logit and Probit
Solution: We need a link function that will transform our dichotomous Y into a continuous form Y’ Looking for a function which lies between 0 and 1: Cumulative normal distribution: Probit model Z scores assuming the cumulative normal distribution Φ Logistic distribution: Logit (logistic) model Logged odds of probability They are very similar! Note how they lie between 0 and 1 (vertical axis) Logit and Probit
Likelihood function: product of Pr(y=1) = F(x’ β ) for all observations where y=1 Pr(y=0) = 1- F(x’ β ) for all observations where y=0 (think of the probability of flipping exactly four heads and two tails, with six dice) Log likelihood written as Estimated using an iterative procedure STATA chooses starting values for β ’s Computes slopes of likelihood function at these values Adjusts β ’s accordingly Stops when slope of LF is ≈0 Can take time! Maximum likelihood estimation
Let’s look at whether a person works gen byte work = (jbstat == 1 | jbstat == 2) if jbstat >= 1 & jbstat != .
Logit regression: whether have a job All the iterations 2* (LL of this model – LL of null model) Measure of amount explained but less intuitive interpretation From these coefficients, can tell whether estimated effects are positive or negative Whether they’re significant Something about effect sizes – but difficult to draw inferences from coefficients
Comparing logit and probit Scaling factor proposed by Amemiya (1981) Multiply Probit coefficients by 1.6 to get an approximation to Logit Other authors have suggested a factor of 1.8
Marginal effects After logit or Probit estimation, use the margins command Calculates marginal effects of each of the RHS variables on the dependent variable Slope of the function for continuous variables Effect of change from 0 to 1 in a dummy variable Can also provide predicted probabilities, linear combinations, plots, and much more! MEM: Marginal Effects at the Means margins, dydx (*) atmeans AME: Average Marginal Effects Margins, dydx (*) MER: Marginal Effects at Representative Values Margins, dydx (*) at(age=20 30 40 50)
Marginal effects Logit and Probit mfx are very similar indeed OLS is actually not too bad
Odds ratios Only an option with logit Type “or” in, after the comma as an option Reports odds ratios: that is, how many times more (or less) likely the outcome becomes if the variable is 1 rather than 0, in the case of a dichotomous variable for each unit increase of the variable, for a continuous variable Results >1 show an increased probability, results <1 show decrease
Other post-estimation commands Likelihood ratio test “lrtest” Adding an extra variable to the RHS always increases the likelihood But, does it add “enough” to the likelihood? LR test calculates L /L 1 (L restricted /L unrestricted ) and calculates chi-squared stat with d.f. equal to the number of variables you are dropping. Null hypothesis: restricted specification. Only works on nested models, ie, where the RHS variables in one model are a subset of the RHS variables in the other. How to do it Run the full model Type “estimates store NAME” Run a smaller model Type “estimates store ANOTHERNAME” ….. And so on for as many models as you like Type “lrtest NAME ANOTHERNAME” Be careful….. Sample sizes must be the same for both models Won’t happen if the dropped variable is missing for some observations Solve problem by running the biggest model first and using e(sample)
LR test - example Similar but not identical regression to previous examples Add regional variables, decide which ones to keep Looks as though Scotland might stay, also possibly SW, NW, N
LR test - example Reject dropping all regional variables against keeping full set Don’t reject dropping all but 4, over keeping full set Don’t reject dropping all but Scotland, over keeping full set Don’t reject dropping all but Scotland, over dropping all but 4 [and just to check: DO reject dropping all regional variables against dropping all but Scotland] REJECT nested specification DON’T REJECT nested spec
Again, specification is illustrative only This is not an example of a “finished” labour supply model! How could one improve the model? Model specification Theoretical considerations, Empirical considerations Parsimony Stepwise regression techniques Regression diagnostics Interpreting results Spotting “unreasonable” results
Other models Other models to be aware of, but not covered on this course: Extensions to logit and probit Ordered models ( ologit , oprobit ) for ordered outcomes Levels of education, Number of children Excellent, good, fair or poor health Multinomial models ( mlogit , mprobit ) for multiple outcomes with no obvious ordering Working in public, private or voluntary sector Choice of nursery, childminder or playgroup for pre-school care Heckman selection model For modelling two-stage procedures Earnings, conditional on having a job at all Having a job is modelled as a probit , earnings are modelled as OLS Used particularly for women’s earnings Tobit model for censored or truncated data Typically, for data where there are lots of zeros Expenditure on rarely-purchased items, eg cars Children’s weights, in an experiment where the scales broke and gave a minimum reading of 10kg
Competence in STATA Best results in this course if you already know how to use STATA competently. Check you know how to Get data into STATA (use and using commands) Manipulate data, (merge, append, rename, drop, save) Describe your data (describe, tabulate, table) Create new variables (gen, egen) Work with subsets of data (if, in, by) Do basic regressions (regress, logit, probit) Run sessions interactively and in batch mode Organise your datasets and do-files so you can find them again. If you can’t do these, upgrade your knowledge ASAP! Could enroll in STATA net course 101 Costs $110 ESRC might pay Courses run regularly www.stata.com
SC968 Panel data methods for sociologists Lecture 1, part 2 Introducing Longitudinal Data
Overview Cross-sectional and longitudinal data Types of longitudinal data Types of analysis possible with panel data Data management – merging, appending, long and wide forms Simple models using longitudinal data
Cross-sectional and longitudinal data First, draw the distinction between macro- and micro-level data Micro level: firms, individuals Macro level: local authorities, travel-to-work areas, countries, commodity prices Both may exist in cross-sectional or longitudinal forms We are interested in micro-level data But macro-level variables are often used in conjunction with micro-data Cross-sectional data Contains information collected at a given point in time (More strictly, during a given time window) European Social Survey (ESS) Programme for International Student Assessment (PISA) Many cross-sectional surveys are repeated, but on different individuals Longitudinal data Contains repeated observations on the same subjects
Types of longitudinal data Time-series data Eg , commodity prices, exchange rates Repeated interviews at irregular intervals UK cohort studies: NCDS (1958), BCS (1970), MCS (2000) Repeated interviews at regular intervals “Panel” surveys Usually annual intervals, sometimes two-yearly BHPS, SLID, PSID, SOEP Some surveys have both cross-sectional and panel elements Panels more expensive to collect LFS, EU-SILC both have a “rolling panel” element Other sources of longitudinal data Retrospective data ( eg work or relationship history) Linkage with external data ( eg , tax or benefit records) – particularly in Scandinavia May be present in both cross-sectional or longitudinal data sets
Analysis with longitudinal data The “snapshot” versus the “movie” Essentially, longitudinal data allow us to observe how events evolve Study “flows” as well as “stocks”. Example: unemployment Cross-sectional analysis shows steady 5% unemployment rate Does this mean that everyone is unemployed one year out of twenty? That 5% of people are unemployed all the time? Or something in between Very different implications for equality, social policy, etc
The BHPS Interviews about 10,000 adults in about 6,000 households Interviews repeated annually People followed when they move People join the sample if they move in with a sample member Household-level information collected from “head of household” Individual-level information collected from people aged 17+ Young people aged 11-16 fill in a youth questionnaire BHPS is now part of Understanding Society Much larger and wider-ranging survey 40,000 households Data set used for this course is a 20% sample of BHPS, with selected variables
The BHPS All files prefixed with a letter indicating the year All variables within each file also prefixed with this letter 1991: a 1992: b………. and so on Several files each year, containing different information hhsamp information on sample households hhresp household-level information on households that actually responded indall info on all individuals in responding households indresp info on respondents to main questionnaire (adults) egoalt file showing relationship of household members to one another income incomes Extra files each year containing derived variables: Work histories, net income files And others with occasional modules, eg life histories in wave 2 bjobhist blifemst bmarriag bcohabit bchildnt
Person and household identifiers BHPS (along with other panels such as ECHP, SOEP, ECHP) is a household survey – so everyone living in sample households becomes a member Need identifiers to Associate the same individual with him- or herself in different waves Link members of same household with each other in the same wave - the HID identifier Note: no such thing as a longitudinal household! Household composition changes, household location changes….. HID is a cross-sectional concept only!
What it looks like: 4 waves of data, sorted by pid and wave. Not present at 2 nd wave A child, so no data on job or marital status Observations in rows, variables in columns. Blue stripes show where one individual ends & another begins Surveyed twice in 70 th
(Can also use ,nol option)
Joining data sets together Adding extra observations: “append” command Adding extra variables: “merge” command
Whether appending or merging Whether appending or merging The data set you are using at the time is called the “master” data The data set you want to merge it with is called the “using” data Make sure you can identify observations properly beforehand Make sure you can identify observations uniquely afterwards
Appending Use this command to add more observations Relatively easy Check first that you are really adding observations you don’t already have (or that if you are adding duplicates, you really want to do this) Syntax: append using using_data STATA simply sticks the “using” data on the end of the “master” data STATA re-orders the variables if necessary. If the using data contain variables not present in the master data, STATA sets the values of these variables to missing in the using data (and vice versa if the master data contains variables not present in the using data)
Merging is more complicated Use “merge” to add more variables to a data set Master data: age.dta pid wave age 28005 1 30 19057 1 59 28005 2 31 19057 3 61 19057 4 62 28005 4 33 Using data: sex.dta pid wave sex 19057 1 female 19057 3 female 28005 1 male 28005 2 male 28005 4 male 42571 1 male 42571 3 male Notice that both data sets don’t contain the same observations Merge 1:1 pid wave using sex pid wave age sex _merge 19057 1 59 female 3 19057 3 61 female 3 19057 4 62 . 1 28005 1 30 male 3 28005 2 31 male 3 28005 4 33 male 3 42571 1 . male 2 42571 3 . male 2
Merging STATA creates a variable called _merge after merging 1: observation in master but not using data 2: observation in using but not master data 3: observation in both data sets Options available for discarding some observations – see manual
More on merging Previous example showed one-to-one merging Not every observation was in both data sets, but every observation in the master data was matched with a maximum of only one observation in the using data – and vice versa. Many-to-one merging: Household-level data sets contain only one observation per household (usually <1 per person) Regional data (eg, regional unemployment data), usually one observation per region Sample syntax: merge m:1 hid wave using hhinc_data hid pid age 1604 19057 59 2341 28005 30 3569 42571 59 4301 51538 22 4301 51562 4 4956 59377 46 5421 64966 70 6363 76166 77 6827 81763 71 6827 81798 72 hid h/h income 1604 780 2341 1501 3569 268 4301 394 4956 1601 5421 225 6363 411 6827 743 hid pid age h/h income 1604 19057 59 780 2341 28005 30 1501 3569 42571 59 268 4301 51538 22 394 4301 51562 4 394 4956 59377 46 1601 5421 64966 70 225 6363 76166 77 411 6827 81763 71 743 6827 81798 72 743 One-to-many merging Job and relationship files contain one observation per episode (potentially >1 per person) Income files contain one observation per source of income (potentially >1 per person) Sample syntax: merge 1:m pid wave using births_data
Long and wide forms The data we have here is in “long” form One row for each person/wave combination From a few slides back:
Wide form However, it’s also possible to put longitudinal data into “wide” form One observation per person, with different variables relating to different years of data Age at wave 1, and so on Sex doesn’t change [usually]
The reshape command Switching from long to wide: reshape wide [stubnames], i(id) j(year) In BHPS, this becomes reshape wide [stubnames], i(pid) j(wave) What are stub names? They are a list of variables which vary between years Variables like sex or ethnicity would not normally be included in this list Switching from wide to long: Exactly the opposite reshape long [stubnames], i(id) j(wave) Lots more info and examples in STATA manual
Simple models using longitudinal data Auto-regressive and time-lagged models Models of change
But first: the GHQ Use this for lots of analysis in the lectures and practical sessions General Health Questionnaire Different versions: BHPS carries the GHQ-12, with 12 questions. Have you recently: been able to concentrate on whatever you're doing ? lost much sleep over worry ? felt that you were playing a useful part in things ? felt capable of making decisions about things ? felt constantly under strain? felt you couldn't overcome your difficulties ? been able to enjoy your normal day to day activities ? been able to face up to problems ? been feeling unhappy or depressed? been losing confidence in yourself? been thinking of yourself as a worthless person ? been feeling reasonably happy, all things considered ? Answer each question on 4-point scale not at all - no more than usual - rather more - much more
GHQ HLGHQ2 Caseness scale Recodes answers 3-4 as 1, and adds up Scores above 2 used to indicate psychological morbidity
Time-lagged models Start with simple OLS model The Likert score is a measure of psychological wellbeing derived from a battery of questions
Generate lagged variable NB: the 1/30 here is just so it will fit on the page. You should check many more observations than this!
OLS, with lagged dependent variable R-squared rockets from 5% to 26% Big & very significant coefficient on lagged variable Coeff on “ue_sick” falls from 3.6 to 2.1 Also possible to include lagged explanatory variables
Models of change Start with OLS model [simplified, but imagine more variables] Separate model for each year – suffix denotes year Subtract 1 st from 2 nd model Or, express in terms of change
Generate difference variables capture drop dif* sort pid wave gen dif_LIKERT = LIKERT - LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_age = age - age[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_age2 = age2 - age[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_female = female - female[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_ue_sick = ue_sick - ue_sick[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_partner = partner - partner[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 Check you understand why dif_female will [very nearly] always be zero
Check for sensible results!
More checking….
Issues Most differences are zero Moving into unemployment or partnership is given equal and opposite weighting to moving out. No real reason why this should be the case There are MUCH better ways to use these data! Nevertheless, let’s proceed!
Results Female drops out Coeffs on sick and partner significant