Pablo de Pedraza JRC, Ispra AIAS , University of Amsterdam Survey vs scraped data: comparing time series properties of web and survey vacancy data
Survey vs scraped data: comparing web and survey vacancy data More and more online activities also in the matching process between Labour Supply & Labour Demand Big Data enthusiasm mainly come from commercial sector (profit) BUT does it serves for credible sciences (truth)? Vs definition Velocity, Variety, Volatility, Validity & Veracity Statistical Offices n oise , bias , accuracy …. BUT S cientific data generation process Can provide advice about quality and validity of various sources of Big Data CASE STUDY (quality test) THE RELATIONSHIP BT POPULATION OF VACANCIES COLLECTED ONLINE & THE POPULATION INFERRED BY A TRADITIONAL METHOD Preliminary step to benefit richer data
Survey vs scraped data: comparing web and survey vacancy data 1.- Data generation process & data quality 2 .- C omparison strategy Test whether there is a significant long term connection Times series components 3.- Results 4.- Conclusions
2. Data generation & data quality National Statistical office of the Netherlands (CBS) Stratified random sample of companies phone survey to employers Population numbers are inferred sound and scientifically valid statistical methods By-product of employers’ activity Textkernel - Scraping online vacancies since 2017 provide figures on scraped vacancies to the Netherlands Employment Office does not include vacancies not posted online Newspapers Supermarkets Big Data definitions ( Einav and Levi 2013, Schroeder 2014): available in real time, ii) large in size, iii) aspects difficult to observe using the traditional methods, iv ) it is unstructured… -cleaned , structured and aggregated to give it a meaningful structure for our purpose. new vacancies by quarter seven years (28 quarters )
2. Comparison Strategy Two steps comparison 0.- Visual inspection of the two time series 1 - Is there an statistical significant connexion? Are they generated by the same underlying phenomena? Cointegration analysis Augmented Dickey-Fuller (ADF) test if the WEB and NSO time series are cointegrated , then we can assume that they have been generated by the same long-term phenomenon, namely the real number of vacancies in the Dutch labour market. 2- We decompose into the time series into their main components: Y=TC*S*I Trend Cycle , Seasonal, Irregular Similar c ycles and seasonal components in the NSO and WEB time series would confirm the hypothesis that both series have been generated by the same underlying phenomenon .
3 . R esults 0.- Visual inspection: Very similar -Higher Variance at the beginning - Maximums and peaks - NSO is always higher
3 . R esults 1 .- Unit root: reject the null hypothesis is non-stationary - share the same underlying trend - generated by the same phenomenon: number of vacancies in NL Labour market Spread regression Unit root tests Dep. NSO ADF test PP test coeff. -b/a 0.743 test statistic -4.94 -21.457 p-value > 0.01 0.0198 0.946 Spread regression Unit root tests Dep. NSO ADF test PP test coeff. -b/a 0.743 test statistic -4.94 -21.457 p-value > 0.01 0.0198 0.946
3 . R esults 2 .- Time series components Web 1 st , 3 rd NS 2 nd , 4rd
3 . Conclusions Similar time series properties Cointegration - Similar underlying trends in the long term . Non-parametric decomposition method - A similar impact of the economic crisis and business cycles. - Show different behaviour over a short period of time R esearch agenda How the comparison above apply to each sector of the economy
Pablo de Pedraza AIAS, Amsterdam Institute for Advanced Labour Studies, University of Amsterdam 2016 Thank you
. By sectors 6 /19 where the activity is a bit below but is catching up and follow similar evolution B Mining & quarrying C Manufacturing F Construction G Wholesales, retail trade & repair motor H Transport & storage O Public Administration & Social security 9 /19 where activity level is very similar and following evolution D Electricity, gas, steam supply J Information and communication K Financial Institutions L Renting and buying of real state M Consultancy research & other specialized services P Education Q Health & social work R Culture, sports & recreation S Other services 1/19 sector where do not capture the whole activity but same evolution I Accommodation and food 1/19 similar level but differences in the up and down E water sup 2/19 Cases where there are big differences N renting & leasing A Agriculture
6 /19 where the activity is a bit below but is catching up and follow similar evolution B Mining & quarrying C Manufacturing F Construction G Wholesales, retail trade & repair motor H Transport & storage O Public Administration & Social security 9 /19 where activity level is very similar and following evolution D Electricity, gas, steam supply J Information and communication K Financial Institutions L Renting and buying of real state M Consultancy research & other specialized services P Education Q Health & social work R Culture, sports & recreation S Other services 1/19 sector where do not capture the whole activity but same evolution I Accommodation and food 1/19 similar level but differences in the up and down E water sup 2/19 Cases where there are big differences N renting & leasing A Agriculture
2. Data generation & data quality 6 /19 where the activity is a bit below but is catching up and follow similar evolution B Mining & quarrying C Manufacturing F Construction G Wholesales, retail trade & repair motor H Transport & storage O Public Administration & Social security 9 /19 where activity level is very similar and following evolution D Electricity, gas, steam supply J Information and communication K Financial Institutions L Renting and buying of real state M Consultancy research & other specialized services P Education Q Health & social work R Culture, sports & recreation S Other services 1/19 sector where do not capture the whole activity but same evolution I Accommodation and food 1/19 similar level but differences in the up and down E water sup 2/19 Cases where there are big differences N renting & leasing A Agriculture
2. Data generation & data quality 6 /19 where the activity is a bit below but is catching up and follow similar evolution B Mining & quarrying C Manufacturing F Construction G Wholesales, retail trade & repair motor H Transport & storage O Public Administration & Social security 9 /19 where activity level is very similar and following evolution D Electricity, gas, steam supply J Information and communication K Financial Institutions L Renting and buying of real state M Consultancy research & other specialized services P Education Q Health & social work R Culture, sports & recreation S Other services 1/19 sector where do not capture the whole activity but same evolution I Accommodation and food 1/19 similar level but differences in the up and down E water sup 2/19 Cases where there are big differences N renting & leasing A Agriculture
2. Data generation & data quality 6/19 where the activity is a bit below but is catching up and follow similar evolution B Mining & quarrying C Manufacturing F Construction G Wholesales, retail trade & repair motor H Transport & storage O Public Administration & Social security 9/19 where activity level is very similar and following evolution D Electricity, gas, steam supply J Information and communication K Financial Institutions L Renting and buying of real state M Consultancy research & other specialized services P Education Q Health & social work R Culture, sports & recreation S Other services 1/19 sector where do not capture the whole activity but same evolution I Accommodation and food 1/19 similar level but differences in the up and down E water sup 2/19 Cases where there are big differences N renting & leasing A Agriculture