eduworks-network.eu
facebook.com/eduworksnetwork
@EduworksNetwork
This project has been funded with support from the European Commission.
This communication reflects the views only of the author, and the Commission cannot be held responsible for any use which may be
made of the information contained therein.
Web Data and Labour
Market Matching
Brian Fabo (CEU, CEPS, CELSI, CentERdata)
Eduworks fellow (9/2015 – 7/2017)
Eduworks final workshop. June 26 2017
Chapter I Web Data SoA
• 2 underlining drivers:
• Rise of the Internet (as a research object and source of data)
• Societal developments made traditional data sources less useful
(growing non-response in survey, more mobile population,
fragmentarisation of the labour market)
• There is a high variety of sources (web surveys (including web
panels), vacancies (with platforms), social media, Google
trends…)
Chapter 1a – Labour Market
Matching SoA
• Huge number of jobs out there, challenge to systematize
• Job – specific context = bundle of tasks = occupation
(Tijdens 2015)
• This is also supported by qualitative research, which shows
that employers think of tasks when hiring
• Ability to perform tasks depends on skills
There is so much you need to learn,
young Jedi
“It is, in fact, amazing how little labor economists know
about the actual mechanics of how workers get assigned
to jobs.”
Peter J. Kuhn
Labor market matching logic
• As long as there is a vacancy and there is a sufficiently skilled
candidate, matching can happen (unless prevented by i.e.
inability of the employer to pay demanded wage or legislation
(e.g. minimum wage)
• Mismatch = a situation where there is a free manpower and
vacancies but matching does not happen
What prevents matching?
• It is not lack of skills level – mismatch exists also in societies with
abundant skilled but unemployed workforce.
• But rather it appears due to labor market changes (technological
progress, polarization, fragmentarisation, teleworking,
outsourcing/offshoring, AIs and deep learning) the demand has
shifted to workers a)capable of controlling the machines b)able to
communicate, requiring specific skills (IT, languages), which are
often lacking.
• In other words its increasingly about transferable skills. About which
we know little and can learn very little from most traditional datasets.
Hence a role for web data
Chapter Ib Web Data SoA
• Research increasingly published in top journals
• But nearly always as a “prototype” or methodological inquiry.
Subject-matter research based on web data is still in its infancy
• Representativeness worries used to dominate, but now getting
mitigated (in high income countries, everyone is online (Askitas
and Zimmermann 2015))
Nonetheless, the Brazil problem
• Brazil is the country of the future!
• Brazil has always been the country of the future!
• Brazil will always be the country of the future!
• Task before us: Moving from exploration to installation phase.
Chapter II Survey
• Two types of web surveys
• Probabilistic (centerpanel, “offline” sampling frame)
• Non-probabilistic (self-selected participants)
• The probalistic web surveys (esp panels)dominant method of
data collection
• Non-probabilistic more experimental (but i.e. WageIndicator
based research also increasingly published)
State of the Art
• WageIndicator survey
• Can reliably identify the effect of the main labour market variables
• Can not reliably capture the strength of these effects
• The reason is that the population surveyed by WI is different
from general population (e.g. Tijdens and Steinmetz for African
countries)
• Experiments with weighting not really successful, it mainly
increases standard errors, so higher chance of false negative
(de Pedraza)
Our contribution
• We took 10 years (2005-2014) of WI data from the Netherlands
(by far the strongest dataset) to see how the sample
composition changes
• Interestingly enough, not much. Seems the WI sample is biased
but in a systemic ways, which can be treated through controls.
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
SILC
N = 5k
mean experience 17.2 18.0 18.6 19.2 19.3 19.5 19.4 19.7 20.2 20.5
sd experience 9.8 9.6 9.7 9.8 9.8 9.6 9.9 10.1 10.0 10.0
mean education 14.0 14.1 14.2 14.3 14.4 14.5 14.6 14.6 14.8 14.9
sd education 3.0 3.0 3.0 2.9 2.8 2.9 2.9 2.9 3.0 2.9
WI
N =
50k
mean experience 14.8 14.7 16.1 16.8 17.5 17.4 17.6 17.8 16.8 16.7
sd experience 9.8 9.9 10.2 9.9 10.0 9.9 10.0 10.1 10.1 10.3
mean education 14.3 14.5 14.2 14.3 14.0 14.3 14.4 14.4 14.7 14.8
sd education 2.8 2.8 2.9 2.9 2.9 3.0 2.9 2.8 2.9 2.8
Source: Fabo and Kahanec: The Potential for Using Voluntary Web Surveys Beyond Exploratory Research
(upcoming)
Pooled estimates with year dummies
Men SILC Men WI Women SILC Women WI
Years of potential experience 0.0217*** 0.0197*** 0.0168*** 0.0193***
(0.00128) (0.000756) (0.00115) (0.000693)
Years of experience squared -0.000309*** -0.000251*** -0.000265*** -0.000300***
(2.98e-05) (1.86e-05) (2.78e-05) (1.77e-05)
Years of education 0.0474*** 0.0392*** 0.0443*** 0.0334***
(0.000986) (0.000685) (0.000979) (0.000684)
Source: Fabo and Kahanec: The Potential for Using Voluntary Web Surveys Beyond Exploratory Research (upcoming)
Conclusion
• Non-probabilistic survey might have potential even
beyond exploratory analysis
• We should understand better the way the sample is
formed (as it does not appear to be random) - this is
the key
• If we understand the sample, treatment of the bias
through econometric tools seems quite straightforward
Chapter III job vacancies
• Analysis of web content is a trend in and beyond academia
(social media mining allegedly won unwinnable election for
Trump)
• Increasing attention by economists (i.e. Million Prices Project –
Publication in AER
• Vacancies in particular very popular, offering information that is
complex but quite structuredd
• Slovakia a powerhouse in the field (Mytna Kurekova end co.
within the frame of NEUJOBS and InGRID) . Progress within
Eduworks
State of Art
• Not much thought of representativeness (exception de Pedraza,
found that the number of vacancies is quite dependable when
controlled for economic cycle and detrended)
• A lot of explorative research (Mytna Kurekova, Beblavy,
Eduworks guys…)
• Common methods – Text mining, machine learning
• Data collection increasingly systematized (Textkernel, Burning
Glass)
Our Contributions
1. Focus on meta data (the occupation observatory project)
– regular collection not of vacancies but portal-specific
vacancy classifications (occupation tags)
2. Focus on the platforms (Uber, Task Rabbit) - enable to
observe matching process in its entirety, although limited
scope so far
Chapter 4 languages
• RQ: What is the role of foreign language knowledge in
labour market matching?
• SoA: Languages mainly studies in connection with migrant
integration, though theoretical literature increasingly
recognizes the importance of the topic
• Scope: Visegrad countries. Interesting due to a large role of
TNCs and relatively low extent to which local populations
are proficient in English (Eurobarometer)
Method
• Scrapped share of vacancies requiring foreign language
skills from the most important portal in each of the countries
• Meta-data based analysis. Evaluated based on two tags (the
“occupation” tag (coded to ISCO using the CASCOT
application) and the language tag (based on employer’s
indication – overall only English and German found relecant)
• Compared with wages calculated for each occupation using
the WI survey
Result: English language
requirement correlates with wages
y = 4.936x + 3.054
y = 2.0458x + 2.8214
y = 3.5278x + 3.197
0
2
4
6
8
10
12
14
0% 20% 40% 60% 80% 100% 120%
Czechia Hungary Slovakia
Also
present in
the
individual
level WI
data
(1) (2) (3)
Years of Experience 0.0295*** 0.0252*** 0.0212***
(0.00419) (0.00416) (0.00404)
Years of Experience Squared -0.000582*** -0.000511*** -0.000397***
(0.000110) (0.000109) (0.000106)
Education 0.0444*** 0.0330*** 0.0273***
(0.00491) (0.00508) (0.00492)
Woman -0.181*** -0.191*** -0.165***
(0.0238) (0.0249) (0.0241)
English skill level (reference category: no English)
- Basic 0.0313 0.0214 -0.0186
(0.0337) (0.0332) (0.0320)
- Rather skilled 0.154*** 0.128*** 0.0618*
(0.0377) (0.0374) (0.0360)
- Skilled 0.364*** 0.306*** 0.197***
(0.0417) (0.0418) (0.0409)
Country and year dummies YES YES YES
Basic occupation dummies NO YES YES
Extended occupation dummies NO NO YES
Observations 1,988 1,988 1,947
Chapter 5 Computer skills
• Puzzle: Determining computer skills relevance for
individual occupations
• SoA: Skills typically proxied by education only, more
detailed look only recently and with the connection to web
data (Visintin and Tijdens)
• Combination of H2 2017 WageIndicator data and job
vacancies from Textkernel
Methodology
• WageIndicator data – “Where do you use computer or
tablet?”. Counted share of at work or both at work and at
home per occupation.
• Job vacancies – text mined computer requirements per
occupation
• Two approaches to aggregating occupations – based on
apparent usefulness of computers for tasks associated with
them or based on task complexity
Occupation complexity dimension
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Shareofjobvacanciesdemanding
computerskills
Shareofsel-reporteduseofcomputer
ISCO occupation group
WI Vacancies
Conclusions
• Web data are a huge, exciting and diverse field
• Increasingly mainstream in academia, yet still short of
applications for the most part
• Nonetheless, robust analysis increasingly possible,
through the need for benchmarking against
representative data source still present
Thank you for your attention

Brian Fabo

  • 1.
    eduworks-network.eu facebook.com/eduworksnetwork @EduworksNetwork This project hasbeen funded with support from the European Commission. This communication reflects the views only of the author, and the Commission cannot be held responsible for any use which may be made of the information contained therein. Web Data and Labour Market Matching Brian Fabo (CEU, CEPS, CELSI, CentERdata) Eduworks fellow (9/2015 – 7/2017) Eduworks final workshop. June 26 2017
  • 2.
    Chapter I WebData SoA • 2 underlining drivers: • Rise of the Internet (as a research object and source of data) • Societal developments made traditional data sources less useful (growing non-response in survey, more mobile population, fragmentarisation of the labour market) • There is a high variety of sources (web surveys (including web panels), vacancies (with platforms), social media, Google trends…)
  • 3.
    Chapter 1a –Labour Market Matching SoA • Huge number of jobs out there, challenge to systematize • Job – specific context = bundle of tasks = occupation (Tijdens 2015) • This is also supported by qualitative research, which shows that employers think of tasks when hiring • Ability to perform tasks depends on skills
  • 4.
    There is somuch you need to learn, young Jedi “It is, in fact, amazing how little labor economists know about the actual mechanics of how workers get assigned to jobs.” Peter J. Kuhn
  • 5.
    Labor market matchinglogic • As long as there is a vacancy and there is a sufficiently skilled candidate, matching can happen (unless prevented by i.e. inability of the employer to pay demanded wage or legislation (e.g. minimum wage) • Mismatch = a situation where there is a free manpower and vacancies but matching does not happen
  • 6.
    What prevents matching? •It is not lack of skills level – mismatch exists also in societies with abundant skilled but unemployed workforce. • But rather it appears due to labor market changes (technological progress, polarization, fragmentarisation, teleworking, outsourcing/offshoring, AIs and deep learning) the demand has shifted to workers a)capable of controlling the machines b)able to communicate, requiring specific skills (IT, languages), which are often lacking. • In other words its increasingly about transferable skills. About which we know little and can learn very little from most traditional datasets. Hence a role for web data
  • 7.
    Chapter Ib WebData SoA • Research increasingly published in top journals • But nearly always as a “prototype” or methodological inquiry. Subject-matter research based on web data is still in its infancy • Representativeness worries used to dominate, but now getting mitigated (in high income countries, everyone is online (Askitas and Zimmermann 2015))
  • 8.
    Nonetheless, the Brazilproblem • Brazil is the country of the future! • Brazil has always been the country of the future! • Brazil will always be the country of the future! • Task before us: Moving from exploration to installation phase.
  • 10.
    Chapter II Survey •Two types of web surveys • Probabilistic (centerpanel, “offline” sampling frame) • Non-probabilistic (self-selected participants) • The probalistic web surveys (esp panels)dominant method of data collection • Non-probabilistic more experimental (but i.e. WageIndicator based research also increasingly published)
  • 11.
    State of theArt • WageIndicator survey • Can reliably identify the effect of the main labour market variables • Can not reliably capture the strength of these effects • The reason is that the population surveyed by WI is different from general population (e.g. Tijdens and Steinmetz for African countries) • Experiments with weighting not really successful, it mainly increases standard errors, so higher chance of false negative (de Pedraza)
  • 12.
    Our contribution • Wetook 10 years (2005-2014) of WI data from the Netherlands (by far the strongest dataset) to see how the sample composition changes • Interestingly enough, not much. Seems the WI sample is biased but in a systemic ways, which can be treated through controls.
  • 13.
    2005 2006 20072008 2009 2010 2011 2012 2013 2014 SILC N = 5k mean experience 17.2 18.0 18.6 19.2 19.3 19.5 19.4 19.7 20.2 20.5 sd experience 9.8 9.6 9.7 9.8 9.8 9.6 9.9 10.1 10.0 10.0 mean education 14.0 14.1 14.2 14.3 14.4 14.5 14.6 14.6 14.8 14.9 sd education 3.0 3.0 3.0 2.9 2.8 2.9 2.9 2.9 3.0 2.9 WI N = 50k mean experience 14.8 14.7 16.1 16.8 17.5 17.4 17.6 17.8 16.8 16.7 sd experience 9.8 9.9 10.2 9.9 10.0 9.9 10.0 10.1 10.1 10.3 mean education 14.3 14.5 14.2 14.3 14.0 14.3 14.4 14.4 14.7 14.8 sd education 2.8 2.8 2.9 2.9 2.9 3.0 2.9 2.8 2.9 2.8 Source: Fabo and Kahanec: The Potential for Using Voluntary Web Surveys Beyond Exploratory Research (upcoming)
  • 14.
    Pooled estimates withyear dummies Men SILC Men WI Women SILC Women WI Years of potential experience 0.0217*** 0.0197*** 0.0168*** 0.0193*** (0.00128) (0.000756) (0.00115) (0.000693) Years of experience squared -0.000309*** -0.000251*** -0.000265*** -0.000300*** (2.98e-05) (1.86e-05) (2.78e-05) (1.77e-05) Years of education 0.0474*** 0.0392*** 0.0443*** 0.0334*** (0.000986) (0.000685) (0.000979) (0.000684) Source: Fabo and Kahanec: The Potential for Using Voluntary Web Surveys Beyond Exploratory Research (upcoming)
  • 15.
    Conclusion • Non-probabilistic surveymight have potential even beyond exploratory analysis • We should understand better the way the sample is formed (as it does not appear to be random) - this is the key • If we understand the sample, treatment of the bias through econometric tools seems quite straightforward
  • 16.
    Chapter III jobvacancies • Analysis of web content is a trend in and beyond academia (social media mining allegedly won unwinnable election for Trump) • Increasing attention by economists (i.e. Million Prices Project – Publication in AER • Vacancies in particular very popular, offering information that is complex but quite structuredd • Slovakia a powerhouse in the field (Mytna Kurekova end co. within the frame of NEUJOBS and InGRID) . Progress within Eduworks
  • 17.
    State of Art •Not much thought of representativeness (exception de Pedraza, found that the number of vacancies is quite dependable when controlled for economic cycle and detrended) • A lot of explorative research (Mytna Kurekova, Beblavy, Eduworks guys…) • Common methods – Text mining, machine learning • Data collection increasingly systematized (Textkernel, Burning Glass)
  • 18.
    Our Contributions 1. Focuson meta data (the occupation observatory project) – regular collection not of vacancies but portal-specific vacancy classifications (occupation tags) 2. Focus on the platforms (Uber, Task Rabbit) - enable to observe matching process in its entirety, although limited scope so far
  • 19.
    Chapter 4 languages •RQ: What is the role of foreign language knowledge in labour market matching? • SoA: Languages mainly studies in connection with migrant integration, though theoretical literature increasingly recognizes the importance of the topic • Scope: Visegrad countries. Interesting due to a large role of TNCs and relatively low extent to which local populations are proficient in English (Eurobarometer)
  • 20.
    Method • Scrapped shareof vacancies requiring foreign language skills from the most important portal in each of the countries • Meta-data based analysis. Evaluated based on two tags (the “occupation” tag (coded to ISCO using the CASCOT application) and the language tag (based on employer’s indication – overall only English and German found relecant) • Compared with wages calculated for each occupation using the WI survey
  • 21.
    Result: English language requirementcorrelates with wages y = 4.936x + 3.054 y = 2.0458x + 2.8214 y = 3.5278x + 3.197 0 2 4 6 8 10 12 14 0% 20% 40% 60% 80% 100% 120% Czechia Hungary Slovakia
  • 22.
    Also present in the individual level WI data (1)(2) (3) Years of Experience 0.0295*** 0.0252*** 0.0212*** (0.00419) (0.00416) (0.00404) Years of Experience Squared -0.000582*** -0.000511*** -0.000397*** (0.000110) (0.000109) (0.000106) Education 0.0444*** 0.0330*** 0.0273*** (0.00491) (0.00508) (0.00492) Woman -0.181*** -0.191*** -0.165*** (0.0238) (0.0249) (0.0241) English skill level (reference category: no English) - Basic 0.0313 0.0214 -0.0186 (0.0337) (0.0332) (0.0320) - Rather skilled 0.154*** 0.128*** 0.0618* (0.0377) (0.0374) (0.0360) - Skilled 0.364*** 0.306*** 0.197*** (0.0417) (0.0418) (0.0409) Country and year dummies YES YES YES Basic occupation dummies NO YES YES Extended occupation dummies NO NO YES Observations 1,988 1,988 1,947
  • 23.
    Chapter 5 Computerskills • Puzzle: Determining computer skills relevance for individual occupations • SoA: Skills typically proxied by education only, more detailed look only recently and with the connection to web data (Visintin and Tijdens) • Combination of H2 2017 WageIndicator data and job vacancies from Textkernel
  • 24.
    Methodology • WageIndicator data– “Where do you use computer or tablet?”. Counted share of at work or both at work and at home per occupation. • Job vacancies – text mined computer requirements per occupation • Two approaches to aggregating occupations – based on apparent usefulness of computers for tasks associated with them or based on task complexity
  • 25.
  • 26.
    Conclusions • Web dataare a huge, exciting and diverse field • Increasingly mainstream in academia, yet still short of applications for the most part • Nonetheless, robust analysis increasingly possible, through the need for benchmarking against representative data source still present
  • 27.
    Thank you foryour attention