Squeezing biggish job market data
onto a laptop
Alan Mark Berg BSc.MSc.PGCE.
a.m.berg@uva.nl
Agenda
•Overview
• Who I am and what am I doing?
• Area of research
• Technique
•Example Results:
• Stereotypes
• Female
• IT
• Discrimination
•Question and Answers
• Refinements
• References
Who am I?
A rather mature, external PhD Candidate in Learning Analytics.
2. Hard Science Background: Physics, microelectronics with computational
engineering, experimental Science
3. Pragmatic: Last 17 years involved in Design and development of large scale
IT systems @UvA
○ Wishes to use the simplest technique possible for a given task.
4. Author of 4 books
5. Busy with open source communities.
○ Considers the best place to curate software
6. Stephan and Gabor are my co-supervisors. Prof Robin Boast my supervisor.
7. Status: In the process of writing up the research and then finishing the PhD.
8. Initial Infrastructure, standards papers published (see references)
Technique
❑ 3 million UK job adverts – 1150 million words - Thank you
Monsterboard.
❑ Simplest possible scenario
❑ Bag of words
❑ Unigrams
❑ Perl to process the text
❑ R language: Inferential Statistic and visualization
❑ CATA: Frequency of words
❑ Mapped job dataset to SOC 2010 occupation categories
❑ From SOC 2010 categories merged UK Labour force survey
UK Labour Market Survey
Representation
in adverts
❏ Qualifications +
❏ Male +
❏ Head hunting -
❏ Female -
Under
Represented
Over
Represented
Male Dominated
Female Dominated
Salary
Stereotypical distribution
Dispersion IT
Skills
- Monitor skill dispersion
- Has implications for policy
and training
- Has implications for risks
within occupations such as
the deployment of IT
projects.
Dispersion
Female
words
Is discrimination wording
attracted to female worded
job adverts?
Discrimination
Diffusion process into the central region where
men and women are more equally represented.
Color:
Red = Highest percentage of female wording
Notice the large amount of green (less wording)
in 2013
Questions
Refinements
❏Inferential Statistics
❏From Unigram to Bigram
❏Cleaner data sources
❏Multiple languages
❏Compare to specific surveys
❏Generation of many dictionaries
❏From dictionary to taxonomies
❏From research to practice
References
Motivation: We can develop large scale systems without using sensitive data.
Berg, A. M., Mol, S. T., Kismihók, G., & Sclater, N. (2016). The role of a reference synthetic data generator within the
field of learning analytics. Journal of Learning Analytics, 3, 107–128. http://doi.org/10.18608/jla.2016.31.7
Motivation: We need to add new xAPI profiles and be consistent to avoid issues with connecting systems
Berg, A., Scheffel, M., Drachsler, H., Ternier, S., & Specht, M. (2016). The dutch xAPI experience. In Proceedings of
the Sixth International Conference on Learning Analytics & Knowledge - LAK ’16 (pp. 544–545). New York, New York,
USA: ACM Press. http://doi.org/10.1145/2883851.2883968
Motivation: We need to add new xAPI profiles and be consistent to avoid issues with connecting systems
Berg, A., Scheffel, M., Drachsler, H., Ternier, S., & Specht, M. (2016). Dutch Cooking with xAPI Recipes: The Good,
the Bad, and the Consistent. In 2016 IEEE 16th International Conference on Advanced Learning Technologies (ICALT)
(pp. 234–236). IEEE. http://doi.org/10.1109/ICALT.2016.48
Motivation: To contribute to the discussion around LA infrastructural elements, hence providing a means to consistency
Sclater, N., Berg, A., & Webb, M. (2015). Developing an open architecture for learning analytics. Proceedings of the
EUNIS 2015 Congress. http://doi.org/ISSN: 2409-1340

Alan Berg

  • 1.
    Squeezing biggish jobmarket data onto a laptop Alan Mark Berg BSc.MSc.PGCE. a.m.berg@uva.nl
  • 2.
    Agenda •Overview • Who Iam and what am I doing? • Area of research • Technique •Example Results: • Stereotypes • Female • IT • Discrimination •Question and Answers • Refinements • References
  • 3.
    Who am I? Arather mature, external PhD Candidate in Learning Analytics. 2. Hard Science Background: Physics, microelectronics with computational engineering, experimental Science 3. Pragmatic: Last 17 years involved in Design and development of large scale IT systems @UvA ○ Wishes to use the simplest technique possible for a given task. 4. Author of 4 books 5. Busy with open source communities. ○ Considers the best place to curate software 6. Stephan and Gabor are my co-supervisors. Prof Robin Boast my supervisor. 7. Status: In the process of writing up the research and then finishing the PhD. 8. Initial Infrastructure, standards papers published (see references)
  • 5.
    Technique ❑ 3 millionUK job adverts – 1150 million words - Thank you Monsterboard. ❑ Simplest possible scenario ❑ Bag of words ❑ Unigrams ❑ Perl to process the text ❑ R language: Inferential Statistic and visualization ❑ CATA: Frequency of words ❑ Mapped job dataset to SOC 2010 occupation categories ❑ From SOC 2010 categories merged UK Labour force survey
  • 6.
  • 7.
    Representation in adverts ❏ Qualifications+ ❏ Male + ❏ Head hunting - ❏ Female - Under Represented Over Represented Male Dominated Female Dominated
  • 8.
  • 9.
    Dispersion IT Skills - Monitorskill dispersion - Has implications for policy and training - Has implications for risks within occupations such as the deployment of IT projects.
  • 10.
  • 11.
    Is discrimination wording attractedto female worded job adverts?
  • 12.
    Discrimination Diffusion process intothe central region where men and women are more equally represented. Color: Red = Highest percentage of female wording Notice the large amount of green (less wording) in 2013
  • 13.
  • 14.
    Refinements ❏Inferential Statistics ❏From Unigramto Bigram ❏Cleaner data sources ❏Multiple languages ❏Compare to specific surveys ❏Generation of many dictionaries ❏From dictionary to taxonomies ❏From research to practice
  • 15.
    References Motivation: We candevelop large scale systems without using sensitive data. Berg, A. M., Mol, S. T., Kismihók, G., & Sclater, N. (2016). The role of a reference synthetic data generator within the field of learning analytics. Journal of Learning Analytics, 3, 107–128. http://doi.org/10.18608/jla.2016.31.7 Motivation: We need to add new xAPI profiles and be consistent to avoid issues with connecting systems Berg, A., Scheffel, M., Drachsler, H., Ternier, S., & Specht, M. (2016). The dutch xAPI experience. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge - LAK ’16 (pp. 544–545). New York, New York, USA: ACM Press. http://doi.org/10.1145/2883851.2883968 Motivation: We need to add new xAPI profiles and be consistent to avoid issues with connecting systems Berg, A., Scheffel, M., Drachsler, H., Ternier, S., & Specht, M. (2016). Dutch Cooking with xAPI Recipes: The Good, the Bad, and the Consistent. In 2016 IEEE 16th International Conference on Advanced Learning Technologies (ICALT) (pp. 234–236). IEEE. http://doi.org/10.1109/ICALT.2016.48 Motivation: To contribute to the discussion around LA infrastructural elements, hence providing a means to consistency Sclater, N., Berg, A., & Webb, M. (2015). Developing an open architecture for learning analytics. Proceedings of the EUNIS 2015 Congress. http://doi.org/ISSN: 2409-1340