Labeling all the Things with the WDI Skill Labeler

Kwame Robinson, CEO @ Kwamata, LLC
www.kwamata.com
February 15, 2018
Labeling All The Things With the
Workforce Data Initiative Skill
Labeler

preamble
● Opinions and views are my own. This is a 42 slide presentation.
● This talk covers:
○ A brief introduction to the Workforce Data Initiative (WDI)
○ Several motivating examples and context for open skills data
○ An overview and technical deep dive into the WDI Skill Labeler
○ Data sets, including job posting data sets, related to the Skill
Labeler
○ Data sets covering industry and occupations, connecting the skills
labeler to a larger workforce context
○ Next Steps
2

The Workforce Data Initiative (WDI): why and what
Data At Work
● WDI is housed within
Data At Work
● An open, public private
partnership supporting a
21st century workforce
data ecosystem
● See: www.dataatwork.org
Workforce Data Initiative (WDI)
● Mission to create tools for and conduct applied
research on skills data
● Additional mission to create skill taxonomies to
better inform state, national stakeholders and
people like us
● See: www.github.com/workforce-data-initiative
3

Home and Current/Former Participants
● Academia: University of Chicago; Matt Gee, Tristan Crockett, Eddie Lin,
Hyunzoo Chai, Nathan Bartley, etc.
● Corporate: Pairin, Upwork, Microsoft, LinkedIn, etc.
● Government: State of Michigan, Dept. Labor (2016; E.J. Kalafarski), White House
(2016; Natalie Harris), CFPB (2016; Sam Leitner), etc.
● Civic Hackers: Greg Mundy, Kwame Robinson, etc.
*See: dataatwork.org/partners/
4
The Workforce Data Initiative (WDI): who and where

About myself
About myself and the WDI
● Inspired by the mission, put in nearly two years of pro bono effort
● Contribute:
○ Data science
○ Machine learning engineering
○ Machine learning research
○ Tool development
5

So, why we need an open skills
data in the first place?
6

Skills are the
foundation of work ● To motivate why open skills data
is important let’s use a story
about someone “Eve” to
illustrate.
7TheWorkforce
A State (or region)
Industries
Occupations
Skills
Retail
Cashier
Adding
...
... ... ............
......
... ... ... ... ... ... ...

What’s Free Isn’t As
Good
● Most open skill data is
static, some not updated
in over 10 years (e.g.
ONET Worker skills)
8

Hard to Pin Down
● “Acting like a team
player?” … What does
that really mean? Soft
skills can be hard to put
a finger on.
● In written language,
context and intent play
important roles: “she’s
great with a bat!” vs “a
bat is not a bird”
9

If You Want to
Know You Gotta
Pay
● Costs money, or limited
by terms of use: Google
Jobs API, LinkedIn
Skills, etc.
● Biases from focusing on
market needs: tech jobs
vs. playwrights
● Biases from overlooking
soft skills: C# vs bedside
manner
10

The Future Ain’t
What it Used to Be*
● Deloitte, McKinsey,
Brookings, etc. all say:
“Automation, AI to
eliminate large swaths of
jobs!”†
● As jobs disappear, new
jobs, skills will appear
that have never existed.
11* Yogi Berra †
Essentially this is what they’ve said

We need an open skills data to
allow the community to
understand skill demands in
occupations, industries, and
states, on their own terms, free
of biases, a profit focus and other
issues 12

And now ... about the WDI Skill Labeler
13
https://github.com/workforce-data-initiative/skills-labeller

WDI Skill Labeler: Project Details
MIT Licensed
● Primary contributors: Kwame Robinson,
Tristan Crockett and Greg Mundy
A service anyone
can run
● System for community based labelled of skills
Goals
● Open dataset of skills data and their context
● Foundation for open workforce skill research
14
Ongoing
● In active development
● Welcome any and all contributions

WDI Skill Labeler: Project Details
We’re on Slack ● slack@workdatainitiative.slack.com
Many other WDI repos ● https://github.com/workforce-data-initiative
15

WDI Skill Labeler: Technologies Used
Python
● NLP: Textacy
● ETL: PyMongo
Vowpal Wabbit ● Machine Learning: Online Active Learning
Docker, Github
● Service Architecture: Docker Compose, Bash
● Source Control, Change Management: Git
16
MongoDB, Redis
● Database: Job Postings, Skills
● In Memory Cache: Importance Priority Queue
● Testing: Pytest

WDI Skill Labeler: Alternatives
NextML
● Web scale active learning for labeling data
● Very friendly devs, contact@nextml.org
● Used by New York Times, Google,
Facebook, Yahoo Research division
● Modular
● Greater custom code complexity, interacts
with several subsystems
● Older version of Docker-compose
● Python 2
17

WDI Skill Labeler: Deep Dive
18
Organizing Principles
● Old School Way:
○ The Monolithic App, does everything, changes rebuild entire app
● New School Way:
○ Microservices Architecture - Martin Fowler
■ See: martinfowler.com/articles/microservices.html
■ Treat functionality as separate services
■ Replicate services as needed for scale
■ Each service is independent as possible (testing, deployment, code, etc)
○ The 12 Factor App (12factor.net): Considerations, modern factors for software-as-a-service,
lessons learned, best practices.
● Leads to faster delivery, more stable product, easier participation and integration

19
System Architecture (implementation in progress)

20
ETL Service
● Docker, Docker-compose: Mongo DB as a container
● Preprocessor: Textacy, unsupervised key term extraction
○ Uses graph theory, frequency, built on spaCy NLP
○ Combat very unbalanced classes by artificially
lowering recall to boost precision
● ETL: Pymongo, Pytest, Unittest, Mock
○ Pymongo ORM map to database, Mock DB
○ Pulls job posting data from VA’s CCARS
○ Housed w/ Preprocessor for speed
● Offers HTTP endpoint, to be moved to Service Listener
● Houses Skill Candidates for community to label
● Houses Labeled Skills for community, research

21
ETL Service: Testing
● Pytest, image runs ETL service specific tests in test/
● Unittest setUp to instantiate a database using stored
test data
○ from unittest.mock import patch
○ with patch(...) as mock_write_url: … cxt manager
○ Undo patch operations, remove copied file
● Reduced test data saves testing time
● All failures are service related, not external to other services
● Test Driven Development

22
ETL Service: ORM
● Pymongo ORM (object relational mapper)
● Sets up a class, API for specific object to be stored
● Easy to use, test
● PyMongo Aggregate
○ Pipeline of operations

23
Skill Oracle Service: Vowpal Wabbit
● Leans heavily on Vowpal Wabbit
○ Microsoft, Dr. John Langford at UMD
○ Extremely fast, extremely flexible
○ Out of core, Online, Active Learning
○ Cluster mode, high performance
● See vw_hyperopt.py for parameter search
● Using Active Learning mode
○ Learn one example at a time
○ Assumes labeled data is very costly, ask person to label only the example/instance
it is most uncertain about
○ Ranked instances are backed by Redis Priority Queue, against importance (> 1 →
important)

24
● Take new quasi-Hogwild inspired approach
○ Do not revise older importances
○ Randomly permute last few importance
digits to make examples unique but
weakly preserve ranking
○ Backed by Redis Queue
■ ZSORT
■ Priority Queue with O(log N) add, pop
■ Pop: ZRANGEBYSCORE(...,-1) get highest importance
■ https://github.com/workforce-data-initiative/skills-labeller/blob/master/skilloracle/skil
loracle/__init__.py#L147-L188
○ Not having to update importance ranking simplifies things quite a bit

25
● Endpoint available over HTTP/TCP
● To be moved to service listener

26
Skill Oracle Service: Frontend, REST
● User Interface is of primary importance
● Learn from the best: Tinder
○ Swipe left to reject
○ Swipe right to mark as skill
○ Near infinite list of skills
● Web page issues REST API calls
● REST API calls talk to Dispatcher
○ Drives entire system, indirectly
○ Other services emit events to the dispatcher (e.g., low on unlabeled skills)
○ Dispatcher enforces separation of concerns, micro services
● Angular, JS, HTML … any awesome front end developers out there? :)

27
Skill Oracle Service: Dispatcher
● Work In Progress
● Dispatcher:
○ Coordinates communication across, between services
○ Services are only aware of a “Dispatcher”
○ Enforces microservice approach
○ Communicate over Redis Queue
○ Service Listener monitors its service queue, reacts
○ Service Listener can put event on event queue too (feedback loop)
● Dispatcher to offer simple REST API called by users
○ Use hugs library, built on top of falcon, bare metal web api framework
○ Translates vetted API calls to Redis queue messages for microservices
○ Older, dispatcher like functionality exists in Skill Oracle

Job Related, Skill Data Sets for the Skill
Labeler
28

ESCO: European Skills, Competencies, Occupations
Occupations+Skills
● https://ec.europa.eu/esco/portal/download
● EU based
● Continuously Updated (!)
● Occupations: ISCO-08, SOC crosswalk
● Skills*: 13485, Qualifications: 2414
*not a full hierarchy
29
data @ https://data.bls.gov/cew/apps/data_views/data_views.htm
data @ https://www.bls.gov/sae/home.htm#tables
*method: https://www.bls.gov/cew/cewbultncur.htm#Comparison

Kaggle
Job Posting (related)
● https://www.kaggle.com/c/job-recommendation
● https://www.kaggle.com/c/job-salary-prediction
● Recommend or predict salary based on Job Posting
data
● Ground truth data, interesting data sets
30

USA Jobs
Job Posting
● www.usajobs.gov
● ALL U.S. federal government openings
● API @ developer.usajobs.gov
● Includes:
○ Job Description, Responsibilities
○ Min/Max Salary
○ Location
○ Date
● Near real time, 2 hour lag
● Note: gov. jobs are qualitatively different than private sector
31

National Labor Exchange
Job Posting Data
● http://us.jobs
● National Labor Exchange: a partnership between
National Assoc. State Workforce Agency +
DirectEmployers Assoc.
● Collects job postings from over 25,000 corporate
websites, state job banks and USAJobs
● More than 2 million job postings at any given time
● Can browse by Occupation or Industry, weak taxonomy
● No public API :( (that I could find), just host link
metadata
● See: www.naswa.org/nlx/?action=what for more detail
32

State of Virginia CCARS
Open Data for Job Listings
● See: https://opendata-cs-vt.github.io/ccars-jobpostings/
● Many gigabytes of Job Listings in Virginia
● Primary source of job listings for Skill Labeler
● Similar to Data At Work’s mission, but by VA
33

Other Sites to be aware of
Advocacy, General Data
● Data.gov
Holds a lot of federal, state, city related data, search for jobs, job
postings
● www.nationalskillscoalition.org
National Skills Coalition, non profit special interest group
34

Going Beyond Job Related Data Sets:
A Larger Workforce Context
35

Dataset Topics
Occupations
● The type of job or work that a person
does; e.g. Mr. Wyeth is an artist or John
is a cashier.
Industry
● The business activity of an employer or
company; e.g. Walmart is in retail sales,
employs those in cashier occupations.
Semco makes and sells paint and
employs painters (but not like Mr.
Wyeth)
Skills
● The ability to do something well;
expertise. Can include knowledge,
abilities, etc.
36

ONET (Occupational Skill Network)
Occupation/Industry/Skills
● www.onetcenter.org
● Semi-annual Occupational Database
○ SOC Code
○ Big 6/RIASEC Occup. personality
tests
○ Skills, Tasks
● Heavy Industrial/Organizational
Psychologist focus
● DoL sponsored, led by North Carolina
Dept of Commerce
● Surveys, data collection since 2000
37
database @ www.onetcenter.org/database.html

BLS: SAE, QCEW (Bureau of Labor Statistics)
Industry
● www.bls.gov/sae/
● www.bls.gov/qcew/
● Different in methodology*
● Month, Quarterly Industry survey on
wage, employment
● Released Monthly, quarterly
● Rolled up by Industry (NAICS), by
State/Metro Area* or National levels
38
data @ https://data.bls.gov/cew/apps/data_views/data_views.htm
data @ https://www.bls.gov/sae/home.htm#tables
*method: https://www.bls.gov/cew/cewbultncur.htm#Comparison

BLS: Measuring SOC Concentration By NAICS
39
Audrey Watson, “Measuring occupational concentration by industry,”
Beyond the Numbers: Employment & Unemployment, vol. 3, no. 3
(U.S. Bureau of Labor Statistics, February 2014), https://www.bls.gov/opub/btn/volume-3/measuring-occupational-concentration-by-industry.htm
Industry+Occupation
“[T]he HHI and industry quotients offer additional
perspectives on industry staffing patterns, helping to provide
a more accurate picture of the distribution of occupations
across industries. Such information could be useful for
workers as they choose a career, jobseekers as they narrow or
broaden their job searches, and employers as they try to
recruit workers from other industries …”

US Open City Data Census
City Data, Business Listings
● Interesting data set to be aware of, although not
directly relevant to workforce research
● 2018: us-cities.survey.okfn.org
● 2017: us-city.census.okfn.org * note city vs. cities
● Wide variety of data on US cities
● Links to city categorized business listings
● Grades cities on open data access
40

So What’s Next?
Ready to help us label?
Twitter: @data_at_work, get
notified when WDI skill
labeler is deployed to
production
Talk to us on Slack!
workdatainitiative.slack.com
Code and Research
Git:
github.com/workforce-data-i
nitiative
See:
dataatwork.org/get-involved/
Use Gamification
Make skill labeling
psychologically motivating,
allow user accounts, high
scores?
A Comment System?
Additional opinions, context
on skills, job postings from
the community

Question & Answer Session
Contact Info:
Kwame Robinson, CEO @ Kwamata LLC
kwame@kwamata.com

43
Preprocessor
ETL Service
ServiceListener
Skill Candidates
Labeled Skills
Skill Oracle Service
ServiceListener
Community of
Labelers
Dispatcher
GET
skill candidate
PUT
skill candidate
RESTAPI
Various
Microservice
Messages
To
Listeners

Labeling all the Things with the WDI Skill Labeler

Recommended

Recommended

More Related Content

Similar to Labeling all the Things with the WDI Skill Labeler

Similar to Labeling all the Things with the WDI Skill Labeler (20)

Recently uploaded

Recently uploaded (20)

Labeling all the Things with the WDI Skill Labeler