A presentation on recent developments with the open source Workforce Data Initiative (WDI) Skill Labeler, a community based labelling system for producing open skills data. Context, best practices, data sets and thoughts on ongoing work. Presented by Kwame, CEO of Kwamata, a civic hacker contributing time to WDI.
From prototype to production - The journey of re-designing SmartUp.ioMáté Lang
Talk about the joureny of small tech team re-designing SmartUp.io from scratch, and the technical paths from MVP to Production.
High level overview of architecture and tech stack decisions, best-practices and culture.
Fighting legacy with hexagonal architecture and frameworkless phpFabio Pellegrini
Molto spesso capita di venire a contatto con applicazioni legacy piuttosto datate, i classici monoliti che sono cresciuti a dismisura nel tempo accumulando debito tecnologico.
A causa delle priorità di business delle aziende, non sempre si riesce ad allocare il budget e il tempo necessario per iniziare subito il processo di ristrutturazione architetturale e di rimodellazione dei dati che servirebbe.
In questo talk presenterò una soluzione che mi è capitato di adottare recentemente per iniziare a ridefinire la struttura di un progetto legacy, utilizzando un approccio basato su Domain Driven Design, architettura esagonale e l’utilizzo di PHP senza framework.
Vedremo come è stato creato un nuovo servizio “satellite” da zero, come sono state implementate le componenti principali, come si è tenuto il codice legacy ai margini dell’applicazione, come si è approcciato il testing, il tutto nell’ottica di poter spacchettare il monolite in microservizi in un secondo momento.
From prototype to production - The journey of re-designing SmartUp.ioMáté Lang
Talk about the joureny of small tech team re-designing SmartUp.io from scratch, and the technical paths from MVP to Production.
High level overview of architecture and tech stack decisions, best-practices and culture.
Fighting legacy with hexagonal architecture and frameworkless phpFabio Pellegrini
Molto spesso capita di venire a contatto con applicazioni legacy piuttosto datate, i classici monoliti che sono cresciuti a dismisura nel tempo accumulando debito tecnologico.
A causa delle priorità di business delle aziende, non sempre si riesce ad allocare il budget e il tempo necessario per iniziare subito il processo di ristrutturazione architetturale e di rimodellazione dei dati che servirebbe.
In questo talk presenterò una soluzione che mi è capitato di adottare recentemente per iniziare a ridefinire la struttura di un progetto legacy, utilizzando un approccio basato su Domain Driven Design, architettura esagonale e l’utilizzo di PHP senza framework.
Vedremo come è stato creato un nuovo servizio “satellite” da zero, come sono state implementate le componenti principali, come si è tenuto il codice legacy ai margini dell’applicazione, come si è approcciato il testing, il tutto nell’ottica di poter spacchettare il monolite in microservizi in un secondo momento.
Database automation guide - Oracle Community Tour LATAM 2023Nelson Calero
The tasks of the DBA role are in permanent evolution. There are new and changed functionalities in database versions, cloud services, integrations, and new tools. Automation has been always a big portion of the DBA work, and is constantly challenging our processes. This presentation explore these automation changes using examples from experience of supporting hundreds of Oracle installations of varying size and complexity, including the process of choosing the right tool for the task, implementation, and subsequent maintenance, mainly using Ansible.
WTF is a Microservice - Rafael Schloming, DatawireAmbassador Labs
Rafael Schloming, Chief Architect at Datawire and AMQP spec author breaks down an understanding of microservices into People, Processes, and Technology, and when adopting microservices recommends starting with People first, rather than starting with Technology.
The working architecture of node js applications open tech week javascript ...Viktor Turskyi
We launched more than 60 projects, developed a web application architecture that is suitable for projects of completely different sizes. In the talk, I'll analyze this architecture, will consider the question what to choose “monolith or microservices”, will show the main architectural mistakes that developers make.
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
Decathlon’s mission is to make sport accessible to more people. Decathlon SportMeeting, its new social network, was created to take this one step further, allowing everyone to find people who share their sport and their passion.
DSM was defined from scratch to support the actual traffic with more than 100k registered users, 1000 active sport proposals for more than 30 sports.
This web platform is entirely built with Groovy & Grails but there are also applications in Android and iOS that use its RESTful API. Along the development process several plugins were created and open-sourced to the community.
In this talk Kaleidos will explain how the development of this platform was, some of the technical decisions that were made, lessons learned, pitfalls or how the infrastructure has been evolving for almost 3 years, and much more.
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Viktor Turskyi "Effective NodeJS Application Development"Fwdays
For 15 years in development, I managed to take part in the creation of a large number of various projects. I have already made a number of talks on the working architecture of Web applications, but this is only part of the efficient development puzzle. We will consider the whole process from the start of the project to its launch in production. I’ll tell you how we approach the ideas of the “12 Factor App”, how we use the docker, discuss environment deployment issues, security issues, testing issues, discuss the nuances of SDLC and much more.
If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat LlamaUXDXConf
Will Demaine, Engineer, Fat Llama Setup decisions: Planning your Agile architecture (Cloud migration path, platform choice, microservices/conainer architecture) ... Before you know everything about your product, how are you supposed to set it up
Totango is an Analytics platform for Customer Success.
Our data pipeline converts usage information into actionable analytics. The pipeline is managed using Luigi workflow engine, and data transformations are done in Spark.
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
The Open Chemistry project is developing an ambitious platform to facilitate reproducible quantum chemistry workflows by integrating the best of breed open source projects currently available in a cohesive platform with extensions specific to the needs of quantum chemistry. The core of the project is a Python-based data server capable of storing metadata, executing quantum chemistry calculations, and processing the output. The platform exposes RESTful endpoints using programming language agnostic web endpoints, and uses Linux container technology to package quantum codes that are often difficult to build.
The Jupyter project has been leveraged as a web-based frontend offering reproducibility as a core principle. This has been coupled with the data server to initiate quantum chemistry calculations, cache results, make them searchable, and even visualize the results within a modern browser environment. The Avogadro libraries have been reused for visualization workflows, coupled with Open Babel for file translation, and examples of the use of NWChem and Psi4 will be demonstrated.
The core of the platform is developed upon JSON data standards, and encouraging the wider adoption of JSON/HDF5 as the principle storage mediums. A single page web application using React at its core will be shown for sharing simple views of data output, and linking to the Jupyter notebooks that documents how they were made. Command line tools and links to the Avogadro graphical interface will be shown demonstrating capabilities from web through to desktop.
Applications need data, but the legacy approach of n-tiered application architecture doesn’t solve for today’s challenges. Developers aren’t empowered to build and iterate their code quickly without lengthy review processes from other teams. New data sources cannot be quickly adopted into application development cycles, and developers are not able to control their own requirements when it comes to data platforms.
Part of the challenge here is the existing relationship between two groups: developers and DBAs. Developers are trying to go faster, automating build/test/release cycles with CI/CD, and thrive on the autonomy provided by microservices architectures. DBAs are stewards of data protection, governance, and security. Both of these groups are critically important to running data platforms, but many organizations deal with high friction between these teams. As a result, applications get to market more slowly, and it takes longer for customers to see value.
What if we changed the orientation between developers and DBAs? What if developers consumed data products from data teams? In this session, Pivotal’s Dormain Drewitz and Solstice’s Mike Koleno will speak about:
- Product mindset and how balanced teams can reduce internal friction
- Creating data as a product to align with cloud-native application architectures, like microservices and serverless
- Getting started bringing lean principles into your data organization
- Balancing data usability with data protection, governance, and security
Presenter : Dormain Drewitz, Pivotal & Mike Koleno, Solstice
Database automation guide - Oracle Community Tour LATAM 2023Nelson Calero
The tasks of the DBA role are in permanent evolution. There are new and changed functionalities in database versions, cloud services, integrations, and new tools. Automation has been always a big portion of the DBA work, and is constantly challenging our processes. This presentation explore these automation changes using examples from experience of supporting hundreds of Oracle installations of varying size and complexity, including the process of choosing the right tool for the task, implementation, and subsequent maintenance, mainly using Ansible.
WTF is a Microservice - Rafael Schloming, DatawireAmbassador Labs
Rafael Schloming, Chief Architect at Datawire and AMQP spec author breaks down an understanding of microservices into People, Processes, and Technology, and when adopting microservices recommends starting with People first, rather than starting with Technology.
The working architecture of node js applications open tech week javascript ...Viktor Turskyi
We launched more than 60 projects, developed a web application architecture that is suitable for projects of completely different sizes. In the talk, I'll analyze this architecture, will consider the question what to choose “monolith or microservices”, will show the main architectural mistakes that developers make.
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
Decathlon’s mission is to make sport accessible to more people. Decathlon SportMeeting, its new social network, was created to take this one step further, allowing everyone to find people who share their sport and their passion.
DSM was defined from scratch to support the actual traffic with more than 100k registered users, 1000 active sport proposals for more than 30 sports.
This web platform is entirely built with Groovy & Grails but there are also applications in Android and iOS that use its RESTful API. Along the development process several plugins were created and open-sourced to the community.
In this talk Kaleidos will explain how the development of this platform was, some of the technical decisions that were made, lessons learned, pitfalls or how the infrastructure has been evolving for almost 3 years, and much more.
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Viktor Turskyi "Effective NodeJS Application Development"Fwdays
For 15 years in development, I managed to take part in the creation of a large number of various projects. I have already made a number of talks on the working architecture of Web applications, but this is only part of the efficient development puzzle. We will consider the whole process from the start of the project to its launch in production. I’ll tell you how we approach the ideas of the “12 Factor App”, how we use the docker, discuss environment deployment issues, security issues, testing issues, discuss the nuances of SDLC and much more.
If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".
Migrating to an Agile Architecture, Will Demaine, Engineer, Fat LlamaUXDXConf
Will Demaine, Engineer, Fat Llama Setup decisions: Planning your Agile architecture (Cloud migration path, platform choice, microservices/conainer architecture) ... Before you know everything about your product, how are you supposed to set it up
Totango is an Analytics platform for Customer Success.
Our data pipeline converts usage information into actionable analytics. The pipeline is managed using Luigi workflow engine, and data transformations are done in Spark.
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
The Open Chemistry project is developing an ambitious platform to facilitate reproducible quantum chemistry workflows by integrating the best of breed open source projects currently available in a cohesive platform with extensions specific to the needs of quantum chemistry. The core of the project is a Python-based data server capable of storing metadata, executing quantum chemistry calculations, and processing the output. The platform exposes RESTful endpoints using programming language agnostic web endpoints, and uses Linux container technology to package quantum codes that are often difficult to build.
The Jupyter project has been leveraged as a web-based frontend offering reproducibility as a core principle. This has been coupled with the data server to initiate quantum chemistry calculations, cache results, make them searchable, and even visualize the results within a modern browser environment. The Avogadro libraries have been reused for visualization workflows, coupled with Open Babel for file translation, and examples of the use of NWChem and Psi4 will be demonstrated.
The core of the platform is developed upon JSON data standards, and encouraging the wider adoption of JSON/HDF5 as the principle storage mediums. A single page web application using React at its core will be shown for sharing simple views of data output, and linking to the Jupyter notebooks that documents how they were made. Command line tools and links to the Avogadro graphical interface will be shown demonstrating capabilities from web through to desktop.
Applications need data, but the legacy approach of n-tiered application architecture doesn’t solve for today’s challenges. Developers aren’t empowered to build and iterate their code quickly without lengthy review processes from other teams. New data sources cannot be quickly adopted into application development cycles, and developers are not able to control their own requirements when it comes to data platforms.
Part of the challenge here is the existing relationship between two groups: developers and DBAs. Developers are trying to go faster, automating build/test/release cycles with CI/CD, and thrive on the autonomy provided by microservices architectures. DBAs are stewards of data protection, governance, and security. Both of these groups are critically important to running data platforms, but many organizations deal with high friction between these teams. As a result, applications get to market more slowly, and it takes longer for customers to see value.
What if we changed the orientation between developers and DBAs? What if developers consumed data products from data teams? In this session, Pivotal’s Dormain Drewitz and Solstice’s Mike Koleno will speak about:
- Product mindset and how balanced teams can reduce internal friction
- Creating data as a product to align with cloud-native application architectures, like microservices and serverless
- Getting started bringing lean principles into your data organization
- Balancing data usability with data protection, governance, and security
Presenter : Dormain Drewitz, Pivotal & Mike Koleno, Solstice
Similar to Labeling all the Things with the WDI Skill Labeler (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Labeling all the Things with the WDI Skill Labeler
1. Kwame Robinson, CEO @ Kwamata, LLC
www.kwamata.com
February 15, 2018
Labeling All The Things With the
Workforce Data Initiative Skill
Labeler
2. preamble
● Opinions and views are my own. This is a 42 slide presentation.
● This talk covers:
○ A brief introduction to the Workforce Data Initiative (WDI)
○ Several motivating examples and context for open skills data
○ An overview and technical deep dive into the WDI Skill Labeler
○ Data sets, including job posting data sets, related to the Skill
Labeler
○ Data sets covering industry and occupations, connecting the skills
labeler to a larger workforce context
○ Next Steps
2
3. The Workforce Data Initiative (WDI): why and what
Data At Work
● WDI is housed within
Data At Work
● An open, public private
partnership supporting a
21st century workforce
data ecosystem
● See: www.dataatwork.org
Workforce Data Initiative (WDI)
● Mission to create tools for and conduct applied
research on skills data
● Additional mission to create skill taxonomies to
better inform state, national stakeholders and
people like us
● See: www.github.com/workforce-data-initiative
3
4. Home and Current/Former Participants
● Academia: University of Chicago; Matt Gee, Tristan Crockett, Eddie Lin,
Hyunzoo Chai, Nathan Bartley, etc.
● Corporate: Pairin, Upwork, Microsoft, LinkedIn, etc.
● Government: State of Michigan, Dept. Labor (2016; E.J. Kalafarski), White House
(2016; Natalie Harris), CFPB (2016; Sam Leitner), etc.
● Civic Hackers: Greg Mundy, Kwame Robinson, etc.
*See: dataatwork.org/partners/
4
The Workforce Data Initiative (WDI): who and where
5. About myself
About myself and the WDI
● Inspired by the mission, put in nearly two years of pro bono effort
● Contribute:
○ Data science
○ Machine learning engineering
○ Machine learning research
○ Tool development
5
6. So, why we need an open skills
data in the first place?
6
7. Skills are the
foundation of work ● To motivate why open skills data
is important let’s use a story
about someone “Eve” to
illustrate.
7TheWorkforce
A State (or region)
Industries
Occupations
Skills
Retail
Cashier
Adding
...
... ... ............
......
... ... ... ... ... ... ...
8. What’s Free Isn’t As
Good
● Most open skill data is
static, some not updated
in over 10 years (e.g.
ONET Worker skills)
8
9. Hard to Pin Down
● “Acting like a team
player?” … What does
that really mean? Soft
skills can be hard to put
a finger on.
● In written language,
context and intent play
important roles: “she’s
great with a bat!” vs “a
bat is not a bird”
9
10. If You Want to
Know You Gotta
Pay
● Costs money, or limited
by terms of use: Google
Jobs API, LinkedIn
Skills, etc.
● Biases from focusing on
market needs: tech jobs
vs. playwrights
● Biases from overlooking
soft skills: C# vs bedside
manner
10
11. The Future Ain’t
What it Used to Be*
● Deloitte, McKinsey,
Brookings, etc. all say:
“Automation, AI to
eliminate large swaths of
jobs!Ӡ
● As jobs disappear, new
jobs, skills will appear
that have never existed.
11* Yogi Berra †
Essentially this is what they’ve said
12. We need an open skills data to
allow the community to
understand skill demands in
occupations, industries, and
states, on their own terms, free
of biases, a profit focus and other
issues 12
13. And now ... about the WDI Skill Labeler
13
https://github.com/workforce-data-initiative/skills-labeller
14. WDI Skill Labeler: Project Details
MIT Licensed
● Primary contributors: Kwame Robinson,
Tristan Crockett and Greg Mundy
A service anyone
can run
● System for community based labelled of skills
Goals
● Open dataset of skills data and their context
● Foundation for open workforce skill research
14
Ongoing
● In active development
● Welcome any and all contributions
15. WDI Skill Labeler: Project Details
We’re on Slack ● slack@workdatainitiative.slack.com
Many other WDI repos ● https://github.com/workforce-data-initiative
15
17. WDI Skill Labeler: Alternatives
NextML
● Web scale active learning for labeling data
● Very friendly devs, contact@nextml.org
● Used by New York Times, Google,
Facebook, Yahoo Research division
● Modular
● Greater custom code complexity, interacts
with several subsystems
● Older version of Docker-compose
● Python 2
17
18. WDI Skill Labeler: Deep Dive
18
Organizing Principles
● Old School Way:
○ The Monolithic App, does everything, changes rebuild entire app
● New School Way:
○ Microservices Architecture - Martin Fowler
■ See: martinfowler.com/articles/microservices.html
■ Treat functionality as separate services
■ Replicate services as needed for scale
■ Each service is independent as possible (testing, deployment, code, etc)
○ The 12 Factor App (12factor.net): Considerations, modern factors for software-as-a-service,
lessons learned, best practices.
● Leads to faster delivery, more stable product, easier participation and integration
19. WDI Skill Labeler: Deep Dive
19
System Architecture (implementation in progress)
20. WDI Skill Labeler: Deep Dive
20
ETL Service
● Docker, Docker-compose: Mongo DB as a container
● Preprocessor: Textacy, unsupervised key term extraction
○ Uses graph theory, frequency, built on spaCy NLP
○ Combat very unbalanced classes by artificially
lowering recall to boost precision
● ETL: Pymongo, Pytest, Unittest, Mock
○ Pymongo ORM map to database, Mock DB
○ Pulls job posting data from VA’s CCARS
○ Housed w/ Preprocessor for speed
● Offers HTTP endpoint, to be moved to Service Listener
● Houses Skill Candidates for community to label
● Houses Labeled Skills for community, research
21. WDI Skill Labeler: Deep Dive
21
ETL Service: Testing
● Pytest, image runs ETL service specific tests in test/
● Unittest setUp to instantiate a database using stored
test data
○ from unittest.mock import patch
○ with patch(...) as mock_write_url: … cxt manager
○ Undo patch operations, remove copied file
● Reduced test data saves testing time
● All failures are service related, not external to other services
● Test Driven Development
22. WDI Skill Labeler: Deep Dive
22
ETL Service: ORM
● Pymongo ORM (object relational mapper)
● Sets up a class, API for specific object to be stored
● Easy to use, test
● PyMongo Aggregate
○ Pipeline of operations
23. WDI Skill Labeler: Deep Dive
23
Skill Oracle Service: Vowpal Wabbit
● Leans heavily on Vowpal Wabbit
○ Microsoft, Dr. John Langford at UMD
○ Extremely fast, extremely flexible
○ Out of core, Online, Active Learning
○ Cluster mode, high performance
● See vw_hyperopt.py for parameter search
● Using Active Learning mode
○ Learn one example at a time
○ Assumes labeled data is very costly, ask person to label only the example/instance
it is most uncertain about
○ Ranked instances are backed by Redis Priority Queue, against importance (> 1 →
important)
24. WDI Skill Labeler: Deep Dive
24
Skill Oracle Service: Vowpal Wabbit
● Take new quasi-Hogwild inspired approach
○ Do not revise older importances
○ Randomly permute last few importance
digits to make examples unique but
weakly preserve ranking
○ Backed by Redis Queue
■ ZSORT
■ Priority Queue with O(log N) add, pop
■ Pop: ZRANGEBYSCORE(...,-1) get highest importance
■ https://github.com/workforce-data-initiative/skills-labeller/blob/master/skilloracle/skil
loracle/__init__.py#L147-L188
○ Not having to update importance ranking simplifies things quite a bit
25. WDI Skill Labeler: Deep Dive
25
Skill Oracle Service: Vowpal Wabbit
● Endpoint available over HTTP/TCP
● To be moved to service listener
26. WDI Skill Labeler: Deep Dive
26
Skill Oracle Service: Frontend, REST
● User Interface is of primary importance
● Learn from the best: Tinder
○ Swipe left to reject
○ Swipe right to mark as skill
○ Near infinite list of skills
● Web page issues REST API calls
● REST API calls talk to Dispatcher
○ Drives entire system, indirectly
○ Other services emit events to the dispatcher (e.g., low on unlabeled skills)
○ Dispatcher enforces separation of concerns, micro services
● Angular, JS, HTML … any awesome front end developers out there? :)
27. WDI Skill Labeler: Deep Dive
27
Skill Oracle Service: Dispatcher
● Work In Progress
● Dispatcher:
○ Coordinates communication across, between services
○ Services are only aware of a “Dispatcher”
○ Enforces microservice approach
○ Communicate over Redis Queue
○ Service Listener monitors its service queue, reacts
○ Service Listener can put event on event queue too (feedback loop)
● Dispatcher to offer simple REST API called by users
○ Use hugs library, built on top of falcon, bare metal web api framework
○ Translates vetted API calls to Redis queue messages for microservices
○ Older, dispatcher like functionality exists in Skill Oracle
29. ESCO: European Skills, Competencies, Occupations
Occupations+Skills
● https://ec.europa.eu/esco/portal/download
● EU based
● Continuously Updated (!)
● Occupations: ISCO-08, SOC crosswalk
● Skills*: 13485, Qualifications: 2414
*not a full hierarchy
29
data @ https://data.bls.gov/cew/apps/data_views/data_views.htm
data @ https://www.bls.gov/sae/home.htm#tables
*method: https://www.bls.gov/cew/cewbultncur.htm#Comparison
30. Kaggle
Job Posting (related)
● https://www.kaggle.com/c/job-recommendation
● https://www.kaggle.com/c/job-salary-prediction
● Recommend or predict salary based on Job Posting
data
● Ground truth data, interesting data sets
30
31. USA Jobs
Job Posting
● www.usajobs.gov
● ALL U.S. federal government openings
● API @ developer.usajobs.gov
● Includes:
○ Job Description, Responsibilities
○ Min/Max Salary
○ Location
○ Date
● Near real time, 2 hour lag
● Note: gov. jobs are qualitatively different than private sector
31
32. National Labor Exchange
Job Posting Data
● http://us.jobs
● National Labor Exchange: a partnership between
National Assoc. State Workforce Agency +
DirectEmployers Assoc.
● Collects job postings from over 25,000 corporate
websites, state job banks and USAJobs
● More than 2 million job postings at any given time
● Can browse by Occupation or Industry, weak taxonomy
● No public API :( (that I could find), just host link
metadata
● See: www.naswa.org/nlx/?action=what for more detail
32
33. State of Virginia CCARS
Open Data for Job Listings
● See: https://opendata-cs-vt.github.io/ccars-jobpostings/
● Many gigabytes of Job Listings in Virginia
● Primary source of job listings for Skill Labeler
● Similar to Data At Work’s mission, but by VA
33
34. Other Sites to be aware of
Advocacy, General Data
● Data.gov
Holds a lot of federal, state, city related data, search for jobs, job
postings
● www.nationalskillscoalition.org
National Skills Coalition, non profit special interest group
34
35. Going Beyond Job Related Data Sets:
A Larger Workforce Context
35
36. Dataset Topics
Occupations
● The type of job or work that a person
does; e.g. Mr. Wyeth is an artist or John
is a cashier.
Industry
● The business activity of an employer or
company; e.g. Walmart is in retail sales,
employs those in cashier occupations.
Semco makes and sells paint and
employs painters (but not like Mr.
Wyeth)
Skills
● The ability to do something well;
expertise. Can include knowledge,
abilities, etc.
36
37. ONET (Occupational Skill Network)
Occupation/Industry/Skills
● www.onetcenter.org
● Semi-annual Occupational Database
○ SOC Code
○ Big 6/RIASEC Occup. personality
tests
○ Skills, Tasks
● Heavy Industrial/Organizational
Psychologist focus
● DoL sponsored, led by North Carolina
Dept of Commerce
● Surveys, data collection since 2000
37
database @ www.onetcenter.org/database.html
38. BLS: SAE, QCEW (Bureau of Labor Statistics)
Industry
● www.bls.gov/sae/
● www.bls.gov/qcew/
● Different in methodology*
● Month, Quarterly Industry survey on
wage, employment
● Released Monthly, quarterly
● Rolled up by Industry (NAICS), by
State/Metro Area* or National levels
38
data @ https://data.bls.gov/cew/apps/data_views/data_views.htm
data @ https://www.bls.gov/sae/home.htm#tables
*method: https://www.bls.gov/cew/cewbultncur.htm#Comparison
39. BLS: Measuring SOC Concentration By NAICS
39
Audrey Watson, “Measuring occupational concentration by industry,”
Beyond the Numbers: Employment & Unemployment, vol. 3, no. 3
(U.S. Bureau of Labor Statistics, February 2014), https://www.bls.gov/opub/btn/volume-3/measuring-occupational-concentration-by-industry.htm
Industry+Occupation
“[T]he HHI and industry quotients offer additional
perspectives on industry staffing patterns, helping to provide
a more accurate picture of the distribution of occupations
across industries. Such information could be useful for
workers as they choose a career, jobseekers as they narrow or
broaden their job searches, and employers as they try to
recruit workers from other industries …”
40. US Open City Data Census
City Data, Business Listings
● Interesting data set to be aware of, although not
directly relevant to workforce research
● 2018: us-cities.survey.okfn.org
● 2017: us-city.census.okfn.org * note city vs. cities
● Wide variety of data on US cities
● Links to city categorized business listings
● Grades cities on open data access
40
41. So What’s Next?
Ready to help us label?
Twitter: @data_at_work, get
notified when WDI skill
labeler is deployed to
production
Talk to us on Slack!
workdatainitiative.slack.com
Code and Research
Git:
github.com/workforce-data-i
nitiative
See:
dataatwork.org/get-involved/
Use Gamification
Make skill labeling
psychologically motivating,
allow user accounts, high
scores?
A Comment System?
Additional opinions, context
on skills, job postings from
the community