Exploring French Job Ads, Lynn Cherny

Pôle Systematic Paris-Region
Pôle Systematic Paris-RegionPôle Systematic Paris-Region
Lynn Cherny, Assoc Prof Data Science, emlyon business
school
& Students!
@arnicas
PyData Paris 2017
Why am I here?
• Starting up a program in data science/analytics at
a business school: emlyon business school
• My courses first year: Python bootcamp, Data
analysis with Pandas, Text analysis/NLP, Business
Analytics (Excel pivot tables, SQL, Tableau).
• Next year: an intro AI course, some web & db stuff,
plus above.
–faculty in the marketing department when I introduced myself
“What do our students really need to know?”
–faculty in the marketing department when I introduced myself
“What do our students really need to know?”
–me, who likes NLP problems
“Hey, let’s find out by looking at job ads in
France.”
Also, This Project Course
• “Business Data Science Projects” — combine students
from
• École Lyon Centrale (engineering school, so
presumably coders) +
• emlyon business students (presumably non-coders)
for product design/research/plan
In practice, coding skills in the teams were not distributed
as expected; but my project had strong skills on both
sides (we already taught a few Python courses by then)
The student team
• Mathilde TRÉARDE (superb
project manager)
• Thomas PUCCI (amazing
reactjs front-end dev)
• Yann VAGINAY (great python
data scientist)
• Imen FEHRI
• Mohamed Amine MEJRI
• Roxane MARCILHACY (great
python data scientist)
• Julien RAULT
• Eric DUPRAZ
• Sophie REISER (great market
research/analyst)
• Nicolas LOUVIGNE (top notch
visual designer/branding)
• Grégoire CANER-CHABRAN
• Sarah DAIEN
Data Sources
Indeed API: targeted searches, text collection
apec.fr: targeted searches (and sifoning from API)
“JT” (CSV data dump from an edu provider)
Data collection began in February 2017 in earnest.
I beefed it up in April/May.
Demo
Exploring French Job Ads, Lynn Cherny
Exploring French Job Ads, Lynn Cherny
Exploring French Job Ads, Lynn Cherny
Exploring French Job Ads, Lynn Cherny
Exploring French Job Ads, Lynn Cherny
Filter: A PDF resume uploaded… maybe a bit imperfect now:
Biz students:
95 student interviews of job searchers
Excellent creative work
UI mockup suggestions from
biz team
Architecture
Lynn said we should do these (Mongo, ES, Flask)
and set up (poorly managed and insecure) Mongo / Elastic / EC2 crawler host
herself on AWS.
Dev team did their own github/react & nodejs/Heroku plan.
Some discoveries in the
code after it was over.
• Databases didn’t have date the items were added
to them (date of scrape)
• Scraping was based on rather random sets of
words, and not consistent across site sources
• No automation of the indexing in Elastic - manual
job from Jupyter notebook (they knew this was an
issue too)
• Scraper code was never put on github.
My security issues
• Tried and failed to secure mongo by my own ssh key gen,
ended up using tunneling from scraping machine(that works
fine).
• Elastic is wide open and had been written to by a virus
(Amazon just sent me a warning), creating extra tables.
• We had a lot of issues with university firewalls and the cloud.
We all had to tether to phones to access the dbs from
school.
• AWS security stuff is really confusing. (One student team
didn’t succeed in using AWS at all— no one helped them.)
the data in more
detail…
Total Data Now by Source
• “JT,” an academic partner (given us as dump in
Jan, now “out of date”): 78K
• Apec: 25K
• Indeed: 10K
Apec - cadres
My student: “they would never hire someone like me”
Indeed - international feed (API)
with links - need to scrape text
more english:
Data in the db : the search
terms requested by API (!?)
apec.fr Indeed
Dates in the db (remember,
not the date scraped…)
Indeed’s date of
publication counts
Apec
student work ended March/Apr - I added new terms and increased scraping into May/June
JT provided data dates
JT provided data dates
No, this spike is real,
they are different ads and dated
this same day.
Job type labels
on JT data
Largest cats are
Marketing, Bizdev,
Communication
(Dev/IT not small tho)
“JT” : more “stages”
Revisit the word2vec
part
Or create your own list
and see the related
skills in the
“neighborhood”:
scikit-learn is not in the skills list? but is found in a job ad!
What is that graph?
a few “closely related skills” (by word2vec distance) in
a simple TSNE layout, computed and passed over API.
Awesome idea… but caveat: “Skills” were pre-filtered from the
word2vec model of the job ads, using the list of LinkedIn
skills.
link
A few related links
Radio’s tutorial on using word2vec in gensim:
https://rare-technologies.com/word2vec-tutorial/
My 8 million links on w2v papers/code etc:
https://pinboard.in/search/u:arnicas?query=word2vec
Interactive demo of w2v tsne layout of Yelp text reviews:
https://bl.ocks.org/arnicas/dd2ef348ad8854e40ef2
Useful warnings/info about making tsne layouts (we need
a grid search option):
http://distill.pub/2016/misread-tsne/
LinkedIn Skillz list: English,
Mysterious, —Garbage?
LI skills only
from the w2v model
in March
Zoom in…
Word2Vec updated (a week ago)
?!
Python also didn’t make the
“top 50 words
per search term,” which is
sad.
Exploring French Job Ads, Lynn Cherny
My shitty tsne layout that took
40 minutes on my laptop
Tensorboard projector view
convert your gensim model to tensorflow tsv files and upload
http://projector.tensorflow.org/
english
Exploring French Job Ads, Lynn Cherny
Tableau app
vis in
Tableau,
more UI
options
Most frequent
data-related
words, sized by
frequency
in search on
source.
Note: few JT ads
words (pink)
Sales,
logistics
supply chain -
lot of JT.
Let’s look at job ads
again…
Exploring French Job Ads, Lynn Cherny
Exploring French Job Ads, Lynn Cherny
“skills” are often soft or “previous
experience doing” in business job ads
link
Market research with
students:
Algorithm to determine “skill” “matches” is interesting but
worrying. It has to be really “good.”
–one of my students (who did better after tips on searching for skills I’d
taught on other job sites) :)
“I feel like we’re all looking at the same vague
job ads and competing with each other.”
Search by courses taken?
some of these descriptions are really short and vague; what’s
a good criterion for match?
sure, with 2 words, we get
some matches…
Teaching vs. Jobs, a Gap.
Les	 entrepreneurs	 sont	 appelés	 à	 résoudre	
constamment	des	problèmes	avec	peu	de	temps	
et	 de	 ressources	 pour	 prendre	 du	 recul	 dans	 un	
environnement	à	forte	incer7tude.	En	s'appuyant	
sur	des	résultats	en	recherche	sur	le	management	
et	la	psychologie	cogni7ve,	ce	cours	vise	à	fournir	
quelques	 apports	 simples	 pour	 développer	 et	
accompagner	 l'ap7tude	 décisionnelle	 des	
par7cipants.
“decision-making” course:
Job ad: “You can make decisions”?
So, Extension Ideas
• For student job search improvement:
• Return to skill extraction problem; use some training data. (Do some qualitative
analysis.)
• CV matching problem: revisit. Use different skills extraction (n-grams)
• Compare description of ALL courses taken (and liked) vs. jobs out there; is this
better?
• Curriculum development:
• Evaluate course descriptions by how well they match jobs
• Find “gaps” in teaching — what’s not being taught? (E.g., SQL.)
• Could course descriptions (and content) be better? Make this easier for
students?
My plan now
• Generally, starting up a Data Science Institute in
EM-Lyon. Money —> DS and data vis visitors/
confs/talks.
• Looking for help with teaching/workshops/tutorials
(Paris, Lyon, St. Etienne, Shanghai, Casablanca,
India)
• Contact me at cherny@em-lyon.com or @arnicas
Reminder: The student team
• Mathilde TRÉARDE (superb
project manager)
• Thomas PUCCI (amazing
reactjs front-end dev, multiply
employed)
• Yann VAGINAY (great python
data scientist doing NLP in
German stage now)
• Imen FEHRI
• Mohamed Amine MEJRI
• Roxane MARCILHACY (great python
data scientist) - now also web dev.
Looking for stage in Paris.
• Julien RAULT
• Eric DUPRAZ
• Sophie REISER (great market
research/analyst, not dev, but looking)
• Nicolas LOUVIGNE (top notch visual
designer/branding)
• Grégoire CANER-CHABRAN
• Sarah DAIEN
1 of 58

Recommended

Automatic image moderation in classifieds by
Automatic image moderation in classifiedsAutomatic image moderation in classifieds
Automatic image moderation in classifiedsJaroslaw Szymczak
721 views37 slides
Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling) by
Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)
Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)Gleicon Moraes
5.2K views48 slides
SFScon 2020 - Juri Strumpflohner - Beyond Basics - Scaling Development acros... by
 SFScon 2020 - Juri Strumpflohner - Beyond Basics - Scaling Development acros... SFScon 2020 - Juri Strumpflohner - Beyond Basics - Scaling Development acros...
SFScon 2020 - Juri Strumpflohner - Beyond Basics - Scaling Development acros...South Tyrol Free Software Conference
57 views73 slides
HTML for beginners by
HTML for beginnersHTML for beginners
HTML for beginnersSalahaddin University-Erbil
316 views71 slides
Cetis Talk 27 Jan2009 by
Cetis Talk 27 Jan2009Cetis Talk 27 Jan2009
Cetis Talk 27 Jan2009University of Strathclyde
263 views18 slides
From Ant to Rake by
From Ant to RakeFrom Ant to Rake
From Ant to Rakejazzman1980
1.4K views39 slides

More Related Content

What's hot

PuppetConf track overview: Inside Puppet by
PuppetConf track overview: Inside PuppetPuppetConf track overview: Inside Puppet
PuppetConf track overview: Inside PuppetPuppet
566 views23 slides
How to really obfuscate your pdf malware by
How to really obfuscate your pdf malwareHow to really obfuscate your pdf malware
How to really obfuscate your pdf malwarezynamics GmbH
1.6K views52 slides
Polyglot Applications with GraalVM by
Polyglot Applications with GraalVMPolyglot Applications with GraalVM
Polyglot Applications with GraalVMjexp
1.9K views97 slides
C++ Unit testing - the good, the bad & the ugly by
C++ Unit testing - the good, the bad & the uglyC++ Unit testing - the good, the bad & the ugly
C++ Unit testing - the good, the bad & the uglyDror Helper
1.9K views23 slides
Polyglot by
PolyglotPolyglot
PolyglotRory Preddy
2.2K views42 slides
Introduction to Agile Software Development & Python by
Introduction to Agile Software Development & PythonIntroduction to Agile Software Development & Python
Introduction to Agile Software Development & PythonTharindu Weerasinghe
79 views42 slides

What's hot(20)

PuppetConf track overview: Inside Puppet by Puppet
PuppetConf track overview: Inside PuppetPuppetConf track overview: Inside Puppet
PuppetConf track overview: Inside Puppet
Puppet566 views
How to really obfuscate your pdf malware by zynamics GmbH
How to really obfuscate your pdf malwareHow to really obfuscate your pdf malware
How to really obfuscate your pdf malware
zynamics GmbH1.6K views
Polyglot Applications with GraalVM by jexp
Polyglot Applications with GraalVMPolyglot Applications with GraalVM
Polyglot Applications with GraalVM
jexp1.9K views
C++ Unit testing - the good, the bad & the ugly by Dror Helper
C++ Unit testing - the good, the bad & the uglyC++ Unit testing - the good, the bad & the ugly
C++ Unit testing - the good, the bad & the ugly
Dror Helper1.9K views
Java vs JavaScript | Edureka by Edureka!
Java vs JavaScript | EdurekaJava vs JavaScript | Edureka
Java vs JavaScript | Edureka
Edureka!191 views
Introduction to mobile reversing by jduart
Introduction to mobile reversingIntroduction to mobile reversing
Introduction to mobile reversing
jduart10.7K views
ConFoo Montreal - Approaches for application request throttling by Maarten Balliauw
ConFoo Montreal - Approaches for application request throttlingConFoo Montreal - Approaches for application request throttling
ConFoo Montreal - Approaches for application request throttling
Maarten Balliauw1.2K views
Application Development Using Java - DIYComputerScience Course by parag
Application Development Using Java - DIYComputerScience CourseApplication Development Using Java - DIYComputerScience Course
Application Development Using Java - DIYComputerScience Course
parag2.7K views
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain... by Maarten Balliauw
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
Maarten Balliauw1.2K views
Node.js Presentation Rotterdam.PHP by Joris Verbogt
Node.js Presentation Rotterdam.PHPNode.js Presentation Rotterdam.PHP
Node.js Presentation Rotterdam.PHP
Joris Verbogt566 views
JavaScript for Enterprise Applications by Piyush Katariya
JavaScript for Enterprise ApplicationsJavaScript for Enterprise Applications
JavaScript for Enterprise Applications
Piyush Katariya819 views
PHP Mega Meetup, Sep, 2020, Anti patterns in php by Ahmed Abdou
PHP Mega Meetup, Sep, 2020, Anti patterns in phpPHP Mega Meetup, Sep, 2020, Anti patterns in php
PHP Mega Meetup, Sep, 2020, Anti patterns in php
Ahmed Abdou86 views

Similar to Exploring French Job Ads, Lynn Cherny

Info Session : University Institute of engineering and technology , Kurukshet... by
Info Session : University Institute of engineering and technology , Kurukshet...Info Session : University Institute of engineering and technology , Kurukshet...
Info Session : University Institute of engineering and technology , Kurukshet...HRITIKKHURANA1
142 views55 slides
Mark Tortoricci - Talent42 2015 by
Mark Tortoricci - Talent42 2015Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015Talent42
1.1K views69 slides
OSCON 2014: Data Workflows for Machine Learning by
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
16.9K views68 slides
Google summer of code 2012 by
Google summer of code 2012Google summer of code 2012
Google summer of code 2012Pradeeban Kathiravelu, Ph.D.
467 views42 slides
Data Workflows for Machine Learning - Seattle DAML by
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
31.6K views74 slides
20180707 - 2nd meeting - Group Orientation by
20180707 - 2nd meeting - Group Orientation20180707 - 2nd meeting - Group Orientation
20180707 - 2nd meeting - Group OrientationDuc Lai Trung Minh
46 views26 slides

Similar to Exploring French Job Ads, Lynn Cherny(20)

Info Session : University Institute of engineering and technology , Kurukshet... by HRITIKKHURANA1
Info Session : University Institute of engineering and technology , Kurukshet...Info Session : University Institute of engineering and technology , Kurukshet...
Info Session : University Institute of engineering and technology , Kurukshet...
HRITIKKHURANA1142 views
Mark Tortoricci - Talent42 2015 by Talent42
Mark Tortoricci - Talent42 2015Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015
Talent421.1K views
OSCON 2014: Data Workflows for Machine Learning by Paco Nathan
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan16.9K views
Data Workflows for Machine Learning - Seattle DAML by Paco Nathan
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan31.6K views
When develpment met test(shift left testing) by SangIn Choung
When develpment met test(shift left testing)When develpment met test(shift left testing)
When develpment met test(shift left testing)
SangIn Choung858 views
Stefan Geissler kairntech - SDC Nice Apr 2019 by Stefan Geißler
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geißler450 views
How to start your data career by Adwait Bhave
How to start your data careerHow to start your data career
How to start your data career
Adwait Bhave173 views
Data Workflows for Machine Learning - SF Bay Area ML by Paco Nathan
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan8.9K views
Start Building Machine Learning Models Faster Than You Think by Cheah Eng Soon
Start Building Machine Learning Models Faster Than You ThinkStart Building Machine Learning Models Faster Than You Think
Start Building Machine Learning Models Faster Than You Think
Cheah Eng Soon761 views
Rapid elearning tools and techniques by Steve Rayson
Rapid elearning tools and techniquesRapid elearning tools and techniques
Rapid elearning tools and techniques
Steve Rayson5.1K views
How to Build your Career.pptx by vaideheekore
How to Build your Career.pptxHow to Build your Career.pptx
How to Build your Career.pptx
vaideheekore75 views
Agilelessons scanagile-final 2013 by lokori
Agilelessons scanagile-final 2013Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013
lokori656 views

More from Pôle Systematic Paris-Region

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na... by
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...Pôle Systematic Paris-Region
686 views39 slides
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ... by
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...Pôle Systematic Paris-Region
293 views24 slides
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ... by
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...Pôle Systematic Paris-Region
349 views38 slides
OSIS19_Cloud : Performance and power management in virtualized data centers, ... by
OSIS19_Cloud : Performance and power management in virtualized data centers, ...OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...Pôle Systematic Paris-Region
288 views27 slides
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ... by
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...Pôle Systematic Paris-Region
271 views30 slides
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt... by
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...Pôle Systematic Paris-Region
229 views9 slides

More from Pôle Systematic Paris-Region(20)

Recently uploaded

This talk was not generated with ChatGPT: how AI is changing science by
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing scienceElena Simperl
34 views13 slides
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023 by
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023BookNet Canada
46 views19 slides
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...ShapeBlue
120 views12 slides
What is Authentication Active Directory_.pptx by
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptxHeenaMehta35
15 views7 slides
The Power of Generative AI in Accelerating No Code Adoption.pdf by
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdfSaeed Al Dhaheri
44 views18 slides
"Node.js Development in 2024: trends and tools", Nikita Galkin by
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
37 views38 slides

Recently uploaded(20)

This talk was not generated with ChatGPT: how AI is changing science by Elena Simperl
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
Elena Simperl34 views
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023 by BookNet Canada
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
BookNet Canada46 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue120 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri44 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays37 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays37 views
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
The Coming AI Tsunami.pptx by johnhandby
The Coming AI Tsunami.pptxThe Coming AI Tsunami.pptx
The Coming AI Tsunami.pptx
johnhandby14 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10180 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro38 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li104 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty66 views
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 views
Measurecamp Brussels - Synthetic data.pdf by Human37
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
Human37 27 views
AIM102-S_Cognizant_CognizantCognitive by PhilipBasford
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
PhilipBasford23 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10152 views

Exploring French Job Ads, Lynn Cherny

  • 1. Lynn Cherny, Assoc Prof Data Science, emlyon business school & Students! @arnicas PyData Paris 2017
  • 2. Why am I here? • Starting up a program in data science/analytics at a business school: emlyon business school • My courses first year: Python bootcamp, Data analysis with Pandas, Text analysis/NLP, Business Analytics (Excel pivot tables, SQL, Tableau). • Next year: an intro AI course, some web & db stuff, plus above.
  • 3. –faculty in the marketing department when I introduced myself “What do our students really need to know?”
  • 4. –faculty in the marketing department when I introduced myself “What do our students really need to know?” –me, who likes NLP problems “Hey, let’s find out by looking at job ads in France.”
  • 5. Also, This Project Course • “Business Data Science Projects” — combine students from • École Lyon Centrale (engineering school, so presumably coders) + • emlyon business students (presumably non-coders) for product design/research/plan In practice, coding skills in the teams were not distributed as expected; but my project had strong skills on both sides (we already taught a few Python courses by then)
  • 6. The student team • Mathilde TRÉARDE (superb project manager) • Thomas PUCCI (amazing reactjs front-end dev) • Yann VAGINAY (great python data scientist) • Imen FEHRI • Mohamed Amine MEJRI • Roxane MARCILHACY (great python data scientist) • Julien RAULT • Eric DUPRAZ • Sophie REISER (great market research/analyst) • Nicolas LOUVIGNE (top notch visual designer/branding) • Grégoire CANER-CHABRAN • Sarah DAIEN
  • 7. Data Sources Indeed API: targeted searches, text collection apec.fr: targeted searches (and sifoning from API) “JT” (CSV data dump from an edu provider) Data collection began in February 2017 in earnest. I beefed it up in April/May.
  • 14. Filter: A PDF resume uploaded… maybe a bit imperfect now:
  • 15. Biz students: 95 student interviews of job searchers
  • 17. UI mockup suggestions from biz team
  • 18. Architecture Lynn said we should do these (Mongo, ES, Flask) and set up (poorly managed and insecure) Mongo / Elastic / EC2 crawler host herself on AWS. Dev team did their own github/react & nodejs/Heroku plan.
  • 19. Some discoveries in the code after it was over. • Databases didn’t have date the items were added to them (date of scrape) • Scraping was based on rather random sets of words, and not consistent across site sources • No automation of the indexing in Elastic - manual job from Jupyter notebook (they knew this was an issue too) • Scraper code was never put on github.
  • 20. My security issues • Tried and failed to secure mongo by my own ssh key gen, ended up using tunneling from scraping machine(that works fine). • Elastic is wide open and had been written to by a virus (Amazon just sent me a warning), creating extra tables. • We had a lot of issues with university firewalls and the cloud. We all had to tether to phones to access the dbs from school. • AWS security stuff is really confusing. (One student team didn’t succeed in using AWS at all— no one helped them.)
  • 21. the data in more detail…
  • 22. Total Data Now by Source • “JT,” an academic partner (given us as dump in Jan, now “out of date”): 78K • Apec: 25K • Indeed: 10K
  • 23. Apec - cadres My student: “they would never hire someone like me”
  • 24. Indeed - international feed (API) with links - need to scrape text
  • 26. Data in the db : the search terms requested by API (!?) apec.fr Indeed
  • 27. Dates in the db (remember, not the date scraped…) Indeed’s date of publication counts Apec student work ended March/Apr - I added new terms and increased scraping into May/June
  • 29. JT provided data dates No, this spike is real, they are different ads and dated this same day.
  • 30. Job type labels on JT data Largest cats are Marketing, Bizdev, Communication (Dev/IT not small tho)
  • 31. “JT” : more “stages”
  • 33. Or create your own list and see the related skills in the “neighborhood”: scikit-learn is not in the skills list? but is found in a job ad!
  • 34. What is that graph? a few “closely related skills” (by word2vec distance) in a simple TSNE layout, computed and passed over API. Awesome idea… but caveat: “Skills” were pre-filtered from the word2vec model of the job ads, using the list of LinkedIn skills. link
  • 35. A few related links Radio’s tutorial on using word2vec in gensim: https://rare-technologies.com/word2vec-tutorial/ My 8 million links on w2v papers/code etc: https://pinboard.in/search/u:arnicas?query=word2vec Interactive demo of w2v tsne layout of Yelp text reviews: https://bl.ocks.org/arnicas/dd2ef348ad8854e40ef2 Useful warnings/info about making tsne layouts (we need a grid search option): http://distill.pub/2016/misread-tsne/
  • 36. LinkedIn Skillz list: English, Mysterious, —Garbage?
  • 37. LI skills only from the w2v model in March
  • 39. Word2Vec updated (a week ago) ?! Python also didn’t make the “top 50 words per search term,” which is sad.
  • 41. My shitty tsne layout that took 40 minutes on my laptop
  • 42. Tensorboard projector view convert your gensim model to tensorflow tsv files and upload http://projector.tensorflow.org/ english
  • 45. Most frequent data-related words, sized by frequency in search on source. Note: few JT ads words (pink)
  • 47. Let’s look at job ads again…
  • 50. “skills” are often soft or “previous experience doing” in business job ads link
  • 51. Market research with students: Algorithm to determine “skill” “matches” is interesting but worrying. It has to be really “good.”
  • 52. –one of my students (who did better after tips on searching for skills I’d taught on other job sites) :) “I feel like we’re all looking at the same vague job ads and competing with each other.”
  • 53. Search by courses taken? some of these descriptions are really short and vague; what’s a good criterion for match?
  • 54. sure, with 2 words, we get some matches…
  • 55. Teaching vs. Jobs, a Gap. Les entrepreneurs sont appelés à résoudre constamment des problèmes avec peu de temps et de ressources pour prendre du recul dans un environnement à forte incer7tude. En s'appuyant sur des résultats en recherche sur le management et la psychologie cogni7ve, ce cours vise à fournir quelques apports simples pour développer et accompagner l'ap7tude décisionnelle des par7cipants. “decision-making” course: Job ad: “You can make decisions”?
  • 56. So, Extension Ideas • For student job search improvement: • Return to skill extraction problem; use some training data. (Do some qualitative analysis.) • CV matching problem: revisit. Use different skills extraction (n-grams) • Compare description of ALL courses taken (and liked) vs. jobs out there; is this better? • Curriculum development: • Evaluate course descriptions by how well they match jobs • Find “gaps” in teaching — what’s not being taught? (E.g., SQL.) • Could course descriptions (and content) be better? Make this easier for students?
  • 57. My plan now • Generally, starting up a Data Science Institute in EM-Lyon. Money —> DS and data vis visitors/ confs/talks. • Looking for help with teaching/workshops/tutorials (Paris, Lyon, St. Etienne, Shanghai, Casablanca, India) • Contact me at cherny@em-lyon.com or @arnicas
  • 58. Reminder: The student team • Mathilde TRÉARDE (superb project manager) • Thomas PUCCI (amazing reactjs front-end dev, multiply employed) • Yann VAGINAY (great python data scientist doing NLP in German stage now) • Imen FEHRI • Mohamed Amine MEJRI • Roxane MARCILHACY (great python data scientist) - now also web dev. Looking for stage in Paris. • Julien RAULT • Eric DUPRAZ • Sophie REISER (great market research/analyst, not dev, but looking) • Nicolas LOUVIGNE (top notch visual designer/branding) • Grégoire CANER-CHABRAN • Sarah DAIEN