1. Lynn Cherny, Assoc Prof Data Science, emlyon business
school
& Students!
@arnicas
PyData Paris 2017
2. Why am I here?
• Starting up a program in data science/analytics at
a business school: emlyon business school
• My courses first year: Python bootcamp, Data
analysis with Pandas, Text analysis/NLP, Business
Analytics (Excel pivot tables, SQL, Tableau).
• Next year: an intro AI course, some web & db stuff,
plus above.
3. –faculty in the marketing department when I introduced myself
“What do our students really need to know?”
4. –faculty in the marketing department when I introduced myself
“What do our students really need to know?”
–me, who likes NLP problems
“Hey, let’s find out by looking at job ads in
France.”
5. Also, This Project Course
• “Business Data Science Projects” — combine students
from
• École Lyon Centrale (engineering school, so
presumably coders) +
• emlyon business students (presumably non-coders)
for product design/research/plan
In practice, coding skills in the teams were not distributed
as expected; but my project had strong skills on both
sides (we already taught a few Python courses by then)
6. The student team
• Mathilde TRÉARDE (superb
project manager)
• Thomas PUCCI (amazing
reactjs front-end dev)
• Yann VAGINAY (great python
data scientist)
• Imen FEHRI
• Mohamed Amine MEJRI
• Roxane MARCILHACY (great
python data scientist)
• Julien RAULT
• Eric DUPRAZ
• Sophie REISER (great market
research/analyst)
• Nicolas LOUVIGNE (top notch
visual designer/branding)
• Grégoire CANER-CHABRAN
• Sarah DAIEN
7. Data Sources
Indeed API: targeted searches, text collection
apec.fr: targeted searches (and sifoning from API)
“JT” (CSV data dump from an edu provider)
Data collection began in February 2017 in earnest.
I beefed it up in April/May.
18. Architecture
Lynn said we should do these (Mongo, ES, Flask)
and set up (poorly managed and insecure) Mongo / Elastic / EC2 crawler host
herself on AWS.
Dev team did their own github/react & nodejs/Heroku plan.
19. Some discoveries in the
code after it was over.
• Databases didn’t have date the items were added
to them (date of scrape)
• Scraping was based on rather random sets of
words, and not consistent across site sources
• No automation of the indexing in Elastic - manual
job from Jupyter notebook (they knew this was an
issue too)
• Scraper code was never put on github.
20. My security issues
• Tried and failed to secure mongo by my own ssh key gen,
ended up using tunneling from scraping machine(that works
fine).
• Elastic is wide open and had been written to by a virus
(Amazon just sent me a warning), creating extra tables.
• We had a lot of issues with university firewalls and the cloud.
We all had to tether to phones to access the dbs from
school.
• AWS security stuff is really confusing. (One student team
didn’t succeed in using AWS at all— no one helped them.)
26. Data in the db : the search
terms requested by API (!?)
apec.fr Indeed
27. Dates in the db (remember,
not the date scraped…)
Indeed’s date of
publication counts
Apec
student work ended March/Apr - I added new terms and increased scraping into May/June
33. Or create your own list
and see the related
skills in the
“neighborhood”:
scikit-learn is not in the skills list? but is found in a job ad!
34. What is that graph?
a few “closely related skills” (by word2vec distance) in
a simple TSNE layout, computed and passed over API.
Awesome idea… but caveat: “Skills” were pre-filtered from the
word2vec model of the job ads, using the list of LinkedIn
skills.
link
35. A few related links
Radio’s tutorial on using word2vec in gensim:
https://rare-technologies.com/word2vec-tutorial/
My 8 million links on w2v papers/code etc:
https://pinboard.in/search/u:arnicas?query=word2vec
Interactive demo of w2v tsne layout of Yelp text reviews:
https://bl.ocks.org/arnicas/dd2ef348ad8854e40ef2
Useful warnings/info about making tsne layouts (we need
a grid search option):
http://distill.pub/2016/misread-tsne/
52. –one of my students (who did better after tips on searching for skills I’d
taught on other job sites) :)
“I feel like we’re all looking at the same vague
job ads and competing with each other.”
53. Search by courses taken?
some of these descriptions are really short and vague; what’s
a good criterion for match?
55. Teaching vs. Jobs, a Gap.
Les entrepreneurs sont appelés à résoudre
constamment des problèmes avec peu de temps
et de ressources pour prendre du recul dans un
environnement à forte incer7tude. En s'appuyant
sur des résultats en recherche sur le management
et la psychologie cogni7ve, ce cours vise à fournir
quelques apports simples pour développer et
accompagner l'ap7tude décisionnelle des
par7cipants.
“decision-making” course:
Job ad: “You can make decisions”?
56. So, Extension Ideas
• For student job search improvement:
• Return to skill extraction problem; use some training data. (Do some qualitative
analysis.)
• CV matching problem: revisit. Use different skills extraction (n-grams)
• Compare description of ALL courses taken (and liked) vs. jobs out there; is this
better?
• Curriculum development:
• Evaluate course descriptions by how well they match jobs
• Find “gaps” in teaching — what’s not being taught? (E.g., SQL.)
• Could course descriptions (and content) be better? Make this easier for
students?
57. My plan now
• Generally, starting up a Data Science Institute in
EM-Lyon. Money —> DS and data vis visitors/
confs/talks.
• Looking for help with teaching/workshops/tutorials
(Paris, Lyon, St. Etienne, Shanghai, Casablanca,
India)
• Contact me at cherny@em-lyon.com or @arnicas
58. Reminder: The student team
• Mathilde TRÉARDE (superb
project manager)
• Thomas PUCCI (amazing
reactjs front-end dev, multiply
employed)
• Yann VAGINAY (great python
data scientist doing NLP in
German stage now)
• Imen FEHRI
• Mohamed Amine MEJRI
• Roxane MARCILHACY (great python
data scientist) - now also web dev.
Looking for stage in Paris.
• Julien RAULT
• Eric DUPRAZ
• Sophie REISER (great market
research/analyst, not dev, but looking)
• Nicolas LOUVIGNE (top notch visual
designer/branding)
• Grégoire CANER-CHABRAN
• Sarah DAIEN