Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploring French Job Ads, Lynn Cherny

PyParis 2017

  • Be the first to comment

  • Be the first to like this

Exploring French Job Ads, Lynn Cherny

  1. 1. Lynn Cherny, Assoc Prof Data Science, emlyon business school & Students! @arnicas PyData Paris 2017
  2. 2. Why am I here? • Starting up a program in data science/analytics at a business school: emlyon business school • My courses first year: Python bootcamp, Data analysis with Pandas, Text analysis/NLP, Business Analytics (Excel pivot tables, SQL, Tableau). • Next year: an intro AI course, some web & db stuff, plus above.
  3. 3. –faculty in the marketing department when I introduced myself “What do our students really need to know?”
  4. 4. –faculty in the marketing department when I introduced myself “What do our students really need to know?” –me, who likes NLP problems “Hey, let’s find out by looking at job ads in France.”
  5. 5. Also, This Project Course • “Business Data Science Projects” — combine students from • École Lyon Centrale (engineering school, so presumably coders) + • emlyon business students (presumably non-coders) for product design/research/plan In practice, coding skills in the teams were not distributed as expected; but my project had strong skills on both sides (we already taught a few Python courses by then)
  6. 6. The student team • Mathilde TRÉARDE (superb project manager) • Thomas PUCCI (amazing reactjs front-end dev) • Yann VAGINAY (great python data scientist) • Imen FEHRI • Mohamed Amine MEJRI • Roxane MARCILHACY (great python data scientist) • Julien RAULT • Eric DUPRAZ • Sophie REISER (great market research/analyst) • Nicolas LOUVIGNE (top notch visual designer/branding) • Grégoire CANER-CHABRAN • Sarah DAIEN
  7. 7. Data Sources Indeed API: targeted searches, text collection targeted searches (and sifoning from API) “JT” (CSV data dump from an edu provider) Data collection began in February 2017 in earnest. I beefed it up in April/May.
  8. 8. Demo
  9. 9. Filter: A PDF resume uploaded… maybe a bit imperfect now:
  10. 10. Biz students: 95 student interviews of job searchers
  11. 11. Excellent creative work
  12. 12. UI mockup suggestions from biz team
  13. 13. Architecture Lynn said we should do these (Mongo, ES, Flask) and set up (poorly managed and insecure) Mongo / Elastic / EC2 crawler host herself on AWS. Dev team did their own github/react & nodejs/Heroku plan.
  14. 14. Some discoveries in the code after it was over. • Databases didn’t have date the items were added to them (date of scrape) • Scraping was based on rather random sets of words, and not consistent across site sources • No automation of the indexing in Elastic - manual job from Jupyter notebook (they knew this was an issue too) • Scraper code was never put on github.
  15. 15. My security issues • Tried and failed to secure mongo by my own ssh key gen, ended up using tunneling from scraping machine(that works fine). • Elastic is wide open and had been written to by a virus (Amazon just sent me a warning), creating extra tables. • We had a lot of issues with university firewalls and the cloud. We all had to tether to phones to access the dbs from school. • AWS security stuff is really confusing. (One student team didn’t succeed in using AWS at all— no one helped them.)
  16. 16. the data in more detail…
  17. 17. Total Data Now by Source • “JT,” an academic partner (given us as dump in Jan, now “out of date”): 78K • Apec: 25K • Indeed: 10K
  18. 18. Apec - cadres My student: “they would never hire someone like me”
  19. 19. Indeed - international feed (API) with links - need to scrape text
  20. 20. more english:
  21. 21. Data in the db : the search terms requested by API (!?) Indeed
  22. 22. Dates in the db (remember, not the date scraped…) Indeed’s date of publication counts Apec student work ended March/Apr - I added new terms and increased scraping into May/June
  23. 23. JT provided data dates
  24. 24. JT provided data dates No, this spike is real, they are different ads and dated this same day.
  25. 25. Job type labels on JT data Largest cats are Marketing, Bizdev, Communication (Dev/IT not small tho)
  26. 26. “JT” : more “stages”
  27. 27. Revisit the word2vec part
  28. 28. Or create your own list and see the related skills in the “neighborhood”: scikit-learn is not in the skills list? but is found in a job ad!
  29. 29. What is that graph? a few “closely related skills” (by word2vec distance) in a simple TSNE layout, computed and passed over API. Awesome idea… but caveat: “Skills” were pre-filtered from the word2vec model of the job ads, using the list of LinkedIn skills. link
  30. 30. A few related links Radio’s tutorial on using word2vec in gensim: My 8 million links on w2v papers/code etc: Interactive demo of w2v tsne layout of Yelp text reviews: Useful warnings/info about making tsne layouts (we need a grid search option):
  31. 31. LinkedIn Skillz list: English, Mysterious, —Garbage?
  32. 32. LI skills only from the w2v model in March
  33. 33. Zoom in…
  34. 34. Word2Vec updated (a week ago) ?! Python also didn’t make the “top 50 words per search term,” which is sad.
  35. 35. My shitty tsne layout that took 40 minutes on my laptop
  36. 36. Tensorboard projector view convert your gensim model to tensorflow tsv files and upload english
  37. 37. Tableau app vis in Tableau, more UI options
  38. 38. Most frequent data-related words, sized by frequency in search on source. Note: few JT ads words (pink)
  39. 39. Sales, logistics supply chain - lot of JT.
  40. 40. Let’s look at job ads again…
  41. 41. “skills” are often soft or “previous experience doing” in business job ads link
  42. 42. Market research with students: Algorithm to determine “skill” “matches” is interesting but worrying. It has to be really “good.”
  43. 43. –one of my students (who did better after tips on searching for skills I’d taught on other job sites) :) “I feel like we’re all looking at the same vague job ads and competing with each other.”
  44. 44. Search by courses taken? some of these descriptions are really short and vague; what’s a good criterion for match?
  45. 45. sure, with 2 words, we get some matches…
  46. 46. Teaching vs. Jobs, a Gap. Les entrepreneurs sont appelés à résoudre constamment des problèmes avec peu de temps et de ressources pour prendre du recul dans un environnement à forte incer7tude. En s'appuyant sur des résultats en recherche sur le management et la psychologie cogni7ve, ce cours vise à fournir quelques apports simples pour développer et accompagner l'ap7tude décisionnelle des par7cipants. “decision-making” course: Job ad: “You can make decisions”?
  47. 47. So, Extension Ideas • For student job search improvement: • Return to skill extraction problem; use some training data. (Do some qualitative analysis.) • CV matching problem: revisit. Use different skills extraction (n-grams) • Compare description of ALL courses taken (and liked) vs. jobs out there; is this better? • Curriculum development: • Evaluate course descriptions by how well they match jobs • Find “gaps” in teaching — what’s not being taught? (E.g., SQL.) • Could course descriptions (and content) be better? Make this easier for students?
  48. 48. My plan now • Generally, starting up a Data Science Institute in EM-Lyon. Money —> DS and data vis visitors/ confs/talks. • Looking for help with teaching/workshops/tutorials (Paris, Lyon, St. Etienne, Shanghai, Casablanca, India) • Contact me at or @arnicas
  49. 49. Reminder: The student team • Mathilde TRÉARDE (superb project manager) • Thomas PUCCI (amazing reactjs front-end dev, multiply employed) • Yann VAGINAY (great python data scientist doing NLP in German stage now) • Imen FEHRI • Mohamed Amine MEJRI • Roxane MARCILHACY (great python data scientist) - now also web dev. Looking for stage in Paris. • Julien RAULT • Eric DUPRAZ • Sophie REISER (great market research/analyst, not dev, but looking) • Nicolas LOUVIGNE (top notch visual designer/branding) • Grégoire CANER-CHABRAN • Sarah DAIEN