Could You Be a Data Scientist? 
Carlo Torniai, Ph.D. 
@carlotorniai
Goal 
• Quantify data scientist profiles features 
• Analyze aspirant data scientist profiles 
• Provide useful feedback 
...
Why this is relevant? 
• A quantitative characterization of data scientists 
profiles can help closing the loop between jo...
Data Collection 
• Linkedin API: 
– General Information 
– Past work history 
– Education 
• Web Scraping: 
– Skills 
• 15...
Data Analysis 
Feature Extraction 
Software Engineers 
Business Analysts 
Data scientists 
Statisticians 
Mathematicians
Data Analysis 
Feature Extraction 
Astronomy 
Bioinformatics 
Biology 
Computer 
Science 
Economics 
Electronics 
Engineer...
Model Testing 
For the purpose of this project I trained with skills and 
education features the following models: 
Random...
Model Testing 
For the purpose of this project I trained with skills and 
education features the following models: 
Model ...
Data Product 
bit.ly/cybads
Data Product 
Naïve Bayes 
Multi class 
probabilities 
Random Forest
Data Product 
K-means 
clustering
Next Steps 
Data Collection 
Data Analysis 
Feature Extraction 
Model Testing Data Product 
Get more data: 
- Other websit...
Thank you! 
Technologies 
Web App: 
Flask, jQuery, Vega, MongoDB 
NMF, HC, RF ,DT, NB, K-means models:: 
scikit-learn 
Vis...
Upcoming SlideShare
Loading in …5
×

Could You be a Data Scientist? Quantify Data Scientist Profiles using Machine Learning and Linkedin API.

2,401 views

Published on

Short presentation about my final project at Zipfian Academy about quantifying Data Scientist profiles using Linkedin data.
The prototype web app is available at: bit.ly/cybads

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,401
On SlideShare
0
From Embeds
0
Number of Embeds
235
Actions
Shares
0
Downloads
21
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Could You be a Data Scientist? Quantify Data Scientist Profiles using Machine Learning and Linkedin API.

  1. 1. Could You Be a Data Scientist? Carlo Torniai, Ph.D. @carlotorniai
  2. 2. Goal • Quantify data scientist profiles features • Analyze aspirant data scientist profiles • Provide useful feedback ?
  3. 3. Why this is relevant? • A quantitative characterization of data scientists profiles can help closing the loop between job seekers and recruiters Image: http://www.getelastic.com/wp-content/uploads/puzzle1.jpg
  4. 4. Data Collection • Linkedin API: – General Information – Past work history – Education • Web Scraping: – Skills • 1500 profiles – Data Scientists – Software Engineer – Business Analysts – Mathematicians – Statisticians
  5. 5. Data Analysis Feature Extraction Software Engineers Business Analysts Data scientists Statisticians Mathematicians
  6. 6. Data Analysis Feature Extraction Astronomy Bioinformatics Biology Computer Science Economics Electronics Engineering Math Neuroscience Other Physics Psychology Stats Number of PhDs by topic and profiles
  7. 7. Model Testing For the purpose of this project I trained with skills and education features the following models: Random Forest • Classify the profile Naïve Bayes • Multi class probabilities to asses profiles background components K-means • Capability of suggesting similar and relevant profiles
  8. 8. Model Testing For the purpose of this project I trained with skills and education features the following models: Model Training set Purpose Random Forest All 5 categories Classify the profile Naïve Bayes 4 classic categories: SE, BA, MT, ST Asses profile backgrounds components with multi class probabilities K-means All 5 categories Identify similar profiles
  9. 9. Data Product bit.ly/cybads
  10. 10. Data Product Naïve Bayes Multi class probabilities Random Forest
  11. 11. Data Product K-means clustering
  12. 12. Next Steps Data Collection Data Analysis Feature Extraction Model Testing Data Product Get more data: - Other websites - Indeed - User input on Web app - Fine grained parsing of education - Experiment with additional features (industry, years of experience) • Extend feature set and test more models • Fuzzy C-means • Add interactive data collection • Personalized links for skills • Explanation about similarity results Close the loop by analyzing job offers and suggest matching profiles
  13. 13. Thank you! Technologies Web App: Flask, jQuery, Vega, MongoDB NMF, HC, RF ,DT, NB, K-means models:: scikit-learn Visualizations: Vincent, Vega, NetworkX, Gephi Acknowledgement yatish27 : Ruby Linkedin public profile Web Scraper ozgut : Linkedin API Python wrapper

×