Could You Be a Data Scientist? 
Carlo Torniai, Ph.D. 
@carlotorniai
Goal 
• Quantify data scientist profiles features 
• Analyze aspirant data scientist profiles 
• Provide useful feedback 
?
Why this is relevant? 
• A quantitative characterization of data scientists 
profiles can help closing the loop between job 
seekers and recruiters 
Image: http://www.getelastic.com/wp-content/uploads/puzzle1.jpg
Data Collection 
• Linkedin API: 
– General Information 
– Past work history 
– Education 
• Web Scraping: 
– Skills 
• 1500 profiles 
– Data Scientists 
– Software Engineer 
– Business Analysts 
– Mathematicians 
– Statisticians
Data Analysis 
Feature Extraction 
Software Engineers 
Business Analysts 
Data scientists 
Statisticians 
Mathematicians
Data Analysis 
Feature Extraction 
Astronomy 
Bioinformatics 
Biology 
Computer 
Science 
Economics 
Electronics 
Engineering 
Math 
Neuroscience 
Other 
Physics 
Psychology 
Stats 
Number of PhDs by topic and profiles
Model Testing 
For the purpose of this project I trained with skills and 
education features the following models: 
Random Forest 
• Classify the profile 
Naïve Bayes 
• Multi class probabilities to asses profiles 
background components 
K-means 
• Capability of suggesting similar and relevant profiles
Model Testing 
For the purpose of this project I trained with skills and 
education features the following models: 
Model Training set Purpose 
Random 
Forest 
All 5 categories Classify the profile 
Naïve Bayes 4 classic 
categories: SE, BA, 
MT, ST 
Asses profile backgrounds 
components with multi class 
probabilities 
K-means All 5 categories Identify similar profiles
Data Product 
bit.ly/cybads
Data Product 
Naïve Bayes 
Multi class 
probabilities 
Random Forest
Data Product 
K-means 
clustering
Next Steps 
Data Collection 
Data Analysis 
Feature Extraction 
Model Testing Data Product 
Get more data: 
- Other websites 
- Indeed 
- User input on 
Web app 
- Fine grained 
parsing of 
education 
- Experiment with 
additional features 
(industry, years of 
experience) 
• Extend feature set 
and test more 
models 
• Fuzzy C-means 
• Add interactive 
data collection 
• Personalized links 
for skills 
• Explanation about 
similarity results 
Close the loop by analyzing job offers and suggest 
matching profiles
Thank you! 
Technologies 
Web App: 
Flask, jQuery, Vega, MongoDB 
NMF, HC, RF ,DT, NB, K-means models:: 
scikit-learn 
Visualizations: 
Vincent, Vega, NetworkX, Gephi 
Acknowledgement 
yatish27 : Ruby Linkedin public profile Web Scraper 
ozgut : Linkedin API Python wrapper

Could You be a Data Scientist? Quantify Data Scientist Profiles using Machine Learning and Linkedin API.

  • 1.
    Could You Bea Data Scientist? Carlo Torniai, Ph.D. @carlotorniai
  • 2.
    Goal • Quantifydata scientist profiles features • Analyze aspirant data scientist profiles • Provide useful feedback ?
  • 3.
    Why this isrelevant? • A quantitative characterization of data scientists profiles can help closing the loop between job seekers and recruiters Image: http://www.getelastic.com/wp-content/uploads/puzzle1.jpg
  • 4.
    Data Collection •Linkedin API: – General Information – Past work history – Education • Web Scraping: – Skills • 1500 profiles – Data Scientists – Software Engineer – Business Analysts – Mathematicians – Statisticians
  • 5.
    Data Analysis FeatureExtraction Software Engineers Business Analysts Data scientists Statisticians Mathematicians
  • 6.
    Data Analysis FeatureExtraction Astronomy Bioinformatics Biology Computer Science Economics Electronics Engineering Math Neuroscience Other Physics Psychology Stats Number of PhDs by topic and profiles
  • 7.
    Model Testing Forthe purpose of this project I trained with skills and education features the following models: Random Forest • Classify the profile Naïve Bayes • Multi class probabilities to asses profiles background components K-means • Capability of suggesting similar and relevant profiles
  • 8.
    Model Testing Forthe purpose of this project I trained with skills and education features the following models: Model Training set Purpose Random Forest All 5 categories Classify the profile Naïve Bayes 4 classic categories: SE, BA, MT, ST Asses profile backgrounds components with multi class probabilities K-means All 5 categories Identify similar profiles
  • 9.
  • 10.
    Data Product NaïveBayes Multi class probabilities Random Forest
  • 11.
  • 12.
    Next Steps DataCollection Data Analysis Feature Extraction Model Testing Data Product Get more data: - Other websites - Indeed - User input on Web app - Fine grained parsing of education - Experiment with additional features (industry, years of experience) • Extend feature set and test more models • Fuzzy C-means • Add interactive data collection • Personalized links for skills • Explanation about similarity results Close the loop by analyzing job offers and suggest matching profiles
  • 13.
    Thank you! Technologies Web App: Flask, jQuery, Vega, MongoDB NMF, HC, RF ,DT, NB, K-means models:: scikit-learn Visualizations: Vincent, Vega, NetworkX, Gephi Acknowledgement yatish27 : Ruby Linkedin public profile Web Scraper ozgut : Linkedin API Python wrapper