SlideShare a Scribd company logo
1 of 13
Download to read offline
TEXT	
  MINING	
  DATA	
  SCIENCE	
  JOBS	
  IN	
  R	
  
Sung	
  Park,	
  MSPA	
  Candidate	
  
August	
  20,	
  2015	
  
Northwestern	
  University	
  
PREDICT	
  422-­‐DL	
  SecGon	
  55	
  
1	
  
SUMMARY	
  
•  IntroducGon	
  
•  Resources	
  
•  Data	
  Source	
  	
  
•  Data	
  ExtracGon	
  
•  Data	
  PreparaGon	
  
•  Supervised	
  Learning	
  
2	
  
INTRODUCTION	
  
•  ExploraGon	
  of	
  web	
  scraping	
  and	
  text	
  mining	
  
capabiliGes	
  in	
  R	
  
•  Unstructured	
  data	
  
•  Kaggle.com	
  job	
  posGngs	
  
•  ClassificaGon	
  using	
  machine	
  learning	
  algorithm	
  
•  Data	
  scienGsts	
  vs.	
  non-­‐data	
  scienGsts	
  
	
  
3	
  
RESOURCES	
  
•  Text	
  AnalyGcs	
  Tutorial	
  in	
  R	
  
•  Timothy	
  D’Auria,	
  Boston	
  Decision,	
  LLC	
  
•  hUps://www.youtube.com/watch?v=j1V2McKbkLo	
  
•  Web	
  Scraping	
  Tutorial	
  in	
  R	
  
•  Sharon	
  Machlis,	
  Computerworld	
  
•  hUps://www.youtube.com/watch?v=TPLMQnGw0Vk	
  
•  Data	
  Science	
  in	
  R:	
  A	
  Case	
  Study	
  Approach	
  to	
  ComputaGonal	
  
Reasoning	
  and	
  Problem	
  Solving	
  
•  Deborah	
  Nolan	
  and	
  Duncan	
  Temple	
  Lang	
  
•  Google	
  and	
  Stack	
  Overflow	
  
4	
  
DATA	
  SOURCE	
  
•  Kaggle.com/jobs	
  
•  August	
  17,	
  2015	
  
•  1,025	
  Job	
  PosGngs	
  
•  Data	
  ScienGst	
  
•  Big	
  Data	
  Engineer	
  
•  Data	
  Science	
  
Architect	
  
•  Data	
  Analyst	
  
•  MarkeGng	
  Analyst	
  
•  StaGsGcian	
  
•  Data	
  Science	
  
Director	
  
5	
  
DATA	
  EXTRACTION	
  
•  Extracted	
  job	
  links	
  
•  XML	
  Package	
  
•  xpathSApply(doc,	
  "//h3/a/@href[starts-­‐with(.,	
  '/jobs')]")	
  
	
  
	
  
	
  
	
  
	
  
	
  
•  Extracted	
  job	
  posGng	
  text	
  
•  rvest	
  Package	
  
•  html_text(html_nodes(htmlpage,	
  "div.postcontent"))	
  
6	
  
DATA	
  PREPARATION	
  
•  Cleaned	
  the	
  text	
  data	
  
•  tm	
  Package	
  
•  tm_map()	
  
•  Remove	
  punctuaGons	
  
•  Remove	
  white	
  spaces	
  
•  Lower-­‐casing	
  
•  Remove	
  stopwords	
  
•  “a”,	
  “the”,	
  “and”,	
  “but”,	
  etc.	
  
7	
  
DATA	
  PREPARATION	
  
•  Created	
  the	
  term	
  document	
  matrix	
  (TDM)	
  
8	
  
DATA	
  PREPARATION	
  
•  TDM	
  consists	
  of	
  959	
  job	
  posGngs	
  and	
  73	
  terms	
  
•  375	
  data	
  scienGsts	
  and	
  584	
  non-­‐data	
  scienGsts	
  
•  Split	
  TDM	
  into	
  training	
  set	
  and	
  test	
  set	
  
•  864	
  job	
  posGngs	
  in	
  training	
  sample	
  
•  95	
  job	
  posGngs	
  in	
  test	
  sample	
  
9	
  
SUPERVISED	
  LEARNING	
  
•  K-­‐Nearest	
  Neighbor	
  
•  Find	
  the	
  K	
  value	
  with	
  the	
  highest	
  classificaGon	
  accuracy	
  
	
  
	
  
	
  
	
  
	
  
	
  
•  K=8	
  shows	
  the	
  best	
  result	
  with	
  82.98%	
  accuracy	
  rate	
  
•  Confusion	
  matrix	
  shows	
  the	
  model	
  correctly	
  predicted	
  22	
  
out	
  of	
  35	
  data	
  scienGst	
  job	
  posGngs	
  
10	
  
SUPERVISED	
  LEARNING	
  
•  ClassificaGon	
  Decision	
  Tree	
  (Gini	
  index)	
  
•  The	
  classificaGon	
  accuracy	
  rate	
  is	
  96.8%	
  
•  Confusion	
  matrix	
  shows	
  the	
  model	
  correctly	
  predicted	
  30	
  
out	
  of	
  33	
  data	
  scienGst	
  job	
  posGngs	
  
•  Key	
  terms	
  for	
  tree	
  construcGon:	
  
11	
  
SUPERVISED	
  LEARNING	
  
•  Bagging	
  
•  The	
  classificaGon	
  accuracy	
  rate	
  is	
  96.8%	
  	
  
•  Confusion	
  matrix	
  shows	
  the	
  same	
  results	
  as	
  the	
  
classificaGon	
  tree	
  
12	
  
QUESTIONS?	
  
COMMENTS?	
  
Sung	
  Park,	
  MSPA	
  Candidate	
  
August	
  20,	
  2015	
  
Northwestern	
  University	
  
PREDICT	
  422-­‐DL	
  SecGon	
  55	
  
13	
  

More Related Content

Viewers also liked

R user group presentation
R user group presentationR user group presentation
R user group presentationTom Liptrot
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science ResearchRyan Wesslen
 
Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Catherine Canevet
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RNikhil Gadkar
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryChia-Chi Chang
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studioAshley Lindley
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)Vincent Handara
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression modelsHamideh Iraj
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 

Viewers also liked (20)

Predictshine
PredictshinePredictshine
Predictshine
 
R and data mining
R and data miningR and data mining
R and data mining
 
R user group presentation
R user group presentationR user group presentation
R user group presentation
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science Research
 
Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...Automatic extraction of microorganisms and their habitats from free text usin...
Automatic extraction of microorganisms and their habitats from free text usin...
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lottery
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 

Similar to Classifying Data Science Job Postings Using Text Mining and Machine Learning in R

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Machine learning systems for engineers
Machine learning systems for engineersMachine learning systems for engineers
Machine learning systems for engineersCameron Joannidis
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptxImXaib
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and PlacementAkhilGGM
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementMongoDB
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackNick Craswell
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
 
An introduction to azure machine learning
An introduction to azure machine learningAn introduction to azure machine learning
An introduction to azure machine learningDoug Kline
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningNikolay Karelin
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 

Similar to Classifying Data Science Job Postings Using Text Mining and Machine Learning in R (20)

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Group presentation 22
Group presentation 22Group presentation 22
Group presentation 22
 
Machine learning systems for engineers
Machine learning systems for engineersMachine learning systems for engineers
Machine learning systems for engineers
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and Placement
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
An introduction to azure machine learning
An introduction to azure machine learningAn introduction to azure machine learning
An introduction to azure machine learning
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 

Classifying Data Science Job Postings Using Text Mining and Machine Learning in R

  • 1. TEXT  MINING  DATA  SCIENCE  JOBS  IN  R   Sung  Park,  MSPA  Candidate   August  20,  2015   Northwestern  University   PREDICT  422-­‐DL  SecGon  55   1  
  • 2. SUMMARY   •  IntroducGon   •  Resources   •  Data  Source     •  Data  ExtracGon   •  Data  PreparaGon   •  Supervised  Learning   2  
  • 3. INTRODUCTION   •  ExploraGon  of  web  scraping  and  text  mining   capabiliGes  in  R   •  Unstructured  data   •  Kaggle.com  job  posGngs   •  ClassificaGon  using  machine  learning  algorithm   •  Data  scienGsts  vs.  non-­‐data  scienGsts     3  
  • 4. RESOURCES   •  Text  AnalyGcs  Tutorial  in  R   •  Timothy  D’Auria,  Boston  Decision,  LLC   •  hUps://www.youtube.com/watch?v=j1V2McKbkLo   •  Web  Scraping  Tutorial  in  R   •  Sharon  Machlis,  Computerworld   •  hUps://www.youtube.com/watch?v=TPLMQnGw0Vk   •  Data  Science  in  R:  A  Case  Study  Approach  to  ComputaGonal   Reasoning  and  Problem  Solving   •  Deborah  Nolan  and  Duncan  Temple  Lang   •  Google  and  Stack  Overflow   4  
  • 5. DATA  SOURCE   •  Kaggle.com/jobs   •  August  17,  2015   •  1,025  Job  PosGngs   •  Data  ScienGst   •  Big  Data  Engineer   •  Data  Science   Architect   •  Data  Analyst   •  MarkeGng  Analyst   •  StaGsGcian   •  Data  Science   Director   5  
  • 6. DATA  EXTRACTION   •  Extracted  job  links   •  XML  Package   •  xpathSApply(doc,  "//h3/a/@href[starts-­‐with(.,  '/jobs')]")               •  Extracted  job  posGng  text   •  rvest  Package   •  html_text(html_nodes(htmlpage,  "div.postcontent"))   6  
  • 7. DATA  PREPARATION   •  Cleaned  the  text  data   •  tm  Package   •  tm_map()   •  Remove  punctuaGons   •  Remove  white  spaces   •  Lower-­‐casing   •  Remove  stopwords   •  “a”,  “the”,  “and”,  “but”,  etc.   7  
  • 8. DATA  PREPARATION   •  Created  the  term  document  matrix  (TDM)   8  
  • 9. DATA  PREPARATION   •  TDM  consists  of  959  job  posGngs  and  73  terms   •  375  data  scienGsts  and  584  non-­‐data  scienGsts   •  Split  TDM  into  training  set  and  test  set   •  864  job  posGngs  in  training  sample   •  95  job  posGngs  in  test  sample   9  
  • 10. SUPERVISED  LEARNING   •  K-­‐Nearest  Neighbor   •  Find  the  K  value  with  the  highest  classificaGon  accuracy               •  K=8  shows  the  best  result  with  82.98%  accuracy  rate   •  Confusion  matrix  shows  the  model  correctly  predicted  22   out  of  35  data  scienGst  job  posGngs   10  
  • 11. SUPERVISED  LEARNING   •  ClassificaGon  Decision  Tree  (Gini  index)   •  The  classificaGon  accuracy  rate  is  96.8%   •  Confusion  matrix  shows  the  model  correctly  predicted  30   out  of  33  data  scienGst  job  posGngs   •  Key  terms  for  tree  construcGon:   11  
  • 12. SUPERVISED  LEARNING   •  Bagging   •  The  classificaGon  accuracy  rate  is  96.8%     •  Confusion  matrix  shows  the  same  results  as  the   classificaGon  tree   12  
  • 13. QUESTIONS?   COMMENTS?   Sung  Park,  MSPA  Candidate   August  20,  2015   Northwestern  University   PREDICT  422-­‐DL  SecGon  55   13