Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SUNG PARK PREDICT 422 Group Project Presentation

438 views

Published on

  • Be the first to comment

  • Be the first to like this

SUNG PARK PREDICT 422 Group Project Presentation

  1. 1. TEXT  MINING  DATA  SCIENCE  JOBS  IN  R   Sung  Park,  MSPA  Candidate   August  20,  2015   Northwestern  University   PREDICT  422-­‐DL  SecGon  55   1  
  2. 2. SUMMARY   •  IntroducGon   •  Resources   •  Data  Source     •  Data  ExtracGon   •  Data  PreparaGon   •  Supervised  Learning   2  
  3. 3. INTRODUCTION   •  ExploraGon  of  web  scraping  and  text  mining   capabiliGes  in  R   •  Unstructured  data   •  Kaggle.com  job  posGngs   •  ClassificaGon  using  machine  learning  algorithm   •  Data  scienGsts  vs.  non-­‐data  scienGsts     3  
  4. 4. RESOURCES   •  Text  AnalyGcs  Tutorial  in  R   •  Timothy  D’Auria,  Boston  Decision,  LLC   •  hUps://www.youtube.com/watch?v=j1V2McKbkLo   •  Web  Scraping  Tutorial  in  R   •  Sharon  Machlis,  Computerworld   •  hUps://www.youtube.com/watch?v=TPLMQnGw0Vk   •  Data  Science  in  R:  A  Case  Study  Approach  to  ComputaGonal   Reasoning  and  Problem  Solving   •  Deborah  Nolan  and  Duncan  Temple  Lang   •  Google  and  Stack  Overflow   4  
  5. 5. DATA  SOURCE   •  Kaggle.com/jobs   •  August  17,  2015   •  1,025  Job  PosGngs   •  Data  ScienGst   •  Big  Data  Engineer   •  Data  Science   Architect   •  Data  Analyst   •  MarkeGng  Analyst   •  StaGsGcian   •  Data  Science   Director   5  
  6. 6. DATA  EXTRACTION   •  Extracted  job  links   •  XML  Package   •  xpathSApply(doc,  "//h3/a/@href[starts-­‐with(.,  '/jobs')]")               •  Extracted  job  posGng  text   •  rvest  Package   •  html_text(html_nodes(htmlpage,  "div.postcontent"))   6  
  7. 7. DATA  PREPARATION   •  Cleaned  the  text  data   •  tm  Package   •  tm_map()   •  Remove  punctuaGons   •  Remove  white  spaces   •  Lower-­‐casing   •  Remove  stopwords   •  “a”,  “the”,  “and”,  “but”,  etc.   7  
  8. 8. DATA  PREPARATION   •  Created  the  term  document  matrix  (TDM)   8  
  9. 9. DATA  PREPARATION   •  TDM  consists  of  959  job  posGngs  and  73  terms   •  375  data  scienGsts  and  584  non-­‐data  scienGsts   •  Split  TDM  into  training  set  and  test  set   •  864  job  posGngs  in  training  sample   •  95  job  posGngs  in  test  sample   9  
  10. 10. SUPERVISED  LEARNING   •  K-­‐Nearest  Neighbor   •  Find  the  K  value  with  the  highest  classificaGon  accuracy               •  K=8  shows  the  best  result  with  82.98%  accuracy  rate   •  Confusion  matrix  shows  the  model  correctly  predicted  22   out  of  35  data  scienGst  job  posGngs   10  
  11. 11. SUPERVISED  LEARNING   •  ClassificaGon  Decision  Tree  (Gini  index)   •  The  classificaGon  accuracy  rate  is  96.8%   •  Confusion  matrix  shows  the  model  correctly  predicted  30   out  of  33  data  scienGst  job  posGngs   •  Key  terms  for  tree  construcGon:   11  
  12. 12. SUPERVISED  LEARNING   •  Bagging   •  The  classificaGon  accuracy  rate  is  96.8%     •  Confusion  matrix  shows  the  same  results  as  the   classificaGon  tree   12  
  13. 13. QUESTIONS?   COMMENTS?   Sung  Park,  MSPA  Candidate   August  20,  2015   Northwestern  University   PREDICT  422-­‐DL  SecGon  55   13  

×