Data Scientists
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data Scientists

on

  • 1,488 views

Data Science and Data Scientists

Data Science and Data Scientists

Statistics

Views

Total Views
1,488
Views on SlideShare
1,474
Embed Views
14

Actions

Likes
10
Downloads
63
Comments
0

3 Embeds 14

http://www.linkedin.com 6
https://www.linkedin.com 4
https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Scientists Presentation Transcript

  • 1. Data Scientists Leonid Zhukov Higher School of Economics , Moscow, 2013 www.hse.ru
  • 2. The Sexiest Job of the 21st Century McKinsey estimates 140,000-190,000 shortage by 2018 Higher School of Economics , Moscow, 2013 2  
  • 3. Data Scientists wanted! Higher School of Economics , Moscow, 2013 3  
  • 4. Supply and demand Higher School of Economics , Moscow, 2013 4  
  • 5. Who are Data Scientists? Data Scientist: •  Loves data •  Investigator mind set •  Goal of his work is in finding patterns in data and data driven products •  He is a practitioner, not theorist •  Has “hands on” skills •  Domain expertise (*) •  Team player Some backgrounds are better than others: •  Computer Science •  Statistics (mathematics) •  Natural sciences with strong quantitative •  PhD’s, but not only demand for a certain set of skills, while later demand wanes as automated by even newer tools. Consider, for instance, the wa management jobs that used to require legions of computer ope monitoring tools. Data science is still in its very early phase, wi the right available The best source of new Data Science talent is: Today's BI professionals 12% Professionals in disciplines other than IT or computer science 27%  EMC Data Science Community Survey, 2011 Higher School of Economics , Moscow, 2013 Other 3% Students studying computer science 34% Students studying fields other than computer science 24% university students. 5   Although opportun scientist thirds of shortfall the next research Institute 190,000 And whe the best today’s b Instead,
  • 6. What do Data Scientists do? •  •  •  •  •  •  •  •  Designs customized system and tools Works with structured and unstructured data Creates data processing pipelines Analyzes massive datasets (TB, PB) Builds predictive models Creates visualizations Designs data products Uses Hadoop, MapReduce, Hive, Python, R Higher School of Economics , Moscow, 2013 6  
  • 7. Tools of the trade •  Operating systems: •  Linux + shell tools •  Big data instruments: •  Hadoop (MapReduce) + hadoop tools •  Hive, Pig •  NoSQL (Hbase, MongoDB, Cassandra, Neo4J) •  Database: •  SQL •  Programming: •  Python •  Java •  Scala •  Machine Learning: •  R •  Matlab •  Python libraries (NumPy, SciPy, Nltk,SciKit) •  Java libraries (Mahaut) . Higher School of Economics , Moscow, 2013 7  
  • 8. Required skills •  •  •  •  •  •  •  •  •  •  Programming Algorithms Statistics Data mining Machine learning NLP Distributed systems Big data tools Databases Visualization From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions Higher School of Economics , Moscow, 2013 8  
  • 9. Data Scientist roles From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012 Higher School of Economics , Moscow, 2013 9  
  • 10. Data Science ”dream team” From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013 Higher School of Economics , Moscow, 2013 10  
  • 11. Data Science project pipeline Learning  a   problem     Higher School of Economics , Moscow, 2013 Parsing  data     Cleaning,   filtering  and   organizing   Exploring   and  mining   for  paGerns   Acquiring   data     Building   models   Visualizing   results   CommunicaJng   findings   11  
  • 12. Business applications •  Marketing: •  Market segmentation •  Product and media mix analysis •  Customer acquisition and churn modeling •  Recommendation system and cross sell •  Social media analysis •  Finance & Insurance: •  Fraud prevention •  Anomaly detection •  Credit risk analysis •  Usage based insurance modeling •  Portfolio optimization •  Healthcare and Pharmaceuticals: •  Genetic analysis •  Clinical trials analysis •  Clinical decision support system Higher School of Economics , Moscow, 2013 12  
  • 13. Industry training TRAINING SHEET | 2 Course Outline: Cloudera Introduction to Data Science Introduction Data Analysis and Statistical Methods Experimentation and Evaluation Data Science Overview > Relationship Between Statistics and Probability > Measuring Recommender Effectiveness > Descriptive Statistics > Conducting an Effective Experiment > What Is Data Science? > The Growing Need for Data Science > The Role of a Data Scientist > Inferential Statistics Fundamentals of Machine Learning Use Cases > Overview > Finance > The Three Cs of Machine Learning > Retail > Spotlight: Naïve Bayes Classifiers > Advertising > Importance of Data and Algorithms > Defense and Intelligence > Telecommunications and Utilities > Healthcare and Pharmaceuticals Evaluating Input Data > Data Formats > Data Quantity > Data Quality Data Transformation > Tips and Techniques for Working at Scale > Summarizing and Visualizing Results > Considerations for Improvement Conclusion > Types of Collaborative Filtering > Fundamental Concepts > Acquisition Techniques > Deploying to Production > What Is a Recommender System? > Steps in the Project Lifecycle > Where to Source Data Production Deployment and Beyond > Next Steps for Recommenders > Limitations of Recommender Systems Data Acquisition > User Interfaces for Recommenders Recommender Overview Project Lifecycle > Lab Scenario Explanation > Designing Effective Experiments Introduction to Apache Mahout > What Apache Mahout Is (and Is Not) > A Brief History of Mahout > Availability and Installation Appendix A : Hadoop Overview Appendix B: Mathematical Formulas Appendix C : Language and Tool Reference > Demonstration: Using Mahout’s ItemBased Recommender Implementing Recommenders with Apache Mahout > Overview > Similarity Metrics for Binary Preferences > Anonymization > File Format Conversion TRAINING SHEET > Similarity Metrics for Numeric Preferences > Scoring > Joining Datasets Cloudera Introduction to Data Science: Cloudera Certified Professional: Data Building RecommenderScientist (CCP:DS) Systems Higher School of Economics , Moscow, 2013 13  
  • 14. Industry training Higher School of Economics , Moscow, 2013 14  
  • 15. Educational programs University programs: •  •  •  •  •  University of Washington: Certificate in Data Science UC Berkeley: Master of information and data science program New York University: Data Science at NYU Columbia University: Institute for Data Sciences and Engineering University of Southern California (UCS) : Master of Science in Data Science Online MOOC courses: •  Coursera •  edX •  Udacity Accelerated educational programs: •  Zipfian Academy (12 weeks intensive program) •  Insight Data Science Fellows program ( 6 weeks post doc training) Higher School of Economics , Moscow, 2013 15  
  • 16. Conferences •  Industry conferences and meetings: •  •  •  •  O’Reilly Strata Conference Making Data Work Hadoop World Big Data Techcon Big Data Innovation summits •  Academic conferences (peer reviewed): •  •  •  •  •  •  •  •  •  •  •  •  •  IEEE & ACM Supercomputing IEEE Big Data ACM KDD Knowledge Discovery and Data Mining ACM SIGIR Information Retrieval ICML International Conference on Machine Learning ICDM International Conference on Data Mining NIPS Neural Information Processing WWW World Wide Web Conference VLDB Very Large Data Bases ACM CIKM Information and Knowledge Management SIAM SDM International Conference on Data Mining IEEE ICDE Data Engineering IEEE Visualization •  Meetups Higher School of Economics , Moscow, 2013 16  
  • 17. Textbooks Higher School of Economics , Moscow, 2013 17  
  • 18. Open questions • How important is domain expertise? • What is need more: education or experience? • Future of Data Scientist, will they be replaced by software? Higher School of Economics , Moscow, 2013 18  
  • 19. 20, Myasnitskaya str., Moscow, Russia, 101000 Tel.: +7 (495) 628-8829, Fax: +7 (495) 628-7931 www.hse.ru