Training a New Generation ofData ScientistsJosh Wills | Senior Director of Data Science
About Me
What Do Data Scientists Do?
What I Think I Do
What Other People Think I Do
What I Actually Do
The Emergence of Data Science
Data Storage in 2001: Databases• Structured schemas• Intensive processing  done where data is  stored• Somewhat reliable• ...
Data Storage in 2001: Filers                      • No schemas, stores any                        kind of file            ...
And Then, This Happened
Data Economics, Return on Byte
Big Data Economics• No individual record is        Value = f(Bytes)  particularly valuable• Having every record is  incred...
Enter Hadoop
The Hadoop Distributed File System• Based on the Google File  System• Data stored in large files  • Large block size: 64MB...
Simple, Reliable, Distributed Processing: MapReduce• Map Stage  •   Embarrassingly parallel• Shuffle Stage: Large-scale di...
Thinking Like a Data Scientist
Solving Problems vs. Finding Insights
Parallelize Everything
Abundance vs. Scarcity
Building Data Products
Create a Data Science Team
Choose Good Problems
Design the Model
Mind the Gap
Amortize Costs
Measure Everything
Rinse and Repeat
Work Like a Data Scientist
Train Like a Data Scientist                                 Introduction                                 to Data          ...
Introduction to Data Science:Building Recommender Systems  http://university.cloudera.com/
• Submit questions in the Q&A panel                                            Register now for Cloudera training at      ...
Training a New Generation of Data Scientists
Upcoming SlideShare
Loading in...5
×

Training a New Generation of Data Scientists

1,515

Published on

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

In this video, Cloudera's Senior Director of Data Science, Josh Wills, discusses what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh answers questions about machine learning, analytics platforms, applications of data science in different industries, and Cloudera's Introduction to Data Science course.

Published in: Technology

Training a New Generation of Data Scientists

  1. 1. Training a New Generation ofData ScientistsJosh Wills | Senior Director of Data Science
  2. 2. About Me
  3. 3. What Do Data Scientists Do?
  4. 4. What I Think I Do
  5. 5. What Other People Think I Do
  6. 6. What I Actually Do
  7. 7. The Emergence of Data Science
  8. 8. Data Storage in 2001: Databases• Structured schemas• Intensive processing done where data is stored• Somewhat reliable• Expensive at scale
  9. 9. Data Storage in 2001: Filers • No schemas, stores any kind of file • No data processing capability • Reliable • Expensive at scale
  10. 10. And Then, This Happened
  11. 11. Data Economics, Return on Byte
  12. 12. Big Data Economics• No individual record is Value = f(Bytes) particularly valuable• Having every record is incredibly valuable • Web index • Recommendation systems • Sensor data • Market basket analysis • Online advertising
  13. 13. Enter Hadoop
  14. 14. The Hadoop Distributed File System• Based on the Google File System• Data stored in large files • Large block size: 64MB to 256MB per block • Blocks are replicated to multiple nodes in the cluster
  15. 15. Simple, Reliable, Distributed Processing: MapReduce• Map Stage • Embarrassingly parallel• Shuffle Stage: Large-scale distributed sort• Reduce Stage • Process all the values that have the same key in a single step• Process the data where it is stored• Write once and you’re done.
  16. 16. Thinking Like a Data Scientist
  17. 17. Solving Problems vs. Finding Insights
  18. 18. Parallelize Everything
  19. 19. Abundance vs. Scarcity
  20. 20. Building Data Products
  21. 21. Create a Data Science Team
  22. 22. Choose Good Problems
  23. 23. Design the Model
  24. 24. Mind the Gap
  25. 25. Amortize Costs
  26. 26. Measure Everything
  27. 27. Rinse and Repeat
  28. 28. Work Like a Data Scientist
  29. 29. Train Like a Data Scientist Introduction to Data Hive and Pig Science Training Hadoop Developer Training
  30. 30. Introduction to Data Science:Building Recommender Systems http://university.cloudera.com/
  31. 31. • Submit questions in the Q&A panel Register now for Cloudera training at http://university.cloudera.com• Watch on-demand video of this webinar at http://cloudera.com Use discount code DSvideo_10 to save 10% on new enrollments in Cloudera-• Follow Josh on Twitter @josh_wills delivered training classes until June 1• Follow Cloudera University @ClouderaU Use discount code 15off2 to save 15% on• Thank you for attending! enrollments in two or more Cloudera- delivered training classes until June 1
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×