Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Training a New Generation of Data Scientists


Published on

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

In this video, Cloudera's Senior Director of Data Science, Josh Wills, discusses what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh answers questions about machine learning, analytics platforms, applications of data science in different industries, and Cloudera's Introduction to Data Science course.

Published in: Technology

Training a New Generation of Data Scientists

  1. 1. Training a New Generation ofData ScientistsJosh Wills | Senior Director of Data Science
  2. 2. About Me
  3. 3. What Do Data Scientists Do?
  4. 4. What I Think I Do
  5. 5. What Other People Think I Do
  6. 6. What I Actually Do
  7. 7. The Emergence of Data Science
  8. 8. Data Storage in 2001: Databases• Structured schemas• Intensive processing done where data is stored• Somewhat reliable• Expensive at scale
  9. 9. Data Storage in 2001: Filers • No schemas, stores any kind of file • No data processing capability • Reliable • Expensive at scale
  10. 10. And Then, This Happened
  11. 11. Data Economics, Return on Byte
  12. 12. Big Data Economics• No individual record is Value = f(Bytes) particularly valuable• Having every record is incredibly valuable • Web index • Recommendation systems • Sensor data • Market basket analysis • Online advertising
  13. 13. Enter Hadoop
  14. 14. The Hadoop Distributed File System• Based on the Google File System• Data stored in large files • Large block size: 64MB to 256MB per block • Blocks are replicated to multiple nodes in the cluster
  15. 15. Simple, Reliable, Distributed Processing: MapReduce• Map Stage • Embarrassingly parallel• Shuffle Stage: Large-scale distributed sort• Reduce Stage • Process all the values that have the same key in a single step• Process the data where it is stored• Write once and you’re done.
  16. 16. Thinking Like a Data Scientist
  17. 17. Solving Problems vs. Finding Insights
  18. 18. Parallelize Everything
  19. 19. Abundance vs. Scarcity
  20. 20. Building Data Products
  21. 21. Create a Data Science Team
  22. 22. Choose Good Problems
  23. 23. Design the Model
  24. 24. Mind the Gap
  25. 25. Amortize Costs
  26. 26. Measure Everything
  27. 27. Rinse and Repeat
  28. 28. Work Like a Data Scientist
  29. 29. Train Like a Data Scientist Introduction to Data Hive and Pig Science Training Hadoop Developer Training
  30. 30. Introduction to Data Science:Building Recommender Systems
  31. 31. • Submit questions in the Q&A panel Register now for Cloudera training at• Watch on-demand video of this webinar at Use discount code DSvideo_10 to save 10% on new enrollments in Cloudera-• Follow Josh on Twitter @josh_wills delivered training classes until June 1• Follow Cloudera University @ClouderaU Use discount code 15off2 to save 15% on• Thank you for attending! enrollments in two or more Cloudera- delivered training classes until June 1