Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On Building a Data Science Curriculum

1,683 views

Published on

Data Science is a comparatively new field and as such it is constantly changing as new techniques, tools, and problems emerge every day. Traditionally education has taken a top down approach where courses are developed on the scale of years and committees approve curricula based on what might be the most theoretically complete approach. This is at odds however with an evolving industry that needs data scientists faster than they can be (traditionally) trained.

If we are to sustainably push the field of Data Science forward, we must collectively figure out how to best scale this type of education. At Zipfian I have seen (and felt) first hand what works (and what doesn't) when tools and theory are combined in a classroom environment. This talk will be a narrative about the lessons learned trying to integrate high level theory with practical application, how leveraging the Python ecosystem (numpy, scipy, pandas, scikit-learn, etc.) has made this possible, and what happens when you treat curriculum like product (and the classroom like a team).

Published in: Technology
  • Be the first to comment

On Building a Data Science Curriculum

  1. 1. On Building a Data Science Curriculum November 23nd, 2014
  2. 2. Jonathan Dinu Director of Education, Galvanize jonathan@galvanize.com @clearspandex Questions? tweet @galvanize
  3. 3. Formerly Questions? tweet @galvanize
  4. 4. Formerly Questions? tweet @galvanize
  5. 5. + Currently Questions? tweet @galvanize
  6. 6. Challenge The Challenge Questions? tweet @galvanize
  7. 7. Challenge
  8. 8. Tools H20 (0xdata) Framework/Library Big Data (scalability) Small Data MapReduce (Java) MapReduce (Streaming) Bespoke Code Cloudera ML Mahout MLlib (amplab) C/C++ Cascading/Crunch Pig/Hive Vowpal Rabbit Giraph GraphLab Spark Storm R CRAN Python Java scikit-learn pandas mlpack Weka Numpy Javascript Questions? tweet @galvanize
  9. 9. Obligatory Name Drop At Scale Locally Snakebite (HDFS) Questions? tweet @galvanize Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pymongo pandas Flask scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) MLlib (pySpark) Flask scikit-learn/NLTK
  10. 10. Challenge Questions? tweet @galvanize
  11. 11. Challenge Now do that in 8 weeks Questions? tweet @galvanize
  12. 12. Challenge Questions? tweet @galvanize
  13. 13. Intuition Iteration 0: Intuition Questions? tweet @galvanize
  14. 14. Content Questions? tweet @galvanize Source: Metacademy
  15. 15. Bottom Up Approach Content Questions? tweet @galvanize
  16. 16. Content Source: Coursera
  17. 17. Content Source: UC Berkeley Masters
  18. 18. Not Everybody Learns This Way Issues Questions? tweet @galvanize
  19. 19. Issues • Not Enough Context • Not Enough Concept Overlap • Takes too much Time • Nothing Happens in a Vacuum Questions? tweet @galvanize
  20. 20. Digression Not Just for Data Science (relevant to learning any complex subject) Questions? tweet @galvanize
  21. 21. Experience Iteration 1: Experience Questions? tweet @galvanize
  22. 22. Theory Mathematics & Statistics Mathematics Statistical Analysis Distributions (Binomial, Poisson, etc.) Summary Statistics (Mean, Variance, etc.) Hypothesis Testing Bayesian Analysis Linear Algebra (Matrix Factorization) Calculus (Integrals, Derivatives, etc) Graph Theory Probability/ Combinatorics Questions? tweet @galvanize
  23. 23. Worth the Upfront Investment Theory Questions? tweet @galvanize
  24. 24. Technique Machine Learning & Software Engineering Distributed Computing Supervised (SVM, Random Forest) Unsupervised (K-means, LDA) NLP / Information Retrieval Algorithms & Data Structures Data Visualization Data Munging Machine Learning Software Engineering Validation, Model Comparison Questions? tweet @galvanize
  25. 25. Questions? tweet @galvanize Just ask them! Network (the students)
  26. 26. Context is King
  27. 27. Network Questions? tweet @galvanize
  28. 28. Network Iris Dataset Classification Questions? tweet @galvanize
  29. 29. Network Iris Dataset Classification Questions? tweet @galvanize NYT Topic Modeling
  30. 30. Network Iris Dataset Classification Questions? tweet @galvanize NYT Topic Modeling Real-time Fraud scoring service
  31. 31. Network Iris Dataset Classification Questions? tweet @galvanize NYT Topic Modeling Real-time Fraud scoring service Personal Capstone Project
  32. 32. Network “Domesticated Data” Learn the tools/theory Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  33. 33. Network “Domesticated Data” Learn the tools/theory “Wild Data” Learn the application Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  34. 34. Network “Domesticated Data” Learn the tools/theory “Wild Data” Learn the application Simulated Case Study Learn the process Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  35. 35. Network “Domesticated Data” Learn the tools/theory “Wild Data” Learn the application Simulated Case Study Learn the process Greenfield Project Learn the practice/art Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  36. 36. Theory Questions? tweet @galvanize Theory Application Synthesis $$$ PROFIT!!
  37. 37. Questions? tweet @galvanize Just ask them! Network
  38. 38. Network Questions? tweet @galvanize
  39. 39. Questions? tweet @galvanize Just ask them! (and be flexible) Network
  40. 40. Treat them like customers Questions? tweet @galvanize (because they are) Network
  41. 41. Questions? tweet @galvanize Always Validate! Network
  42. 42. Metrics Iteration 2: Data! Questions? tweet @galvanize
  43. 43. METRICS Experience Iteration 2: Data! METRICS EVERYWHERE Questions? tweet @galvanize
  44. 44. Metrics Questions? tweet @galvanize
  45. 45. Questions? tweet @galvanize • Commits • Pull Requests • Passing Tests • Etc. Metrics
  46. 46. Curriculum as Product
  47. 47. Learning Techniques Questions? tweet @galvanize
  48. 48. Industry Techniques Questions? tweet @galvanize Source: http://en.wikipedia.org/wiki/Extreme_programming
  49. 49. Industry Techniques Questions? tweet @galvanize Source: http://lostechies.com/scottreynolds/2009/10/07/how-we-do-things-tdd-bdd/
  50. 50. Industry Techniques Questions? tweet @galvanize Code Reviews Source: http://agile.dzone.com/articles/re-pair-programming
  51. 51. Our House @Zipfian (now Galvanize) Questions? tweet @galvanize
  52. 52. source: http://www.sebastienmillon.com/Rainbow-Immersion-Therapy-Art-Print-15
  53. 53. Methodology Community Education Meetup Student Groups Corporate Training Industry Questions? tweet @galvanize
  54. 54. Methodology • Outcomes focused • Project-based curriculum using real datasets • Guest lectures from leaders in the field • Mock interviews and hiring preparation • Full instructional staff + personal mentorship Questions? tweet @galvanize
  55. 55. Employment Highest Employment Rates (2012) University of Massachusetts-Amherst School of Nursing 98% Georgetown University McDonough School of Business 94% Michigan State University College of Nursing 92% Syracuse University School of Architecture 90% University of Massachusetts-Amherst Isenberg School of Management 90% Michigan State University School of Hospitality Business 89% New York University 88% Boston College Connell School of Nursing 88% Boston College Carroll School of Management 87% Case Western Reserve University Frances Payne Bolton School of Nursing 86% U.S. News and World Report Ranking 1. Princeton University 2. Harvard University 3. Yale University 4. Columbia University 5. Stanford University 6. University of Chicago 7. Duke University 8. MIT 9. University of Pennsylvania 10. California Institue of Technology Questions? tweet @galvanize Source: http://www.nerdwallet.com/nerdscholar/grad_surveys/highest-employment-rates
  56. 56. Timeline Data Science Immersive STRUCTURED CURRICULUM Questions? tweet @galvanize HIRING DAY CAPSTONE PROJECT GRADUATION 0 INTERVIEWS 8 10.5 12
  57. 57. Industry Student Projects Questions? tweet @galvanize
  58. 58. ! • Working knowledge of programming • Background in a quantitative discipline • Comfortable with mathematics and statistics • Child-like curiosity Questions? tweet @galvanize What We Look For Our Students
  59. 59. Our Students Questions? tweet @galvanize Educational Background BS MS PhD 0 4 8 12 16
  60. 60. Questions? tweet @galvanize Disciplines Software Engineering Analysts Finance/Economics Engineering Physics Physical Sciences Mathematics Statistics Astronomy Linguistics Professional Poker 0 2 4 6 8 Our Students
  61. 61. Data Science Immersive Questions? tweet @galvanize Masters in Data Science Data Engineering Immersive Weekend Workshops +
  62. 62. Questions? tweet @galvanize Immersive Masters
  63. 63. Questions? tweet @galvanize Immersive Masters (not to scale)
  64. 64. Masters of Science - 1 year Questions? tweet @galvanize (Starts in Spring) http://www.galvanizeu.com/request-info
  65. 65. Goals ! • Present a guest lecture or share a data story • Donate datasets and propose projects • Sponsor a scholarship • Attend our Hiring Day Questions? tweet @galvanize Get Involved
  66. 66. Goals Questions? tweet @galvanize We’re Hiring! ! • Full-time Instructors • TAs • Mentor (volunteer)
  67. 67. Questions? Questions? tweet @galvanize Thank You! Jonathan Dinu Director of Education, Galvanize jonathan@galvanize.com @clearspandex

×