Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science Popup Austin: Data Do's and Dont's: Lessons From The Front Line


Published on

Watch talk ➟

At Galvanize, hundreds of students complete capstone projects every year to showcase their skills to hiring partners in industry. This talk distills the main learnings from my experience advising students on some of their first data projects. Learn from their mistakes in using the right development process, building projects with real business value, proper project scoping, and making that final presentation.

Published in: Data & Analytics
  • Login to see the comments

Data Science Popup Austin: Data Do's and Dont's: Lessons From The Front Line

  1. 1. DATA SCIENCE POP UP AUSTIN Data Do's and Dont's: Lessons From the Front Line Ryan Orban VP of Product and Strategy, 
 Data Scientist, Galvanize ryanorban
  2. 2. DATA SCIENCE POP UP AUSTIN #datapopupaustin April 13, 2016 Galvanize, Austin Campus
  3. 3. Data Do’s and Dont’s: Lessons from the Frontline
  4. 4. Co-Founder & CEO Zipfian Academy Ryan Orban @ryanorban EVP of Product and Strategy Galvanize
  5. 5. We believe an opportunity belongs 
 to anyone with aptitude and ambition.
  6. 6. 4Galvanize 2015 NODES ON THE NETWORK COLORADO (BOULDER, DENVER, FORT COLLINS) SEATTLE, WA SAN FRANCISCO, CA AUSTIN, TX (OPENING Q1 2016) Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship Programs: Full Stack Immersive, Data Science Immersive, Data Engineering Immersive, Masters of Science in Data Science, Entrepreneurship Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship [Explanation Text]
  7. 7. 5Galvanize 2015 5 PROGRAMS • Full Stack Immersive • Data Science Immersive • Data Engineering Immersive Project over 500 Student Member Graduates in 2015 Currently over 1500 Members • Master of Science in Data Science 
 (University of New Haven) • Startup Membership
  8. 8. 6Galvanize 2015 PLACEMENT STATS FULL STACK IMMERSIVE DATA SCIENCE IMMERSIVE $43K $77KPre-program Salary Average Starting Salary 97% Placement Rate* *Galvanize is a founder member of NESTA (New Economy Skills Training Association), a trade organization founded to regulate the new “bootcamp” market. This place rate is more rigorous than that requested by state licensure agencies. The placement rate is calculated 6 months after graduation. $72K $114KPre-program Salary 94%Placement Rate* Average Starting Salary
  9. 9. Software Engineering Data Science Data Analysis Data Engineering Machine Learning Java Linux, UNIX Mobile Development Objective C C, C++, C# Web Development Ruby on Rails JavaScript Front-endPHP Full- Stack Excel Python SQL NLP Hadoop Databases Network Analysis Java Assembly Statistics R The orange words are the most important things we teach. How These Things Relate to Each Other Full-Stack Web Development and Data Science are in gray circles.
  10. 10. 8Galvanize 2015 DATA SCIENCE IMMERSIVE Week 1 - Exploratory Data Analysis and Software Engineering Best Practices Week 2 - Statistical Inference, Bayesian Methods, A/B Testing, Multi-Armed Bandit Week 3 - Regression, Regularization, Gradient Descent Week 4 - Supervised Machine Learning: Classification, Validation, Ensemble Methods Week 5 - Clustering, Topic Modeling (NMF, LDA), NLP Week 6 - Network Analysis, Matrix Factorization, and Time Series Week 7 - Hadoop, Hive, and MapReduce Week 8 - Data Visualization with D3.js, Data Products, and Fraud Detection Case Study Weeks 9-10 - Capstone Projects Week 12 - Onsite Interviews
  11. 11. Data Manipulation Model Creation Prediction
  12. 12. Data Manipulation
  13. 13. Do Don’t • Assume your data is friendly • ETL and feature engineering is largely opaque to others (and yourself after enough time away) • Automate cleaning and transformation pipelines • Jupyter and RStudio are great for EDA, but have issues with collaboration and version control • Build functional code to be reused; export into plain code files, track with Git
  14. 14. Model Creation
  15. 15. Do Don’t • Never use accuracy as your main metric • You can have 99% accuracy but 0% predictive power • Unbalanced classes; sampling • Use metrics like precision and recall • Aggregate metrics like F1-score, AUC/AIC/BIC also good • Remember that models with highest scores are not always the ones you need; permissive vs. conservative based on use case
  16. 16. Do Don’t • Don’t start with the most complicated models first (deep learning, gradient boosting, SVMs, etc.) • Don’t focus on the algorithm •“More data always beats better algorithms” • But better features usually beat better algorithms* • Start with a baseline model, then continuously “close the loop” • Create a base case to optimize against • Does 1% greater F1-score outweigh a 10x training time in production? Not usually unless you’re Google-scale.
  17. 17. Do Don’t • Assume your cross-validation metrics will hold up against real-life data • Separate your application and prediction code • Fast iteration cycles are key. Create a “scoring service” that is uncoupled from application code. • APIs & service oriented architectures typically work best
  18. 18. Communication
  19. 19. Do Don’t • Don’t focus on the “how”, i.e. cover every trial and tribulation • Cut to the chase • After a presentation, I always ask the class two questions: • What is one sentence that describes what the speaker learned? • Why do I care?
  20. 20. 19Galvanize 2015 • Early Access to Students • Candidate Matching • Curriculum Development • Corporate Student Sponsorship • Diversity TALENT
  21. 21. 20Galvanize 2015 • Membership • Organic Relationships • Course Content • Mentorship • Community • Events ACCESS
  22. 22. 21Galvanize 2015 • Galvanize Experts • Capstone Projects • Internship • Corporate Training EXPERTISE
  23. 23. THANK YOU RYAN ORBAN | EVP, STRATEGY @ryanorban
  24. 24. DATA SCIENCE POP UP AUSTIN @datapopup #datapopupaustin