Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MLconf NYC Josh Wills

2,465 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

MLconf NYC Josh Wills

  1. 1. 1 MLConf NYC 2014 Josh Wills, Senior Director of Data Science Cloudera
  2. 2. A Little Bit About Me 2
  3. 3. An Experience I Had Recently 3
  4. 4. The Two Kinds of Data Scientists • The Lab • Statisticians who got really good at programming • Neuroscientists, geneticists, etc. • The Factory • Software engineers who were in the wrong place at the wrong time 4
  5. 5. The Lab and The Factory Analytics in the Lab • Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine Analytics in the Factory • Metric-driven • Automated • Systematic • Fluid data • Focus on transparency and reliability • Output is a production system that makes customer- facing decisions 5
  6. 6. 6 Data Science In The Factory
  7. 7. On Icebergs 7
  8. 8. The Impedance Mismatch 8
  9. 9. What Do We Need? 9
  10. 10. Apache Spark 10
  11. 11. A Feature Extraction DSL for Spark 11
  12. 12. The R Formula Specification 12
  13. 13. So Why Doesn’t This Exist Yet? 13
  14. 14. Functional Programming to the Rescue 14
  15. 15. 15 Data Science in the Lab
  16. 16. Great Tools for Investigative Analytics 16
  17. 17. Cloudera Impala 17
  18. 18. LLVM and NUMBA 18
  19. 19. Python UDFs for Impala 19
  20. 20. Python UDFs for Impala • github.com/cloudera/impyla • Already There • Numeric and boolean types (as native python objects) • In Progress • String support • C/C++ function integration • Planned • Struct/tuple and array types • UDAFs • Include support for PyData stack (scikit-learn, NLTK) 20
  21. 21. Josh Wills, Director of Data Science, Cloudera @josh_wills Thank you!

×