MLconf NYC Josh Wills


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • My major contribution to western civilization.See also:
  • Curt Monashmakes a distinction between investigative analytics (which he defines here: ) and operational analytics that I like, and I expanded it into my own set of differences that I want to walk through here.Investigative analytics is what we think of when we think of traditional BI: there’s an analyst or an executive that is searching for previously unknown patterns in a data set, either by looking at a series of visualizations mediated by database queries, or by applying some statistical models to a prepared data set to tease out some deeper explanations. This is where the vast majority of the BI market is focused right now.Operational analytics, on the other hand, is a nascent market, and I don’t believe the existing BI tools have done a good job of supporting companies that want to start leveraging their modeling and analytical prowess in order to make better decisions in real-time. I’d like to shift some of the conversation and the focus in the market from the lab to the factory.
  • The tip of the iceberg metaphor. This has been a useful metaphor for me throughout my career, I feel like I am constantly exploring the tip of the iceberg,from the theory of model building to the practice of model building to operational model building.There is a ton of stuff I don’t know, but I hope that I can provide a useful sort of commentary on the culture of credit scoring from the perspective of an outsider, kind of like Alexis de Tocqueville or Borat
  • Parser combinators,monoids, regular expressions, oh my!
  • Tools focus on speed and flexibility.
  • MLconf NYC Josh Wills

    1. 1. 1 MLConf NYC 2014 Josh Wills, Senior Director of Data Science Cloudera
    2. 2. A Little Bit About Me 2
    3. 3. An Experience I Had Recently 3
    4. 4. The Two Kinds of Data Scientists • The Lab • Statisticians who got really good at programming • Neuroscientists, geneticists, etc. • The Factory • Software engineers who were in the wrong place at the wrong time 4
    5. 5. The Lab and The Factory Analytics in the Lab • Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine Analytics in the Factory • Metric-driven • Automated • Systematic • Fluid data • Focus on transparency and reliability • Output is a production system that makes customer- facing decisions 5
    6. 6. 6 Data Science In The Factory
    7. 7. On Icebergs 7
    8. 8. The Impedance Mismatch 8
    9. 9. What Do We Need? 9
    10. 10. Apache Spark 10
    11. 11. A Feature Extraction DSL for Spark 11
    12. 12. The R Formula Specification 12
    13. 13. So Why Doesn’t This Exist Yet? 13
    14. 14. Functional Programming to the Rescue 14
    15. 15. 15 Data Science in the Lab
    16. 16. Great Tools for Investigative Analytics 16
    17. 17. Cloudera Impala 17
    18. 18. LLVM and NUMBA 18
    19. 19. Python UDFs for Impala 19
    20. 20. Python UDFs for Impala • • Already There • Numeric and boolean types (as native python objects) • In Progress • String support • C/C++ function integration • Planned • Struct/tuple and array types • UDAFs • Include support for PyData stack (scikit-learn, NLTK) 20
    21. 21. Josh Wills, Director of Data Science, Cloudera @josh_wills Thank you!