MLconf NYC Josh Wills
Upcoming SlideShare
Loading in...5

MLconf NYC Josh Wills






Total Views
Views on SlideShare
Embed Views



5 Embeds 610 536 42 20 9 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • My major contribution to western civilization.See also:
  • Curt Monashmakes a distinction between investigative analytics (which he defines here: ) and operational analytics that I like, and I expanded it into my own set of differences that I want to walk through here.Investigative analytics is what we think of when we think of traditional BI: there’s an analyst or an executive that is searching for previously unknown patterns in a data set, either by looking at a series of visualizations mediated by database queries, or by applying some statistical models to a prepared data set to tease out some deeper explanations. This is where the vast majority of the BI market is focused right now.Operational analytics, on the other hand, is a nascent market, and I don’t believe the existing BI tools have done a good job of supporting companies that want to start leveraging their modeling and analytical prowess in order to make better decisions in real-time. I’d like to shift some of the conversation and the focus in the market from the lab to the factory.
  • The tip of the iceberg metaphor. This has been a useful metaphor for me throughout my career, I feel like I am constantly exploring the tip of the iceberg,from the theory of model building to the practice of model building to operational model building.There is a ton of stuff I don’t know, but I hope that I can provide a useful sort of commentary on the culture of credit scoring from the perspective of an outsider, kind of like Alexis de Tocqueville or Borat
  • Parser combinators,monoids, regular expressions, oh my!
  • Tools focus on speed and flexibility.

MLconf NYC Josh Wills MLconf NYC Josh Wills Presentation Transcript

  • 1 MLConf NYC 2014 Josh Wills, Senior Director of Data Science Cloudera
  • A Little Bit About Me 2
  • An Experience I Had Recently 3
  • The Two Kinds of Data Scientists • The Lab • Statisticians who got really good at programming • Neuroscientists, geneticists, etc. • The Factory • Software engineers who were in the wrong place at the wrong time 4
  • The Lab and The Factory Analytics in the Lab • Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine Analytics in the Factory • Metric-driven • Automated • Systematic • Fluid data • Focus on transparency and reliability • Output is a production system that makes customer- facing decisions 5
  • 6 Data Science In The Factory
  • On Icebergs 7
  • The Impedance Mismatch 8
  • What Do We Need? 9
  • Apache Spark 10
  • A Feature Extraction DSL for Spark 11
  • The R Formula Specification 12
  • So Why Doesn’t This Exist Yet? 13
  • Functional Programming to the Rescue 14
  • 15 Data Science in the Lab
  • Great Tools for Investigative Analytics 16
  • Cloudera Impala 17
  • LLVM and NUMBA 18
  • Python UDFs for Impala 19
  • Python UDFs for Impala • • Already There • Numeric and boolean types (as native python objects) • In Progress • String support • C/C++ function integration • Planned • Struct/tuple and array types • UDAFs • Include support for PyData stack (scikit-learn, NLTK) 20
  • Josh Wills, Director of Data Science, Cloudera @josh_wills Thank you!