Successfully reported this slideshow.

Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

3

Share

YouTube videos are no longer supported on SlideShare

View original on YouTube

Black Boxes and Unicorns
Jeremy Achin | Data Scientist & CEO| DataRobot
Jeremy Achin?
Upcoming SlideShare
maniresume1
maniresume1
Loading in …3
×
1 of 42
1 of 42

Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

3

Share

Download to read offline

Jeremy Achin, CEO at DataRobot, presented at FirstMark's Data Driven NYC on November 23, 2015. Achin discussed technique-agnostic ways to asses and interpret predictive models.

DataRobot provides a predictive analytics platform to rapidly build and deploy predictive models the cloud or an enterprise.

Data Driven NYC is a monthly event covering Big Data and data-driven products and startups, hosted by Matt Turck, partner at FirstMark.

FirstMark is an early stage venture capital firm based in New York City. Find out more about Data Driven NYC at http://datadrivennyc.com and FirstMark Capital at http://firstmarkcap.com.

Jeremy Achin, CEO at DataRobot, presented at FirstMark's Data Driven NYC on November 23, 2015. Achin discussed technique-agnostic ways to asses and interpret predictive models.

DataRobot provides a predictive analytics platform to rapidly build and deploy predictive models the cloud or an enterprise.

Data Driven NYC is a monthly event covering Big Data and data-driven products and startups, hosted by Matt Turck, partner at FirstMark.

FirstMark is an early stage venture capital firm based in New York City. Find out more about Data Driven NYC at http://datadrivennyc.com and FirstMark Capital at http://firstmarkcap.com.

More Related Content

More from FirstMark Capital

Related Books

Free with a 14 day trial from Scribd

See all

Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

  1. 1. Black Boxes and Unicorns Jeremy Achin | Data Scientist & CEO| DataRobot
  2. 2. Jeremy Achin?
  3. 3. 3 DataRobot Company History 2012 2H 2013 1H 2013 2H 2014 1H 2014 2H 2015 1H June ‘12 Founded June ‘13 Seed Funding $3.3M July ‘14 Series A $21M 2015 2H Bigger & Better Announcements Coming Soon!
  4. 4. DataRobot: better predictive models faster
  5. 5. https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726 Leo Breiman (classification & regression trees, random forest, and my personal hero)
  6. 6. https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726 Leo Breiman (classification & regression trees, random forest, and my personal hero) 2001: Statistical Modeling: The Two Cultures ● An attack on statisticians who rely solely on regression models ● Argued we should be using the techniques that obtain the best results ● Even a carefully built regression model is just one of many possible representations of the underlying reality
  7. 7. “If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data [regression] models and adopt a more diverse set of tools.” https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
  8. 8. 14 Years Later Excellent progress in recent years but... ● still armies of people taking months to manually build regression models (especially in larger companies) ● non-regression methods still thought of as “black box”
  9. 9. Black Box (n) /blak bäks/
  10. 10. Black Box (n) /blak bäks/ A phrase people use when they’re scared of technology they don’t understand and want to keep doing the same thing they’ve been doing for the last twenty years.
  11. 11. What do we really need to know about a predictive model? 1. Overall Performance on Out-of-Sample (Validation) Data 2. Predicted vs Actual by Variable 3. How a model’s predictions change as values of input variables change
  12. 12. What do we really need to know about a predictive model? 1. Overall Performance on Out-of-Sample (Validation) Data 2. Predicted vs Actual by Variable 3. How a model’s predictions change as values of input variables change None of these depend on the specific algorithm you are using. Even #3!
  13. 13. Overall Out-of-Sample Performance Mean Absolute Error Weighted Mean Absolute Error Root Mean Squared Error Root Mean Squared Mean F Score Mean Consequential Error Mean Average Precision Multi-class Log Loss Hamming Loss Mean Utility Continuous Ranked AUC Average Precision (column-wise) Gini Average Among Top P Mean Average Precision (row-wise) ` Normalized Discounted Cumulative Gain@k Mean Average Precision@n Levenshtein Distance Average Precision Absolute Error Probability Score Logarithmic Error
  14. 14. Hospital Readmission Model Assessment and Interpretation Number of Prior Visits to Hospital HospitalReadmissionRate
  15. 15. Hospital Readmission Model Assessment and Interpretation Number of Prior Visits to Hospital HospitalReadmissionRate Actual Hospital Readmission Rate
  16. 16. Hospital Readmission Model Assessment and Interpretation Number of Prior Visits to Hospital HospitalReadmissionRate Predicted Hospital Readmission Rate
  17. 17. Hospital Readmission Model Assessment and Interpretation Number of Prior Visits to Hospital HospitalReadmissionRate
  18. 18. Hospital Readmission Model Assessment and Interpretation Number of Prior Visits to Hospital HospitalReadmissionRate Partial Dependence
  19. 19. Partial Dependence 10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369 https://web.stanford.edu/~hastie/local.ftp/ Springer/OLD/ESLII_print4.pdf
  20. 20. Compliance (n) /kəmˈplīəns/
  21. 21. Compliance (n) /kəmˈplīəns/ A word people use as a last resort to defend the status quo after they realize that their 100 variable regression model is an arbitrary representation of reality that is less accurate, robust, and interpretable than modern alternatives.
  22. 22. Arbitrary Representations of Reality Three statisticians sitting at a bar... One more round? ftp://ftp.nhtsa.dot.gov/GES/GES12/ ● 153,077 Police-reported accidents ● 58 Variables Goal: Try to Predict Probability of a Fatality
  23. 23. Variable Name Restraint Misuse: Roll Over: Alcohol Involved: Is Driver: Regression Coefficient 0.509 0.355 0.089 -0.694 Arbitrary Representations of Reality Model Performance (Log Loss): 0.469 "Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference.. Also, being the driver is safe, so I'm driving home" Statistician #1
  24. 24. Variable Name Restraint Misuse: Roll Over: Alcohol Involved: Is Driver: Regression Coefficient 0.509 0.355 0.089 -0.694 Arbitrary Representations of Reality Model Performance (Log Loss): 0.469 "Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference.. Also, being the driver is safe, so I'm driving home" Model Performance (Log Loss): 0.467 "Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round." Also, the later the better, so let's just wait here until midnight" Variable Name Alcohol Involved: Age: Restraint Misuse: Hour of Accident: Regression Coefficient 1.866 0.008 0.000 -0.019 Statistician #2Statistician #1
  25. 25. Variable Name Restraint Misuse: Roll Over: Alcohol Involved: Is Driver: Regression Coefficient 0.509 0.355 0.089 -0.694 Arbitrary Representations of Reality Model Performance (Log Loss): 0.469 "Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference.. Also, being the driver is safe, so I'm driving home" Model Performance (Log Loss): 0.422 "No, no, no, we just need to wear lap and shoulder belts with our booster seats, and be police officers. Look at those coefficients! Furthermore, my model is better, so I'm right." Variable Name Alcohol Involved: Age: Restraint Misuse: Hour of Accident: Regression Coefficient 1.866 0.008 0.000 -0.019 Variable Name Opening Door In Motion: Is Police Officer: Booster Seat Used: Lap And Shoulder Belt: Regression Coefficient 0.449 -0.412 -0.787 -1.897 Statistician #3Statistician #2Statistician #1 Model Performance (Log Loss): 0.467 "Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round." Also, the later the better, so let's just wait here until midnight"
  26. 26. The Killer Potato
  27. 27. The Killer Potato
  28. 28. Obligatory Data Scientist Definition Slide Hacking Skills Maths & Stats Domain Knowledge Data Science ● Foundational Statistics ● Internals of Algorithms ● Practical Knowledge and Experience ● Programming ○ Get Data ○ Manipulate Data ○ Explore Data ○ Build Models ○ Implement Models ● Understand the Business Problem ● Understanding of the Data http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  29. 29. The current path to becoming a Data Scientist
  30. 30. A Better Way AUTOMATED USING MODERN TOOLS AND COMPUTATIONAL POWER
  31. 31. Takeaways ● There are technique-agnostic ways to assess and interpret predictive models. ● The shortage of Data Scientists will be solved by a combination of pragmatic education and levels of automation currently not thought possible.
  32. 32. Three quick tips for entrepreneurs
  33. 33. Watch out for Lean Startup & MVP Zealots Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.
  34. 34. Watch out for Lean Startup & MVP Zealots Minimum viable product (MVP) is the product with the highest return on investment versus risk. Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.
  35. 35. Be Paranoid and Don’t Rely on Hope.
  36. 36. Choose the Right Investors & Advisors CHRIS LYNCH HARRY WELLER Jason Seats Jit Saxena Kevin Dick Ray Tacoma Brad Gillespie
  37. 37. © DataRobot, Inc. All rights reserved. Confidential jeremy@datarobot.com

×