Successfully reported this slideshow.

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

1

Share

Loading in …3
×
1 of 29
1 of 29

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

1

Share

Download to read offline

Despite a wide array of advanced techniques available today, too many practitioners are forced to return to their old toolkit of approaches deemed “more interpretable.” Whether because of non-legal policy or difficulty in executive presentation, these restraints result from poor analytics communication and inability to explain model risks and outcomes, not a failing of the techniques.

From sampling to feature reduction to supervised modeling, the toolbox and communications of data scientists are limited by these constraints. But, instead of simplifying models, data scientists can re-introduce often ignored statistical practices to describe the models, their risk, and the impact of changes in the customer environment.

Even in situations without restrictions, these approaches will improve how practitioners select models and communicate results. Through measurement and simulation, reviewed approaches can be used to articulate the promises, risks, and assumptions of developed models, without requiring deep statistical explanations.

Despite a wide array of advanced techniques available today, too many practitioners are forced to return to their old toolkit of approaches deemed “more interpretable.” Whether because of non-legal policy or difficulty in executive presentation, these restraints result from poor analytics communication and inability to explain model risks and outcomes, not a failing of the techniques.

From sampling to feature reduction to supervised modeling, the toolbox and communications of data scientists are limited by these constraints. But, instead of simplifying models, data scientists can re-introduce often ignored statistical practices to describe the models, their risk, and the impact of changes in the customer environment.

Even in situations without restrictions, these approaches will improve how practitioners select models and communicate results. Through measurement and simulation, reviewed approaches can be used to articulate the promises, risks, and assumptions of developed models, without requiring deep statistical explanations.

More Related Content

More from MLconf

Related Books

Free with a 14 day trial from Scribd

See all

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

  1. 1. Think Big, Start Smart, Scale Fast Analytics Communication: Re-Introducing Complex Models
  2. 2. 2 • Director of Data Science at Think Big • I work in the intersection of statistics and technology • But also business and analytics • Too often see data scientists limit themselves and their businesses Dan Mallinger
  3. 3. 3 1. Importance of Communication 2. Lost Tools of Analytics Communication 3. Tricks for those in Regulated Environments 4. More Communication Today
  4. 4. 4 Not Today
  5. 5. 5 • Familiar = Clear • Clear = Explainable • Explainable = Understood • Understood = Trustworthy “Explainable” Model Fallacy
  6. 6. 6 Better Communication Yields…
  7. 7. 7 Bad Communication and Black Boxes…
  8. 8. 8 Why We Should Care: We Won’t Waste Money Alas, not even a 250Gb server was sufficient: even after waiting three days, the data couldn't even be loaded. […] Steve said it would be difficult for managers to accept a process that involved sampling.
  9. 9. 9 hlm.html('Test1', test1_score__eoy~test1_score__boy + ... is_special_ed * perc_free_lunch ... other_score * support_rec ... (is_focal | inst_sid), data=kinder) Technically this is a regression… So simple anyone can understand it! Why We Should Care: You Can’t Explain Your Models Anyway
  10. 10. 10 • If your model need to be re-fit every month, it probably has an eating disorder • Be a better communicator to yourself Why We Should Care: Some of Us Don’t Understand Our Models
  11. 11. 11 Meet Bob
  12. 12. 12 • Predicting “Membership” (Not really, this is dummy outcome) • Pick a “black box” model • Build understanding Airline Data
  13. 13. 13 Danger! Does Your Manager Know What Strata Are?! Manager Doesn’t Trust Samples?
  14. 14. 14 • Easy: sapply(1:5000, function(i) { rand.rows <- sample.int(nrow(raw), size=10000) df <- raw[rand.rows, c(dep.cols, ind.cols)] m <- nnet(Member~., data=df, size=10) }) • Easier: library(bootstrap) • Bootstrap! – Simple, but underused – Resample data, rebuild models – Parametric and non-parametric bootstrapping (bias/variance) Gist of non-parametric: Do it a bunch of times, treat results as distribution for CI Manager Doesn’t Trust Samples?
  15. 15. 15 Stability of Model
  16. 16. 16 • Bob has convinced his manager that his sampling strategy is acceptable (Good Job, Bob!) • But he hasn’t built trust in the model Now What?
  17. 17. 17 Bob Doesn’t Explain Variables Like This…
  18. 18. 18 • If X matters, then shuffling it should hurt our model • Then bootstrap for confidence intervals • Most R models have a method for this (see caret) Shining a light into the parameters of our black box Variable Importance
  19. 19. 19 Shining a light into the parameters of our black box Variable Importance: Bob’s Data
  20. 20. 20 • Similar to variable importance • How do relationships in our model play out in different settings? • How much does our model depend on accurate measurement? Sensitivity and Robustness
  21. 21. 21 Sensitivity and Robustness Example My code wasn’t working, so thanks to: https://beckmw.wordpress.com/2013/10/07/sensitivity-analysis-for-neural-networks/
  22. 22. 22 More Sensitivity and Robustness Manual variable permutation in R library(sensitivity)
  23. 23. 23 • Bob’s manager has told him that black box models are not allowed • But Bob’s neural net performed better than anything else. Oh dear! Dang!
  24. 24. 24 • Bob’s work in neural nets can be leveraged! • Generically: Prototype selection • Identify points on the decision boundary to improve model • Specifically: Extracting decision trees from neural nets Blackbox to Whitebox
  25. 25. 25 Blackbox to Whitebox: Methodology “Extracting Decision Trees from Trained Neural Networks” - Krishnan & Bhattacharya Also: https://github.com/dvro/scikit-protopy
  26. 26. 26 • Bob has shown how variables impact his black box • He’s shown how they behave in different contexts • He’s show how robust they are to errors • But he hasn’t told us why we should care Now What?
  27. 27. 27 Accuracy, False Positive Rates, Confusions matrices are CONSTRUCTS Metrics and Assessment
  28. 28. 28 • Enterprises are slow: Predict KPI not KRI • Give confidence bands, sensitivities, and impact of context changes • Build a story about the model internals and assumptions; tie to domain knowledge of audience • Explainability is up to the modeler, not the model * • Unless, of course, your regulator says otherwise! Conclusions
  29. 29. 29 We’re hiring! http://thinkbig.teradata.com Thanks!

×