Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debugging machine-learning

3,353 views

Published on

Debugging machine-learning

My presentation from PyDataWarsaw 2017

Published in: Data & Analytics

Debugging machine-learning

  1. 1. Debugging Machine Learning. Mostly for profit but with a bit of fun too! Michał Łopuszyński PyData Warsaw, 19.10.2017
  2. 2. About me In my previous professional life, I was a theoretical physicist. I got a PhD in solid state physics • For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
  3. 3. Agenda 4 failure modes of ML systems• 9 simple hints what to do•
  4. 4. Hey, my ML system does not work at all...
  5. 5. Hint #1 Check your code AKA it is engineering, stupid!
  6. 6. Write tests Do not strive for 100% coverage, partial coverage is infinitely better then none! • In Python, doctests are your friends• "Hidden" benefits of tests• Better code structure• Executable documentation• Test the fragile parts first•
  7. 7. Test your code with Monte Carlo / synthetic data Try to generate a trivial case for your ML system • This tests the whole pipeline (transforming/training) and allows for exploration of your system performance under unusual conditions • If you have generative model, prepare Monte Carlo data from assumed distributions • Perturb the original data, by generating the output with known and learnable prescription •
  8. 8. Code style Single responsibility principle• Do not repeat yourself (DRY!)• Have and apply coding standard (PEP8)• Consider using a linter• Short functions and methods• https://xkcd.com/844/ pycodestyle checker (formerly pep8)• yapf formater• pylint, pyflakes, pychecker
  9. 9. Naming Minimal requirement - be consistent!• Interesting software development books, offering chapters on naming• Freely available chapter on names
  10. 10. Hint #2 Check your data
  11. 11. Data quality audits are difficult Happy families are all alike; every unhappy family is unhappy in its own way. Leo Tolstoy Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Hadley Wickham H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10 Images credit - Wikipedia
  12. 12. Data quality Beware, your data providers usually overestimate the data quality • Think of outliers, missing values (and how the are represented)• Understand your data• Do exploratory data analysis• Visualize, visualize, visualize• Talk to the domain expert• Is your data correct, complete, coherent, stationary (seasonality!), deduplicated, up-to-date, representative (unbiased) •
  13. 13. OK, my ML system works, but I think it should perform better...
  14. 14. Hint #3 Examine your features
  15. 15. Features Features make a difference!• Be creative with your features• Try meaningful transformations, combinations (products/ratios), decorrelation... • Think of good representations for non-tabular data (text, time-series) • Make conscious decision about missing values• Understand what features are important for your model• Use ML models offering feature ranking• Use feature selection methods• ID F1 F2 F3
  16. 16. Hint #4 Examine your data points
  17. 17. Data points Find difficult data points! (DDP)• DDP = notoriously misclassified (or high error) cases in your cross-validation loop for large variety of models • ID P1 P2 P3 P4 P5 Examine DDPs, understand them!• In the easiest case, remove DDPs from the dataset (think outliers, mislabeled examples) •
  18. 18. Influence functions
  19. 19. Influence functions Best Paper Award ICML 2017
  20. 20. Data points Get more data!• ID P1 P2 P3 P4 P5 Trick 1. Extend your set with artificial data E.g., data augmentation in image processing, SMOTE algorithm for imbalanced datasets • Good performance booster, rarely applicable• Trick 2. Generate automatically noisy labeled data set by heuristics, e.g. distant supervision in NLP (requires unlabeled data!) • Trick 3. Semisupervised learning methods self-training and co-training (requires unlabeled data!) •
  21. 21. Hint #5 Examine your model
  22. 22. Why my model predicts what it predicts? (philosophical slide) How do you answer why questions?• Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
  23. 23. Model introspection You can answer thy why question, only for very simple models (e.g., linear model, basic decision trees) • Sometimes, it is instructive to run such a simple model on your dataset, even though it does not provide top-level performance • You can boost your simple model by feeding it with more advanced (non-linearly transformed) features •
  24. 24. Complex model introspection LIME algorithm = Local Interpretable Model-agnostic Explanations ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
  25. 25. Complex model introspection – practical issues Lime authors provided open source Python implementation as lime package• http://eli5.readthedocs.io/en/latest/tutorials• Another option eli5 package, aimed more generally at model introspection and explanation (includes lime implementation) •
  26. 26. Visualizing models Display model in a data space Look at collection of models at once Explore the process of model fitting The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
  27. 27. So my ML system works on test data, but you tell me it fails in production?
  28. 28. Hint #6 Watch out for overfitting
  29. 29. Overfitting If you torture the data long enough, it will confess. Roald Coase
  30. 30. Hint #7 Watch out for data leakage
  31. 31. Data leakage Some time ago, I used to thing data leaks are trivial to avoid• They are not! (Look at number of Kaggle competitions flawed by Data Leakage)• You may introduce them yourself E.g. meaningful identifiers, past & future separation in time series • You may receive them in the data from your provider• Good paper•
  32. 32. Hint #8 Watch out for covariate shift
  33. 33. What is covariate shift? Training data y X
  34. 34. What is covariate shift? Model Training data X y
  35. 35. What is covariate shift? Model Noiseless reality Training data y X
  36. 36. What is covariate shift? Model Noiseless reality Training data Production data (Test data) y X
  37. 37. Covariate shift Unlike overfitting and data leakage, it is easier to detect• Method: Try to build classifier differentiating train from production (test). If you succeed, you very likely have a problem • Basic remedy – reweighting data points. Give production-like data higher impact on your model •
  38. 38. The quality of my super ML system deteriorates with time, really? Really really?
  39. 39. ML system in production 2009: Hooray we can predict flu epidemics from Google query data! 2014: Hmm... Can we?
  40. 40. ML system in production NIPS 2015 paper
  41. 41. Hint #9 Remember monitoring & maintenance AKA it is engineering again, stupid! 5
  42. 42. Thank you! Questions? @lopusz

×