Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICML 2018 Reproducible Machine Learning - A. Gramfort

0 views

Published on

Workshop talk from:

https://mltrain.cc/events/enabling-reproducibility-in-machine-learning-mltrainrml-icml-2018/

Thoughts on the challenges of reproducibility in ML and computational sciences, and some engineering solutions based on my experience writing scikit-learn for the last 8 years.

Published in: Science

ICML 2018 Reproducible Machine Learning - A. Gramfort

  1. 1. Reproducible ML: software challenges, anecdotes and some engineering solutions  Alexandre Gramfort http://alexandre.gramfort.net GitHub : @agramfort Twitter : @agramfort
  2. 2. FreeSurfer: popular software for extracting features from MRI (e.g. cortical thickness used to predict Alzheimer’s disease, etc.) https://surfer.nmr.mgh.harvard.edu/
  3. 3. FreeSurfer: popular software for extracting features from MRI (e.g. cortical thickness used to predict Alzheimer’s disease, etc.) https://surfer.nmr.mgh.harvard.edu/
  4. 4. FreeSurfer: popular software for extracting features from MRI (e.g. cortical thickness used to predict Alzheimer’s disease, etc.) Hardware and software differences can lead to different features / statistical results and scientific conclusions https://surfer.nmr.mgh.harvard.edu/
  5. 5. https://github.com/mne-tools/mne-python/issues/4922 ICA: popular matrix factorization problem. Infomax does an SGD on the non-convex log-likelihood function
  6. 6. https://github.com/mne-tools/mne-python/issues/4922 ICA: popular matrix factorization problem. Infomax does an SGD on the non-convex log-likelihood function Changing BLAS/Lapack backends changes results Even changing OMP_NUM_THREADS can change the results
  7. 7. https://github.com/scikit-learn/scikit-learn/issues/5545
  8. 8. https://github.com/scikit-learn/scikit-learn/issues/5545 Even on the same machine numerical solvers can lead to different outcomes
  9. 9. A. Gramfort - HdR - Bridging gaps between neuroimaging, ML and optimization Some software engineering solutions for reproducible ML
  10. 10. http://scikit-learn.org
  11. 11. http://scikit-learn.org 526,000 users in the last 30 days 42,000,000 pages views in last year
  12. 12. http://scikit-learn.org 526,000 users in the last 30 days 42,000,000 pages views in last year Big user base higher chance of spotting issues
  13. 13. Alex Gramfort Reproducible ML: challenges and some engineering solutions  Do not reinvent the wheel… 7 #JSM2016Jake VanderPlas We provide one component in the Python ecosystem
  14. 14. Alex Gramfort Reproducible ML: challenges and some engineering solutions  Do not reinvent the wheel… 7 #JSM2016Jake VanderPlas We provide one component in the Python ecosystem Code reuse and tight community Bigger user base
  15. 15. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/tests/test_pca.py Unit test Docstring tests
  16. 16. alex@:scikit-learn(master)$ cloc --not-match-d='tests' --force-lang=Python sklearn/ 427 text files. 426 unique files. 29 files ignored. […] ——————————————————————————————————————— Language files blank comment code ------------------------------------------------------------------------------- Python 426 83679 395769 552905 ------------------------------------------------------------------------------- alex@:scikit-learn(master)$ cloc --match-d='tests' --force-lang=Python sklearn/ 168 text files. 168 unique files. 25 files ignored. […] ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 168 13153 7014 45710 ------------------------------------------------------------------------------- Not even counting docstrings…. 45,000 lines of test!
  17. 17. Continuous integration on all platforms Win/OSX/Linux from Py 2.7 to 3.7 Coverage
  18. 18. Simplifying code reuse with sphinx-gallery sphinx-gallery: Write doc by writing Python code Sphinx-Gallery https://sphinx-gallery.readthedocs.io Extracted from scikit-learn and funded by:
  19. 19. Sphinx-Gallery
  20. 20. Sphinx-Gallery https://mybinder.org/
  21. 21. Simplifying code reuse with sphinx-gallery Sphinx-Gallery https://sphinx-gallery.readthedocs.io Configuring sphinx-gallery is really easy:
  22. 22. How to go even further?
  23. 23. Open Data http://jaberg.github.io/skdata/ http://www.dmi.usherb.ca/~larocheh/mlpython/ Some oldies… Many more…
  24. 24. https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community Open Data
  25. 25. https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community Open Data
  26. 26. Open Data
  27. 27. Open Data
  28. 28. https://mlperf.org/ But DATA isn’t much… … without evaluation platforms
  29. 29. https://mlperf.org/ Reproducible benchmarks
  30. 30. https://www.ramp.studio/ https://www.ramp.studio/ RAMP: Challenge with code submission
  31. 31. https://paris-saclay-cds.github.io/autism_challenge/ Reproducible challenges
  32. 32. https://www.ramp.studio/
  33. 33. https://www.ramp.studio/ Allows to: • Run code on private data • Pick model with good accuracy/perf tradeoff
  34. 34. So in the end maybe we can easily do better?
  35. 35. Alex Gramfort Reproducible ML: challenges and some engineering solutions  Wrapping up 24 • Even hardware/software replication is hard and costly
  36. 36. Alex Gramfort Reproducible ML: challenges and some engineering solutions  Wrapping up 24 • Even hardware/software replication is hard and costly • Disclaimer: Not every problem has an engineering solution
  37. 37. Alex Gramfort Reproducible ML: challenges and some engineering solutions  Wrapping up 24 • Even hardware/software replication is hard and costly Sphinx-Gallery • Yet, technology and engineering can make ML more replicable • Modern science is Open Science • Disclaimer: Not every problem has an engineering solution
  38. 38. Alexandre Gramfort http://alexandre.gramfort.netContact: GitHub : @agramfort Twitter : @agramfort "An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. ~ JohnTukey" Support:

×