Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)

828 views

Published on

H2O + LIME Workshop (Milan Edition)

Published in: Technology
  • Be the first to comment

Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Edition)

  1. 1. Automatic and Interpretable Machine Learning in R with H2O and LIME Jo-fai (Joe) Chow Data Science Evangelist / Community Manager joe@h2o.ai @matlabulous Download → https://bit.ly/ joe_eRum_2018
  2. 2. Agenda 2 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example 8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕 9:00 – 9:20 pm Classification Example 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  3. 3. 3 Time Topics / Tasks 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 mlbench for datasets install.packages(‘mlbench’)
  4. 4. 4 Install H2O using tar.gz (if needed) Go to https://www.h2o.ai/download/ Click Click Unzip Install
  5. 5. 5 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example 8:30 – 9:00 pm 9:00 – 9:20 pm Classification Example 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  6. 6. Why? • Most users/organizations can benefit from automatic machine learning pipelines. • Eliminate time wasted on human errors, debugging etc. • Model interpretations is crucial for those who must explain their models to regulators or customers. You will learn … • How to build high quality H2O models (almost) automatically. • How to explain predictions from complex H2O models with LIME. • Bonus: A real use-case that led to multimillion-dollar baseball decisions earlier this year. 6
  7. 7. + + $20Mtrade 2 weeks prior to the season beginning AutoML + LIME + Shiny = Real Moneyball
  8. 8. About Me 8 • Before H2O • Water Engineer / EngD Researcher / Matlab Fan Boy (wonder why @matlabulous?) • Discovered R, Python, H2O … never look back again • Data Scientist at Virgin Media (UK), Domino Data Lab (US) • At H2O … • Data Scientist / Evangelist / • Sales Engineer / Solution Architect / • Community Manager … The harsh reality of startup life … • H2O SWAG Photographer #AroundTheWorldWithH2Oai Love H2O? Get some stickers! Barcelona Berlin Brussels Paris Milan (2016)
  9. 9. 9 Reminder: #360Selfie Budapest Cologne ToulouseAmsterdam BerlinLondon
  10. 10. About H2O.ai … Have you seen Avengers: Infinity War? Do you know all the characters in the movie? (No spoilers - I promise) 10
  11. 11. 11 We develop machine learning platforms
  12. 12. CONFIDENTIAL Gartner names H2O as Leader with the most completeness of vision • H2O.ai recognized as a technology leader with most completeness of vision • H2O.ai was recognized for the mindshare, partner network and status as a quasi-industry standard for machine learning and AI. • H2O customers gave the highest overall score among all the vendors for sales relationship and account management, customer support (onboarding, troubleshooting, etc.) and overall service and support.
  13. 13. CONFIDENTIAL Platforms with H2O integration H2O + KNIME Talk at KNIME Summit Mar 2017
  14. 14. Founded 2012, Series C in Nov, 2017 Products • Driverless AI – Automated Machine Learning • H2O Open Source Machine Learning • Sparkling Water Mission Democratize AI. Do Good Team ~100 employees • Distributed Systems Engineers doing Machine Learning • World-class visualization designers Offices Mountain View, London, Prague 14 Company Overview
  15. 15. H2O Products In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI H2O AI Open Source Engine Integration with Spark Lightning Fast machine learning on GPUs Automatic feature engineering, machine learning and interpretability Secure multi-tenant H2O clusters
  16. 16. H2O Products In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI H2O AI Open Source Engine Integration with Spark Lightning Fast machine learning on GPUs Automatic feature engineering, machine learning and interpretability Secure multi-tenant H2O clusters In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI This Workshop
  17. 17. CONFIDENTIAL CONFIDENTIAL Worldwide Recognition in the H2O.ai Community Financial InsuranceMarketingHW Vendors Retail Advisory & Accounting Healthcare “H2O.ai's reference customers gave it the highest overall score for sales relationship and overall service and support” - Gartner MQ 2018 Open Source Community Paying Customers
  18. 18. 18 Why H2O?
  19. 19. 19 Our Mission: Make Machine Learning Accessible to Everyone
  20. 20. Scientific Advisory Council 20
  21. 21. 21H2O Team
  22. 22. 22H2O Team Origin of R Package `ggplot2`
  23. 23. 23H2O Team Matt Dowle
  24. 24. 24H2O Team Erin LeDell, Chief ML Scientist Women in ML/DS & R-Ladies Global
  25. 25. 25H2O Team Erin LeDell Navdeep Gill H2O AutoML
  26. 26. 26H2O Team 1st 4th 25th 48th 33rd Kaggle Grand Masters (and their Highest Rank) 13th About 80,000 Kagglers
  27. 27. 27H2O Team 1st 4th 25th 48th 33rd 13th 181st Hoping to get closer to them at some point …
  28. 28. 28 https://www.meetup.com/pro/h2oai/
  29. 29. 29 Encourage your friends/ colleagues to give a talk. joe@h2o.ai We sponsor meetups We encourage diversity Source: https://www.maddycoupons.in/blog/yummy-yummy-pizza/ … and more Joe’s Meetups Female Speaker Female Speaker Ratio London Dec 2017 Kasia Kulma 1/3 Amsterdam Feb 2018 Andreea Bejinaru 1/2 London Mar 2018 Cheuk Ting Ho 1/3 London Jun 2018 Torgyn Shaikhina 1/4 Overall (so far) = 4/12 = 33.3%
  30. 30. About H2O AutoML 30 Automatic Machine Learning with H2O http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
  31. 31. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 31
  32. 32. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 32 Import Data from Multiple Sources
  33. 33. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 33 Fast, Scalable & Distributed Compute Engine Written in Java
  34. 34. Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations H2O-3 Algorithms Overview Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning 34
  35. 35. H2O Core CPU Model Building H2O
  36. 36. H2O Core H2O H2O H2O
  37. 37. H2O Core CPU CPU CPU Model Building H2O Distributed In-Memory
  38. 38. H2O Core YARN CPU CPU CPU Model Building H2O Distributed In-Memory SQL NFS S3 Firewall or Cloud
  39. 39. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 39 Multiple Interfaces
  40. 40. H2O Flow (Web) 40
  41. 41. Using H2O with R and Python 41
  42. 42. 42H2O Team Erin LeDell Navdeep Gill H2O AutoML
  43. 43. 43
  44. 44. 44 AutoML
  45. 45. Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations H2O-3 Supervised Algorithms in AutoML Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning 45
  46. 46. 46
  47. 47. 47
  48. 48. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 48 Export Standalone Models for Production
  49. 49. 49 URL: docs.h2o.ai
  50. 50. About Machine Learning Interpretability 50 LIME (Local Interpretable Model-Agnostic Explanations) … and more
  51. 51. Acknowledgement 51
  52. 52. Why Should I Trust Your Model? 52
  53. 53. 53
  54. 54. Age Age Propensity to PurchasePropensity to Purchase Linear Models Machine Learning Exact explanations for approximate models. Lost profits. Wasted marketing. “For a one unit increase in age, propensity to purchase increases by 3% on average.” Approximate explanations for exact models. Slope begins to decrease here. Act to optimize savings. Slope begins to increase here sharply. Act to optimize profits.
  55. 55. Theory • LIME approximates model locally as logistic or linear model • Repeats process many times • Outputs features that are most important to local models Outcome • Approximate reasoning • Complex models can be interpreted • Neural nets, Random Forest, Ensembles etc. 55 Local Interpretable Model-Agnostic Explanations LIME - How does it work?
  56. 56. 56 https://github.com/jphall663/awesome-machine-learning-interpretability
  57. 57. 57 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example examplesregression_...Rmd 8:30 – 9:00 pm 9:00 – 9:20 pm Classification Example 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  58. 58. Learning from Boston Housing Data Features: No. of rooms, Crime Rate, Age … Median House Value (medv) Predictions H2O Machine Learning Learn the Pattern 58
  59. 59. 🍕🍕🍕@h2oai @matlabulous #AutoML #LIME Pizzas & Drinks 8:30 – 9:00 pm Sponsored by H2O.ai
  60. 60. 60 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example 8:30 – 9:00 pm 9:00 – 9:20 pm Classification Example examplesclassification_...Rmd 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  61. 61. Learning from Diabetes Data Features: Pregnant, Age, Glucose … Positive / Negative Predictions H2O Machine Learning Learn the Pattern 61
  62. 62. 62 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example 8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕 9:00 – 9:20 pm Classification Example 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  63. 63. Why? • Most users/organizations can benefit from automatic machine learning pipelines. • Eliminate time wasted on human errors, debugging etc. • Model interpretations is crucial for those who must explain their models to regulators or customers. You will learn … • How to build high quality H2O models (almost) automatically. • How to explain predictions from complex H2O models with LIME. • Bonus: A real use-case that led to multimillion-dollar baseball decisions earlier this year. 63
  64. 64. 64 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example 8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕 9:00 – 9:20 pm Classification Example 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  65. 65. SlideShare Link Podcast Link
  66. 66. 66 Time Topics / Tasks 7:00 – 7:30 pm Welcome + Data Hack Italia 7:30 – 7:45 pm Install h2o, lime, mlbench from CRAN slides/code: bit.ly/joe_eRum_2018 7:45 – 8:00 pm Introduction (H2O, AutoML, LIME) 8:00 – 8:30 pm Regression Example 8:30 – 9:00 pm 🍕 🍕 🍕 🍕 🍕 9:00 – 9:20 pm Classification Example 9:20 – 9:30 pm Quick Recap 9:30 – 9:45 pm Real Use-Case: Moneyball 9:45 – 10:00 pm Other H2O News + Q & A
  67. 67. H2O Products In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI H2O AI Open Source Engine Integration with Spark Lightning Fast machine learning on GPUs Automatic feature engineering, machine learning and interpretability Secure multi-tenant H2O clusters Lightning Fast machine learning on GPUs
  68. 68. “Confidential and property of H2O.ai. All rights reserved” Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Algorithms on H2O-3 (CPU) Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  69. 69. “Confidential and property of H2O.ai. All rights reserved” Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Algorithms on H2O4GPU (more to come) Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  70. 70. 70 https://github.com/h2oai/h2o4gpu
  71. 71. 71 blog.h2o.ai
  72. 72. Coming Up 72 Date City Talk / Workshop 25 June Milan H2O + LIME Workshop 28 June Rome H2O Intro + Use Cases Mid-July London Next London Meetup (T.B.C.) Mid-Oct London H2O World London 13-14 Nov London Big Data LDN Keynote by Sri (CEO) + Meetup
  73. 73. • Organizers & Sponsors 73 Thanks! • Code, Slides & Documents • bit.ly/h2o_meetups • bit.ly/joe_eRum_2018 • docs.h2o.ai • Contact • joe@h2o.ai • @matlabulous • github.com/woobe • Please search/ask questions on Stack Overflow • Use the tag `h2o` (not h2 zero)
  74. 74. Appendix 74
  75. 75. Supported Formats & Data Sources CSV XLS XLSX ORC* Hive* SVMLight ARFF Parquet Avro 1.8.0* HDFS S3 NFS LOCAL SQL 9Formats 5Sources File type or Folder of Files * 1. only if H2O is running as a Hadoop job * 2. Hive files that are saved in ORC format * 3. without multi-file parsing or column type modification
  76. 76. Distributed Algorithms • Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression • All algorithms are distributed in H2O: GBM, GLM, DRF, Deep Learning and more. Fine-grained map-reduce iterations. • Only enterprise-grade, open-source distributed algorithms in the market User Benefits Advantageous Foundation • “Out-of-box” functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java • Designed for all sizes of data sets, especially large data • Highly optimized Java code for model exports • In-house expertise for all algorithms Parallel Parse into Distributed Rows Fine Grain Map Reduce Illustration: Scalable Distributed Histogram Calculation for GBM Foundation for Distributed Algorithms 76
  77. 77. Demo: H2O on a 320-Core Hadoop Cluster (Web Interface) 77
  78. 78. 78 https://www.kaggle.com/c/higgs-boson
  79. 79. Learning from Higgs Boson Machine Data 79 Sensors (Detector) Data Historical Outcome Is it a Higgs Particle (Yes/No) Predicted Outcome Learn the Pattern 11M Rows 28 Features Raw Data Size: 7.48 GB
  80. 80. 80 11M Rows Size (Raw): 7.48 GB Compressed: 2.00 GB (≈ 27% of Raw)
  81. 81. 81 10 nodes 10 x 32 = 320 Cores 10 x 29.6 = 296 GB Memory
  82. 82. H2O Water Meter (CPU Monitor) 82 10 x 32 = 320 Cores

×