Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable and Automatic Machine Learning with H2O

689 views

Published on

H2O is widely used for machine learning projects. A TechCrunch article, published in January 2017 by John Mannes, reported that around 20% of Fortune 500 companies use H2O.

Talk 1: Introduction to Scalable & Automatic Machine Learning with H2O

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models.

In this presentation, Joe will introduce the AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

Talk 2: Making Multimillion-dollar Baseball Decisions with H2O AutoML and Shiny

Joe recently teamed up with IBM and Aginity to create a proof of concept "Moneyball" app for the IBM Think conference in Vegas. The original goal was to prove that different tools (e.g. H2O, Aginity AMP, IBM Data Science Experience, R and Shiny) could work together seamlessly for common business use-cases. Little did Joe know, the app would be used by Ari Kaplan (the real "Moneyball" guy) to validate the future performance of some baseball players. Ari recommended one player to a Major League Baseball team. The player was signed the next day with a multimillion-dollar contract. This talk is about Joe's journey to a real "Moneyball" application.

Bio : Jo-fai (or Joe) Chow is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Scalable and Automatic Machine Learning with H2O

  1. 1. Scalable and Automatic Machine Learning with H2O Introduction, demos and a real-world use-case Jo-fai (Joe) Chow Data Scientist / Community Manager joe@h2o.ai @matlabulous
  2. 2. Agenda • Talk 1: Introduction to H2O • Company and People • H2O Open Source ML Platform • Demos • H2O on Hadoop (320 Cores) • AutoML • Other News • Talk 2: Moneyball • From a proof-of-concept project to a multimillion-dollar contract 2
  3. 3. Founded 2012, Series C in Nov, 2017 Products • Driverless AI – Automated Machine Learning • H2O Open Source Machine Learning • Sparkling Water Mission Democratize AI. Do Good Team ~100 employees • Distributed Systems Engineers doing Machine Learning • World-class visualization designers Offices Mountain View, London, Prague 3 Company Overview
  4. 4. 4 Our Mission: Make Machine Learning Accessible to Everyone
  5. 5. Scientific Advisory Council 5
  6. 6. 6H2O Team
  7. 7. 7H2O Team Arno Candel, CTO Fortune’s 2014 Big Data All-Star Sri Ambati, Co-founder & CEO
  8. 8. 8H2O Team Origin of R Package `ggplot2`
  9. 9. 9H2O Team Matt Dowle
  10. 10. 10H2O Team Erin LeDell, Chief ML Scientist Women in ML/DS & R-Ladies Global
  11. 11. 11H2O Team 1st 4th 25th 48th 33rd Their Highest Rank in Kaggle (about 80,000 competitors)
  12. 12. 12H2O Team 1st 4th 25th 48th 33rd 181st Trying to get closer to them at some point …
  13. 13. 13H2O Team Joe Avni Priya Bonsoir!
  14. 14. 14H2O Team H2O Team in UK Feb 2016 - Present June 2017 - Present
  15. 15. Joe’s Roles at H2O.ai 15 • Data Scientist / Sales Engineer / Speaker / Meetup Organiser / Community Evangelist (on paper) • Unofficial Photographer of H2O.ai SWAG (the travelling data scientist) • H2O.ai SWAG EMEA Distributor (please help yourself)
  16. 16. Joe’s Real Job at H2O.ai 16 Reminder: #360Selfie
  17. 17. H2O Products In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI H2O AI Open Source Engine Integration with Spark Lightning Fast machine learning on GPUs Automatic feature engineering, machine learning and interpretability Secure multi-tenant H2O clusters
  18. 18. * DATA FROM GOOGLE ANALYTICS EMBEDDED IN THE END USER PRODUCT Worldwide Community Adoption
  19. 19. CONFIDENTIAL Gartner names H2O as Leader with the most completeness of vision • H2O.ai recognized as a technology leader with most completeness of vision • H2O.ai was recognized for the mindshare, partner network and status as a quasi-industry standard for machine learning and AI. • H2O customers gave the highest overall score among all the vendors for sales relationship and account management, customer support (onboarding, troubleshooting, etc.) and overall service and support.
  20. 20. CONFIDENTIAL Platforms with H2O integration H2O + KNIME Talk at KNIME Summit Mar 2017
  21. 21. H2O.ai Solution Leadership Across Verticals 21 2 1 Financial InsuranceMarketing TelecomHealthcareRetail Advisory & Accounting
  22. 22. Community Expansion Find out more: www.h2o.ai/community/
  23. 23. 23
  24. 24. H2O Products In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI H2O AI Open Source Engine Integration with Spark Lightning Fast machine learning on GPUs Automatic feature engineering, machine learning and interpretability Secure multi-tenant H2O clusters In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI
  25. 25. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 25
  26. 26. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 26 Import Data from Multiple Sources
  27. 27. Supported Formats & Data Sources CSV XLS XLSX ORC* Hive* SVMLight ARFF Parquet Avro 1.8.0* HDFS S3 NFS LOCAL SQL 9Formats 5Sources File type or Folder of Files * 1. only if H2O is running as a Hadoop job * 2. Hive files that are saved in ORC format * 3. without multi-file parsing or column type modification
  28. 28. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 28 Fast, Scalable & Distributed Compute Engine Written in Java
  29. 29. H2O Core CPU Model Building H2O
  30. 30. H2O Core H2O H2O H2O
  31. 31. H2O Core CPU CPU CPU Model Building H2O Distributed In-Memory
  32. 32. H2O Core YARN CPU CPU CPU Model Building H2O Distributed In-Memory SQL NFS S3 Firewall or Cloud
  33. 33. Distributed Algorithms • Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression • All algorithms are distributed in H2O: GBM, GLM, DRF, Deep Learning and more. Fine-grained map-reduce iterations. • Only enterprise-grade, open-source distributed algorithms in the market User Benefits Advantageous Foundation • “Out-of-box” functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java • Designed for all sizes of data sets, especially large data • Highly optimized Java code for model exports • In-house expertise for all algorithms Parallel Parse into Distributed Rows Fine Grain Map Reduce Illustration: Scalable Distributed Histogram Calculation for GBM Foundation for Distributed Algorithms 33
  34. 34. Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations H2O-3 Algorithms Overview Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning 34
  35. 35. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 35 Multiple Interfaces
  36. 36. H2O Flow (Web) – First Demo 36
  37. 37. Using H2O with R and Python – Second Demo 37
  38. 38. HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model Evaluation & Selection Predict Data & Model Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL High Level Architecture 38 Export Standalone Models for Production
  39. 39. 39 URL: docs.h2o.ai
  40. 40. Demo: H2O on a 320-Core Hadoop Cluster (Web Interface) 40
  41. 41. 41 https://www.kaggle.com/c/higgs-boson
  42. 42. Learning from Higgs Boson Machine Data 42 Sensors (Detector) Data Historical Outcome Is it a Higgs Particle (Yes/No) Predicted Outcome Learn the Pattern 11M Rows 28 Features Raw Data Size: 7.48 GB
  43. 43. 43 11M Rows Size (Raw): 7.48 GB Compressed: 2.00 GB (≈ 27% of Raw)
  44. 44. 44 10 nodes 10 x 32 = 320 Cores 10 x 29.6 = 296 GB Memory
  45. 45. H2O Water Meter (CPU Monitor) 45 10 x 32 = 320 Cores
  46. 46. Demo: AutoML 46 Automatic Machine Learning with H2O (R Interface)
  47. 47. 47Think 2018 / 3456 / March, 2018 / © 2018 IBM Corporation
  48. 48. 48 AutoML Think 2018 / 3456 / March, 2018 / © 2018 IBM Corporation
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52. Learning from Boston Housing Data 52 Crime, No. of rooms, Age … Historical House Price Predicted House Price H2O AutoML: Learn the Pattern
  53. 53. 53
  54. 54. 54
  55. 55. 55
  56. 56. 56
  57. 57. 57
  58. 58. 58
  59. 59. 59
  60. 60. Other H2O News 60 Latest Developments Events
  61. 61. H2O Products In-Memory, Distributed Machine Learning Algorithms with H2O Flow GUI H2O AI Open Source Engine Integration with Spark Lightning Fast machine learning on GPUs Automatic feature engineering, machine learning and interpretability Secure multi-tenant H2O clusters Lightning Fast machine learning on GPUs
  62. 62. “Confidential and property of H2O.ai. All rights reserved” Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Algorithms on H2O-3 (CPU) Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  63. 63. “Confidential and property of H2O.ai. All rights reserved” Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Algorithms on H2O4GPU (more to come) Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  64. 64. 64 https://github.com/h2oai/h2o4gpu
  65. 65. 65
  66. 66. End of First Talk 66 Any Questions?
  67. 67. Making Multimillion-Dollar Decisions with H2O AutoML, LIME and Shiny My journey to a real Moneyball application
  68. 68. About Moneyball The first rule of Moneyball: You do not ask me about the names of team and player involved. The second rule of Moneyball: You do not ask me about the names of team and player involved. (… for legal reasons …) The third rule of Moneyball: If you happen to guess the names right, I can neither confirm nor deny. 68
  69. 69. About Moneyball 69 Billy Beane Peter Brand (based on Paul DePodesta)
  70. 70. Ari Kaplan – the Real ”Moneyball” Guy 70 • The real characters in the movie (Billy Beane and Paul DePodesta) did not want to work with Hollywood. • The filmmaker interviewed Ari instead and created the Paul DePodesta character based on Ari’s real-life story. • Ari happens to work at Aginity so we have a real ”Moneyball” guy for this demo.
  71. 71. A Proof-of-Concept Demo for IBM Think Conference 71
  72. 72. Enterprise Solution 72Think 2018 / 3456 / March, 2018 / © 2018 IBM Corporation The Architecture The Workflow 1. Data loaded into the databases 2. Connected diverse data sources to Amp 3. Amp used to create derived attributes and publish them and data to DSX and H2O 4. DSX and H2O to build and tweak statistical and machine learning models 5. Visualizations tested in Immersive Insights 6. Steps 4 and 5 repeated to get settled data 7. Statistical and machine learning models saved in Amp 8. Data exported to Immersive Insights for final visualizations DB2 Machine Learning & AI Libraries Data Science Modeling Tools Augmented Reality Visualization Analytic Management and Reuse Layer Hadoop Data Environment High-performing Database for Analytics
  73. 73. Approach One: Learning from Lahman only Lahman: Age, Height, Weight … Historical Performance Stats Home Runs Batting Average … Predictions H2O AutoML: Learn the Pattern Sliding Windows (Stats from previous n years) About 300 Lahman Features 73
  74. 74. Approach Two: Learning from Lahman & AriDB Lahman: Age, Height, Weight … Historical Performance Stats Home Runs Batting Average … Predictions H2O AutoML: Learn the Pattern Sliding Windows (Stats from previous n years) About 300 Lahman Features + 200 AriDB Features AriDB: Fastball, curveball, slider, velocity … 74
  75. 75. Timeline • March 19 – AutoML Predictions finalized. Initial presentation in Excel. • March 20 – Version 1 of Shiny app. Ari used to app to validate some players he had in mind and recommended one player to his team. • March 21 – Multimillion-dollar contract finalized. • March 22 – Moneyball presentation at IBM Think 75
  76. 76. Shiny App 76 Presentation Green: Predictions based on Lahman only Orange: Predictions based on AriDB + Lahman Think 2018 / 3456 / March, 2018 / © 2018 IBM Corporation
  77. 77. Acknowledgement 77
  78. 78. • Organisers & Sponsors • Alexia Audevart • Christophe Regouby • HarryCow Coworking • H2O’s Mission • Democratize AI • Make Machine Learning Accessible to Everyone 78 Merci beaucoup! • Code, Slides & Documents • bit.ly/h2o_meetups • docs.h2o.ai • Contact • joe@h2o.ai • @matlabulous • github.com/woobe • Please search/ask questions on Stack Overflow • Use the tag `h2o` (not h2 zero)

×