Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dive into H2O: NYC


Published on

This session took place at New York City on November 4th, 2019.

Speaker Bio:
Chemere is a Senior Data Science Training Specialist for Chemere has a Master's in Business Administration with focus in Marketing Analytics from the University of North Carolina at Charlotte. She is an experienced data scientist with a diverse background in transformational decision-making in various industries including Banking, Manufacturing, Logistics, and Medical Devices. Chemere joins us from Venus Concept/2two5, where she was the Lead Data Scientist focused on building predictive models with Internet of Things (IoT) data and for a subscription-based marketing product for B2B customers. Prior to that, Chemere worked as a Senior Data Scientist at Wells Fargo Bank focused on various applied predictive analytic solutions.

More details about the event can be had here:

Published in: Technology
  • Be the first to comment

Dive into H2O: NYC

  1. 1. Introduction to Driverless AI Chemere Davis
  2. 2. Confidential2 Please Create an Account on Aquarium
  3. 3. Confidential3 Please Sign Into Aquarium
  4. 4. Confidential4 Product Suite Automatic feature engineering, machine learning and interpretability • 100% open source – Apache V2 licensed • Built for data scientists – interface using R, Python, Scala, H2O Flow (interactive notebook interface) • Enterprise support subscriptions • Enterprise software • Built for domain users, analysts and data scientists – GUI-based interface for end-to-end data science • Fully automated machine learning from ingest to deployment • User licenses on a per seat basis (annual subscription) H2O AI open source engine integration with Spark Lightning fast machine learning on GPUs In-memory, distributed machine learning algorithms with H2O Flow GUI Open Source
  5. 5. Confidential5 The Workflow of Driverless AI SQL HDFS X Y Automatic Model Optimization Automatic Scoring Pipeline Deploy Low-latency Scoring to Production Modelling Dataset Model Recipes • i.i.d. data • Time-series • More on the way Advanced Feature Engineering Algorithm Model Tuning+ + Survival of the Fittest 1 Drag and Drop Data 2 Automatic Visualization 4 Automatic Model Optimization 5 Automatic Scoring Pipelines Snowflake Model Documentation  Upload your own recipe(s) Transformations Algorithms Scorers 3 Bring Your Own Recipes  Driverless AI executes automation on your recipes Feature engineering, model selection, hyper-parameter tuning, overfitting protection  Driverless AI automates model scoring and deployment using your recipes Amazon S3 Google BigQuery Azure Blog Storage
  6. 6. Confidential6 Driverless AI: Supervised Learning Regression: How much will a customers spend? Classification: Will a customer make a purchase? Yes or No X y xi xj yes no
  7. 7. Confidential7 Confidential7 Confidential7 Driverless AI Features Target Data Quality and Transformation Modeling Table Model Building Model Data Integration + Typical Enterprise ML Workflow
  8. 8. Confidential8 Confidential8 Confidential8 Features Target Modeling Table Model Building Model Driverless AI Modeling Data Types • Numeric • Categorical • Time/Date • Text • Missing values allowed Model Types • Regression • Classification – Binary – Multinomial Build Process • Feature engineering – Including NLP (text) • Automated hyperparameter tuning Both iid & Time Series • Single • Grouped • Gap between last observation and prediction
  9. 9. Confidential9 Confidential9 How Well Does Driverless AI Work?
  10. 10. Confidential10 Top 10 Finish in BNP Kaggle Competition single run, fully automated: 2h on DGX Station! 6h on PC Driverless AI: 10th place in private LB at Kaggle (out of 2926)
  11. 11. Confidential11 Top 5% in Amazon Kaggle competition
  12. 12. Confidential12 Other Kaggle Competitions: Driverless AI Results 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Allstate BNP Paribas Amazon Homesite Otto Group Relative error: Lower is Better Kaggle Grandmaster Best AutoDL GBM BaselineRelative Error (Lower is Better) Kaggle Grandmaster Best Driverless AI GBM Baseline
  13. 13. Confidential13 Confidential13 Credit Card Example
  14. 14. Confidential14 • Dataset: – Comes from a lender in Taiwan (April – August, 2005) – Information on default payments, demographic factors, credit data, history of payment, etc. – Source: – UCI Machine Learning Library – • Our Goal: – Predict whether someone will default on their next credit card payment. Credit Card Payment Default 14
  15. 15. Confidential15 The Data Column Description ID ID of each customer Default Defaulted on next payment (1 = yes, 0 = no) CreditLimit Credit limit in NT dollars Sex Gender (M, F) Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown) Marriage Marital status (M, S, D, O) Age Age in years Status1 … Status6 Repayment status in September, 2005 – April, 2005 BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar) PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar)
  16. 16. Confidential16 Payment History Data 1 Month Ago Status1: ≤0, 1 BillAmt1 PayAmt1 2 Months Ago Status2: ≤0, 1, 2 BillAmt2 PayAmt2 3 Months Ago Status3: ≤0, 1, 2, 3 BillAmt3 PayAmt3 ... 6 Months Ago Status6: ≤0, 1, ..., 6 BillAmt6 PayAmt6 Status: -2: No balance -1: Paid in full 0: Minimum balance paid 1: One month late 2: Two months late etc.
  17. 17. Confidential17 Confidential17 Automatic Visualizations
  18. 18. Confidential18 Automatic Visualization (AutoViz)
  19. 19. Confidential19 Automatic Visualizations Scalable outlier detection Contains novel statistical algorithms to only show “relevant” aspects of the data (coming soon: automated data cleaning)
  20. 20. Confidential20 Confidential20 Machine Learning Experimentation
  21. 21. Confidential21 Experiment Settings 3 KEY SETTINGS Accuracy Time Interpretability
  22. 22. Confidential22 Experiment Settings ACCURACY • Relative accuracy – higher values should lead to higher confidence in model performance (accuracy) • Impacts things such as level of data sampling, how many models are used in the final ensemble, parameter tuning level, among others Accuracy Time Interpretability • Relative time for completing the experiment • Higher settings mean: – More iterations are performed to find the best set of features – Longer “early stopping” threshold • Relative interpretability – higher values favor more interpretable models • The higher the interpretability setting, the lower the complexity of the engineered features and of the final model(s)
  23. 23. Confidential23 Accuracy Accuracy Max Rows x Cols Ensemble Level Target Transformation Parameter Tuning Level Num Folds Only First Fold Model Distribution Check 1 100K 0 False 0 3 True No 2 1M 0 False 0 3 True No 3 50M 0 True 1 3 True No 4 100M 0 True 1 3-4 True No 5 200M 1 True 1 3-4 True Yes 6 500M 2 True 1 3-5 True Yes 7 750M <=3 True 2 3-10 Auto Yes 8 1B <=3 True 2 4-10 Auto Yes 9 2B <=3 True 3 4-10 Auto Yes 10 10B <=4 True 3 4-10 Auto Yes
  24. 24. Confidential24 Time Time Iterations Early Stopping Rounds 1 1-5 None 2 10 5 3 30 5 4 40 5 5 50 10 6 100 10 7 150 15 8 200 20 9 300 30 10 500 50
  25. 25. Confidential25 Interpretability Interpretability Ensemble Level Target Transformation Feature Engineering Feature Pre- Pruning Monotonicity Constraints 1 - 3 <= 3 None Disabled 4 <= 3 Inverse None Disabled 5 <= 3 Anscombe Clustering (ID, distance) Truncated SVD None Disabled 6 <= 2 Logit Sigmoid Feature selection Disabled 7 <= 2 Frequency Encoding Feature selection Enabled 8 <= 1 4th Root Feature selection Enabled 9 <= 1 Square Square Root Bulk Interactions (add, subtract, multiply, divide) Weight of Evidence Feature selection Enabled 10 0 Identity Unit Box Log Date Decompositions Number Encoding Target Encoding Text (TF-IDF, Frequency) Feature selection Enabled Good start
  26. 26. Confidential26 Scoring Options " Classification Regression Best For Imbalanced Data Precision Recall
  27. 27. Confidential27 Driverless AI - Machine Learning Interpretability Gain confidence in models before deploying them!
  28. 28. Confidential28 Linear Models Machine Learning For a given well-understood dataset there is usually one best model. For a given well-understood dataset there are usually many good models. This is often referred to as “the multiplicity of good models.” -- Leo Breiman. “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical Science. 2001. Why is Machine Learning Interpretability Difficult?
  29. 29. Confidential29 Interpretability Complexity of learned functions: • Linear, monotonic • Nonlinear, monotonic • Nonlinear, non-monotonic Scope of interpretability: Global vs. local Application domain Understanding: Model-agnostic vs. model-specificTrust: Enhancing trust and understanding: the mechanisms and results of an interpretable model should be both transparent AND dependable.
  30. 30. Confidential30 Global and Local Interpretability Linear Models Exact explanations for approximate models. Machine Learning Approximate explanations for exact models.