Introduction to H2O and Model Stacking Use Cases

Introduction to H2O
with Model Stacking Use Cases
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous
London Artificial Intelligence & Deep Learning @SHACK15hub
27th April, 2017

2
Thanks for joining us!
https://www.meetup.com/London-Artificial-Intelligence-Deep-Learning/members/
1st Official H2O
Meetup in London

3
Our Friends in UK
• Data Science for IoT Meetup
• Ajit Jaokar (Oxford Uni)
• Barty Isola from La Fosse (Venue)
• London Kaggle Meetup
• Alex Glaser, Wojtek Kostelecki &
Sergiusz Bleja
• Big Data London
• Bill Hammond
• This year: Nov 15-16

Agenda
• Introduction
• Company
• Why H2O?
• H2O Machine Learning Platform
• Model Stacking in H2O
• Introduction / Why
• Simple Examples
• Regression / Binary Classification
• Kaggle Example
• Multi-class Classification
• H2O + xgboost
4

About Me
• Civil (Water) Engineer
• 2010 – 2015
• Consultant (UK)
• Utilities
• Asset Management
• Constrained Optimization
• Industrial PhD (UK)
• Infrastructure Design Optimization
• Machine Learning +
Water Engineering
• Discovered H2O in 2014
• Data Scientist
• 2015
• Virgin Media (UK)
• Domino Data Lab (Silicon Valley)
• 2016 – Present
• H2O.ai (Silicon Valley)
5

About Me – From Kaggle to H2O
6
R + H2O + Domino for Kaggle
Guest Blog Post for Domino & H2O (2014)
• The Long Story
• bit.ly/joe_kaggle_story

About Me – I Love DataViz
7
My First Data Viz & Shiny App Experience
CrimeMap (2013) Revolution Analytics’ Data Viz Contest
RUGSMAPS (2014)

Company Overview
Founded 2011 Venture-backed, debuted in 2012
Products • H2O Open Source In-Memory AI Prediction Engine
• Sparkling Water
• Steam
Mission Operationalize Data Science, and provide a platform for users to build beautiful data products
Team 70 employees
• Distributed Systems Engineers doing Machine Learning
• World-class visualization designers
Headquarters Mountain View, CA
9

Scientific Advisory Council
11

#AroundTheWorldWithH2Oai
16
My first H2O talk
March 2016

Users In Various Verticals Adore H2O
Financial Insurance MarketingTelecom Healthcare
18

19
Joe (2015)
http://www.h2o.ai/gartner-magic-quadrant/

20
Check
out our
website
h2o.ai

Szilard Pafka’s ML Benchmark
22
https://github.com/szilard/benchm-ml
n = million of samples
Gradient Boosting Machine Benchmark
H2O is fastest at 10M samples
H2O is as accurate as
others at 10M samples
Time (s)
AUC

Szilard Pafka’s Comment
23
https://speakerdeck.com/szilard/machine-learning-with-h2o-dot-ai-budapest-data-science-meetup-july-2016

H2O Deep Learning in Action
25

H2O for Kaggle Competitions
26

H2O for Academic Research
27
http://www.sciencedirect.com/science/article/pii/S0377221716308657
https://arxiv.org/abs/1509.01199

H2O Machine Learning Platform
28

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Local
SQL
High Level Architecture
29

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
30
Import Data from
Multiple Sources

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
31
Fast, Scalable & Distributed
Compute Engine Written in
Java

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
32
Fast, Scalable & Distributed
Compute Engine Written in
Java

Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Algorithms Overview
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
33

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
34
Multiple Interfaces

H2O + R
35
Package ‘h2o’ from CRAN
or H2O’s website
Start a local H2O (Java
Virtual Machine) cluster
Simple ‘iris’ example

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
39
Export Standalone Models
for Production

43
https://www.slideshare.net/0xdata/st
acked-ensembles-in-h2o

46
Stacking
…
CV Predictions
From Model 1
CV Predictions
From Model 2
CV Predictions
From Model n
Ground Truth
(Real Labels)
+
Numerical Features Numerical or
Categorical Labels
Meta-learning

50
https://github.com/kaz-Anova/StackNet/blob/master/example/example_amazon/Data%20Festival%20Presentation%2024_4_2017.pdf

Regression Example
Wine Quality Dataset
51

Examples are based on my H2O Tutorials
• Introduction to Machine Learning
with H2O and Python
• Basic Extract, Transform and Load
(ETL)
• Supervised Learning
• Parameters Tuning
• Stacking
• http://bit.ly/joe_h2o_tutorials
• R Code Examples included
• Official H2O Tutorials
• https://github.com/h2oai/h2o-
tutorials
52

Improving Model Performance (Step-by-Step)
53
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Lower Mean
Square Error
=
Better
Performance
For More Details https://github.com/woobe/h2o_tutorials/tree/master/introduction_to_machine_learning

54
11 Numerical Features
Target

55
Regression Performance – MSE
Lower the better

56
Lowest MSE =
Best Performance
Python Interface for H2O
Stacked Ensembles
Best GBM, DRF and DNN
models from Random Grid
Search

57
Lowest MSE =
Best Performance
R Interface for H2O
Stacked Ensembles
Use the three models
from previous steps

Binary Classification Example
Titanic Dataset
58

59
7 Features
Target

60
http://fastml.com/what-you-wanted-to-know-about-auc/

61
Highest AUC =
Best Performance

Kaggle Example
Santander Product Recommendation
62

63
https://www.kaggle.com/c/santander-product-recommendation
Stacked Ensembles of H2O GBM (Joe)
+
Ensembles of xgboost (ZFTurbo)

• Predict new products that
customers will add in the future
• Reframed as a Multiclass
Classification problem
• Feature Engineering
• Basic (Everyone)
• Advanced (ZFTurbo, Yifan, Anokas)
• Also see Yifan’s slides
• Models
• H2O GBM (Joe) – Single Best Model
• xgboost (ZFTurbo)
64

65
Reducing logloss by
Model Stacking
https://bitbucket.org/woobe/kaggle_santander_product/src/

66
Extract CV Predictions
…
CV Predictions
From Model 1
CV Predictions
From Model 2
CV Predictions
From Model n
Ground Truth
(Real Labels)
+
Numerical Features Categorical Labels
https://bitbucket.org/woobe/kaggle_santander_product/src/

Kaggle Example
Higgs (Small Version)
67

68
https://github.com/h2oai/h2o-tutorials/tree/master/tutorials/ensembles-stacking

69

70

71

72
Higher AUC =
Better Performance

Model Stacking in H2O
• Stacking made easy
• Laborious process automated
• Works in both R and Python
• Works with current and new
algorithms in H2O
• xgboost
• Deep Water (MXNet, TensorFlow
& Caffe)
• … and more!
74
• Related Talk
• www.slideshare.net/0xdata/stacke
d-ensembles-in-h2o
• Learning Resources
• github.com/h2oai/h2o-
tutorials/tree/master/tutorials/en
sembles-stacking
• bit.ly/joe_h2o_tutorials

75
H2O Supports Local Data Science Community
https://www.meetup.com/London-Kaggle-Meetup/ https://www.meetup.com/Women-in-Kaggle/

76
Our Friends in UK
• Data Science for IoT Meetup
• Ajit Jaokar (Oxford Uni)
• Barty Isola from La Fosse (Venue)
• London Kaggle Meetup
• Alex Glaser, Wojtek Kostelecki &
Sergiusz Bleja
• Big Data London
• Bill Hammond
• This year: Nov 15-16

77
Thanks for joining us!
Next H2O Meetup:
June 20 (T.B.C.)
https://www.meetup.com/London-Artificial-Intelligence-Deep-Learning/members/

78
Thanks!
• Code, Slides & Documents
• bit.ly/h2o_meetups
• docs.h2o.ai
• Contact
• joe@h2o.ai
• @matlabulous
• github.com/woobe
• Please search/ask questions on
Stack Overflow
• Use the tag `h2o` (not H2 zero)

Introduction to H2O and Model Stacking Use Cases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to H2O and Model Stacking Use Cases

Similar to Introduction to H2O and Model Stacking Use Cases (20)

More from Jo-fai Chow

More from Jo-fai Chow (14)

Recently uploaded

Recently uploaded (20)

Introduction to H2O and Model Stacking Use Cases