Introduction to Machine Learning with H2O and Python

Introduction to Machine Learning
with H2O and Python
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous
H2O Tutorial at Analyx
20th April, 2017

Slides and Code Examples:
bit.ly/joe_h2o_tutorials
2

About Me
• Civil (Water) Engineer
• 2010 – 2015
• Consultant (UK)
• Utilities
• Asset Management
• Constrained Optimization
• Industrial PhD (UK)
• Infrastructure Design Optimization
• Machine Learning +
Water Engineering
• Discovered H2O in 2014
• Data Scientist
• 2015
• Virgin Media (UK)
• Domino Data Lab (Silicon Valley,
US)
• 2016 – Present
• H2O.ai (Silicon Valley, US)
3

Side Project #1 – Crime Data Visualization
5
https://github.com/woobe/rApps/tree/master/crimemap
http://insidebigdata.com/2013/11/30/visualization-week-crimemap/

Side Project #2 – Data Visualization Contest
6
https://github.com/woobe/rugsmaps http://blog.revolutionanalytics.com/2014/08/winner-for-revolution-analytics-user-group-map-contest.html

Side Project #3
7
Developing R Packages for Fun
rPlotter (2014)

About Me
8
R + H2O + Domino for Kaggle
Guest Blog Post for Domino & H2O (2014)
• The Long Story
• bit.ly/joe_kaggle_story

Agenda
• About H2O.ai
• Company
• Machine Learning Platform
• Tutorial
• H2O Python Module
• Download & Install
• Step-by-Step Examples:
• Basic Data Import / Manipulation
• Regression & Classification (Basics)
• Regression & Classification (Advanced)
• Using H2O in the Cloud
9

Agenda
• About H2O.ai
• Company
• Machine Learning Platform
• Tutorial
• H2O Python Module
• Download & Install
• Step-by-Step Examples:
• Basic Data Import / Manipulation
• Regression & Classification (Basics)
• Regression & Classification (Advanced)
• Using H2O in the Cloud
10
Background Information
For beginners
As if I am working on
Kaggle competitions
Short Break

Company Overview
Founded 2011 Venture-backed, debuted in 2012
Products • H2O Open Source In-Memory AI Prediction Engine
• Sparkling Water
• Steam
Mission Operationalize Data Science, and provide a platform for users to build beautiful data products
Team 70 employees
• Distributed Systems Engineers doing Machine Learning
• World-class visualization designers
Headquarters Mountain View, CA
12

Scientific Advisory Council
14

0
10000
20000
30000
40000
50000
60000
70000
1-Jan-15 1-Jul-15 1-Jan-16 1-Oct-16
# H2O Users
H2O Community Growth
Tremendous Momentum Globally
65,000+ users globally
(Sept 2016)
• 65,000+ users from
~8,000 companies in 140
countries. Top 5 from:
Large User Circle
* DATA FROM GOOGLE ANALYTICS EMBEDDED IN THE END USER PRODUCT
16
0
2000
4000
6000
8000
10000
1-Jan-15 1-Jul-15 1-Jan-16 1-Oct-16
# Companies Using H2O ~8,000+ companies
(Sept 2016)
+127%
+60%

H2O for Kaggle Competitions
18

H2O for Academic Research
19
http://www.sciencedirect.com/science/article/pii/S0377221716308657
https://arxiv.org/abs/1509.01199

Users In Various Verticals Adore H2O
Financial Insurance MarketingTelecom Healthcare
20

21
Joe (2015)
http://www.h2o.ai/gartner-magic-quadrant/

22
Check
out our
website
h2o.ai

H2O Machine Learning Platform
23

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Local
SQL
High Level Architecture
27

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
28
Import Data from
Multiple Sources

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
29
Fast, Scalable & Distributed
Compute Engine Written in
Java

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
30
Fast, Scalable & Distributed
Compute Engine Written in
Java

Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Algorithms Overview
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
31

H2O Deep Learning in Action
32

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
33
Multiple Interfaces

HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Your
Imagination
Data Prep Export:
Local
SQL
37
Export Standalone Models
for Production

Learning Objectives
• Start and connect to a local H2O cluster from Python.
• Import data from Python data frames, local files or web.
• Perform basic data transformation and exploration.
• Train regression and classification models using various H2O machine
learning algorithms.
• Evaluate models and make predictions.
• Improve performance by tuning and stacking.
• Connect to H2O cluster in the cloud.
40

Install H2O
h2o.ai -> Download -> Install in Python
42

Start and Connect to a
Local H2O Cluster
py_01_data_in_h2o.ipynb
44

Local H2O Cluster
45
Import H2O module
Start a local H2O cluster
nthreads = -1 means
using ALL CPU resources

Importing Data into H2O
py_01_data_in_h2o.ipynb
47

48
Import data into H2O cluster
(instead of Python’s memory)

49
Directly from data on the web

50
Convert from Pandas
to H2O data frame

Basic Data Transformation &
Exploration
py_02_data_manipulation.ipynb
(see notebooks)
51

52
The Classic
Titanic Dataset

53
Only two unique values
(0 or 1)

54
“enum” is the data type of
categorical data in Java
Convert numerical to
categorical values

55
Only three unique values
(1, 2 or 3)

Regression Models (Basics)
py_03a_regression_basics.ipynb
56

Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest: Classification
or regression models
• Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Algorithms Overview
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
57

59
11 Numerical Features
Target

60
Define 11 Numerical
Features using their
Column Names

61
Split dataset so we can
measure out-of-bag
performance later

63
Regression Performance – MSE
Lower the better

65
Evaluate model performance
using test set

67
API for other ML
algorithms

68
API for other ML
algorithms

Classification Models (Basics)
py_04_classification_basics.ipynb
69

71
Convert numerical to
categorical values

72
Define features manually
Split dataset so we can
measure out-of-bag
performance later

Classification Performance – Confusion Matrix
74

77
Evaluate model performance
using test set

78
Predicted
Class
Probabilities of Each Class

79
API for other ML
algorithms

80
API for other ML
algorithms

End of Basics
Let’s have a break ☺
81

Regression Models (Tuning)
py_03b_regression_grid_search.ipynb
82

Improving Model Performance (Step-by-Step)
83
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Lower Mean
Square Error
=
Better
Performance

84
Using same dataset and split
as previous tutorial

85
Baseline Model
Write down MSE on Test set

86

87
Manual settings
based on experience

88

90
Manual settings
based on experience
+
5-fold CV

92

94
Search for
lowest MSE
from 5-fold CV

95

Grid Search
96
Combination Parameter 1 Parameter 2
1 0.7 0.7
2 0.7 0.8
3 0.7 0.9
4 0.8 0.7
5 0.8 0.8
6 0.8 0.9
7 0.9 0.7
8 0.9 0.8
9 0.9 0.9

98
Sort Results by MSE
Best Model on Top
Lowest MSE

99
Stopped at 187 trees
(automatic)

100

101
Expand Search Space
Only search for 9
combinations

102
Sort Results by MSE
Best Model on Top
Lowest MSE

103

Regression Models (Ensembles)
py_03c_regression_ensembles.ipynb
104

105
https://github.com/h2oai/h2o-
meetups/blob/master/2017_02_23_
Metis_SF_Sacked_Ensembles_Deep_
Water/stacked_ensembles_in_h2o_fe
b2017.pdf

106
Keep the Best Model after
Random Grid Search

107
Random Grid Search

108
Random Grid Search

109
Lowest MSE =
Best Performance
API for Stacked Ensembles
Use the three models
from previous steps

110
Lowest MSE =
Best Performance

Classification Models (Ensembles)
py_04_classification_ensembles.ipynb
111

112
Highest AUC =
Best Performance

H2O in the Cloud
py_05_h2o_in_the_cloud.ipynb
113

Learning Objectives
• Start and connect to a local H2O cluster from Python.
• Import data from Python data frames, local files or web.
• Perform basic data transformation and exploration.
• Train regression and classification models using various H2O machine
learning algorithms.
• Evaluate models and make predictions.
• Improve performance by tuning and stacking.
• Connect to H2O cluster in the cloud.
117

118
Lowest MSE =
Best Performance

• Our Friends at
• Find us at Poznan R Meetup
• Today at 6:15 pm
• Uniwersytet Ekonomiczny w Poznaniu
Centrum Edukacyjne Usług
Elektronicznych
120
Thanks!
• Code, Slides & Documents
• bit.ly/h2o_meetups
• docs.h2o.ai
• Contact
• joe@h2o.ai
• @matlabulous
• github.com/woobe
• Please search/ask questions on
Stack Overflow
• Use the tag `h2o` (not H2 zero)

Introduction to Machine Learning with H2O and Python

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Introduction to Machine Learning with H2O and Python

Similar to Introduction to Machine Learning with H2O and Python (20)

More from Jo-fai Chow

More from Jo-fai Chow (15)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning with H2O and Python