Advancing Engineering with AI through the Next Generation of Strategic Projec...
Â
Introduction to Machine Learning with H2O and Python
1. Introduction to Machine Learning
with H2O and Python
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous
H2O Tutorial at Analyx
20th April, 2017
8. About Me
8
R + H2O + Domino for Kaggle
Guest Blog Post for Domino & H2O (2014)
⢠The Long Story
⢠bit.ly/joe_kaggle_story
9. Agenda
⢠About H2O.ai
⢠Company
⢠Machine Learning Platform
⢠Tutorial
⢠H2O Python Module
⢠Download & Install
⢠Step-by-Step Examples:
⢠Basic Data Import / Manipulation
⢠Regression & Classification (Basics)
⢠Regression & Classification (Advanced)
⢠Using H2O in the Cloud
9
10. Agenda
⢠About H2O.ai
⢠Company
⢠Machine Learning Platform
⢠Tutorial
⢠H2O Python Module
⢠Download & Install
⢠Step-by-Step Examples:
⢠Basic Data Import / Manipulation
⢠Regression & Classification (Basics)
⢠Regression & Classification (Advanced)
⢠Using H2O in the Cloud
10
Background Information
For beginners
As if I am working on
Kaggle competitions
Short Break
12. Company Overview
Founded 2011 Venture-backed, debuted in 2012
Products ⢠H2O Open Source In-Memory AI Prediction Engine
⢠Sparkling Water
⢠Steam
Mission Operationalize Data Science, and provide a platform for users to build beautiful data products
Team 70 employees
⢠Distributed Systems Engineers doing Machine Learning
⢠World-class visualization designers
Headquarters Mountain View, CA
12
16. 0
10000
20000
30000
40000
50000
60000
70000
1-Jan-15 1-Jul-15 1-Jan-16 1-Oct-16
# H2O Users
H2O Community Growth
Tremendous Momentum Globally
65,000+ users globally
(Sept 2016)
⢠65,000+ users from
~8,000 companies in 140
countries. Top 5 from:
Large User Circle
* DATA FROM GOOGLE ANALYTICS EMBEDDED IN THE END USER PRODUCT
16
0
2000
4000
6000
8000
10000
1-Jan-15 1-Jul-15 1-Jan-16 1-Oct-16
# Companies Using H2O ~8,000+ companies
(Sept 2016)
+127%
+60%
27. HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Plain Old Java Object
Local
SQL
High Level Architecture
27
28. HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Plain Old Java Object
Local
SQL
High Level Architecture
28
Import Data from
Multiple Sources
29. HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Plain Old Java Object
Local
SQL
High Level Architecture
29
Fast, Scalable & Distributed
Compute Engine Written in
Java
30. HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Plain Old Java Object
Local
SQL
High Level Architecture
30
Fast, Scalable & Distributed
Compute Engine Written in
Java
31. Supervised Learning
⢠Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
⢠Naïve Bayes
Statistical
Analysis
Ensembles
⢠Distributed Random Forest: Classification
or regression models
⢠Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
⢠Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Algorithms Overview
Unsupervised Learning
⢠K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
⢠Principal Component Analysis: Linearly transforms
correlated variables to independent components
⢠Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
⢠Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
31
33. HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Plain Old Java Object
Local
SQL
High Level Architecture
33
Multiple Interfaces
37. HDFS
S3
NFS
Distributed
In-Memory
Load Data
Loss-less
Compression
H2O Compute Engine
Production Scoring Environment
Exploratory &
Descriptive
Analysis
Feature
Engineering &
Selection
Supervised &
Unsupervised
Modeling
Model
Evaluation &
Selection
Predict
Data & Model
Storage
Model Export:
Plain Old Java Object
Your
Imagination
Data Prep Export:
Plain Old Java Object
Local
SQL
High Level Architecture
37
Export Standalone Models
for Production
40. Learning Objectives
⢠Start and connect to a local H2O cluster from Python.
⢠Import data from Python data frames, local files or web.
⢠Perform basic data transformation and exploration.
⢠Train regression and classification models using various H2O machine
learning algorithms.
⢠Evaluate models and make predictions.
⢠Improve performance by tuning and stacking.
⢠Connect to H2O cluster in the cloud.
40
57. Supervised Learning
⢠Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
⢠Naïve Bayes
Statistical
Analysis
Ensembles
⢠Distributed Random Forest: Classification
or regression models
⢠Gradient Boosting Machine: Produces an
ensemble of decision trees with increasing
refined approximations
Deep Neural
Networks
⢠Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Algorithms Overview
Unsupervised Learning
⢠K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
⢠Principal Component Analysis: Linearly transforms
correlated variables to independent components
⢠Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
⢠Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
57
117. Learning Objectives
⢠Start and connect to a local H2O cluster from Python.
⢠Import data from Python data frames, local files or web.
⢠Perform basic data transformation and exploration.
⢠Train regression and classification models using various H2O machine
learning algorithms.
⢠Evaluate models and make predictions.
⢠Improve performance by tuning and stacking.
⢠Connect to H2O cluster in the cloud.
117
118. Improving Model Performance (Step-by-Step)
118
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Lowest MSE =
Best Performance
120. ⢠Our Friends at
⢠Find us at Poznan R Meetup
⢠Today at 6:15 pm
⢠Uniwersytet Ekonomiczny w Poznaniu
Centrum Edukacyjne UsĹug
Elektronicznych
120
Thanks!
⢠Code, Slides & Documents
⢠bit.ly/h2o_meetups
⢠docs.h2o.ai
⢠Contact
⢠joe@h2o.ai
⢠@matlabulous
⢠github.com/woobe
⢠Please search/ask questions on
Stack Overflow
⢠Use the tag `h2o` (not H2 zero)