Generalized Linear Models with H2O

H2O.ai 
Machine Intelligence
Generalized Linear
Models with H2O
1
Tomas Nykodym
tomas@h2o.ai

H2O.ai 
Outline
• Introduction to H2O
• GLM Overview
• Quick demo on Airlines Data
• Overview of H2O GLM features
• Common usage patterns
•finding optimal regularization
•handling wide datasets
• Kaggle Example
•Avito Dataset overview
•basic model
•feature engineering
•final model building 2

H2O.ai 
In-Memory ML
Distributed
Open Source
APIs
3
Memory-Efficient Data Structures
Cutting-Edge Algorithms
Use all your Data (No Sampling)
Accuracy with Speed and Scale
Ownership of Methods - Apache V2
Easy to Deploy: Bare, Hadoop, Spark, etc.
Java, Scala, R, Python, JavaScript, JSON
NanoFast Scoring Engine (POJO)
H2O - Product Overview

H2O.ai 
25,000 commits / 3yrs
H2O World Conference 2014
Team Work @ H2O.ai
4
Join H2O World Nov 9-11 2015!

H2O.ai 
5
cientific Advisory Council
Stephen Boyd
Professor of EE Engineering
Stanford University
Rob Tibshirani
Professor of Health Research
and Policy, and Statistics
Stanford University
Trevor Hastie
Professor of Statistics
Stanford University

H2O.ai 
103 634 2789
463 2,887 13,237
Companies
Users
Mar 2014 July 2014 Mar 2015
Active Users
150+
6
Strong Community & Growth
5/25/15 @kdnuggets t.co/4xSgleSIdY

H2O.ai 
7
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models,
15x faster with H2O than before)
Fraud Detection (11% higher accuracy with
H2O Deep Learning - saves millions)
…and many large insurance and financial
services companies!
Real-time marketing (H2O is 10x faster than
anything else)
Actual Customer Use Cases

H2O.ai 
8
HDFS
S3
SQL
NoSQL
Classification
Regression
Feature
Engineering
Distributed In-Memory
Map Reduce/Fork Join
Columnar Compression
GLM, Deep Learning
K-Means, PCA, NB,
Cox
Random Forest / GBM
Ensembles
Fast Modeling Engine
Streaming
Nano Fast Java Scoring Engines
(POJO code generation)
Matrix
Factorization Clustering
Munging
Unsupervised
Supervised
Accuracy with Speed and Scale
Most code is written
in-house from scratch

H2O.ai 
- Well known statistical/machine learning method
- Fits a linear model
- link(y) = c1*x1 + c2*x + … + cn*xn + intercept
- Produces (relatively) simple model
- easy to fit
- easy to understand and interpret
- well known statistical properties
- Regression problems
- gaussian, poisson, gamma, tweedie
- Classification
- binomial, multinomial
- Requires good features
- not as powerful on raw data as some other models
- (gbm/deep learning)
Generalized Linear Models
9

H2O.ai 
- Linear Model
- defined by vector of coefficients
- 1 number per predictor
- Parametrized by Family and Link
- Family
- Our assumption about distribution of the response
- e.g. poisson for regression on counts, binomial for
two class classification
- Link
- non-linear transform of the response
- e.g. logit to generate s-curve for logistic regression
- Fitted by maximum likelihood
- pick the model with max probability of seeing the data
- need an iterative solver (e.g. newton method, L-BFGS)
Generalized Linear Models 2
10

H2O.ai 
Generalized Linear Models 3
11
Simple 2-class classification example
Linear Regression fit
(family=gaussian,link =identity)
Logistic Regression fit
(family=binomial,link = logit)

H2O.ai 
- Problems
- can overfit - works great on training, fails on test
- solution is not unique with correlated variables
- Solution - Add Regularization
- add penalty to reduce size of the vector
- L1 or L2 norm of the coefficient vector
- L1 versus L2
- L2 dense solution
- correlated variables coefficients are pushed to the same
value
- L1 sparse solution
- picks one correlated variable, others discarded
- Elastic Net
- combination of L1 and L2
- sparse solution, correlated variables grouped, enter/ leave the
model together
Penalized Generalized Linear Models
12

H2O.ai 
- Fully Distributed and Parallel
- handles datasets with up to 100s of thousand of predictors
- scales linearly with number of rows
- processes datasets with 100s of millions of rows in seconds
- All standard GLM features
- standard families
- support for observation weights and offset
- Elastic Net Regularization
- lambda-search - efficient computation of optimal regularization
strength
- applies strong rules to filter out in-active coefficients
- Several solvers for different problems
- Iterative re-weighted least squares with ADMM solver
- L-BFGS for wide problems
- Coordinate Descent (Experimental)
GLM on H2O
13

H2O.ai 
- Automatic handling of categorical variables
- automatically expands categoricals into 1-hot encoded binary
vectors
- Efficient handling (sparse acces, sparse covariance matrix)
- (Unlike R) uses all levels by default if running with
regularization
- Missing value handling
- missing values are not handled and rows with any missing value
will be omitted from the training dataset
- need to impute missing values up front if there are many
GLM on H2O 2
14

H2O.ai 
15
EC2 Demo Cluster: 8 nodes, 64 cores
H2O Deep Learning:
Expect good cluster
utilization :)

H2O.ai 
16
Airline Data: Predict Delayed Departure
Predict dep. delay Y/N
116M rows
31 colums
12 GB CSV
4 GB compressed
20 years of domestic
airline flight data

H2O.ai 
17
Results in Seconds on Big Data
Logistic Regression: ~20s
elastic net, alpha=0.5, lambda=1.379e-4 (auto)
Deep Learning: ~70s
4 hidden ReLU layers of 20 neurons, 2 epochs
8-node EC2 cluster: 64 virtual cores, 1GbE
Year, Month, Sched.
Dep. Time have
non-linear impact
Chicago, Atlanta,
Dallas: 
often delayed
All cores maxed out
+9% AUC
+--+++

H2O.ai 
- Standard Metrics as other H2O algos +
- residual deviance
- null deviance
- degrees of freedom
- Coefficients / standardized coefficients
- The actual model
- One number per predictor
- Model is fitted on standardized data by default (parameter)
- standardized coefficients are the actual coefficients
fitted on standardized data
- (non-standardized) coefficients are de-scaled version of
standardized coefficients (so that they can be applied to
original dataset)
Output Fields
18

H2O.ai 
19
Avito Dataset: Predict User Clicks
20M rows
21 colums
~ 3 GB CSV
~ 1 GB compressed
Kaggle competion
https://www.kaggle.com/c/avito-context-ad-clicks/data
Some munging needed, see
https://github.com/h2oai/0xdata.com/
blob/master/src/blog/
2015_10_DataMunging._md
http://www.slideshare.net/0xdata/400-million-
search-results-predict-contextual-ad-clicks

H2O.ai 
- Running GLM straight on the data runs fast, but not great accuracy
- We will try to improve it by:
- turning variables to categoricals
- imputing NAs with means
20

H2O.ai 
- Further improvements
- cut numerical columns into intervals to make new categoricals
- use h2o.hist + h2o.cut
- add interactions
- use h2o.interaction
21

H2O.ai 
- Solver selection
- IRLSM - default choice with L1 penalty
- works great with small number of predictors
- efficient L1 solver
- can handle wide datasets with lambda search and L1 penalty
- L-BFGS
- handles wide data well, but
- can iterate a lot (take long time), especially with L1 penalty
- tune the objective epsilon - often many iterations are spent
on minor improvements
- Regularization Selection
- Compare sparse versus dense
- compare runs with alpha >= .5, alpha == 0
- generally L1 does slightly better
- Run lambda search to pick optimal regularization strength
General Usage Guidelines
22

H2O.ai 
- Do not pre-expand categorical variables
- H2O expands categorical variables automatically,
- way more efficient
- Adding features
- Splitting numerical variables into intervals helps
- Adding categorical interactions helps
- Using Lambda-Search
- always use validation data set
- otherwise picks lambda.min
- validation dataset is used to pick the best lambda value ->
need separate test set!
- check that lambda.best > lambda.min
- otherwise did not start overfitting, smaller lambda may be
better
- re-run with smaller lambda.min
General Usage Guidelines 2
23

H2O.ai 
More Info in H2O Booklets
http://h2o.ai/resources
24

Generalized Linear Models with H2O

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Generalized Linear Models with H2O

Similar to Generalized Linear Models with H2O (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Generalized Linear Models with H2O