SlideShare a Scribd company logo
1 of 24
Download to read offline
H2O.ai

Machine Intelligence
Generalized Linear
Models with H2O
1
Tomas Nykodym
tomas@h2o.ai
H2O.ai

Machine Intelligence
Outline
• Introduction to H2O
• GLM Overview
• Quick demo on Airlines Data
• Overview of H2O GLM features
• Common usage patterns
•finding optimal regularization
•handling wide datasets
• Kaggle Example
•Avito Dataset overview
•basic model
•feature engineering
•final model building 2
H2O.ai

Machine Intelligence
In-Memory ML
Distributed
Open Source
APIs
3
Memory-Efficient Data Structures
Cutting-Edge Algorithms
Use all your Data (No Sampling)
Accuracy with Speed and Scale
Ownership of Methods - Apache V2
Easy to Deploy: Bare, Hadoop, Spark, etc.
Java, Scala, R, Python, JavaScript, JSON
NanoFast Scoring Engine (POJO)
H2O - Product Overview
H2O.ai

Machine Intelligence
25,000 commits / 3yrs
H2O World Conference 2014
Team Work @ H2O.ai
4
Join H2O World Nov 9-11 2015!
H2O.ai

Machine Intelligence
5
cientific Advisory Council
Stephen Boyd
Professor of EE Engineering
Stanford University
Rob Tibshirani
Professor of Health Research
and Policy, and Statistics
Stanford University
Trevor Hastie
Professor of Statistics
Stanford University
H2O.ai

Machine Intelligence
103 634 2789
463 2,887 13,237
Companies
Users
Mar 2014 July 2014 Mar 2015
Active Users
150+
6
Strong Community & Growth
5/25/15 @kdnuggets t.co/4xSgleSIdY
H2O.ai

Machine Intelligence
7
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models,
15x faster with H2O than before)
Fraud Detection (11% higher accuracy with
H2O Deep Learning - saves millions)
…and many large insurance and financial
services companies!
Real-time marketing (H2O is 10x faster than
anything else)
Actual Customer Use Cases
H2O.ai

Machine Intelligence
8
HDFS
S3
SQL
NoSQL
Classification
Regression
Feature
Engineering
Distributed In-Memory
Map Reduce/Fork Join
Columnar Compression
GLM, Deep Learning
K-Means, PCA, NB,
Cox
Random Forest / GBM
Ensembles
Fast Modeling Engine
Streaming
Nano Fast Java Scoring Engines
(POJO code generation)
Matrix
Factorization Clustering
Munging
Unsupervised
Supervised
Accuracy with Speed and Scale
Most code is written
in-house from scratch
H2O.ai

Machine Intelligence
- Well known statistical/machine learning method
- Fits a linear model
- link(y) = c1*x1 + c2*x + … + cn*xn + intercept
- Produces (relatively) simple model
- easy to fit
- easy to understand and interpret
- well known statistical properties
- Regression problems
- gaussian, poisson, gamma, tweedie
- Classification
- binomial, multinomial
- Requires good features
- not as powerful on raw data as some other models
- (gbm/deep learning)
Generalized Linear Models
9
H2O.ai

Machine Intelligence
- Linear Model
- defined by vector of coefficients
- 1 number per predictor
- Parametrized by Family and Link
- Family
- Our assumption about distribution of the response
- e.g. poisson for regression on counts, binomial for
two class classification
- Link
- non-linear transform of the response
- e.g. logit to generate s-curve for logistic regression
- Fitted by maximum likelihood
- pick the model with max probability of seeing the data
- need an iterative solver (e.g. newton method, L-BFGS)
Generalized Linear Models 2
10
H2O.ai

Machine Intelligence
Generalized Linear Models 3
11
Simple 2-class classification example
Linear Regression fit
(family=gaussian,link =identity)
Logistic Regression fit
(family=binomial,link = logit)
H2O.ai

Machine Intelligence
- Problems
- can overfit - works great on training, fails on test
- solution is not unique with correlated variables
- Solution - Add Regularization
- add penalty to reduce size of the vector
- L1 or L2 norm of the coefficient vector
- L1 versus L2
- L2 dense solution
- correlated variables coefficients are pushed to the same
value
- L1 sparse solution
- picks one correlated variable, others discarded
- Elastic Net
- combination of L1 and L2
- sparse solution, correlated variables grouped, enter/ leave the
model together
Penalized Generalized Linear Models
12
H2O.ai

Machine Intelligence
- Fully Distributed and Parallel
- handles datasets with up to 100s of thousand of predictors
- scales linearly with number of rows
- processes datasets with 100s of millions of rows in seconds
- All standard GLM features
- standard families
- support for observation weights and offset
- Elastic Net Regularization
- lambda-search - efficient computation of optimal regularization
strength
- applies strong rules to filter out in-active coefficients
- Several solvers for different problems
- Iterative re-weighted least squares with ADMM solver
- L-BFGS for wide problems
- Coordinate Descent (Experimental)
GLM on H2O
13
H2O.ai

Machine Intelligence
- Automatic handling of categorical variables
- automatically expands categoricals into 1-hot encoded binary
vectors
- Efficient handling (sparse acces, sparse covariance matrix)
- (Unlike R) uses all levels by default if running with
regularization
- Missing value handling
- missing values are not handled and rows with any missing value
will be omitted from the training dataset
- need to impute missing values up front if there are many
GLM on H2O 2
14
H2O.ai

Machine Intelligence
15
EC2 Demo Cluster: 8 nodes, 64 cores
H2O Deep Learning:
Expect good cluster
utilization :)
H2O.ai

Machine Intelligence
16
Airline Data: Predict Delayed Departure
Predict dep. delay Y/N
116M rows
31 colums
12 GB CSV
4 GB compressed
20 years of domestic
airline flight data
H2O.ai

Machine Intelligence
17
Results in Seconds on Big Data
Logistic Regression: ~20s
elastic net, alpha=0.5, lambda=1.379e-4 (auto)
Deep Learning: ~70s
4 hidden ReLU layers of 20 neurons, 2 epochs
8-node EC2 cluster: 64 virtual cores, 1GbE
Year, Month, Sched.
Dep. Time have
non-linear impact
Chicago, Atlanta,
Dallas:

often delayed
All cores maxed out
+9% AUC
+--+++
H2O.ai

Machine Intelligence
- Standard Metrics as other H2O algos +
- residual deviance
- null deviance
- degrees of freedom
- Coefficients / standardized coefficients
- The actual model
- One number per predictor
- Model is fitted on standardized data by default (parameter)
- standardized coefficients are the actual coefficients
fitted on standardized data
- (non-standardized) coefficients are de-scaled version of
standardized coefficients (so that they can be applied to
original dataset)
Output Fields
18
H2O.ai

Machine Intelligence
19
Avito Dataset: Predict User Clicks
20M rows
21 colums
~ 3 GB CSV
~ 1 GB compressed
Kaggle competion
https://www.kaggle.com/c/avito-context-ad-clicks/data
Some munging needed, see
https://github.com/h2oai/0xdata.com/
blob/master/src/blog/
2015_10_DataMunging._md
http://www.slideshare.net/0xdata/400-million-
search-results-predict-contextual-ad-clicks
H2O.ai

Machine Intelligence
- Running GLM straight on the data runs fast, but not great accuracy
- We will try to improve it by:
- turning variables to categoricals
- imputing NAs with means
Avito Dataset: Predict User Clicks
20
H2O.ai

Machine Intelligence
- Further improvements
- cut numerical columns into intervals to make new categoricals
- use h2o.hist + h2o.cut
- add interactions
- use h2o.interaction
Avito Dataset: Predict User Clicks
21
H2O.ai

Machine Intelligence
- Solver selection
- IRLSM - default choice with L1 penalty
- works great with small number of predictors
- efficient L1 solver
- can handle wide datasets with lambda search and L1 penalty
- L-BFGS
- handles wide data well, but
- can iterate a lot (take long time), especially with L1 penalty
- tune the objective epsilon - often many iterations are spent
on minor improvements
- Regularization Selection
- Compare sparse versus dense
- compare runs with alpha >= .5, alpha == 0
- generally L1 does slightly better
- Run lambda search to pick optimal regularization strength
General Usage Guidelines
22
H2O.ai

Machine Intelligence
- Do not pre-expand categorical variables
- H2O expands categorical variables automatically,
- way more efficient
- Adding features
- Splitting numerical variables into intervals helps
- Adding categorical interactions helps
- Using Lambda-Search
- always use validation data set
- otherwise picks lambda.min
- validation dataset is used to pick the best lambda value ->
need separate test set!
- check that lambda.best > lambda.min
- otherwise did not start overfitting, smaller lambda may be
better
- re-run with smaller lambda.min
General Usage Guidelines 2
23
H2O.ai

Machine Intelligence
More Info in H2O Booklets
http://h2o.ai/resources
24

More Related Content

What's hot

2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 

What's hot (20)

Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 

Viewers also liked

Final generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugcFinal generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugc
Id'rees Waris
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
Nursing Path
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
saba khan
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
Harsh Upadhyay
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
Ravi shankar
 

Viewers also liked (20)

General Linear Model | Statistics
General Linear Model | StatisticsGeneral Linear Model | Statistics
General Linear Model | Statistics
 
Generalized Linear Models
Generalized Linear ModelsGeneralized Linear Models
Generalized Linear Models
 
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimatorsReading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
 
H2O 0xdata MLconf
H2O 0xdata MLconfH2O 0xdata MLconf
H2O 0xdata MLconf
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API Overview
 
Final generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugcFinal generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugc
 
Medical Statistics Part-II:Inferential statistics
Medical Statistics Part-II:Inferential  statisticsMedical Statistics Part-II:Inferential  statistics
Medical Statistics Part-II:Inferential statistics
 
Sparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal MalohlavaSparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal Malohlava
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
 
Linear model of Curriculum
Linear model of CurriculumLinear model of Curriculum
Linear model of Curriculum
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Models of curriculum
Models of curriculumModels of curriculum
Models of curriculum
 

Similar to Generalized Linear Models with H2O

Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochgArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
Sri Ambati
 
In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)
Chinmay Kulkarni
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
Saurabh Chauhan
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 

Similar to Generalized Linear Models with H2O (20)

System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance Tuning
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
NoSQL meetup July 2011
NoSQL meetup July 2011NoSQL meetup July 2011
NoSQL meetup July 2011
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochgArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
 
In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 

More from Sri Ambati

More from Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Generalized Linear Models with H2O

  • 1. H2O.ai
 Machine Intelligence Generalized Linear Models with H2O 1 Tomas Nykodym tomas@h2o.ai
  • 2. H2O.ai
 Machine Intelligence Outline • Introduction to H2O • GLM Overview • Quick demo on Airlines Data • Overview of H2O GLM features • Common usage patterns •finding optimal regularization •handling wide datasets • Kaggle Example •Avito Dataset overview •basic model •feature engineering •final model building 2
  • 3. H2O.ai
 Machine Intelligence In-Memory ML Distributed Open Source APIs 3 Memory-Efficient Data Structures Cutting-Edge Algorithms Use all your Data (No Sampling) Accuracy with Speed and Scale Ownership of Methods - Apache V2 Easy to Deploy: Bare, Hadoop, Spark, etc. Java, Scala, R, Python, JavaScript, JSON NanoFast Scoring Engine (POJO) H2O - Product Overview
  • 4. H2O.ai
 Machine Intelligence 25,000 commits / 3yrs H2O World Conference 2014 Team Work @ H2O.ai 4 Join H2O World Nov 9-11 2015!
  • 5. H2O.ai
 Machine Intelligence 5 cientific Advisory Council Stephen Boyd Professor of EE Engineering Stanford University Rob Tibshirani Professor of Health Research and Policy, and Statistics Stanford University Trevor Hastie Professor of Statistics Stanford University
  • 6. H2O.ai
 Machine Intelligence 103 634 2789 463 2,887 13,237 Companies Users Mar 2014 July 2014 Mar 2015 Active Users 150+ 6 Strong Community & Growth 5/25/15 @kdnuggets t.co/4xSgleSIdY
  • 7. H2O.ai
 Machine Intelligence 7 Ad Optimization (200% CPA Lift with H2O) P2B Model Factory (60k models, 15x faster with H2O than before) Fraud Detection (11% higher accuracy with H2O Deep Learning - saves millions) …and many large insurance and financial services companies! Real-time marketing (H2O is 10x faster than anything else) Actual Customer Use Cases
  • 8. H2O.ai
 Machine Intelligence 8 HDFS S3 SQL NoSQL Classification Regression Feature Engineering Distributed In-Memory Map Reduce/Fork Join Columnar Compression GLM, Deep Learning K-Means, PCA, NB, Cox Random Forest / GBM Ensembles Fast Modeling Engine Streaming Nano Fast Java Scoring Engines (POJO code generation) Matrix Factorization Clustering Munging Unsupervised Supervised Accuracy with Speed and Scale Most code is written in-house from scratch
  • 9. H2O.ai
 Machine Intelligence - Well known statistical/machine learning method - Fits a linear model - link(y) = c1*x1 + c2*x + … + cn*xn + intercept - Produces (relatively) simple model - easy to fit - easy to understand and interpret - well known statistical properties - Regression problems - gaussian, poisson, gamma, tweedie - Classification - binomial, multinomial - Requires good features - not as powerful on raw data as some other models - (gbm/deep learning) Generalized Linear Models 9
  • 10. H2O.ai
 Machine Intelligence - Linear Model - defined by vector of coefficients - 1 number per predictor - Parametrized by Family and Link - Family - Our assumption about distribution of the response - e.g. poisson for regression on counts, binomial for two class classification - Link - non-linear transform of the response - e.g. logit to generate s-curve for logistic regression - Fitted by maximum likelihood - pick the model with max probability of seeing the data - need an iterative solver (e.g. newton method, L-BFGS) Generalized Linear Models 2 10
  • 11. H2O.ai
 Machine Intelligence Generalized Linear Models 3 11 Simple 2-class classification example Linear Regression fit (family=gaussian,link =identity) Logistic Regression fit (family=binomial,link = logit)
  • 12. H2O.ai
 Machine Intelligence - Problems - can overfit - works great on training, fails on test - solution is not unique with correlated variables - Solution - Add Regularization - add penalty to reduce size of the vector - L1 or L2 norm of the coefficient vector - L1 versus L2 - L2 dense solution - correlated variables coefficients are pushed to the same value - L1 sparse solution - picks one correlated variable, others discarded - Elastic Net - combination of L1 and L2 - sparse solution, correlated variables grouped, enter/ leave the model together Penalized Generalized Linear Models 12
  • 13. H2O.ai
 Machine Intelligence - Fully Distributed and Parallel - handles datasets with up to 100s of thousand of predictors - scales linearly with number of rows - processes datasets with 100s of millions of rows in seconds - All standard GLM features - standard families - support for observation weights and offset - Elastic Net Regularization - lambda-search - efficient computation of optimal regularization strength - applies strong rules to filter out in-active coefficients - Several solvers for different problems - Iterative re-weighted least squares with ADMM solver - L-BFGS for wide problems - Coordinate Descent (Experimental) GLM on H2O 13
  • 14. H2O.ai
 Machine Intelligence - Automatic handling of categorical variables - automatically expands categoricals into 1-hot encoded binary vectors - Efficient handling (sparse acces, sparse covariance matrix) - (Unlike R) uses all levels by default if running with regularization - Missing value handling - missing values are not handled and rows with any missing value will be omitted from the training dataset - need to impute missing values up front if there are many GLM on H2O 2 14
  • 15. H2O.ai
 Machine Intelligence 15 EC2 Demo Cluster: 8 nodes, 64 cores H2O Deep Learning: Expect good cluster utilization :)
  • 16. H2O.ai
 Machine Intelligence 16 Airline Data: Predict Delayed Departure Predict dep. delay Y/N 116M rows 31 colums 12 GB CSV 4 GB compressed 20 years of domestic airline flight data
  • 17. H2O.ai
 Machine Intelligence 17 Results in Seconds on Big Data Logistic Regression: ~20s elastic net, alpha=0.5, lambda=1.379e-4 (auto) Deep Learning: ~70s 4 hidden ReLU layers of 20 neurons, 2 epochs 8-node EC2 cluster: 64 virtual cores, 1GbE Year, Month, Sched. Dep. Time have non-linear impact Chicago, Atlanta, Dallas:
 often delayed All cores maxed out +9% AUC +--+++
  • 18. H2O.ai
 Machine Intelligence - Standard Metrics as other H2O algos + - residual deviance - null deviance - degrees of freedom - Coefficients / standardized coefficients - The actual model - One number per predictor - Model is fitted on standardized data by default (parameter) - standardized coefficients are the actual coefficients fitted on standardized data - (non-standardized) coefficients are de-scaled version of standardized coefficients (so that they can be applied to original dataset) Output Fields 18
  • 19. H2O.ai
 Machine Intelligence 19 Avito Dataset: Predict User Clicks 20M rows 21 colums ~ 3 GB CSV ~ 1 GB compressed Kaggle competion https://www.kaggle.com/c/avito-context-ad-clicks/data Some munging needed, see https://github.com/h2oai/0xdata.com/ blob/master/src/blog/ 2015_10_DataMunging._md http://www.slideshare.net/0xdata/400-million- search-results-predict-contextual-ad-clicks
  • 20. H2O.ai
 Machine Intelligence - Running GLM straight on the data runs fast, but not great accuracy - We will try to improve it by: - turning variables to categoricals - imputing NAs with means Avito Dataset: Predict User Clicks 20
  • 21. H2O.ai
 Machine Intelligence - Further improvements - cut numerical columns into intervals to make new categoricals - use h2o.hist + h2o.cut - add interactions - use h2o.interaction Avito Dataset: Predict User Clicks 21
  • 22. H2O.ai
 Machine Intelligence - Solver selection - IRLSM - default choice with L1 penalty - works great with small number of predictors - efficient L1 solver - can handle wide datasets with lambda search and L1 penalty - L-BFGS - handles wide data well, but - can iterate a lot (take long time), especially with L1 penalty - tune the objective epsilon - often many iterations are spent on minor improvements - Regularization Selection - Compare sparse versus dense - compare runs with alpha >= .5, alpha == 0 - generally L1 does slightly better - Run lambda search to pick optimal regularization strength General Usage Guidelines 22
  • 23. H2O.ai
 Machine Intelligence - Do not pre-expand categorical variables - H2O expands categorical variables automatically, - way more efficient - Adding features - Splitting numerical variables into intervals helps - Adding categorical interactions helps - Using Lambda-Search - always use validation data set - otherwise picks lambda.min - validation dataset is used to pick the best lambda value -> need separate test set! - check that lambda.best > lambda.min - otherwise did not start overfitting, smaller lambda may be better - re-run with smaller lambda.min General Usage Guidelines 2 23
  • 24. H2O.ai
 Machine Intelligence More Info in H2O Booklets http://h2o.ai/resources 24