SlideShare a Scribd company logo
1 of 33
Download to read offline
Scalable Automatic Machine 

Learning with H2O
Erin LeDell Ph.D.

@ledell
WiMLDS Paris & Paris ML Joint Meetup

Sept 2018
What is H2O?
H2O.ai, the
company
H2O, the
platform
• Founded in 2012
• Advised by Stanford Professors Hastie, Tibshirani & Boyd
• Headquarters: Mountain View, California, USA
• Open Source Software (Apache 2.0 Licensed)
• R, Python, Scala, Java and Web Interfaces
• Distributed Machine Learning Algorithms for Big Data
Agenda
• H2O Platform
• Intro to Automatic Machine Learning (AutoML)
• H2O AutoML Overview
• Pro Tips
• Demo
Slides ⬇ https://tinyurl.com/paris-automl
H2O Platform
H2O Machine Learning Platform
• Distributed (multi-core + multi-node) implementations of
cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala; web GUI.
• Easily deploy models to production as pure Java code.
• Works on Hadoop, Spark, EC2, your laptop, etc.
H2O Distributed Computing
H2O Cluster
H2O Frame
• Multi-node cluster with shared memory model.
• All computations in memory.
• Each node sees only some rows of the data.
• No limit on cluster size.
• Distributed data frames (collection of vectors).
• Columns are distributed (across nodes) arrays.
• Works just like R’s data.frame or Python Pandas
DataFrame
H2O Machine Learning Features
• Supervised & unsupervised machine learning algos
(GBM, RF, DNN, GLM, Stacked Ensembles, etc.)
• Imputation, normalization & auto one-hot-encoding
• Automatic early stopping
• Cross-validation, grid search & random search
• Variable importance, model evaluation metrics, plots
Intro to Automatic
Machine Learning
Aspects of Automatic Machine Learning
Data Prep
Model

Generation
Ensembles
Aspects of Automatic Machine Learning
• Cartesian grid search or random grid search
• Bayesian Hyperparameter Optimization
• Individual models can be tuned using a validation set
Data
Preprocessing
Model

Generation
Ensembles
• Imputation, one-hot encoding, standardization
• Feature selection and/or feature extraction (e.g. PCA)
• Count/Label/Target encoding of categorical features
• Ensembles often out-perform individual models
• Stacking / Super Learning (Wolpert, Breiman)
• Ensemble Selection (Caruana)
H2O’s AutoML
H2O AutoML (current release)
• Cartesian grid search or random grid search
• Bayesian Hyperparameter Optimization
• Individual models can be tuned using a validation set
Data
Preprocessing
Model

Generation
Ensembles
• Imputation, one-hot encoding, standardization
• Feature selection and/or feature extraction (e.g. PCA)
• Count/Label/Target encoding of categorical features
• Ensembles often out-perform individual models:
• Stacking / Super Learning (Wolpert, Breiman)
• Ensemble Selection (Caruana)
Random Grid Search & Stacking
• Random Grid Search combined with Stacked
Ensembles is a powerful combination.

• Ensembles perform particularly well if the models
they are based on (1) are individually strong, 

and (2) make uncorrelated errors.

• Stacking uses a second-level metalearning algorithm
to find the optimal combination of base learners.
H2O AutoML
• Basic data pre-processing (as in all H2O algos).
• Trains a random grid of GBMs, DNNs, GLMs, etc.
using a carefully chosen hyper-parameter space
• Individual models are tuned using a validation set.
• Two Stacked Ensembles are trained (“All Models”
ensemble & a lightweight “Best of Family” ensemble).
• Returns a sorted “Leaderboard” of all models.
Available in H2O >=3.14

H2O AutoML in Flow GUI
H2O AutoML in R
H2O AutoML in Python
H2O AutoML Leaderboard
Example Leaderboard for binary classification
AutoML Pro Tips!
Before you press the “red button”
AutoML Pro Tips: Input Frames
• Don’t use leaderboard_frame unless you
really need to; use cross-validation metrics to
generate the leaderboard instead (default).

• If you only provide training_frame, it will
chop off 20% of your data for a validation set
to be used in early stopping. To control this
proportion, you can split the data yourself and
pass a validation_frame manually.
AutoML Pro Tips: Exclude Algos
• If you have sparse, wide data (e.g. text), use
the exclude_algos argument to turn off the
tree-based models (GBM, RF).

• If you want tree-based algos only, turn off
GLM and DNNs via exclude_algos.
AutoML Pro Tips: Time & Model Limits
• AutoML will stop after 1 hour unless you
change max_runtime_secs.

• Running with max_runtime_secs is not
reproducible since available resources on a
machine may change from run to run. Set
max_runtime_secs to a big number (e.g.
999999999) and use max_models instead.
AutoML Pro Tips: Cluster memory
• Reminder: All H2O models are stored in H2O
Cluster memory.
• Make sure to give the H2O Cluster a lot of
memory if you’re going to create hundreds or
thousands of models.
• e.g. h2o.init(max_mem_size = “80G”)
After you press the “red button”
AutoML Pro Tips: Early Stopping
• If you’re expecting more models than are listed
in the leaderboard, or the run is stopping
earlier than max_runtime_secs, this is a
result of the default “early stopping” settings.

• To allow more time, increase the number of
stopping_rounds and/or decrease value of
stopping_tolerance.
AutoML Pro Tips: Add More Models
• If you want to add (train) more models to an
existing AutoML project, just make sure to use
the same training set and project_name.

• If you set the same seed twice it will give you
identical models as the first run (not useful), so
change the seed or leave it unset.
AutoML Pro Tips: Saving Models
• You can save any of the individual models
created by the AutoML run. The model ids are
listed in the leaderboard.
• If you’re taking your leader model (probably a
Stacked Ensemble) to production, we’d
recommend using “Best of Family” since it only
contains 5 models and gets most of the
performance of the “All Models” ensemble.
H2O AutoML Tutorial
H2O AutoML Tutorial
https://tinyurl.com/automl-h2oworld17
Code available here
H2O Resources
• Documentation: http://docs.h2o.ai
• Tutorials: https://github.com/h2oai/h2o-tutorials
• Slidedecks: https://github.com/h2oai/h2o-meetups
• Videos: https://www.youtube.com/user/0xdata
• Stack Overflow: https://stackoverflow.com/tags/h2o
• Google Group: https://tinyurl.com/h2ostream
• Gitter: http://gitter.im/h2oai/h2o-3
• Events & Meetups: http://h2o.ai/events
Contribute to H2O!
Get in touch over email, Gitter or JIRA.

https://tinyurl.com/h2o-contribute
Thank you!
@ledell on Github, Twitter
erin@h2o.ai
http://www.stat.berkeley.edu/~ledell

More Related Content

Similar to Scalable Automatic Machine Learning with H2O” by Erin LeDell, Chief Machine Learning Scientist at H2O.ai

Scalable Automatic Machine Learning in H2O
 Scalable Automatic Machine Learning in H2O Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)Julien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Julien SIMON
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services
 
Scalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2OScalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2OSri Ambati
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Julien SIMON
 
Advanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerAdvanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerJulien SIMON
 
O365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin Timmermann
O365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin TimmermannO365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin Timmermann
O365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin TimmermannNCCOMMS
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Amazon Web Services
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Amazon Web Services
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerAmazon Web Services
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Amazon Web Services
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaSandesh Rao
 
Cast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best PracticesCast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best PracticesSarath Ambadas
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Amazon Web Services
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 

Similar to Scalable Automatic Machine Learning with H2O” by Erin LeDell, Chief Machine Learning Scientist at H2O.ai (20)

Scalable Automatic Machine Learning in H2O
 Scalable Automatic Machine Learning in H2O Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Scalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2OScalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2O
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Advanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMakerAdvanced Machine Learning with Amazon SageMaker
Advanced Machine Learning with Amazon SageMaker
 
O365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin Timmermann
O365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin TimmermannO365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin Timmermann
O365Con18 - Using ARM Templates to Deploy Solutions on Azure - Kevin Timmermann
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMaker
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS...
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow India
 
Cast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best PracticesCast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best Practices
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Managing international tech teams, by Natasha Dimban
Managing international tech teams, by Natasha DimbanManaging international tech teams, by Natasha Dimban
Managing international tech teams, by Natasha Dimban
 
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria KnorpsOptimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
 
Perspectives, by M. Pannegeon
Perspectives, by M. PannegeonPerspectives, by M. Pannegeon
Perspectives, by M. Pannegeon
 
Evaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled dataEvaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled data
 
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
 
An age-old question, by Caroline Jean-Pierre
An age-old question, by Caroline Jean-PierreAn age-old question, by Caroline Jean-Pierre
An age-old question, by Caroline Jean-Pierre
 
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle LautréApplying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
Global Ambitions Local Realities, by Anna Abreu
Global Ambitions Local Realities, by Anna AbreuGlobal Ambitions Local Realities, by Anna Abreu
Global Ambitions Local Realities, by Anna Abreu
 
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie DelonPlug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
 
Sales Forecasting as a Data Product by Francesca Iannuzzi
Sales Forecasting as a Data Product by Francesca IannuzziSales Forecasting as a Data Product by Francesca Iannuzzi
Sales Forecasting as a Data Product by Francesca Iannuzzi
 
Identifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta BinkyteIdentifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta Binkyte
 
“Turning your ML algorithms into full web apps in no time with Python" by Mar...
“Turning your ML algorithms into full web apps in no time with Python" by Mar...“Turning your ML algorithms into full web apps in no time with Python" by Mar...
“Turning your ML algorithms into full web apps in no time with Python" by Mar...
 
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
 
Sandrine Henry presents the BechdelAI project
Sandrine Henry presents the BechdelAI projectSandrine Henry presents the BechdelAI project
Sandrine Henry presents the BechdelAI project
 
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
 
Khrystyna Grynko WiMLDS - From marketing to Tech.pdf
Khrystyna Grynko WiMLDS - From marketing to Tech.pdfKhrystyna Grynko WiMLDS - From marketing to Tech.pdf
Khrystyna Grynko WiMLDS - From marketing to Tech.pdf
 
Iana Iatsun_ML in production_20Dec2022.pdf
Iana Iatsun_ML in production_20Dec2022.pdfIana Iatsun_ML in production_20Dec2022.pdf
Iana Iatsun_ML in production_20Dec2022.pdf
 
41 WiMLDS Kyiv Paris Poznan.pdf
41 WiMLDS Kyiv Paris Poznan.pdf41 WiMLDS Kyiv Paris Poznan.pdf
41 WiMLDS Kyiv Paris Poznan.pdf
 
Emergency plan to secure winter: what are the measures set up by RTE?
Emergency plan to secure winter: what are the measures set up by RTE?Emergency plan to secure winter: what are the measures set up by RTE?
Emergency plan to secure winter: what are the measures set up by RTE?
 

Recently uploaded

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 

Scalable Automatic Machine Learning with H2O” by Erin LeDell, Chief Machine Learning Scientist at H2O.ai

  • 1. Scalable Automatic Machine 
 Learning with H2O Erin LeDell Ph.D.
 @ledell WiMLDS Paris & Paris ML Joint Meetup
 Sept 2018
  • 2. What is H2O? H2O.ai, the company H2O, the platform • Founded in 2012 • Advised by Stanford Professors Hastie, Tibshirani & Boyd • Headquarters: Mountain View, California, USA • Open Source Software (Apache 2.0 Licensed) • R, Python, Scala, Java and Web Interfaces • Distributed Machine Learning Algorithms for Big Data
  • 3. Agenda • H2O Platform • Intro to Automatic Machine Learning (AutoML) • H2O AutoML Overview • Pro Tips • Demo Slides ⬇ https://tinyurl.com/paris-automl
  • 5. H2O Machine Learning Platform • Distributed (multi-core + multi-node) implementations of cutting edge ML algorithms. • Core algorithms written in high performance Java. • APIs available in R, Python, Scala; web GUI. • Easily deploy models to production as pure Java code. • Works on Hadoop, Spark, EC2, your laptop, etc.
  • 6. H2O Distributed Computing H2O Cluster H2O Frame • Multi-node cluster with shared memory model. • All computations in memory. • Each node sees only some rows of the data. • No limit on cluster size. • Distributed data frames (collection of vectors). • Columns are distributed (across nodes) arrays. • Works just like R’s data.frame or Python Pandas DataFrame
  • 7. H2O Machine Learning Features • Supervised & unsupervised machine learning algos (GBM, RF, DNN, GLM, Stacked Ensembles, etc.) • Imputation, normalization & auto one-hot-encoding • Automatic early stopping • Cross-validation, grid search & random search • Variable importance, model evaluation metrics, plots
  • 9. Aspects of Automatic Machine Learning Data Prep Model
 Generation Ensembles
  • 10. Aspects of Automatic Machine Learning • Cartesian grid search or random grid search • Bayesian Hyperparameter Optimization • Individual models can be tuned using a validation set Data Preprocessing Model
 Generation Ensembles • Imputation, one-hot encoding, standardization • Feature selection and/or feature extraction (e.g. PCA) • Count/Label/Target encoding of categorical features • Ensembles often out-perform individual models • Stacking / Super Learning (Wolpert, Breiman) • Ensemble Selection (Caruana)
  • 12. H2O AutoML (current release) • Cartesian grid search or random grid search • Bayesian Hyperparameter Optimization • Individual models can be tuned using a validation set Data Preprocessing Model
 Generation Ensembles • Imputation, one-hot encoding, standardization • Feature selection and/or feature extraction (e.g. PCA) • Count/Label/Target encoding of categorical features • Ensembles often out-perform individual models: • Stacking / Super Learning (Wolpert, Breiman) • Ensemble Selection (Caruana)
  • 13. Random Grid Search & Stacking • Random Grid Search combined with Stacked Ensembles is a powerful combination.
 • Ensembles perform particularly well if the models they are based on (1) are individually strong, 
 and (2) make uncorrelated errors.
 • Stacking uses a second-level metalearning algorithm to find the optimal combination of base learners.
  • 14. H2O AutoML • Basic data pre-processing (as in all H2O algos). • Trains a random grid of GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space • Individual models are tuned using a validation set. • Two Stacked Ensembles are trained (“All Models” ensemble & a lightweight “Best of Family” ensemble). • Returns a sorted “Leaderboard” of all models. Available in H2O >=3.14

  • 15. H2O AutoML in Flow GUI
  • 17. H2O AutoML in Python
  • 18. H2O AutoML Leaderboard Example Leaderboard for binary classification
  • 20. Before you press the “red button”
  • 21. AutoML Pro Tips: Input Frames • Don’t use leaderboard_frame unless you really need to; use cross-validation metrics to generate the leaderboard instead (default).
 • If you only provide training_frame, it will chop off 20% of your data for a validation set to be used in early stopping. To control this proportion, you can split the data yourself and pass a validation_frame manually.
  • 22. AutoML Pro Tips: Exclude Algos • If you have sparse, wide data (e.g. text), use the exclude_algos argument to turn off the tree-based models (GBM, RF).
 • If you want tree-based algos only, turn off GLM and DNNs via exclude_algos.
  • 23. AutoML Pro Tips: Time & Model Limits • AutoML will stop after 1 hour unless you change max_runtime_secs.
 • Running with max_runtime_secs is not reproducible since available resources on a machine may change from run to run. Set max_runtime_secs to a big number (e.g. 999999999) and use max_models instead.
  • 24. AutoML Pro Tips: Cluster memory • Reminder: All H2O models are stored in H2O Cluster memory. • Make sure to give the H2O Cluster a lot of memory if you’re going to create hundreds or thousands of models. • e.g. h2o.init(max_mem_size = “80G”)
  • 25. After you press the “red button”
  • 26. AutoML Pro Tips: Early Stopping • If you’re expecting more models than are listed in the leaderboard, or the run is stopping earlier than max_runtime_secs, this is a result of the default “early stopping” settings.
 • To allow more time, increase the number of stopping_rounds and/or decrease value of stopping_tolerance.
  • 27. AutoML Pro Tips: Add More Models • If you want to add (train) more models to an existing AutoML project, just make sure to use the same training set and project_name.
 • If you set the same seed twice it will give you identical models as the first run (not useful), so change the seed or leave it unset.
  • 28. AutoML Pro Tips: Saving Models • You can save any of the individual models created by the AutoML run. The model ids are listed in the leaderboard. • If you’re taking your leader model (probably a Stacked Ensemble) to production, we’d recommend using “Best of Family” since it only contains 5 models and gets most of the performance of the “All Models” ensemble.
  • 31. H2O Resources • Documentation: http://docs.h2o.ai • Tutorials: https://github.com/h2oai/h2o-tutorials • Slidedecks: https://github.com/h2oai/h2o-meetups • Videos: https://www.youtube.com/user/0xdata • Stack Overflow: https://stackoverflow.com/tags/h2o • Google Group: https://tinyurl.com/h2ostream • Gitter: http://gitter.im/h2oai/h2o-3 • Events & Meetups: http://h2o.ai/events
  • 32. Contribute to H2O! Get in touch over email, Gitter or JIRA.
 https://tinyurl.com/h2o-contribute
  • 33. Thank you! @ledell on Github, Twitter erin@h2o.ai http://www.stat.berkeley.edu/~ledell