Improving Model Predictions via Stacking and Hyper-parameters Tuning

Jo-fai Chow
Jo-fai ChowData Scientist at H2O.ai - Customer Success, Smart Apps, Machine Learning & Data Visualization
Improving Model Predictions via
Stacking and Hyper-Parameters Tuning
Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulus
About Me
• 2005 - 2015
• Water Engineer
o Consultant for Utilities
o EngD Research
• 2015 - Present
• Data Scientist
o Virgin Media
o Domino Data Lab
o H2O.ai
2
Mango Data Science Radar
3
About This Talk
• Predictive modelling
o Kaggle as an example
• Improve predictions
with simple tricks
• Use data science for
social good 👍
4
About Kaggle
• World’s biggest predictive
modelling competition
platform
• 560k members
• Competition types:
o Featured (prize)
o Recruitment
o Playground
o 101
5
Predicting Shelter Animal Outcomes
• X: Predictors
o Name
o Gender
o Type (🐱 or 🐶)
o Date & Time
o Age
o Breed
o Colour
• Y: Outcomes (5 types)
o Adoption
o Died
o Euthanasia
o Return to Owner
o Transfer
• Data
o Training (27k samples)
o Test (11k)
6
Basic Feature Engineering
X Raw (Before) Reformatted (After)
Name Elsa, Steve, Lassie [name_len]: 4, 5, 6
Date & Time 2014-02-12 18:22:00 [year]: 2014
[month]: 2
[weekday]: 4
[hour]: 18
Age 1 year, 3 weeks, 2 days [age_day]: 365, 21, 2
Breed German Shepherd, Pit Bull Mix [is_mix]: 0, 1
Colour Brown Brindle/White [simple_colour]: brown
7
Common Machine Learning Techniques
• Ensembles
o Bagging/boosting of
decision trees
o Reduces variance and
increase accuracy
o Popular R Packages
(used in next example)
• “randomForest”
• “xgboost”
• There are a lot more
machine learning
packages in R:
o “caret”, “caretEnsemble”
o “h2o”, “h2oEnsemble”
o “mlr”
8
Simple Trick – Model Averaging
• Stratified sampling
o 80% for training
o 20% for validation
• Evaluation metric
o Multi-class Log Loss
o Lower the better
o 0 = Perfect
• 50 runs
o different random seed
9
More Advanced Methods
• Model Stacking
o Uses a second-level
metalearner to learn the
optimal combination of
base learners
o R Packages:
• “SuperLearner”
• “subsemble”
• “h2oEnsemble”
• “caretEnsemble”
• Hyper-parameters Tuning
o Improves the performance
of individual machine
learning algorithms
o Grid search
• Full / Random
o R Packages:
• “caret”
• “h2o”
10
For more info, see
https://github.com/h2oai/h2o-meetups/tree/master/2016_05_20_MLconf_Seattle_Scalable_Ensembles
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.Rmd
Trade-Off of Advanced Methods
• Strength
o Model tuning + stacking
won nearly all Kaggle
competitions.
o Multi-algorithm
ensemble may better
approximate the true
predictive function than
any single algorithm.
• Weakness
o Increased training and
prediction times.
o Increased model
complexity.
o Requires large machines
or clusters for big data.
11
R + H2O = Scalable Machine Learning
• H2O is an open-source,
distributed machine
learning library written in
Java with APIs in R,
Python and more.
• ”h2oEnsemble” is the
scalable implementation
of the Super Learner
algorithm for H2O.
12
H2O Random Grid Search Example
13
Define search range and criteria
Best models
H2O Model Stacking Example
14
17 out of 717 teams (≈ top 2%)
Getting reasonable resultsUsing h2o.stack(…) to combine multiple models
Conclusions
• Many R packages for
predictive modelling.
• Use hyper-parameters
tuning to improve
individual models.
• Use model averaging /
stacking to improve
predictions.
• Trade-off between model
performance and
computational costs.
• Use R + H2O for scalable
machine learning.
• H2O random grid search
and stacking.
• Use data science for
social good 👍
15
Big Thank You!
• Mango Solutions
• RStudio
• Domino Data Lab
• H2O
o Erin LeDell
o Raymond Peck
o Arno Candel
16
1st LondonR Talk
Crime Map Shiny App
bit.ly/londonr_crimemap
2nd LondonR Talk
Domino API Endpoint
bit.ly/1cYbZbF
Any Questions?
• Contact
o joe@h2o.ai
o @matlabulous
o github.com/woobe
• Slides & Code
o github.com/h2oai/h2o-
meetups
• H2O in London
o Meetups / Office (soon)
o www.h2o.ai/careers
• More H2O at Strata
tomorrow
o Innards of H2O (11:15)
o Intro to Generalised Low-
Rank Models (14:05)
17
Extra Slide (Stratified Sampling)
18
1 of 18

Recommended

Introduction to Generalised Low-Rank Model and Missing Values by
Introduction to Generalised Low-Rank Model and Missing ValuesIntroduction to Generalised Low-Rank Model and Missing Values
Introduction to Generalised Low-Rank Model and Missing ValuesJo-fai Chow
2.3K views22 slides
Kaggle competitions, new friends, new skills and new opportunities by
Kaggle competitions, new friends, new skills and new opportunitiesKaggle competitions, new friends, new skills and new opportunities
Kaggle competitions, new friends, new skills and new opportunitiesJo-fai Chow
973 views28 slides
Using H2O Random Grid Search for Hyper-parameters Optimization by
Using H2O Random Grid Search for Hyper-parameters OptimizationUsing H2O Random Grid Search for Hyper-parameters Optimization
Using H2O Random Grid Search for Hyper-parameters OptimizationJo-fai Chow
2.5K views21 slides
Kaggle Competitions, New Friends, New Skills and New Opportunities by
Kaggle Competitions, New Friends, New Skills and New OpportunitiesKaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New OpportunitiesJo-fai Chow
363 views31 slides
Deploying your Predictive Models as a Service via Domino by
Deploying your Predictive Models as a Service via DominoDeploying your Predictive Models as a Service via Domino
Deploying your Predictive Models as a Service via DominoJo-fai Chow
19.1K views33 slides
Stacked Ensembles in H2O by
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2OSri Ambati
2.1K views29 slides

More Related Content

What's hot

Using H2O AutoML for Kaggle Competitions by
Using H2O AutoML for Kaggle CompetitionsUsing H2O AutoML for Kaggle Competitions
Using H2O AutoML for Kaggle CompetitionsSri Ambati
2.8K views46 slides
Big data week 2018 - Graph Analytics on Big Data by
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataChristos Hadjinikolis
145 views43 slides
Mining and Managing Large-scale Linked Open Data by
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
1.4K views47 slides
Project "Deep Water" by
Project "Deep Water"Project "Deep Water"
Project "Deep Water"Jo-fai Chow
1.5K views95 slides
A Comparison of Different Strategies for Automated Semantic Document Annotation by
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
1K views44 slides
Congressional PageRank: Graph Analytics of US Congress With Neo4j by
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
1.4K views65 slides

What's hot(20)

Using H2O AutoML for Kaggle Competitions by Sri Ambati
Using H2O AutoML for Kaggle CompetitionsUsing H2O AutoML for Kaggle Competitions
Using H2O AutoML for Kaggle Competitions
Sri Ambati2.8K views
Mining and Managing Large-scale Linked Open Data by Ansgar Scherp
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
Ansgar Scherp1.4K views
Project "Deep Water" by Jo-fai Chow
Project "Deep Water"Project "Deep Water"
Project "Deep Water"
Jo-fai Chow1.5K views
A Comparison of Different Strategies for Automated Semantic Document Annotation by Ansgar Scherp
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
Ansgar Scherp1K views
Congressional PageRank: Graph Analytics of US Congress With Neo4j by William Lyon
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon1.4K views
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud by Ansgar Scherp
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Ansgar Scherp1.1K views
Benchmarking graph databases on the problem of community detection by Symeon Papadopoulos
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
Symeon Papadopoulos9.2K views
Gephi, Graphx, and Giraph by Doug Needham
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
Doug Needham2.1K views
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr... by Ansgar Scherp
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Ansgar Scherp693 views
Analyzing Data With Python by Sarah Guido
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido4.1K views
Apache Spark GraphX highlights. by Doug Needham
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
Doug Needham2K views
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be... by Martin Junghanns
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Martin Junghanns3.1K views
Knowledge Discovery in Social Media and Scientific Digital Libraries by Ansgar Scherp
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital Libraries
Ansgar Scherp1.5K views
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr... by Oscar Corcho
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Oscar Corcho25.1K views
Summary of the Stream Reasoning workshop at ISWC 2016 by Daniele Dell'Aglio
Summary of the Stream Reasoning workshop at ISWC 2016Summary of the Stream Reasoning workshop at ISWC 2016
Summary of the Stream Reasoning workshop at ISWC 2016
Daniele Dell'Aglio1.4K views
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio by Open Knowledge Belgium
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
SF Python Meetup: TextRank in Python by Paco Nathan
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan5.7K views

Similar to Improving Model Predictions via Stacking and Hyper-parameters Tuning

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab by
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
1.2K views32 slides
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud by
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
444 views38 slides
Introduction to H2O and Model Stacking Use Cases by
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesJo-fai Chow
1.1K views78 slides
Strata San Jose 2016: Scalable Ensemble Learning with H2O by
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
1.9K views29 slides
G3 talk rld_2 by
G3 talk rld_2G3 talk rld_2
G3 talk rld_2Robert Davidson
248 views39 slides
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ... by
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
167 views28 slides

Similar to Improving Model Predictions via Stacking and Hyper-parameters Tuning(20)

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab by Sri Ambati
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati1.2K views
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud by Databricks
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks444 views
Introduction to H2O and Model Stacking Use Cases by Jo-fai Chow
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use Cases
Jo-fai Chow1.1K views
Strata San Jose 2016: Scalable Ensemble Learning with H2O by Sri Ambati
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati1.9K views
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ... by Lucidworks
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks167 views
Demographics andweblogtargeting by Doug Chang
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
Doug Chang532 views
Data Wrangling For Kaggle Data Science Competitions by Krishna Sankar
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
Krishna Sankar7.7K views
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking by Bruce Kozuma
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
Bruce Kozuma229 views
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh... by Lucidworks
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Lucidworks564 views
The Quest for an Open Source Data Science Platform by QAware GmbH
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
QAware GmbH686 views
A Modern Introduction to Decision Tree Ensembles by Ichigaku Takigawa
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
Beauty and Big Data by Sri Ambati
Beauty and Big DataBeauty and Big Data
Beauty and Big Data
Sri Ambati1.9K views
Intro to Machine Learning with H2O and AWS by Sri Ambati
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
Sri Ambati11.2K views
From Labelling Open data images to building a private recommender system by Pierre Gutierrez
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
Pierre Gutierrez1.9K views
H2O at Poznan R Meetup by Jo-fai Chow
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
Jo-fai Chow753 views
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek by Jo-fai Chow
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data GeekFrom Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
Jo-fai Chow1.8K views

More from Jo-fai Chow

Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny by
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyMaking Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyJo-fai Chow
702 views50 slides
Automatic and Interpretable Machine Learning in R with H2O and LIME by
Automatic and Interpretable Machine Learning in R with H2O and LIMEAutomatic and Interpretable Machine Learning in R with H2O and LIME
Automatic and Interpretable Machine Learning in R with H2O and LIMEJo-fai Chow
313 views92 slides
Automatic and Interpretable Machine Learning with H2O and LIME by
Automatic and Interpretable Machine Learning with H2O and LIMEAutomatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIMEJo-fai Chow
3.7K views57 slides
H2O at Berlin R Meetup by
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R MeetupJo-fai Chow
447 views71 slides
H2O at BelgradeR Meetup by
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR MeetupJo-fai Chow
473 views100 slides
Introduction to Machine Learning with H2O and Python by
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
716 views120 slides

More from Jo-fai Chow(17)

Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny by Jo-fai Chow
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyMaking Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Jo-fai Chow702 views
Automatic and Interpretable Machine Learning in R with H2O and LIME by Jo-fai Chow
Automatic and Interpretable Machine Learning in R with H2O and LIMEAutomatic and Interpretable Machine Learning in R with H2O and LIME
Automatic and Interpretable Machine Learning in R with H2O and LIME
Jo-fai Chow313 views
Automatic and Interpretable Machine Learning with H2O and LIME by Jo-fai Chow
Automatic and Interpretable Machine Learning with H2O and LIMEAutomatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIME
Jo-fai Chow3.7K views
H2O at Berlin R Meetup by Jo-fai Chow
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R Meetup
Jo-fai Chow447 views
H2O at BelgradeR Meetup by Jo-fai Chow
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
Jo-fai Chow473 views
Introduction to Machine Learning with H2O and Python by Jo-fai Chow
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
Jo-fai Chow716 views
Introduction to Machine Learning with H2O and Python by Jo-fai Chow
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
Jo-fai Chow842 views
H2O Deep Water - Making Deep Learning Accessible to Everyone by Jo-fai Chow
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
Jo-fai Chow566 views
H2O Deep Water - Making Deep Learning Accessible to Everyone by Jo-fai Chow
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
Jo-fai Chow1K views
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain... by Jo-fai Chow
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
Jo-fai Chow1.2K views
Designing Sustainable Drainage Systems by Jo-fai Chow
Designing Sustainable Drainage SystemsDesigning Sustainable Drainage Systems
Designing Sustainable Drainage Systems
Jo-fai Chow2.2K views
Developing a New Decision Support System for SuDS by Jo-fai Chow
Developing a New Decision Support System for SuDSDeveloping a New Decision Support System for SuDS
Developing a New Decision Support System for SuDS
Jo-fai Chow1.3K views
Udacity Statement (Introduction to Statistics, August 2012) by Jo-fai Chow
Udacity Statement (Introduction to Statistics, August 2012)Udacity Statement (Introduction to Statistics, August 2012)
Udacity Statement (Introduction to Statistics, August 2012)
Jo-fai Chow365 views
Coursera Statement (Computational Investing, Part I, by Jo-fai Chow
Coursera Statement (Computational Investing, Part I, Coursera Statement (Computational Investing, Part I,
Coursera Statement (Computational Investing, Part I,
Jo-fai Chow532 views
Coursera Statement (Computing for Data Analysis, Oct 2013) by Jo-fai Chow
Coursera Statement (Computing for Data Analysis, Oct 2013)Coursera Statement (Computing for Data Analysis, Oct 2013)
Coursera Statement (Computing for Data Analysis, Oct 2013)
Jo-fai Chow283 views
Coursera Statement (Data Analysis, Mar 2013) by Jo-fai Chow
Coursera Statement (Data Analysis, Mar 2013)Coursera Statement (Data Analysis, Mar 2013)
Coursera Statement (Data Analysis, Mar 2013)
Jo-fai Chow332 views
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain... by Jo-fai Chow
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
A Systematic, Multi-Criteria Decision Support Framework for Sustainable Drain...
Jo-fai Chow5.3K views

Recently uploaded

MOSORE_BRESCIA by
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIAFederico Karagulian
5 views8 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
7 views29 slides
Ukraine Infographic_22NOV2023_v2.pdf by
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.3K views3 slides
UNEP FI CRS Climate Risk Results.pptx by
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptxpekka28
11 views51 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
5 views30 slides
CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
6 views24 slides

Recently uploaded(20)

[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.3K views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862012 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials11 views

Improving Model Predictions via Stacking and Hyper-parameters Tuning

  • 1. Improving Model Predictions via Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus
  • 2. About Me • 2005 - 2015 • Water Engineer o Consultant for Utilities o EngD Research • 2015 - Present • Data Scientist o Virgin Media o Domino Data Lab o H2O.ai 2
  • 4. About This Talk • Predictive modelling o Kaggle as an example • Improve predictions with simple tricks • Use data science for social good 👍 4
  • 5. About Kaggle • World’s biggest predictive modelling competition platform • 560k members • Competition types: o Featured (prize) o Recruitment o Playground o 101 5
  • 6. Predicting Shelter Animal Outcomes • X: Predictors o Name o Gender o Type (🐱 or 🐶) o Date & Time o Age o Breed o Colour • Y: Outcomes (5 types) o Adoption o Died o Euthanasia o Return to Owner o Transfer • Data o Training (27k samples) o Test (11k) 6
  • 7. Basic Feature Engineering X Raw (Before) Reformatted (After) Name Elsa, Steve, Lassie [name_len]: 4, 5, 6 Date & Time 2014-02-12 18:22:00 [year]: 2014 [month]: 2 [weekday]: 4 [hour]: 18 Age 1 year, 3 weeks, 2 days [age_day]: 365, 21, 2 Breed German Shepherd, Pit Bull Mix [is_mix]: 0, 1 Colour Brown Brindle/White [simple_colour]: brown 7
  • 8. Common Machine Learning Techniques • Ensembles o Bagging/boosting of decision trees o Reduces variance and increase accuracy o Popular R Packages (used in next example) • “randomForest” • “xgboost” • There are a lot more machine learning packages in R: o “caret”, “caretEnsemble” o “h2o”, “h2oEnsemble” o “mlr” 8
  • 9. Simple Trick – Model Averaging • Stratified sampling o 80% for training o 20% for validation • Evaluation metric o Multi-class Log Loss o Lower the better o 0 = Perfect • 50 runs o different random seed 9
  • 10. More Advanced Methods • Model Stacking o Uses a second-level metalearner to learn the optimal combination of base learners o R Packages: • “SuperLearner” • “subsemble” • “h2oEnsemble” • “caretEnsemble” • Hyper-parameters Tuning o Improves the performance of individual machine learning algorithms o Grid search • Full / Random o R Packages: • “caret” • “h2o” 10 For more info, see https://github.com/h2oai/h2o-meetups/tree/master/2016_05_20_MLconf_Seattle_Scalable_Ensembles https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.Rmd
  • 11. Trade-Off of Advanced Methods • Strength o Model tuning + stacking won nearly all Kaggle competitions. o Multi-algorithm ensemble may better approximate the true predictive function than any single algorithm. • Weakness o Increased training and prediction times. o Increased model complexity. o Requires large machines or clusters for big data. 11
  • 12. R + H2O = Scalable Machine Learning • H2O is an open-source, distributed machine learning library written in Java with APIs in R, Python and more. • ”h2oEnsemble” is the scalable implementation of the Super Learner algorithm for H2O. 12
  • 13. H2O Random Grid Search Example 13 Define search range and criteria Best models
  • 14. H2O Model Stacking Example 14 17 out of 717 teams (≈ top 2%) Getting reasonable resultsUsing h2o.stack(…) to combine multiple models
  • 15. Conclusions • Many R packages for predictive modelling. • Use hyper-parameters tuning to improve individual models. • Use model averaging / stacking to improve predictions. • Trade-off between model performance and computational costs. • Use R + H2O for scalable machine learning. • H2O random grid search and stacking. • Use data science for social good 👍 15
  • 16. Big Thank You! • Mango Solutions • RStudio • Domino Data Lab • H2O o Erin LeDell o Raymond Peck o Arno Candel 16 1st LondonR Talk Crime Map Shiny App bit.ly/londonr_crimemap 2nd LondonR Talk Domino API Endpoint bit.ly/1cYbZbF
  • 17. Any Questions? • Contact o joe@h2o.ai o @matlabulous o github.com/woobe • Slides & Code o github.com/h2oai/h2o- meetups • H2O in London o Meetups / Office (soon) o www.h2o.ai/careers • More H2O at Strata tomorrow o Innards of H2O (11:15) o Intro to Generalised Low- Rank Models (14:05) 17
  • 18. Extra Slide (Stratified Sampling) 18

Editor's Notes

  1. All slides and code available online – sit back and relax, remember you’re here today for a good cause, care about shelter animals