H2O AutoML Roadmap 2016.10
Raymond Peck
Director of Product Engineering, H2O.ai
rpeck@h2o.ai
© H2O.ai, 2016 1
What Will We Cover?
• What is AutoML?
• What is the roadmap for H2O AutoML?
© H2O.ai, 2016 2
What is AutoML?
H2O AutoML automates parts of data preparation and model
training in order to help both Machine Learning / Data Science
experts and complete novices.
Other AutoML projects concentrate on novices.
© H2O.ai, 2016 3
Outside AutoML Projects
• auto-sklearn
• AutoCompete
• TPOT
• DataRobot
• Automatic Statistician
• BigML
• et al...
© H2O.ai, 2016 4
Who is the Target Audience?
• "Big green button" for novice users such as software
developers and business analysts;
• Iterative, interactive use and controls for expert users:
• Machine Learning experts
• Descriptive Data Scientists
© H2O.ai, 2016 5
What Are the Pieces?
• data cleaning
• feature engineering / feature generation
• feature selection
• for both the original and generated features
• model hyperparameter tuning
• automatic smart ensemble generation
© H2O.ai, 2016 6
Prior Work @ H2O
• ensembles (stacking), from Erin LeDell
• random hyperparameter search with automatic stopping,
from Raymond Peck
• some dataset characterization and feature engineering,
from Spencer Aiello
• hyperopt Bayesian hyperparameter optimization, from
Abhishek Malali
© H2O.ai, 2016 7
Current Work
• random hyperparameter search with parameter values
based on open datasets
• moving ensembles into the back end
• working on basic metalearning for hyperparameter vectors,
starting with 140 OpenML datasets
© H2O.ai, 2016 8
Future Work
• feature selection
• feature engineering for IID data
• Bayesian hyperparameter search with warm start
• feature engineering for non-IID data, e.g. time series
• iterate w/ larger datasets that are typical for our customers
• distribution guesser for regression
© H2O.ai, 2016 9
How Do We Evaluate Our Work?
• public datasets from
• OpenML
• ChaLearn AutoML challenge
• Kaggle
• our own Data Scientists' work with customer datasets
• customer feedback (soon)
© H2O.ai, 2016 10
Data Cleaning
• outlier analysis (with user feedback)
• sentinel value detection
• as a side-effect of outlier analysis
• type-based heuristics (e.g., 999999, 1970.01.01)
• identifier detection (e.g., customer ID)
• smart imputation
© H2O.ai, 2016 11
Feature Generation
We will be using several techniques including:
• type-based heuristics
• date/time expansion
• log and other transforms of numerics
• interactions (product, ratio, etc)
• feature generation with Deep Learning deepfeatures()
• clustering
© H2O.ai, 2016 12
Feature Selection
We will be evaluating several techniques including:
• Mutual Information (non-linear correlation)
• variable importance from GBM and Deep Learning
• PCA
• GLM with Elastic Net / LASSO
Perhaps different selectors for initial data and transforms / interactions
to trade off speed and the detection of non-linear relationships.
© H2O.ai, 2016 13
Hyperparameter Tuning
• currently do random hyperparameter search with metric-based
smart stopping
• hyperparameter values taken from hand-tuning 140 OpenML
datasets
• soon adding simple "nearest neighbors" warm start (basic
metalearning)
• then adding Bayesian hyperparameter optimization
• possibly integrating hyperopt into the back end
© H2O.ai, 2016 14
Automatic Smart Ensemble
Generation
• currently adding Erin LeDell's stacking / SuperLearner into the back end
• initially, ensemble top N models from hyperparameter searches
• optional "use original features"
• smarter ensemble generation for faster scoring, less overfitting:
• greedy ensemble creation
• ensemble models with uncorrelated residuals
© H2O.ai, 2016 15
Possible Futures
• try to predict accuracy from dataset metadata
• training time prediction
• scoring time prediction
• multiple concurrent H2O clusters for speed
• freeze/thaw model training
• outlier analysis with user feedback
• residuals analysis with user feedback
• composite models using pre-clustering step
© H2O.ai, 2016 16

H2O AutoML roadmap - Ray Peck

  • 1.
    H2O AutoML Roadmap2016.10 Raymond Peck Director of Product Engineering, H2O.ai rpeck@h2o.ai © H2O.ai, 2016 1
  • 2.
    What Will WeCover? • What is AutoML? • What is the roadmap for H2O AutoML? © H2O.ai, 2016 2
  • 3.
    What is AutoML? H2OAutoML automates parts of data preparation and model training in order to help both Machine Learning / Data Science experts and complete novices. Other AutoML projects concentrate on novices. © H2O.ai, 2016 3
  • 4.
    Outside AutoML Projects •auto-sklearn • AutoCompete • TPOT • DataRobot • Automatic Statistician • BigML • et al... © H2O.ai, 2016 4
  • 5.
    Who is theTarget Audience? • "Big green button" for novice users such as software developers and business analysts; • Iterative, interactive use and controls for expert users: • Machine Learning experts • Descriptive Data Scientists © H2O.ai, 2016 5
  • 6.
    What Are thePieces? • data cleaning • feature engineering / feature generation • feature selection • for both the original and generated features • model hyperparameter tuning • automatic smart ensemble generation © H2O.ai, 2016 6
  • 7.
    Prior Work @H2O • ensembles (stacking), from Erin LeDell • random hyperparameter search with automatic stopping, from Raymond Peck • some dataset characterization and feature engineering, from Spencer Aiello • hyperopt Bayesian hyperparameter optimization, from Abhishek Malali © H2O.ai, 2016 7
  • 8.
    Current Work • randomhyperparameter search with parameter values based on open datasets • moving ensembles into the back end • working on basic metalearning for hyperparameter vectors, starting with 140 OpenML datasets © H2O.ai, 2016 8
  • 9.
    Future Work • featureselection • feature engineering for IID data • Bayesian hyperparameter search with warm start • feature engineering for non-IID data, e.g. time series • iterate w/ larger datasets that are typical for our customers • distribution guesser for regression © H2O.ai, 2016 9
  • 10.
    How Do WeEvaluate Our Work? • public datasets from • OpenML • ChaLearn AutoML challenge • Kaggle • our own Data Scientists' work with customer datasets • customer feedback (soon) © H2O.ai, 2016 10
  • 11.
    Data Cleaning • outlieranalysis (with user feedback) • sentinel value detection • as a side-effect of outlier analysis • type-based heuristics (e.g., 999999, 1970.01.01) • identifier detection (e.g., customer ID) • smart imputation © H2O.ai, 2016 11
  • 12.
    Feature Generation We willbe using several techniques including: • type-based heuristics • date/time expansion • log and other transforms of numerics • interactions (product, ratio, etc) • feature generation with Deep Learning deepfeatures() • clustering © H2O.ai, 2016 12
  • 13.
    Feature Selection We willbe evaluating several techniques including: • Mutual Information (non-linear correlation) • variable importance from GBM and Deep Learning • PCA • GLM with Elastic Net / LASSO Perhaps different selectors for initial data and transforms / interactions to trade off speed and the detection of non-linear relationships. © H2O.ai, 2016 13
  • 14.
    Hyperparameter Tuning • currentlydo random hyperparameter search with metric-based smart stopping • hyperparameter values taken from hand-tuning 140 OpenML datasets • soon adding simple "nearest neighbors" warm start (basic metalearning) • then adding Bayesian hyperparameter optimization • possibly integrating hyperopt into the back end © H2O.ai, 2016 14
  • 15.
    Automatic Smart Ensemble Generation •currently adding Erin LeDell's stacking / SuperLearner into the back end • initially, ensemble top N models from hyperparameter searches • optional "use original features" • smarter ensemble generation for faster scoring, less overfitting: • greedy ensemble creation • ensemble models with uncorrelated residuals © H2O.ai, 2016 15
  • 16.
    Possible Futures • tryto predict accuracy from dataset metadata • training time prediction • scoring time prediction • multiple concurrent H2O clusters for speed • freeze/thaw model training • outlier analysis with user feedback • residuals analysis with user feedback • composite models using pre-clustering step © H2O.ai, 2016 16