My INSURER PTE LTD - Insurtech Innovation Award 2024
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Boxever
1. Like-for-Like Comparsion of Machine Learning
Algorithms
Sensitivity Analysis of ML Hyperparameters
Dominik Dahlem
2016-09-25 Sun
2. Who am I?
• Dominik Dahlem, Lead Data Scientist, Boxever
dominik.dahlem@boxever.com
http://ie.linkedin.com/in/ddahlem
http://github.com/dahlem
@dahlemd
4. Introduction
Boxever
• Boxever is a Data Science company with a Customer
Intelligence Cloud for Travel
• Our cloud analytical services need to be well-tuned and robust
• e.g., recommendation models, propensity models, etc.
6. Introduction
Goal
• Tuning
• ML algorithms tend to be governed by tunable parameters
typically referred as hyperparameters
• They are not trained
• Require trial-and-error fine tuning
• Sensitivity Analysis
• Does a small perburbation in the parameters change the output
dramatically?
• Visual inspection easy in ML algorithms with very few
hyperparameters
• But mathematical treatment necessary in high dimensions
8. Building ML Models
Evaluating a ML Model Hypothesis
• A hypothesis (given the hyperparameters) may overfit → How
do we know?
• We have a low error, but the model is still inaccurate
• Test-driven development and debugging ↔ Statistical Diagnostics
1 With a given dataset, split into two sets: training and test
2 Fix hyperparameters
3 Learn model parameters and minimise the corresponding error
using the training set
4 Compute the test error using the test set
9. Building ML Models
Model Selection
• Without the validation set
• Optimise ML parameters using the training set for each
hypothesis (e.g., polynomial degree)
• Select the hypothesis with the smallest test error
• Estimate the generalisation error also using the test set →
optimistic error estimates
• With validation set
• Optimise ML parameters using the training set for each
hypothesis (e.g., polynomial degree)
• Select the hypothesis with the smallest cross-validation error
• Estimate the generalisation error also using the test set
10. Building ML Models
General ML Pipeline
• Parameter search
• grid vs random vs active
learning
• We found well-performing model,
but
• are the parameters sensitive to
minute changes?
11. Building ML Models
Sensitivity Analysis1
• Hyperparameter tuning using
e.g., Spearmint
• integrate uncertainty of the
k-fold CV
• model the parameter surface
on the mean error metric from
CV
• Characterise the nature of the
hyperparameter surface in the
vicinity of the optimal point (e.g.,
the one that minimised the error
of the ML algorithm)
1George E. P. Box and Norman R. Draper (2007). Response Surfaces, Mixtures, and Ridge Analyses. 2nd ed.
Wiley-Interscience.
12. Building ML Models
Bias vs. Variance
Jtraining(θ)
JCV(θ)
Optimal
Underfitting
(high bias)
Overfitting
(high variance)
d (polynomial degree)
J(θ)
• What is the source of bad predictions?
13. Building ML Models
Regularisation and Bias/Variance
Jtraining(θ)
JCV(θ)
Optimal
Overfitting
(high variance)
Underfitting
(high bias)
λ (regularisation)
J(θ)
14. Building ML Models
Learning Curves (High Bias)
Jtest(θ)
Jtraining(θ)
Desired
N (training set size)
J(θ)
• More training data will not help!
15. Building ML Models
Learning Curves (High Variance)
Jtest(θ)
Jtraining(θ)
Desired
N (training set size)
J(θ)
• More training data will likely help!
17. Example: RL Gridworld
Overview
• Teach a computer to find a
path to a goal
• Actions: N, E, S, W
• Classifying grids?
• Trial and Error?
18. Example: RL Gridworld
SARSA(λ)
• SARSA update rule:
Q(s, a) ← Q(s, a) + α [r + γQ(s , a ) − Q(s, a)] . (1)
• s: the state, i.e., cell on the grid
• a: the action, i.e., N, E, S, W
• Q(s, a): state-action value function
• here: lookup table
• r: the reward received for performing action a in state s
• α: the learning rate
• γ: the discount factor
19. Example: RL Gridworld
RL Gridworld Pipeline
• Optimise the learning rate and
the discount factor
• α ∈ [0.0001, 0.3]
• γ ∈ [0.01, 0.95]
• Fixed parameters for brevity:
• greedy policy
• the eligibility traces λ
• episodes N = 2000
21. Example: RL Gridworld
Overview: Canonical Analysis2
• Find the optimum point using a
constrained optimisation method
that can escape local minima
• α = 0.0001, γ = 0.577
• Restrict the canonical analysis to
a subset of the parameter space
around the optimum value
• α ∈ [0.0001, 0.03]
• γ ∈ [0.48, 0.67]
• Trace the α, γ, and estimated
number of steps along the
maximum path
• eigen-system analysis of the
covariance matrix of the
hyperparameter surface
2George E. P. Box and Norman R. Draper (2007). Response Surfaces, Mixtures, and Ridge Analyses. 2nd ed.
Wiley-Interscience.
25. Summary
• Enable like-for-like ML model evaluations
• Tuning, e.g.,
• Spearmint: https://github.com/JasperSnoek/spearmint
• SMAC: http://www.cs.ubc.ca/labs/beta/Projects/SMAC/
• hyperopt: http://hyperopt.github.io/hyperopt/
• Canonical Anlaysis
• Sensitivity of the hyperparameters when subjected to small
perturbations around the optimum
• Assess the sensitivity of the HP between competing ML models
• Choose an ML model that does not exhibit minima that are
surrounded by very steep slopes in the hyperparameter surface