Pricing like a data scientist

Pricing like a Data Scientist
Matthew Evans
EMC Actuarial and Analytics
16 July 2020

Augmenting traditional actuarial pricing with data
science approaches
16 July 2020

This presentation sets out the key insights from my work with data scientists and
data science techniques over the last five years.
Examples are presented showing how traditional actuarial pricing can be
augmented with data science tools and approaches.
Suggestions of varying reward/effort are offered for enhancing your pricing
process.

Agenda
• The traditional actuarial pricing approach
• The data science process
• Enhancing our pricing processes
– What are they doing that we aren’t?
– Which bits could work for us and our datasets?
Data
Visualisation
Deep
learning
NLP
Vision
Clustering
Actuarial
Pricing
Data Science
Burning cost
Credibility
weighting
Triangles Curve fitting
GAM
PCA
The Cloud
GLM
R
Resampling
ML
We will focus on
adjacent area here,
on techniques most
relevant to actuarial

The traditional actuarial pricing approach
• MI and Analytics on corporate systems
• Risk premium modelling using GLM’s
• MI supplemented with ad hoc analyses
• Market rates (Optimisation/Underwriter loads)
• Model deployment on corporate systems/tools.
Modelling frequency and
severity by claims type
Train and test datasets
Poisson/Gamma GLM
error structures

The data science process: EDA
• Exploratory Data Analysis (“EDA”) and Data Visualisation is an established part of
the data science process
– Examination of features of data
– Distributions
– Correlations
– Suitability of proposed model structure
– Clustering & dimensionality reduction
Important checks for data scientists working
across multiple domains with different and
unfamiliar datasets.

The data science process: EDA
• Pricing actuaries use similar
processes to monitor developing
experience:
– Management information
– Ad hoc analyses.
Basic EDA tends to be less
important in actuarial pricing
where datasets are familiar and
characteristics are broadly
stable.
But could more sophisticated
methods work for us?

The data science process: EDA - PCA
• Dimensionality Reduction,
here using Principal Components
Analysis, can be used to reduce
a wide dataset to a smaller set of
features
• 139 original factors converted to
70 “Principal Components” which
may help modelling
• Less useful if dataset narrow.

The data science process: EDA - Clustering
• Clustering, here hierarchical
clustering, groups similar
records
• Clusters appear to show
differences in frequency in
our test dataset, which may
help modelling
• Less useful if dataset
broadly homogeneous.

The data science process: Imputation
• EDA may expose deficiencies in data or missing data. Imputation can
help.
• A simple thing but something we use often at EMC, especially in
commercial lines with more challenging datasets.
Build small model to
predict missing value
Add predicted values
back onto dataset
where missing
Imputed values retain
data volume for rest of
process
Final model shows
improved performance

The data science process: Imputation
• This graph shows how the
Previous Keepers field is imputed
using a tree-based machine
learning model
• Most missing values were
assigned to zero – manufactured
date was close to purchased date
• Various models downstream now
have the benefit of a fully-
populated field.

The data science process: Feature Engineering
• Feature Engineering is the construction of new features from existing
data
– Using domain knowledge to construct features can help models fit better, faster,
or improve model interpretability
– Pricing actuaries already do this, with NCB level, say, or other engineered risk
scores, but it tends not to be a focus
• Data scientists focus more on this because their datasets are often
wider and individual features may lack predictive power.

The data science process: ML
• Machine Learning approaches can be shown to “beat” GLM’s across a
number of metrics, they are also easier to fit, at the expense of being more
difficult to interpret
• If legacy-constraints mean a traditional model must be used, ML approaches
can still offer insights:
– Showing most predictive factors, guiding model design
– Exposing interactions
– Showing optimum banding for numerical variables
– Giving a “target” model score for competing statistical approaches.

• Factor predictiveness
analyses from machine
learning can inform model
design
• This diagnostic from an
XGBoost model shows us
the most important
factors for use in a legacy
GLM, say.
XGBoost model less fussy
than GLM so you can save
time preparing data
Use results to design GLM
model formula

• Similar approaches
allow interactions to be
exposed
• These approaches
involve digging into the
black box of machine
learning models.
These pairs of rating
factors appear together
most frequently in model
trees.
We can experiment with
high-scoring interactions in
our GLM.

• A tree-based machine
learning model finds
optimal points to split
variables
• Seeing where the tree
model branches
numerical variables can
inform banding
• The biggest difference is
above or below £5,000.
Note that treating the variable as
a smoother in a GAM is
preferable, it avoids banding
altogether.

• When not legacy-constrained, the full power of modern ML
approaches can be realised:
– InsurTech start-ups make common use of ML models for rating
– ML models are well-suited to non-linear, discontinuous data – for example in
modelling market rates using aggregator data
– ML models work well with wide data where traditional approaches break down
– ML models are easily-deployed in the cloud.

The data science process: GAM’s
• General Additive Models have found wide application in data science, where they are often used in
preference to GLM’s
• GAM’s do all that GLM’s do:
– Same formula and error structures so easy to convert a GLM
– Same structure for offsets (fixed relativities) where required
– Same output of multiplicative relativities for deployment
– Same diagnostic outputs for peer review and underwriter assessment
• But GAM’s offer a much more convenient treatment of splines:
– “Smoothers” handle non-linear numerical variables
– Much less to go wrong, can avoid polynomials, knots, etc
– A Shape-Constrained fit can impose shapes on smoothers
– Regularised fitting.
GAM formula in R often the
same as for a GLM, easy
changes for numeric terms
only
Outputs are the same so
no issue if you need
multiplicative factor loads
Models fit better and make
better use of data

The data science process: CV
• Cross-Validation is a widely used technique in data science for developing
and testing models
– Rather than just train and test, and dataset is split into many “folds”, and a model fitted
and tested with each fold held out in turn
– This approach gives more stable results for model assessment, improving model selection
and avoiding over-fitting
– CV is important in data science, where machine learning models are more likely to overfit
than traditional statistical models
• The approach also lends itself to grid-searching for optimal hyper-parameters.
Time tests offer a
means of achieving
same ends

The data science process: Grid-Search
• Machine Learning models use hyper-parameters which control the model fit
process – they sit above model parameters which reflect the fit
• A Grid-Search is an exhaustive search through a set of hyper-parameters until
the best performing model is found
• Related techniques, including Model Bayesian Optimisation, offer further
means of finding the best hyper-parameters, and model, easily.
Choose a set of hyper-
parameters, the “grid”
Fit and test each
combination with cross
validation
Compare results of all
combinations
Select best set and refit
using whole dataset

The data science process: Grid-Search
• Grid-search from
machine learning
algorithm XGBoost
shown here
• Can also be applied to
traditional modelling:
– Setting Tweedie power
– Claim truncation
– Inflation assumption.
Here we test different
“depths” of an XGBoost
model
Impact is small, with
wide variance within
hyper-parameter set
We choose depth=1 for
smallest error and the
simplest model

The data science process: Bootstrapping
• The traditional approach for pricing actuaries is to use confidence intervals
when examining models
– We can see exposure levels and have an idea of the potential range of a relativity or
prediction
– Underwriters might adjust relativities with low certainty
• But Bootstrapping offers the same and more…

The data science process: Bootstrapping
• Bootstrapping is a form of resampling
widely used in data science
• Bootstrapped model metrics are more flexible
than confidence intervals:
– Gives whole distribution
– Works for non-statistical machine learning models.
– We can use results to make models safe, for
example by refitting to the 65% percentile.
Sample from
dataset with
replacement
Refit model
Variation in models
represents
uncertainty

The data science process: Bayesian Methods
• Bayesian approaches use an assumed
prior distribution
• The prior is combined with emerging data
to give a posterior distribution
• Like bootstrapping, this gives an idea of
uncertainty
• It also gives a robust framework for
credibility-weighting
• Models are generative so can easily be
used for simulations.
Bayesian
programming
language: stan
R package
Rstanarm
used here
Prior
Actual
Cumulative
observed
Posterior
mean &
distribution

The data science process: Automation
• An automated process allows actuaries to focus on adding value
Claims
system
Cloud
warehouse
R Server
Refreshed
rating
model
Analytics
cube
Reports in
MS Word
Quote
system
Impact
analysis
Legacy
claims
system
Automatic
upload
Secure and
maintenance
free
Cloud R server processes
data, runs models,
produces reports & MI
automatically
Updated rating
model including
adjustments
Human and
commercial
judgement
New rates go
live on
“serverless”
platform

The data science process: The Cloud
• Many of the approaches discussed here rely on fitting multiple models,
either for cross-validation or bootstrapping
• Multiple model fits means greater run times, however…
• Data science approaches make heavy use of parallel processing and cloud-
based infrastructure
– Parallel processing is possible on a local PC, using each core of a modern CPU or GPU
– Cloud-based machines are relatively inexpensive to rent and are well-specified, allowing
multiple models to be fitted quickly.

• Various platforms allow traditional or machine learning models to be deployed
robustly, and at scale, in the cloud
– Load-balanced virtual machines or containers running R can serve model results
through an API
– “Serverless” platforms allow models and code to be deployed directly.
No need to extract model
parameters and recode… just
deploy native R model
Can scale all the way up to
100m quotations a year –
enough for a motor
aggregator.
No servers to maintain,
security, updates, etc.

Rstudio running on an AWS
linux virtual machine

Online documentation for API
interface, running here in R on
a virtual server

Online documentation for API
interface, running here in R on
a virtual server
This API, running “serverless”
on AWS lambda, returns a little
bit of JSON describing a
database entry

Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Steep learning
curve but
worth it

Easy fixes

Better models

Better
actuaries,
better
business.

Handling
uncertainty

Better
actuaries,
bigger scope,
no IT reliance.

Our Offering
Pricing
Data
Science
InsurTech
Capital
Modelling
Solvency II Reserving
IFRS 17 Training Process
Efficiency
Reserving as
a Service:
API to Report
Dynamic MI:
showing
trends
automatically
Risk Triage
with Machine
Learning
Cloud data
pipeline
engineering

References
• R and RStudio: r-project.org rstudio.com
• GAM’s in R package mgcv: cran.r-project.org/web/packages/mgcv
• Imputation using impute package (knn) or algorithm of your choice
• Cross-validation, Bootstrapping and other modelling functions with:
– caret package topepo.github.io/caret
– tidymodels suite of packages tidymodels.org
• ML examples shown here use XGBoost xgboost.readthedocs.io
• Amazon Web Services (AWS) and Digital Ocean used for Cloud examples
Contact us to
ask about R
code and
packages used

16 July 2020 41
The views expressed in this [publication/presentation] are those of invited contributors and not necessarily those of the IFoA. The IFoA do not endorse any of the
views stated, nor any claims or representations made in this [publication/presentation] and accept no responsibility or liability to any person for loss or damage
suffered as a consequence of their placing reliance upon any view, claim or representation made in this [publication/presentation].
The information and expressions of opinion contained in this publication are not intended to be a comprehensive study, nor to provide actuarial advice or advice
of any nature and should not be treated as a substitute for specific advice concerning individual situations. On no account may any part of this
[publication/presentation] be reproduced without the written permission of the IFoA [or authors, in the case of non-IFoA research].
Questions Comments

Pricing like a data scientist

More Related Content

What's hot

Similar to Pricing like a data scientist

Recently uploaded

Pricing like a data scientist

Editor's Notes