Pricing like a Data Scientist
Matthew Evans
EMC Actuarial and Analytics
16 July 2020
Pricing like a Data Scientist
Augmenting traditional actuarial pricing with data
science approaches
16 July 2020
Pricing like a Data Scientist
This presentation sets out the key insights from my work with data scientists and
data science techniques over the last five years.
Examples are presented showing how traditional actuarial pricing can be
augmented with data science tools and approaches.
Suggestions of varying reward/effort are offered for enhancing your pricing
process.
Agenda
• The traditional actuarial pricing approach
• The data science process
• Enhancing our pricing processes
– What are they doing that we aren’t?
– Which bits could work for us and our datasets?
Data
Visualisation
Deep
learning
NLP
Vision
Clustering
Actuarial
Pricing
Data Science
Burning cost
Credibility
weighting
Triangles Curve fitting
GAM
PCA
The Cloud
GLM
R
Resampling
ML
We will focus on
adjacent area here,
on techniques most
relevant to actuarial
The traditional actuarial pricing approach
• MI and Analytics on corporate systems
• Risk premium modelling using GLM’s
• MI supplemented with ad hoc analyses
• Market rates (Optimisation/Underwriter loads)
• Model deployment on corporate systems/tools.
Modelling frequency and
severity by claims type
Train and test datasets
Poisson/Gamma GLM
error structures
The data science process: EDA
• Exploratory Data Analysis (“EDA”) and Data Visualisation is an established part of
the data science process
– Examination of features of data
– Distributions
– Correlations
– Suitability of proposed model structure
– Clustering & dimensionality reduction
Important checks for data scientists working
across multiple domains with different and
unfamiliar datasets.
The data science process: EDA
• Pricing actuaries use similar
processes to monitor developing
experience:
– Management information
– Ad hoc analyses.
Basic EDA tends to be less
important in actuarial pricing
where datasets are familiar and
characteristics are broadly
stable.
But could more sophisticated
methods work for us?
The data science process: EDA - PCA
• Dimensionality Reduction,
here using Principal Components
Analysis, can be used to reduce
a wide dataset to a smaller set of
features
• 139 original factors converted to
70 “Principal Components” which
may help modelling
• Less useful if dataset narrow.
The data science process: EDA - Clustering
• Clustering, here hierarchical
clustering, groups similar
records
• Clusters appear to show
differences in frequency in
our test dataset, which may
help modelling
• Less useful if dataset
broadly homogeneous.
The data science process: Imputation
• EDA may expose deficiencies in data or missing data. Imputation can
help.
• A simple thing but something we use often at EMC, especially in
commercial lines with more challenging datasets.
Build small model to
predict missing value
Add predicted values
back onto dataset
where missing
Imputed values retain
data volume for rest of
process
Final model shows
improved performance
The data science process: Imputation
• This graph shows how the
Previous Keepers field is imputed
using a tree-based machine
learning model
• Most missing values were
assigned to zero – manufactured
date was close to purchased date
• Various models downstream now
have the benefit of a fully-
populated field.
The data science process: Feature Engineering
• Feature Engineering is the construction of new features from existing
data
– Using domain knowledge to construct features can help models fit better, faster,
or improve model interpretability
– Pricing actuaries already do this, with NCB level, say, or other engineered risk
scores, but it tends not to be a focus
• Data scientists focus more on this because their datasets are often
wider and individual features may lack predictive power.
The data science process: ML
• Machine Learning approaches can be shown to “beat” GLM’s across a
number of metrics, they are also easier to fit, at the expense of being more
difficult to interpret
• If legacy-constraints mean a traditional model must be used, ML approaches
can still offer insights:
– Showing most predictive factors, guiding model design
– Exposing interactions
– Showing optimum banding for numerical variables
– Giving a “target” model score for competing statistical approaches.
The data science process: ML
• Machine Learning approaches can be shown to “beat” GLM’s across a
number of metrics, they are also easier to fit, at the expense of being more
difficult to interpret
• If legacy-constraints mean a traditional model must be used, ML approaches
can still offer insights:
– Showing most predictive factors, guiding model design
– Exposing interactions
– Showing optimum banding for numerical variables
– Giving a “target” model score for competing statistical approaches.
The data science process: ML
• Factor predictiveness
analyses from machine
learning can inform model
design
• This diagnostic from an
XGBoost model shows us
the most important
factors for use in a legacy
GLM, say.
XGBoost model less fussy
than GLM so you can save
time preparing data
Use results to design GLM
model formula
The data science process: ML
• Similar approaches
allow interactions to be
exposed
• These approaches
involve digging into the
black box of machine
learning models.
These pairs of rating
factors appear together
most frequently in model
trees.
We can experiment with
high-scoring interactions in
our GLM.
The data science process: ML
• A tree-based machine
learning model finds
optimal points to split
variables
• Seeing where the tree
model branches
numerical variables can
inform banding
• The biggest difference is
above or below £5,000.
Note that treating the variable as
a smoother in a GAM is
preferable, it avoids banding
altogether.
The data science process: ML
• When not legacy-constrained, the full power of modern ML
approaches can be realised:
– InsurTech start-ups make common use of ML models for rating
– ML models are well-suited to non-linear, discontinuous data – for example in
modelling market rates using aggregator data
– ML models work well with wide data where traditional approaches break down
– ML models are easily-deployed in the cloud.
The data science process: GAM’s
• General Additive Models have found wide application in data science, where they are often used in
preference to GLM’s
• GAM’s do all that GLM’s do:
– Same formula and error structures so easy to convert a GLM
– Same structure for offsets (fixed relativities) where required
– Same output of multiplicative relativities for deployment
– Same diagnostic outputs for peer review and underwriter assessment
• But GAM’s offer a much more convenient treatment of splines:
– “Smoothers” handle non-linear numerical variables
– Much less to go wrong, can avoid polynomials, knots, etc
– A Shape-Constrained fit can impose shapes on smoothers
– Regularised fitting.
GAM formula in R often the
same as for a GLM, easy
changes for numeric terms
only
Outputs are the same so
no issue if you need
multiplicative factor loads
Models fit better and make
better use of data
The data science process: CV
• Cross-Validation is a widely used technique in data science for developing
and testing models
– Rather than just train and test, and dataset is split into many “folds”, and a model fitted
and tested with each fold held out in turn
– This approach gives more stable results for model assessment, improving model selection
and avoiding over-fitting
– CV is important in data science, where machine learning models are more likely to overfit
than traditional statistical models
• The approach also lends itself to grid-searching for optimal hyper-parameters.
Time tests offer a
means of achieving
same ends
The data science process: Grid-Search
• Machine Learning models use hyper-parameters which control the model fit
process – they sit above model parameters which reflect the fit
• A Grid-Search is an exhaustive search through a set of hyper-parameters until
the best performing model is found
• Related techniques, including Model Bayesian Optimisation, offer further
means of finding the best hyper-parameters, and model, easily.
Choose a set of hyper-
parameters, the “grid”
Fit and test each
combination with cross
validation
Compare results of all
combinations
Select best set and refit
using whole dataset
The data science process: Grid-Search
• Grid-search from
machine learning
algorithm XGBoost
shown here
• Can also be applied to
traditional modelling:
– Setting Tweedie power
– Claim truncation
– Inflation assumption.
Here we test different
“depths” of an XGBoost
model
Impact is small, with
wide variance within
hyper-parameter set
We choose depth=1 for
smallest error and the
simplest model
The data science process: Bootstrapping
• The traditional approach for pricing actuaries is to use confidence intervals
when examining models
– We can see exposure levels and have an idea of the potential range of a relativity or
prediction
– Underwriters might adjust relativities with low certainty
• But Bootstrapping offers the same and more…
The data science process: Bootstrapping
• Bootstrapping is a form of resampling
widely used in data science
• Bootstrapped model metrics are more flexible
than confidence intervals:
– Gives whole distribution
– Works for non-statistical machine learning models.
– We can use results to make models safe, for
example by refitting to the 65% percentile.
Sample from
dataset with
replacement
Refit model
Variation in models
represents
uncertainty
The data science process: Bayesian Methods
• Bayesian approaches use an assumed
prior distribution
• The prior is combined with emerging data
to give a posterior distribution
• Like bootstrapping, this gives an idea of
uncertainty
• It also gives a robust framework for
credibility-weighting
• Models are generative so can easily be
used for simulations.
Bayesian
programming
language: stan
R package
Rstanarm
used here
Prior
Actual
Cumulative
observed
Posterior
mean &
distribution
The data science process: Automation
• An automated process allows actuaries to focus on adding value
Claims
system
Cloud
warehouse
R Server
Refreshed
rating
model
Analytics
cube
Reports in
MS Word
Quote
system
Impact
analysis
Legacy
claims
system
Automatic
upload
Secure and
maintenance
free
Cloud R server processes
data, runs models,
produces reports & MI
automatically
Updated rating
model including
adjustments
Human and
commercial
judgement
New rates go
live on
“serverless”
platform
The data science process: The Cloud
• Many of the approaches discussed here rely on fitting multiple models,
either for cross-validation or bootstrapping
• Multiple model fits means greater run times, however…
• Data science approaches make heavy use of parallel processing and cloud-
based infrastructure
– Parallel processing is possible on a local PC, using each core of a modern CPU or GPU
– Cloud-based machines are relatively inexpensive to rent and are well-specified, allowing
multiple models to be fitted quickly.
The data science process: The Cloud
• Various platforms allow traditional or machine learning models to be deployed
robustly, and at scale, in the cloud
– Load-balanced virtual machines or containers running R can serve model results
through an API
– “Serverless” platforms allow models and code to be deployed directly.
No need to extract model
parameters and recode… just
deploy native R model
Can scale all the way up to
100m quotations a year –
enough for a motor
aggregator.
No servers to maintain,
security, updates, etc.
The data science process: The Cloud
Rstudio running on an AWS
linux virtual machine
The data science process: The Cloud
Rstudio running on an AWS
linux virtual machine
Online documentation for API
interface, running here in R on
a virtual server
The data science process: The Cloud
Rstudio running on an AWS
linux virtual machine
Online documentation for API
interface, running here in R on
a virtual server
This API, running “serverless”
on AWS lambda, returns a little
bit of JSON describing a
database entry
Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Steep learning
curve but
worth it
Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Easy fixes
Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Better models
Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Better
actuaries,
better
business.
Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Handling
uncertainty
Key Insights ranked by reward/effort
1. The R platform offers a new world of insight and flexibility
2. GAM’s are an easy substitution to make for GLM’s
3. Imputation of missing fields is easy and helps avoid losing data
4. Feature engineering can help models fit better
5. Cross-validation is so much better than test and train
6. Machine learning can offer insights even to legacy-constrained processes
7. Automation allows your team to concentrate on interpretation, not process
8. Bayesian approaches allow robust credibility-weighting and simulations
9. Bootstrapping offers an underwriter-friendly way of showing uncertainty
10. The cloud can speed analysis and support robust deployment of models
11. EDA examples might give ideas for your MI suite.
Better
actuaries,
bigger scope,
no IT reliance.
Our Offering
Pricing
Data
Science
InsurTech
Capital
Modelling
Solvency II Reserving
IFRS 17 Training Process
Efficiency
Reserving as
a Service:
API to Report
Dynamic MI:
showing
trends
automatically
Risk Triage
with Machine
Learning
Cloud data
pipeline
engineering
References
• R and RStudio: r-project.org rstudio.com
• GAM’s in R package mgcv: cran.r-project.org/web/packages/mgcv
• Imputation using impute package (knn) or algorithm of your choice
• Cross-validation, Bootstrapping and other modelling functions with:
– caret package topepo.github.io/caret
– tidymodels suite of packages tidymodels.org
• ML examples shown here use XGBoost xgboost.readthedocs.io
• Amazon Web Services (AWS) and Digital Ocean used for Cloud examples
Contact us to
ask about R
code and
packages used
16 July 2020 41
The views expressed in this [publication/presentation] are those of invited contributors and not necessarily those of the IFoA. The IFoA do not endorse any of the
views stated, nor any claims or representations made in this [publication/presentation] and accept no responsibility or liability to any person for loss or damage
suffered as a consequence of their placing reliance upon any view, claim or representation made in this [publication/presentation].
The information and expressions of opinion contained in this publication are not intended to be a comprehensive study, nor to provide actuarial advice or advice
of any nature and should not be treated as a substitute for specific advice concerning individual situations. On no account may any part of this
[publication/presentation] be reproduced without the written permission of the IFoA [or authors, in the case of non-IFoA research].
Questions Comments

Pricing like a data scientist

  • 1.
    Pricing like aData Scientist Matthew Evans EMC Actuarial and Analytics 16 July 2020
  • 2.
    Pricing like aData Scientist Augmenting traditional actuarial pricing with data science approaches 16 July 2020
  • 3.
    Pricing like aData Scientist This presentation sets out the key insights from my work with data scientists and data science techniques over the last five years. Examples are presented showing how traditional actuarial pricing can be augmented with data science tools and approaches. Suggestions of varying reward/effort are offered for enhancing your pricing process.
  • 4.
    Agenda • The traditionalactuarial pricing approach • The data science process • Enhancing our pricing processes – What are they doing that we aren’t? – Which bits could work for us and our datasets? Data Visualisation Deep learning NLP Vision Clustering Actuarial Pricing Data Science Burning cost Credibility weighting Triangles Curve fitting GAM PCA The Cloud GLM R Resampling ML We will focus on adjacent area here, on techniques most relevant to actuarial
  • 5.
    The traditional actuarialpricing approach • MI and Analytics on corporate systems • Risk premium modelling using GLM’s • MI supplemented with ad hoc analyses • Market rates (Optimisation/Underwriter loads) • Model deployment on corporate systems/tools. Modelling frequency and severity by claims type Train and test datasets Poisson/Gamma GLM error structures
  • 6.
    The data scienceprocess: EDA • Exploratory Data Analysis (“EDA”) and Data Visualisation is an established part of the data science process – Examination of features of data – Distributions – Correlations – Suitability of proposed model structure – Clustering & dimensionality reduction Important checks for data scientists working across multiple domains with different and unfamiliar datasets.
  • 7.
    The data scienceprocess: EDA • Pricing actuaries use similar processes to monitor developing experience: – Management information – Ad hoc analyses. Basic EDA tends to be less important in actuarial pricing where datasets are familiar and characteristics are broadly stable. But could more sophisticated methods work for us?
  • 8.
    The data scienceprocess: EDA - PCA • Dimensionality Reduction, here using Principal Components Analysis, can be used to reduce a wide dataset to a smaller set of features • 139 original factors converted to 70 “Principal Components” which may help modelling • Less useful if dataset narrow.
  • 9.
    The data scienceprocess: EDA - Clustering • Clustering, here hierarchical clustering, groups similar records • Clusters appear to show differences in frequency in our test dataset, which may help modelling • Less useful if dataset broadly homogeneous.
  • 10.
    The data scienceprocess: Imputation • EDA may expose deficiencies in data or missing data. Imputation can help. • A simple thing but something we use often at EMC, especially in commercial lines with more challenging datasets. Build small model to predict missing value Add predicted values back onto dataset where missing Imputed values retain data volume for rest of process Final model shows improved performance
  • 11.
    The data scienceprocess: Imputation • This graph shows how the Previous Keepers field is imputed using a tree-based machine learning model • Most missing values were assigned to zero – manufactured date was close to purchased date • Various models downstream now have the benefit of a fully- populated field.
  • 12.
    The data scienceprocess: Feature Engineering • Feature Engineering is the construction of new features from existing data – Using domain knowledge to construct features can help models fit better, faster, or improve model interpretability – Pricing actuaries already do this, with NCB level, say, or other engineered risk scores, but it tends not to be a focus • Data scientists focus more on this because their datasets are often wider and individual features may lack predictive power.
  • 13.
    The data scienceprocess: ML • Machine Learning approaches can be shown to “beat” GLM’s across a number of metrics, they are also easier to fit, at the expense of being more difficult to interpret • If legacy-constraints mean a traditional model must be used, ML approaches can still offer insights: – Showing most predictive factors, guiding model design – Exposing interactions – Showing optimum banding for numerical variables – Giving a “target” model score for competing statistical approaches.
  • 14.
    The data scienceprocess: ML • Machine Learning approaches can be shown to “beat” GLM’s across a number of metrics, they are also easier to fit, at the expense of being more difficult to interpret • If legacy-constraints mean a traditional model must be used, ML approaches can still offer insights: – Showing most predictive factors, guiding model design – Exposing interactions – Showing optimum banding for numerical variables – Giving a “target” model score for competing statistical approaches.
  • 15.
    The data scienceprocess: ML • Factor predictiveness analyses from machine learning can inform model design • This diagnostic from an XGBoost model shows us the most important factors for use in a legacy GLM, say. XGBoost model less fussy than GLM so you can save time preparing data Use results to design GLM model formula
  • 16.
    The data scienceprocess: ML • Similar approaches allow interactions to be exposed • These approaches involve digging into the black box of machine learning models. These pairs of rating factors appear together most frequently in model trees. We can experiment with high-scoring interactions in our GLM.
  • 17.
    The data scienceprocess: ML • A tree-based machine learning model finds optimal points to split variables • Seeing where the tree model branches numerical variables can inform banding • The biggest difference is above or below £5,000. Note that treating the variable as a smoother in a GAM is preferable, it avoids banding altogether.
  • 18.
    The data scienceprocess: ML • When not legacy-constrained, the full power of modern ML approaches can be realised: – InsurTech start-ups make common use of ML models for rating – ML models are well-suited to non-linear, discontinuous data – for example in modelling market rates using aggregator data – ML models work well with wide data where traditional approaches break down – ML models are easily-deployed in the cloud.
  • 19.
    The data scienceprocess: GAM’s • General Additive Models have found wide application in data science, where they are often used in preference to GLM’s • GAM’s do all that GLM’s do: – Same formula and error structures so easy to convert a GLM – Same structure for offsets (fixed relativities) where required – Same output of multiplicative relativities for deployment – Same diagnostic outputs for peer review and underwriter assessment • But GAM’s offer a much more convenient treatment of splines: – “Smoothers” handle non-linear numerical variables – Much less to go wrong, can avoid polynomials, knots, etc – A Shape-Constrained fit can impose shapes on smoothers – Regularised fitting. GAM formula in R often the same as for a GLM, easy changes for numeric terms only Outputs are the same so no issue if you need multiplicative factor loads Models fit better and make better use of data
  • 20.
    The data scienceprocess: CV • Cross-Validation is a widely used technique in data science for developing and testing models – Rather than just train and test, and dataset is split into many “folds”, and a model fitted and tested with each fold held out in turn – This approach gives more stable results for model assessment, improving model selection and avoiding over-fitting – CV is important in data science, where machine learning models are more likely to overfit than traditional statistical models • The approach also lends itself to grid-searching for optimal hyper-parameters. Time tests offer a means of achieving same ends
  • 21.
    The data scienceprocess: Grid-Search • Machine Learning models use hyper-parameters which control the model fit process – they sit above model parameters which reflect the fit • A Grid-Search is an exhaustive search through a set of hyper-parameters until the best performing model is found • Related techniques, including Model Bayesian Optimisation, offer further means of finding the best hyper-parameters, and model, easily. Choose a set of hyper- parameters, the “grid” Fit and test each combination with cross validation Compare results of all combinations Select best set and refit using whole dataset
  • 22.
    The data scienceprocess: Grid-Search • Grid-search from machine learning algorithm XGBoost shown here • Can also be applied to traditional modelling: – Setting Tweedie power – Claim truncation – Inflation assumption. Here we test different “depths” of an XGBoost model Impact is small, with wide variance within hyper-parameter set We choose depth=1 for smallest error and the simplest model
  • 23.
    The data scienceprocess: Bootstrapping • The traditional approach for pricing actuaries is to use confidence intervals when examining models – We can see exposure levels and have an idea of the potential range of a relativity or prediction – Underwriters might adjust relativities with low certainty • But Bootstrapping offers the same and more…
  • 24.
    The data scienceprocess: Bootstrapping • Bootstrapping is a form of resampling widely used in data science • Bootstrapped model metrics are more flexible than confidence intervals: – Gives whole distribution – Works for non-statistical machine learning models. – We can use results to make models safe, for example by refitting to the 65% percentile. Sample from dataset with replacement Refit model Variation in models represents uncertainty
  • 25.
    The data scienceprocess: Bayesian Methods • Bayesian approaches use an assumed prior distribution • The prior is combined with emerging data to give a posterior distribution • Like bootstrapping, this gives an idea of uncertainty • It also gives a robust framework for credibility-weighting • Models are generative so can easily be used for simulations. Bayesian programming language: stan R package Rstanarm used here Prior Actual Cumulative observed Posterior mean & distribution
  • 26.
    The data scienceprocess: Automation • An automated process allows actuaries to focus on adding value Claims system Cloud warehouse R Server Refreshed rating model Analytics cube Reports in MS Word Quote system Impact analysis Legacy claims system Automatic upload Secure and maintenance free Cloud R server processes data, runs models, produces reports & MI automatically Updated rating model including adjustments Human and commercial judgement New rates go live on “serverless” platform
  • 27.
    The data scienceprocess: The Cloud • Many of the approaches discussed here rely on fitting multiple models, either for cross-validation or bootstrapping • Multiple model fits means greater run times, however… • Data science approaches make heavy use of parallel processing and cloud- based infrastructure – Parallel processing is possible on a local PC, using each core of a modern CPU or GPU – Cloud-based machines are relatively inexpensive to rent and are well-specified, allowing multiple models to be fitted quickly.
  • 28.
    The data scienceprocess: The Cloud • Various platforms allow traditional or machine learning models to be deployed robustly, and at scale, in the cloud – Load-balanced virtual machines or containers running R can serve model results through an API – “Serverless” platforms allow models and code to be deployed directly. No need to extract model parameters and recode… just deploy native R model Can scale all the way up to 100m quotations a year – enough for a motor aggregator. No servers to maintain, security, updates, etc.
  • 29.
    The data scienceprocess: The Cloud Rstudio running on an AWS linux virtual machine
  • 30.
    The data scienceprocess: The Cloud Rstudio running on an AWS linux virtual machine Online documentation for API interface, running here in R on a virtual server
  • 31.
    The data scienceprocess: The Cloud Rstudio running on an AWS linux virtual machine Online documentation for API interface, running here in R on a virtual server This API, running “serverless” on AWS lambda, returns a little bit of JSON describing a database entry
  • 32.
    Key Insights rankedby reward/effort 1. The R platform offers a new world of insight and flexibility 2. GAM’s are an easy substitution to make for GLM’s 3. Imputation of missing fields is easy and helps avoid losing data 4. Feature engineering can help models fit better 5. Cross-validation is so much better than test and train 6. Machine learning can offer insights even to legacy-constrained processes 7. Automation allows your team to concentrate on interpretation, not process 8. Bayesian approaches allow robust credibility-weighting and simulations 9. Bootstrapping offers an underwriter-friendly way of showing uncertainty 10. The cloud can speed analysis and support robust deployment of models 11. EDA examples might give ideas for your MI suite. Steep learning curve but worth it
  • 33.
    Key Insights rankedby reward/effort 1. The R platform offers a new world of insight and flexibility 2. GAM’s are an easy substitution to make for GLM’s 3. Imputation of missing fields is easy and helps avoid losing data 4. Feature engineering can help models fit better 5. Cross-validation is so much better than test and train 6. Machine learning can offer insights even to legacy-constrained processes 7. Automation allows your team to concentrate on interpretation, not process 8. Bayesian approaches allow robust credibility-weighting and simulations 9. Bootstrapping offers an underwriter-friendly way of showing uncertainty 10. The cloud can speed analysis and support robust deployment of models 11. EDA examples might give ideas for your MI suite. Easy fixes
  • 34.
    Key Insights rankedby reward/effort 1. The R platform offers a new world of insight and flexibility 2. GAM’s are an easy substitution to make for GLM’s 3. Imputation of missing fields is easy and helps avoid losing data 4. Feature engineering can help models fit better 5. Cross-validation is so much better than test and train 6. Machine learning can offer insights even to legacy-constrained processes 7. Automation allows your team to concentrate on interpretation, not process 8. Bayesian approaches allow robust credibility-weighting and simulations 9. Bootstrapping offers an underwriter-friendly way of showing uncertainty 10. The cloud can speed analysis and support robust deployment of models 11. EDA examples might give ideas for your MI suite. Better models
  • 35.
    Key Insights rankedby reward/effort 1. The R platform offers a new world of insight and flexibility 2. GAM’s are an easy substitution to make for GLM’s 3. Imputation of missing fields is easy and helps avoid losing data 4. Feature engineering can help models fit better 5. Cross-validation is so much better than test and train 6. Machine learning can offer insights even to legacy-constrained processes 7. Automation allows your team to concentrate on interpretation, not process 8. Bayesian approaches allow robust credibility-weighting and simulations 9. Bootstrapping offers an underwriter-friendly way of showing uncertainty 10. The cloud can speed analysis and support robust deployment of models 11. EDA examples might give ideas for your MI suite. Better actuaries, better business.
  • 36.
    Key Insights rankedby reward/effort 1. The R platform offers a new world of insight and flexibility 2. GAM’s are an easy substitution to make for GLM’s 3. Imputation of missing fields is easy and helps avoid losing data 4. Feature engineering can help models fit better 5. Cross-validation is so much better than test and train 6. Machine learning can offer insights even to legacy-constrained processes 7. Automation allows your team to concentrate on interpretation, not process 8. Bayesian approaches allow robust credibility-weighting and simulations 9. Bootstrapping offers an underwriter-friendly way of showing uncertainty 10. The cloud can speed analysis and support robust deployment of models 11. EDA examples might give ideas for your MI suite. Handling uncertainty
  • 37.
    Key Insights rankedby reward/effort 1. The R platform offers a new world of insight and flexibility 2. GAM’s are an easy substitution to make for GLM’s 3. Imputation of missing fields is easy and helps avoid losing data 4. Feature engineering can help models fit better 5. Cross-validation is so much better than test and train 6. Machine learning can offer insights even to legacy-constrained processes 7. Automation allows your team to concentrate on interpretation, not process 8. Bayesian approaches allow robust credibility-weighting and simulations 9. Bootstrapping offers an underwriter-friendly way of showing uncertainty 10. The cloud can speed analysis and support robust deployment of models 11. EDA examples might give ideas for your MI suite. Better actuaries, bigger scope, no IT reliance.
  • 38.
    Our Offering Pricing Data Science InsurTech Capital Modelling Solvency IIReserving IFRS 17 Training Process Efficiency Reserving as a Service: API to Report Dynamic MI: showing trends automatically Risk Triage with Machine Learning Cloud data pipeline engineering
  • 39.
    References • R andRStudio: r-project.org rstudio.com • GAM’s in R package mgcv: cran.r-project.org/web/packages/mgcv • Imputation using impute package (knn) or algorithm of your choice • Cross-validation, Bootstrapping and other modelling functions with: – caret package topepo.github.io/caret – tidymodels suite of packages tidymodels.org • ML examples shown here use XGBoost xgboost.readthedocs.io • Amazon Web Services (AWS) and Digital Ocean used for Cloud examples Contact us to ask about R code and packages used
  • 41.
    16 July 202041 The views expressed in this [publication/presentation] are those of invited contributors and not necessarily those of the IFoA. The IFoA do not endorse any of the views stated, nor any claims or representations made in this [publication/presentation] and accept no responsibility or liability to any person for loss or damage suffered as a consequence of their placing reliance upon any view, claim or representation made in this [publication/presentation]. The information and expressions of opinion contained in this publication are not intended to be a comprehensive study, nor to provide actuarial advice or advice of any nature and should not be treated as a substitute for specific advice concerning individual situations. On no account may any part of this [publication/presentation] be reproduced without the written permission of the IFoA [or authors, in the case of non-IFoA research]. Questions Comments

Editor's Notes

  • #9 Experiment on personal lines dataset showed no improvement. No quick win.
  • #10 Experiment on personal lines dataset showed no improvement. No quick win.
  • #11 Statistical models are less robust under missing data, restricting dataset size So build a small model to predict the missing variable, then add it back to the dataset where it is missing The model is only used in the middle of the process so could be black-box Imputed values help retain data volumes and preserve a realistic effect from the rating factor – an imputed value is often better than nothing.
  • #14 First we will examine some of the ML approaches that can be used if you are legacy-constrained.
  • #15 First we will examine some of the ML approaches that can be used if you are legacy-constrained.
  • #16 Predictiveness output is available in the basic implementation of XGBoost so is relatively easy to get out.
  • #19 Non-Linear, discontinuous data, and also complex interactions we can’t see. ML models are easy to deploy in the cloud.
  • #21 Testing interactions with time is a traditional method used to achieve a similar result – checking consistency across different “folds” of the dataset.