2. Opportunities in P&C Modeling
A tremendous effort has been spent on the data preparation and exploration that
can be automated and streamlined.
Let machines deal with tedious data works so as to allow modelers focus on modeling
methodology and statistical inference.
Model
Deve. Data
Data Screen
Anomaly
Treatment
Data
Transform
Predictive
Ranking
Filter redundant
data fields;
Retain relevant
information;
Impute missing
values;
Winsorize data
outliers;
Recode special
values.
Explore data
distribution;
Identify best
transformation to
improve linearity;
Access variable
predictiveness;
Identify important
model drivers;
Data Preparation Consumed 50+% Time in Model Development
Heterogeneous
Data Sources:
Credit
Vehicle
Telematic
Geographic
3. Banking Practice
In retail credit risk models, Weight of Evidence transformation* has been widely
used to improve the efficiency of model development:
𝑊𝑜𝐸𝑖 = 𝐿𝑛
# 𝑜𝑓 𝑌 = 1 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝑌 = 0 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 1
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 0
The number of categories (i.e. bins) is derived from discretization on the 𝑿 vector, with
missing values being handled differently.
In consideration of regulatory scrutiny and model interpretation, a strict monotonicity is
assumed between 𝑿 and 𝑾𝒐𝑬𝑿.
All monotonic functions of 𝑿, e.g. logarithm, exponential, or linear, should converge to
the same monotonic 𝑾𝒐𝑬𝑿 transformation.
Odds in 𝒊𝒕𝒉
Category Overall Odds
* https://pypi.org/project/py-mob/
4. Adoption in P&C Models
In light of P&C loss models, a modified approach is proposed to mimic the idea of
𝑾𝒐𝑬 transformation, as shown below.
𝐹 𝑋𝑖 = 𝐿𝑛
𝐿𝑜𝑠𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠
The interpretation of 𝑭 𝑿𝒊 is intuitive in that it is the ratio between the average loss in
the 𝑖𝑡ℎ category and the overall average loss.
With missing values fallen into a standalone category or combined with a similar
neighbor, no special treatment (i.e. imputation) is necessary anymore.
Since the transformation is to project raw values of 𝑿 into the data space of 𝒀 based on
the permutation, concerns around outliers in each 𝑿 have been neutralized.
Average Loss in 𝒊𝒕𝒉
Category Overall Average Loss
5. Outline of Loss_Mob Package
The Python package loss_mob (https://pypi.org/project/loss-mob) is my weekend
project with the attempt to tackle the most tedious yet critical task in P&C loss
model development.
Core Functionality
Variable Information Binning Algorithms Utility Functions
Coefficient of Variation
Spearman and Distance
Correlation Coefficients
Mutual Information Score
Gini Coefficient
Fine Binning Based on GBM
or Isotonic Regression;
Coarse Binning Based on
Density or Value Range;
Customized Binning Based
on User Inputs
Tabulation of Binning Result;
Application of Binning
Outcome to New Data;
Verification of Data
Transformation;
sMAPE for Model Perf.
6. Demo Based on MTPL Data
French Motor Third-Part Liability (MTPL) Claims dataset from OpenML* is used in
the subsequent demo.
import loss_mob as mob, pandas as pd, numpy as np, statsmodels.api as sm
dt = mob.get_mtpl() # https://github.com/dutangc/CASdatasets
dt.keys()
# dict_keys(['idpol', 'claimnb', 'exposure', 'area', 'vehpower', 'vehage', 'drivage', 'bonusmalus’,
# 'vehbrand','vehgas’, 'density', 'region', 'claimamount', 'purepremium’])
pd.DataFrame(dt).head(3)
… vehpower vehage drivage bonusmalus vehbrand vehgas density region claimamount purepremium
… 7 1 61 50 B12 Regular 27000 R11 303.00 404.0000
… 12 5 50 60 B12 Diesel 56 R25 1981.84 14156.0000
… 4 0 36 85 B12 Regular 4792 R11 1456.55 10403.9286
* https://www.openml.org
7. Variable Screening
The screen() function assesses the association between each 𝑿 and 𝒀.
The consistent magnitude between Spearman and Distance correlations indicates a
strong linear association in the context of GLM.
# variable list to screen
vlst = ["vehpower", "vehage", "drivage", "bonusmalus", "density"]
# screen through each attribute
summ = [{"variable": _, **mob.screen(dt[_], dt["purepremium"])} for _ in vlst]
# sort the summary by distance correlation
pd.DataFrame(sorted(summ, key = lambda x: -x["distance correlation"]))
variable … coefficient of variation spearman correlation distance correlation gini coefficient
bonusmalus … 0.261651 0.057169 0.043454 0.364684
drivage … 0.310719 -0.004906 0.014289 0.319361
density … 2.208544 0.020221 0.011069 0.075396
vehage … 0.804375 0.019526 0.010801 0.093274
vehpower … 0.317741 0.002307 0.003570 0.026760
8. Variable Screening in Parallel
Scalability is at the heart of development philosophy.
Functions in loss_mob can be easily parallelized and scaled to ~1000+ predictors.
# first, define a wrapper to be consumed by the parallel map
def pscreen(v):
return({"variable": v, **mob.screen(dt[v], dt["purepremium"])})
# next, load necessary modules
from multiprocessing import Pool, cpu_count
from contextlib import closing
with closing(Pool(processes = cpu_count())) as pool:
psum = pool.map(pscreen, vlst)
pool.terminate()
pd.DataFrame(sorted(psum, key = lambda x: -x["gini coefficient"])).head(3)
variable ... spearman correlation distance correlation gini coefficient
Bonusmalus ... 0.057169 0.043454 0.364684
drivage ... -0.004906 0.014289 0.319361
vehage ... 0.019526 0.010801 0.093274
10. Visual of Variable Transformation
Because 𝐹(𝑋), i.e. 𝑵𝒆𝒘𝑿, is strictly linear with respect to 𝐿𝑛(𝑌), the linearity of model
predictors in GLM has been be enhanced.
Each category of 𝐹(𝑋) is an aggregation based on a segment of records. As a result, the
model estimated with transformed 𝑿 should be more stable and less prone to overfitting.
11. Treatment of Missing Values - I
Case I - The binning algorithm groups all missing values into a standalone category
and then assigns a value to 𝑵𝒆𝒘𝑿 based on the corresponding average loss.
np.random.seed(1)
test_x = np.where(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 0 | 135182 | 135182 | 35578527.4354 | 263.1898 | -0.37584005 | numpy.isnan($X$) |
| 1 | 307519 | 0 | 67711301.2865 | 220.1857 | -0.55424410 | $X$ <= 50.0 |
| 2 | 77102 | 0 | 19992914.3670 | 259.3047 | -0.39071162 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43211 | 0 | 15596493.6976 | 360.9380 | -0.06000929 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 890 | 0 | 414193.4095 | 465.3859 | 0.19415125 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 35072 | 0 | 18566866.2822 | 529.3929 | 0.32301519 | $X$ > 70.0 and $X$ <= 79.0 |
| 6 | 56945 | 0 | 46318023.6829 | 813.3817 | 0.75248495 | $X$ > 79.0 and $X$ <= 96.0 |
| 7 | 153 | 0 | 227944.8955 | 1489.8359 | 1.35770566 | $X$ > 96.0 and $X$ <= 99.0 |
| 8 | 21939 | 0 | 55449516.2623 | 2527.4405 | 1.88624679 | $X$ > 99.0 |
12. Treatment of Missing Values - II
Case II - When no loss was incurred for missing values, all records with missing
values will be merged into a category with the lowest averaged loss.
test_x = np.where(np.logical_and(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8,
np.array(dt["purepremium"]) == 0), np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 439893 | 130194 | 80777461.9272 | 183.6298 | -0.73579385 | $X$ <= 50.0 or numpy.isnan($X$) |
| 2 | 77806 | 0 | 25208175.3327 | 323.9876 | -0.16801052 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43781 | 0 | 17814900.3274 | 406.9094 | 0.05987494 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 901 | 0 | 505020.2406 | 560.5108 | 0.38013292 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 33871 | 0 | 19639592.0866 | 579.8350 | 0.41402801 | $X$ > 70.0 and $X$ <= 76.0 |
| 6 | 1550 | 0 | 933807.2930 | 602.4563 | 0.45229955 | $X$ > 76.0 and $X$ <= 79.0 |
| 7 | 57623 | 0 | 52056038.4906 | 903.3899 | 0.85743868 | $X$ > 79.0 and $X$ <= 96.0 |
| 8 | 157 | 0 | 238409.0720 | 1518.5291 | 1.37678185 | $X$ > 96.0 and $X$ <= 99.0 |
| 9 | 21991 | 0 | 60879166.6831 | 2768.3674 | 1.97729742 | $X$ > 99.0 and $X$ <= 139.0 |
| 10 | 440 | 0 | 1803209.8657 | 4098.2042 | 2.36958856 | $X$ > 139.0 |
14. Variable Importance after Transformation
Because the monotonic binning provides a rank order capability of each attribute,
outcomes can be leveraged to calculate the Gini-Coefficient in order to evaluate the
predictability of each predictor after transformation.
Gini outcome is highly consistent with Distance Correlation.
# calculate gini-coefficient for each binned attribute
gout = [{"variable": _, “gini”: mob.bin_gini(bout[_])} for _ in vlst]
# sort all attributes by gini-coefficients
pd.DataFrame(sorted(gout, key = lambda x: -x[“gini”]))
variable gini gini before binning
bonusmalus 0.373600 0.364684
drivage 0.335541 0.319361
vehage 0.130189 0.093274
density 0.129020 0.075396
vehpower 0.076282 0.026760
Gini improved
after binning
15. Functions to apply transformations to new data and to verify the outcome.
Transforming New Data
bin1 = mob.qtl_bin(dt["bonusmalus"], dt["purepremium"])
# score new data based on the binning outcome
out1 = mob.cal_newx(dt['bonusmalus'], bin1)
mob.head(out1, 3)
# {'x': 50, 'bin': 1, 'newx': -0.60031106}
# {'x': 60, 'bin': 3, 'newx': -0.24666305}
# {'x': 85, 'bin': 4, 'newx': 0.51388758}
mob.chk_newx(out1)
| bin | newx | freq | dist | xrng |
|-------|-------------|--------|------------|--------------------------------|
| 1 | -0.60031106 | 384156 | 56.6591% | 50 <==> 50 |
| 2 | -0.34039004 | 68334 | 10.0786% | 51 <==> 57 |
| 3 | -0.24666305 | 80831 | 11.9217% | 58 <==> 68 |
| 4 | 0.51388758 | 82308 | 12.1396% | 69 <==> 85 |
| 5 | 1.25057961 | 62384 | 9.2010% | 86 <==> 230 |
16. Estimate a Tweedie GLM with forementioned predictors without any transformation.
Only 1 variable is statistically significant.
Model Fitting without Transformation
Y = dt["purepremium"]
# use raw variables
X1 = sm.add_constant(pd.DataFrame({v: dt[v] for v in vlst}), prepend = True)
m1 = sm.GLM(Y, X1, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 3.8697 0.575 6.729 0.000 2.743 4.997
bonusmalus 0.0344 0.005 6.792 0.000 0.024 0.044 1.3222
drivage -0.0055 0.006 -0.907 0.364 -0.017 0.006 1.3018
vehage -0.0024 0.013 -0.183 0.855 -0.029 0.024 1.0165
density -7.882e-06 1.92e-05 -0.411 0.681 -4.55e-05 2.98e-05 1.0194
vehpower 0.0124 0.037 0.338 0.735 -0.060 0.085 1.0084
17. Estimate a Tweedie GLM with same predictors after transformation.
There are 3 statistically significant variables.
Model Fitting with Transformation
bout = dict((v, mob.iso_bin(dt[v], dt["purepremium"])) for v in vlst)
xout = dict((v, mob.cal_newx(dt[v], bout[v])) for v in vlst)
X2 = sm.add_constant(pd.DataFrame(dict((v, [_["newx"] for _ in xout[v]]) for v in vlst)), prepend = True)
m2 = sm.GLM(Y, X2, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 5.9635 0.066 91.001 0.000 5.835 6.092
bonusmalus 0.4727 0.115 4.115 0.000 0.248 0.698 1.3983
drivage 0.6632 0.126 5.254 0.000 0.416 0.911 1.3794
vehage 0.0726 0.228 0.319 0.750 -0.374 0.519 1.0119
density 0.4690 0.227 2.063 0.039 0.023 0.915 1.0215
vehpower 0.5291 0.417 1.269 0.205 -0.288 1.347 1.0055
18. A performance comparison between the model without variable transformation and
the model with variable transformation is provided below.
Model Performance
Statistical Metrics Without Transformation With Transformation
AIC 848,321 821,825
Gini 0.3847 0.4103
sMAPE 1.9724 1.9744
MAE 734.6655 717.6874
D2 Tweedie Score 0.0393 0.0553
19. Distance correlation is a dependence measure between two paired vectors.
Appendix I: Distance Correlation
Distance Correlation Spearman Correlation
Source: github.com/vnmabus/dcor
20. Appendix II: Core Functions of Loss_Mob
loss_mob
|-- qtl_bin() : Iterative discretization based on quantiles of X.
|-- los_bin() : Revised iterative discretization for records with Y > 0.
|-- iso_bin() : Discretization driven by the isotonic regression.
|-- val_bin() : Revised iterative discretization based on unique values of X.
|-- rng_bin() : Revised iterative discretization based on the equal-width range of X.
|-- kmn_bin() : Iterative discretization based on the k-means clustering of X.
|-- gbm_bin() : Discretization based on the gradient boosting machine (GBM).
|-- cus_bin() : Customized discretization based on pre-determined cut points.
|-- view_bin() : Displays the binning outcome in a tabular form.
|-- cal_newx() : Applies the variable transformation to a numeric vector based on binning outcome.
|-- chk_newx() : Verifies the transformation generated from the cal_newx() function.
|-- mi_score() : Calculates the Mutual Information (MI) score between X and Y.
|-- screen() : Calculates Spearman and Distance Correlations between X and Y.
|-- bin_gini() : Calculates the gini-coefficient between X and Y based on the binning object.
|-- num_gini() : Calculates the gini-coefficient between raw values of X and Y.
|-- smape() : Calculates the sMAPE value between Y and Yhat.
`-- get_mtpl() : Extracts French Motor Third-Part Liability Claims dataset from OpenML.