SlideShare a Scribd company logo
1 of 20
Variable Transformation
in P&C Loss Models
Based on Monotonic Binning
WenSui Liu, Nov 2023
Opportunities in P&C Modeling
 A tremendous effort has been spent on the data preparation and exploration that
can be automated and streamlined.
 Let machines deal with tedious data works so as to allow modelers focus on modeling
methodology and statistical inference.
Model
Deve. Data
Data Screen
Anomaly
Treatment
Data
Transform
Predictive
Ranking
 Filter redundant
data fields;
 Retain relevant
information;
 Impute missing
values;
 Winsorize data
outliers;
 Recode special
values.
 Explore data
distribution;
 Identify best
transformation to
improve linearity;
 Access variable
predictiveness;
 Identify important
model drivers;
Data Preparation Consumed 50+% Time in Model Development
Heterogeneous
Data Sources:
 Credit
 Vehicle
 Telematic
 Geographic
Banking Practice
 In retail credit risk models, Weight of Evidence transformation* has been widely
used to improve the efficiency of model development:
𝑊𝑜𝐸𝑖 = 𝐿𝑛
# 𝑜𝑓 𝑌 = 1 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝑌 = 0 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 1
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 0
 The number of categories (i.e. bins) is derived from discretization on the 𝑿 vector, with
missing values being handled differently.
 In consideration of regulatory scrutiny and model interpretation, a strict monotonicity is
assumed between 𝑿 and 𝑾𝒐𝑬𝑿.
 All monotonic functions of 𝑿, e.g. logarithm, exponential, or linear, should converge to
the same monotonic 𝑾𝒐𝑬𝑿 transformation.
Odds in 𝒊𝒕𝒉
Category Overall Odds
* https://pypi.org/project/py-mob/
Adoption in P&C Models
 In light of P&C loss models, a modified approach is proposed to mimic the idea of
𝑾𝒐𝑬 transformation, as shown below.
𝐹 𝑋𝑖 = 𝐿𝑛
𝐿𝑜𝑠𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠
 The interpretation of 𝑭 𝑿𝒊 is intuitive in that it is the ratio between the average loss in
the 𝑖𝑡ℎ category and the overall average loss.
 With missing values fallen into a standalone category or combined with a similar
neighbor, no special treatment (i.e. imputation) is necessary anymore.
 Since the transformation is to project raw values of 𝑿 into the data space of 𝒀 based on
the permutation, concerns around outliers in each 𝑿 have been neutralized.
Average Loss in 𝒊𝒕𝒉
Category Overall Average Loss
Outline of Loss_Mob Package
 The Python package loss_mob (https://pypi.org/project/loss-mob) is my weekend
project with the attempt to tackle the most tedious yet critical task in P&C loss
model development.
Core Functionality
Variable Information Binning Algorithms Utility Functions
 Coefficient of Variation
 Spearman and Distance
Correlation Coefficients
 Mutual Information Score
 Gini Coefficient
 Fine Binning Based on GBM
or Isotonic Regression;
 Coarse Binning Based on
Density or Value Range;
 Customized Binning Based
on User Inputs
 Tabulation of Binning Result;
 Application of Binning
Outcome to New Data;
 Verification of Data
Transformation;
 sMAPE for Model Perf.
Demo Based on MTPL Data
 French Motor Third-Part Liability (MTPL) Claims dataset from OpenML* is used in
the subsequent demo.
import loss_mob as mob, pandas as pd, numpy as np, statsmodels.api as sm
dt = mob.get_mtpl() # https://github.com/dutangc/CASdatasets
dt.keys()
# dict_keys(['idpol', 'claimnb', 'exposure', 'area', 'vehpower', 'vehage', 'drivage', 'bonusmalus’,
# 'vehbrand','vehgas’, 'density', 'region', 'claimamount', 'purepremium’])
pd.DataFrame(dt).head(3)
… vehpower vehage drivage bonusmalus vehbrand vehgas density region claimamount purepremium
… 7 1 61 50 B12 Regular 27000 R11 303.00 404.0000
… 12 5 50 60 B12 Diesel 56 R25 1981.84 14156.0000
… 4 0 36 85 B12 Regular 4792 R11 1456.55 10403.9286
* https://www.openml.org
Variable Screening
 The screen() function assesses the association between each 𝑿 and 𝒀.
 The consistent magnitude between Spearman and Distance correlations indicates a
strong linear association in the context of GLM.
# variable list to screen
vlst = ["vehpower", "vehage", "drivage", "bonusmalus", "density"]
# screen through each attribute
summ = [{"variable": _, **mob.screen(dt[_], dt["purepremium"])} for _ in vlst]
# sort the summary by distance correlation
pd.DataFrame(sorted(summ, key = lambda x: -x["distance correlation"]))
variable … coefficient of variation spearman correlation distance correlation gini coefficient
bonusmalus … 0.261651 0.057169 0.043454 0.364684
drivage … 0.310719 -0.004906 0.014289 0.319361
density … 2.208544 0.020221 0.011069 0.075396
vehage … 0.804375 0.019526 0.010801 0.093274
vehpower … 0.317741 0.002307 0.003570 0.026760
Variable Screening in Parallel
 Scalability is at the heart of development philosophy.
 Functions in loss_mob can be easily parallelized and scaled to ~1000+ predictors.
# first, define a wrapper to be consumed by the parallel map
def pscreen(v):
return({"variable": v, **mob.screen(dt[v], dt["purepremium"])})
# next, load necessary modules
from multiprocessing import Pool, cpu_count
from contextlib import closing
with closing(Pool(processes = cpu_count())) as pool:
psum = pool.map(pscreen, vlst)
pool.terminate()
pd.DataFrame(sorted(psum, key = lambda x: -x["gini coefficient"])).head(3)
variable ... spearman correlation distance correlation gini coefficient
Bonusmalus ... 0.057169 0.043454 0.364684
drivage ... -0.004906 0.014289 0.319361
vehage ... 0.019526 0.010801 0.093274
Variable Transformation
 Monotonic binning based on GBM (Gradient Boosting Machine).
 After binning, the transformed 𝑿 , namely 𝑵𝒆𝒘𝑿 , will replace the raw 𝑿 in the
downstream model estimation.
bout = dict((v, mob.gbm_bin(dt[v], dt["purepremium"])) for v in vlst)
mob.view_bin(bout["bonusmalus"])
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------|
| 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 |
| 2 | 96170 | 0 | 25208175.3327 | 262.1210 | -0.37990943 | $X$ > 50 and $X$ <= 61 |
| 3 | 54092 | 0 | 17814900.3274 | 329.3445 | -0.15161142 | $X$ > 61 and $X$ <= 69 |
| 4 | 1113 | 0 | 505020.2406 | 453.7468 | 0.16882383 | $X$ > 69 and $X$ <= 70 |
| 5 | 41863 | 0 | 19639592.0866 | 469.1396 | 0.20218482 | $X$ > 70 and $X$ <= 76 |
| 6 | 1877 | 0 | 933807.2930 | 497.4999 | 0.26087973 | $X$ > 76 and $X$ <= 79 |
| 7 | 71234 | 0 | 52056038.4906 | 730.7752 | 0.64539024 | $X$ > 79 and $X$ <= 96 |
| 8 | 184 | 0 | 238409.0720 | 1295.7015 | 1.21809190 | $X$ > 96 and $X$ <= 99 |
| 9 | 26798 | 0 | 60879166.6831 | 2271.7802 | 1.77960344 | $X$ > 99 and $X$ <= 139 |
| 10 | 526 | 0 | 1803209.8657 | 3428.1556 | 2.19106207 | $X$ > 139 |
Visual of Variable Transformation
 Because 𝐹(𝑋), i.e. 𝑵𝒆𝒘𝑿, is strictly linear with respect to 𝐿𝑛(𝑌), the linearity of model
predictors in GLM has been be enhanced.
 Each category of 𝐹(𝑋) is an aggregation based on a segment of records. As a result, the
model estimated with transformed 𝑿 should be more stable and less prone to overfitting.
Treatment of Missing Values - I
 Case I - The binning algorithm groups all missing values into a standalone category
and then assigns a value to 𝑵𝒆𝒘𝑿 based on the corresponding average loss.
np.random.seed(1)
test_x = np.where(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 0 | 135182 | 135182 | 35578527.4354 | 263.1898 | -0.37584005 | numpy.isnan($X$) |
| 1 | 307519 | 0 | 67711301.2865 | 220.1857 | -0.55424410 | $X$ <= 50.0 |
| 2 | 77102 | 0 | 19992914.3670 | 259.3047 | -0.39071162 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43211 | 0 | 15596493.6976 | 360.9380 | -0.06000929 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 890 | 0 | 414193.4095 | 465.3859 | 0.19415125 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 35072 | 0 | 18566866.2822 | 529.3929 | 0.32301519 | $X$ > 70.0 and $X$ <= 79.0 |
| 6 | 56945 | 0 | 46318023.6829 | 813.3817 | 0.75248495 | $X$ > 79.0 and $X$ <= 96.0 |
| 7 | 153 | 0 | 227944.8955 | 1489.8359 | 1.35770566 | $X$ > 96.0 and $X$ <= 99.0 |
| 8 | 21939 | 0 | 55449516.2623 | 2527.4405 | 1.88624679 | $X$ > 99.0 |
Treatment of Missing Values - II
 Case II - When no loss was incurred for missing values, all records with missing
values will be merged into a category with the lowest averaged loss.
test_x = np.where(np.logical_and(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8,
np.array(dt["purepremium"]) == 0), np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 439893 | 130194 | 80777461.9272 | 183.6298 | -0.73579385 | $X$ <= 50.0 or numpy.isnan($X$) |
| 2 | 77806 | 0 | 25208175.3327 | 323.9876 | -0.16801052 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43781 | 0 | 17814900.3274 | 406.9094 | 0.05987494 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 901 | 0 | 505020.2406 | 560.5108 | 0.38013292 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 33871 | 0 | 19639592.0866 | 579.8350 | 0.41402801 | $X$ > 70.0 and $X$ <= 76.0 |
| 6 | 1550 | 0 | 933807.2930 | 602.4563 | 0.45229955 | $X$ > 76.0 and $X$ <= 79.0 |
| 7 | 57623 | 0 | 52056038.4906 | 903.3899 | 0.85743868 | $X$ > 79.0 and $X$ <= 96.0 |
| 8 | 157 | 0 | 238409.0720 | 1518.5291 | 1.37678185 | $X$ > 96.0 and $X$ <= 99.0 |
| 9 | 21991 | 0 | 60879166.6831 | 2768.3674 | 1.97729742 | $X$ > 99.0 and $X$ <= 139.0 |
| 10 | 440 | 0 | 1803209.8657 | 4098.2042 | 2.36958856 | $X$ > 139.0 |
 The loss_mob package offers eight different binning algorithms to meet different
business needs in various scenarios (as well as curiosity).
Alternative Binning Algorithms
mob.view_bin(mob.kmn_bin(dt["bonusmalus"], dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 431099 | 0 | 96318760.7388 | 223.4261 | -0.53963497 | $X$ <= 55 |
| 2 | 80649 | 0 | 20223542.3051 | 250.7600 | -0.42421935 | $X$ > 55 and $X$ <= 66 |
| 3 | 67270 | 0 | 28283842.6718 | 420.4525 | 0.09261601 | $X$ > 66 and $X$ <= 78 |
| 4 | 54381 | 0 | 42028721.0647 | 772.8567 | 0.70137806 | $X$ > 78 and $X$ <= 92 |
| 5 | 44614 | 0 | 73000914.5385 | 1636.2782 | 1.45146393 | $X$ > 92 |
mob.view_bin(mob.los_bin(dt["bonusmalus"], dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 |
| 2 | 94446 | 0 | 24910533.5905 | 263.7542 | -0.37369782 | $X$ > 50 and $X$ <= 60 |
| 3 | 77565 | 0 | 30475703.0135 | 392.9053 | 0.02485312 | $X$ > 60 and $X$ <= 72 |
| 4 | 76915 | 0 | 50625848.0929 | 658.2051 | 0.54080103 | $X$ > 72 and $X$ <= 90 |
| 5 | 44931 | 0 | 73066234.6948 | 1626.1876 | 1.44527805 | $X$ > 90 |
Variable Importance after Transformation
 Because the monotonic binning provides a rank order capability of each attribute,
outcomes can be leveraged to calculate the Gini-Coefficient in order to evaluate the
predictability of each predictor after transformation.
 Gini outcome is highly consistent with Distance Correlation.
# calculate gini-coefficient for each binned attribute
gout = [{"variable": _, “gini”: mob.bin_gini(bout[_])} for _ in vlst]
# sort all attributes by gini-coefficients
pd.DataFrame(sorted(gout, key = lambda x: -x[“gini”]))
variable gini gini before binning
bonusmalus 0.373600 0.364684
drivage 0.335541 0.319361
vehage 0.130189 0.093274
density 0.129020 0.075396
vehpower 0.076282 0.026760
Gini improved
after binning
 Functions to apply transformations to new data and to verify the outcome.
Transforming New Data
bin1 = mob.qtl_bin(dt["bonusmalus"], dt["purepremium"])
# score new data based on the binning outcome
out1 = mob.cal_newx(dt['bonusmalus'], bin1)
mob.head(out1, 3)
# {'x': 50, 'bin': 1, 'newx': -0.60031106}
# {'x': 60, 'bin': 3, 'newx': -0.24666305}
# {'x': 85, 'bin': 4, 'newx': 0.51388758}
mob.chk_newx(out1)
| bin | newx | freq | dist | xrng |
|-------|-------------|--------|------------|--------------------------------|
| 1 | -0.60031106 | 384156 | 56.6591% | 50 <==> 50 |
| 2 | -0.34039004 | 68334 | 10.0786% | 51 <==> 57 |
| 3 | -0.24666305 | 80831 | 11.9217% | 58 <==> 68 |
| 4 | 0.51388758 | 82308 | 12.1396% | 69 <==> 85 |
| 5 | 1.25057961 | 62384 | 9.2010% | 86 <==> 230 |
 Estimate a Tweedie GLM with forementioned predictors without any transformation.
 Only 1 variable is statistically significant.
Model Fitting without Transformation
Y = dt["purepremium"]
# use raw variables
X1 = sm.add_constant(pd.DataFrame({v: dt[v] for v in vlst}), prepend = True)
m1 = sm.GLM(Y, X1, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 3.8697 0.575 6.729 0.000 2.743 4.997
bonusmalus 0.0344 0.005 6.792 0.000 0.024 0.044 1.3222
drivage -0.0055 0.006 -0.907 0.364 -0.017 0.006 1.3018
vehage -0.0024 0.013 -0.183 0.855 -0.029 0.024 1.0165
density -7.882e-06 1.92e-05 -0.411 0.681 -4.55e-05 2.98e-05 1.0194
vehpower 0.0124 0.037 0.338 0.735 -0.060 0.085 1.0084
 Estimate a Tweedie GLM with same predictors after transformation.
 There are 3 statistically significant variables.
Model Fitting with Transformation
bout = dict((v, mob.iso_bin(dt[v], dt["purepremium"])) for v in vlst)
xout = dict((v, mob.cal_newx(dt[v], bout[v])) for v in vlst)
X2 = sm.add_constant(pd.DataFrame(dict((v, [_["newx"] for _ in xout[v]]) for v in vlst)), prepend = True)
m2 = sm.GLM(Y, X2, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 5.9635 0.066 91.001 0.000 5.835 6.092
bonusmalus 0.4727 0.115 4.115 0.000 0.248 0.698 1.3983
drivage 0.6632 0.126 5.254 0.000 0.416 0.911 1.3794
vehage 0.0726 0.228 0.319 0.750 -0.374 0.519 1.0119
density 0.4690 0.227 2.063 0.039 0.023 0.915 1.0215
vehpower 0.5291 0.417 1.269 0.205 -0.288 1.347 1.0055
 A performance comparison between the model without variable transformation and
the model with variable transformation is provided below.
Model Performance
Statistical Metrics Without Transformation With Transformation
AIC 848,321 821,825
Gini 0.3847 0.4103
sMAPE 1.9724 1.9744
MAE 734.6655 717.6874
D2 Tweedie Score 0.0393 0.0553
 Distance correlation is a dependence measure between two paired vectors.
Appendix I: Distance Correlation
Distance Correlation Spearman Correlation
Source: github.com/vnmabus/dcor
Appendix II: Core Functions of Loss_Mob
loss_mob
|-- qtl_bin() : Iterative discretization based on quantiles of X.
|-- los_bin() : Revised iterative discretization for records with Y > 0.
|-- iso_bin() : Discretization driven by the isotonic regression.
|-- val_bin() : Revised iterative discretization based on unique values of X.
|-- rng_bin() : Revised iterative discretization based on the equal-width range of X.
|-- kmn_bin() : Iterative discretization based on the k-means clustering of X.
|-- gbm_bin() : Discretization based on the gradient boosting machine (GBM).
|-- cus_bin() : Customized discretization based on pre-determined cut points.
|-- view_bin() : Displays the binning outcome in a tabular form.
|-- cal_newx() : Applies the variable transformation to a numeric vector based on binning outcome.
|-- chk_newx() : Verifies the transformation generated from the cal_newx() function.
|-- mi_score() : Calculates the Mutual Information (MI) score between X and Y.
|-- screen() : Calculates Spearman and Distance Correlations between X and Y.
|-- bin_gini() : Calculates the gini-coefficient between X and Y based on the binning object.
|-- num_gini() : Calculates the gini-coefficient between raw values of X and Y.
|-- smape() : Calculates the sMAPE value between Y and Yhat.
`-- get_mtpl() : Extracts French Motor Third-Part Liability Claims dataset from OpenML.

More Related Content

Similar to Variable Transformation in P&C Loss Models Based on Monotonic Binning

Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
WenSui Liu
 
Writing Sample
Writing SampleWriting Sample
Writing Sample
Yiqun Li
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
fsmart01
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
Asoka Korale
 

Similar to Variable Transformation in P&C Loss Models Based on Monotonic Binning (20)

Mixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering AlgorithmMixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering Algorithm
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
Writing Sample
Writing SampleWriting Sample
Writing Sample
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Model Presolve, Warmstart and Conflict Refining in CP Optimizer
Model Presolve, Warmstart and Conflict Refining in CP OptimizerModel Presolve, Warmstart and Conflict Refining in CP Optimizer
Model Presolve, Warmstart and Conflict Refining in CP Optimizer
 
20190907 Julia the language for future
20190907 Julia the language for future20190907 Julia the language for future
20190907 Julia the language for future
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining Tool
 
Engineering Data Analysis-ProfCharlton
Engineering Data  Analysis-ProfCharltonEngineering Data  Analysis-ProfCharlton
Engineering Data Analysis-ProfCharlton
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithms
 
Dbms plan - A swiss army knife for performance engineers
Dbms plan - A swiss army knife for performance engineersDbms plan - A swiss army knife for performance engineers
Dbms plan - A swiss army knife for performance engineers
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
 

Recently uploaded

如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
a8om7o51
 

Recently uploaded (20)

如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 

Variable Transformation in P&C Loss Models Based on Monotonic Binning

  • 1. Variable Transformation in P&C Loss Models Based on Monotonic Binning WenSui Liu, Nov 2023
  • 2. Opportunities in P&C Modeling  A tremendous effort has been spent on the data preparation and exploration that can be automated and streamlined.  Let machines deal with tedious data works so as to allow modelers focus on modeling methodology and statistical inference. Model Deve. Data Data Screen Anomaly Treatment Data Transform Predictive Ranking  Filter redundant data fields;  Retain relevant information;  Impute missing values;  Winsorize data outliers;  Recode special values.  Explore data distribution;  Identify best transformation to improve linearity;  Access variable predictiveness;  Identify important model drivers; Data Preparation Consumed 50+% Time in Model Development Heterogeneous Data Sources:  Credit  Vehicle  Telematic  Geographic
  • 3. Banking Practice  In retail credit risk models, Weight of Evidence transformation* has been widely used to improve the efficiency of model development: 𝑊𝑜𝐸𝑖 = 𝐿𝑛 # 𝑜𝑓 𝑌 = 1 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 # 𝑜𝑓 𝑌 = 0 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 1 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 0  The number of categories (i.e. bins) is derived from discretization on the 𝑿 vector, with missing values being handled differently.  In consideration of regulatory scrutiny and model interpretation, a strict monotonicity is assumed between 𝑿 and 𝑾𝒐𝑬𝑿.  All monotonic functions of 𝑿, e.g. logarithm, exponential, or linear, should converge to the same monotonic 𝑾𝒐𝑬𝑿 transformation. Odds in 𝒊𝒕𝒉 Category Overall Odds * https://pypi.org/project/py-mob/
  • 4. Adoption in P&C Models  In light of P&C loss models, a modified approach is proposed to mimic the idea of 𝑾𝒐𝑬 transformation, as shown below. 𝐹 𝑋𝑖 = 𝐿𝑛 𝐿𝑜𝑠𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠  The interpretation of 𝑭 𝑿𝒊 is intuitive in that it is the ratio between the average loss in the 𝑖𝑡ℎ category and the overall average loss.  With missing values fallen into a standalone category or combined with a similar neighbor, no special treatment (i.e. imputation) is necessary anymore.  Since the transformation is to project raw values of 𝑿 into the data space of 𝒀 based on the permutation, concerns around outliers in each 𝑿 have been neutralized. Average Loss in 𝒊𝒕𝒉 Category Overall Average Loss
  • 5. Outline of Loss_Mob Package  The Python package loss_mob (https://pypi.org/project/loss-mob) is my weekend project with the attempt to tackle the most tedious yet critical task in P&C loss model development. Core Functionality Variable Information Binning Algorithms Utility Functions  Coefficient of Variation  Spearman and Distance Correlation Coefficients  Mutual Information Score  Gini Coefficient  Fine Binning Based on GBM or Isotonic Regression;  Coarse Binning Based on Density or Value Range;  Customized Binning Based on User Inputs  Tabulation of Binning Result;  Application of Binning Outcome to New Data;  Verification of Data Transformation;  sMAPE for Model Perf.
  • 6. Demo Based on MTPL Data  French Motor Third-Part Liability (MTPL) Claims dataset from OpenML* is used in the subsequent demo. import loss_mob as mob, pandas as pd, numpy as np, statsmodels.api as sm dt = mob.get_mtpl() # https://github.com/dutangc/CASdatasets dt.keys() # dict_keys(['idpol', 'claimnb', 'exposure', 'area', 'vehpower', 'vehage', 'drivage', 'bonusmalus’, # 'vehbrand','vehgas’, 'density', 'region', 'claimamount', 'purepremium’]) pd.DataFrame(dt).head(3) … vehpower vehage drivage bonusmalus vehbrand vehgas density region claimamount purepremium … 7 1 61 50 B12 Regular 27000 R11 303.00 404.0000 … 12 5 50 60 B12 Diesel 56 R25 1981.84 14156.0000 … 4 0 36 85 B12 Regular 4792 R11 1456.55 10403.9286 * https://www.openml.org
  • 7. Variable Screening  The screen() function assesses the association between each 𝑿 and 𝒀.  The consistent magnitude between Spearman and Distance correlations indicates a strong linear association in the context of GLM. # variable list to screen vlst = ["vehpower", "vehage", "drivage", "bonusmalus", "density"] # screen through each attribute summ = [{"variable": _, **mob.screen(dt[_], dt["purepremium"])} for _ in vlst] # sort the summary by distance correlation pd.DataFrame(sorted(summ, key = lambda x: -x["distance correlation"])) variable … coefficient of variation spearman correlation distance correlation gini coefficient bonusmalus … 0.261651 0.057169 0.043454 0.364684 drivage … 0.310719 -0.004906 0.014289 0.319361 density … 2.208544 0.020221 0.011069 0.075396 vehage … 0.804375 0.019526 0.010801 0.093274 vehpower … 0.317741 0.002307 0.003570 0.026760
  • 8. Variable Screening in Parallel  Scalability is at the heart of development philosophy.  Functions in loss_mob can be easily parallelized and scaled to ~1000+ predictors. # first, define a wrapper to be consumed by the parallel map def pscreen(v): return({"variable": v, **mob.screen(dt[v], dt["purepremium"])}) # next, load necessary modules from multiprocessing import Pool, cpu_count from contextlib import closing with closing(Pool(processes = cpu_count())) as pool: psum = pool.map(pscreen, vlst) pool.terminate() pd.DataFrame(sorted(psum, key = lambda x: -x["gini coefficient"])).head(3) variable ... spearman correlation distance correlation gini coefficient Bonusmalus ... 0.057169 0.043454 0.364684 drivage ... -0.004906 0.014289 0.319361 vehage ... 0.019526 0.010801 0.093274
  • 9. Variable Transformation  Monotonic binning based on GBM (Gradient Boosting Machine).  After binning, the transformed 𝑿 , namely 𝑵𝒆𝒘𝑿 , will replace the raw 𝑿 in the downstream model estimation. bout = dict((v, mob.gbm_bin(dt[v], dt["purepremium"])) for v in vlst) mob.view_bin(bout["bonusmalus"]) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------| | 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 | | 2 | 96170 | 0 | 25208175.3327 | 262.1210 | -0.37990943 | $X$ > 50 and $X$ <= 61 | | 3 | 54092 | 0 | 17814900.3274 | 329.3445 | -0.15161142 | $X$ > 61 and $X$ <= 69 | | 4 | 1113 | 0 | 505020.2406 | 453.7468 | 0.16882383 | $X$ > 69 and $X$ <= 70 | | 5 | 41863 | 0 | 19639592.0866 | 469.1396 | 0.20218482 | $X$ > 70 and $X$ <= 76 | | 6 | 1877 | 0 | 933807.2930 | 497.4999 | 0.26087973 | $X$ > 76 and $X$ <= 79 | | 7 | 71234 | 0 | 52056038.4906 | 730.7752 | 0.64539024 | $X$ > 79 and $X$ <= 96 | | 8 | 184 | 0 | 238409.0720 | 1295.7015 | 1.21809190 | $X$ > 96 and $X$ <= 99 | | 9 | 26798 | 0 | 60879166.6831 | 2271.7802 | 1.77960344 | $X$ > 99 and $X$ <= 139 | | 10 | 526 | 0 | 1803209.8657 | 3428.1556 | 2.19106207 | $X$ > 139 |
  • 10. Visual of Variable Transformation  Because 𝐹(𝑋), i.e. 𝑵𝒆𝒘𝑿, is strictly linear with respect to 𝐿𝑛(𝑌), the linearity of model predictors in GLM has been be enhanced.  Each category of 𝐹(𝑋) is an aggregation based on a segment of records. As a result, the model estimated with transformed 𝑿 should be more stable and less prone to overfitting.
  • 11. Treatment of Missing Values - I  Case I - The binning algorithm groups all missing values into a standalone category and then assigns a value to 𝑵𝒆𝒘𝑿 based on the corresponding average loss. np.random.seed(1) test_x = np.where(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.nan, np.array(dt["bonusmalus"])) mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 0 | 135182 | 135182 | 35578527.4354 | 263.1898 | -0.37584005 | numpy.isnan($X$) | | 1 | 307519 | 0 | 67711301.2865 | 220.1857 | -0.55424410 | $X$ <= 50.0 | | 2 | 77102 | 0 | 19992914.3670 | 259.3047 | -0.39071162 | $X$ > 50.0 and $X$ <= 61.0 | | 3 | 43211 | 0 | 15596493.6976 | 360.9380 | -0.06000929 | $X$ > 61.0 and $X$ <= 69.0 | | 4 | 890 | 0 | 414193.4095 | 465.3859 | 0.19415125 | $X$ > 69.0 and $X$ <= 70.0 | | 5 | 35072 | 0 | 18566866.2822 | 529.3929 | 0.32301519 | $X$ > 70.0 and $X$ <= 79.0 | | 6 | 56945 | 0 | 46318023.6829 | 813.3817 | 0.75248495 | $X$ > 79.0 and $X$ <= 96.0 | | 7 | 153 | 0 | 227944.8955 | 1489.8359 | 1.35770566 | $X$ > 96.0 and $X$ <= 99.0 | | 8 | 21939 | 0 | 55449516.2623 | 2527.4405 | 1.88624679 | $X$ > 99.0 |
  • 12. Treatment of Missing Values - II  Case II - When no loss was incurred for missing values, all records with missing values will be merged into a category with the lowest averaged loss. test_x = np.where(np.logical_and(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.array(dt["purepremium"]) == 0), np.nan, np.array(dt["bonusmalus"])) mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 1 | 439893 | 130194 | 80777461.9272 | 183.6298 | -0.73579385 | $X$ <= 50.0 or numpy.isnan($X$) | | 2 | 77806 | 0 | 25208175.3327 | 323.9876 | -0.16801052 | $X$ > 50.0 and $X$ <= 61.0 | | 3 | 43781 | 0 | 17814900.3274 | 406.9094 | 0.05987494 | $X$ > 61.0 and $X$ <= 69.0 | | 4 | 901 | 0 | 505020.2406 | 560.5108 | 0.38013292 | $X$ > 69.0 and $X$ <= 70.0 | | 5 | 33871 | 0 | 19639592.0866 | 579.8350 | 0.41402801 | $X$ > 70.0 and $X$ <= 76.0 | | 6 | 1550 | 0 | 933807.2930 | 602.4563 | 0.45229955 | $X$ > 76.0 and $X$ <= 79.0 | | 7 | 57623 | 0 | 52056038.4906 | 903.3899 | 0.85743868 | $X$ > 79.0 and $X$ <= 96.0 | | 8 | 157 | 0 | 238409.0720 | 1518.5291 | 1.37678185 | $X$ > 96.0 and $X$ <= 99.0 | | 9 | 21991 | 0 | 60879166.6831 | 2768.3674 | 1.97729742 | $X$ > 99.0 and $X$ <= 139.0 | | 10 | 440 | 0 | 1803209.8657 | 4098.2042 | 2.36958856 | $X$ > 139.0 |
  • 13.  The loss_mob package offers eight different binning algorithms to meet different business needs in various scenarios (as well as curiosity). Alternative Binning Algorithms mob.view_bin(mob.kmn_bin(dt["bonusmalus"], dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 1 | 431099 | 0 | 96318760.7388 | 223.4261 | -0.53963497 | $X$ <= 55 | | 2 | 80649 | 0 | 20223542.3051 | 250.7600 | -0.42421935 | $X$ > 55 and $X$ <= 66 | | 3 | 67270 | 0 | 28283842.6718 | 420.4525 | 0.09261601 | $X$ > 66 and $X$ <= 78 | | 4 | 54381 | 0 | 42028721.0647 | 772.8567 | 0.70137806 | $X$ > 78 and $X$ <= 92 | | 5 | 44614 | 0 | 73000914.5385 | 1636.2782 | 1.45146393 | $X$ > 92 | mob.view_bin(mob.los_bin(dt["bonusmalus"], dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 | | 2 | 94446 | 0 | 24910533.5905 | 263.7542 | -0.37369782 | $X$ > 50 and $X$ <= 60 | | 3 | 77565 | 0 | 30475703.0135 | 392.9053 | 0.02485312 | $X$ > 60 and $X$ <= 72 | | 4 | 76915 | 0 | 50625848.0929 | 658.2051 | 0.54080103 | $X$ > 72 and $X$ <= 90 | | 5 | 44931 | 0 | 73066234.6948 | 1626.1876 | 1.44527805 | $X$ > 90 |
  • 14. Variable Importance after Transformation  Because the monotonic binning provides a rank order capability of each attribute, outcomes can be leveraged to calculate the Gini-Coefficient in order to evaluate the predictability of each predictor after transformation.  Gini outcome is highly consistent with Distance Correlation. # calculate gini-coefficient for each binned attribute gout = [{"variable": _, “gini”: mob.bin_gini(bout[_])} for _ in vlst] # sort all attributes by gini-coefficients pd.DataFrame(sorted(gout, key = lambda x: -x[“gini”])) variable gini gini before binning bonusmalus 0.373600 0.364684 drivage 0.335541 0.319361 vehage 0.130189 0.093274 density 0.129020 0.075396 vehpower 0.076282 0.026760 Gini improved after binning
  • 15.  Functions to apply transformations to new data and to verify the outcome. Transforming New Data bin1 = mob.qtl_bin(dt["bonusmalus"], dt["purepremium"]) # score new data based on the binning outcome out1 = mob.cal_newx(dt['bonusmalus'], bin1) mob.head(out1, 3) # {'x': 50, 'bin': 1, 'newx': -0.60031106} # {'x': 60, 'bin': 3, 'newx': -0.24666305} # {'x': 85, 'bin': 4, 'newx': 0.51388758} mob.chk_newx(out1) | bin | newx | freq | dist | xrng | |-------|-------------|--------|------------|--------------------------------| | 1 | -0.60031106 | 384156 | 56.6591% | 50 <==> 50 | | 2 | -0.34039004 | 68334 | 10.0786% | 51 <==> 57 | | 3 | -0.24666305 | 80831 | 11.9217% | 58 <==> 68 | | 4 | 0.51388758 | 82308 | 12.1396% | 69 <==> 85 | | 5 | 1.25057961 | 62384 | 9.2010% | 86 <==> 230 |
  • 16.  Estimate a Tweedie GLM with forementioned predictors without any transformation.  Only 1 variable is statistically significant. Model Fitting without Transformation Y = dt["purepremium"] # use raw variables X1 = sm.add_constant(pd.DataFrame({v: dt[v] for v in vlst}), prepend = True) m1 = sm.GLM(Y, X1, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit() ============================================================================== coef std err z P>|z| [0.025 0.975] VIF ------------------------------------------------------------------------------ const 3.8697 0.575 6.729 0.000 2.743 4.997 bonusmalus 0.0344 0.005 6.792 0.000 0.024 0.044 1.3222 drivage -0.0055 0.006 -0.907 0.364 -0.017 0.006 1.3018 vehage -0.0024 0.013 -0.183 0.855 -0.029 0.024 1.0165 density -7.882e-06 1.92e-05 -0.411 0.681 -4.55e-05 2.98e-05 1.0194 vehpower 0.0124 0.037 0.338 0.735 -0.060 0.085 1.0084
  • 17.  Estimate a Tweedie GLM with same predictors after transformation.  There are 3 statistically significant variables. Model Fitting with Transformation bout = dict((v, mob.iso_bin(dt[v], dt["purepremium"])) for v in vlst) xout = dict((v, mob.cal_newx(dt[v], bout[v])) for v in vlst) X2 = sm.add_constant(pd.DataFrame(dict((v, [_["newx"] for _ in xout[v]]) for v in vlst)), prepend = True) m2 = sm.GLM(Y, X2, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit() ============================================================================== coef std err z P>|z| [0.025 0.975] VIF ------------------------------------------------------------------------------ const 5.9635 0.066 91.001 0.000 5.835 6.092 bonusmalus 0.4727 0.115 4.115 0.000 0.248 0.698 1.3983 drivage 0.6632 0.126 5.254 0.000 0.416 0.911 1.3794 vehage 0.0726 0.228 0.319 0.750 -0.374 0.519 1.0119 density 0.4690 0.227 2.063 0.039 0.023 0.915 1.0215 vehpower 0.5291 0.417 1.269 0.205 -0.288 1.347 1.0055
  • 18.  A performance comparison between the model without variable transformation and the model with variable transformation is provided below. Model Performance Statistical Metrics Without Transformation With Transformation AIC 848,321 821,825 Gini 0.3847 0.4103 sMAPE 1.9724 1.9744 MAE 734.6655 717.6874 D2 Tweedie Score 0.0393 0.0553
  • 19.  Distance correlation is a dependence measure between two paired vectors. Appendix I: Distance Correlation Distance Correlation Spearman Correlation Source: github.com/vnmabus/dcor
  • 20. Appendix II: Core Functions of Loss_Mob loss_mob |-- qtl_bin() : Iterative discretization based on quantiles of X. |-- los_bin() : Revised iterative discretization for records with Y > 0. |-- iso_bin() : Discretization driven by the isotonic regression. |-- val_bin() : Revised iterative discretization based on unique values of X. |-- rng_bin() : Revised iterative discretization based on the equal-width range of X. |-- kmn_bin() : Iterative discretization based on the k-means clustering of X. |-- gbm_bin() : Discretization based on the gradient boosting machine (GBM). |-- cus_bin() : Customized discretization based on pre-determined cut points. |-- view_bin() : Displays the binning outcome in a tabular form. |-- cal_newx() : Applies the variable transformation to a numeric vector based on binning outcome. |-- chk_newx() : Verifies the transformation generated from the cal_newx() function. |-- mi_score() : Calculates the Mutual Information (MI) score between X and Y. |-- screen() : Calculates Spearman and Distance Correlations between X and Y. |-- bin_gini() : Calculates the gini-coefficient between X and Y based on the binning object. |-- num_gini() : Calculates the gini-coefficient between raw values of X and Y. |-- smape() : Calculates the sMAPE value between Y and Yhat. `-- get_mtpl() : Extracts French Motor Third-Part Liability Claims dataset from OpenML.