SlideShare a Scribd company logo
Variable Transformation
in P&C Loss Models
Based on Monotonic Binning
WenSui Liu, Nov 2023
Opportunities in P&C Modeling
 A tremendous effort has been spent on the data preparation and exploration that
can be automated and streamlined.
 Let machines deal with tedious data works so as to allow modelers focus on modeling
methodology and statistical inference.
Model
Deve. Data
Data Screen
Anomaly
Treatment
Data
Transform
Predictive
Ranking
 Filter redundant
data fields;
 Retain relevant
information;
 Impute missing
values;
 Winsorize data
outliers;
 Recode special
values.
 Explore data
distribution;
 Identify best
transformation to
improve linearity;
 Access variable
predictiveness;
 Identify important
model drivers;
Data Preparation Consumed 50+% Time in Model Development
Heterogeneous
Data Sources:
 Credit
 Vehicle
 Telematic
 Geographic
Banking Practice
 In retail credit risk models, Weight of Evidence transformation* has been widely
used to improve the efficiency of model development:
𝑊𝑜𝐸𝑖 = 𝐿𝑛
# 𝑜𝑓 𝑌 = 1 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝑌 = 0 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 1
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 0
 The number of categories (i.e. bins) is derived from discretization on the 𝑿 vector, with
missing values being handled differently.
 In consideration of regulatory scrutiny and model interpretation, a strict monotonicity is
assumed between 𝑿 and 𝑾𝒐𝑬𝑿.
 All monotonic functions of 𝑿, e.g. logarithm, exponential, or linear, should converge to
the same monotonic 𝑾𝒐𝑬𝑿 transformation.
Odds in 𝒊𝒕𝒉
Category Overall Odds
* https://pypi.org/project/py-mob/
Adoption in P&C Models
 In light of P&C loss models, a modified approach is proposed to mimic the idea of
𝑾𝒐𝑬 transformation, as shown below.
𝐹 𝑋𝑖 = 𝐿𝑛
𝐿𝑜𝑠𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠
 The interpretation of 𝑭 𝑿𝒊 is intuitive in that it is the ratio between the average loss in
the 𝑖𝑡ℎ category and the overall average loss.
 With missing values fallen into a standalone category or combined with a similar
neighbor, no special treatment (i.e. imputation) is necessary anymore.
 Since the transformation is to project raw values of 𝑿 into the data space of 𝒀 based on
the permutation, concerns around outliers in each 𝑿 have been neutralized.
Average Loss in 𝒊𝒕𝒉
Category Overall Average Loss
Outline of Loss_Mob Package
 The Python package loss_mob (https://pypi.org/project/loss-mob) is my weekend
project with the attempt to tackle the most tedious yet critical task in P&C loss
model development.
Core Functionality
Variable Information Binning Algorithms Utility Functions
 Coefficient of Variation
 Spearman and Distance
Correlation Coefficients
 Mutual Information Score
 Gini Coefficient
 Fine Binning Based on GBM
or Isotonic Regression;
 Coarse Binning Based on
Density or Value Range;
 Customized Binning Based
on User Inputs
 Tabulation of Binning Result;
 Application of Binning
Outcome to New Data;
 Verification of Data
Transformation;
 sMAPE for Model Perf.
Demo Based on MTPL Data
 French Motor Third-Part Liability (MTPL) Claims dataset from OpenML* is used in
the subsequent demo.
import loss_mob as mob, pandas as pd, numpy as np, statsmodels.api as sm
dt = mob.get_mtpl() # https://github.com/dutangc/CASdatasets
dt.keys()
# dict_keys(['idpol', 'claimnb', 'exposure', 'area', 'vehpower', 'vehage', 'drivage', 'bonusmalus’,
# 'vehbrand','vehgas’, 'density', 'region', 'claimamount', 'purepremium’])
pd.DataFrame(dt).head(3)
… vehpower vehage drivage bonusmalus vehbrand vehgas density region claimamount purepremium
… 7 1 61 50 B12 Regular 27000 R11 303.00 404.0000
… 12 5 50 60 B12 Diesel 56 R25 1981.84 14156.0000
… 4 0 36 85 B12 Regular 4792 R11 1456.55 10403.9286
* https://www.openml.org
Variable Screening
 The screen() function assesses the association between each 𝑿 and 𝒀.
 The consistent magnitude between Spearman and Distance correlations indicates a
strong linear association in the context of GLM.
# variable list to screen
vlst = ["vehpower", "vehage", "drivage", "bonusmalus", "density"]
# screen through each attribute
summ = [{"variable": _, **mob.screen(dt[_], dt["purepremium"])} for _ in vlst]
# sort the summary by distance correlation
pd.DataFrame(sorted(summ, key = lambda x: -x["distance correlation"]))
variable … coefficient of variation spearman correlation distance correlation gini coefficient
bonusmalus … 0.261651 0.057169 0.043454 0.364684
drivage … 0.310719 -0.004906 0.014289 0.319361
density … 2.208544 0.020221 0.011069 0.075396
vehage … 0.804375 0.019526 0.010801 0.093274
vehpower … 0.317741 0.002307 0.003570 0.026760
Variable Screening in Parallel
 Scalability is at the heart of development philosophy.
 Functions in loss_mob can be easily parallelized and scaled to ~1000+ predictors.
# first, define a wrapper to be consumed by the parallel map
def pscreen(v):
return({"variable": v, **mob.screen(dt[v], dt["purepremium"])})
# next, load necessary modules
from multiprocessing import Pool, cpu_count
from contextlib import closing
with closing(Pool(processes = cpu_count())) as pool:
psum = pool.map(pscreen, vlst)
pool.terminate()
pd.DataFrame(sorted(psum, key = lambda x: -x["gini coefficient"])).head(3)
variable ... spearman correlation distance correlation gini coefficient
Bonusmalus ... 0.057169 0.043454 0.364684
drivage ... -0.004906 0.014289 0.319361
vehage ... 0.019526 0.010801 0.093274
Variable Transformation
 Monotonic binning based on GBM (Gradient Boosting Machine).
 After binning, the transformed 𝑿 , namely 𝑵𝒆𝒘𝑿 , will replace the raw 𝑿 in the
downstream model estimation.
bout = dict((v, mob.gbm_bin(dt[v], dt["purepremium"])) for v in vlst)
mob.view_bin(bout["bonusmalus"])
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------|
| 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 |
| 2 | 96170 | 0 | 25208175.3327 | 262.1210 | -0.37990943 | $X$ > 50 and $X$ <= 61 |
| 3 | 54092 | 0 | 17814900.3274 | 329.3445 | -0.15161142 | $X$ > 61 and $X$ <= 69 |
| 4 | 1113 | 0 | 505020.2406 | 453.7468 | 0.16882383 | $X$ > 69 and $X$ <= 70 |
| 5 | 41863 | 0 | 19639592.0866 | 469.1396 | 0.20218482 | $X$ > 70 and $X$ <= 76 |
| 6 | 1877 | 0 | 933807.2930 | 497.4999 | 0.26087973 | $X$ > 76 and $X$ <= 79 |
| 7 | 71234 | 0 | 52056038.4906 | 730.7752 | 0.64539024 | $X$ > 79 and $X$ <= 96 |
| 8 | 184 | 0 | 238409.0720 | 1295.7015 | 1.21809190 | $X$ > 96 and $X$ <= 99 |
| 9 | 26798 | 0 | 60879166.6831 | 2271.7802 | 1.77960344 | $X$ > 99 and $X$ <= 139 |
| 10 | 526 | 0 | 1803209.8657 | 3428.1556 | 2.19106207 | $X$ > 139 |
Visual of Variable Transformation
 Because 𝐹(𝑋), i.e. 𝑵𝒆𝒘𝑿, is strictly linear with respect to 𝐿𝑛(𝑌), the linearity of model
predictors in GLM has been be enhanced.
 Each category of 𝐹(𝑋) is an aggregation based on a segment of records. As a result, the
model estimated with transformed 𝑿 should be more stable and less prone to overfitting.
Treatment of Missing Values - I
 Case I - The binning algorithm groups all missing values into a standalone category
and then assigns a value to 𝑵𝒆𝒘𝑿 based on the corresponding average loss.
np.random.seed(1)
test_x = np.where(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 0 | 135182 | 135182 | 35578527.4354 | 263.1898 | -0.37584005 | numpy.isnan($X$) |
| 1 | 307519 | 0 | 67711301.2865 | 220.1857 | -0.55424410 | $X$ <= 50.0 |
| 2 | 77102 | 0 | 19992914.3670 | 259.3047 | -0.39071162 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43211 | 0 | 15596493.6976 | 360.9380 | -0.06000929 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 890 | 0 | 414193.4095 | 465.3859 | 0.19415125 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 35072 | 0 | 18566866.2822 | 529.3929 | 0.32301519 | $X$ > 70.0 and $X$ <= 79.0 |
| 6 | 56945 | 0 | 46318023.6829 | 813.3817 | 0.75248495 | $X$ > 79.0 and $X$ <= 96.0 |
| 7 | 153 | 0 | 227944.8955 | 1489.8359 | 1.35770566 | $X$ > 96.0 and $X$ <= 99.0 |
| 8 | 21939 | 0 | 55449516.2623 | 2527.4405 | 1.88624679 | $X$ > 99.0 |
Treatment of Missing Values - II
 Case II - When no loss was incurred for missing values, all records with missing
values will be merged into a category with the lowest averaged loss.
test_x = np.where(np.logical_and(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8,
np.array(dt["purepremium"]) == 0), np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 439893 | 130194 | 80777461.9272 | 183.6298 | -0.73579385 | $X$ <= 50.0 or numpy.isnan($X$) |
| 2 | 77806 | 0 | 25208175.3327 | 323.9876 | -0.16801052 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43781 | 0 | 17814900.3274 | 406.9094 | 0.05987494 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 901 | 0 | 505020.2406 | 560.5108 | 0.38013292 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 33871 | 0 | 19639592.0866 | 579.8350 | 0.41402801 | $X$ > 70.0 and $X$ <= 76.0 |
| 6 | 1550 | 0 | 933807.2930 | 602.4563 | 0.45229955 | $X$ > 76.0 and $X$ <= 79.0 |
| 7 | 57623 | 0 | 52056038.4906 | 903.3899 | 0.85743868 | $X$ > 79.0 and $X$ <= 96.0 |
| 8 | 157 | 0 | 238409.0720 | 1518.5291 | 1.37678185 | $X$ > 96.0 and $X$ <= 99.0 |
| 9 | 21991 | 0 | 60879166.6831 | 2768.3674 | 1.97729742 | $X$ > 99.0 and $X$ <= 139.0 |
| 10 | 440 | 0 | 1803209.8657 | 4098.2042 | 2.36958856 | $X$ > 139.0 |
 The loss_mob package offers eight different binning algorithms to meet different
business needs in various scenarios (as well as curiosity).
Alternative Binning Algorithms
mob.view_bin(mob.kmn_bin(dt["bonusmalus"], dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 431099 | 0 | 96318760.7388 | 223.4261 | -0.53963497 | $X$ <= 55 |
| 2 | 80649 | 0 | 20223542.3051 | 250.7600 | -0.42421935 | $X$ > 55 and $X$ <= 66 |
| 3 | 67270 | 0 | 28283842.6718 | 420.4525 | 0.09261601 | $X$ > 66 and $X$ <= 78 |
| 4 | 54381 | 0 | 42028721.0647 | 772.8567 | 0.70137806 | $X$ > 78 and $X$ <= 92 |
| 5 | 44614 | 0 | 73000914.5385 | 1636.2782 | 1.45146393 | $X$ > 92 |
mob.view_bin(mob.los_bin(dt["bonusmalus"], dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 |
| 2 | 94446 | 0 | 24910533.5905 | 263.7542 | -0.37369782 | $X$ > 50 and $X$ <= 60 |
| 3 | 77565 | 0 | 30475703.0135 | 392.9053 | 0.02485312 | $X$ > 60 and $X$ <= 72 |
| 4 | 76915 | 0 | 50625848.0929 | 658.2051 | 0.54080103 | $X$ > 72 and $X$ <= 90 |
| 5 | 44931 | 0 | 73066234.6948 | 1626.1876 | 1.44527805 | $X$ > 90 |
Variable Importance after Transformation
 Because the monotonic binning provides a rank order capability of each attribute,
outcomes can be leveraged to calculate the Gini-Coefficient in order to evaluate the
predictability of each predictor after transformation.
 Gini outcome is highly consistent with Distance Correlation.
# calculate gini-coefficient for each binned attribute
gout = [{"variable": _, “gini”: mob.bin_gini(bout[_])} for _ in vlst]
# sort all attributes by gini-coefficients
pd.DataFrame(sorted(gout, key = lambda x: -x[“gini”]))
variable gini gini before binning
bonusmalus 0.373600 0.364684
drivage 0.335541 0.319361
vehage 0.130189 0.093274
density 0.129020 0.075396
vehpower 0.076282 0.026760
Gini improved
after binning
 Functions to apply transformations to new data and to verify the outcome.
Transforming New Data
bin1 = mob.qtl_bin(dt["bonusmalus"], dt["purepremium"])
# score new data based on the binning outcome
out1 = mob.cal_newx(dt['bonusmalus'], bin1)
mob.head(out1, 3)
# {'x': 50, 'bin': 1, 'newx': -0.60031106}
# {'x': 60, 'bin': 3, 'newx': -0.24666305}
# {'x': 85, 'bin': 4, 'newx': 0.51388758}
mob.chk_newx(out1)
| bin | newx | freq | dist | xrng |
|-------|-------------|--------|------------|--------------------------------|
| 1 | -0.60031106 | 384156 | 56.6591% | 50 <==> 50 |
| 2 | -0.34039004 | 68334 | 10.0786% | 51 <==> 57 |
| 3 | -0.24666305 | 80831 | 11.9217% | 58 <==> 68 |
| 4 | 0.51388758 | 82308 | 12.1396% | 69 <==> 85 |
| 5 | 1.25057961 | 62384 | 9.2010% | 86 <==> 230 |
 Estimate a Tweedie GLM with forementioned predictors without any transformation.
 Only 1 variable is statistically significant.
Model Fitting without Transformation
Y = dt["purepremium"]
# use raw variables
X1 = sm.add_constant(pd.DataFrame({v: dt[v] for v in vlst}), prepend = True)
m1 = sm.GLM(Y, X1, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 3.8697 0.575 6.729 0.000 2.743 4.997
bonusmalus 0.0344 0.005 6.792 0.000 0.024 0.044 1.3222
drivage -0.0055 0.006 -0.907 0.364 -0.017 0.006 1.3018
vehage -0.0024 0.013 -0.183 0.855 -0.029 0.024 1.0165
density -7.882e-06 1.92e-05 -0.411 0.681 -4.55e-05 2.98e-05 1.0194
vehpower 0.0124 0.037 0.338 0.735 -0.060 0.085 1.0084
 Estimate a Tweedie GLM with same predictors after transformation.
 There are 3 statistically significant variables.
Model Fitting with Transformation
bout = dict((v, mob.iso_bin(dt[v], dt["purepremium"])) for v in vlst)
xout = dict((v, mob.cal_newx(dt[v], bout[v])) for v in vlst)
X2 = sm.add_constant(pd.DataFrame(dict((v, [_["newx"] for _ in xout[v]]) for v in vlst)), prepend = True)
m2 = sm.GLM(Y, X2, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 5.9635 0.066 91.001 0.000 5.835 6.092
bonusmalus 0.4727 0.115 4.115 0.000 0.248 0.698 1.3983
drivage 0.6632 0.126 5.254 0.000 0.416 0.911 1.3794
vehage 0.0726 0.228 0.319 0.750 -0.374 0.519 1.0119
density 0.4690 0.227 2.063 0.039 0.023 0.915 1.0215
vehpower 0.5291 0.417 1.269 0.205 -0.288 1.347 1.0055
 A performance comparison between the model without variable transformation and
the model with variable transformation is provided below.
Model Performance
Statistical Metrics Without Transformation With Transformation
AIC 848,321 821,825
Gini 0.3847 0.4103
sMAPE 1.9724 1.9744
MAE 734.6655 717.6874
D2 Tweedie Score 0.0393 0.0553
 Distance correlation is a dependence measure between two paired vectors.
Appendix I: Distance Correlation
Distance Correlation Spearman Correlation
Source: github.com/vnmabus/dcor
Appendix II: Core Functions of Loss_Mob
loss_mob
|-- qtl_bin() : Iterative discretization based on quantiles of X.
|-- los_bin() : Revised iterative discretization for records with Y > 0.
|-- iso_bin() : Discretization driven by the isotonic regression.
|-- val_bin() : Revised iterative discretization based on unique values of X.
|-- rng_bin() : Revised iterative discretization based on the equal-width range of X.
|-- kmn_bin() : Iterative discretization based on the k-means clustering of X.
|-- gbm_bin() : Discretization based on the gradient boosting machine (GBM).
|-- cus_bin() : Customized discretization based on pre-determined cut points.
|-- view_bin() : Displays the binning outcome in a tabular form.
|-- cal_newx() : Applies the variable transformation to a numeric vector based on binning outcome.
|-- chk_newx() : Verifies the transformation generated from the cal_newx() function.
|-- mi_score() : Calculates the Mutual Information (MI) score between X and Y.
|-- screen() : Calculates Spearman and Distance Correlations between X and Y.
|-- bin_gini() : Calculates the gini-coefficient between X and Y based on the binning object.
|-- num_gini() : Calculates the gini-coefficient between raw values of X and Y.
|-- smape() : Calculates the sMAPE value between Y and Yhat.
`-- get_mtpl() : Extracts French Motor Third-Part Liability Claims dataset from OpenML.

More Related Content

Similar to Variable Transformation in P&C Loss Models Based on Monotonic Binning

Mixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering AlgorithmMixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering Algorithm
Asoka Korale
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
WenSui Liu
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
prateek kumar
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
Sheetal Gangakhedkar
 
Writing Sample
Writing SampleWriting Sample
Writing Sample
Yiqun Li
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Model Presolve, Warmstart and Conflict Refining in CP Optimizer
Model Presolve, Warmstart and Conflict Refining in CP OptimizerModel Presolve, Warmstart and Conflict Refining in CP Optimizer
Model Presolve, Warmstart and Conflict Refining in CP Optimizer
Philippe Laborie
 
20190907 Julia the language for future
20190907 Julia the language for future20190907 Julia the language for future
20190907 Julia the language for future
岳華 杜
 
Tree building 2
Tree building 2Tree building 2
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
KonkoboUlrichArthur
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET Journal
 
Engineering Data Analysis-ProfCharlton
Engineering Data  Analysis-ProfCharltonEngineering Data  Analysis-ProfCharlton
Engineering Data Analysis-ProfCharlton
CharltonInao1
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
fsmart01
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595
Marco Yandun
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
Asoka Korale
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithms
rahmedraj93
 
Dbms plan - A swiss army knife for performance engineers
Dbms plan - A swiss army knife for performance engineersDbms plan - A swiss army knife for performance engineers
Dbms plan - A swiss army knife for performance engineers
Riyaj Shamsudeen
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
Yanchang Zhao
 
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
Fabricio de França
 

Similar to Variable Transformation in P&C Loss Models Based on Monotonic Binning (20)

Mixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering AlgorithmMixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering Algorithm
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
Writing Sample
Writing SampleWriting Sample
Writing Sample
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Model Presolve, Warmstart and Conflict Refining in CP Optimizer
Model Presolve, Warmstart and Conflict Refining in CP OptimizerModel Presolve, Warmstart and Conflict Refining in CP Optimizer
Model Presolve, Warmstart and Conflict Refining in CP Optimizer
 
20190907 Julia the language for future
20190907 Julia the language for future20190907 Julia the language for future
20190907 Julia the language for future
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining Tool
 
Engineering Data Analysis-ProfCharlton
Engineering Data  Analysis-ProfCharltonEngineering Data  Analysis-ProfCharlton
Engineering Data Analysis-ProfCharlton
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithms
 
Dbms plan - A swiss army knife for performance engineers
Dbms plan - A swiss army knife for performance engineersDbms plan - A swiss army knife for performance engineers
Dbms plan - A swiss army knife for performance engineers
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
 

Recently uploaded

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Variable Transformation in P&C Loss Models Based on Monotonic Binning

  • 1. Variable Transformation in P&C Loss Models Based on Monotonic Binning WenSui Liu, Nov 2023
  • 2. Opportunities in P&C Modeling  A tremendous effort has been spent on the data preparation and exploration that can be automated and streamlined.  Let machines deal with tedious data works so as to allow modelers focus on modeling methodology and statistical inference. Model Deve. Data Data Screen Anomaly Treatment Data Transform Predictive Ranking  Filter redundant data fields;  Retain relevant information;  Impute missing values;  Winsorize data outliers;  Recode special values.  Explore data distribution;  Identify best transformation to improve linearity;  Access variable predictiveness;  Identify important model drivers; Data Preparation Consumed 50+% Time in Model Development Heterogeneous Data Sources:  Credit  Vehicle  Telematic  Geographic
  • 3. Banking Practice  In retail credit risk models, Weight of Evidence transformation* has been widely used to improve the efficiency of model development: 𝑊𝑜𝐸𝑖 = 𝐿𝑛 # 𝑜𝑓 𝑌 = 1 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 # 𝑜𝑓 𝑌 = 0 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 1 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 0  The number of categories (i.e. bins) is derived from discretization on the 𝑿 vector, with missing values being handled differently.  In consideration of regulatory scrutiny and model interpretation, a strict monotonicity is assumed between 𝑿 and 𝑾𝒐𝑬𝑿.  All monotonic functions of 𝑿, e.g. logarithm, exponential, or linear, should converge to the same monotonic 𝑾𝒐𝑬𝑿 transformation. Odds in 𝒊𝒕𝒉 Category Overall Odds * https://pypi.org/project/py-mob/
  • 4. Adoption in P&C Models  In light of P&C loss models, a modified approach is proposed to mimic the idea of 𝑾𝒐𝑬 transformation, as shown below. 𝐹 𝑋𝑖 = 𝐿𝑛 𝐿𝑜𝑠𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠  The interpretation of 𝑭 𝑿𝒊 is intuitive in that it is the ratio between the average loss in the 𝑖𝑡ℎ category and the overall average loss.  With missing values fallen into a standalone category or combined with a similar neighbor, no special treatment (i.e. imputation) is necessary anymore.  Since the transformation is to project raw values of 𝑿 into the data space of 𝒀 based on the permutation, concerns around outliers in each 𝑿 have been neutralized. Average Loss in 𝒊𝒕𝒉 Category Overall Average Loss
  • 5. Outline of Loss_Mob Package  The Python package loss_mob (https://pypi.org/project/loss-mob) is my weekend project with the attempt to tackle the most tedious yet critical task in P&C loss model development. Core Functionality Variable Information Binning Algorithms Utility Functions  Coefficient of Variation  Spearman and Distance Correlation Coefficients  Mutual Information Score  Gini Coefficient  Fine Binning Based on GBM or Isotonic Regression;  Coarse Binning Based on Density or Value Range;  Customized Binning Based on User Inputs  Tabulation of Binning Result;  Application of Binning Outcome to New Data;  Verification of Data Transformation;  sMAPE for Model Perf.
  • 6. Demo Based on MTPL Data  French Motor Third-Part Liability (MTPL) Claims dataset from OpenML* is used in the subsequent demo. import loss_mob as mob, pandas as pd, numpy as np, statsmodels.api as sm dt = mob.get_mtpl() # https://github.com/dutangc/CASdatasets dt.keys() # dict_keys(['idpol', 'claimnb', 'exposure', 'area', 'vehpower', 'vehage', 'drivage', 'bonusmalus’, # 'vehbrand','vehgas’, 'density', 'region', 'claimamount', 'purepremium’]) pd.DataFrame(dt).head(3) … vehpower vehage drivage bonusmalus vehbrand vehgas density region claimamount purepremium … 7 1 61 50 B12 Regular 27000 R11 303.00 404.0000 … 12 5 50 60 B12 Diesel 56 R25 1981.84 14156.0000 … 4 0 36 85 B12 Regular 4792 R11 1456.55 10403.9286 * https://www.openml.org
  • 7. Variable Screening  The screen() function assesses the association between each 𝑿 and 𝒀.  The consistent magnitude between Spearman and Distance correlations indicates a strong linear association in the context of GLM. # variable list to screen vlst = ["vehpower", "vehage", "drivage", "bonusmalus", "density"] # screen through each attribute summ = [{"variable": _, **mob.screen(dt[_], dt["purepremium"])} for _ in vlst] # sort the summary by distance correlation pd.DataFrame(sorted(summ, key = lambda x: -x["distance correlation"])) variable … coefficient of variation spearman correlation distance correlation gini coefficient bonusmalus … 0.261651 0.057169 0.043454 0.364684 drivage … 0.310719 -0.004906 0.014289 0.319361 density … 2.208544 0.020221 0.011069 0.075396 vehage … 0.804375 0.019526 0.010801 0.093274 vehpower … 0.317741 0.002307 0.003570 0.026760
  • 8. Variable Screening in Parallel  Scalability is at the heart of development philosophy.  Functions in loss_mob can be easily parallelized and scaled to ~1000+ predictors. # first, define a wrapper to be consumed by the parallel map def pscreen(v): return({"variable": v, **mob.screen(dt[v], dt["purepremium"])}) # next, load necessary modules from multiprocessing import Pool, cpu_count from contextlib import closing with closing(Pool(processes = cpu_count())) as pool: psum = pool.map(pscreen, vlst) pool.terminate() pd.DataFrame(sorted(psum, key = lambda x: -x["gini coefficient"])).head(3) variable ... spearman correlation distance correlation gini coefficient Bonusmalus ... 0.057169 0.043454 0.364684 drivage ... -0.004906 0.014289 0.319361 vehage ... 0.019526 0.010801 0.093274
  • 9. Variable Transformation  Monotonic binning based on GBM (Gradient Boosting Machine).  After binning, the transformed 𝑿 , namely 𝑵𝒆𝒘𝑿 , will replace the raw 𝑿 in the downstream model estimation. bout = dict((v, mob.gbm_bin(dt[v], dt["purepremium"])) for v in vlst) mob.view_bin(bout["bonusmalus"]) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------| | 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 | | 2 | 96170 | 0 | 25208175.3327 | 262.1210 | -0.37990943 | $X$ > 50 and $X$ <= 61 | | 3 | 54092 | 0 | 17814900.3274 | 329.3445 | -0.15161142 | $X$ > 61 and $X$ <= 69 | | 4 | 1113 | 0 | 505020.2406 | 453.7468 | 0.16882383 | $X$ > 69 and $X$ <= 70 | | 5 | 41863 | 0 | 19639592.0866 | 469.1396 | 0.20218482 | $X$ > 70 and $X$ <= 76 | | 6 | 1877 | 0 | 933807.2930 | 497.4999 | 0.26087973 | $X$ > 76 and $X$ <= 79 | | 7 | 71234 | 0 | 52056038.4906 | 730.7752 | 0.64539024 | $X$ > 79 and $X$ <= 96 | | 8 | 184 | 0 | 238409.0720 | 1295.7015 | 1.21809190 | $X$ > 96 and $X$ <= 99 | | 9 | 26798 | 0 | 60879166.6831 | 2271.7802 | 1.77960344 | $X$ > 99 and $X$ <= 139 | | 10 | 526 | 0 | 1803209.8657 | 3428.1556 | 2.19106207 | $X$ > 139 |
  • 10. Visual of Variable Transformation  Because 𝐹(𝑋), i.e. 𝑵𝒆𝒘𝑿, is strictly linear with respect to 𝐿𝑛(𝑌), the linearity of model predictors in GLM has been be enhanced.  Each category of 𝐹(𝑋) is an aggregation based on a segment of records. As a result, the model estimated with transformed 𝑿 should be more stable and less prone to overfitting.
  • 11. Treatment of Missing Values - I  Case I - The binning algorithm groups all missing values into a standalone category and then assigns a value to 𝑵𝒆𝒘𝑿 based on the corresponding average loss. np.random.seed(1) test_x = np.where(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.nan, np.array(dt["bonusmalus"])) mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 0 | 135182 | 135182 | 35578527.4354 | 263.1898 | -0.37584005 | numpy.isnan($X$) | | 1 | 307519 | 0 | 67711301.2865 | 220.1857 | -0.55424410 | $X$ <= 50.0 | | 2 | 77102 | 0 | 19992914.3670 | 259.3047 | -0.39071162 | $X$ > 50.0 and $X$ <= 61.0 | | 3 | 43211 | 0 | 15596493.6976 | 360.9380 | -0.06000929 | $X$ > 61.0 and $X$ <= 69.0 | | 4 | 890 | 0 | 414193.4095 | 465.3859 | 0.19415125 | $X$ > 69.0 and $X$ <= 70.0 | | 5 | 35072 | 0 | 18566866.2822 | 529.3929 | 0.32301519 | $X$ > 70.0 and $X$ <= 79.0 | | 6 | 56945 | 0 | 46318023.6829 | 813.3817 | 0.75248495 | $X$ > 79.0 and $X$ <= 96.0 | | 7 | 153 | 0 | 227944.8955 | 1489.8359 | 1.35770566 | $X$ > 96.0 and $X$ <= 99.0 | | 8 | 21939 | 0 | 55449516.2623 | 2527.4405 | 1.88624679 | $X$ > 99.0 |
  • 12. Treatment of Missing Values - II  Case II - When no loss was incurred for missing values, all records with missing values will be merged into a category with the lowest averaged loss. test_x = np.where(np.logical_and(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.array(dt["purepremium"]) == 0), np.nan, np.array(dt["bonusmalus"])) mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 1 | 439893 | 130194 | 80777461.9272 | 183.6298 | -0.73579385 | $X$ <= 50.0 or numpy.isnan($X$) | | 2 | 77806 | 0 | 25208175.3327 | 323.9876 | -0.16801052 | $X$ > 50.0 and $X$ <= 61.0 | | 3 | 43781 | 0 | 17814900.3274 | 406.9094 | 0.05987494 | $X$ > 61.0 and $X$ <= 69.0 | | 4 | 901 | 0 | 505020.2406 | 560.5108 | 0.38013292 | $X$ > 69.0 and $X$ <= 70.0 | | 5 | 33871 | 0 | 19639592.0866 | 579.8350 | 0.41402801 | $X$ > 70.0 and $X$ <= 76.0 | | 6 | 1550 | 0 | 933807.2930 | 602.4563 | 0.45229955 | $X$ > 76.0 and $X$ <= 79.0 | | 7 | 57623 | 0 | 52056038.4906 | 903.3899 | 0.85743868 | $X$ > 79.0 and $X$ <= 96.0 | | 8 | 157 | 0 | 238409.0720 | 1518.5291 | 1.37678185 | $X$ > 96.0 and $X$ <= 99.0 | | 9 | 21991 | 0 | 60879166.6831 | 2768.3674 | 1.97729742 | $X$ > 99.0 and $X$ <= 139.0 | | 10 | 440 | 0 | 1803209.8657 | 4098.2042 | 2.36958856 | $X$ > 139.0 |
  • 13.  The loss_mob package offers eight different binning algorithms to meet different business needs in various scenarios (as well as curiosity). Alternative Binning Algorithms mob.view_bin(mob.kmn_bin(dt["bonusmalus"], dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 1 | 431099 | 0 | 96318760.7388 | 223.4261 | -0.53963497 | $X$ <= 55 | | 2 | 80649 | 0 | 20223542.3051 | 250.7600 | -0.42421935 | $X$ > 55 and $X$ <= 66 | | 3 | 67270 | 0 | 28283842.6718 | 420.4525 | 0.09261601 | $X$ > 66 and $X$ <= 78 | | 4 | 54381 | 0 | 42028721.0647 | 772.8567 | 0.70137806 | $X$ > 78 and $X$ <= 92 | | 5 | 44614 | 0 | 73000914.5385 | 1636.2782 | 1.45146393 | $X$ > 92 | mob.view_bin(mob.los_bin(dt["bonusmalus"], dt["purepremium"])) | bin | freq | miss | ysum | yavg | newx | rule | |-------|--------|--------|---------------|-----------|-------------|-------------------------------------| | 1 | 384156 | 0 | 80777461.9272 | 210.2726 | -0.60031106 | $X$ <= 50 | | 2 | 94446 | 0 | 24910533.5905 | 263.7542 | -0.37369782 | $X$ > 50 and $X$ <= 60 | | 3 | 77565 | 0 | 30475703.0135 | 392.9053 | 0.02485312 | $X$ > 60 and $X$ <= 72 | | 4 | 76915 | 0 | 50625848.0929 | 658.2051 | 0.54080103 | $X$ > 72 and $X$ <= 90 | | 5 | 44931 | 0 | 73066234.6948 | 1626.1876 | 1.44527805 | $X$ > 90 |
  • 14. Variable Importance after Transformation  Because the monotonic binning provides a rank order capability of each attribute, outcomes can be leveraged to calculate the Gini-Coefficient in order to evaluate the predictability of each predictor after transformation.  Gini outcome is highly consistent with Distance Correlation. # calculate gini-coefficient for each binned attribute gout = [{"variable": _, “gini”: mob.bin_gini(bout[_])} for _ in vlst] # sort all attributes by gini-coefficients pd.DataFrame(sorted(gout, key = lambda x: -x[“gini”])) variable gini gini before binning bonusmalus 0.373600 0.364684 drivage 0.335541 0.319361 vehage 0.130189 0.093274 density 0.129020 0.075396 vehpower 0.076282 0.026760 Gini improved after binning
  • 15.  Functions to apply transformations to new data and to verify the outcome. Transforming New Data bin1 = mob.qtl_bin(dt["bonusmalus"], dt["purepremium"]) # score new data based on the binning outcome out1 = mob.cal_newx(dt['bonusmalus'], bin1) mob.head(out1, 3) # {'x': 50, 'bin': 1, 'newx': -0.60031106} # {'x': 60, 'bin': 3, 'newx': -0.24666305} # {'x': 85, 'bin': 4, 'newx': 0.51388758} mob.chk_newx(out1) | bin | newx | freq | dist | xrng | |-------|-------------|--------|------------|--------------------------------| | 1 | -0.60031106 | 384156 | 56.6591% | 50 <==> 50 | | 2 | -0.34039004 | 68334 | 10.0786% | 51 <==> 57 | | 3 | -0.24666305 | 80831 | 11.9217% | 58 <==> 68 | | 4 | 0.51388758 | 82308 | 12.1396% | 69 <==> 85 | | 5 | 1.25057961 | 62384 | 9.2010% | 86 <==> 230 |
  • 16.  Estimate a Tweedie GLM with forementioned predictors without any transformation.  Only 1 variable is statistically significant. Model Fitting without Transformation Y = dt["purepremium"] # use raw variables X1 = sm.add_constant(pd.DataFrame({v: dt[v] for v in vlst}), prepend = True) m1 = sm.GLM(Y, X1, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit() ============================================================================== coef std err z P>|z| [0.025 0.975] VIF ------------------------------------------------------------------------------ const 3.8697 0.575 6.729 0.000 2.743 4.997 bonusmalus 0.0344 0.005 6.792 0.000 0.024 0.044 1.3222 drivage -0.0055 0.006 -0.907 0.364 -0.017 0.006 1.3018 vehage -0.0024 0.013 -0.183 0.855 -0.029 0.024 1.0165 density -7.882e-06 1.92e-05 -0.411 0.681 -4.55e-05 2.98e-05 1.0194 vehpower 0.0124 0.037 0.338 0.735 -0.060 0.085 1.0084
  • 17.  Estimate a Tweedie GLM with same predictors after transformation.  There are 3 statistically significant variables. Model Fitting with Transformation bout = dict((v, mob.iso_bin(dt[v], dt["purepremium"])) for v in vlst) xout = dict((v, mob.cal_newx(dt[v], bout[v])) for v in vlst) X2 = sm.add_constant(pd.DataFrame(dict((v, [_["newx"] for _ in xout[v]]) for v in vlst)), prepend = True) m2 = sm.GLM(Y, X2, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit() ============================================================================== coef std err z P>|z| [0.025 0.975] VIF ------------------------------------------------------------------------------ const 5.9635 0.066 91.001 0.000 5.835 6.092 bonusmalus 0.4727 0.115 4.115 0.000 0.248 0.698 1.3983 drivage 0.6632 0.126 5.254 0.000 0.416 0.911 1.3794 vehage 0.0726 0.228 0.319 0.750 -0.374 0.519 1.0119 density 0.4690 0.227 2.063 0.039 0.023 0.915 1.0215 vehpower 0.5291 0.417 1.269 0.205 -0.288 1.347 1.0055
  • 18.  A performance comparison between the model without variable transformation and the model with variable transformation is provided below. Model Performance Statistical Metrics Without Transformation With Transformation AIC 848,321 821,825 Gini 0.3847 0.4103 sMAPE 1.9724 1.9744 MAE 734.6655 717.6874 D2 Tweedie Score 0.0393 0.0553
  • 19.  Distance correlation is a dependence measure between two paired vectors. Appendix I: Distance Correlation Distance Correlation Spearman Correlation Source: github.com/vnmabus/dcor
  • 20. Appendix II: Core Functions of Loss_Mob loss_mob |-- qtl_bin() : Iterative discretization based on quantiles of X. |-- los_bin() : Revised iterative discretization for records with Y > 0. |-- iso_bin() : Discretization driven by the isotonic regression. |-- val_bin() : Revised iterative discretization based on unique values of X. |-- rng_bin() : Revised iterative discretization based on the equal-width range of X. |-- kmn_bin() : Iterative discretization based on the k-means clustering of X. |-- gbm_bin() : Discretization based on the gradient boosting machine (GBM). |-- cus_bin() : Customized discretization based on pre-determined cut points. |-- view_bin() : Displays the binning outcome in a tabular form. |-- cal_newx() : Applies the variable transformation to a numeric vector based on binning outcome. |-- chk_newx() : Verifies the transformation generated from the cal_newx() function. |-- mi_score() : Calculates the Mutual Information (MI) score between X and Y. |-- screen() : Calculates Spearman and Distance Correlations between X and Y. |-- bin_gini() : Calculates the gini-coefficient between X and Y based on the binning object. |-- num_gini() : Calculates the gini-coefficient between raw values of X and Y. |-- smape() : Calculates the sMAPE value between Y and Yhat. `-- get_mtpl() : Extracts French Motor Third-Part Liability Claims dataset from OpenML.