Guide for building GLMS

Guide for building accurate, effective and efficient GLMs in R
By: Ali T. Lotia
Follow the guide from start to finish as it is written. It will allow for the creation of any
frequency*severity GLM as efficiently as possible and will greatly aid in the application of any
GLMS avoiding pitfalls that slow progress and lead to inaccurate results.
Reading the data into R:
After establishing an RODBC connection the appropriate R command to install and access the package
that allows you to access the database is:
install.packages("RODBC") [if the package has not already been installed. You will need
administrator access to install packages]
library (RODBC);
The command to access the database directory in a server is:
channel <- odbcConnect("Server name", "username", "password”);
The appropriate database is then selected using a SQL query within R:
dataset <- sqlQuery(channel, "select * from name of dataset");
Editing the dataset to make it appropriate for model fitting
Handling NA values:
This step is critical if the unavailable values are to be treated as 0s. By default R will tend to ignore and
skip over na values.
dataset[is.na(dataset)] <- 0
Changing appropriate variables into categorical variables:
Certain numerical variables may only take a small number of fixed values such as limits, co-insurance or
maximum deductibles. Secondly, they may not be linearly related to the response variable such as age
(Young children, babies in particular, spend more on medical bills which then decrease as they grow
older. After a certain age, medical bills rise once again.) In these scenarios, it is more appropriate to
transform these variables to categorical variables. Age was turned into age group strings directly in SQL
(easier than doing them in R).

Variables are transformed by the factor function; by default the GLM function in R will treat vectors with
strings as categorical variables:
Example:
dataset$variablename<- factor(dataset$variablename)
Changing reference levels to desired ones:
Reference levels (values for levels taken into the intercept) may be changed for presentation purposes.
In the previous project, they were changed to categories with the highest exposure. This can be
achieved using the “within” and “relevel” function.
Example:
dataset <- within(dataset, Gender <- relevel(Gender, ref = "Male"))
dataset <- within(dataset, age_group <- relevel(age_group, ref = "26-30"))
Extreme values:
Large datasets typically consist of extreme and illogical values that result from incorrect input of data
when constructing the data set. Extreme and illogical values may be ignored by making appropriate
changes to the SQL query while selecting the dataset.
dataset <- sqlQuery(channel, "select * from name of dataset where …");
What comes after the ‘where’ depends entirely on the names of the variables you are considering and
how you determine extreme values. For example:
Example:
dataset2 <- sqlQuery(channel, "select * from name of dataset where variable1 > 0 and variable1 <
Extreme value");
Here, only positive values are considered for variable1. This may be done when, for example you are
looking into average or total cost. The 0s are counted as an individual without claims and this data is
already accounted for in the count model and should not be included in the severity model.
Extreme value calculation:
This may be determined multiple ways. A statistical method to determine extreme values can be to
locate values 2-3 standard deviations from the mean and restricting selections from those values.
Extreme value = Mean(variable_to_check) + sd(variable_to_check)

Removing extreme values is a crucial step in the construction of GLMs where multiple categorical
variables are concerned as GLM divides the dataset into increasingly specific categories. For example a
model with 3 categorical variables each with 4 levels each will divide the data into 64 sections assuming
each permutation of variables contains around the same number of data points. The point is, these
small sets of groups will be highly affected by extreme values and may give highly inaccurate estimates
and may even return results that are directionally opposite to unbiased estimates.
General approach to building GLMs that estimate total claims made by an
individual using frequency*severity approach
Investigating interactions:
Interactions may be investigated using interaction plots. The interaction plot is great because it shows
interactions among different levels. A parallel shift in the graph shows an absolute effect of the second
variable (as the change is consistent between all levels) while intersecting plots show a change in
direction due to the interaction effect. Diverging or converging graphs show an additional effect of the
interaction term in the same direction or opposite direction respectively.
Three way interactions and higher interactions can also be investigated but AICc should be used to
compare models. AICc assigns a greater penalty for additional parameters. This is important because
with increasing interaction terms between categorical variables, it becomes increasingly difficult to
control for over-fitting.
Example:
interaction.plot(dataset$age_group,dataset$Gender,dataset$Count)
interaction.plot(dataset2$age_group,dataset2$Gender,dataset2$Amount)
Modeling frequency:
Theoretical discussions describe Frequency as a Poisson distributed event. This may, however, not be
appropriate where the model assumptions are broken. A negative binomial or an over-dispersed Poisson
may be used instead.
Distplot: Distplot is a Q-Q like plot for count variables. It is a way of investigating the marginal
distribution of the response variable. Although the conditional distribution may not follow the same
distribution as the marginal one, I have found this investigation to be quite an effective starting point
and the most effective method of model distribution investigation (residual vs. fits plot) has agreed with
the marginal distribution on almost every prior occasion.

Example:
distplot(dataset$Count, type = "nbinomial")
vs.
distplot(dataset$Count, type = "poisson")
Offset exposure term:
The model investigating frequency must account for exposure as each data point has an individual
exposure which greatly influences the number of claims a person makes. Log of the exposure is added to
the model as an offset term which accounts for different exposure values.
count <- glm(Count ~ variables + offset(log(exposure_2015_2016)), data= dataset,family='poisson')
Modeling severity:
Severity theoretically follows a gamma distribution for medical insurance but on certain occasions the
normal distribution may be more appropriate. Both the Q-Q plot and residuals vs. fits plots agreed that a
normal distribution was a better model assumption for the last frequency*claims project.
Modeling average claims amount does not include an exposure term. Theoretically, the amount a
person spends per claim on average would have nothing to do with how long the person is exposed for.
Even when applied practically, offsetting exposure in the average claims reduced the goodness of fit and
increased inaccuracy of results.
It is worth noting that counts and average amounts may not have the same significant interaction terms.
Final notes:
The results of the counts may be between 1 and 0 especially for in-patients. This may seem counter-
intuitive at first but with a bit of thought one realizes that it is not every year that an average person
visits the hospital. A person may visit the hospital around once 5 years resulting in 0.2 counts per year.
The link function and transformation of results:
The link function that describes the relationship between the mean of the conditional response variable
and the variance was found to be log for all models and will likely be the same for all frequency and
severity related models. Since a log transformation was used, the results must be exponentiated to find
the coefficients.

Checking model fit and competing models:
The residual vs. fits graph must finally be plotted to determine if the model fits well and that the
residuals are not related to the fitted values (homoscedasticity vs. hedroscedasticity). This plot should
have randomly distributed points. If these points are centered near 0, it means a large number of
residuals = observed-fitted values were 0 and the model made accurate predictions.
plot(fitted(count),resid(count))
Competing models can be tested on the bases of AIC, BIC and AICc. These measures are included in the
results of the GLM. Although they give no absolute measure of model fit, they are useful for comparing
competing models as they assign penalties for over-fitting. The general rule for AIC is that a difference of
12 between two competing models is significant evidence for the model with the lower AIC.
Alternate models:
An alternate model was build which used the principle of total claims. This was modeled using
probability a claim is made given factors*total amount claimed given factors
A logistic model was used to determine the probability of claims. It should be noted that logistic
regression models return log odds ratios. The transformation required to turn them into probabilities is:
𝑒∑ 𝛽𝑖 𝑥 𝑖
𝑛
𝑖=1
1 + 𝑒∑ 𝛽𝑖 𝑥 𝑖
𝑛
𝑖=1
Alternately, a linear model assuming a Gaussian family of distribution was used to model total amounts.
This agreed with the Q-Q plots and eventually, the residual vs. fits plots.
This model was not transformed and compared to the frequency*severity model as it is not applied in
actuarial practice but was hypothesized and created by me.
The reason probabilities and counts are separated and analyzed individually is the fact that the models
data set is highly 0 inflated as most people do not make claims in a given year, especially in-patient
claims.

Guide for building GLMS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Guide for building GLMS

Similar to Guide for building GLMS (20)

Guide for building GLMS