SlideShare a Scribd company logo
Guide for building accurate, effective and efficient GLMs in R
By: Ali T. Lotia
Follow the guide from start to finish as it is written. It will allow for the creation of any
frequency*severity GLM as efficiently as possible and will greatly aid in the application of any
GLMS avoiding pitfalls that slow progress and lead to inaccurate results.
Reading the data into R:
After establishing an RODBC connection the appropriate R command to install and access the package
that allows you to access the database is:
install.packages("RODBC") [if the package has not already been installed. You will need
administrator access to install packages]
library (RODBC);
The command to access the database directory in a server is:
channel <- odbcConnect("Server name", "username", "password”);
The appropriate database is then selected using a SQL query within R:
dataset <- sqlQuery(channel, "select * from name of dataset");
Editing the dataset to make it appropriate for model fitting
Handling NA values:
This step is critical if the unavailable values are to be treated as 0s. By default R will tend to ignore and
skip over na values.
dataset[is.na(dataset)] <- 0
Changing appropriate variables into categorical variables:
Certain numerical variables may only take a small number of fixed values such as limits, co-insurance or
maximum deductibles. Secondly, they may not be linearly related to the response variable such as age
(Young children, babies in particular, spend more on medical bills which then decrease as they grow
older. After a certain age, medical bills rise once again.) In these scenarios, it is more appropriate to
transform these variables to categorical variables. Age was turned into age group strings directly in SQL
(easier than doing them in R).
Variables are transformed by the factor function; by default the GLM function in R will treat vectors with
strings as categorical variables:
Example:
dataset$variablename<- factor(dataset$variablename)
Changing reference levels to desired ones:
Reference levels (values for levels taken into the intercept) may be changed for presentation purposes.
In the previous project, they were changed to categories with the highest exposure. This can be
achieved using the “within” and “relevel” function.
Example:
dataset <- within(dataset, Gender <- relevel(Gender, ref = "Male"))
dataset <- within(dataset, age_group <- relevel(age_group, ref = "26-30"))
Extreme values:
Large datasets typically consist of extreme and illogical values that result from incorrect input of data
when constructing the data set. Extreme and illogical values may be ignored by making appropriate
changes to the SQL query while selecting the dataset.
dataset <- sqlQuery(channel, "select * from name of dataset where …");
What comes after the ‘where’ depends entirely on the names of the variables you are considering and
how you determine extreme values. For example:
Example:
dataset2 <- sqlQuery(channel, "select * from name of dataset where variable1 > 0 and variable1 <
Extreme value");
Here, only positive values are considered for variable1. This may be done when, for example you are
looking into average or total cost. The 0s are counted as an individual without claims and this data is
already accounted for in the count model and should not be included in the severity model.
Extreme value calculation:
This may be determined multiple ways. A statistical method to determine extreme values can be to
locate values 2-3 standard deviations from the mean and restricting selections from those values.
Extreme value = Mean(variable_to_check) + sd(variable_to_check)
Removing extreme values is a crucial step in the construction of GLMs where multiple categorical
variables are concerned as GLM divides the dataset into increasingly specific categories. For example a
model with 3 categorical variables each with 4 levels each will divide the data into 64 sections assuming
each permutation of variables contains around the same number of data points. The point is, these
small sets of groups will be highly affected by extreme values and may give highly inaccurate estimates
and may even return results that are directionally opposite to unbiased estimates.
General approach to building GLMs that estimate total claims made by an
individual using frequency*severity approach
Investigating interactions:
Interactions may be investigated using interaction plots. The interaction plot is great because it shows
interactions among different levels. A parallel shift in the graph shows an absolute effect of the second
variable (as the change is consistent between all levels) while intersecting plots show a change in
direction due to the interaction effect. Diverging or converging graphs show an additional effect of the
interaction term in the same direction or opposite direction respectively.
Three way interactions and higher interactions can also be investigated but AICc should be used to
compare models. AICc assigns a greater penalty for additional parameters. This is important because
with increasing interaction terms between categorical variables, it becomes increasingly difficult to
control for over-fitting.
Example:
interaction.plot(dataset$age_group,dataset$Gender,dataset$Count)
interaction.plot(dataset2$age_group,dataset2$Gender,dataset2$Amount)
Modeling frequency:
Theoretical discussions describe Frequency as a Poisson distributed event. This may, however, not be
appropriate where the model assumptions are broken. A negative binomial or an over-dispersed Poisson
may be used instead.
Distplot: Distplot is a Q-Q like plot for count variables. It is a way of investigating the marginal
distribution of the response variable. Although the conditional distribution may not follow the same
distribution as the marginal one, I have found this investigation to be quite an effective starting point
and the most effective method of model distribution investigation (residual vs. fits plot) has agreed with
the marginal distribution on almost every prior occasion.
Example:
distplot(dataset$Count, type = "nbinomial")
vs.
distplot(dataset$Count, type = "poisson")
Offset exposure term:
The model investigating frequency must account for exposure as each data point has an individual
exposure which greatly influences the number of claims a person makes. Log of the exposure is added to
the model as an offset term which accounts for different exposure values.
count <- glm(Count ~ variables + offset(log(exposure_2015_2016)), data= dataset,family='poisson')
Modeling severity:
Severity theoretically follows a gamma distribution for medical insurance but on certain occasions the
normal distribution may be more appropriate. Both the Q-Q plot and residuals vs. fits plots agreed that a
normal distribution was a better model assumption for the last frequency*claims project.
Modeling average claims amount does not include an exposure term. Theoretically, the amount a
person spends per claim on average would have nothing to do with how long the person is exposed for.
Even when applied practically, offsetting exposure in the average claims reduced the goodness of fit and
increased inaccuracy of results.
It is worth noting that counts and average amounts may not have the same significant interaction terms.
Final notes:
The results of the counts may be between 1 and 0 especially for in-patients. This may seem counter-
intuitive at first but with a bit of thought one realizes that it is not every year that an average person
visits the hospital. A person may visit the hospital around once 5 years resulting in 0.2 counts per year.
The link function and transformation of results:
The link function that describes the relationship between the mean of the conditional response variable
and the variance was found to be log for all models and will likely be the same for all frequency and
severity related models. Since a log transformation was used, the results must be exponentiated to find
the coefficients.
Checking model fit and competing models:
The residual vs. fits graph must finally be plotted to determine if the model fits well and that the
residuals are not related to the fitted values (homoscedasticity vs. hedroscedasticity). This plot should
have randomly distributed points. If these points are centered near 0, it means a large number of
residuals = observed-fitted values were 0 and the model made accurate predictions.
plot(fitted(count),resid(count))
Competing models can be tested on the bases of AIC, BIC and AICc. These measures are included in the
results of the GLM. Although they give no absolute measure of model fit, they are useful for comparing
competing models as they assign penalties for over-fitting. The general rule for AIC is that a difference of
12 between two competing models is significant evidence for the model with the lower AIC.
Alternate models:
An alternate model was build which used the principle of total claims. This was modeled using
probability a claim is made given factors*total amount claimed given factors
A logistic model was used to determine the probability of claims. It should be noted that logistic
regression models return log odds ratios. The transformation required to turn them into probabilities is:
𝑒∑ 𝛽𝑖 𝑥 𝑖
𝑛
𝑖=1
1 + 𝑒∑ 𝛽𝑖 𝑥 𝑖
𝑛
𝑖=1
Alternately, a linear model assuming a Gaussian family of distribution was used to model total amounts.
This agreed with the Q-Q plots and eventually, the residual vs. fits plots.
This model was not transformed and compared to the frequency*severity model as it is not applied in
actuarial practice but was hypothesized and created by me.
The reason probabilities and counts are separated and analyzed individually is the fact that the models
data set is highly 0 inflated as most people do not make claims in a given year, especially in-patient
claims.

More Related Content

What's hot

Testing the performance of the power law process model considering the use of...
Testing the performance of the power law process model considering the use of...Testing the performance of the power law process model considering the use of...
Testing the performance of the power law process model considering the use of...
IJCSEA Journal
 
Cmt learning objective 25 - regresseion
Cmt learning objective   25 - regresseionCmt learning objective   25 - regresseion
Cmt learning objective 25 - regresseion
Professional Training Academy
 
Cluster analysis using spss
Cluster analysis using spssCluster analysis using spss
Cluster analysis using spss
Dr Nisha Arora
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
Ashish Patel
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
Anirudha si
 
Multivariate data analysis regression, cluster and factor analysis on spss
Multivariate data analysis   regression, cluster and factor analysis on spssMultivariate data analysis   regression, cluster and factor analysis on spss
Multivariate data analysis regression, cluster and factor analysis on spss
Aditya Banerjee
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
Indus University
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
Ashish Patel
 
Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design
Peter Kenny
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook ProjectBrian Ryan
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...
ijcsity
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
nurul amin
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
jamescupello
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
DrZahid Khan
 

What's hot (20)

Testing the performance of the power law process model considering the use of...
Testing the performance of the power law process model considering the use of...Testing the performance of the power law process model considering the use of...
Testing the performance of the power law process model considering the use of...
 
J itendra cca stat
J itendra cca statJ itendra cca stat
J itendra cca stat
 
Cmt learning objective 25 - regresseion
Cmt learning objective   25 - regresseionCmt learning objective   25 - regresseion
Cmt learning objective 25 - regresseion
 
Cluster analysis using spss
Cluster analysis using spssCluster analysis using spss
Cluster analysis using spss
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
 
Multivariate data analysis regression, cluster and factor analysis on spss
Multivariate data analysis   regression, cluster and factor analysis on spssMultivariate data analysis   regression, cluster and factor analysis on spss
Multivariate data analysis regression, cluster and factor analysis on spss
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
 
Answers
AnswersAnswers
Answers
 
Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook Project
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...
 
Lisrel
LisrelLisrel
Lisrel
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Logistic regression sage
Logistic regression sageLogistic regression sage
Logistic regression sage
 

Similar to Guide for building GLMS

Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
Saleesh Satheeshchandran
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
heba_ahmad
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
BeyaNasr1
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
FAO
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
carlstromcurtis
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
AsadJaved304231
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
Shivaram Prakash
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
javed khan
 
SEM
SEMSEM
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
gadissaassefa
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
Dr Athar Khan
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
Dr Hemant Sharma
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
AlemAyahu
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
Jaideep Adusumelli
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
Rafael Bustamante Romaní
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 

Similar to Guide for building GLMS (20)

Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
 
SEM
SEMSEM
SEM
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
 
report
reportreport
report
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 

Guide for building GLMS

  • 1. Guide for building accurate, effective and efficient GLMs in R By: Ali T. Lotia Follow the guide from start to finish as it is written. It will allow for the creation of any frequency*severity GLM as efficiently as possible and will greatly aid in the application of any GLMS avoiding pitfalls that slow progress and lead to inaccurate results. Reading the data into R: After establishing an RODBC connection the appropriate R command to install and access the package that allows you to access the database is: install.packages("RODBC") [if the package has not already been installed. You will need administrator access to install packages] library (RODBC); The command to access the database directory in a server is: channel <- odbcConnect("Server name", "username", "password”); The appropriate database is then selected using a SQL query within R: dataset <- sqlQuery(channel, "select * from name of dataset"); Editing the dataset to make it appropriate for model fitting Handling NA values: This step is critical if the unavailable values are to be treated as 0s. By default R will tend to ignore and skip over na values. dataset[is.na(dataset)] <- 0 Changing appropriate variables into categorical variables: Certain numerical variables may only take a small number of fixed values such as limits, co-insurance or maximum deductibles. Secondly, they may not be linearly related to the response variable such as age (Young children, babies in particular, spend more on medical bills which then decrease as they grow older. After a certain age, medical bills rise once again.) In these scenarios, it is more appropriate to transform these variables to categorical variables. Age was turned into age group strings directly in SQL (easier than doing them in R).
  • 2. Variables are transformed by the factor function; by default the GLM function in R will treat vectors with strings as categorical variables: Example: dataset$variablename<- factor(dataset$variablename) Changing reference levels to desired ones: Reference levels (values for levels taken into the intercept) may be changed for presentation purposes. In the previous project, they were changed to categories with the highest exposure. This can be achieved using the “within” and “relevel” function. Example: dataset <- within(dataset, Gender <- relevel(Gender, ref = "Male")) dataset <- within(dataset, age_group <- relevel(age_group, ref = "26-30")) Extreme values: Large datasets typically consist of extreme and illogical values that result from incorrect input of data when constructing the data set. Extreme and illogical values may be ignored by making appropriate changes to the SQL query while selecting the dataset. dataset <- sqlQuery(channel, "select * from name of dataset where …"); What comes after the ‘where’ depends entirely on the names of the variables you are considering and how you determine extreme values. For example: Example: dataset2 <- sqlQuery(channel, "select * from name of dataset where variable1 > 0 and variable1 < Extreme value"); Here, only positive values are considered for variable1. This may be done when, for example you are looking into average or total cost. The 0s are counted as an individual without claims and this data is already accounted for in the count model and should not be included in the severity model. Extreme value calculation: This may be determined multiple ways. A statistical method to determine extreme values can be to locate values 2-3 standard deviations from the mean and restricting selections from those values. Extreme value = Mean(variable_to_check) + sd(variable_to_check)
  • 3. Removing extreme values is a crucial step in the construction of GLMs where multiple categorical variables are concerned as GLM divides the dataset into increasingly specific categories. For example a model with 3 categorical variables each with 4 levels each will divide the data into 64 sections assuming each permutation of variables contains around the same number of data points. The point is, these small sets of groups will be highly affected by extreme values and may give highly inaccurate estimates and may even return results that are directionally opposite to unbiased estimates. General approach to building GLMs that estimate total claims made by an individual using frequency*severity approach Investigating interactions: Interactions may be investigated using interaction plots. The interaction plot is great because it shows interactions among different levels. A parallel shift in the graph shows an absolute effect of the second variable (as the change is consistent between all levels) while intersecting plots show a change in direction due to the interaction effect. Diverging or converging graphs show an additional effect of the interaction term in the same direction or opposite direction respectively. Three way interactions and higher interactions can also be investigated but AICc should be used to compare models. AICc assigns a greater penalty for additional parameters. This is important because with increasing interaction terms between categorical variables, it becomes increasingly difficult to control for over-fitting. Example: interaction.plot(dataset$age_group,dataset$Gender,dataset$Count) interaction.plot(dataset2$age_group,dataset2$Gender,dataset2$Amount) Modeling frequency: Theoretical discussions describe Frequency as a Poisson distributed event. This may, however, not be appropriate where the model assumptions are broken. A negative binomial or an over-dispersed Poisson may be used instead. Distplot: Distplot is a Q-Q like plot for count variables. It is a way of investigating the marginal distribution of the response variable. Although the conditional distribution may not follow the same distribution as the marginal one, I have found this investigation to be quite an effective starting point and the most effective method of model distribution investigation (residual vs. fits plot) has agreed with the marginal distribution on almost every prior occasion.
  • 4. Example: distplot(dataset$Count, type = "nbinomial") vs. distplot(dataset$Count, type = "poisson") Offset exposure term: The model investigating frequency must account for exposure as each data point has an individual exposure which greatly influences the number of claims a person makes. Log of the exposure is added to the model as an offset term which accounts for different exposure values. count <- glm(Count ~ variables + offset(log(exposure_2015_2016)), data= dataset,family='poisson') Modeling severity: Severity theoretically follows a gamma distribution for medical insurance but on certain occasions the normal distribution may be more appropriate. Both the Q-Q plot and residuals vs. fits plots agreed that a normal distribution was a better model assumption for the last frequency*claims project. Modeling average claims amount does not include an exposure term. Theoretically, the amount a person spends per claim on average would have nothing to do with how long the person is exposed for. Even when applied practically, offsetting exposure in the average claims reduced the goodness of fit and increased inaccuracy of results. It is worth noting that counts and average amounts may not have the same significant interaction terms. Final notes: The results of the counts may be between 1 and 0 especially for in-patients. This may seem counter- intuitive at first but with a bit of thought one realizes that it is not every year that an average person visits the hospital. A person may visit the hospital around once 5 years resulting in 0.2 counts per year. The link function and transformation of results: The link function that describes the relationship between the mean of the conditional response variable and the variance was found to be log for all models and will likely be the same for all frequency and severity related models. Since a log transformation was used, the results must be exponentiated to find the coefficients.
  • 5. Checking model fit and competing models: The residual vs. fits graph must finally be plotted to determine if the model fits well and that the residuals are not related to the fitted values (homoscedasticity vs. hedroscedasticity). This plot should have randomly distributed points. If these points are centered near 0, it means a large number of residuals = observed-fitted values were 0 and the model made accurate predictions. plot(fitted(count),resid(count)) Competing models can be tested on the bases of AIC, BIC and AICc. These measures are included in the results of the GLM. Although they give no absolute measure of model fit, they are useful for comparing competing models as they assign penalties for over-fitting. The general rule for AIC is that a difference of 12 between two competing models is significant evidence for the model with the lower AIC. Alternate models: An alternate model was build which used the principle of total claims. This was modeled using probability a claim is made given factors*total amount claimed given factors A logistic model was used to determine the probability of claims. It should be noted that logistic regression models return log odds ratios. The transformation required to turn them into probabilities is: 𝑒∑ 𝛽𝑖 𝑥 𝑖 𝑛 𝑖=1 1 + 𝑒∑ 𝛽𝑖 𝑥 𝑖 𝑛 𝑖=1 Alternately, a linear model assuming a Gaussian family of distribution was used to model total amounts. This agreed with the Q-Q plots and eventually, the residual vs. fits plots. This model was not transformed and compared to the frequency*severity model as it is not applied in actuarial practice but was hypothesized and created by me. The reason probabilities and counts are separated and analyzed individually is the fact that the models data set is highly 0 inflated as most people do not make claims in a given year, especially in-patient claims.