2. Agency Performance Prediction | 2
Contents
Contents
Business Case
Insights Required & Business Benefit
A Bit About Domain…
Data Pre-Processing
Modelling
Approach
Evaluation Metric
Outcome
Best Model Comparison
Model Interpretation & Key Challenges
3. Agency Performance Prediction | 3
Business Case
Azure Insurance Group is operating property and casualty (P&C) insurance, life
insurance and insurance brokerage companies. Azure sells the policies through
direct & indirect sales channel. For indirect selling, Azure has tie up with 1600+
agencies across 6 states. Azure is interested in classifying existing agencies into
predefined performance categories in a supervised predictive framework &
based on agencies past performance. Specifically Azure expects to better
understand which agencies are likely to bring more growth in Personal Line (PL)
of Business
4. Agency Performance Prediction | 4
Insights Required & Business Benefits
What Insights Are Required?
Classify each agency into one of the following categories
– GROW: Business from the agency is likely to grow > 5% in 2014
– STABLE: Business from the agency is likely to stay flat with growth in the range [-5%,5%] in 2014
– LOSS: Business form the agency is likely to shrink > 5% (< -5% growth) in 2014
Note: Business growth is measured in terms of %growth in Average Monthly Written Premium Amount achieved by the Agency for a given year
Potential Business Benefits
• Improved understanding of Agency Performance - at a micro level & macro level
– How is an individual agency is likely to perform?
– How are all agencies in a state are likely to perform?
• Optimized utilization of Agency Development Funds
5. Agency Performance Prediction | 5
A Bit About Domain…
What is Insurance?
• Risk Management Tool for the
customer (individual/business)
allowing him to transfer the risk
of financial loss to the insurance
company
• In exchange for a constant
stream of premiums, insurance
companies offer to pay
consumers a sum of money
upon the occurrence of a
predetermined event, such as a
natural catastrophe, a car crash
or death etc..
• Broadly, from a business
perspective - insurance is
classified as: Life OR Non-Life
(General)
Insurance
Life
Insurance
General
Insurance
Property & Casualty
Insurance
Medical
Insurance
Motor Vehicle
Insurance
Marine
Insurance
Fire
Insurance
Homeowner’s
Insurance
Insurance Type
6. Agency Performance Prediction | 6
Data Preprocessing
What Data Was Provided By Azure?
• 213K+ observations with 49 dimensions
• Each observation representing yearly aggregated data for an Agency >> for a Year >> for a state >> for a
product
• Key attribute summary:
– 1624 agencies
– 11 years of time duration (2005-2015)
– 6 states
– 29 products
– 2 product lines
• No target class in the data !
Attribute Analysis
Each input attribute was assessed from 3 different angles:
• Business meaning: What does it mean?
• Domain Expertise Based Predictive Importance: Can it help in predicting agency performance?
• Sparsity: Does it have enough values?
7. Agency Performance Prediction | 7
Data Preprocessing (Cont…)
Key Preprocessing Challenges:
SR# Challenge Category Challenge Resolution
1 Missing Values 1. Identified and dropped highly sparse attributes
2. Missing values encoded as "99999", "Unknown" were converted to NA during file read in R
2 Unwanted Data 1. Agencies, appointed as late as 2014 & for which 2014 growth rate can not be calculated - were
removed
2. Scope of analysis is "Personal Line (PL)" data, hence, Commercial Line (CL) data was filtered out
3 Unavailable Data New attributes created for all Quantity and Revenue attributes to average them over the # of months
data is available
4 Incomplete Data 2005 and 2015 data removed as data were available only for 8 & 5 months respectively
5 Repeating Data Agency specific attributes were detached from raw data, processed separately and later merge with
main data
6 Format Of Data For
Modelling
1. All Quantity attributes & revenue attributes were aggregated based on AGENCY_ID and YEAR
2. Each important attribute was expanded with AGENCY_ID in row and Year identifier in column
E.g. WrittenPremAmount column was converted to 2006_WrittenPremAmount,
2007_WrittenPremAmount....
7 No Target Class
Present
1. Lag variable was created for Written Premium Amount
2. Growth Rate for each agency for all years (2006-20014) was calculated
3. Each agency was assigned a class label based on 2014 growth rate:
• GROW class := 2014 growth rate > 5%
• STABLE class := 2014 growth rate in the range [-5%, 5%]
• LOSS class := 2014 growth rate < -5%
8. Agency Performance Prediction | 8
Modelling - Approach
• Important features were identified using Boruta package (11 attributes dropped)
• As this is a classification problem, following algorithms were used:
– CART
– C5.0
– Random Forest
– K Nearest Neighbours
– Artificial Neural Network
– Support Vector Machine
– GBM
– Ensemble-Stacking
• Many algorithms were tried on three flavours of data:
– ASIS Data
– ASIS Data + Range transformation
– ASIS Data + Range transformation + Important Features
• 10-fold cross-validation (3x-10x repeated) was performed to get an initial best-estimate of
hyper parameters ("caret" package)
• One or more round of grid search was used to fine tune the hyper-parameter values ("caret"
package)
• Cost-sensitive learning was used in CART and SVM
9. Agency Performance Prediction | 9
Modelling – Evaluation Metric
• Interesting Insight:
– Only ~40% of the agencies achieved >0% growth in 2014
– Of the 40%, Only ~50% of the agencies grew > 5%. Same is reflected in the
2014 growth class distribution:
• Azure is interested in identifying agencies in GROW class as accurately as
possible
GROW STABLE LOSS
21% 37% 42%
• Model Evaluation Metric:
– Higher Recall For GROW class AND
– Optimal F1 to balance Recall-Precision tradeoff
11. Agency Performance Prediction | 11
Modelling – Best Model Comparison
Best Model Vs. Baseline Model:
• In the absence of a model OR as a baseline model, the best estimate of 2014 Performance
Class is MODE of 2014 Performance Class attribute.
• Baseline model would predict "LOSS" class for all agencies as, with 42% observations, "LOSS"
is the highest occurring class
Model Metric Baseline Predictive Model Best Predictive Model
GROWTH Class – Recall 0 0.80
GROWTH Class – Precision 0 0.35
GROWTH Class – F1 NA 0.49
Overall Accuracy 41.78% 49.20%
12. Agency Performance Prediction | 12
Modelling – Model Interpretation & Key Challenges
Model Interpretation
• If an agency is likely to grow > 5% in 2014:
– Best Predictive Model is able to accurately label it as "GROW" in 4/5 cases
• If the Best Predictive Model has labeled an agency as "GROW":
– In 1/3 cases the agency will actually grow > 5% in 2014
– In 2/3 cases the agency will stay STABLE or LOSS in 2014
Key Challenges:
• GROW Class is a minority class in the data. The class distribution is imbalanced & is skewed toward
"LOSS" class
• For majority of algorithms, the learning is skewed toward learning LOSS class correctly - something that
Azure is not interested in
• The data has lot of variance. Difficult to get Test Data truly representative of Train data !
• There is "not enough" data to overcome class-imbalance and variance in the data