Agency Performance Prediction
Miraj Vashi
11-Dec-2016
Agency Performance Prediction | 2
Contents
Contents
 Business Case
 Insights Required & Business Benefit
 A Bit About Domain…
 Data Pre-Processing
 Modelling
 Approach
 Evaluation Metric
 Outcome
 Best Model Comparison
 Model Interpretation & Key Challenges
Agency Performance Prediction | 3
Business Case
Azure Insurance Group is operating property and casualty (P&C) insurance, life
insurance and insurance brokerage companies. Azure sells the policies through
direct & indirect sales channel. For indirect selling, Azure has tie up with 1600+
agencies across 6 states. Azure is interested in classifying existing agencies into
predefined performance categories in a supervised predictive framework &
based on agencies past performance. Specifically Azure expects to better
understand which agencies are likely to bring more growth in Personal Line (PL)
of Business
Agency Performance Prediction | 4
Insights Required & Business Benefits
What Insights Are Required?
Classify each agency into one of the following categories
– GROW: Business from the agency is likely to grow > 5% in 2014
– STABLE: Business from the agency is likely to stay flat with growth in the range [-5%,5%] in 2014
– LOSS: Business form the agency is likely to shrink > 5% (< -5% growth) in 2014
Note: Business growth is measured in terms of %growth in Average Monthly Written Premium Amount achieved by the Agency for a given year
Potential Business Benefits
• Improved understanding of Agency Performance - at a micro level & macro level
– How is an individual agency is likely to perform?
– How are all agencies in a state are likely to perform?
• Optimized utilization of Agency Development Funds
Agency Performance Prediction | 5
A Bit About Domain…
What is Insurance?
• Risk Management Tool for the
customer (individual/business)
allowing him to transfer the risk
of financial loss to the insurance
company
• In exchange for a constant
stream of premiums, insurance
companies offer to pay
consumers a sum of money
upon the occurrence of a
predetermined event, such as a
natural catastrophe, a car crash
or death etc..
• Broadly, from a business
perspective - insurance is
classified as: Life OR Non-Life
(General)
Insurance
Life
Insurance
General
Insurance
Property & Casualty
Insurance
Medical
Insurance
Motor Vehicle
Insurance
Marine
Insurance
Fire
Insurance
Homeowner’s
Insurance
Insurance Type
Agency Performance Prediction | 6
Data Preprocessing
What Data Was Provided By Azure?
• 213K+ observations with 49 dimensions
• Each observation representing yearly aggregated data for an Agency >> for a Year >> for a state >> for a
product
• Key attribute summary:
– 1624 agencies
– 11 years of time duration (2005-2015)
– 6 states
– 29 products
– 2 product lines
• No target class in the data !
Attribute Analysis
Each input attribute was assessed from 3 different angles:
• Business meaning: What does it mean?
• Domain Expertise Based Predictive Importance: Can it help in predicting agency performance?
• Sparsity: Does it have enough values?
Agency Performance Prediction | 7
Data Preprocessing (Cont…)
Key Preprocessing Challenges:
SR# Challenge Category Challenge Resolution
1 Missing Values 1. Identified and dropped highly sparse attributes
2. Missing values encoded as "99999", "Unknown" were converted to NA during file read in R
2 Unwanted Data 1. Agencies, appointed as late as 2014 & for which 2014 growth rate can not be calculated - were
removed
2. Scope of analysis is "Personal Line (PL)" data, hence, Commercial Line (CL) data was filtered out
3 Unavailable Data New attributes created for all Quantity and Revenue attributes to average them over the # of months
data is available
4 Incomplete Data 2005 and 2015 data removed as data were available only for 8 & 5 months respectively
5 Repeating Data Agency specific attributes were detached from raw data, processed separately and later merge with
main data
6 Format Of Data For
Modelling
1. All Quantity attributes & revenue attributes were aggregated based on AGENCY_ID and YEAR
2. Each important attribute was expanded with AGENCY_ID in row and Year identifier in column
E.g. WrittenPremAmount column was converted to 2006_WrittenPremAmount,
2007_WrittenPremAmount....
7 No Target Class
Present
1. Lag variable was created for Written Premium Amount
2. Growth Rate for each agency for all years (2006-20014) was calculated
3. Each agency was assigned a class label based on 2014 growth rate:
• GROW class := 2014 growth rate > 5%
• STABLE class := 2014 growth rate in the range [-5%, 5%]
• LOSS class := 2014 growth rate < -5%
Agency Performance Prediction | 8
Modelling - Approach
• Important features were identified using Boruta package (11 attributes dropped)
• As this is a classification problem, following algorithms were used:
– CART
– C5.0
– Random Forest
– K Nearest Neighbours
– Artificial Neural Network
– Support Vector Machine
– GBM
– Ensemble-Stacking
• Many algorithms were tried on three flavours of data:
– ASIS Data
– ASIS Data + Range transformation
– ASIS Data + Range transformation + Important Features
• 10-fold cross-validation (3x-10x repeated) was performed to get an initial best-estimate of
hyper parameters ("caret" package)
• One or more round of grid search was used to fine tune the hyper-parameter values ("caret"
package)
• Cost-sensitive learning was used in CART and SVM
Agency Performance Prediction | 9
Modelling – Evaluation Metric
• Interesting Insight:
– Only ~40% of the agencies achieved >0% growth in 2014
– Of the 40%, Only ~50% of the agencies grew > 5%. Same is reflected in the
2014 growth class distribution:
• Azure is interested in identifying agencies in GROW class as accurately as
possible
GROW STABLE LOSS
21% 37% 42%
• Model Evaluation Metric:
– Higher Recall For GROW class AND
– Optimal F1 to balance Recall-Precision tradeoff
Agency Performance Prediction | 10
Modelling - Outcome
Agency Performance Prediction | 11
Modelling – Best Model Comparison
Best Model Vs. Baseline Model:
• In the absence of a model OR as a baseline model, the best estimate of 2014 Performance
Class is MODE of 2014 Performance Class attribute.
• Baseline model would predict "LOSS" class for all agencies as, with 42% observations, "LOSS"
is the highest occurring class
Model Metric Baseline Predictive Model Best Predictive Model
GROWTH Class – Recall 0 0.80
GROWTH Class – Precision 0 0.35
GROWTH Class – F1 NA 0.49
Overall Accuracy 41.78% 49.20%
Agency Performance Prediction | 12
Modelling – Model Interpretation & Key Challenges
Model Interpretation
• If an agency is likely to grow > 5% in 2014:
– Best Predictive Model is able to accurately label it as "GROW" in 4/5 cases
• If the Best Predictive Model has labeled an agency as "GROW":
– In 1/3 cases the agency will actually grow > 5% in 2014
– In 2/3 cases the agency will stay STABLE or LOSS in 2014
Key Challenges:
• GROW Class is a minority class in the data. The class distribution is imbalanced & is skewed toward
"LOSS" class
• For majority of algorithms, the learning is skewed toward learning LOSS class correctly - something that
Azure is not interested in
• The data has lot of variance. Difficult to get Test Data truly representative of Train data !
• There is "not enough" data to overcome class-imbalance and variance in the data
AIG Performance Classification

AIG Performance Classification

  • 1.
  • 2.
    Agency Performance Prediction| 2 Contents Contents  Business Case  Insights Required & Business Benefit  A Bit About Domain…  Data Pre-Processing  Modelling  Approach  Evaluation Metric  Outcome  Best Model Comparison  Model Interpretation & Key Challenges
  • 3.
    Agency Performance Prediction| 3 Business Case Azure Insurance Group is operating property and casualty (P&C) insurance, life insurance and insurance brokerage companies. Azure sells the policies through direct & indirect sales channel. For indirect selling, Azure has tie up with 1600+ agencies across 6 states. Azure is interested in classifying existing agencies into predefined performance categories in a supervised predictive framework & based on agencies past performance. Specifically Azure expects to better understand which agencies are likely to bring more growth in Personal Line (PL) of Business
  • 4.
    Agency Performance Prediction| 4 Insights Required & Business Benefits What Insights Are Required? Classify each agency into one of the following categories – GROW: Business from the agency is likely to grow > 5% in 2014 – STABLE: Business from the agency is likely to stay flat with growth in the range [-5%,5%] in 2014 – LOSS: Business form the agency is likely to shrink > 5% (< -5% growth) in 2014 Note: Business growth is measured in terms of %growth in Average Monthly Written Premium Amount achieved by the Agency for a given year Potential Business Benefits • Improved understanding of Agency Performance - at a micro level & macro level – How is an individual agency is likely to perform? – How are all agencies in a state are likely to perform? • Optimized utilization of Agency Development Funds
  • 5.
    Agency Performance Prediction| 5 A Bit About Domain… What is Insurance? • Risk Management Tool for the customer (individual/business) allowing him to transfer the risk of financial loss to the insurance company • In exchange for a constant stream of premiums, insurance companies offer to pay consumers a sum of money upon the occurrence of a predetermined event, such as a natural catastrophe, a car crash or death etc.. • Broadly, from a business perspective - insurance is classified as: Life OR Non-Life (General) Insurance Life Insurance General Insurance Property & Casualty Insurance Medical Insurance Motor Vehicle Insurance Marine Insurance Fire Insurance Homeowner’s Insurance Insurance Type
  • 6.
    Agency Performance Prediction| 6 Data Preprocessing What Data Was Provided By Azure? • 213K+ observations with 49 dimensions • Each observation representing yearly aggregated data for an Agency >> for a Year >> for a state >> for a product • Key attribute summary: – 1624 agencies – 11 years of time duration (2005-2015) – 6 states – 29 products – 2 product lines • No target class in the data ! Attribute Analysis Each input attribute was assessed from 3 different angles: • Business meaning: What does it mean? • Domain Expertise Based Predictive Importance: Can it help in predicting agency performance? • Sparsity: Does it have enough values?
  • 7.
    Agency Performance Prediction| 7 Data Preprocessing (Cont…) Key Preprocessing Challenges: SR# Challenge Category Challenge Resolution 1 Missing Values 1. Identified and dropped highly sparse attributes 2. Missing values encoded as "99999", "Unknown" were converted to NA during file read in R 2 Unwanted Data 1. Agencies, appointed as late as 2014 & for which 2014 growth rate can not be calculated - were removed 2. Scope of analysis is "Personal Line (PL)" data, hence, Commercial Line (CL) data was filtered out 3 Unavailable Data New attributes created for all Quantity and Revenue attributes to average them over the # of months data is available 4 Incomplete Data 2005 and 2015 data removed as data were available only for 8 & 5 months respectively 5 Repeating Data Agency specific attributes were detached from raw data, processed separately and later merge with main data 6 Format Of Data For Modelling 1. All Quantity attributes & revenue attributes were aggregated based on AGENCY_ID and YEAR 2. Each important attribute was expanded with AGENCY_ID in row and Year identifier in column E.g. WrittenPremAmount column was converted to 2006_WrittenPremAmount, 2007_WrittenPremAmount.... 7 No Target Class Present 1. Lag variable was created for Written Premium Amount 2. Growth Rate for each agency for all years (2006-20014) was calculated 3. Each agency was assigned a class label based on 2014 growth rate: • GROW class := 2014 growth rate > 5% • STABLE class := 2014 growth rate in the range [-5%, 5%] • LOSS class := 2014 growth rate < -5%
  • 8.
    Agency Performance Prediction| 8 Modelling - Approach • Important features were identified using Boruta package (11 attributes dropped) • As this is a classification problem, following algorithms were used: – CART – C5.0 – Random Forest – K Nearest Neighbours – Artificial Neural Network – Support Vector Machine – GBM – Ensemble-Stacking • Many algorithms were tried on three flavours of data: – ASIS Data – ASIS Data + Range transformation – ASIS Data + Range transformation + Important Features • 10-fold cross-validation (3x-10x repeated) was performed to get an initial best-estimate of hyper parameters ("caret" package) • One or more round of grid search was used to fine tune the hyper-parameter values ("caret" package) • Cost-sensitive learning was used in CART and SVM
  • 9.
    Agency Performance Prediction| 9 Modelling – Evaluation Metric • Interesting Insight: – Only ~40% of the agencies achieved >0% growth in 2014 – Of the 40%, Only ~50% of the agencies grew > 5%. Same is reflected in the 2014 growth class distribution: • Azure is interested in identifying agencies in GROW class as accurately as possible GROW STABLE LOSS 21% 37% 42% • Model Evaluation Metric: – Higher Recall For GROW class AND – Optimal F1 to balance Recall-Precision tradeoff
  • 10.
    Agency Performance Prediction| 10 Modelling - Outcome
  • 11.
    Agency Performance Prediction| 11 Modelling – Best Model Comparison Best Model Vs. Baseline Model: • In the absence of a model OR as a baseline model, the best estimate of 2014 Performance Class is MODE of 2014 Performance Class attribute. • Baseline model would predict "LOSS" class for all agencies as, with 42% observations, "LOSS" is the highest occurring class Model Metric Baseline Predictive Model Best Predictive Model GROWTH Class – Recall 0 0.80 GROWTH Class – Precision 0 0.35 GROWTH Class – F1 NA 0.49 Overall Accuracy 41.78% 49.20%
  • 12.
    Agency Performance Prediction| 12 Modelling – Model Interpretation & Key Challenges Model Interpretation • If an agency is likely to grow > 5% in 2014: – Best Predictive Model is able to accurately label it as "GROW" in 4/5 cases • If the Best Predictive Model has labeled an agency as "GROW": – In 1/3 cases the agency will actually grow > 5% in 2014 – In 2/3 cases the agency will stay STABLE or LOSS in 2014 Key Challenges: • GROW Class is a minority class in the data. The class distribution is imbalanced & is skewed toward "LOSS" class • For majority of algorithms, the learning is skewed toward learning LOSS class correctly - something that Azure is not interested in • The data has lot of variance. Difficult to get Test Data truly representative of Train data ! • There is "not enough" data to overcome class-imbalance and variance in the data