Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance
1. Applications of Unsupervised Learning
in Property and Casualty Insurance
with emphasis on fraud analysis
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial
Data Mining, Inc.
www.data-mines.com
Louise.francus@data-mines.com
2. Objectives
Review classic unsupervised learning
techniques
Introduce 2 new unsupervised
learning techniques
RandomForest
PRIDIT
Apply the techniques to insurance
data
Automobile Fraud data set
A publically available automobile
insurance database
3. Motivation for Topic
New book: Predictive Modeling in
Actuarial Science
An introduction to predictive modeling for
actuaries and other insurance
professionals
Publisher: Cambridge University Press
Hope to Publish: Fall 2012
Chapter on Unsupervised Learning
Li Yang and Louise Francis
Li Yang – Variable grouping (PCA)
Louise Francis- record grouping
(clustering)
4. Book Project
Predictive Modeling 2 Volume Book Project
A joint project leading to a two volume pair of
books on Predictive Modeling in Actuarial Science.
Volume 1 would be on Theory and Methods and
Volume 2 would be on Property and Casualty
Applications.
The first volume will be introductory with basic
concepts and a wide range of techniques designed
to acquaint actuaries with this sector of problem
solving techniques. The second volume would be
a collection of applications to P&C problems,
written by authors who are well aware of the
advantages and disadvantages of the first volume
techniques but who can explore relevant
applications in detail with positive results.
5. The Fraud Study Data
• 1993 AIB closed PIP claims
• Dependent Variables
• Suspicion Score
• Expert assessment of liklihood of
fraud or abuse
• Predictor Variables
• Red flag indicators
• Claim file variables
Francis Analytics and Actuarial
6/26/2012 5
Data Mining, Inc.
6. The Fraud Problem
from: www.agentinsure.com
Francis Analytics and Actuarial
6/26/2012 6
Data Mining, Inc.
7. The Fraud Problem (2)
from Coalition Against Insurance Fraud
Francis Analytics and Actuarial Da
6/26/2012 7
Mining, Inc.
8. Fraud and Abuse
Planned fraud
Staged accidents
Abuse
Opportunistic
Exaggerate claim
Francis Analytics and Actuarial
6/26/2012 8
Data Mining, Inc.
9. The Fraud Red Flags
Binary variables that capture
characteristics of claims
associated with fraud and abuse
Accident variables (acc01 - acc19)
Injury variables (inj01 – inj12)
Claimant variables (ch01 – ch11)
Insured variables (ins01 – ins06)
Treatment variables (trt01 – trt09)
Lost wages variables (lw01 – lw07)
10. The Red Flag Variables
Red Flag Variables
Indicator
Subject Variable Description
Accident ACCO1 No report by police officer at scene
A0004 Single vehicle accident
A0009 No plausible explanation for accident
ACC10 Claimant in old, low valued vehicle
ACC11 Rental vehicle involved in accident
ACC14 Property Damage was inconsistent with accident
ACC15 Very minor impact collision
ACC16 Claimant vehicle stopped short
ACC19 Insured felt set up, denied fault
Claimant CLT02 Had a history of previous claims
CLT04 Was an out of state accident
CLT07 Was one of three or more claimants in vehicle
Injury INJO1 Injury consisted of strain or sprain only
INJ02 No objective evidence of injury
INJO3 Police report showed no injury or pain
INJ05 No emergency treatment was given
INJO6 Non-emergency treatment was delayed
INJ11 Unusual injury for auto accident
Insured INSO1 Had history of previous claims
INSO3 Readily accepted fault for accident
INSO6 Was difficult to contact/uncooperative
INSO7 Accident occurred soon after effective date
Lost Wages LWO1 Claimant worked for self or a family member
LW03 Claimant recently started employment
Francis Analytics and Actuarial
6/26/2012 10
Data Mining, Inc.
11. Dependent Variable
Problem
Insurance companies frequently do
not collect information as to
whether a claim is suspected of
fraud or abuse
Even when claims are referred for
special investigation
Solution: unsupervised learning
Francis Analytics and Actuarial
6/26/2012 11
Data Mining, Inc.
12. Supervised Learning
Francis Analytics and Actuarial
6/26/2012 12
Data Mining, Inc.
13. Dimension Reduction
PolicyCount VehicleCou
Frequency Frequency Frequency NonBusines ntNonBusin
ZipCode BI PD Comb sUse essUse SeverityBI SeverityPD
90095 - 54.50 0.03 2.00 3.00 1,973.50
93741 - - - 1.00 1.00
90015 22.65 43.93 0.04 1.00 2.00 10,181.16 2,442.36
90067 15.53 44.41 0.04 3.00 6.00 13,146.57 2,565.56
90004 26.71 48.45 0.04 11.00 17.00 8,538.56 2,354.08
Francis Analytics and Actuarial
6/26/2012 13
Data Mining, Inc.
14. The CAARP Data
This assigned risk automobile data was made
available to researchers in 2005 for the purpose of
studying the effect of change in regultion on territorial
variables
contain exposure information (car counts, premium)
and claim and loss information (Bodily Injury (BI)
counts, BI ultimate losses, Property Damage (PD)
claim counts, PD ultimate losses).
Each record is a zip code
Good example of using unsupervised learning for
territory construction
Francis Analytics and Actuarial
6/26/2012 14
Data Mining, Inc.
15. R Cluster Library
The “cluster” library from R used
Many of the functions in the library
are described in the Kaufman and
Rousseeuw’s (1990) classic
bookon clustering.
Finding Groups in Data.
Francis Analytics and Actuarial
6/26/2012 15
Data Mining, Inc.
16. Grouping Records
Francis Analytics and Actuarial
6/26/2012 16
Data Mining, Inc.
17. Dissimilarity
Euclidian Distance: the record by
record squared difference between
the value of each the variables for
a record and the values for the
record it is being compared to.
Francis Analytics and Actuarial
6/26/2012 17
Data Mining, Inc.
18. RF Similarity
Varies between 0 and 1
Proximity matrix is an output of RF
After a tree is fit, all records run through model
If 2 records in same terminal node, their
proximity increased by 1
1-proximity forms distance
Can be used as an input to clustering and other
unsupervised learning procedures
See “Unsupervised Learning with Random
Forest Predictors” by Shi and Actuarial
Francis Analytics
and Horvath
6/26/2012 18
Data Mining, Inc.
19. Clustering
Hierarchical clustering
K-Means clustering
This analysis uses k-means
Francis Analytics and Actuarial
6/26/2012 19
Data Mining, Inc.
20. K-means Clustering
An iterative procedure is used to assign
each record in the data to one of the k
clusters.
The iteration begins with the initial centers
or mediods for k groups.
uses a dissimilarity measure to assign
records to a group and to iterate to a final
grouping. An iterative procedure is used to
assign each record to one of the k
6/26/2012
clusters. byFrancis Analytics and Actuarial
the user, 21
Data Mining, Inc.
21. R Cluster Output
Francis Analytics and Actuarial
6/26/2012 22
Data Mining, Inc.
22. Cluster Plot
Francis Analytics and Actuarial
6/26/2012 23
Data Mining, Inc.
23. Silhouette Plot
Francis Analytics and Actuarial
6/26/2012 24
Data Mining, Inc.
30. RF Ranking of the
“Predictors”: Top 10 of 44
Variable MeanDecreaseGini Description
acc10 10.50 Claimant in old low value vehical
trt01 9.05 arge # visits to chiro
inj01 8.64 strain or sprain
inj02 8.64 readily accepted fauld
inj05 8.62 non emergency treatment given for injury
acc01 8.55 no police report
clt07 7.47 one of 3 or more claimants in vehical
inj06 7.44 non emergency trt delayed
acc15 7.36 very minor collision
trt03 6.82 large # visits to PT
Francis Analytics and Actuarial
6/26/2012 31
Data Mining, Inc.
31. Problem: Categorical
Variables
It is not clear how to best perform
Principal Components/Factor
Analysis on categorical variables
The categories may be coded as a
series of binary dummy variables
If the categories are ordered
categories, you may loose
important information
This is the problem that PRIDIT
addresses
32. RIDIT
Variables are ordered so that
lowest value is associated with
highest probability of fraud
Use Cumulative distribution of
claims at each value, i, to create
RIDIT statistic for claim t, value i
Rti ˆ
ptj ˆ
ptj
j i j i
33. Example: RIDIT for Legal
Representation
Legal Representation
Proportion Proportion
Value Code Number Proportion Below Above RIDIT
Yes 1 706 0.504 0.000 0.496 -0.496
No 2 694 0.496 0.504 0.000 0.504
34. PRIDIT
Use RIDIT statistics in Principal
Components Analysis
Component Matri xa
C om pon e n t
1
S IU .248
Pol i ce Re port .220
At Faul t .709
Le gal Re p .752
Medi cal Audi t .341
Pri or C l ai m .406
Extracti on Me th od: Pri n ci pal Com pon e n t An al ys i s.
a. 1 component s ext r act ed.
35. PRIDITS of Accident
Flags
Francis Analytics and Actuarial
6/26/2012 36
Data Mining, Inc.
36. Fit Tree with PRIDITS for
Each Type of Flag
Francis Analytics and Actuarial
6/26/2012 38
Data Mining, Inc.
37. Importance Ranking of
Pridits
Francis Analytics and Actuarial
6/26/2012 39
Data Mining, Inc.
38. Importance Ranking of
Factors
Francis Analytics and Actuarial
6/26/2012 40
Data Mining, Inc.
39. Add RF and Euclid
Clusters to PRIDIT
Factors
Francis Analytics and Actuarial
6/26/2012 41
Data Mining, Inc.
40. Use Salford RF MDS
Top variable in importance (acc10)
used as binary dependent
Run tree with 1,000 forests
Output proximities and MDS
Use MDS scales as to cluster
(k=3)
Run Tree to get Importance
ranking
Francis Analytics and Actuarial
6/26/2012 42
Data Mining, Inc.
41. MDS Graph
Francis Analytics and Actuarial
6/26/2012 43
Data Mining, Inc.
42. Rank of cluster
procedures to Tree
Prediction
Francis Analytics and Actuarial
6/26/2012 44
Data Mining, Inc.
43. Labeling Clusters
Francis Analytics and Actuarial
6/26/2012 45
Data Mining, Inc.
44. Relation Between
PRIDIT Factor and
Suspicion
Francis Analytics and Actuarial
6/26/2012 46
Data Mining, Inc.
45. Next Steps
Add claim file variables
Rerun clusters
Rerun PRIDITS
Do Random Forest proximities on
the RIDITS
Apply the procedures to other
fraud databases
Francis Analytics and Actuarial
6/26/2012 47
Data Mining, Inc.
46. PRIDIT REFERENCES
Ai, J., Brockett, Patrick L., and Golden, Linda L. (2009) “Assessing Consumer
Fraud Risk in Insurance Claims with Discrete and Continuous Data,”
North American Actuarial Journal 13: 438-458.
Brockett, Patrick L., Derrig, Richard A., Golden, Linda L., Levine, Albert and
Alpert, Mark, (2002), Fraud Classification Using Principal Component
Analysis of RIDITs, Journal of Risk and Insurance, 69:3, 341-373.
Brockett, Patrick L., Xiaohua, Xia and Derrig, Richard A., (1998), Using
Kohonen’ Self-Organizing Feature Map to Uncover Automobile Bodily
Injury Claims Fraud, Journal of Risk and Insurance, 65:245-274
Bross, Irwin D.J., (1958), How To Use RIDIT Analysis, Biometrics,
4:18-38.
Chipman, H.E.I. George and R.E. McCulloch, 2006, Baysian Ensemble Learning,
Neural Information Processing Systems
Lieberthal, Robert D., (2008), Hospital Quality: A PRIDIT Approach, Health
Services Research, 43:3, 988–1005.
47. Questions?
Francis Analytics and Actuarial
6/26/2012 49
Data Mining, Inc.