Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance
Applications of Unsupervised Learningin Property and Casualty Insurancewith emphasis on fraud analysis Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.email@example.com
Objectives Review classic unsupervised learning techniques Introduce 2 new unsupervised learning techniques RandomForest PRIDIT Apply the techniques to insurance data Automobile Fraud data set A publically available automobile insurance database
Motivation for Topic New book: Predictive Modeling in Actuarial Science An introduction to predictive modeling for actuaries and other insurance professionals Publisher: Cambridge University Press Hope to Publish: Fall 2012 Chapter on Unsupervised Learning Li Yang and Louise Francis Li Yang – Variable grouping (PCA) Louise Francis- record grouping (clustering)
Book Project Predictive Modeling 2 Volume Book Project A joint project leading to a two volume pair of books on Predictive Modeling in Actuarial Science. Volume 1 would be on Theory and Methods and Volume 2 would be on Property and Casualty Applications. The first volume will be introductory with basic concepts and a wide range of techniques designed to acquaint actuaries with this sector of problem solving techniques. The second volume would be a collection of applications to P&C problems, written by authors who are well aware of the advantages and disadvantages of the first volume techniques but who can explore relevant applications in detail with positive results.
The Fraud Study Data • 1993 AIB closed PIP claims • Dependent Variables • Suspicion Score • Expert assessment of liklihood of fraud or abuse • Predictor Variables • Red flag indicators • Claim file variables Francis Analytics and Actuarial6/26/2012 5 Data Mining, Inc.
The Fraud Problem from: www.agentinsure.com Francis Analytics and Actuarial6/26/2012 6 Data Mining, Inc.
The Fraud Problem (2) from Coalition Against Insurance Fraud Francis Analytics and Actuarial Da6/26/2012 7 Mining, Inc.
Fraud and Abuse Planned fraud Staged accidents Abuse Opportunistic Exaggerate claim Francis Analytics and Actuarial6/26/2012 8 Data Mining, Inc.
The Fraud Red Flags Binary variables that capture characteristics of claims associated with fraud and abuse Accident variables (acc01 - acc19) Injury variables (inj01 – inj12) Claimant variables (ch01 – ch11) Insured variables (ins01 – ins06) Treatment variables (trt01 – trt09) Lost wages variables (lw01 – lw07)
The Red Flag Variables Red Flag Variables Indicator Subject Variable Description Accident ACCO1 No report by police officer at scene A0004 Single vehicle accident A0009 No plausible explanation for accident ACC10 Claimant in old, low valued vehicle ACC11 Rental vehicle involved in accident ACC14 Property Damage was inconsistent with accident ACC15 Very minor impact collision ACC16 Claimant vehicle stopped short ACC19 Insured felt set up, denied fault Claimant CLT02 Had a history of previous claims CLT04 Was an out of state accident CLT07 Was one of three or more claimants in vehicle Injury INJO1 Injury consisted of strain or sprain only INJ02 No objective evidence of injury INJO3 Police report showed no injury or pain INJ05 No emergency treatment was given INJO6 Non-emergency treatment was delayed INJ11 Unusual injury for auto accident Insured INSO1 Had history of previous claims INSO3 Readily accepted fault for accident INSO6 Was difficult to contact/uncooperative INSO7 Accident occurred soon after effective date Lost Wages LWO1 Claimant worked for self or a family member LW03 Claimant recently started employment Francis Analytics and Actuarial6/26/2012 10 Data Mining, Inc.
Dependent Variable Problem Insurance companies frequently do not collect information as to whether a claim is suspected of fraud or abuse Even when claims are referred for special investigation Solution: unsupervised learning Francis Analytics and Actuarial6/26/2012 11 Data Mining, Inc.
Supervised Learning Francis Analytics and Actuarial6/26/2012 12 Data Mining, Inc.
Dimension Reduction PolicyCount VehicleCou Frequency Frequency Frequency NonBusines ntNonBusin ZipCode BI PD Comb sUse essUse SeverityBI SeverityPD 90095 - 54.50 0.03 2.00 3.00 1,973.50 93741 - - - 1.00 1.00 90015 22.65 43.93 0.04 1.00 2.00 10,181.16 2,442.36 90067 15.53 44.41 0.04 3.00 6.00 13,146.57 2,565.56 90004 26.71 48.45 0.04 11.00 17.00 8,538.56 2,354.08 Francis Analytics and Actuarial6/26/2012 13 Data Mining, Inc.
The CAARP Data This assigned risk automobile data was made available to researchers in 2005 for the purpose of studying the effect of change in regultion on territorial variables contain exposure information (car counts, premium) and claim and loss information (Bodily Injury (BI) counts, BI ultimate losses, Property Damage (PD) claim counts, PD ultimate losses). Each record is a zip code Good example of using unsupervised learning for territory construction Francis Analytics and Actuarial6/26/2012 14 Data Mining, Inc.
R Cluster Library The “cluster” library from R used Many of the functions in the library are described in the Kaufman and Rousseeuw’s (1990) classic bookon clustering. Finding Groups in Data. Francis Analytics and Actuarial6/26/2012 15 Data Mining, Inc.
Grouping Records Francis Analytics and Actuarial6/26/2012 16 Data Mining, Inc.
Dissimilarity Euclidian Distance: the record by record squared difference between the value of each the variables for a record and the values for the record it is being compared to. Francis Analytics and Actuarial6/26/2012 17 Data Mining, Inc.
RF Similarity Varies between 0 and 1 Proximity matrix is an output of RF After a tree is fit, all records run through model If 2 records in same terminal node, their proximity increased by 1 1-proximity forms distance Can be used as an input to clustering and other unsupervised learning procedures See “Unsupervised Learning with Random Forest Predictors” by Shi and Actuarial Francis Analytics and Horvath6/26/2012 18 Data Mining, Inc.
Clustering Hierarchical clustering K-Means clustering This analysis uses k-means Francis Analytics and Actuarial6/26/2012 19 Data Mining, Inc.
K-means Clustering An iterative procedure is used to assign each record in the data to one of the k clusters. The iteration begins with the initial centers or mediods for k groups. uses a dissimilarity measure to assign records to a group and to iterate to a final grouping. An iterative procedure is used to assign each record to one of the k6/26/2012 clusters. byFrancis Analytics and Actuarial the user, 21 Data Mining, Inc.
R Cluster Output Francis Analytics and Actuarial6/26/2012 22 Data Mining, Inc.
Cluster Plot Francis Analytics and Actuarial6/26/2012 23 Data Mining, Inc.
Silhouette Plot Francis Analytics and Actuarial6/26/2012 24 Data Mining, Inc.
Testing using Expert Scores: Fit a Tree to Suspicion Score for Importance Ranking Francis Analytics and Actuarial6/26/2012 27 Data Mining, Inc.
Importance Ranking of the Clusters Francis Analytics and Actuarial6/26/2012 28 Data Mining, Inc.
Fit Tree to Binary Fraud Indicator Francis Analytics and Actuarial6/26/2012 29 Data Mining, Inc.
Importance Ranking (2) Francis Analytics and Actuarial6/26/2012 30 Data Mining, Inc.
RF Ranking of the “Predictors”: Top 10 of 44 Variable MeanDecreaseGini Description acc10 10.50 Claimant in old low value vehical trt01 9.05 arge # visits to chiro inj01 8.64 strain or sprain inj02 8.64 readily accepted fauld inj05 8.62 non emergency treatment given for injury acc01 8.55 no police report clt07 7.47 one of 3 or more claimants in vehical inj06 7.44 non emergency trt delayed acc15 7.36 very minor collision trt03 6.82 large # visits to PT Francis Analytics and Actuarial6/26/2012 31 Data Mining, Inc.
Problem: CategoricalVariables It is not clear how to best perform Principal Components/Factor Analysis on categorical variables The categories may be coded as a series of binary dummy variables If the categories are ordered categories, you may loose important information This is the problem that PRIDIT addresses
RIDIT Variables are ordered so that lowest value is associated with highest probability of fraud Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value iRti ˆ ptj ˆ ptj j i j i
PRIDIT Use RIDIT statistics in Principal Components Analysis Component Matri xa C om pon e n t 1 S IU .248 Pol i ce Re port .220 At Faul t .709 Le gal Re p .752 Medi cal Audi t .341 Pri or C l ai m .406 Extracti on Me th od: Pri n ci pal Com pon e n t An al ys i s. a. 1 component s ext r act ed.
PRIDITS of Accident Flags Francis Analytics and Actuarial6/26/2012 36 Data Mining, Inc.
Fit Tree with PRIDITS for Each Type of Flag Francis Analytics and Actuarial6/26/2012 38 Data Mining, Inc.
Importance Ranking of Pridits Francis Analytics and Actuarial6/26/2012 39 Data Mining, Inc.
Importance Ranking of Factors Francis Analytics and Actuarial6/26/2012 40 Data Mining, Inc.
Add RF and Euclid Clusters to PRIDIT Factors Francis Analytics and Actuarial6/26/2012 41 Data Mining, Inc.
Use Salford RF MDS Top variable in importance (acc10) used as binary dependent Run tree with 1,000 forests Output proximities and MDS Use MDS scales as to cluster (k=3) Run Tree to get Importance ranking Francis Analytics and Actuarial6/26/2012 42 Data Mining, Inc.
MDS Graph Francis Analytics and Actuarial6/26/2012 43 Data Mining, Inc.
Rank of cluster procedures to Tree Prediction Francis Analytics and Actuarial6/26/2012 44 Data Mining, Inc.
Labeling Clusters Francis Analytics and Actuarial6/26/2012 45 Data Mining, Inc.
Relation Between PRIDIT Factor and Suspicion Francis Analytics and Actuarial6/26/2012 46 Data Mining, Inc.
Next Steps Add claim file variables Rerun clusters Rerun PRIDITS Do Random Forest proximities on the RIDITS Apply the procedures to other fraud databases Francis Analytics and Actuarial6/26/2012 47 Data Mining, Inc.
PRIDIT REFERENCESAi, J., Brockett, Patrick L., and Golden, Linda L. (2009) “Assessing Consumer Fraud Risk in Insurance Claims with Discrete and Continuous Data,” North American Actuarial Journal 13: 438-458.Brockett, Patrick L., Derrig, Richard A., Golden, Linda L., Levine, Albert and Alpert, Mark, (2002), Fraud Classification Using Principal Component Analysis of RIDITs, Journal of Risk and Insurance, 69:3, 341-373.Brockett, Patrick L., Xiaohua, Xia and Derrig, Richard A., (1998), Using Kohonen’ Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud, Journal of Risk and Insurance, 65:245-274Bross, Irwin D.J., (1958), How To Use RIDIT Analysis, Biometrics, 4:18-38.Chipman, H.E.I. George and R.E. McCulloch, 2006, Baysian Ensemble Learning, Neural Information Processing SystemsLieberthal, Robert D., (2008), Hospital Quality: A PRIDIT Approach, Health Services Research, 43:3, 988–1005.
Questions? Francis Analytics and Actuarial6/26/2012 49 Data Mining, Inc.