Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Download Handout


Published on

  • Be the first to comment

  • Be the first to like this

Download Handout

  1. 1. Predictive Modeling CAS Reinsurance Seminar May 7, 2007 Louise Francis, FCAS, MAAA [email_address] Francis Analytics and Actuarial Data Mining, Inc.
  2. 2. Why Predictive Modeling? <ul><li>Better use of data than traditional methods </li></ul><ul><li>Advanced methods for dealing with messy data now available </li></ul>
  3. 3. Data Mining Goes Prime Time
  4. 4. Becoming A Popular Tool In All Industries
  5. 5. Real Life Insurance Application – The “Boris Gang”
  6. 6. Predictive Modeling Family Predictive Modeling Classical Linear Models GLMs Data Mining
  7. 7. A Casualty Actuary’s Perspective on Data Modeling <ul><li>The Stone Age: 1914 – … </li></ul><ul><ul><li>Simple deterministic methods </li></ul></ul><ul><ul><ul><li>Use of blunt instruments: the analytical analog of bows and arrows </li></ul></ul></ul><ul><ul><li>Often ad-hoc </li></ul></ul><ul><ul><li>Slice and dice data </li></ul></ul><ul><ul><li>Based on empirical data – little use of parametric models </li></ul></ul><ul><li>The Pre – Industrial age: 1970 - … </li></ul><ul><ul><li>Fit probability distribution to model tails </li></ul></ul><ul><ul><li>Simulation models and numerical methods for variability and uncertainty analysis </li></ul></ul><ul><ul><li>Focus is on underwriting, not claims </li></ul></ul><ul><li>The Industrial Age – 1985 … </li></ul><ul><ul><li>Begin to use computer catastrophe models </li></ul></ul><ul><li>The 20 th Century – 1990… </li></ul><ul><ul><li>European actuaries begin to use GLMs </li></ul></ul><ul><li>The Computer Age 1996… </li></ul><ul><ul><li>Begin to discuss data mining at conferences </li></ul></ul><ul><ul><li>At end of 20 st century, large consulting firms starts to build a data mining practice </li></ul></ul><ul><li>The Current era – A mixture of above </li></ul><ul><ul><li>In personal lines, modeling the rule rather than the exception </li></ul></ul><ul><ul><ul><li>Often GLM based, though GLMs evolving to GAMs </li></ul></ul></ul><ul><ul><li>Commercial lines beginning to embrace modeling </li></ul></ul>
  8. 8. Data Quality: A Data Mining Problem <ul><li>Actuary reviewing a database </li></ul>
  9. 9. CHAID: The First CAS DM Paper: 1990
  10. 10. A Problem: Nonlinear Functions An Insurance Nonlinear Function: Provider Bill vs. Probability of Independent Medical Exam
  11. 11. Classical Statistics: Regression <ul><li>Estimation of parameters: Fit line that minimizes deviation between actual and fitted values </li></ul>
  12. 12. Generalized Linear Models (GLMs) <ul><li>Relax normality assumption </li></ul><ul><ul><li>Exponential family of distributions </li></ul></ul><ul><li>Models some kinds of nonlinearity </li></ul>
  13. 13. Generalized Linear Models Common Links for GLMs The identity link: h(Y) = Y The log link: h(Y) = ln (Y) The inverse link: h(Y) = The logit link: h(Y) = The probit link: h(Y) =
  14. 14. Major Kinds of Data Mining <ul><li>Supervised learning </li></ul><ul><ul><li>Most common situation </li></ul></ul><ul><ul><li>A dependent variable </li></ul></ul><ul><ul><ul><li>Frequency </li></ul></ul></ul><ul><ul><ul><li>Loss ratio </li></ul></ul></ul><ul><ul><ul><li>Fraud/no fraud </li></ul></ul></ul><ul><ul><li>Some methods </li></ul></ul><ul><ul><ul><li>Regression </li></ul></ul></ul><ul><ul><ul><li>CART </li></ul></ul></ul><ul><ul><ul><li>Some neural networks </li></ul></ul></ul><ul><li>Unsupervised learning </li></ul><ul><ul><li>No dependent variable </li></ul></ul><ul><ul><li>Group like records together </li></ul></ul><ul><ul><ul><li>A group of claims with similar characteristics might be more likely to be fraudulent </li></ul></ul></ul><ul><ul><ul><li>Ex: Territory assignment, Text Mining </li></ul></ul></ul><ul><ul><li>Some methods </li></ul></ul><ul><ul><ul><li>Association rules </li></ul></ul></ul><ul><ul><ul><li>K-means clustering </li></ul></ul></ul><ul><ul><ul><li>Kohonen neural networks </li></ul></ul></ul>
  15. 15. Desirable Features of a Data Mining Method <ul><li>Any nonlinear relationship can be approximated </li></ul><ul><li>A method that works when the form of the nonlinearity is unknown </li></ul><ul><li>The effect of interactions can be easily determined and incorporated into the model </li></ul><ul><li>The method generalizes well on out-of sample data </li></ul>
  16. 16. The Fraud Surrogates used as Dependent Variables <ul><li>Independent Medical Exam (IME) requested </li></ul><ul><li>Special Investigation Unit (SIU) referral </li></ul><ul><ul><li>(IME successful) </li></ul></ul><ul><ul><li>(SIU successful) </li></ul></ul><ul><li>Data: Detailed Auto Injury Claim Database for Massachusetts </li></ul><ul><li>Accident Years (1995-1997) </li></ul>
  17. 17. Predictor Variables <ul><li>Claim file variables </li></ul><ul><ul><li>Provider bill, Provider type </li></ul></ul><ul><ul><li>Injury </li></ul></ul><ul><li>Derived from claim file variables </li></ul><ul><ul><li>Attorneys per zip code </li></ul></ul><ul><ul><li>Docs per zip code </li></ul></ul><ul><li>Using external data </li></ul><ul><ul><li>Average household income </li></ul></ul><ul><ul><li>Households per zip </li></ul></ul>
  18. 18. Different Kinds of Decision Trees <ul><li>Single Trees (CART, CHAID) </li></ul><ul><li>Ensemble Trees, a more recent development (TREENET, RANDOM FOREST) </li></ul><ul><ul><li>A composite or weighted average of many trees (perhaps 100 or more) </li></ul></ul>
  19. 19. Non Tree Methods <ul><li>MARS – Multivariate Adaptive Regression Splines </li></ul><ul><li>Neural Networks </li></ul><ul><li>Naïve Bayes (Baseline) </li></ul><ul><li>Logistic Regression (Baseline) </li></ul>
  20. 20. Illustrations <ul><li>Dependent variables </li></ul><ul><ul><li>Paid auto BI losses </li></ul></ul><ul><ul><li>Fraud surrogates: IME Requested; SIU Requested </li></ul></ul><ul><li>Predictors </li></ul><ul><ul><li>Provider 2 Bill and other variables </li></ul></ul><ul><ul><li>Known to have nonlinear relationship </li></ul></ul><ul><li>The following illustrations will show how some of the techniques model nonlinearity using these predictor and dependent variables </li></ul>
  21. 21. Classification and Regression Trees (CART) <ul><li>Tree Splits are binary </li></ul><ul><li>If the variable is numeric, split is based on R 2 or sum or mean squared error </li></ul><ul><ul><li>For any variable, choose the two way split of data that reduces the mse the most </li></ul></ul><ul><ul><li>Do for all independent variables </li></ul></ul><ul><ul><li>Choose the variable that reduces the squared errors the most </li></ul></ul><ul><li>When dependent is categorical, other goodness of fit measures (gini index, deviance) are used </li></ul>
  22. 22. CART – Example of 1 st split on Provider 2 Bill, With Paid as Dependent <ul><li>For the entire database, total squared deviation of paid losses around the predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x10 13 after the data are partitioned using $5,021 as the cutpoint. </li></ul><ul><li>Any other partition of the provider bill produces a larger SSE than 4.66x10 13 . For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*10 13 . </li></ul>
  23. 23. Continue Splitting to get more homogenous groups at terminal nodes
  24. 24. CART
  25. 25. Ensemble Trees: Fit More Than One Tree <ul><li>Fit a series of trees </li></ul><ul><li>Each tree added improves the fit of the model </li></ul><ul><li>Average or Sum the results of the fits </li></ul><ul><li>There are many methods to fit the trees and prevent overfitting </li></ul><ul><ul><ul><li>Boosting: Iminer Ensemble and Treenet </li></ul></ul></ul><ul><ul><ul><li>Bagging: Random Forest </li></ul></ul></ul>
  26. 26. Create Sequence of Predictions First Tree Prediction Second (Weighted) Tree Prediction
  27. 27. Treenet Prediction of IME Requested
  28. 28. Random Forest: IME vs. Provider 2 Bill
  29. 29. Neural Networks =
  30. 30. Classical Statistics: Regression <ul><li>One of the most common methods of fitting a function is linear regression </li></ul><ul><li>Models a relationship between two variables by fitting a straight line through points </li></ul><ul><li>A lot of infrastructure is available for assessing appropriateness of model </li></ul>
  31. 31. Neural Networks <ul><li>Also minimizes squared deviation between fitted and actual values </li></ul><ul><li>Can be viewed as a non-parametric, non-linear regression </li></ul>
  32. 32. Hidden Layer of Neural Network (Input Transfer Function)
  33. 33. The Activation Function (Transfer Function) <ul><li>The sigmoid logistic function </li></ul>
  34. 34. Neural Network: Provider 2 Bill vs. IME Requested
  35. 35. MARS: Provider 2 Bill vs. IME Requested
  36. 36. How MARS Fits Nonlinear Function <ul><li>MARS fits a piecewise regression </li></ul><ul><ul><li>BF1 = max(0, X – 1,401.00) </li></ul></ul><ul><ul><li>BF2 = max(0, 1,401.00 - X ) </li></ul></ul><ul><ul><li>BF3 = max(0, X - 70.00) </li></ul></ul><ul><ul><li>Y = 0.336 + .145626E-03 * BF1 - .199072E-03 * BF2 - .145947E-03 * BF3; BF1 is basis function </li></ul></ul><ul><ul><li>BF1, BF2, BF3 are basis functions </li></ul></ul><ul><li>MARS uses statistical optimization to find best basis function(s) </li></ul><ul><li>Basis function similar to dummy variable in regression. Like a combination of a dummy indicator and a linear independent variable </li></ul>
  37. 37. The Basis Functions: Category Grouping and Interactions <ul><li>Injury type 4 (neck sprain), and type 5 (back sprain) increase faster and have higher scores than the other injury types </li></ul><ul><ul><li>BF1 = max(0, 2185 - X ) </li></ul></ul><ul><ul><li>BF2 = ( INJTYPE = 4 OR INJTYPE = 5) </li></ul></ul><ul><ul><li>BF3 = max(0, X - 159) * BF2 </li></ul></ul><ul><ul><li>Y = 2.815 - 0.001 * BF1 + 0.685 * BF2 + .360E-03 * BF3 </li></ul></ul><ul><ul><ul><ul><li>where </li></ul></ul></ul></ul><ul><ul><ul><ul><li>X is the provider bill </li></ul></ul></ul></ul><ul><ul><ul><ul><li>INJTYPE is the injury type </li></ul></ul></ul></ul>
  38. 38. Interactions: MARS Fit
  39. 39. Baseline Method: Naive Bayes Classifier <ul><li>Naive Bayes assumes feature (predictor variables) independence conditional on each category </li></ul><ul><li>Probability that an observation X will have a specific set of values for the independent variables is the product of the conditional probabilities of observing each of the values given target category c j , j=1 to m (m typically 2) </li></ul>
  40. 40. Naïve Bayes Formula A constant
  41. 41. Naïve Bayes (cont.) <ul><li>Only classification problems (categorical dependents) </li></ul><ul><li>Only categorical independents </li></ul><ul><li>Create groupings for numeric variables such as provider bill and age </li></ul><ul><ul><li>We split provider bill and age into quintiles (5 groups) </li></ul></ul>
  42. 42. Naïve Bayes: Probability of Independent Variable Given No IME/ IME
  43. 43. Put it together to make a prediction <ul><li>Plug in specific values for Provider 2 Bill and provider type and get a prediction </li></ul><ul><li>Chain together product of bill* probability of type probability * probability IME </li></ul>
  44. 44. Advantages/Disadvantages <ul><li>Computationally efficient </li></ul><ul><li>Under many circumstances has performed well </li></ul><ul><li>Assumption of conditional independence often does not hold </li></ul><ul><li>Can’t be used for numeric variables </li></ul>
  45. 45. Naïve Bayes Predicted IME vs. Provider 2 Bill
  46. 46. True/False Positives and True/False Negatives (Type I and Type II Errors) The “Confusion” Matrix <ul><li>Choose a “cut point” in the model score. </li></ul><ul><li>Claims > cut point, classify “yes”. </li></ul>
  47. 47. ROC Curves and Area Under the ROC Curve <ul><li>Want good performance both on sensitivity and specificity </li></ul><ul><li>Sensitivity and specificity depend on cut points chosen </li></ul><ul><ul><li>Choose a series of different cut points, and compute sensitivity and specificity for each of them </li></ul></ul><ul><ul><li>Graph results </li></ul></ul><ul><ul><ul><li>Plot sensitivity vs 1-specifity </li></ul></ul></ul><ul><ul><ul><li>Compute an overall measure of “lift”, or area under the curve </li></ul></ul></ul>
  48. 48. TREENET ROC Curve – IME Explain AUROC AUROC = 0.701
  49. 49. Logistic ROC Curve – SIU AUROC = 0.612
  50. 50. Ranking of Methods/Software – IME Requested
  51. 51. Some Software Packages That Can be Used <ul><li>Excel </li></ul><ul><li>Access </li></ul><ul><li>Free Software </li></ul><ul><ul><li>R </li></ul></ul><ul><ul><li>Web based software </li></ul></ul><ul><li>S-Plus (similar to commercial version of R) </li></ul><ul><li>SPSS </li></ul><ul><li>CART/MARS </li></ul><ul><li>Data Mining suites – (SAS Enterprise Miner/SPSS Clementine) </li></ul>
  52. 52. References <ul><li>Derrig, R., Francis, L., “Distinguishing the Forest from the Trees: A Comparison of Tree Based Data Mining Methods”, CAS Winter Forum, March 2006, </li></ul><ul><li>Derrig, R., Francis, L., “A Comparison of Methods for Predicting Fraud ”,Risk Theory Seminar, April 2006 </li></ul><ul><li>Francis, L., “Taming Text: An Introduction to Text Mining”, CAS Winter Forum, March 2006, </li></ul><ul><li>Francis, L.A., Neural Networks Demystified, Casualty Actuarial Society Forum, Winter, pp. 254-319, 2001. </li></ul><ul><li>Francis, L.A., Martian Chronicles: Is MARS better than Neural Networks? Casualty Actuarial Society Forum, Winter, pp. 253-320, 2003b. </li></ul><ul><li>Dahr, V, Seven Methods for Transforming Corporate into Business Intelligence , Prentice Hall, 1997 </li></ul><ul><li>The web site has some tutorials and presentations </li></ul>
  53. 53. Predictive Modeling CAS Reinsurance Seminar May, 2006 [email_address]