Download Handout

539 views

Published on

  • Be the first to comment

  • Be the first to like this

Download Handout

  1. 1. Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 [email_address] www.data-mines.com
  2. 2. Objectives <ul><li>Present a new data mining technology </li></ul><ul><li>Show how the technology uses a combination of </li></ul><ul><ul><li>String processing functions </li></ul></ul><ul><ul><li>Common multivariate procedures available in statistical most statistical software </li></ul></ul><ul><li>Present a simple example of text mining </li></ul><ul><li>Discuss practical issues for implementing the methods </li></ul>
  3. 3. Actuarial Rocket Science <ul><li>Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications </li></ul><ul><li>The methods are typically applied to large, complex databases </li></ul><ul><li>One of the newest of these is text mining </li></ul>
  4. 4. Major Kinds of Modeling <ul><li>Supervised learning </li></ul><ul><ul><li>Most common situation </li></ul></ul><ul><ul><li>A dependent variable </li></ul></ul><ul><ul><ul><li>Frequency </li></ul></ul></ul><ul><ul><ul><li>Loss ratio </li></ul></ul></ul><ul><ul><ul><li>Fraud/no fraud </li></ul></ul></ul><ul><ul><li>Some methods </li></ul></ul><ul><ul><ul><li>Regression </li></ul></ul></ul><ul><ul><ul><li>CART </li></ul></ul></ul><ul><ul><ul><li>Some neural networks </li></ul></ul></ul><ul><li>Unsupervised learning </li></ul><ul><ul><li>No dependent variable </li></ul></ul><ul><ul><li>Group like records together </li></ul></ul><ul><ul><ul><li>A group of claims with similar characteristics might be more likely to be fraudulent </li></ul></ul></ul><ul><ul><ul><li>Ex: Territory assignment, Text Mining </li></ul></ul></ul><ul><ul><li>Some methods </li></ul></ul><ul><ul><ul><li>Association rules </li></ul></ul></ul><ul><ul><ul><li>K-means clustering </li></ul></ul></ul><ul><ul><ul><li>Kohonen neural networks </li></ul></ul></ul>
  5. 5. Text Mining: Uses Growing in Many Areas ECHELON Program
  6. 6. Lots of Information, but no Data
  7. 7. Example: Claim Description Field
  8. 8. Objective <ul><li>Create a new variable from free form text </li></ul><ul><li>Use words in injury description to create an injury code </li></ul><ul><li>New injury code can be used in a predictive model or in other analysis </li></ul>
  9. 9. A Two - Step Process <ul><li>Use string manipulation functions to parse the text </li></ul><ul><ul><li>Search for blanks, commas, periods and other word separators </li></ul></ul><ul><ul><li>Use the separators to extract words </li></ul></ul><ul><ul><li>Eliminate stopwords </li></ul></ul><ul><li>Use multivariate techniques to cluster like terms together into the same injury code </li></ul><ul><ul><li>Cluster analysis </li></ul></ul><ul><ul><li>Factor and Principal Components analysis </li></ul></ul>
  10. 10. Parsing a Claim Description Field With Microsoft Excel String Functions
  11. 11. Extraction Creates Binary Indicator Variables
  12. 12. Eliminate Stopwords <ul><li>Common words with no meaningful content </li></ul>
  13. 13. Stemming: Identify Synonyms and Words with Common Stem
  14. 14. Dimension Reduction
  15. 15. The Two Major Categories of Dimension Reduction <ul><li>Variable reduction </li></ul><ul><ul><li>Factor Analysis </li></ul></ul><ul><ul><li>Principal Components Analysis </li></ul></ul><ul><li>Record reduction </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><li>Other methods tend to be developments on these </li></ul>
  16. 16. Correlated Dimensions
  17. 17. Clustering <ul><li>Common Method: k-means and hierarchical clustering </li></ul><ul><li>No dependent variable – records are grouped into classes with similar values on the variable </li></ul><ul><li>Start with a measure of similarity or dissimilarity </li></ul><ul><li>Maximize dissimilarity between members of different clusters </li></ul>
  18. 18. Dissimilarity (Distance) Measure – Continuous Variables <ul><li>Euclidian Distance </li></ul><ul><li>Manhattan Distance </li></ul>
  19. 19. Binary Variables
  20. 20. Binary Variables <ul><li>Sample Matching </li></ul><ul><li>Rogers and Tanimoto </li></ul>
  21. 21. K-Means Clustering <ul><li>Determine ahead of time how many clusters or groups you want </li></ul><ul><li>Use dissimilarity measure to assign all records to one of the clusters </li></ul>
  22. 22. Hierarchical Clustering <ul><li>A stepwise procedure </li></ul><ul><li>At beginning, each records is its own cluster </li></ul><ul><li>Combine the most similar records into a single cluster </li></ul><ul><li>Repeat process until there is only one cluster with every record in it </li></ul>
  23. 23. Hierarchical Clustering Example
  24. 24. How Many Clusters? <ul><li>Use statistics on strength of relationship to variables of interest </li></ul>
  25. 25. A Statistical Test for Number of Clusters <ul><li>Swartz Bayesian Information Criterion </li></ul>
  26. 26. Final Cluster Selection
  27. 27. Use New Injury Code in a Logistic Regression to Predict Serious Claims
  28. 28. Software for Text Mining-Commercial Software <ul><li>Most major software companies, as well as some specialists sell text mining software </li></ul><ul><ul><li>These products tend to be for large complicated applications, such as classifying academic papers </li></ul></ul><ul><ul><li>They also tend to be expensive </li></ul></ul><ul><li>One inexpensive product reviewed by American Statistician had disappointing performance </li></ul>
  29. 29. Software for Text Mining – Free Software <ul><li>A free product, TMSK, was used for much of the paper’s analysis </li></ul><ul><li>Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R ) </li></ul><ul><li>Many of the text manipulation functions can be performed in Perl ( www.perl.com ) and Python (www.python.org) </li></ul>
  30. 30. Software used for Text Mining Perl, TMSK, S-PLUS, SPSS SPSS, SPLUS, SAS Text Mining Parse Terms Feature Creation Prediction
  31. 31. Perl <ul><li>Free open source programming language </li></ul><ul><li>www.perl.org </li></ul><ul><li>Used a lot for text processing </li></ul><ul><li>Perl for Dummies gives a good introduction </li></ul>
  32. 32. Perl Functions for Parsing <ul><li>$TheFile =&quot;GLClaims.txt&quot;; </li></ul><ul><li>$Linelength=length($TheFile); </li></ul><ul><li>open(INFILE, $TheFile) or die &quot;File not found&quot;; </li></ul><ul><li># Initialize variables </li></ul><ul><li>$Linecount=0; </li></ul><ul><li>@alllines=(); </li></ul><ul><li>while(<INFILE>){ </li></ul><ul><li>$Theline=$_; </li></ul><ul><li>chomp($Theline); </li></ul><ul><li>$Linecount = $Linecount+1; </li></ul><ul><li>$Linelength=length($Theline); </li></ul><ul><li>@Newitems = split(/ /,$Theline); </li></ul><ul><li>print &quot;@Newitems &quot;; </li></ul><ul><li>push(@alllines, [@Newitems]); </li></ul><ul><li>} # end while </li></ul>
  33. 33. References <ul><li>Hoffman, P, Perl for Dummies , Wiley, 2003 </li></ul><ul><li>Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining , Springer, 2005 </li></ul>
  34. 34. Questions?

×