Data Mining for Fraud Detection


Published on

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • What can data mining Do??
  • Let’s define data mining “ The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” Data Mining Means: finding patterns or relationships in your data that you can use to solve your organization’s problems
  •   How does one mine data? The CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING provides a framework for all data mining efforts.. IN May of 1998 a consortium of Data Mining experts from NCR, ISL, SPSS, and Daimler-Benz met in London to plan a process model for Data Mining. This grow continues to grow with presently over 80 members worldwide. CRISP-DM - "CRoss-Industry Standard Process for Data Mining" - moves away from this focus on technology by addressing the needs of all levels of users in deploying data mining technology to solve business problems. The model is designed to help businesses plan and work through the complete data mining process – from problem specification to deployment of results. This will make large data mining projects faster, more efficient, more reliable, more manageable, and less costly. CRISP-DM has been kept sufficiently lightweight, however, to benefit even small-scale data mining investigations. Issues addressed include: mapping from business issues to data mining problems capturing and understanding data identifying and solving problems within the data applying data mining techniques interpretation of data mining results within the business context deployment and maintenance of data mining results capture and transfer of expertise to ensure future projects benefit from experience As well as providing a process structure for carrying out data mining, the project model also aims to provide guidance on potential problems and solutions which can occur in data mining projects.
  • At SPSS we have identified a few techniques that we have found to be especially successful in helping identify FRAUD We can divide these techniques into 2 sets of modeling techniques: Techniques that Predict or Classify or Techniques that Group or find Associations These can be further drilled down to specific sets of algorithms as shown in the chart above If we utilize techniques that classify or predict, we have constructed a data mining problem where we already know the outcome – we already know what “Normal” should look like, or we already have cases fraud. But remember fraudsters are tricky – if you catch them they will try to find other way to take advantage of the system. Many times when we are working with fraud, we do not know what it looks like. Grouping and association algorithms to help us handle these situations when we don’t know what fraud looks like. Using these techniques we might be able to learn what a “normal” case looks like. Cases that appear abnormal can be pulled out and reviewed more carefully.
  • If we choose to use a technique to predict, we may have a set of audited billing cases, for example. We might be able to build a model to predict what the cost for a particular procedure or product should be. We then may apply this model to new cases to predict what the actual expenditure was versus the expected expenditure. If costs exceed what was expected, these cases might be examine more closely.
  • We may have audited cases where fraud has been identified. Those cases are Flagged as TRUE for fraud. Additionally, we have a set of cases that Flagged as FALSE for fraud, meaning that while we carefully audited these cases, we could not identify any problems with them. We then may choose a classification algorithm that takes those cases of T/F and builds a predictive model that will in the future automatically identify cases that require further scrutiny.
  • Grouping (otherwise known as clustering) algorithms may be employed in circumstances where we do not know what “fraud” looks like. Our goal is simply to get a good picture of what all the cases look like. We would like to believe that somehow and individual that might be committing fraud would likely have some characteristics, or patterns of behavior, the somehow look different from other cases. Using clustering routines, we can find that outlier cases. Of course, in some of the situations that have been cited in the earlier discussion today, waste fraud and abuse are so prevalent that membership to this cluster would be very high. In those cases, we generally build the clusters, then build profiles of the cluster to understand the characteristics of the cluster membership. In this way, our experience has show the “Cluster Profile” may have telling characteristics that would indicate a need for further investigation.
  • By utilizing the CRISP DM process model and identifying the business issues and data mining objectives, the data mining process can More quickly implement more data mining goals Be easier to understand to a new person entering the project More quantifiable to congress and the GAO Be easier to update and change when the actions of the Fraudsters change
  • Now we’ll discuss three specific instances where data mining with SPSS products was able to help government clients predict and prevent fraud, waste and abuse. They are in three areas: payment error prevention, billing and payment fraud, and audit selection.
  • The US Health Care Finance Administration (now CMS) is using data mining to improve customer service. [Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions. NOTE TO PRESENTER: The new logo is not available on their site as yet which is why we left the old logo and name on slide.
  • The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions. [Click] Using data mining, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.
  • The State of Washington needed to detect erroneous tax payment and identify those who were non-compliant [Click] Using data mining, the agency has started to examine tax returns more closely, isolating the factors that point to fraud and non-compliance thereby maximizing their auditing effort.
  • To summarize – data mining is the key to detecting and preventing fraud, waste and abuse in your organization. Data mining can help you: Analyze – learn from the past Predict – pre-empt future fraud waste and abuse React to changing circumstances – continuously learn from the newest cases
  • Data Mining for Fraud Detection

    1. 1. CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection
    2. 2. What is Fraud Detection? <ul><li>Identify wrongful actions </li></ul><ul><ul><li>Is right and wrong universal? </li></ul></ul><ul><ul><li>If so, why not just prevent wrong actions </li></ul></ul><ul><li>Identify actions by the wrong people </li></ul><ul><li>Identify suspect actions </li></ul><ul><ul><li>Legal </li></ul></ul><ul><ul><li>But probably not right </li></ul></ul>
    3. 3. In Data Mining terms… <ul><li>Classification? </li></ul><ul><ul><li>Classify into fraudulent and non-fraudulent behavior </li></ul></ul><ul><ul><li>What do we need to do this? </li></ul></ul><ul><li>Outlier Detection </li></ul><ul><ul><li>Assume non-fraudulent behavior is normal </li></ul></ul><ul><ul><li>Find the exceptions </li></ul></ul><ul><li>Problems? </li></ul>
    4. 4. Solution: Differential Profiling <ul><li>Determine individual behavior </li></ul><ul><ul><li>What is normal for the individual </li></ul></ul><ul><ul><li>What separates one individual from another </li></ul></ul><ul><li>Gives profile of individual behavior </li></ul><ul><li>How do we do this? </li></ul>– + – Classification Mining + + – Profile Profile Profile
    5. 5. Has this been done? Intrusion Detection (Lane&Brodley) <ul><li>Profiled computer users based on command sequences </li></ul><ul><ul><li>Command </li></ul></ul><ul><ul><li>Some (but not all) argument information </li></ul></ul><ul><ul><li>Sequence information </li></ul></ul>
    6. 6. Results Accuracy Time to Alarm
    7. 7. Scaling Issues <ul><li>What happens with millions of users? </li></ul><ul><ul><li>Credit card </li></ul></ul><ul><ul><li>Cell phone </li></ul></ul><ul><li>What about new users? </li></ul><ul><li>Ideas? </li></ul>
    8. 8. Multi-user profiles <ul><li>Cluster users </li></ul><ul><li>Develop profiles for clusters </li></ul><ul><ul><li>E.g., differential profiling </li></ul></ul><ul><li>Old customers: Do they match profile for their cluster? </li></ul><ul><ul><li>Allows wider range of acceptable behavior </li></ul></ul><ul><li>New customer: Do they match any profile? </li></ul>
    9. 9. Data mining for detection and prevention
    10. 10. <ul><li>“ The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” </li></ul><ul><li>- The Gartner Group </li></ul>Data mining defined:
    11. 11. Matching known fraud/non-compliance <ul><li>Which new cases are similar to known cases? </li></ul><ul><li>How can we define similarity? </li></ul><ul><li>How can we rate or score similarity? </li></ul>
    12. 12. Anomalies and irregularities <ul><li>How can we detect anomalous or unusual behavior? </li></ul><ul><li>What do we mean by usual? </li></ul><ul><li>Can we rate or score cases on their degree of anomaly? </li></ul>
    13. 13. Data mining is not <ul><li>“Blind”application of analysis/modeling algorithms </li></ul><ul><li>Brute-force crunching of bulk data </li></ul><ul><li>Black box technology </li></ul><ul><li>Magic </li></ul>
    14. 14. How do you mine data? <ul><li>Use the Cross Industry Standard Process for Data Mining (CRISP-DM) </li></ul><ul><li>Based on real-world lessons: </li></ul><ul><ul><li>Focus on business issues </li></ul></ul><ul><ul><li>User-centric & interactive </li></ul></ul><ul><ul><li>Full process </li></ul></ul><ul><ul><li>Results are used </li></ul></ul>
    15. 15. Techniques used to identify fraud <ul><li>Predict and Classify </li></ul><ul><ul><li>Regression algorithms (predict numeric outcome): neural networks, CART, Regression, GLM </li></ul></ul><ul><ul><li>Classification algorithms (predict symbolic outcome): CART, C5.0, logistic regression </li></ul></ul><ul><li>Group and Find Associations </li></ul><ul><ul><li>Clustering/Grouping algorithms : K-means, Kohonen, 2Step, Factor analysis </li></ul></ul><ul><ul><li>Association algorithms: apriori, GRI, Capri, Sequence </li></ul></ul>
    16. 16. Techniques for finding fraud: <ul><li>Predict the expected value for a claim, compare that with the actual value of the claim. </li></ul><ul><li>Those cases that fall far outside the expected range should be evaluated more closely </li></ul>
    17. 17. Techniques for finding fraud: <ul><li>Build a profile of the characteristics of fraudulent behavior. </li></ul><ul><li>Pull out the cases that meet the historical characteristics of fraud. </li></ul>Decision Trees and Rules
    18. 18. Techniques for finding fraud: <ul><li>Group behavior using a clustering algorithm </li></ul><ul><li>Find groups of events using the association algorithms </li></ul><ul><li>Identify outliers and investigate </li></ul>Clustering and Associations
    19. 19. Fraud detection using CRISP-DM <ul><li>Provides a systematic way to detect fraud and abuse </li></ul><ul><li>Ensures auditing and investigative efforts are maximized </li></ul><ul><li>Continually assesses and updates models to identify new emerging fraud patterns </li></ul><ul><li>Leads to higher recoupments </li></ul>
    20. 20. Data mining in action: Fraud, waste and abuse case studies
    21. 21. How can data mining help? <ul><li>Payment error prevention </li></ul><ul><li>Billing and payment fraud </li></ul><ul><li>Audit selection </li></ul>
    22. 22. Payment Error Prevention … used this information to focus their auditing effort The US Health Care Finance Administration needed to isolate the likely causes of payment error by developing a profile of acceptable billing practices and...
    23. 23. Payment error prevention solution <ul><li>Clementine ™ </li></ul><ul><li>Using audited discharge records, built profiles of appropriate decisions such as diagnosis coding and admission </li></ul><ul><li>Matched new cases </li></ul><ul><li>Cases not matching are audited </li></ul>
    24. 24. Payment error prevention results <ul><li>Detected 50% of past incorrect payments – resulting in significant recovery of funding lost to payment errors </li></ul><ul><li>PRO analysts able to use resultant Clementine models to prevent future error </li></ul>
    25. 25. Billing and payment fraud Identified suspicious cases to focus investigations The US Defense Finance and Accounting Service needed to find fraud in millions of Dept of Defense transactions and...
    26. 26. Billing and payment fraud solution <ul><li>Clementine </li></ul><ul><li>Detection models based on known fraud patterns </li></ul><ul><li>Analyzed all transactions – scored based on similarity to these known patterns </li></ul><ul><li>High scoring transactions flagged for investigation </li></ul>
    27. 27. Billing and payment fraud results <ul><li>Identified over 1,200 payments for further investigation </li></ul><ul><li>Integrated the detection process </li></ul><ul><li>Anomaly detection methods (e.g., clustering) will serve as ‘sentinel’ systems for previously undetected fraud patterns </li></ul>
    28. 28. Audit selection Focused audit investigations on cases with the highest likely adjustments The Washington State Department of Revenue needed to detect erroneous tax returns and...
    29. 29. Audit selection solution <ul><li>Clementine </li></ul><ul><li>Using previously audited returns </li></ul><ul><li>Model adjustment (recovery) per auditor hour based on return information </li></ul><ul><li>Models will then score future returns showing highest potential adjustment </li></ul>
    30. 30. Audit selection results <ul><li>Maximizes auditors’ time by focusing on cases likely to yield the highest return </li></ul><ul><li>Closes the ‘tax gap’ </li></ul>
    31. 31. Data mining - key to detecting and preventing fraud, waste and abuse <ul><li>Learn from the past </li></ul><ul><ul><li>High quality, evidence based decisions </li></ul></ul><ul><li>Predict </li></ul><ul><ul><li>Prevent future instances </li></ul></ul><ul><li>React to changing circumstances </li></ul><ul><ul><li>Models kept current, from latest data </li></ul></ul>