This document describes the use of data mining techniques for fraud detection. It outlines the typical steps in a data mining process, including problem definition, data collection/enhancement, modeling strategies, model training/validation/testing, analyzing results, and linking techniques to business problems. It then provides two case studies as examples: one uses health care data to detect fraud, the other uses purchase card data. Both cases illustrate how the data mining process can be applied to solve real-world fraud detection problems.
This document provides an overview of the textbook Quantitative Business Research Methods by Rob J Hyndman. The textbook covers quantitative research methods for business administration, including research design, data collection and processing, data summary, significance testing, regression analysis, and presenting quantitative research. It provides examples using SPSS and discusses topics such as sampling, questionnaire design, and other statistical techniques.
This document provides an overview and instructions for using IBM SPSS Data Preparation software. It discusses features for validating data using predefined or custom validation rules, identifying unusual cases, optimally binning numeric variables, and automatically preparing data. The software allows preparing data for analysis in SPSS and improving data quality. Examples demonstrate how to validate a medical database by performing basic checks, copying rules from another file, defining new rules, and checking cross-variable rules.
This document provides information about the IBM SPSS Direct Marketing module, including descriptions of its features and examples of how to use them. It discusses RFM analysis, cluster analysis, prospect profiles, postal code response rates, propensity to purchase modeling, and control package testing. The document includes settings for each analysis technique as well as example applications using sample data to demonstrate the module's capabilities. It is a user guide and reference for understanding and effectively utilizing the predictive analytic tools in IBM SPSS Direct Marketing.
An organized and systematic office solution is essential for all universities and organizations. There are many departments of administration for the maintenance of college information and student databases in any institution. All these departments provide various records regarding students. Most of these track records need to maintain information about the students. This information could be the general details like student name, address, performance, attendance etc or specific information related to departments like collection of data. All the modules in college administration are interdependent. They are maintained manually. So they need to be automated and centralized as, Information from one module will be needed by other modules. For example when a student needs his course completion certificate it needs to check many details about the student like his name, reg number, year of study, exams he attended and many other details. So it needs to contact all the modules that are office, department and examination and result of students.
This document presents a Comprehensive Assessment Model (CAM) for evaluating internal control systems. The CAM provides a standardized framework for assessing risks, control objectives, and internal controls across an organization. It evaluates both the design and operating effectiveness of internal controls using qualitative and quantitative criteria. The model aims to help organizations strengthen risk management and governance by facilitating comprehensive, objective assessments of internal control systems.
This user manual provides instructions for using ChartNexus financial charting software version 3.0.6. It covers getting started, working with data and charts, technical indicators, and supplemental tools. ChartNexus allows users to view financial data, apply technical analysis indicators, and conduct backtesting. The software is developed by FiNEX Solutions and includes features like multiple chart types, drawing tools, and customizable watchlists. Users can download market data, install indicators, and manage portfolios for tracking investments.
This document provides an overview of IBM Watson Content Analytics and describes how it can be used to gain insights from unstructured content. It discusses the product's history and key features in version 3.0. Some main capabilities include performing automated content analysis, discovering patterns and correlations in data, and gaining insights to improve products and services. The document also provides examples of how Content Analytics has been applied in various use cases, such as customer service, healthcare, and investigations.
This document provides an overview of IBM Watson Content Analytics and how it can be used to gain insights from unstructured content. It discusses the architecture of Content Analytics, which includes ingesting and processing unstructured data using natural language processing techniques. It then provides several use case examples where Content Analytics has been applied, such as for customer insights, healthcare, and investigations. The document also covers best practices for designing Content Analytics solutions and understanding the types of analysis that can be performed.
This document provides an overview of the textbook Quantitative Business Research Methods by Rob J Hyndman. The textbook covers quantitative research methods for business administration, including research design, data collection and processing, data summary, significance testing, regression analysis, and presenting quantitative research. It provides examples using SPSS and discusses topics such as sampling, questionnaire design, and other statistical techniques.
This document provides an overview and instructions for using IBM SPSS Data Preparation software. It discusses features for validating data using predefined or custom validation rules, identifying unusual cases, optimally binning numeric variables, and automatically preparing data. The software allows preparing data for analysis in SPSS and improving data quality. Examples demonstrate how to validate a medical database by performing basic checks, copying rules from another file, defining new rules, and checking cross-variable rules.
This document provides information about the IBM SPSS Direct Marketing module, including descriptions of its features and examples of how to use them. It discusses RFM analysis, cluster analysis, prospect profiles, postal code response rates, propensity to purchase modeling, and control package testing. The document includes settings for each analysis technique as well as example applications using sample data to demonstrate the module's capabilities. It is a user guide and reference for understanding and effectively utilizing the predictive analytic tools in IBM SPSS Direct Marketing.
An organized and systematic office solution is essential for all universities and organizations. There are many departments of administration for the maintenance of college information and student databases in any institution. All these departments provide various records regarding students. Most of these track records need to maintain information about the students. This information could be the general details like student name, address, performance, attendance etc or specific information related to departments like collection of data. All the modules in college administration are interdependent. They are maintained manually. So they need to be automated and centralized as, Information from one module will be needed by other modules. For example when a student needs his course completion certificate it needs to check many details about the student like his name, reg number, year of study, exams he attended and many other details. So it needs to contact all the modules that are office, department and examination and result of students.
This document presents a Comprehensive Assessment Model (CAM) for evaluating internal control systems. The CAM provides a standardized framework for assessing risks, control objectives, and internal controls across an organization. It evaluates both the design and operating effectiveness of internal controls using qualitative and quantitative criteria. The model aims to help organizations strengthen risk management and governance by facilitating comprehensive, objective assessments of internal control systems.
This user manual provides instructions for using ChartNexus financial charting software version 3.0.6. It covers getting started, working with data and charts, technical indicators, and supplemental tools. ChartNexus allows users to view financial data, apply technical analysis indicators, and conduct backtesting. The software is developed by FiNEX Solutions and includes features like multiple chart types, drawing tools, and customizable watchlists. Users can download market data, install indicators, and manage portfolios for tracking investments.
This document provides an overview of IBM Watson Content Analytics and describes how it can be used to gain insights from unstructured content. It discusses the product's history and key features in version 3.0. Some main capabilities include performing automated content analysis, discovering patterns and correlations in data, and gaining insights to improve products and services. The document also provides examples of how Content Analytics has been applied in various use cases, such as customer service, healthcare, and investigations.
This document provides an overview of IBM Watson Content Analytics and how it can be used to gain insights from unstructured content. It discusses the architecture of Content Analytics, which includes ingesting and processing unstructured data using natural language processing techniques. It then provides several use case examples where Content Analytics has been applied, such as for customer insights, healthcare, and investigations. The document also covers best practices for designing Content Analytics solutions and understanding the types of analysis that can be performed.
This document is a front cover and table of contents for a book published by IBM about using IBM's Operational Decision Manager Advanced and predictive analytics to create systems of insight for digital transformation. It introduces key concepts around decision making, decision automation, and systems of insight. It also provides an overview of the types of solutions discussed in the book, including real-time, retroactive, and proactive solutions using event-driven processing, predictive analytics, and other techniques.
CARE is a humanitarian organization that works to alleviate global poverty. This document provides a step-by-step guide for collecting and analyzing data on infant and young child feeding practices. The guide was produced by CARE with funding from a private foundation. It is intended to help organizations properly survey, sample, collect, clean, analyze, and report data on important infant feeding indicators.
This document outlines the Penetration Testing Execution Standard (PTES), which provides guidelines for conducting a penetration test. It describes the pre-engagement interactions between testers and clients to define the scope of work. It also covers intelligence gathering, threat modeling, vulnerability analysis, exploitation, post-exploitation activities, and reporting. The goal is to simulate real-world attacks to identify security issues, but tests are conducted under an agreed scope and rules of engagement.
1) Software testing is important because early software projects often failed due to poor software engineering practices and a lack of established standards. This led to a "software crisis" in the 1960s and 1970s.
2) A defined software development process can help avoid failures by improving predictability, managing risks, and ensuring best practices are followed. However, processes must also be adaptive to changing needs.
3) Both effective processes and good human resource planning are needed, as human factors have a large impact on project outcomes. Proper requirements identification is also key to addressing software engineering issues.
This document provides an overview of formative assessments and introduces 25 quick formative assessment strategies that teachers can use in a differentiated classroom. Formative assessments are informal or formal assessment tools used by teachers to evaluate student comprehension and learning needs during a lesson. This book encourages teachers to use a variety of formative assessments and shows how to design tiered activities and gather multiple sources of evidence to differentiate instruction based on assessment data. The 25 strategies are divided into two sections: summaries and reflections, and lists, charts, and graphic organizers. Teachers are provided guidance on implementing each strategy as a formative assessment.
This document discusses enterprise risk management (ERM) and its application in the engineering and construction industry. It provides background on ERM and outlines challenges facing the industry, including risks from negative markets, labor issues, safety concerns and project delays/overruns. The purpose is to develop a quantitative approach to linking risk exposures to company value, using techniques from Sim Segal's book on the corporate value of ERM. Key concepts discussed include stochastic simulation, discounting cash flows, and failure modes and effects analysis. The document ends with a case study applying Segal's value-based approach to an organization in the engineering and construction field.
Identifying Special Needs Populations in Hazard ZonesEsri
This document discusses how geographic and demographic segmentation analysis can help identify special needs populations for disaster evacuation planning. The analysis uses Esri's Tapestry segmentation system to classify residents in hazard zones into lifestyle groups. It finds that factors like health, transportation access, pet ownership, and leisure activities influence special populations' needs. The results then aid in developing targeted communication strategies for evacuation messaging and relocation planning.
This document provides an introduction and preface to a lecture on econometric analysis of financial market data. It discusses using the R programming language to model financial data and complete assignments. It encourages downloading R and additional packages to facilitate statistical analysis and simulations. The document outlines installing and using R, and provides references for further reading on empirical finance techniques.
The Guide to Medicare Preventative Services for Physicans, Providers and Supp...Tim Boucher
This guide was prepared as a service to the public and is not intended to grant rights or impose obligations. This guide may contain references or links to statutes, regulations, or other policy materials. The information provided is only intended to be a general summary. It is not intended to take the place of either the written law or regulations. We encourage readers to review the specific statutes, regulations, and other interpretive materials for a full and accurate statement of their contents. Read more..
This document provides a tutorial on using software resources, mainly the R programming language, to analyze single-case experimental design data. It covers topics such as visual analysis techniques, calculation of nonoverlap indices, regression analyses, randomization tests, simulation modeling, and methods for integrating results across multiple studies. Step-by-step guidance is provided for using R and related packages to conduct these various quantitative analyses and visualize results. Relevant code is included in appendices. Aimed at applied researchers, the tutorial is a guide to computational methods for analyzing single-subject data.
This document describes a project to develop an expert search system that mines academic expertise from funded research in Scottish universities. The system aims to integrate data on funded projects from external sources with an existing academic search engine to improve its search results. It will extract expertise information from publications and funded projects to generate expert profiles. Learning to rank algorithms will then be used to rank experts based on their profiles for specific queries. The goal is to enhance the current search engine that identifies experts based on publications by incorporating additional evidence of expertise from funded research projects.
Undergrad Thesis | Information Science and EngineeringPriyanka Pandit
The document describes a project to build an anonymous question and answer platform called QA-Zen. It aims to provide an intellectual space for students and others to engage in knowledge sharing and connect with like-minded individuals. The project uses technologies like data mining, machine learning and distributed databases. It allows users to anonymously post and answer questions on topics they are interested in.
The document describes a proposed grid-based puzzle game called cZeus Fantasy and Adventure World that aims to improve numeracy skills through gameplay. It does so by incorporating simple mathematical concepts like times tables into different game levels without direct use of formulas. The game is motivated by the declining numeracy skills in many countries and seeks to change negative attitudes towards math through an engaging fantasy-themed experience. A team of students from Imperial College London worked on developing an initial pilot version of the game for iOS as a class project overseen by their supervisor.
The document summarizes the findings of a survey of data marketplaces conducted in 2013. The survey aimed to analyze trends in the data marketplace sector by revisiting a previous survey from 2012. Key dimensions like type of data, pricing models, and target audiences were examined. While some vendors from 2012 had exited the market, overall the data marketplace sector remained vibrant with numerous participants. A notable trend was toward higher quality data offerings.
Business rerearch survey_analysis__intention to use tablet p_cs among univer...Dev Karan Singh Maletia
This document outlines a study on university students' intention to use tablet PCs. It includes an introduction to the problem statement and objectives of the study, a literature review on innovation diffusion theory, and the methodology used. The methodology section describes the correlation research design, sample of university students, measurement of variables like relative advantage and intention to use, questionnaire design, and statistical analysis plan including reliability testing, descriptive analysis, and regression analysis. The goal is to understand the key factors influencing students' intention to use tablet PCs and the relationships between variables related to adoption of the technology.
WHAT CONSTITUTES AN AGILE ORGANIZATION? ? DESCRIPTIVE RESULTS OF AN EMPIRICAL...iasaglobal
The survey items emerged from a comprehensive literature review that identified 33 concepts of agility. These concepts were formulated as questionnaire items with support from already existent studies. To ensure an appropriate measurement, different scales were used, because as Tsourveloudis and Valavanis (2002) point out, the parameters affecting agility are not homogenous. In our opinion, an organization is not agile when its employees and managers ?agree? with statements describing agility or when they ?think? they are agile. Instead, it is the actions, capabilities, values, etc. of an organization that represent its agility.
Modeling, simulation, and operations analysis in afghanistan and iraqMamuka Mchedlidze
This document provides an overview of modeling, simulation, and operations analysis efforts to support decisions in counterinsurgency and irregular warfare operations in Afghanistan and Iraq from 2001-2012. It identifies key decision issues that could benefit from analysis as well as analytical methods and tools that were used. Challenges to effective analytical support are also discussed, drawing from literature on the topic. The document aims to capture lessons learned to improve future application of analysis.
This document provides an overview and preface for a book on practical regression and analysis of variance using R. It discusses the following key points:
- The book assumes some prior statistical knowledge and focuses on applying regression and ANOVA techniques rather than extensive mathematical theory.
- R is used for data analysis examples because it is versatile, interactive, popular among statisticians, and open source.
- Readers are expected to learn R as they work through examples in the book, but an introduction to R is provided in an appendix.
- Chapters will cover topics like model estimation, hypothesis testing for model comparisons, and confidence intervals for model parameters.
SAS Anti-Money Laundering software helps financial institutions comply with anti-money laundering and counterterrorism financing regulations through a risk-based approach to monitoring large volumes of transactions for illicit activity using multiple detection methods. It enables institutions to safeguard their reputation by avoiding fines and penalties for noncompliance while reducing investigation time and costs. Key benefits include monitoring more transactions faster, improving alert quality, and conducting thorough investigations to identify organized crime rings.
The document discusses anti-money laundering case studies from three financial institutions - Sovereign Bank, BB&T, and Bank of America. It then discusses Washington Mutual, which received a cease and desist order for deficient anti-money laundering processes like failing to file reports in a timely manner and not properly identifying customer relationships.
Visualizing an anti-money laundering investigationLinkurious
This document describes how graph visualization can help anti-money laundering investigators map out complex money laundering schemes. It presents a hypothetical example of an investigation into a drug organization, showing how the investigators were able to use financial records and Linkurious software to visualize the connections between companies, bank accounts, and individuals involved in a three-layer money laundering network. By exploring and editing the graph, they traced the flow of money through various entities until ultimately identifying the leaders of the criminal organization behind the scheme. The document argues that graph visualization allows investigators to more easily track relationships, find hidden connections in the data, and communicate their findings to others.
Business Intelligence For Anti-Money LaunderingKartik Mehta
The document discusses anti-money laundering compliance software implementation following the 2001 enactment of the USA PATRIOT Act. Key points include:
- The Patriot Act delegated responsibility to FinCEN to set requirements for financial institutions to establish anti-money laundering compliance programs.
- Section 352(a) of the Patriot Act amended the Bank Secrecy Act to require financial institutions to establish anti-money laundering programs, including internal policies, a compliance officer, ongoing training, and independent audits.
- The objectives are to help businesses implement Patriot Act directives regarding information sharing about clients with suspicious activity and investigating client accounts and transactions for money laundering or terrorist funding possibilities.
This document is a front cover and table of contents for a book published by IBM about using IBM's Operational Decision Manager Advanced and predictive analytics to create systems of insight for digital transformation. It introduces key concepts around decision making, decision automation, and systems of insight. It also provides an overview of the types of solutions discussed in the book, including real-time, retroactive, and proactive solutions using event-driven processing, predictive analytics, and other techniques.
CARE is a humanitarian organization that works to alleviate global poverty. This document provides a step-by-step guide for collecting and analyzing data on infant and young child feeding practices. The guide was produced by CARE with funding from a private foundation. It is intended to help organizations properly survey, sample, collect, clean, analyze, and report data on important infant feeding indicators.
This document outlines the Penetration Testing Execution Standard (PTES), which provides guidelines for conducting a penetration test. It describes the pre-engagement interactions between testers and clients to define the scope of work. It also covers intelligence gathering, threat modeling, vulnerability analysis, exploitation, post-exploitation activities, and reporting. The goal is to simulate real-world attacks to identify security issues, but tests are conducted under an agreed scope and rules of engagement.
1) Software testing is important because early software projects often failed due to poor software engineering practices and a lack of established standards. This led to a "software crisis" in the 1960s and 1970s.
2) A defined software development process can help avoid failures by improving predictability, managing risks, and ensuring best practices are followed. However, processes must also be adaptive to changing needs.
3) Both effective processes and good human resource planning are needed, as human factors have a large impact on project outcomes. Proper requirements identification is also key to addressing software engineering issues.
This document provides an overview of formative assessments and introduces 25 quick formative assessment strategies that teachers can use in a differentiated classroom. Formative assessments are informal or formal assessment tools used by teachers to evaluate student comprehension and learning needs during a lesson. This book encourages teachers to use a variety of formative assessments and shows how to design tiered activities and gather multiple sources of evidence to differentiate instruction based on assessment data. The 25 strategies are divided into two sections: summaries and reflections, and lists, charts, and graphic organizers. Teachers are provided guidance on implementing each strategy as a formative assessment.
This document discusses enterprise risk management (ERM) and its application in the engineering and construction industry. It provides background on ERM and outlines challenges facing the industry, including risks from negative markets, labor issues, safety concerns and project delays/overruns. The purpose is to develop a quantitative approach to linking risk exposures to company value, using techniques from Sim Segal's book on the corporate value of ERM. Key concepts discussed include stochastic simulation, discounting cash flows, and failure modes and effects analysis. The document ends with a case study applying Segal's value-based approach to an organization in the engineering and construction field.
Identifying Special Needs Populations in Hazard ZonesEsri
This document discusses how geographic and demographic segmentation analysis can help identify special needs populations for disaster evacuation planning. The analysis uses Esri's Tapestry segmentation system to classify residents in hazard zones into lifestyle groups. It finds that factors like health, transportation access, pet ownership, and leisure activities influence special populations' needs. The results then aid in developing targeted communication strategies for evacuation messaging and relocation planning.
This document provides an introduction and preface to a lecture on econometric analysis of financial market data. It discusses using the R programming language to model financial data and complete assignments. It encourages downloading R and additional packages to facilitate statistical analysis and simulations. The document outlines installing and using R, and provides references for further reading on empirical finance techniques.
The Guide to Medicare Preventative Services for Physicans, Providers and Supp...Tim Boucher
This guide was prepared as a service to the public and is not intended to grant rights or impose obligations. This guide may contain references or links to statutes, regulations, or other policy materials. The information provided is only intended to be a general summary. It is not intended to take the place of either the written law or regulations. We encourage readers to review the specific statutes, regulations, and other interpretive materials for a full and accurate statement of their contents. Read more..
This document provides a tutorial on using software resources, mainly the R programming language, to analyze single-case experimental design data. It covers topics such as visual analysis techniques, calculation of nonoverlap indices, regression analyses, randomization tests, simulation modeling, and methods for integrating results across multiple studies. Step-by-step guidance is provided for using R and related packages to conduct these various quantitative analyses and visualize results. Relevant code is included in appendices. Aimed at applied researchers, the tutorial is a guide to computational methods for analyzing single-subject data.
This document describes a project to develop an expert search system that mines academic expertise from funded research in Scottish universities. The system aims to integrate data on funded projects from external sources with an existing academic search engine to improve its search results. It will extract expertise information from publications and funded projects to generate expert profiles. Learning to rank algorithms will then be used to rank experts based on their profiles for specific queries. The goal is to enhance the current search engine that identifies experts based on publications by incorporating additional evidence of expertise from funded research projects.
Undergrad Thesis | Information Science and EngineeringPriyanka Pandit
The document describes a project to build an anonymous question and answer platform called QA-Zen. It aims to provide an intellectual space for students and others to engage in knowledge sharing and connect with like-minded individuals. The project uses technologies like data mining, machine learning and distributed databases. It allows users to anonymously post and answer questions on topics they are interested in.
The document describes a proposed grid-based puzzle game called cZeus Fantasy and Adventure World that aims to improve numeracy skills through gameplay. It does so by incorporating simple mathematical concepts like times tables into different game levels without direct use of formulas. The game is motivated by the declining numeracy skills in many countries and seeks to change negative attitudes towards math through an engaging fantasy-themed experience. A team of students from Imperial College London worked on developing an initial pilot version of the game for iOS as a class project overseen by their supervisor.
The document summarizes the findings of a survey of data marketplaces conducted in 2013. The survey aimed to analyze trends in the data marketplace sector by revisiting a previous survey from 2012. Key dimensions like type of data, pricing models, and target audiences were examined. While some vendors from 2012 had exited the market, overall the data marketplace sector remained vibrant with numerous participants. A notable trend was toward higher quality data offerings.
Business rerearch survey_analysis__intention to use tablet p_cs among univer...Dev Karan Singh Maletia
This document outlines a study on university students' intention to use tablet PCs. It includes an introduction to the problem statement and objectives of the study, a literature review on innovation diffusion theory, and the methodology used. The methodology section describes the correlation research design, sample of university students, measurement of variables like relative advantage and intention to use, questionnaire design, and statistical analysis plan including reliability testing, descriptive analysis, and regression analysis. The goal is to understand the key factors influencing students' intention to use tablet PCs and the relationships between variables related to adoption of the technology.
WHAT CONSTITUTES AN AGILE ORGANIZATION? ? DESCRIPTIVE RESULTS OF AN EMPIRICAL...iasaglobal
The survey items emerged from a comprehensive literature review that identified 33 concepts of agility. These concepts were formulated as questionnaire items with support from already existent studies. To ensure an appropriate measurement, different scales were used, because as Tsourveloudis and Valavanis (2002) point out, the parameters affecting agility are not homogenous. In our opinion, an organization is not agile when its employees and managers ?agree? with statements describing agility or when they ?think? they are agile. Instead, it is the actions, capabilities, values, etc. of an organization that represent its agility.
Modeling, simulation, and operations analysis in afghanistan and iraqMamuka Mchedlidze
This document provides an overview of modeling, simulation, and operations analysis efforts to support decisions in counterinsurgency and irregular warfare operations in Afghanistan and Iraq from 2001-2012. It identifies key decision issues that could benefit from analysis as well as analytical methods and tools that were used. Challenges to effective analytical support are also discussed, drawing from literature on the topic. The document aims to capture lessons learned to improve future application of analysis.
This document provides an overview and preface for a book on practical regression and analysis of variance using R. It discusses the following key points:
- The book assumes some prior statistical knowledge and focuses on applying regression and ANOVA techniques rather than extensive mathematical theory.
- R is used for data analysis examples because it is versatile, interactive, popular among statisticians, and open source.
- Readers are expected to learn R as they work through examples in the book, but an introduction to R is provided in an appendix.
- Chapters will cover topics like model estimation, hypothesis testing for model comparisons, and confidence intervals for model parameters.
SAS Anti-Money Laundering software helps financial institutions comply with anti-money laundering and counterterrorism financing regulations through a risk-based approach to monitoring large volumes of transactions for illicit activity using multiple detection methods. It enables institutions to safeguard their reputation by avoiding fines and penalties for noncompliance while reducing investigation time and costs. Key benefits include monitoring more transactions faster, improving alert quality, and conducting thorough investigations to identify organized crime rings.
The document discusses anti-money laundering case studies from three financial institutions - Sovereign Bank, BB&T, and Bank of America. It then discusses Washington Mutual, which received a cease and desist order for deficient anti-money laundering processes like failing to file reports in a timely manner and not properly identifying customer relationships.
Visualizing an anti-money laundering investigationLinkurious
This document describes how graph visualization can help anti-money laundering investigators map out complex money laundering schemes. It presents a hypothetical example of an investigation into a drug organization, showing how the investigators were able to use financial records and Linkurious software to visualize the connections between companies, bank accounts, and individuals involved in a three-layer money laundering network. By exploring and editing the graph, they traced the flow of money through various entities until ultimately identifying the leaders of the criminal organization behind the scheme. The document argues that graph visualization allows investigators to more easily track relationships, find hidden connections in the data, and communicate their findings to others.
Business Intelligence For Anti-Money LaunderingKartik Mehta
The document discusses anti-money laundering compliance software implementation following the 2001 enactment of the USA PATRIOT Act. Key points include:
- The Patriot Act delegated responsibility to FinCEN to set requirements for financial institutions to establish anti-money laundering compliance programs.
- Section 352(a) of the Patriot Act amended the Bank Secrecy Act to require financial institutions to establish anti-money laundering programs, including internal policies, a compliance officer, ongoing training, and independent audits.
- The objectives are to help businesses implement Patriot Act directives regarding information sharing about clients with suspicious activity and investigating client accounts and transactions for money laundering or terrorist funding possibilities.
This is my presentation about what is money laundering crime and what is the role of financial institutions in the fight against it. I used it during my speech for a bunch of Business School Students (ISM).
Money laundering refers to disguising illegally obtained money to make it appear legitimate. It involves three steps - placement, layering, and integration. Criminals like drug dealers, mobsters, corrupt politicians, and terrorists engage in money laundering to hide the source and destination of funds from illegal activities. Key causes of money laundering include tax evasion, increasing profits from crime, and limited risks of exposure. Money laundering distorts economies, increases corruption and crime, undermines financial market integrity, and risks countries' reputations.
The document discusses money laundering, including its definition, process, and risks. It defines money laundering as the process of converting illegal funds into legitimate funds and assets. The money laundering cycle involves placement, layering, and integration of funds to obscure their criminal origin. Risks to banks from money laundering include reputational, legal, operational, and concentration risks. Know-your-customer (KYC) norms and monitoring of suspicious transactions are important measures to deter money laundering.
This document provides definitions for 5000 academic words. It advertises an audio program that teaches these words in only 15 minutes per day for 4 weeks. It includes a free memory course. The definitions provided are brief and include parts of speech and examples of usage for some of the words.
This document summarizes research conducted to develop a national recruiting difficulty index for the U.S. Army. It reviewed previous research on factors affecting recruiting and developed a conceptual model of recruiting difficulty. The model considers measures of recruiting outcomes, Army policy responses, and exogenous predictors. The report describes optimizing a forecast model using these factors and criteria like cross-validation. It presents forecasts of recruiting difficulty measures out to 24 months and recommendations for leveraging the forecasts. The index is intended to help communicate recruiting requirements and resource needs to Army leadership.
This document provides an overview and user guide for IBM SPSS Complex Samples 20. It describes how to create complex sample plans for sampling and analysis, and how to use the complex samples procedures for frequencies, descriptives, crosstabs, and ratios. Guidelines are provided for specifying design variables, sampling and estimation methods, sample sizes, and handling missing data. The complex samples procedures incorporate sampling weights and design information to produce accurate population estimates and standard errors from complex sample survey data.
This document provides an overview and user guide for IBM SPSS Categories 20, which provides optimal scaling procedures for analyzing categorical data. It introduces key concepts of optimal scaling and discusses the appropriate uses of six procedures: categorical regression, categorical principal components analysis, nonlinear canonical correlation analysis, correspondence analysis, multiple correspondence analysis, and multidimensional scaling. The document also provides details on executing each procedure and interpreting their outputs.
Systematizing In-Store Traffic and Minimization of Service Quality Gaps of a ...Angelo Yutuc
This document is a research proposal that aims to systematize in-store traffic and minimize service quality gaps at a retail store. It utilizes the SERVQUAL framework to measure service quality across five dimensions: reliability, assurance, tangibility, empathy, and responsiveness. The study found gaps between customer expectations and perceptions. It then proposes redesigning the store layout and shopping cart to improve traffic flow and address the gaps, resulting in higher service quality. A prototype cart was designed and testing showed improved customer satisfaction after implementation.
This document provides an overview and introduction to dimensional modeling for business intelligence. It discusses how dimensional modeling differs from traditional SQL and E/R modeling by focusing on query performance and ease of analysis rather than data storage and transactions. The document also outlines some key concepts in dimensional modeling like fact tables, dimension tables, and grains. It emphasizes that dimensional modeling helps optimize data access and analysis for business intelligence activities.
This document provides an overview and user guide for IBM SPSS Decision Trees 20 software. It describes how to create and evaluate decision tree models, including selecting variables and target categories, specifying tree growing criteria, and interpreting output like tree diagrams, statistics tables, and charts. Examples demonstrate how to build a CHAID decision tree model to evaluate credit risk and assess model performance. The document also reviews data requirements, managing large trees, editing tree options, and saving selection and scoring rules.
This document provides an overview of predictive analytics and data mining techniques. It covers topics such as supervised learning, data validation and cleaning, missing data, overfitting, linear regression, support vector machines, cross-validation, classification with rare classes, logistic regression, decision making based on costs, non-standard labeling scenarios, recommender systems, text mining, matrix factorization, social network analysis, reinforcement learning, and more. The document serves as a reference for various predictive analytics and machine learning concepts and methods.
This document provides an overview and introduction to the fifth edition of the book "Root Cause Analysis in Health Care: Tools and Techniques". It lists the executive editor, project manager, reviewers, and details about the mission and standards of Joint Commission Resources. It also includes a table of contents showing the chapters and steps involved in root cause analysis. The introduction explains that the book is designed to provide accurate information to help health care organizations conduct root cause analyses after adverse events.
This document presents a case study on the project appraisal system of Andhra Pradesh State Financial Corporation. It discusses the company profile, functions, and project appraisal process. The project appraisal process involves evaluating promoters, technical, financial, market and risk aspects of a proposed project. A theoretical framework is provided covering various stages of appraisal like promoter evaluation, technical, financial and market evaluation, risk assessment, and credit rating. Finally, a case study of a specific project appraisal is presented covering aspects like technical details, project costs, means of finance, economics of operations, and risk analysis.
The document contains endorsements from marketing professors and professionals recommending the book "Cutting-Edge Marketing Analytics" by Rajkumar Venkatesan, Paul Farris, and Ronald T. Wilcox. The endorsements praise the book for providing a balanced overview of rigorous analytical techniques and real-world applications through case studies. One endorsement notes the book will fill an important gap by teaching practical approaches to gain customer insights from big data. Another says the case studies provide a good opportunity to apply analytics techniques to real problems.
This research report presents a Comprehensive Assessment Model (CAM) for evaluating internal control systems. The CAM provides a standardized framework for assessing risks, control objectives, and the design and effectiveness of internal controls. It evaluates controls across the entire organization on an enterprise-wide basis. The model aims to help organizations strengthen risk management and governance through a comprehensive internal control evaluation process.
The document discusses the development of a student database management system. It covers various topics such as the system development lifecycle used, selection of scripting language (PHP) and database (MySQL), system analysis and design including use case analysis and entity relationship diagrams, database design and development in phpMyAdmin, testing of the system and database, and project management processes. The overall aim is to develop a system to manage all student details and activities from registration through graduation to help improve efficiency over a manual process.
25quickformativeassessments 130203063349-phpapp01Sarah Jones
This document is a book that provides 25 quick formative assessment strategies for teachers to use in a differentiated classroom. It includes an introduction explaining what formative assessments are and how to use the strategies in the book. The strategies are organized into four sections: summaries and reflections, lists, charts and graphic organizers, visual representations of information, and collaborative activities. Each strategy includes a description, examples, and tips for implementation. The book aims to help teachers gather real-time data on student understanding to inform instruction and meet the diverse needs of learners.
This document describes a project that aims to estimate full-body demographics from images using computer vision and machine learning techniques. The project proposes a novel method to automatically annotate images with categorical labels for a wide range of body features, like height, leg length, and shoulder width. The method explores using common computer vision algorithms to extract features from images and video frames and compare them to a database of subjects with labeled body features. The document outlines the requirements, approaches considered, design and implementation of the project, and evaluates the results in estimating demographics and identifying individuals.
This document provides information about analytical chemistry concepts and terminology. It begins with an introduction to units of measurement and expressions of concentration commonly used in analytical chemistry. It then discusses the basic equipment and techniques used to measure mass and volume, prepare standard solutions, and record experimental work in a laboratory notebook. The document emphasizes the importance of careful measurements and calculations in analytical chemistry. It aims to establish a foundation of terminology, concepts, and procedures that are fundamental to quantitative chemical analysis.
The document provides information about Regional Development Group Bangladesh (RDGB), an organization that provides development services. It was established in 2014 and officially launched in 2015 in Bangladesh. RDGB's mission is to connect clients to their world through quality development services. Its vision is to become a leading provider of development solutions globally. The organization offers various IT services, products, consulting, and training and has served both private and public sector clients since 2008.
This thesis presents an algorithm to estimate the uncertainty of individual predictions made by recommender systems based on matrix factorization models. The algorithm is developed and tested on both synthetic and real financial market data. Specifically:
- The algorithm estimates the accuracy/uncertainty of recommendations from factorization models like matrix factorization and factorization machines, which only provide point predictions without uncertainties.
- It is tested on synthetic data to evaluate its performance under different conditions and on real financial market product data from ING to provide uncertainty estimates for their existing recommender system.
- Estimating individual prediction uncertainties is important for users of recommender systems to assess the reliability of different recommendations, unlike typical evaluation using only overall accuracy on held-out data.
This is a report detailing my industrial placement year at Tomo Motor Parts Ltd. This report was submitted to Brunel University and formed the majority of my A+ result for the year.
The chapter Lifelines of National Economy in Class 10 Geography focuses on the various modes of transportation and communication that play a vital role in the economic development of a country. These lifelines are crucial for the movement of goods, services, and people, thereby connecting different regions and promoting economic activities.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
B. Ed Syllabus for babasaheb ambedkar education university.pdf
Case sas 2
1. Using Data Mining Techniques
for Fraud Detection
Solving Business Problems
Using SAS® Enterprise Miner™ Software
A SAS Institute
B es t P r a c t i c e s P a p e r
In conjunction with
Federal Data Corporation
7. 1
Abstract
Data mining combines data analysis techniques with high-end technology for use within a
process. The primary goal of data mining is to develop usable knowledge regarding future
events. This paper defines the steps in the data mining process, explains the importance of
the steps, and shows how the steps were used in two case studies involving fraud detection.1
The steps in the data mining process are:
• problem definition
• data collection and enhancement
• modeling strategies
• training, validation, and testing of models
• analyzing results
• modeling iterations
• implementing results.
The first case study uses the data mining process to analyze instances of fraud in the public
sector health care industry. In this study, called “the health care case,” the data contain
recorded examples of known fraudulent cases. The objective of the health care case is to
determine, through predictive modeling, what attributes depict fraudulent claims.
In the second case study, a public sector organization deploys data mining in a purchase card
domain with the aim of determining what transactions reflect fraudulent transactions in the
form of diverting public funds for private use. In this study, called “the purchase card case,”
knowledge of fraud does not exist.
Problem Definition
Defining the business problem is the first and arguably the most important step in the data
mining process. The problem definition should not be a discussion of the implementation or
efficacy of enabling technology that is data mining. Instead, the problem definition should
state the business objective.
A proper business objective will use clear, simple language that focuses on the business
problem and clearly states how the results are to be measured. In addition, the problem defi-
nition should include estimates of the costs associated with making inaccurate predictions as
well as estimates of the advantages of making accurate ones.
1Due to the proprietary nature of the data used in this paper, the case studies focus on the data mining process instead of in-depth
interpretation of the results.
8. 2
Data Collection and Enhancement
Data mining algorithms are only as good as the data shown to them. Incomplete or biased
data produce incomplete or biased models with significant blind spots. As Figure 1 illus-
trates, data collection itself involves four distinct steps:
1. Define Data Sources
Select from multiple databases. These data may include transaction
databases, personnel databases, and accounting databases. Care should
be taken though, as the data used to train data mining models must
match the data on which models will be deployed in an operational
setting. Thus, the tasks that need to be performed in data mining will
have to be repeated when data mining models are deployed.
2. Join and Denormalize Data
This step involves joining the multiple data sources into a flat file
structure. This step sometimes requires that decisions be made on
the level of measurement (e.g. do some data get summarized
in order to facilitate joining.)
3. Enrich Data
As data from disparate sources are joined, it may become evident
that the information contained in the records is insufficent.
For instance, it may be found that the data on vendors is not
specific enough. It may be necessary to enrich data with
external (or other) data.
4. Transform Data
Data transformations enable a model to more readily extract
the valuable information from data. Examples of data
transformations include aggregating records, creating
rations, summarizing very granular fields, etc.
Figure 1: Data Collection and Associated Steps in the Data Mining Process
Defining the data sources should be a part of the details of the problem definition such as,
“We want to get a comprehensive picture of our business by incorporating legacy data with
transactions on our current systems.” Enriching data fills gaps found in the data during the
join process. Finally, data mining tools can often perform data transformations involving
aggregation, arithmetic operations, and other aspects of data preparation.
9. 3
Modeling Strategies
Data mining strategies fall into two broad categories: supervised learning and unsupervised
learning. Supervised learning methods are deployed when there exists a target variable 2
with known values and about which predictions will be made by using the values of other
variables as input.
Unsupervised learning methods tend to be deployed on data for which there does not exist a
target variable with known values, but for which input variables do exist. Although unsuper-
vised learning methods are used most often in cases where a target variable does not exist,
the existence of a target variable does not preclude the use of unsupervised learning.
Table 1 maps data mining techniques by modeling objective and supervised/unsupervised
distinctions using four modeling objectives: prediction, classification, exploration, and affinity.
Modeling Objective Supervised Unsupervised
Prediction Regression and Not feasible
Logistic regression
Neural Networks
Decision Trees
Note: Targets can be binary,
interval, nominal, or ordinal.
Classification Decision Trees Clustering (K-means, etc)
Neural Networks Neural Networks
Discriminant Analysis Self-Organizing Maps
Note: Targets can be binary, (Kohonen Networks)
nominal, or ordinal.
Exploration Decision Trees Principal Components
Note: Targets can be binary, Clustering (K-means, etc)
nominal, or ordinal.
Affinity Not applicable Associations
Sequences
Factor Analysis
Table 1: Modeling Objectives and Data Mining Techniques
Prediction algorithms determine models or rules to predict continuous or discrete target
values given input data. For example, a prediction problem could attempt to predict the
value of the S&P 500 Index given some input data such as a sudden change in a foreign
exchange rate.
Classification algorithms determine models to predict discrete values given input data.
A classification problem might involve trying to determine if transactions represents fraudulent
behavior based on some indicators such as the type of establishment at which the purchase
was made, the time of day the purchase was made, and the amount of the purchase.
2In some disciplines, the terms field and variable are synonymous.
10. 4
Exploration uncovers dimensionality in input data. For example, trying to uncover groups of
similar customers based on spending habits for a large, targeted mailing is an exploration problem.
Affinity analysis determines which events are likely to occur in conjunction with one another.
Retailers use affinity analysis to analyze product purchase combinations.
Both supervised and unsupervised learning methods are useful for classification purposes.
In a particular business problem involving fraud, the objective may be to establish a classifi-
cation scheme for fraudulent transactions. Regression, decision trees, neural networks and
clustering can all address this type of problem. Decision trees and neural networks build
classification rules and other mechanisms for detecting fraud. Clustering can indicate what
types of groupings in a given population (based on a number of inputs) are more at risk for
exhibiting fraud.
Classification modeling tries to find models (or rules) that predict the values of one or more
variables in a data set (target) from the values of other variables in the data set (inputs).
After finding a good model, a data mining tool applies the model to new data sets that may
or may not contain the variable(s) being predicted. When applying a model to new data,
each record in the new data set receives a score based on the likelihood the record repre-
sents some target value. For example, in the health care case, fraud represents the target
value. In this paper, case study 1—the health care case—uses classification modeling.
Exploration uses different forms of unsupervised learning. Clustering places objects into
groups or clusters that are suggested by the data and are based on the values of the input
variables. The objects in each cluster tend to have a similar set of values across the input
fields, and objects in different clusters tend to be dissimilar. Clustering differs from artificial
intelligence (AI) and online analytical processing (OLAP) in that clustering leverages the
relationships in the data themselves to inductively uncover structure rather than imposing
an analyst’s structure on the data. In this paper, case study 2—the purchase card case—uses
exploratory clustering.
Training, Validation, and Testing of Models
Model development begins by partitioning data sets into one set of data used to train a
model, another data set used to validate the model, and a third used to test the trained and
validated model.3 This splitting of data ensures that the model does not memorize a particular
subset of data. A model trains on one set of data, where it learns the underlying patterns in
that data, then gets validated on another set of data, which it has never seen before. If the
model does not perform satisfactorily in the validation phase (for example, it may accurately
predict too few cases in the target field), it will be re-trained.
The training, validating, and testing process occurs iteratively. Models are repeatedly trained
and validated until the tool reaches a time limit or an accuracy threshold. Data partitioning
typically splits the raw data randomly between training and validation sets. In some
instances, controlling the splitting of training and validation data is desirable. For instance, a
credit card fraud case may necessitate controlled partitioning to avoid distributing the fraudulent
transactions for one particular account between training and validation.
3Some texts and software products refer to the test set as the validation set, and the validation set as the test set.
11. 5
After model training and validation, algorithm parameters can be changed in an effort to
find a better model. This new model produces new results through training and validation.
Modeling is an iterative process. Before settling on an acceptable model, analysts should
generate several models from which the best choice can be made.
While training and validation are independent activities in the modeling process, they
nonetheless are indirectly linked and could impact the generalizability of the developed
model. During validation, the model indirectly “sees” the validation data, and tries to
improve its performance in the validation sessions. The model may eventually memorize
both the training data and indirectly the validation data making a third data set know as a
test data set instrumental to delivering unbiased results.
The test data set is used at the very end of the model building process. The test data set must
have a fully populated target variable. The test data set should only be used once, to evaluate
or compare models’ performance, not to determine how the model should be re-trained.
Analyzing Results
Diagnostics used in model evaluation vary in supervised and unsupervised learning. For
classification problems, analysts typically review gain, lift and profit charts, threshold charts,
confusion matrices, and statistics of fit for the training and validation sets, or for the test set.
Business domain knowledge is also of significant value to interpreting model results.
Clustering models can be evaluated for overall model performance or for the quality of cer-
tain groupings of data. Overall model diagnostics usually focus on determining how capable
the model was in dividing the input data into discrete sets of similar cases. However, analysts
may also determine the adequacy of certain individual clusters by analyzing descriptive statistics
for key fields in the cluster vis-à-vis remaining data. For instance, if an analyst seeks patterns
of misuse of a purchase card, clusters with high concentrations of questionable purchases
may be targeted for investigations.
Linking Techniques to Business Problems
Business problems can be solved using one or more of the data mining techniques listed in
Table 1. Choice of the data mining technique to deploy on a business problem depends on
business goals and the data involved. Rarely does a data mining effort rely on a single
algorithm to solve a particular business problem. In fact, multiple data mining approaches
are often deployed on a single problem.
12. 6
Table 2 displays some uses of data mining in the different modeling objectives
Supervised Unsupervised
Modeling Objective
(Target Field Data Exists) (No Target Field Data Exists)
Prediction Attrition/Retention Not feasible
Cash Used in ATM
Cost of Hospital Stay
Fraud Detection
Campaign Analysis
Classification Segmentation Segmentation
Brand Switching Attrition/Retention
Charge Offs
Fraud Detection
Campaign Analysis
Exploration Segmentation Segmentation
Attrition/Retention Profiling
Scorecard Creation
Fraud Detection
Campaign Analysis
Affinity Not applicable Cross-sell/Up-sell
Market Basket Analysis
Table 2: Use of Data Mining by Modeling Objective and Learning Method
Case Study 1: Health Care Fraud Detection
Problem Definition
A public sector health care organization has begun to track fraudulent claims. In most cases,
the organization identifies fraudulent claims by receiving tips from concerned parties then
investigating those tips. In this case study, the business question centers on identifying pat-
terns that produce fraudulent claims for health care among program beneficiaries.
Data Collection and Enhancement
A public sector organization has collected data in which there exists a field indicating fraud-
ulent activity. Some 15 input fields exist as inputs that predict fraud. Figure 2 shows how
SAS® Enterprise Miner ™ displays a table view of the health care fraud data.
Most of the records in Figure 2 have missing data for some fields; however, missing data
actually represent important data values. The data came directly from a transaction processing
system that recorded “missing” as the default value. For instance, if person was born in the
U.S., then no country of birth value is entered for that record. In order to become useful,
this transaction data must be cleansed and transformed. For example, “missing” values for
the country of birth variable are reset to “U.S.” prior to analysis.
13. 7
Figure 2: Missing Health Care Fraud Data
Modeling Strategies
In choosing a modeling strategy for the data in this case study, the following factors come
into play:
• amount of missing data and how it is handled
• measurement level of the input variables
• percentage of data representing the target event (fraud)
• goal of the analysis – understanding predictive factors versus making good predictions.
The level of measurement of the target field is binary and the inputs are primarily nominal
with missing values that will be assigned to their own class before analysis. There is one
interval variable with no missing values. So our choice of modeling strategy is not highly
restricted yet. In fact, regression, decision trees, or neural networks may be appropriate for
the problem at hand. However, training a neural network will be slow with the nominal inputs.
The goals of the analysis are to understand how the input factors relate to predicting fraud
and to develop rules for identifying new cases to investigate. Because neural networks pro-
vide little if any feedback on how the inputs relate to the target, they may be inappropriate
for this analysis. Thus for this exercise, regression and decision trees were used. This leads to
the last consideration – the percentage of fraudulent records in the data. To understand the
factors in choosing a modeling strategy, you must understand how the model algorithms work.
Decision trees will try to group all the nominal values of an input into smaller groups that
are increasingly predictive of the target field. For example, for an input with 8 nominal lev-
els, the decision tree might automatically create two groups with 3 and 5 levels each so that
the 3-level grouping contains the majority of fraudulent records. Because the tree puts the
data into a large group and then tries to split the large groups, the decision tree has most of
the data available to work with from the start. Data do get scarce as the tree grows, because
each new split subsets the data for further modeling.
14. 8
In terms of data scarcity, regression works differently than decision trees. All levels of all
inputs are used to create a contingency table against the target. If there are not enough data
in a cell of the table, then the regression will have problems including that input in the
model. Scarcity of data becomes even more of a problem if interactions or cross-terms are
included in the model. For example, if your inputs include geographical region with 50 levels,
and product code with 20 levels, a regression model with a region by product interaction will
create 50+20+50*20 cells in the contingency table. You would need a large amount of data in
each of these 1070 cells to fit a regression model.
Training, Validation, and Testing of Models
In the health care fraud case study, only about 7 percent of the records in the data set repre-
sent fraudulent cases. In addition, many of the inputs have a high number of nominal levels.
Combined, these two factors make analysis with a regression model difficult. However, for
comparison a forward stepwise regression was tested, and it was determined that its selection
of significant predictors agreed with the decision tree’s fit.
Figure 3 shows how SAS Enterprise Miner was used to create a process flow diagram for
analyzing the health care fraud data.
Figure 3: Analysis Path for Health Care Data
In fitting the decision tree, many of the options were adjusted to find the best tree. Unfortu-
nately there were only 2107 records of which only 153 were considered fraudulent.
15. 9
This is not enough data to create a validation data set that would ensure generalizable
results. Options used include:
• CHAID type of Chi-square splitting criterion, which splits only when a statistically
significant threshold is achieved
• entropy reduction splitting criterion, which measures the achieved node purity at
each split
• evaluating the tree based on the node purity (distinct distribution in nodes)
• evaluating the tree based on the misclassification rate (general assessment)
• allowing more than the default number of splits (2) at each level in the tree
• adjusting the number of observations required in a node and required for splitting
to continue.
Analyzing Results
To analyze the results of the health care study, we used lift charts and confusion matrices.
A lift chart displays results from different models allowing a quick initial model comparison.
A confusion matrix involves a comparison of predicted values to actual values.
Analyzing Results Using Lift Charts
Lift charts compare the percentage of fraudulent observations found by each of the decision
trees and the regression model. In Figure 4 the lift chart shows the percent of positive
response (or lift) on the vertical axis. The lift chart reveals that two of the trees, 4 and 5,
outperformed the other models.
Figure 4: Lift Chart Comparing Multiple Models
16. 10
Tree 4 allows only 2 splits at each branch, requires at least 20 observations in a node to
continue splitting, and uses the entropy reduction algorithm with the best assessment option.
Tree 5 allows 4 splits at each branch, requires at least 20 observations in a node to continue
splitting, and uses the entropy reduction algorithm with the distinct distribution in nodes
option. To decide which of these trees to use, several factors need to be considered:
• What is the desired tradeoff between false alarms and false dismissals?
• How much fraud overall is each tree able to identify?
• How have the algorithmic options that have been set affected the tree results?
• Does one tree make more sense than the other does from a business perspective?
• Is one tree simpler to understand and implement than the other?
Analyzing Results Using Confusion Matrices
Individual model performance of supervised learning methods is often assessed using a
confusion matrix. The objective, typically, is to increase the number of correct predictions
(sensitivity) while maintaining incorrect predictions or the false alarm rate (specificity) at an
acceptable level. The two goals, getting as much of the target field correctly predicted versus
keeping the false alarm rate low, tend to be inversely proportional. A simple example can
illustrate this point: to catch all the fraud in a data set, one need only call health care claims
fraudulent, while to avoid any false alarms one need only call all claims non-fraudulent.
Reality resides between these two extremes.
The business question typically defines what false alarm rate is tolerable versus what amount
of fraud (or other target) needs to be caught.
Table 3 displays the layout of a confusion matrix. The confusion matrix compares actual
values of fraud (rows) versus model predictions of fraud (columns). If the model predicted
fraud perfectly, all observations in the confusion matrix would reside in the two shaded cells
labelled “Correct Dismissals” and “Correct Hits.” Generally, the objective is to maximize
correct predictions while managing the increase in false alarms.
Model Predictions
Model Predicts Model Predicts
Non-Fraud Fraud
Actual Actual
Values of Transaction is False
Fraud Non-Fraudulent Alarms
Correct Dismissals
Actual
Transaction is False Correct
Fraudulent Dismissals Hits
Table 3: Layout of a Confusion Matrix
17. 11
When predicting for classification problems, each record receives a score based on the likeli-
hood that the record represents some target value. Because the likelihood is a probability, its
values range from zero to one inclusive. While most modeling packages apply a standard
cutoff or threshold to this likelihood and then determine the predicted classification value,
SAS Enterprise Miner enables the analyst to modify the default threshold. Changing the
threshold value changes the confusion matrix.
Figures 5 and 6 display the confusion matrix of tree 5 for thresholds of 0.10 and 0.50 respectively.
With a 0.10 threshold, records that have a 10 percent or higher chance of being fraudulent
are predicted as fraudulent. The 10 percent threshold predicts more false alarms while the
50 percent threshold predicts more false dismissals. A correct classification diagram enables
evaluation of the confusion matrix over a wide range of thresholds.
Figure 5: Tree 5 Confusion Matrix at 10% Threshold
Figure 6: Tree 5 Confusion Matrix at 50% Threshold
Figure 7 displays a correct classification rate diagram for tree 5. This plot shows the tradeoff
between sensitivity and specificity enabling the analyst to determine an appropriate cut-off
value for the likelihood that a particular record is fraudulent. For this example, the curve for
Target Level of both illustrates that the threshold can vary upwards of 20 percent without
significantly affecting the results.
18. 12
Figure 7: Tree 5 Correct Classification Diagram
Table 4 displays statistics for decision trees 4 and 5. In this case, both trees produce quite similar
results, first splitting on gender then other inputs in different orders. Because of the options
chosen after the initial split, tree 4 has focused on isolating fraud in the males (68 percent of
all fraud) while tree 5 has found fraud for both males and females. At first glance, tree 4
with fewer splits may seem simpler, yet tree 5 with more splits has less depth and simpler
rules. Tree 5 also isolates 37 percent of all fraudulent cases whereas tree 4 only isolates 25
percent of all fraud.
Time Node # Fraud (%) Gender
4 61 22 (14%) Male
4 106 7 (4.6%) Male
4 93 8 (5.2%) Male
Tree 4 Total 3 (25%)
5 16 7 (4.6%) Male
5 49 4 (2.8%) Male
5 46 11 (7.2%) Male
5 43 15 (9.8%) Male
5 65 6 (3.9%) Female
5 63 7 (4.6%) Female
5 32 4 (2.6%) Female
5 2 3 (2%) Female
Tree 5 Total 5 (37%)
Table 4: Table of Tree Statistics
The visualization techniques available in SAS Enterprise Miner also are helpful when analyzing
the trees. For example, tree ring diagrams provide holistic views of decision trees. Figure 8
displays a tree ring diagram for tree 4. The center of the ring represents the top tree node
including all the data. The first ring represents the first tree split, in this case on gender.
19. 13
Subsequent rings correspond to subsequent levels in the tree. Colors are assigned to show
the proportion of fraudulent records correctly classified in each node of the tree. Light sec-
tions correspond to less fraud, dark sections to more fraud.
Figure 8: Tree Ring Diagram for Decision Tree 4
Using the diagnostics readily available in the software enables analysts to investigate quickly
the darker portions of the diagram and to generate a subset of the tree that displays the
rules required to create the associated data subset. Figure 9 displays rules for decision tree 4.
Figure 9: A Subset of the Rules for Tree 4
An example rule from Figure 9 specifies that 14 percent of the fraudulent records can be
described as follows:
• male
• from four specific person categories of file type A
• received payments of between $19,567 and $44,500
• one of three ‘pc’ status values.
20. 14
Deriving these rules from Figure 9 is straightforward; however, notice that the payment
amount is addressed twice in the model. Model options for tree 4 specified that only one
split point could be defined on an input at each level of the tree. This algorithmic setting
often causes decision trees to create splits on a single input at multiple levels in the tree
making rules more difficult to understand.
Following a similar set of steps for tree 5 enables a comparison of the two trees at a more
granular level. Figure 10 displays the tree ring for decision tree 5 for which four split points
for each input were allowed in each level of the tree. Allowing the algorithm more freedom
in splitting the inputs resulted in a tree with fewer levels that addresses more of the data – in
particular both males and females. A quick glance at the tree ring may suggest tree 5 is more
complex than tree 4. However, each input appears at only one level in the tree, making the
rules easier to understand.
Figure 10: Tree Ring Diagram for Decision Tree 5
Decision tree 5 is displayed graphically in Figure 11 as a set of splits (decisions).
Figure 11: Rules for Tree 5
21. 15
An example rule from Figure 11 specifies that 9.8 percent of the fraudulent records can be
described as follows:
• male
• from two specific person categories or the ‘missing’ category
• of payment status U
• received payments of between $11,567 and $40,851.
Conclusions for Case Study 1
Based on the amount and type of data available, a decision tree with rules that are simple to
follow provides the best insight into this data. Of course, the best recommendation is to
obtain more data, use a validation data set, and have subject matter expertise applied to
enhance this analysis.
Case Study 2: Purchase Card Fraud Detection
A federal agency has collected data on its employees’ purchase card transactions and on the
40,000 employees’ purchase card accounts. The transaction data contain information on the
date purchases are made, the purchase amount, the merchant’s name, the address of the
merchant, and the standard industrial classification (SIC) code of the merchant among other
fields. The account data contain information about the individuals’ accounts such as information
about the account holder, the single transaction limit of the account, the billing cycle purchase
limit for the account, and purchase histories for each account among other fields.
Problem Definition
A government organization seeks to determine what groups of purchases exist in its purchase
card program that may be indicative of a misuse of public funds. The organization has collect-
ed information on purchase characteristics that signify a misuse of government funds. This
information is resident in reports about purchase card risk. Additional information resides
with domain experts. The organization seeks to determine what other types of transactions
group together with the existing knowledge for the sake of preventing continued misuse of
funds by authorized individuals.
The organization wishes to build an effective fraud detection system using its own data as a
starting point.
Data Collection and Enhancement
After defining the business problem, the next step in the data mining process is to link the
disparate data sources. In this case study, data from account and transaction files are linked.
Data are joined at the transaction level because the business question is focused on deter-
mining the inherent properties of transactions that signify fraudulent use of funds.
Typically, not all of the data that have been joined will be selected for model inputs. Some
subset of the fields will be used to develop the data mining models.
22. 16
Data transformations can be performed on the collected data. Data transformations involve
converting raw inputs. For example, data transformations might group very granular cate-
gorical variables such as SIC codes into more general groups, or aggregating records. Data
transformations make more efficient use of the information embedded in raw data. Data
transformations can be made with the assistance of domain experts. In this case study,
domain experts have indicated some SIC code purchases indicate a misuse of funds.
Typically, data mining requires the drawing of samples from the records in the joined data
due to the intense resources required in the training of data mining algorithms. Samples
need to be representative of the total population so that models have a chance to “see”
possible combinations of fields.
Modeling Strategies
In this case study, no target field exists because the organization has never analyzed pur-
chase card data in search of fraud. Therefore, the decision is made to use unsupervised
learning methods to uncover meaningful patterns in the data. Unsupervised learning will
be used to group the data into sets of similar cases.
Figure 12 displays the selection of an unsupervised learning method using SAS Enterprise
Miner. A sample of approximately 13,000 accounts is created. Cluster analysis segments the
sample data into sets of similar records.
Figure 12: Selection of Clustering as an Unsupervised Learning Method
The unsupervised learning method selected in Figure 12 performs disjoint cluster analysis on
the basis of Euclidean distances computed from one or more quantitative variables and seeds
that are generated and updated by a clustering algorithm. Essentially the clustering method
bins the data into groups in such a way as to minimize differences within groups at the same
time that it maximizes differences between groups.
The cluster criterion used in this example is Ordinary Least Squares (OLS), wherein clusters are
constructed so that the sum of the squared distances of observations to the cluster means
is minimized.
Training, Validation, and Testing of Models
Figure 13 displays a hypothetical clustering result. The large crosses represent cluster cen-
ters. The cases are assigned to three clusters (each with an ellipse drawn about it). In the
space represented, there is no better way to assign cases to clusters in order to minimize
distance from each data point to the center of each cluster. Of course, this example displays
a simple two-dimensional representation; cluster analysis performs its optimization routines
23. 17
in m-dimensional space, where m is the number of fields or variables. Therefore, if there
are 20 variables in the clustering operation, the space in which clustering is performed is
20-dimensional space.
y axis
Cluster 2
Cluster 3
Cluster 1
x axis
Figure 13: Cluster Analysis Efficiently Segments Data into Groups of Similar Cases
The difference between exploratory analysis and pattern discovery in clustering concerns
what constitutes a result and how the results will be put to use. Exploratory analysis may be
satisfied to discover some interesting cases in the data. Pattern discovery will leverage the
existing clusters and the general patterns associated with those clusters to assign new cases
to clusters. As a result of this more forward-looking objective, cluster analysis in pattern
discovery requires cluster models to be tested prior to deployment. This testing ensures a
reliable result, one that can help ensure that “discovered” clusters in the data persist in the
general case.
In this case study, cluster analysis is used as a pattern detection technique; therefore, the
resulting cluster model would need to be tested were it to be applied.
Part of the model training process involves selecting parameters for the cluster model.
Figure 14 shows the parameter settings for the current cluster model. In this case, four
cluster centers are selected.
Figure 14: Selecting Cluster Model Parameters
24. 18
The model settings in Figure 14 will produce a cluster model with four centers. The algorithm
will try to arrange the data around the four clusters in such a way as to minimize differences
within clusters at the same time that it maximizes differences between clusters.
Analyzing Results
Figure 15 displays results for a cluster analysis using the purchase card data. The parameters
for the cluster analysis were set to 40 clusters. The height and color of each pie slice represent
the number of cases in the cluster. The slice width refers to the radius of the circle that cov-
ers all data points in the cluster as measured from the center of the cluster. Cluster 31 holds
the largest number of cases at 6,334, while the clusters 1,11, and 19 each have in excess of
500 cases. Cluster 6 has 345 cases.
Figure 15: Results of Cluster Analysis
Figure 16 displays cluster statistics. The column titles represent standard industrial classification
(SIC) codes where purchases have taken place. The number in each cell corresponds to the
average frequency of purchases made by the account holders in that account. In this case,
cluster 6 (with 345 cases) is highlighted as account holders in this cluster make an average of
6.51 sports and leisure purchases each.
25. 19
Figure 16: Cluster Statistics
Looking at the raw data for the cases in cluster 6, we find that account holders in that cluster
also make a high amount of weekend and holiday purchases, restaurant purchases and hotel
purchases. These accounts are problematic as the patterns exhibited by them clearly indicate
improper use of purchase cards for personal and/or unwarranted expenses.
As investigation of clusters proceeds, it also is necessary to ensure that enough of a split
occurs between the clusters, which will demonstrate that sufficient difference exists between
clusters. Lastly, it is important to identify the relevance of the clusters, which is achieved
with the aid of domain expertise. Individuals who are knowledgeable of purchase card use
can help indicate what batches of data are promising given the business question.
The model would still need to be tested by using new data to ensure that the clusters developed
are consistent with the current model.
Building from Unsupervised to Supervised Learning
Pattern detection provides more information on fraudulent behaviors than simply reporting
exceptions and can prove valuable in the future for building a knowledge base for predicting
fraud. For example, the cluster analysis in this case study yields interesting results. In fact, one
of the clusters holds the promise of uncovering fraudulent transactions, which may require
investigation through account audits. The ultimate findings of the investigations should be
stored in a knowledge base, which can be used to validate the cluster model. Should investigation
show the model’s judgment to be erroneous, the cluster analysis would need to be revisited.
The tested cluster model can continue to be applied to new data, producing cases for investi-
gation. In turn, the knowledge base will accumulate known fraud cases.
Conclusions for Case Study 2
Cluster analysis yields substantive results in the absence of a target field. Used wisely, cluster
analysis can help an organization interested in fraud detection build a knowledge base of
fraud. The ultimate objective would be the creation of supervised learning model such as a
neural network that is focused on uncovering fraudulent transactions.
26. 20
Overall Conclusions
Data mining uncovers patterns hidden in data to deliver knowledge for solving business
questions. Even in the absence of target fields, data mining can guide an organization’s
actions toward solving its business questions and building a growing knowledge base.
The powerful data mining tools found in SAS Enterprise Miner software make it easy for
organizations to extract knowledge from data for use in solving core business questions.
When followed, the steps in the data mining process (problem definition; data collection and
enhancement; modeling strategies; training and validating models; analyzing results; modeling
iterations; and implementing results) provide powerful results to organizations.
27. 21
Biographies
I. Philip Matkovsky
Federal Data Corporation
4800 Hampden Lane
Bethesda, MD 20814
301.961.7024
pmatkovsky@feddata.com
As manager of operations for Federal Data Corporation’s Analytical Systems Group, Philip
Matkovsky provides technical lead and guidance on data mining, quantitative analysis, and
management consulting engagements for both public and private sector clients. Philip has
a BA in Political Science from the University of Pennsylvania, an MA in Political Science/
Public Policy from American University and is currently completing his doctoral research in
Public Policy at American University. Philip has successfully applied numerous analytical/
research approaches (including survey research, game theoretic models, quantitative model-
ing, and data mining) for public and private sector clients.
Kristin Rahn Nauta
SAS Institute, Inc.
SAS Campus Drive
Cary, NC 27513
919.677.8000 x4346
saskrl@wnt.sas.com
As part of the Federal Technology Center at SAS Institute Inc., Kristin Rahn Nauta is the
federal program manager for data mining. Formerly SAS Institute’s data mining program
manager for Canada and the analytical products marketing manager for the US, Kristin
has a BS in mathematics from Clemson University and a Masters of Statistics from North
Carolina State University. Kristin has consulted in a variety of fields including pharmaceutical
drug research and design, pharmaceutical NDA submissions, database marketing and cus-
tomer relationship management.
28. 22
Recommended Reading
Data Mining
Adriaans, Pieter and Dolf Zantinge. (1996) Data Mining. Harlow, England:
Addison Wesley.
Berry, Michael J. A. and Gordon Linoff, (1997), Data Mining Techniques,
New York: John Wiley & Sons, Inc.
SAS Institute Inc., (1997), SAS Institute White Paper, Business Intelligence Systems
and Data Mining, Cary, NC: SAS Institute Inc.
SAS Institute Inc., (1998), SAS Institute White Paper, Finding the Solution to
Data Mining: A Map of the Features and Components of SAS® Enterprise
Miner™ Software, Cary, NC: SAS Institute Inc.
Weiss, Sholom M. and Nitin Indurkhya, (1998), Predictive Data Mining: A Practical
Guide, San Francisco, California: Morgan Kaufmann Publishers, Inc.
Data Warehousing
Berson, Alex and Stephen J. Smith (Contributor). (1997) Data Warehousing,
Data Mining and OLAP, New York: McGraw Hill.
Inmon, W. H., (1993), Building the Data Warehouse, New York:
John Wiley & Sons, Inc.
SAS Institute Inc., (1995), SAS Institute White Paper, Building
a SAS® Data Warehouse, Cary, NC: SAS Institute Inc.
SAS Institute Inc., (1996), SAS Institute White Paper, SAS Institute’s
Rapid Warehousing Methodology, Cary, NC: SAS Institute Inc.
Singh, Harry, (1998), Data Warehousing Concepts, Technologies, Implementations,
and Management, Upper Saddle River, New Jersey: Prentice-Hall, Inc.
29. 23
Credits
Using Data Mining Techniques for Fraud Detection was a collaborative work. Contributors
to the development and production of this paper included the following persons:
Consultants
SAS Institute Inc.
Padraic G. Neville, Ph.D.
Writers
SAS Institute Inc.
Kristin Rahn Nauta, M.Stat.
Federal Data Corporation
I. Philip Matkovsky
Technical Reviewers
SAS Institute Inc.
Brent L. Cohen, Ph.D.
Bernd Drewes
Anne Milley
Warren S. Sarle, Ph.D.
Federal Data Corporation
Steve Sharp
Paul Simons
Technical Editor
SAS Institute Inc.
John S. Williams
Copy Editors
SAS Institute Inc.
Rebecca Autore
Sue W. Talley
Production Specialist
SAS Institute Inc.
Kevin Cournoyer