This document outlines an approach for improving the efficacy of alerts from anti-money laundering transaction monitoring models. It involves regularly evaluating rule efficacy, acquiring historic transaction and alert data, analyzing the data to understand patterns, building a detection engine to test threshold combinations, running the data through the engine to calculate alert counts for each combination, analyzing combinations based on case and SAR retention, qualitatively evaluating samples, and selecting the optimal combination based on efficacy tests. The goal is to reduce false positives while still detecting true suspicious activity to improve compliance programs and control costs.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
Build an Ensemble classifier that can detect credit card fraudulent
transactions.Implemented a classifier by use of machine learning algorithms, such as
Decision Trees, Logistic Regression, Artificial Neural Networks and Gradient Boosting
Classifier.
BigData Republic teamed up with VodafoneZiggo and hosted an meetup on churn prediction.
Telecom companies like VodafoneZiggo have long benefited from the fine art/science of predicting churn. Currently, in the booming age of subscription based business models (e.g. Netflix, Spotify, HelloFresh), the importance of predicting churn has become widespread. During this event, VodafoneZiggo shared some of its wisdom with the public, after which BDR Data Scientist Tom de Ruijter presented an overview of the modeling tools at hand, both classical, as well as novel approaches. Finally, the participants engaged in a hands-on session showcasing the implementation of different approaches.
PART 1 — Churn Prediction in Practice by Florian Maas
At VodafoneZiggo we are incredibly excited about Advanced Analytics and the enormous potential for progress and innovation. In our state of the art open source platform we store the tremendous amount of data that is generated every single second in our mobile and fixed networks. This means that we have a vast body of rich information, which if unlocked, can lead to something very special. As a company with a primarily subscription-based service model, churn plays a vital role in the daily business. Not only is the churn rate a good indicator of customer (dis)satisfaction, it is also one out of two factors that determines the steady-state level of active customers. During this talk, we will show how data science provides added value in the process of churn prevention at VodafoneZiggo. We will talk about the data and the modeling approach we use, and the pitfalls and shortcomings that we have encountered while building the model. We will also briefly discuss potential improvements to the current approach, which brings us to talk #2.
PART 2 — The Churn Prediction Toolbox by Tom de Ruijter
The second talk will show you the fine intricacies of predicting churn through different approaches. We’ll start off with an overview of different modeling strategies for describing the problem of churn, both in terms of a classification problem as well as a regression problem. Secondly, Tom will give you insights in how you evaluate a churn model in a way such that business stakeholders know how to act upon the model results. Finally, we’ll work towards the hands-on session demonstrating different model approaches for churn prediction, ranging from classical time series prediction to recurrent neural networks.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.
Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations
Data Science - Part IX - Support Vector MachineDerek Kane
This lecture provides an overview of Support Vector Machines in a more relatable and accessible manner. We will go through some methods of calibration and diagnostics of SVM and then apply the technique to accurately detect breast cancer within a dataset.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
This project aims at predicting Defaulters of Credit Card Payment. R programming is used for Exploratory Data Analysis and for Model building R programming and Azure ML is used.
Build an Ensemble classifier that can detect credit card fraudulent
transactions.Implemented a classifier by use of machine learning algorithms, such as
Decision Trees, Logistic Regression, Artificial Neural Networks and Gradient Boosting
Classifier.
BigData Republic teamed up with VodafoneZiggo and hosted an meetup on churn prediction.
Telecom companies like VodafoneZiggo have long benefited from the fine art/science of predicting churn. Currently, in the booming age of subscription based business models (e.g. Netflix, Spotify, HelloFresh), the importance of predicting churn has become widespread. During this event, VodafoneZiggo shared some of its wisdom with the public, after which BDR Data Scientist Tom de Ruijter presented an overview of the modeling tools at hand, both classical, as well as novel approaches. Finally, the participants engaged in a hands-on session showcasing the implementation of different approaches.
PART 1 — Churn Prediction in Practice by Florian Maas
At VodafoneZiggo we are incredibly excited about Advanced Analytics and the enormous potential for progress and innovation. In our state of the art open source platform we store the tremendous amount of data that is generated every single second in our mobile and fixed networks. This means that we have a vast body of rich information, which if unlocked, can lead to something very special. As a company with a primarily subscription-based service model, churn plays a vital role in the daily business. Not only is the churn rate a good indicator of customer (dis)satisfaction, it is also one out of two factors that determines the steady-state level of active customers. During this talk, we will show how data science provides added value in the process of churn prevention at VodafoneZiggo. We will talk about the data and the modeling approach we use, and the pitfalls and shortcomings that we have encountered while building the model. We will also briefly discuss potential improvements to the current approach, which brings us to talk #2.
PART 2 — The Churn Prediction Toolbox by Tom de Ruijter
The second talk will show you the fine intricacies of predicting churn through different approaches. We’ll start off with an overview of different modeling strategies for describing the problem of churn, both in terms of a classification problem as well as a regression problem. Secondly, Tom will give you insights in how you evaluate a churn model in a way such that business stakeholders know how to act upon the model results. Finally, we’ll work towards the hands-on session demonstrating different model approaches for churn prediction, ranging from classical time series prediction to recurrent neural networks.
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.
Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations
Data Science - Part IX - Support Vector MachineDerek Kane
This lecture provides an overview of Support Vector Machines in a more relatable and accessible manner. We will go through some methods of calibration and diagnostics of SVM and then apply the technique to accurately detect breast cancer within a dataset.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
This project aims at predicting Defaulters of Credit Card Payment. R programming is used for Exploratory Data Analysis and for Model building R programming and Azure ML is used.
Trust transaction monitoring and aml for swift messagingKeith Furst
This presentation was given at the prestigious ONE Aldwych hotel in London on July 14th 2016. The presentation discusses the implications to finance as regulatory and compliance controls become increasingly strict. How confident can you be that you are doing enough to spot and prevent risky transactions and money laundering attempts? And could technologies like blockchain help to increase confidence in transactions?
In July's edition we're talking about Trust, Transaction Monitoring, and anti-money laundering (AML) for financial messaging specifically:
How secure are current transaction methods?
What are the key weaknesses which can be exploited?
What approaches can be used to identify anomalies in financial messaging traffic?
What are the risks to Compliance Officers and financial institutions?
Are there improvements which can be made transaction monitoring systems?
Can technologies like blockchain be used to increase trust and confidence in financial messaging?
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
A Comparative Study for Anomaly Detection in Data Mining
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
1. 1
Reducing False Positives:
BSA AML Transaction Monitoring Re-Tuning Approach
Written by Mayank Johri and Erik De Monte
Introduction
Institutions waste millions per year analyzing false positives due to models which return low efficacy.
In an era of heightened regulatory scrutiny coupled with institutions’ desire to control compliance
costs, there is need for a sound methodology to improve the overall efficacy of alerts. High efficacy
and sound methodology allow institutions to better channel their time and resources to true
suspicious activities and improve the overall quality of a BSA/AML program.
Certain proposed solutions to this problem include automated alert closures, whitelists, etc. These
solutions do not in any way ameliorate the issue of reducing false positives and do not represent
sound principles for a robust BSA/AML program.
Instead of using “out-of-the-box” rules from the transaction monitoring software, custom rules that
encapsulate multiple scenarios and using automated learned behavior (based on past disposition),
customer segmentation, and peer group analysis may help improve the efficacy of the alerts;
however, these still have to be tuned to determine most effective thresholds.
Below is summarization of the steps/approach that can stand the scrutiny of examiners and fulfil the
desired objective of generating quality alerts.
Approach
Assessment & Prioritization
On a regular basis evaluate the efficacy of the current suspicious activity detection rules in
production, identify the rules with the lowest to the highest efficacy and create a prioritization list.
This list then drives the tuning schedule/plan.
Data Acquisition
Three sets of data are pulled for the re-tuning analysis:
1) All historic transaction data since the most previous tuning was implemented;
2) For the rule in question, all historic alerted transactions, and subsequent disposition
(escalated cases and SAR) data. This data can be collected by querying the backend
databases of the transaction monitoring system.
3) Various relevant customer data elements like entity/consumer, cash intensive business,
AI, etc. on the customers alerted.
Data Analysis
Stratify the data as required (such as grouping like-attributes or ‘non-tunable parameters’ such as
entity/consumer, cash intensive businesses, etc.) to account for like-attribute behavior patterns.
Subsequent to stratification, perform a series of data analyses to better understand the data. This
2. 2
data analysis consists of, but is not limited to, identifying if suitable transaction codes and details are
all available, confirming the completeness and accuracy of the data set, and performing a series of
correlation tests to identify if certain data elements are correlated. This stage will help the institution
to understand the data specific to your client and data set. For example, two data elements may
prove to be correlated for an institution and not for another. If two data elements are found to be
correlated it may be in the institution’s best interest (from a resource or time perspective) to run data
analyses against those two elements in parallel.
Build Detection Engine
Using the transaction monitoring manual as a guide, recreate the rule using an object oriented
programming language (statistically driven language preferable; R, MATLAB or Python
recommended) to build an external engine to perform analysis on the rule thresholds.
A threshold range is determined for each threshold being tuned, and a matrix is created for all
combinations of each of the different possible threshold values. As mentioned above, in the event
that two thresholds are discovered to be directly correlated, choose to anchor these two thresholds
together as one to eliminate unnecessary noise in the permutation matrix.
Determine a de-minimis value to serve as the lowest threshold value in the re-tuning range for that
threshold. Professional judgment is used to identify the highest threshold value in the re-tuning
range for that threshold but typically will mirror the same delta between the current threshold value
and the de-minimis threshold value in the opposite direction.
For some rules, it is expected that this permutation matrix can easily create upwards of a thousand
different threshold combinations. A simple example is included below to visualize the permutation
matrix discussed above.
Threshold Current Threshold Current Threshold – Lower Range (de-minimis) Current Threshold – Upper Range
1 10,000 9,500 10,500
2 4 2 6
Figure 1.1 Sample Thresholds and Ranges
Permutation Threshold 1 Threshold 2
1 9,500 2
2 9,500 4
3 9,500 6
4 10,000 2
5 10,000 4
6 10,000 6
7 10,500 2
8 10,500 4
9 10,500 6
Figure 1.2 Sample Permutation Matrix
(of Thresholds and Ranges from Figure 1.1)
3. 3
Once both the permutation matrix and the rule engine have been built, the transaction data is
clustered and all transactions falling into clusters outside of the threshold ranges are excluded and
the two sets of transaction data (full set of transaction data and the transaction data related to
historic alerts, cases, and SARs) are run through the rule engine against a loop of all threshold
combinations in the permutation matrix.
The first full set of all transactions are run through the engine to output a count of events, or
“alerts”, for each permutation combination. Before proceeding, identify the threshold combination
in the matrix which contains all current thresholds and compare this count against the actual alert
count per the historic transaction monitoring system data. This provides a check for completeness
over the data pull as well as validates the risk engine’s accuracy.
Once confirmed, the second set of transaction data linked to the historic cases and SARs are run
through the engine and logged as separate event counts in new columns in the matrix as show
below.
Permutation Threshold 1 Threshold 2
Transaction Alert
– Historic
All Transaction
Data – Count
Case Event
Count
SAR Event
Count
1 (Current) 10,000 4 65 65 14 2
2 (New) 10,000 2 65 58 11 2
3 (New) 10,500 2 65 51 10 1
4 (New) … … … … … …
Figure 1.3 Re-Tuning Permutation Matrix Event Counts
As seen above, the “Transaction Alert – Historic” count and the “All Transaction Data- Count” for
permutation 1 (the current threshold) are equal which would confirm the rule engine is simulating
the rule accurately. In permutation 2, when the thresholds have been adjusted to a new combination
there is a slight decline in the “All Transaction Data – Count”, as expected with the adjusted
thresholds (Note that the “Transaction Alert – Historic” will be anchored at 65 as this logic only
produces the alert count at the current thresholds).
It is notable to mention that the SAR count of 2 will be used as an anchor in the analysis of the
results to set the rule threshold or parameter. Best practice instructs that recent SARs serve as a
benchmark for tuning thresholds and should heavily considered in the analysis. As seen above in
permutation 3, the threshold combination would cause one of the historic SARs to evade detection,
and thus this permutation (and any additional permutations which do not detect 2 historic SARs)
should be subsequently eliminated from any consideration for re-tuning.
A sample transaction data set and shell code (written in R) for the detection engine discussed above
is provided in the Appendix.
Quantitative Analysis
Identify the remaining permutation combinations and focus the analysis on the case and SAR
retention proportions (SAR proportion is usually weighed the most in the analysis). Any threshold
combinations in the matrix with undesirably SAR and/or case retention ratios are eliminated from
the list of possibilities. No one specific line of demarcation is identified at the end of the quantitative
4. 4
analysis for a re-tuning exercise. Instead, all remaining threshold combinations in the permutation
matrix continue through to the qualitative assessment and subsequent qualitative analysis is
performed to solidify a new proposed line of demarcation.
Qualitative Analysis
Determine during the quantitative analysis historical data in order to set indicators for Above-the-
Line (ATL) and Below-the-Line (BTL) and pull the qualitative samples to be reviewed by the FIU.
These samples, when flagged as ‘ATL’ are essentially the pseudo alerts, and are treated as such in the
FIU’s investigative analysis. BTL samples are included in the sample to further validate the threshold
line as the expectation is that less than x% of BTL samples (this percentage will depend on
institutions risk appetite) would return as escalated cases.
Sampling
Determine the appropriate sample size using a hypergeometric binomial sampling without
replacement. The number of transactions which fall into the ATL or BTL category will determine
the number of random samples required for a statistically significant qualitative assessment. A large
enough random sample of the same size would have roughly the same chance of producing a similar
result. Below is the formula to be used for determining sample size.
Included below is a sample size example:
N 620 Through data segmentation analysis (e.g., clustering, etc.) BTL population is determined.
CI 1.96
Target significance level (or confidence interval) is 95%; in this case associated factor (“z-value”) is 1.96.
In MS Excel this can be calculated using “=NORM.S.INV (1-((1-0.95)/2))”
Prec 0.05
Precision is set by risk appetite.
The smaller the value of this variable the larger the sample size needs to be.
P 10% Occurrence rate which needs to be detected
n 113 Based on these values listed above, n = 113
−
⋅
+
⋅
=
1
Prec
PQCI
N
1
1
Prec
PQCI
n
2
2
2
2
Legend
N = population size
P = expected occurrence rate of an attribute
Q = l - P
Prec = desired precision level
CI = associated factor at a given confidence level
5. 5
The table below shows how each variable impacts sample size:
N Prec P CI n
620 0.05 0.1 1.96 113
620 0.03 0.1 1.96 237
620 0.05 0.2 1.96 176
620 0.05 0.1 1.64 135
Figure 1.4 Sample Size
Investigator Analysis
The purpose of generating these samples is for the FIU to qualitatively evaluate the efficacy of the
quantitatively calculated thresholds. A group of investigators should be selected for the exercise and
randomly assigned pseudo ‘alerts’ to review as if they were authentic alerts from the transaction
monitoring system. In theory, if the threshold is appropriately tuned, then a transaction marked
‘ATL’ should most likely also be classified as ‘suspicious’ during this qualitative analysis, and all
sample transactions that are marked ‘BTL’ would be flagged as not suspicious.
The investigator’s evaluation must include consideration for the intent of each rule, and they will
generally evaluate each transaction through a lens akin to “Given what is known from KYC,
origin/destination of funds, beneficiary, etcetera, is it explainable that this consumer/entity would
transact this dollar amount at this ...frequency, velocity, pattern etc...” To maintain the integrity of
this assessment, the investigator does not make this qualitative assessment based only on the value
of the flagged transaction, but rather looks holistically at various qualities of the transaction such as
who the transaction is from/to (is it a wire transfer between two branches of the same company or a
similar commodity like computers and semi-conductors), and if there are any fields such as an
individual’s last name which contain key words which caused the rule to misinterpret a field as a
false positive.
Proportion and Efficacy Tests
All threshold combinations will need a review to identify which threshold combination has the best
efficacy both from a quantitative and qualitative perspective.
The outcome of the investigator’s qualitative analysis and the subsequent statistical analysis decide if
the line of demarcation determined during the quantitative analysis remains at the current level or is
revised. The risk appetite determines the acceptable magnitude of proportion defective (proportion
of suspicious transactions), also known as the “efficacy rate”. The range of outcomes and the
corresponding decisions are listed below.
1. BTL has acceptable proportion of suspicious transactions and ATL proportion is
significantly different (i.e., larger) than BTL’s proportion; threshold remains at the current
level: the threshold meaningfully separates BTL and ATL populations and the separation is
at the “correct” level (in terms of the risk appetite).
2. Both BTL and ATL proportions are low. Regardless of the statistical difference between the
two populations, if the proportions are low, most likely the threshold needs to become less
stringent to reduce the level of false positive.
6. 6
3. Both BTL and ATL proportions are higher than what is the acceptable level of suspicious
transactions. Threshold needs to become more stringent.
Approval and Implementation
Per the institution’s review and approval process, receive all necessary approvals from key personnel
prior to making any changes into production. Once all pertinent parties are in agreement, create a
functional specification document which should include a brief overview of the rule change, what is
currently configured, and the desired configuration changes to be made. It is imperative that the
functional specification document is thoroughly vetted and signed off validating that the document
provides all necessary and accurate information to make the desired implementation changes.
Authors
Mayank Johri and Erik De Monte both work in the Bank Security Act/Anti-Money Laundering
Analytics group at First Republic Bank in San Francisco, California. Their contact information is
included below
Mayank Johri, Vice President Analytics
https://www.linkedin.com/in/johrim
Erik De Monte, Data Scientist
https://www.linkedin.com/in/edemonte
7. 7
Appendix: Detection Engine Shell Code (R)
Included below is a sample of transaction data and a detection engine shell code written in R that
the data can be run through to depict the methodology discussed above. Please note that the table
below should be saved as a comma-separated file (CSV) with the headers included as
“Transactions.csv”.
The R code was built using RStudio Version 0.99.902 and has been commented to navigate the user
through each step of the methodology.
9. 9
Detection Engine Shell Code (R)
#//////////////////////////////////////////////////////////////////////
# Name: Re-Tuning Permutation Analysis - Example R Script
# Date: October 2016
# Developers: Erik De Monte, Mayank Johri
#//////////////////////////////////////////////////////////////////////
# Assumptions:
#
# i. There are 4 tables of Transactions available to be run through the engine:
# - All transactions for the date period identified
# - All transactions related to historic alerts for the date period identified
# - All transactions related to historic alerts that were escalated to case
# - All transactions related to historic alerts that were escalated to SAR
#
# ii. The data available for the relevant thresholds being re-tuned are available.
#//////////////////////////////////////////////////////////////////////
#0. Preliminary Procedures
#//////////////////////////////////////////////////////////////////////
# Load relevant preinstalled R Packages
library(cluster)
library(doBy)
library (base)
library(lubridate)
library(utils)
library(RODBC)
library(reshape)
library(dplyr)
# Upload and Format Data Frame
transactions <- read.csv(file='Transactions.csv', sep=',', header=TRUE, stringsAsFactors = FALSE)
transactions[,1] <- as.character(transactions[,1])
transactions[,2] <- as.Date(transactions[,2], format = "%m/%d/%Y")
transactions[,3] <- as.character(transactions[,3])
transactions[,4] <- as.character(transactions[,4])
transactions[,5] <- as.character(transactions[,5])
transactions[,6] <- as.numeric(transactions[,6])
transactions[,7] <- as.numeric(transactions[,7])
transactions[,8] <- as.numeric(transactions[,8])
#//////////////////////////////////////////////////////////////////////
# 1. Create a reference table for permutation matrix.
#//////////////////////////////////////////////////////////////////////
# 1a. Define Threshold Variables
# For the sake of this example, let us assume that the current thresholds are set at:
# threshold_01 = 10
# threshold_02 = 2
# threshold_03 = 1000000
# To define exact values to a threshold, assign it to a vector ("c")
# To define a sequence of values, use the "seq" function under the syntax:
# threshold = seq(a,b,c) ; Go from a to b in increments of c
threshold_01 = c(7, 10, 12)
threshold_02 = c(1,2,3)
threshold_03 = seq(800000,1200000,200000)
# 1b. Create the Threshold Table
10. 10
x_Threshold_Table <- expand.grid(threshold_01,threshold_02,threshold_03)
# 1c. Accurately define the columns in the new table
names(x_Threshold_Table)[names(x_Threshold_Table) == 'Var1'] <- 'Example_Threshold_01'
names(x_Threshold_Table)[names(x_Threshold_Table) == 'Var2'] <- 'Example_Threshold_02'
names(x_Threshold_Table)[names(x_Threshold_Table) == 'Var3'] <- 'Example_Threshold_03'
# 1d. Clean up your Enviornment and remove unneccesary varibales.
rm(threshold_01)
rm(threshold_02)
rm(threshold_03)
#//////////////////////////////////////////////////////////////////////
# 2. Loop transactions through each permutation in the Permutation Matrix (x_Threshold_Table)
# Count the number of events
#//////////////////////////////////////////////////////////////////////
# Count of Transactions - Current Thresholds
#//////////////////////////////////////////////////////////////////////
# 2a. Set the baseline alert count based on current transactions
# For the sake of this example, let us assume that the current thresholds are set at:
# threshold_01 = 10
# threshold_02 = 2
# threshold_03 = 1000000
# In this example, there are 7 historic alerts for the transaction set.
alerts <- subset(transactions, transactions$Alert_Nbr != 'NULL')
alert_count <- as.numeric(length(alerts$Transaction_Key))
x_Final <- data.frame(x_Threshold_Table[1:3], alert_count)
names(x_Final)[names(x_Final) == 'alert_count'] <- 'Transaction Alert - Historic'
rm(alert_count)
#//////////////////////////////////////////////////////////////////////
# Count of Transactions - Permutation Thresholds
#//////////////////////////////////////////////////////////////////////
# 2b. Create a variable which logs the number of events which fit the respective loop
Var_Event <- rep(NA,nrow(x_Threshold_Table))
# 2c. Loop through all threshold permutation combinations and create a subset of the transactions
that would alert
# var_index is used to temporarily hold the count of alerts between loops
for (i in 1:nrow(x_Threshold_Table)){
var_index <- subset(transactions, (
(transactions$Attribute_01 >= x_Threshold_Table$Example_Threshold_01[i])
& (transactions$Attribute_02 >= x_Threshold_Table$Example_Threshold_02[i])
& (transactions$Attribute_03 >= x_Threshold_Table$Example_Threshold_03[i])
))
#Count
Var_Event[i] <- as.numeric(length(var_index$Transaction_Key))
rm(var_index)
}
Event_Count=as.matrix(Var_Event)
x_Final = cbind(x_Final, Event_Count)
11. 11
names(x_Final)[names(x_Final) == 'Event_Count'] <- 'Transaction Data - Count'
rm(Event_Count)
rm(Var_Event)
rm(i)
#//////////////////////////////////////////////////////////////////////
# Count of Historic Case Transactions
#//////////////////////////////////////////////////////////////////////
# Emulate the logic above using only the transactions related to historic cases.
# Append ("cbind") the results to the final permutation table as done above.
# Name it "Case Event Count"
cases <- subset(transactions, transactions$Case_Nbr != 'NULL')
Var_Event <- rep(NA,nrow(x_Threshold_Table))
for (i in 1:nrow(x_Threshold_Table)){
var_index <- subset(cases, (
(cases$Attribute_01 >= x_Threshold_Table$Example_Threshold_01[i])
& (cases$Attribute_02 >= x_Threshold_Table$Example_Threshold_02[i])
& (cases$Attribute_03 >= x_Threshold_Table$Example_Threshold_03[i])
))
#Count
Var_Event[i] <- as.numeric(length(var_index$Transaction_Key))
rm(var_index)
}
Event_Count=as.matrix(Var_Event)
x_Final = cbind(x_Final, Event_Count)
names(x_Final)[names(x_Final) == 'Event_Count'] <- 'Case Event Count'
rm(Event_Count)
rm(Var_Event)
rm(i)
#//////////////////////////////////////////////////////////////////////
# Count of Historic SAR Transactions
#//////////////////////////////////////////////////////////////////////
# Emulate the logic above using only the transactions related to historic SARs.
# Append ("cbind") the results to the final permutation table as done above.
# Name it "SAR Event Count"
sars <- subset(transactions, transactions$SAR_Nbr != 'NULL')
Var_Event <- rep(NA,nrow(x_Threshold_Table))
for (i in 1:nrow(x_Threshold_Table)){
var_index <- subset(sars, (
(sars$Attribute_01 >= x_Threshold_Table$Example_Threshold_01[i])
& (sars$Attribute_02 >= x_Threshold_Table$Example_Threshold_02[i])
& (sars$Attribute_03 >= x_Threshold_Table$Example_Threshold_03[i])
))
#Count
Var_Event[i] <- as.numeric(length(var_index$Transaction_Key))
rm(var_index)
}
12. 12
Event_Count=as.matrix(Var_Event)
x_Final = cbind(x_Final, Event_Count)
names(x_Final)[names(x_Final) == 'Event_Count'] <- 'SAR Event Count'
rm(Event_Count)
rm(Var_Event)
rm(i)
#//////////////////////////////////////////////////////////////////////
# Anchor your analysis to the number of SARs filed, remove any combinations which would have
# missed a prior filed SAR.
sar_count <- as.numeric(length(sars$Transaction_Key))
x_Final <- subset(x_Final, x_Final$`SAR Event Count` >= sar_count)
rm(sar_count)
#//////////////////////////////////////////////////////////////////////
#//////////////////////////////////////////////////////////////////////
#//////////////////////////////////////////////////////////////////////
#//////////////////////////////////////////////////////////////////FIN.