The document outlines an 8-phase process for appropriately tuning automated transaction monitoring scenarios to balance identifying suspicious activity while maximizing efficiency. Key aspects of the process include using statistical techniques like clustering to set baseline thresholds, qualitatively assessing "above" and "below" threshold samples, and iteratively refining thresholds based on case outcomes. The goal is to establish thresholds representative of the 85th percentile to define the cutoff between normal and unusual transactions.
Time Series Analysis - 2 | Time Series in R | ARIMA Model Forecasting | Data ...Simplilearn
This Time Series Analysis (Part-2) in R presentation will help you understand what is ARIMA model, what is correlation & auto-correlation and you will alose see a use case implementation in which we forecast sales of air-tickets using ARIMA and at the end, we will also how to validate a model using Ljung-Box text. A time series is a sequence of data being recorded at specific time intervals. The past values are analyzed to forecast a future which is time-dependent. Compared to other forecast algorithms, with time series we deal with a single variable which is dependent on time. So, lets deep dive into this presentation and understand what is time series and how to implement time series using R.
Below topics are explained in this " Time Series in R presentation " -
1. Introduction to ARIMA model
2. Auto-correlation & partial auto-correlation
3. Use case - Forecast the sales of air-tickets using ARIMA
4. Model validating using Ljung-Box test
Become an expert in data analytics using the R programming language in this data science certification training course. You’ll master data exploration, data visualization, predictive analytics and descriptive analytics techniques with the R language. With this data science course, you’ll get hands-on practice on R CloudLab by implementing various real-life, industry-based projects in the domains of healthcare, retail, insurance, finance, airlines, music industry, and unemployment.
Why learn Data Science with R?
1. This course forms an ideal package for aspiring data analysts aspiring to build a successful career in analytics/data science. By the end of this training, participants will acquire a 360-degree overview of business analytics and R by mastering concepts like data exploration, data visualization, predictive analytics, etc
2. According to marketsandmarkets.com, the advanced analytics market will be worth $29.53 Billion by 2019
3. Wired.com points to a report by Glassdoor that the average salary of a data scientist is $118,709
4. Randstad reports that pay hikes in the analytics industry are 50% higher than IT
The Data Science with R is recommended for:
1. IT professionals looking for a career switch into data science and analytics
2. Software developers looking for a career switch into data science and analytics
3. Professionals working in data and business analytics
4. Graduates looking to build a career in analytics and data science
5. Anyone with a genuine interest in the data science field
6. Experienced professionals who would like to harness data science in their fields
Learn more at: https://www.simplilearn.com/
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
This lecture provides an overview of some modern regression techniques including a discussion of the bias variance tradeoff for regression errors and the topic of shrinkage estimators. This leads into an overview of ridge regression, LASSO, and elastic nets. These topics will be discussed in detail and we will go through the calibration/diagnostics and then conclude with a practical example highlighting the techniques.
Time Series Analysis - 2 | Time Series in R | ARIMA Model Forecasting | Data ...Simplilearn
This Time Series Analysis (Part-2) in R presentation will help you understand what is ARIMA model, what is correlation & auto-correlation and you will alose see a use case implementation in which we forecast sales of air-tickets using ARIMA and at the end, we will also how to validate a model using Ljung-Box text. A time series is a sequence of data being recorded at specific time intervals. The past values are analyzed to forecast a future which is time-dependent. Compared to other forecast algorithms, with time series we deal with a single variable which is dependent on time. So, lets deep dive into this presentation and understand what is time series and how to implement time series using R.
Below topics are explained in this " Time Series in R presentation " -
1. Introduction to ARIMA model
2. Auto-correlation & partial auto-correlation
3. Use case - Forecast the sales of air-tickets using ARIMA
4. Model validating using Ljung-Box test
Become an expert in data analytics using the R programming language in this data science certification training course. You’ll master data exploration, data visualization, predictive analytics and descriptive analytics techniques with the R language. With this data science course, you’ll get hands-on practice on R CloudLab by implementing various real-life, industry-based projects in the domains of healthcare, retail, insurance, finance, airlines, music industry, and unemployment.
Why learn Data Science with R?
1. This course forms an ideal package for aspiring data analysts aspiring to build a successful career in analytics/data science. By the end of this training, participants will acquire a 360-degree overview of business analytics and R by mastering concepts like data exploration, data visualization, predictive analytics, etc
2. According to marketsandmarkets.com, the advanced analytics market will be worth $29.53 Billion by 2019
3. Wired.com points to a report by Glassdoor that the average salary of a data scientist is $118,709
4. Randstad reports that pay hikes in the analytics industry are 50% higher than IT
The Data Science with R is recommended for:
1. IT professionals looking for a career switch into data science and analytics
2. Software developers looking for a career switch into data science and analytics
3. Professionals working in data and business analytics
4. Graduates looking to build a career in analytics and data science
5. Anyone with a genuine interest in the data science field
6. Experienced professionals who would like to harness data science in their fields
Learn more at: https://www.simplilearn.com/
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
This lecture provides an overview of some modern regression techniques including a discussion of the bias variance tradeoff for regression errors and the topic of shrinkage estimators. This leads into an overview of ridge regression, LASSO, and elastic nets. These topics will be discussed in detail and we will go through the calibration/diagnostics and then conclude with a practical example highlighting the techniques.
Expectation Maximization (EM) algorithm is a method that is used for finding maximum likelihood or maximum a posteriori (MAP) that is the estimation of parameters in statistical models, and the model depends on unobserved latent variables that is calculated using models. Copy the link given below and paste it in new browser window to get more information on Em Algorithm:- http://www.transtutors.com/homework-help/statistics/em-algorithm.aspx
Market Basket Analysis in SQL Server Machine Learning ServicesLuca Zavarella
Market Basket Analysis is a methodology that allows the identification of the relationships between a large number of products purchased by different consumers. It was born as a Data Mining technique to support cross-selling and shelf placement of products; but it is also used in medical diagnosis, in bioinformatics, in the analysis of society on the basis of personal data, etc. In this session we will see how the new Machine Learning Services allow us to derive insights from this analysis directly in SQL Server, using the programming language R.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
Given the resurgence of neural network-based techniques in recent years, it is important for data science practitioner to understand how to apply these techniques and the tradeoffs between neural network-based and traditional statistical methods.
This lecture discusses two specific techniques: Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN). The former is one of the most important class of multivariate time series statistical models applied in finance while the latter is a neural network architecture that is suitable for time series forecasting. I’ll demonstrate how they are implemented in practice and compares their advantages and disadvantages. Real-world applications, demonstrated using python and Spark, are used to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using time-series plot, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, plots of cross-correlations, histogram, and kernel density plot, will also be included in the demo.
The attendees will learn – the formulation of a time series forecasting problem statement in context of VAR and RNN – the application of Recurrent Neural Network-based techniques in time series forecasting – the application of Vector Autoregressive Models in multivariate time series forecasting – the pros and cons of using VAR and RNN-based techniques in the context of financial time series forecasting – When to use VAR and when to use RNN-based techniques
Understand ROC (Receiver Operating Characteristics) Curves for Classification problems
Learn more about Machine Learning on http://gollnickdata.com/
Get the Udemy course for just 9.99€ (https://www.udemy.com/hands-on-machine-learning-with-r/?couponCode=MLWITHR_09.99)
how,when and why to perform Feature scaling?
Different type of feature scaling Technique.
when to perform feature scaling?
why to perform feature scaling?
MinMax feature scaling techniques.
Unit vector scaling.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 6: Normal Probability Distribution
6.3: Sampling Distributions and Estimators
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
Definition of classification
Basic principles of classification
Typical
How Does Classification Works?
Difference between Classification & Prediction.
Machine learning techniques
Decision Trees
k-Nearest Neighbors
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
Abstract: This PDSG workshop introduces basic concepts on measuring accuracy of your trained model. Concepts covered are loss functions and confusion matrices.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Expectation Maximization (EM) algorithm is a method that is used for finding maximum likelihood or maximum a posteriori (MAP) that is the estimation of parameters in statistical models, and the model depends on unobserved latent variables that is calculated using models. Copy the link given below and paste it in new browser window to get more information on Em Algorithm:- http://www.transtutors.com/homework-help/statistics/em-algorithm.aspx
Market Basket Analysis in SQL Server Machine Learning ServicesLuca Zavarella
Market Basket Analysis is a methodology that allows the identification of the relationships between a large number of products purchased by different consumers. It was born as a Data Mining technique to support cross-selling and shelf placement of products; but it is also used in medical diagnosis, in bioinformatics, in the analysis of society on the basis of personal data, etc. In this session we will see how the new Machine Learning Services allow us to derive insights from this analysis directly in SQL Server, using the programming language R.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
Given the resurgence of neural network-based techniques in recent years, it is important for data science practitioner to understand how to apply these techniques and the tradeoffs between neural network-based and traditional statistical methods.
This lecture discusses two specific techniques: Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN). The former is one of the most important class of multivariate time series statistical models applied in finance while the latter is a neural network architecture that is suitable for time series forecasting. I’ll demonstrate how they are implemented in practice and compares their advantages and disadvantages. Real-world applications, demonstrated using python and Spark, are used to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using time-series plot, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, plots of cross-correlations, histogram, and kernel density plot, will also be included in the demo.
The attendees will learn – the formulation of a time series forecasting problem statement in context of VAR and RNN – the application of Recurrent Neural Network-based techniques in time series forecasting – the application of Vector Autoregressive Models in multivariate time series forecasting – the pros and cons of using VAR and RNN-based techniques in the context of financial time series forecasting – When to use VAR and when to use RNN-based techniques
Understand ROC (Receiver Operating Characteristics) Curves for Classification problems
Learn more about Machine Learning on http://gollnickdata.com/
Get the Udemy course for just 9.99€ (https://www.udemy.com/hands-on-machine-learning-with-r/?couponCode=MLWITHR_09.99)
how,when and why to perform Feature scaling?
Different type of feature scaling Technique.
when to perform feature scaling?
why to perform feature scaling?
MinMax feature scaling techniques.
Unit vector scaling.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 6: Normal Probability Distribution
6.3: Sampling Distributions and Estimators
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
Definition of classification
Basic principles of classification
Typical
How Does Classification Works?
Difference between Classification & Prediction.
Machine learning techniques
Decision Trees
k-Nearest Neighbors
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
Abstract: This PDSG workshop introduces basic concepts on measuring accuracy of your trained model. Concepts covered are loss functions and confusion matrices.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
Data Mining Steps
Problem Definition
Market Analysis
Customer Profiling, Identifying Customer Requirements, Cross Market Analysis, Target Marketing, Determining Customer purchasing pattern
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation, Resource Planning, Competition
Fraud Detection
Customer Retention
Production Control
Science Exploration
> Data Preparation
Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. The Datasets you are provided in these projects were obtained from kaggle.com.
Variable selection and description
Numerical – Ratio, Interval
Categorical – Ordinal, Nominal
Simplifying variables: From continuous to discrete
Formatting the data
Basic data integrity checks: missing data, outliers
> Data Exploration
Data Exploration is about describing the data by means of statistical and visualization techniques.
· Data Visualization:
o
Univariate
analysis explores variables (attributes) one by one. Variables could be either categorical or numerical.
Univariate Analysis - Categorical
Statistics
Visualization
Description
Count
Bar Chart
The number of values of the specified variable.
Count%
Pie Chart
The percentage of values of the specified variable
Univariate Analysis - Numerical
Statistics
Visualization
Equation
Description
Count
Histogram
N
The number of values (observations) of the variable.
Minimum
Box Plot
Min
The smallest value of the variable.
Maximum
Box Plot
Max
The largest value of the variable.
Mean
Box Plot
The sum of the values divided by the count.
Median
Box Plot
The middle value. Below and above median lies an equal number of values.
Mode
Histogram
The most frequent value. There can be more than one mode.
Quantile
Box Plot
A set of 'cut points' that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, ...).
Range
Box Plot
Max-Min
The difference between maximum and minimum.
Variance
Histogram
A measure of data dispersion.
Standard Deviation
Histogram
The square root of variance.
Coefficient of Deviation
Histogram
A measure of data dispersion divided by mean.
Skewness
Histogram
A measure of symmetry or asymmetry in the distribution of data.
Kurtosis
Histogram
A measure of whether the data are peaked or flat relative to a normal distribution.
Note: There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature in Centigrade degrees. Data on an int ...
Data reduction: breaking down large sets of data into more-manageable groups or segments that provide better insight.
- Data sampling
- Data cleaning
- Data transformation
- Data segmentation
- Dimension reduction
Among many data clustering approaches available today, mixed data set of numeric and category data
poses a significant challenge due to difficulty of an appropriate choice and employment of
distance/similarity functions for clustering and its verification. Unsupervised learning models for
artificial neural network offers an alternate means for data clustering and analysis. The objective of this
study is to highlight an approach and its associated considerations for mixed data set clustering with
Adaptive Resonance Theory 2 (ART-2) artificial neural network model and subsequent validation of the
clusters with dimensionality reduction using Autoencoder neural network model.
4Data Mining Approach of Accident Occurrences Identification with Effective M...IJECEIAES
Data mining is used in various domains of research to identify a new cause for tan effect in the society over the globe. This article includes the same reason for using the data mining to identify the Accident Occurrences in different regions and to identify the most valid reason for happening accidents over the globe. Data Mining and Advanced Machine Learning algorithms are used in this research approach and this article discusses about hyperline, classifications, pre-processing of the data, training the machine with the sample datasets which are collected from different regions in which we have structural and semi-structural data. We will dive into deep of machine learning and data mining classification algorithms to find or predict something novel about the accident occurrences over the globe. We majorly concentrate on two classification algorithms to minify the research and task and they are very basic and important classification algorithms. SVM (Support vector machine), CNB Classifier. This discussion will be quite interesting with WEKA tool for CNB classifier, Bag of Words Identification, Word Count and Frequency Calculation.
Keys to extract value from the data analytics life cycleGrant Thornton LLP
Regulatory mandates driving transparency and financial objectives requiring accurate understanding of customer needs have heightened the importance of data analytics to unprecedented levels making it a critical element of doing business.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. Approach to AML Rule Thresholds
By Mayank Johri, Amin Ahmadi, Kevin Kinkade, Sam Day, Michael Spieler, Erik DeMonte
January 12, 2016
Introduction
Institutions are constantly facing the challenge of managing growing alert volumes from automated transaction
monitoring systems, new money laundering typologies to surveil, and more robust regulatory guidance. The question is
how will BSA/AML departments scale to meet demand while managing compliance cost? In order to effectively set
baseline thresholds for new detection scenario configuration or improve the efficacy of existing scenarios, apply
statistical techniques and industry standards to identify the cut-off between “normal” and “abnormal” or “suspicious”
activity. These estimated thresholds are then either challenged or reinforced by the qualitative judgement of professional
investigators during a simulated ‘pseudo’ investigation or ‘qualitative assessment’.
An effective AML transaction monitoring program includes a standardized process for tuning, optimizing, and testing
AML scenarios/typologies that is understandable, repeatable and consistent.
An appropriately tuned or optimized scenario seeks a balance between maximizing the identification of suspicious
activity while simultaneously maximizing resource efficiency. The two competing objectives of tuning and optimization
which must remain in constant balance are:
(1) Reduce the number of ‘false positives’ or alerts generated on transactions that do not require further investigation
or the filing a Suspicious Activity Report (SAR).
(2) Reduce the number of ‘false negatives’ or ‘transactions that were not alerted’ but that do require further
investigation or the filing a SAR.
Phases
The following outlines the eight phase process for the initial tuning:
Phase 0 | Planning. The Policy Office (PO) works closely with the Analytics team to strategize the scenario,
stratification, and parameters that will be used to conduct a threshold analysis.
Phase 1 | Assess Data. Analytics communicates which data fields will be required to perform this analysis to
Information Technology (IT). IT then determines if the ETL of these fields into Transaction Monitoring System is
a near or long term process.
Phase 2 | Query Data. Analytics queries the required transactional data for analysis.
Phase 3 | Quantitative Analysis. Analytics stratifies the data as required (such as grouping like-attributes or ‘non-
tunable parameters’ such as entity/consumer, cash intensive businesses/non-CIB, Cr/Db, high/medium risk
destinations etc.) to account for like-attribute behavior patterns.
Transformation
Once stratified, Analytics performs transformations to the data as required (such as 90 day rolling count/sum/standard
deviation etc).
2. Exploratory Data Analysis
Analytics performs a variety of visual and statistical exploratory data analysis (EDA) techniques to analyze the dataset to
understand the correlation and impact that one or more parameters may have on the scenario, and therefore ultimately
on the alert-to-case efficacy. The objective of EDA is to further explore the recommended parameters (count, amount,
standard deviation, etc.) proposed during the planning phase to determine with greater statistical precision the best
combination of parameters
Segmentation
Once stratified and transformed, Analytics clusters the data’s ‘tunable parameters’ to account for ‘skewness’ in the data
population caused by outliers in order to yield a statistically accurate threshold that is representative of the 85th
percentile.
The 85th percentile is used as a standard when establishing a new rule to set an initial baseline threshold for defining the
cutoff between “normal” transactional data and “unusual” transactional data. For normally distributed data with a bell-
shaped curve (as depicted in the middle diagram below, figure 1.1), the mean value (i.e., the “expected” value) represents
the central tendency of the data, and the 85th percentile represents one standard deviation (σ) from this central tendency.
The 85th percentile could represent a reasonably conservative cutoff line or “threshold” for unusual activity. This
baseline simply provides a starting point for further analysis, and is later refined through qualitative judgement and alert-
to-case efficacy.
If transactional data were always normally distributed, it would be easy to calculate one standard deviation above the
mean to identify where to draw the line representing the 85th percentile of the data (this technique is often referred to as
‘quantiling’), thus establishing the threshold. However, in real world applications transactional data is often not normally
distributed. Transactional data is frequently skewed by outliers (such as uniquely high-value customers), therefore, if
statistical techniques that assume normal distribution (such as quantile) are applied while determining the 85th percentile
(+1 standard deviation from the mean), the result will yield a misrepresentative ‘threshold’ which is offset by the
outlier(s).
Figure 1.1 Distribution affected by Skewness
Clustering
To account for skewness in the data, employ the clustering technique known as ‘Partition around Medoid’ (PAM), or
more specifically, ‘Clustering Large Application’s (CLARA). Clustering is an alternative method of data segmentation
which is not predicated on the assumption that the data is normally distributed or that it has constant variance.
Clustering works by breaking the dataset into groups of distinct clusters around one common entity of the dataset
(which represents the group). This partition more accurately allows the assignment of a boundary (such as a target
threshold to distinguish normal from unusual activity).
The first step of the clustering model is to understand the number of clusters to partition the data by. The methodology
used to identify the optimal number of clusters takes into account two variables:
3. Approximation – How the clustering model fits to the current data set (“Error Measure”)
Generalization – Cost of how well the clustering model could be re-performed with another similar data set
The model for clustering can be seen in the figure below. As the number of clusters increases, (x-axis) the model will
become more complex and thus less stable. Increasing the number of clusters creates a more customized model which is
catered to the current data set, resulting in a high level of approximation. However, in this situation cost will increase as
the flexibility to re-perform using a similar data set will become more difficult. Inversely, the fewer clusters the less
representative the model is for the current data set, but the more scalable it is for future similar data sets. An objective
function curve is plotted to map the tradeoff between the two competing objectives. This modelling methodology is
used to identify the inflection point of the objective function of the two variables - the optimal number of clusters that
will accommodate both the current data set (approximation) and future data sets (generalization). Refer to figure 1.2
below for the conceptual visual of the modelling methodology used for identifying the optimal number of clusters.
Figure 1.2 Cluster Modeling – Identification of Number of Clusters
The basic approach to CLARA clustering is to partition objects/observations into several similar subsets. Data is
partitioned based on ‘Euclidean’ distance to a common data point (called a medoid). Medoid, rather than being a
calculated quantity (as it is the case with “mean”), is a data point in the cluster which happens to have the minimal
average dissimilarity to all other data points assigned to the same cluster. Euclidean distance is the most common
measure of dissimilarity. The advantage of using medoid-based cluster analysis is the fact that no assumption is made
about the structure of the data. In the case of mean-based cluster analysis, however, one makes the implicit restrictive
assumption that the data follows a Gaussian (bell-shape) distribution.
The next step is to determine the number of dimensions for parameter threshold analysis and to translate the
transactional data into ‘events’. An event is defined as a unique combination of all parameters for the identified scenario
or rule. The full transactional data set is translated into a population of events. Event bands are formed based on the
distribution of total events within the clusters. Event bands can be thought of as the boundaries between the clusters
(such that one or more parameters exhibit similarity).
Event Banding with One Parameter
When a scenario only has one tunable parameter (such as ‘amount’), bands for this parameter are ideally generated in 5%
increments beginning at the 50th percentile, resulting in six bands – P50, P55, P60, P65, P70, P75, P80, P85, P90, and
P95. The 50th percentile is chosen as a starting point to allow room for adjustment towards a more conservative
cluster/threshold, pending the results of the qualitative analysis. In other words, it is important to include clusters well
below, but still within reasonable consideration to the target threshold definition of transaction activity that will be
considered quantitatively suspicious. Refer to Figure 1.3 below.
4. Figure 1.3 85
th
Percentile and BTL/ATL
Some parameters such as ‘transaction count’ have a discrete range of values, and therefore the bands may not be able to
be established exactly at the desired percentile level. In these cases, judgment is necessary to establish reasonable bands.
Depending on the values of the bands, they will often be rounded to nearby numbers of a similar order of magnitude
but that are more easily socialized with internal and external business partners. Each of these bands corresponds to a
parameter value to be tested as a prospective threshold for the scenario.
If the six clusters have ranges that are drastically different from one another, adjustment to the bands may be necessary
to make the clusters more reasonable while still maintaining a relatively evenly distributed volume across the event
bands. This process is subjective and will differ from scenario-to-scenario, especially in cases where a specific value for a
parameter is inherent in the essence of the rule (e.g., $10,000 for cash structuring). In many cases the nature of the
customer segment and activity being monitored may support creating fewer than 6 event bands due to the lack of
volume of activity for that segment.
Figure 1.4 Event Banding of 1 Parameter ‘Amount’
Event Banding with Two Parameters
When a scenario has two tunable parameters (such as ‘count’ and ‘amount’), two independent sets of bands need to be
established for each parameter, similar to the method used for one parameter.
5. Analysis of two tunable parameters may be thought of as ‘two-dimensional’, whereas one parameter event banding is
based only on a single parameter (one axis), event banding with two parameters is affected by two axes (x & y axis). For
example, ‘count’ may represent the x-axis, while ‘amount’ may represent the y-axis. In this sense, the ultimate threshold
is determined by a combination of both axes, and so are the event bands. Including additional parameters will likewise
add additional dimensions and complexity.
As discussed above, while the 85th percentile is used to determine the threshold line, bands are created through
clustering techniques starting at the 50th percentile to account for those data points below, but still within reasonable
consideration to the target threshold definition of transaction activity that will be considered quantitatively suspicious. In
the diagram below, we see banding between two parameters, count and value. Once the data is clustered, the 85th
percentile is identified per the distribution (upper right hand table in Figure 1.5 below) and qualitative judgement is
exercised in order to set exact thresholds within the range that creates a model conducive for re-performance (Refer
above to discussion on “generalization” in the discussion of clustering modelling).
Figure 1.5 Event Banding of 2 Parameters ‘Value’/‘Count’
Event Banding with more than Two Parameters
When the scenario has more than two tunable parameters (such as count, amount and standard deviation), more than
two independent sets of bands need to be established for each parameter, similarly to the method used for two
parameters.
Select Threshold(s)
The output of phase three is the ‘threshold’, or ‘event’ characteristics (combination of thresholds based in the case of
multiple parameters) which serve as the baseline for ‘suspicious’ activity. Too many alerts may be generated which
creates extraneous noise and strains BSA/AML investigative resources. Conversely if the threshold is set too high,
suspicious activity may not generate alerts.
6. Phase 4 | Sampling. Analytics applies the thresholds determined during the quantitative analysis phase to the
historical data in order to identify potential events for Above-the-Line (ATL) and Below-the-Line (BTL) analysis.
These indicators, when flagged as ‘ATL’ are essentially the same thing as alerts, except since they are applied using
historical data they are referred to as ‘pseudo alerts’. The number of transactions which fall into the ATL or BTL
category will determine the number of random samples required for a statistically significant qualitative assessment.
The purpose of the samples is to evaluate the efficacy of Analytics’ calculated thresholds. In other words, if the
threshold is appropriately tuned, then a larger percentage of events marked ‘ATL’ should be classified as ‘suspicious’
by an independent FIU investigator compared to the ‘BTL’ events. Analytics then packages these sample ATL and
BTL transactions into a format which is understandable and readable by an FIU investigator (samples must include
the transactional detail fields required for FIU to determine the nature of the transactions).
Phase 5 | Training. Analytics orients the FIU investigators to the scenario, parameters and overall intent/spirit of
each rule so that during the qualitative analysis phase, the FIU investigators render appropriate independent
judgements for ATL and BTL samples.
Phase 6 | Qualitative Analysis. FIU assesses the sampled transactions from a qualitative perspective. During this
phase, an independent FIU investigator analyzes each sampled pseudo alert as they would treat real alerts (without
any bias regardless of the alert’s classification as ATL or BTL). The investigator’s evaluation must include
consideration for the intent of each rule, and should include an assessment of both the qualitative and quantitative
fields associated with each alert. The FIU investigator will generally evaluate each transaction through a lens akin to
“Given what is known from KYC, origin/destination of funds, beneficiary, etcetera, is it explainable that this
consumer/entity would transact this dollar amount at this ...frequency, velocity, pattern etc...” FIU provides
feedback to Analytics for each pseudo alert classified as (a) ‘Escalate-to-Case’ (b) ‘Alert Cleared – No Investigation
Required (false positive)’, (c) ‘Alert Cleared – Error, or (d) ‘Insufficient Information’. If the efficacy is deemed
appropriate, then Business Review Session is scheduled to vote the rule into production.
Phase 7 | Business Review Session. PO, Analytics and FIU present their findings for business review to voting
members.
Phase 8 | Implementation. Analytics provides functional specifications to IT to implement the scenario within
Transaction Monitoring System.