Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Data Science Project: Advancements in Fetal Health Classification
Â
Anomaly detection
1. Anomaly Detection
Techniques and Best Practices
2019 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2. 2
About us:
⢠Data Science, Quant Finance and
Machine Learning Startup
⢠Technologies using MATLAB, Python
and R
⢠Programs
⍠Analytics Certificate Program
⍠Fintech programs
⢠Platform
3. ⢠Founder of QuantUniversity LLC. and
www.analyticscertificate.com
⢠Advisory and Consultancy for Financial Analytics
⢠Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers.
⢠Regular Columnist for the Wilmott Magazine
⢠Author of forthcoming book
âFinancial Modeling: A case study approachâ
published by Wiley
⢠Charted Financial Analyst and Certified Analytics
Professional
⢠Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
3
4. What is anomaly detection?
⢠Anomalies or outliers are data points within the datasets
that appear to deviate markedly from expected outputs
under certain assumptions.
⢠Anomaly detection is the process of finding patterns in
data that do not conform to a prior expected behavior.
⢠Anomaly detection is being employed more increasingly in
the presence of big data that is captured by sensors, social
media platforms, huge networks, etc. including energy
systems, medical devices, banking, network intrusion
detection, etc.
4
5. 5
⢠Outliers are data points that are considered out of the ordinary or
abnormal . This includes noise.
⢠Anomalies are a special kind of outlier that has significant/
critical/actionable information which could be of interest to
analysts.
Anomaly vs Outliers
1
2
All points not in clusters 1 & 2 are Outliers
Point B is an Anomaly (Both X and Y are large)
7. 7
⢠Fraud Detection
⍠Credit card fraud detection
â By owner or by operation
⍠Mobile phone fraud/anomaly detection
â Calling behavior, volume etc
⍠Insurance claim fraud detection
â Medical malpractice
â Auto insurance
⍠Insider trading detection
⢠E-commerce
⍠Pricing issues
⍠Network issues
Applications of Anomaly Detection
8. 8
⢠Intrusion detection:
⍠Detect malicious activity in computer systems
⍠This could be host-based or network-based
⢠Medical anomalies
Examples of Anomaly Detection
9. 9
⢠Manufacturing and sensors:
⍠Fault detection
⍠Heat, fire sensors
⢠Text data
⍠Novel topics, events
⍠Plagiarism
Examples of Anomaly Detection
10. Anomaly Detection Methods
⢠Most outlier detection methods generate an output
that can be categorized in one of the following
groups:
⍠Real-valued outlier score: which quantifies the tendency
of a data point being an outlier by assigning a score or
probability to it.
⍠Binary label: which is the result of using a threshold to
convert outlier scores to binary labels, inlier or outlier.
10
11. 11
1. Graphical approaches
2. Statistical approaches
3. Machine learning approaches
4. Time series methods
Illustration of four methodologies to Anomaly Detection
13. Graphical approaches
⢠Graphical methods utilize extreme value analysis, by which outliers
correspond to the statistical tails of probability distributions.
⢠Statistical tails are most commonly used for one dimensional
distributions, although the same concept can be applied to
multidimensional case.
⢠It is important to understand that all extreme values are outliers
but the reverse may not be true.
⢠For instance in one dimensional dataset of
{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isnât
considered as an extreme value, but since this observation is the
most isolated point, it should be considered as an outlier from a
generative perspective.
13
14. Box plot
⢠A standardized way of displaying the
variation of data based on the five
number summary, which includes
minimum, first quartile, median, third
quartile, and maximum.
⢠This plot does not make any assumptions
of the underlying statistical distribution.
⢠Any data not included between the
minimum and maximum are considered
as an outlier.
14
16. Scatter plot
⢠A mathematical diagram, which uses Cartesian coordinates for plotting ordered
pairs to show the correlation between typically two random variables.
⢠This plot is useful for detecting outliers.
⢠An outlier is defined as a data point that doesn't seem to fit with the rest of the
data points.
⢠In scatterplots, outliers of either intersection or union sets of two variables can
be shown.
16
18. 18
⢠In statistics, a QâQ plot is a probability plot, which is a graphical
method for comparing two probability distributions by plotting their
quantiles against each other.
⢠If the two distributions being compared are similar, the points in the
QâQ plot will approximately lie on the line y = x.
Q-Q plot
Source: Wikipedia
19. Adjusted quantile plot
⢠This plot identifies possible multivariate outliers by calculating the Mahalanobis
distance of each point from the center of the data.
⢠Multi-dimensional Mahalanobis distance between vectors x and y in !" can be
formulated as:
d(x,y) = x â y ,S./(x â y)
where x and y are random vectors of the same distribution with the covariance
matrix S.
⢠An outlier is defined as a point with a distance larger than some predetermined
value.
19
20. Adjusted quantile plot
⢠Before applying this method and many other parametric
multivariate methods, first we need to check if the data is
multivariate normally distributed using different
multivariate normality tests, such as Royston, Mardia, Chi-
square, univariate plots, etc.
⢠In R, we use the âmvoutlierâ package, which utilizes
graphical approaches as discussed above.
20
21. Adjusted quantile plot
21
Min-Max normalization before diving into analysis
Multivariate normality test
Outlier Boolean vector identifies the
outliers
Alpha defines maximum thresholding proportion
See Graphical_Approach.ipynb
24. Symbol plot
⢠This plot plots two dimensional data, using robust Mahalanobis distances based
on the minimum covariance determinant(mcd) estimator with adjustment.
⢠Minimum Covariance Determinant (MCD) estimator looks for the subset of h
data points whose covariance matrix has the smallest determinant.
⢠Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to
25%, 50%, 75% and adjusted quantiles of the chi-square distribution.
24
25. Symbol plot
25
See Graphical_Approach.ipynb
Parameter âquanâ defines the amount of observations,
which are used for minimum covariance determinant
estimations. The default is 0.5.
Alpha defines the amount of observations used for
calculating the adjusted quantile.
27. Hypothesis testing
⢠This method draws conclusions about a sample point by testing whether it
comes from the same distribution as the training data.
⢠Statistical tests, such as the t-test and the ANOVA table, can be used on multiple
subsets of the data.
⢠Here, the level of signiďŹcance, i.e, the probability of incorrectly rejecting the
true null hypothesis, needs to be chosen.
⢠To apply this method in R, âoutliersâ package, which utilizes statistical
tests, is used .
27
28. Chi-square test
⢠Chi-square test performs a simple test for detecting outliers of univariate data
based on Chi-square distribution of squared difference between data and
sample mean.
⢠In this test, sample variance counts as the estimator of the population variance.
⢠Chi-square test helps us identify the lowest and highest values, since outliers
can exist in both tails of the data.
28
30. Grubbsâ test
â˘This test is defined for the following hypotheses:
H0: There are no outliers in the data set
H1: There is exactly one outlier in the data set
â˘The Grubbs' test statistic is defined as:
! =
#$% % â Ě %
(
30
33. Scores
⢠Scores quantifies the tendency of a data point being an outlier by assigning it a
score or probability.
⢠The most commonly used scores are:
⍠Normal score:
!" #$%&'
()&'*&+* *%,-&)-.'
⍠T-student score:
(0#(1+) '#2 )
(1+)(0#4#)5)
⍠Chi-square score:
!" #$%&'
(*
2
⍠IQR score: 67-64
⢠By using âscoreâ function in R, p-values can be returned instead of scores.
33
35. Scores
35
See Statistical_Approach.ipynb
By setting âprobâ to any specific value, logical vector
returns the data points, whose probabilities are
greater than this cut-off value, as outliers.
By setting âtypeâ to IQR, all values lower than first
and greater than third quartiles are considered and
difference between them and nearest quartile
divided by IQR is calculated.
37. Linear regression
⢠Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
! = #$ + &
'()
*
#'+'
where Y and +' are random variables, #' is regression coefficient and #$ is a
constant.
⢠In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
37
38. Piecewise/segmented regression
⢠A method in regression analysis, in which the independent variable is
partitioned into intervals to allow multiple linear models to be fitted to data for
different ranges.
⢠This model can be applied when there are âbreakpointsâ and clearly two
different linear relationships in the data with a sudden, sharp change in
directionality. Below is a simple segmented regression for data with two
breakpoints:
! = #$ + &'( ( < ('
! = #' + &*( ( > ('
where Y is a predicted value, X is an independent variable, #$ and #' are
constant values, &' and &* are regression coefficients, and (' and (* are
breakpoints.
38
40. Piecewise/segmented regression
⢠For this example, we use âsegmentedâ package in R to first illustrate piecewise
regression for two dimensional data set, which has a breakpoint around z=0.5.
40
See Piecewise_Regression.ipynb
âpmaxâ is used for parallel maximization to
create different values for y.
42. Piecewise/segmented regression
⢠Finally, the outliers can be detected for each segment by setting some rules for
residuals of model.
42
See Piecewise_Regression.ipynb
Here, we set the rule for the residuals corresponding to z
less than 0.5, by which the outliers with residuals below
0.5 can be defined as outliers.
44. 44
⢠Goal is to have !" to approximate x
⢠Interesting applications such as
⍠Data compression
⍠Visualization
⍠Pre-train neural networks
Autoencoder
45. 45
Demo in Keras1
1. https://blog.keras.io/building-autoencoders-in-keras.html
2. https://keras.io/models/model/
46. 46
Principal Component Analysis
Principal component analysis (PCA) is a statistical
procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated
variables (entities each of which takes on various
numerical values) into a set of values of linearly
uncorrelated variables called principal components.
In Outlier analysis, we do principal component
analysis and computes p-values to test for outliers.
https://en.wikipedia.org/wiki/Principal_component_analysis
47. Clustering-based approaches
⢠These methods are suitable for unsupervised anomaly detection.
⢠They aim to partition the data into meaningful groups (clusters) based on the
similarities and relationships between the groups found in the data.
⢠Each data point is assigned a degree of membership for each of the clusters.
⢠Anomalies are those data points that:
⍠Do not ďŹt into any clusters.
⍠Belong to a particular cluster but are far away from the cluster centroid.
⍠Form small or sparse clusters.
47
48. Clustering-based approaches
⢠These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
!
"#$
%
!
&â()
!
*#$
+
(-&* â /"*)1
where 2" is the set of observations in the kth cluster and /"* is the mean of jth
variable of the cluster center of the kth cluster.
⢠Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
48
50. Clustering-based approaches
⢠âKmodâ package in R is used to show the application of K-means model.
50
In this example the number of clusters is defined
through bend graph in order to pass to K-mod
function.
See Clustering_Approach.ipynb
54. Time-series method
⢠Time-series model is used to identify outliers only in univariate time-series
data.
⢠In order to apply this model, we use âAnomalydetectionâ package in R.
⢠This package was published by twitter for detecting anomalies in time-series
data in the presence of seasonality and an underlying trend using statistical
approaches.
⢠Since this package uses a specific algorithm to detect anomalies, we go over it
in details in the next slide.
55. Anomaly detection, R package
⢠Twitterâs R package: https://github.com/twitter/AnomalyDetection
⢠Seasonal Hybrid ESD (S-H-ESD), which builds upon the Generalized ESD test, is the
underlying algorithm of this package.
⢠The algorithm employs time series decomposition and statistical metrics with ESD test.
⢠Since the time-series data exhibit a huge variety of pattern, time-series decomposition,
which a statistical method, is used to decompose the data into its four components.
⢠The four components are:
1. Trend: refers to the long term progression of the series
2. Cyclical: refers to variations in recognizable cycles
3. Seasonal: refers to seasonal variations or fluctuations
4. Irregular: describes random, irregular influences
v Find more about ESD test in tutorial slides.
57. Summary
57
We have covered Anomaly detection
Introduction Ăź Definition of anomaly detection and its importance in energy systems
Ăź Different types of anomaly detection methods: Statistical, graphical and machine
learning methods
Graphical approach Ăź Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol
plot to demonstrate outliers graphically
Ăź The main assumption for applying graphical approaches is multivariate normality
Ăź Mahalanobis distance methods is mainly used for calculating the distance of a point
from a center of multivariate distribution
Statistical approach Ăź Statistical hypothesis testing includes of: Chi-square, Grubbâs test
Ăź Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning approach Ăź Both supervised and unsupervised learning methods can be used for outlier detection
Ăź Piece wised or segmented regression can be used to identify outliers based on the
residuals for each segment
Ăź In K-means clustering method outliers are defined as points which have doesnât belong
to any cluster, are far away from the centroids of the cluster or shaping sparse clusters
Ăź In PCA, Auto-encoder decoder methods, we look at points that werenât recovered closer
to the original points as anomalies
Time Series Ăź Temporal outlier detection to detect anomalies which is robust, from a statistical
standpoint, in the presence of seasonality and an underlying trend.
60. Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
60