Anomaly detection : QuantUniversity Workshop

www.globalbigdataconference.com
Twitter : @bigdataconf

Location:
Santa Clara, CA
August 31st 2016
Anomaly Detection
Techniques and Best Practices
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com

- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
4

5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program in 2016

7
• If you don’t have R,
▫ Install R from https://cran.r-project.org/
▫ Install Rstudio from
https://www.rstudio.com/products/rstudio/download2/
▫ Install IRKernel from
http://irkernel.github.io/installation/
▫ Refer to www.analyticscertificate.com/GBDC for slides and examples
Local installation instructions

What is anomaly detection?
• Anomalies or outliers are data points within the datasets
that appear to deviate markedly from expected outputs.
• An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism1
• Anomaly detection refers to the problem of finding
patterns in data that don’t confirm to expected behavior
9
1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.

10
• Outliers are data points that are considered out of the ordinary or
abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/
critical/actionable information which could be of interest to
analysts.
Anomaly vs Outliers
1
2
All points not in clusters 1 & 2 are Outliers
Point B is an Anomaly (Both X and Y are large)

11
• Fraud Detection
▫ Credit card fraud detection
 By owner or by operation
▫ Mobile phone fraud/anomaly detection
 Calling behavior, volume etc
▫ Insurance claim fraud detection
 Medical malpractice
 Auto insurance
▫ Insider trading detection
• E-commerce
▫ Pricing issues
▫ Network issues
Applications of Anomaly Detection

12
• Intrusion detection:
▫ Detect malicious activity in computer systems
▫ This could be host-based or network-based
• Medical anomalies
Examples of Anomaly Detection

13
• Manufacturing and sensors:
▫ Fault detection
▫ Heat, fire sensors
• Text data
▫ Novel topics, events
▫ Plagiarism
Examples of Anomaly Detection

14
• Anomalies can be classified into 3 major categories
1. Point Anomalies:
 In an instance is anomalous compared with the rest of instances, the anomaly
is considered a point anomaly
2. Contextual Anomalies
 If an instance is anomalous in a specific context, the anomaly would be
considered as a contextual anomaly
3. Collective Anomalies
 If a collection of related data records are anomalous with respect to the entire
data set, the anomaly is a collective anomaly
Anomaly Classification

15
• In the figure, points o1 and o2 are considered point anomalies
• Examples:
▫ A 50% increase in daily stock price
▫ A credit card transaction attempt for $5000 (assuming you have never
had a single transaction for anything above $1000)
Point Anomalies

16
• In the figure, temperature t2 is an anomaly
• Note that t1 is lower than t2 but contextually, t1 is expected and t2
isn’t when compared to records around it.
Contextual Anomalies

17
• Multiple Buy Stock transactions and then a sequence of Sell
transactions around an earnings release date may be anomalous
and may indicate insider trading.
• Consider the sequence of network activities recorded
• Though ssh, buffer-overflow and ftp themselves are not anomalous
activities, a sequence of the three indicates a web-based attack
• Similarly, multiple http requests from an ip address may indicate a
crawler in action.
Collective Anomalies

18
• In medicine, abnormal ECG pattern detection would involve looking
for collective anomalies like Premature Atrial Contraction
Collective Anomalies
http://www.fprmed.com/Pages/Cardio/PAC.html

19
1. Graphical approach
2. Statistical approach
3. Machine learning approach
4. Density based approach
Illustration of four methodologies to Anomaly Detection in R

20
 Boxplot
 Scatter plot
 Adjusted quantile plot
 Symbol plot

Graphical approaches
• Graphical methods utilize extreme value analysis, by which outliers
correspond to the statistical tails of probability distributions.
• Statistical tails are most commonly used for one dimensional
distributions, although the same concept can be applied to
multidimensional case.
• It is important to understand that all extreme values are outliers
but the reverse may not be true.
• For instance in one dimensional dataset of
{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t
considered as an extreme value, but since this observation is the
most isolated point, it should be considered as an outlier.
21

Box plot
• A standardized way of displaying the
variation of data based on the five
number summary, which includes
minimum, first quartile, median, third
quartile, and maximum.
• This plot does not make any assumptions
of the underlying statistical distribution.
• Any data not included between the
minimum and maximum are considered
as an outlier.
22

Boxplot
23
See Graphical_Approach.R
Side-by-side boxplot for each variable

Scatter plot
• A mathematical diagram, which uses Cartesian coordinates for plotting ordered
pairs to show the correlation between typically two random variables.
• An outlier is defined as a data point that doesn't seem to fit with the rest of the
data points.
• In scatterplots, outliers of either intersection or union sets of two variables can
be shown.
24

Scatterplot
25
Scatterplot of Sepal.Width and Sepal.Length

26
• In statistics, a Q–Q plotis a probability plot, which is a graphical
method for comparing two probability distributions by plotting their
quantiles against each other.
• If the two distributions being compared are similar, the points in the
Q–Q plot will approximately lie on the line y = x.
Q-Q plot
Source: Wikipedia

Adjusted quantile plot
• This plot identifies possible multivariate outliers by calculating the Mahalanobis
distance of each point from the center of the data.
• Multi-dimensional Mahalanobis distance between vectors x and y in 𝑅 𝑛 can be
formulated as:
d(x,y) = x − y TS−1(x − y)
where x and y are random vectors of the same distribution with the covariance
matrix S.
• An outlier is defined as a point with a distance larger than some pre-determined
value.
27

• Before applying this method and many other parametric
multivariate methods, first we need to check if the data is
multivariate normally distributed using different
multivariate normality tests, such as Royston, Mardia, Chi-
square, univariate plots, etc.
• In R, we use the “mvoutlier” package, which utilizes
graphical approaches as discussed above.
28

29
Min-Max normalization before diving into analysis
Multivariate normality test
Outlier Boolean vector identifies the
outliers
Alpha defines maximum thresholding proportion

30
Mahalanobis distances
Covariance matrix

31

Symbol plot
• This plot plots two dimensional data, using robust Mahalanobis distances based
on the minimum covariance determinant(mcd) estimator with adjustment.
• Minimum Covariance Determinant (MCD) estimator looks for the subset of h
data points whose covariance matrix has the smallest determinant.
• Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to
25%, 50%, 75% and adjusted quantiles of the chi-square distribution.
32

Symbol plot
33
Parameter “quan” defines the amount of observations,
which are used for minimum covariance determinant
estimations. The default is 0.5.
Alpha defines the amount of observations used for
calculating the adjusted quantile.

34
 Hypothesis testing ( Chi-square test, Grubb’s test)
 Scores

Hypothesis testing
• This method draws conclusions about a sample point by testing whether it
comes from the same distribution as the training data.
• Statistical tests, such as the t-test and the ANOVA table, can be used on
multiple subsets of the data.
• Here, the level of signiﬁcance, i.e, the probability of incorrectly rejecting the
true null hypothesis, needs to be chosen.
• To apply this method in R, “outliers” package, which utilizes statistical
tests, is used .
35

Chi-square test
• Chi-square test performs a simple test for detecting outliers of univariate data
based on Chi-square distribution of squared difference between data and
sample mean.
• In this test, sample variance counts as the estimator of the population variance.
• Chi-square test helps us identify the lowest and highest values, since outliers
can exist in both tails of the data.
36

37
When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually
reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One
statistical test that addresses this issue is the chi-square goodness of fit test.
This test is commonly used to test association of variables in two-way tables where the assumed model of independence is
evaluated against the observed data. In general, the chi-square test statistic is of the form
.
If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the
data (anomaly).
Chi-square test

Chi-square test
38
See Statistical_Approach.R
This function repeats the Chi-square test until it finds all
the outliers within the data.

Grubbs’ test
• Test for outliers for univariate data sets assumed to come from a normally
distributed population.
• Grubbs' test detects one outlier at a time. This outlier is expunged from the
dataset and the test is iterated until no outliers are detected.
• This test is defined for the following hypotheses:
H0: There are no outliers in the data set
H1: There is exactly one outlier in the data set
• The Grubbs' test statistic is defined as:
39

Grubbs’ test
40
The above function repeats the Grubbs’ test until it finds
all the outliers within the data.

Grubbs’ test
41
Histogram of normal observations vs outliers)

Scores
• Scores quantifies the tendency of a data point being an outlier by assigning it a
score or probability.
• The most commonly used scores are:
▫ Normal score:
𝑥 𝑖 −𝑀𝑒𝑎𝑛
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
▫ T-student score:
(𝑧−𝑠𝑞𝑟𝑡 𝑛−2 )
𝑠𝑞𝑟𝑡(𝑧−1−𝑡2)
▫ Chi-square score:
𝑥 𝑖 −𝑀𝑒𝑎𝑛
𝑠𝑑
2
▫ IQR score: 𝑄3-𝑄1
• By using “score” function in R, p-values can be returned instead of scores.
42

Scores
43
“type” defines the type of the score, such as
normal, t-student, etc.
“prob=1” returns the corresponding p-value.

Scores
44
By setting “prob” to any specific value, logical vector
returns the data points, whose probabilities are
greater than this cut-off value, as outliers.
By setting “type” to IQR, all values lower than first
and greater than third quartiles are considered and
difference between them and nearest quartile
divided by IQR is calculated.

45
• Anomaly Detection
▫ Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test for
detecting anomalies.
▫ Anomaly detection referring to point-in-time anomalous data points that
could be global or local. A local anomaly is one that occurs inside a seasonal
pattern; Could be +ve or –ve.
▫ More details here: https://github.com/twitter/AnomalyDetection
• Breakout Detection
▫ A breakout is characterized in this package by two steady states and an
intermediate transition period that could be sudden or gradual
▫ Uses the E-Divisive with Medians algorithm; Can detect one or multiple
breakouts in a given time series and employs energy statistics to detect
divergence in mean. More details here:
(https://blog.twitter.com/2014/breakout-detection-in-the-wild )
Twitter packages

46
• Twitter-R-Anomaly Detection tutorial.ipyb
Demo

47
 Linear regression
 Piecewise/ segmented regression
 Clustering-based approaches

Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
𝑌 = 𝛽0 + ෍
𝑖=1
𝑝
𝛽𝑖 𝑋𝑖
where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
48

Piecewise/segmented regression
• A method in regression analysis, in which the independent variable is
partitioned into intervals to allow multiple linear models to be fitted to data for
different ranges.
• This model can be applied when there are ‘breakpoints’ and clearly two
different linear relationships in the data with a sudden, sharp change in
directionality. Below is a simple segmented regression for data with two
breakpoints:
𝑌 = 𝐶0 + 𝜑1 𝑋 𝑋 < 𝑋1
𝑌 = 𝐶1 + 𝜑2 𝑋 𝑋 > 𝑋1
where Y is a predicted value, X is an independent variable, 𝐶0 and 𝐶1 are
constant values, 𝜑1 and 𝜑2 are regression coefficients, and 𝑋1 and 𝑋2 are
breakpoints.
49

50
Anomaly detection vs Supervised learning

• For this example, we use “segmented” package in R to first illustrate piecewise
regression for two dimensional data set, which has a breakpoint around z=0.5.
51
See Piecewise_Regression.R
“pmax” is used for parallel maximization to
create different values for y.

• Then, we use linear regression to predict y values for each segment of z.
52

• Finally, the outliers can be detected for each segment by setting some rules for
residuals of model.
53
Here, we set the rule for the residuals corresponding to z
less than 0.5, by which the outliers with residuals below
0.5 can be defined as outliers.

Clustering-based approaches
• These methods are suitable for unsupervised anomaly detection.
• They aim to partition the data into meaningful groups (clusters) based on the
similarities and relationships between the groups found in the data.
• Each data point is assigned a degree of membership for each of the clusters.
• Anomalies are those data points that:
▫ Do not ﬁt into any clusters.
▫ Belong to a particular cluster but are far away from the cluster centroid.
▫ Form small or sparse clusters.
54

• These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
෍
𝑘=1
𝐾
෍
𝑖∈𝑆 𝑘
෍
𝑗=1
𝑃
(𝑥𝑖𝑗 − 𝜇 𝑘𝑗)2
where 𝑆 𝑘 is the set of observations in the kth cluster and 𝜇 𝑘𝑗 is the mean of jth
variable of the cluster center of the kth cluster.
• Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
55

56
Anomaly Detection vs Unsupervised Learning

• “Kmod” package in R is used to show the application of K-means model.
57
In this example the number of clusters is defined
through bend graph in order to pass to K-mod
function.
See Clustering_Approach.R

58
K=4 is the number of clusters and L=10 is
the number of outliers

59
Scatter plots of normal and outlier data points

Local Outlier Factor (LOF)
• Local outlier factor (LOF) algorithm first calculates the density of local
neighborhood for each point.
• Then for each object such as p, LOF score is defined as the average of the ratios
of the density of sample p and the density of its nearest neighbors. The number
of nearest neighbors, k, is given by user.
• Points with largest LOF scores are considered as outliers.
• In R, both “DMwR” and “Rlof” packages can be used for performing LOF model.
61

Local Outlier Factor (LOF)
• The LOF scores for outlying points will be high because they are computed in
terms of the ratios to the average neighborhood reachability distances.
• As a result for data points, which distributed homogenously in the cluster, the
LOF scores will be close to one.
• Over a different range of values for k, the maximum LOF score will determine
the scores associated with the local outliers.
62

Local Outlier Factor (R)
• LOF returns a numeric vector of scores for each observation in the data set.
63
k, is the number of neighbors that is used in
calculation of local outlier scores.
See Density_Approach.R
Outlier indexes

64
Local outliers are shown in
red.

65
Histogram of regular observations vs outliers

Summary
66
We have covered Anomaly detection
Introduction  Definition of anomaly detection and its importance in energy systems
 Different types of anomaly detection methods: Statistical, graphical and machine
learning methods
Graphical approach  Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol
plot to demonstrate outliers graphically
 The main assumption for applying graphical approaches is multivariate normality
 Mahalanobis distance methods is mainly used for calculating the distance of a point
from a center of multivariate distribution
Statistical approach  Statistical hypothesis testing includes of: Chi-square, Grubb’s test
 Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning approach  Both supervised and unsupervised learning methods can be used for outlier detection
 Piece wised or segmented regression can be used to identify outliers based on the
residuals for each segment
 In K-means clustering method outliers are defined as points which have doesn’t belong
to any cluster, are far away from the centroids of the cluster or shaping sparse clusters
Density approach  Local outlier factor algorithm is used to detect local outliers
 The relative density of a data point is compared the density of it’s k nearest neighbors. K
is mainly identified by user

• Demand response (DR) is defined by the U. S. department of energy
(DOE) as:
“Changes in electric usage by end‐use customers from their normal
consumption patterns in response to changes in the price of electricity over
time, or to incentive payments designed to induce lower electricity use at time
of high wholesale market prices or when system reliability is jeopardized.”
Demand response (DR)

http://www.poweritsolutions.com/solutions/demand_response
The DR costumer gets notified to
participate in the DR event
The customer
curtails his energy
usage
The usage
returns to
the normal
level as the
event ends

• In DR programs, customers are incentivized by either capacity payments or
energy payments:
▫ Capacity payments:
The payments to customers to stand by to be ready to make electrical capacity available during an
emergency.
▫ Energy payments:
The payments based on the actual energy that a customer provides over a set period of time
during a DR event.
• DR programs’ customers are compensated based on the extent to which they
reduce their energy consumption.
• Therefore, DR providers require a reliable system to measure energy reduction.

• Demand Response Measurement and Verification (M&V) is the application of
statistical techniques to measure and verify the load reduction during a DR
event.
• To measure the load reduction, DR providers should first estimate the baseline
for each of their customers.
Demand response measurement and verification
(M&V)

Baseline (normal model)
• A baseline is an estimate of the electricity usage that a customer would have
consumed in the absence of a DR event.
• The baseline is critical for measuring curtailment during DR events.
• It enables DR providers to measure the performance of DR events.
• Since M&V processes are entirely dependent on the baseline calculation and
actual load, the baseline and actual electricity consumption must be calculated
as accurately as possible.

Load reduction
• Baseline = The amount of energy the costumer would have consumed in the
absence of the DR event
• Actual consumption = The amount of energy the costumer actually consumed
during the DR event
• Load reduction = Baseline – Actual consumption

Baseline, actual consumption, load reduction
74
Actual consumptionBaseline
Load
reduction
https://www.cozero.com.au/demand-response

Anomaly detection and its importance in DR programs
• The performance of DR events should be computed precisely and customers
should not receive credit for more or less than the load reduction they actually
provide, therefore data should be free of any anomalies for analyses.
• Anomalies can have a significant impact on the results of statistical models,
which are applied to data for measuring the DR events’ performance.
• Anomaly detection techniques can be applied to identify anomalies and allow
the decision makers to find the best way to deal with them.

Energy consumption data
• The primary data set is collected from EnerNOC public data sources.
• The data covers the 5 minute energy usage of a primary/secondary school, in
Georgetown, Delaware for 2012.
• EnerNOC collected this data through its smart electrical meters installed on the
site.
• The data consists of five variables:
▫ Timestamp : Date and time in local time zone
▫ Dttm_utc: Date and time in UTC time zone
▫ Value: 5 minute energy usage
▫ Estimated: If estimated equals 1, it indicates that the value (energy usage) is estimated. If it
equals 0, it indicates that the value is not estimated.
▫ Anomaly: If anomaly equals 1, it indicates that the value (energy usage) is anomalous. If
anomaly equals 0, it indicates that the value is normal.
• For the purpose of this case study, we use the data from the beginning of
January to the end of March. We also only use dttm_utc and value columns.
76

Energy consumption data
77
EnerNOC data EnerNOC meta data
See 136.csv See all_sites.csv

Weather data
• The energy consumption is correlated with temperature. For that reason, we
use weather data (temperature) as a variable impacting the energy usage.
• For this purpose, we collect hourly weather data for Georgetown in Delaware
using “getDetailedWeather” function from “weatherData” package in R.
• This function gathers hourly temperature data for any specific location and date
from www.wunderground.com using two arguments, “station id” and “date”.
• The weather data consists of two variables:
• Timestamp: Data and time in ETS time zone
• Temperature: Hourly temperature in Fahrenheit
• The weather data timestamp is hourly and associated with the 54th minute of
each hour.
• The weather data is collected for January 2012 to March 2012.
78

79
• Refer to Case_study.R for more details
Case study

80
• Apache Spark : Core Statistics and Machine Learning
▫ September 12,13th, Cambridge, MA
▫ www.analyticscertificate.com/SparkWorkshop
• Anomaly Detection: Core techniques and Best Practices
▫ September 19,20th, New York
▫ www.analyticscertificate.com/AnomalyNYC
• Anomaly Detection: Core techniques and Best Practices
▫ October 26th, 27th, San Francisco
▫ http://www.analyticscertificate.com/GBDC/
Upcoming QuantUniversity workshops

Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
81

Anomaly detection : QuantUniversity Workshop

More Related Content

What's hot

Viewers also liked

Similar to Anomaly detection : QuantUniversity Workshop

More from QuantUniversity

Recently uploaded

Anomaly detection : QuantUniversity Workshop