www.globalbigdataconference.com
Twitter : @bigdataconf
Location:
Santa Clara, CA
August 31st 2016
Anomaly Detection
Techniques and Best Practices
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
4
5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program in 2016
6
7
• If you don’t have R,
▫ Install R from https://cran.r-project.org/
▫ Install Rstudio from
https://www.rstudio.com/products/rstudio/download2/
▫ Install IRKernel from
http://irkernel.github.io/installation/
▫ Refer to www.analyticscertificate.com/GBDC for slides and examples
Local installation instructions
8
What is anomaly detection?
• Anomalies or outliers are data points within the datasets
that appear to deviate markedly from expected outputs.
• An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism1
• Anomaly detection refers to the problem of finding
patterns in data that don’t confirm to expected behavior
9
1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
10
• Outliers are data points that are considered out of the ordinary or
abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/
critical/actionable information which could be of interest to
analysts.
Anomaly vs Outliers
1
2
All points not in clusters 1 & 2 are Outliers
Point B is an Anomaly (Both X and Y are large)
11
• Fraud Detection
▫ Credit card fraud detection
 By owner or by operation
▫ Mobile phone fraud/anomaly detection
 Calling behavior, volume etc
▫ Insurance claim fraud detection
 Medical malpractice
 Auto insurance
▫ Insider trading detection
• E-commerce
▫ Pricing issues
▫ Network issues
Applications of Anomaly Detection
12
• Intrusion detection:
▫ Detect malicious activity in computer systems
▫ This could be host-based or network-based
• Medical anomalies
Examples of Anomaly Detection
13
• Manufacturing and sensors:
▫ Fault detection
▫ Heat, fire sensors
• Text data
▫ Novel topics, events
▫ Plagiarism
Examples of Anomaly Detection
14
• Anomalies can be classified into 3 major categories
1. Point Anomalies:
 In an instance is anomalous compared with the rest of instances, the anomaly
is considered a point anomaly
2. Contextual Anomalies
 If an instance is anomalous in a specific context, the anomaly would be
considered as a contextual anomaly
3. Collective Anomalies
 If a collection of related data records are anomalous with respect to the entire
data set, the anomaly is a collective anomaly
Anomaly Classification
15
• In the figure, points o1 and o2 are considered point anomalies
• Examples:
▫ A 50% increase in daily stock price
▫ A credit card transaction attempt for $5000 (assuming you have never
had a single transaction for anything above $1000)
Point Anomalies
16
• In the figure, temperature t2 is an anomaly
• Note that t1 is lower than t2 but contextually, t1 is expected and t2
isn’t when compared to records around it.
Contextual Anomalies
17
• Multiple Buy Stock transactions and then a sequence of Sell
transactions around an earnings release date may be anomalous
and may indicate insider trading.
• Consider the sequence of network activities recorded
• Though ssh, buffer-overflow and ftp themselves are not anomalous
activities, a sequence of the three indicates a web-based attack
• Similarly, multiple http requests from an ip address may indicate a
crawler in action.
Collective Anomalies
18
• In medicine, abnormal ECG pattern detection would involve looking
for collective anomalies like Premature Atrial Contraction
Collective Anomalies
http://www.fprmed.com/Pages/Cardio/PAC.html
19
1. Graphical approach
2. Statistical approach
3. Machine learning approach
4. Density based approach
Illustration of four methodologies to Anomaly Detection in R
20
 Boxplot
 Scatter plot
 Adjusted quantile plot
 Symbol plot
Graphical approaches
• Graphical methods utilize extreme value analysis, by which outliers
correspond to the statistical tails of probability distributions.
• Statistical tails are most commonly used for one dimensional
distributions, although the same concept can be applied to
multidimensional case.
• It is important to understand that all extreme values are outliers
but the reverse may not be true.
• For instance in one dimensional dataset of
{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t
considered as an extreme value, but since this observation is the
most isolated point, it should be considered as an outlier.
21
Box plot
• A standardized way of displaying the
variation of data based on the five
number summary, which includes
minimum, first quartile, median, third
quartile, and maximum.
• This plot does not make any assumptions
of the underlying statistical distribution.
• Any data not included between the
minimum and maximum are considered
as an outlier.
22
Boxplot
23
See Graphical_Approach.R
Side-by-side boxplot for each variable
Scatter plot
• A mathematical diagram, which uses Cartesian coordinates for plotting ordered
pairs to show the correlation between typically two random variables.
• An outlier is defined as a data point that doesn't seem to fit with the rest of the
data points.
• In scatterplots, outliers of either intersection or union sets of two variables can
be shown.
24
Scatterplot
25
See Graphical_Approach.R
Scatterplot of Sepal.Width and Sepal.Length
26
• In statistics, a Q–Q plotis a probability plot, which is a graphical
method for comparing two probability distributions by plotting their
quantiles against each other.
• If the two distributions being compared are similar, the points in the
Q–Q plot will approximately lie on the line y = x.
Q-Q plot
Source: Wikipedia
Adjusted quantile plot
• This plot identifies possible multivariate outliers by calculating the Mahalanobis
distance of each point from the center of the data.
• Multi-dimensional Mahalanobis distance between vectors x and y in 𝑅 𝑛 can be
formulated as:
d(x,y) = x − y TS−1(x − y)
where x and y are random vectors of the same distribution with the covariance
matrix S.
• An outlier is defined as a point with a distance larger than some pre-determined
value.
27
Adjusted quantile plot
• Before applying this method and many other parametric
multivariate methods, first we need to check if the data is
multivariate normally distributed using different
multivariate normality tests, such as Royston, Mardia, Chi-
square, univariate plots, etc.
• In R, we use the “mvoutlier” package, which utilizes
graphical approaches as discussed above.
28
Adjusted quantile plot
29
Min-Max normalization before diving into analysis
Multivariate normality test
Outlier Boolean vector identifies the
outliers
Alpha defines maximum thresholding proportion
See Graphical_Approach.R
Adjusted quantile plot
30
See Graphical_Approach.R
Mahalanobis distances
Covariance matrix
Adjusted quantile plot
31
See Graphical_Approach.R
Symbol plot
• This plot plots two dimensional data, using robust Mahalanobis distances based
on the minimum covariance determinant(mcd) estimator with adjustment.
• Minimum Covariance Determinant (MCD) estimator looks for the subset of h
data points whose covariance matrix has the smallest determinant.
• Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to
25%, 50%, 75% and adjusted quantiles of the chi-square distribution.
32
Symbol plot
33
See Graphical_Approach.R
Parameter “quan” defines the amount of observations,
which are used for minimum covariance determinant
estimations. The default is 0.5.
Alpha defines the amount of observations used for
calculating the adjusted quantile.
34
 Hypothesis testing ( Chi-square test, Grubb’s test)
 Scores
Hypothesis testing
• This method draws conclusions about a sample point by testing whether it
comes from the same distribution as the training data.
• Statistical tests, such as the t-test and the ANOVA table, can be used on
multiple subsets of the data.
• Here, the level of significance, i.e, the probability of incorrectly rejecting the
true null hypothesis, needs to be chosen.
• To apply this method in R, “outliers” package, which utilizes statistical
tests, is used .
35
Chi-square test
• Chi-square test performs a simple test for detecting outliers of univariate data
based on Chi-square distribution of squared difference between data and
sample mean.
• In this test, sample variance counts as the estimator of the population variance.
• Chi-square test helps us identify the lowest and highest values, since outliers
can exist in both tails of the data.
36
37
When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually
reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One
statistical test that addresses this issue is the chi-square goodness of fit test.
This test is commonly used to test association of variables in two-way tables where the assumed model of independence is
evaluated against the observed data. In general, the chi-square test statistic is of the form
.
If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the
data (anomaly).
Chi-square test
Chi-square test
38
See Statistical_Approach.R
This function repeats the Chi-square test until it finds all
the outliers within the data.
Grubbs’ test
• Test for outliers for univariate data sets assumed to come from a normally
distributed population.
• Grubbs' test detects one outlier at a time. This outlier is expunged from the
dataset and the test is iterated until no outliers are detected.
• This test is defined for the following hypotheses:
H0: There are no outliers in the data set
H1: There is exactly one outlier in the data set
• The Grubbs' test statistic is defined as:
39
Grubbs’ test
40
See Statistical_Approach.R
The above function repeats the Grubbs’ test until it finds
all the outliers within the data.
Grubbs’ test
41
See Statistical_Approach.R
Histogram of normal observations vs outliers)
Scores
• Scores quantifies the tendency of a data point being an outlier by assigning it a
score or probability.
• The most commonly used scores are:
▫ Normal score:
𝑥 𝑖 −𝑀𝑒𝑎𝑛
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
▫ T-student score:
(𝑧−𝑠𝑞𝑟𝑡 𝑛−2 )
𝑠𝑞𝑟𝑡(𝑧−1−𝑡2)
▫ Chi-square score:
𝑥 𝑖 −𝑀𝑒𝑎𝑛
𝑠𝑑
2
▫ IQR score: 𝑄3-𝑄1
• By using “score” function in R, p-values can be returned instead of scores.
42
Scores
43
See Statistical_Approach.R
“type” defines the type of the score, such as
normal, t-student, etc.
“prob=1” returns the corresponding p-value.
Scores
44
See Statistical_Approach.R
By setting “prob” to any specific value, logical vector
returns the data points, whose probabilities are
greater than this cut-off value, as outliers.
By setting “type” to IQR, all values lower than first
and greater than third quartiles are considered and
difference between them and nearest quartile
divided by IQR is calculated.
45
• Anomaly Detection
▫ Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test for
detecting anomalies.
▫ Anomaly detection referring to point-in-time anomalous data points that
could be global or local. A local anomaly is one that occurs inside a seasonal
pattern; Could be +ve or –ve.
▫ More details here: https://github.com/twitter/AnomalyDetection
• Breakout Detection
▫ A breakout is characterized in this package by two steady states and an
intermediate transition period that could be sudden or gradual
▫ Uses the E-Divisive with Medians algorithm; Can detect one or multiple
breakouts in a given time series and employs energy statistics to detect
divergence in mean. More details here:
(https://blog.twitter.com/2014/breakout-detection-in-the-wild )
Twitter packages
46
• Twitter-R-Anomaly Detection tutorial.ipyb
Demo
47
 Linear regression
 Piecewise/ segmented regression
 Clustering-based approaches
Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
𝑌 = 𝛽0 + ෍
𝑖=1
𝑝
𝛽𝑖 𝑋𝑖
where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
48
Piecewise/segmented regression
• A method in regression analysis, in which the independent variable is
partitioned into intervals to allow multiple linear models to be fitted to data for
different ranges.
• This model can be applied when there are ‘breakpoints’ and clearly two
different linear relationships in the data with a sudden, sharp change in
directionality. Below is a simple segmented regression for data with two
breakpoints:
𝑌 = 𝐶0 + 𝜑1 𝑋 𝑋 < 𝑋1
𝑌 = 𝐶1 + 𝜑2 𝑋 𝑋 > 𝑋1
where Y is a predicted value, X is an independent variable, 𝐶0 and 𝐶1 are
constant values, 𝜑1 and 𝜑2 are regression coefficients, and 𝑋1 and 𝑋2 are
breakpoints.
49
50
Anomaly detection vs Supervised learning
Piecewise/segmented regression
• For this example, we use “segmented” package in R to first illustrate piecewise
regression for two dimensional data set, which has a breakpoint around z=0.5.
51
See Piecewise_Regression.R
“pmax” is used for parallel maximization to
create different values for y.
Piecewise/segmented regression
• Then, we use linear regression to predict y values for each segment of z.
52
See Piecewise_Regression.R
Piecewise/segmented regression
• Finally, the outliers can be detected for each segment by setting some rules for
residuals of model.
53
See Piecewise_Regression.R
Here, we set the rule for the residuals corresponding to z
less than 0.5, by which the outliers with residuals below
0.5 can be defined as outliers.
Clustering-based approaches
• These methods are suitable for unsupervised anomaly detection.
• They aim to partition the data into meaningful groups (clusters) based on the
similarities and relationships between the groups found in the data.
• Each data point is assigned a degree of membership for each of the clusters.
• Anomalies are those data points that:
▫ Do not fit into any clusters.
▫ Belong to a particular cluster but are far away from the cluster centroid.
▫ Form small or sparse clusters.
54
Clustering-based approaches
• These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
෍
𝑘=1
𝐾
෍
𝑖∈𝑆 𝑘
෍
𝑗=1
𝑃
(𝑥𝑖𝑗 − 𝜇 𝑘𝑗)2
where 𝑆 𝑘 is the set of observations in the kth cluster and 𝜇 𝑘𝑗 is the mean of jth
variable of the cluster center of the kth cluster.
• Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
55
56
Anomaly Detection vs Unsupervised Learning
Clustering-based approaches
• “Kmod” package in R is used to show the application of K-means model.
57
In this example the number of clusters is defined
through bend graph in order to pass to K-mod
function.
See Clustering_Approach.R
Clustering-based approaches
58
See Clustering_Approach.R
K=4 is the number of clusters and L=10 is
the number of outliers
Clustering-based approaches
59
See Clustering_Approach.R
Scatter plots of normal and outlier data points
60
 Local outlier factor
Local Outlier Factor (LOF)
• Local outlier factor (LOF) algorithm first calculates the density of local
neighborhood for each point.
• Then for each object such as p, LOF score is defined as the average of the ratios
of the density of sample p and the density of its nearest neighbors. The number
of nearest neighbors, k, is given by user.
• Points with largest LOF scores are considered as outliers.
• In R, both “DMwR” and “Rlof” packages can be used for performing LOF model.
61
Local Outlier Factor (LOF)
• The LOF scores for outlying points will be high because they are computed in
terms of the ratios to the average neighborhood reachability distances.
• As a result for data points, which distributed homogenously in the cluster, the
LOF scores will be close to one.
• Over a different range of values for k, the maximum LOF score will determine
the scores associated with the local outliers.
62
Local Outlier Factor (R)
• LOF returns a numeric vector of scores for each observation in the data set.
63
k, is the number of neighbors that is used in
calculation of local outlier scores.
See Density_Approach.R
Outlier indexes
Local Outlier Factor (R)
64
Local outliers are shown in
red.
See Density_Approach.R
65
Local Outlier Factor (R)
Histogram of regular observations vs outliers
See Density_Approach.R
Summary
66
We have covered Anomaly detection
Introduction  Definition of anomaly detection and its importance in energy systems
 Different types of anomaly detection methods: Statistical, graphical and machine
learning methods
Graphical approach  Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol
plot to demonstrate outliers graphically
 The main assumption for applying graphical approaches is multivariate normality
 Mahalanobis distance methods is mainly used for calculating the distance of a point
from a center of multivariate distribution
Statistical approach  Statistical hypothesis testing includes of: Chi-square, Grubb’s test
 Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning approach  Both supervised and unsupervised learning methods can be used for outlier detection
 Piece wised or segmented regression can be used to identify outliers based on the
residuals for each segment
 In K-means clustering method outliers are defined as points which have doesn’t belong
to any cluster, are far away from the centroids of the cluster or shaping sparse clusters
Density approach  Local outlier factor algorithm is used to detect local outliers
 The relative density of a data point is compared the density of it’s k nearest neighbors. K
is mainly identified by user
Case study: Anomaly Detection in Energy Data
2016 Copyright QuantUniversity LLC.
• Demand response (DR) is defined by the U. S. department of energy
(DOE) as:
“Changes in electric usage by end‐use customers from their normal
consumption patterns in response to changes in the price of electricity over
time, or to incentive payments designed to induce lower electricity use at time
of high wholesale market prices or when system reliability is jeopardized.”
Demand response (DR)
Demand response (DR)
http://www.poweritsolutions.com/solutions/demand_response
The DR costumer gets notified to
participate in the DR event
The customer
curtails his energy
usage
The usage
returns to
the normal
level as the
event ends
Demand response (DR)
• In DR programs, customers are incentivized by either capacity payments or
energy payments:
▫ Capacity payments:
The payments to customers to stand by to be ready to make electrical capacity available during an
emergency.
▫ Energy payments:
The payments based on the actual energy that a customer provides over a set period of time
during a DR event.
• DR programs’ customers are compensated based on the extent to which they
reduce their energy consumption.
• Therefore, DR providers require a reliable system to measure energy reduction.
• Demand Response Measurement and Verification (M&V) is the application of
statistical techniques to measure and verify the load reduction during a DR
event.
• To measure the load reduction, DR providers should first estimate the baseline
for each of their customers.
Demand response measurement and verification
(M&V)
Baseline (normal model)
• A baseline is an estimate of the electricity usage that a customer would have
consumed in the absence of a DR event.
• The baseline is critical for measuring curtailment during DR events.
• It enables DR providers to measure the performance of DR events.
• Since M&V processes are entirely dependent on the baseline calculation and
actual load, the baseline and actual electricity consumption must be calculated
as accurately as possible.
Load reduction
• Baseline = The amount of energy the costumer would have consumed in the
absence of the DR event
• Actual consumption = The amount of energy the costumer actually consumed
during the DR event
• Load reduction = Baseline – Actual consumption
Baseline, actual consumption, load reduction
74
Actual consumptionBaseline
Load
reduction
https://www.cozero.com.au/demand-response
Anomaly detection and its importance in DR programs
• The performance of DR events should be computed precisely and customers
should not receive credit for more or less than the load reduction they actually
provide, therefore data should be free of any anomalies for analyses.
• Anomalies can have a significant impact on the results of statistical models,
which are applied to data for measuring the DR events’ performance.
• Anomaly detection techniques can be applied to identify anomalies and allow
the decision makers to find the best way to deal with them.
Energy consumption data
• The primary data set is collected from EnerNOC public data sources.
• The data covers the 5 minute energy usage of a primary/secondary school, in
Georgetown, Delaware for 2012.
• EnerNOC collected this data through its smart electrical meters installed on the
site.
• The data consists of five variables:
▫ Timestamp : Date and time in local time zone
▫ Dttm_utc: Date and time in UTC time zone
▫ Value: 5 minute energy usage
▫ Estimated: If estimated equals 1, it indicates that the value (energy usage) is estimated. If it
equals 0, it indicates that the value is not estimated.
▫ Anomaly: If anomaly equals 1, it indicates that the value (energy usage) is anomalous. If
anomaly equals 0, it indicates that the value is normal.
• For the purpose of this case study, we use the data from the beginning of
January to the end of March. We also only use dttm_utc and value columns.
76
Energy consumption data
77
EnerNOC data EnerNOC meta data
See 136.csv See all_sites.csv
Weather data
• The energy consumption is correlated with temperature. For that reason, we
use weather data (temperature) as a variable impacting the energy usage.
• For this purpose, we collect hourly weather data for Georgetown in Delaware
using “getDetailedWeather” function from “weatherData” package in R.
• This function gathers hourly temperature data for any specific location and date
from www.wunderground.com using two arguments, “station id” and “date”.
• The weather data consists of two variables:
• Timestamp: Data and time in ETS time zone
• Temperature: Hourly temperature in Fahrenheit
• The weather data timestamp is hourly and associated with the 54th minute of
each hour.
• The weather data is collected for January 2012 to March 2012.
78
79
• Refer to Case_study.R for more details
Case study
80
• Apache Spark : Core Statistics and Machine Learning
▫ September 12,13th, Cambridge, MA
▫ www.analyticscertificate.com/SparkWorkshop
• Anomaly Detection: Core techniques and Best Practices
▫ September 19,20th, New York
▫ www.analyticscertificate.com/AnomalyNYC
• Anomaly Detection: Core techniques and Best Practices
▫ October 26th, 27th, San Francisco
▫ http://www.analyticscertificate.com/GBDC/
Upcoming QuantUniversity workshops
Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
81

Anomaly detection : QuantUniversity Workshop

  • 1.
  • 2.
    Location: Santa Clara, CA August31st 2016 Anomaly Detection Techniques and Best Practices 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com sri@quantuniversity.com
  • 3.
    - Analytics Advisoryservices - Custom training programs - Architecture assessments, advice and audits
  • 4.
    • Founder ofQuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 4
  • 5.
    5 Quantitative Analytics andBig Data Analytics Onboarding • Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R • Launching the Analytics Certificate Program in 2016
  • 6.
  • 7.
    7 • If youdon’t have R, ▫ Install R from https://cran.r-project.org/ ▫ Install Rstudio from https://www.rstudio.com/products/rstudio/download2/ ▫ Install IRKernel from http://irkernel.github.io/installation/ ▫ Refer to www.analyticscertificate.com/GBDC for slides and examples Local installation instructions
  • 8.
  • 9.
    What is anomalydetection? • Anomalies or outliers are data points within the datasets that appear to deviate markedly from expected outputs. • An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism1 • Anomaly detection refers to the problem of finding patterns in data that don’t confirm to expected behavior 9 1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
  • 10.
    10 • Outliers aredata points that are considered out of the ordinary or abnormal . This includes noise. • Anomalies are a special kind of outlier that has significant/ critical/actionable information which could be of interest to analysts. Anomaly vs Outliers 1 2 All points not in clusters 1 & 2 are Outliers Point B is an Anomaly (Both X and Y are large)
  • 11.
    11 • Fraud Detection ▫Credit card fraud detection  By owner or by operation ▫ Mobile phone fraud/anomaly detection  Calling behavior, volume etc ▫ Insurance claim fraud detection  Medical malpractice  Auto insurance ▫ Insider trading detection • E-commerce ▫ Pricing issues ▫ Network issues Applications of Anomaly Detection
  • 12.
    12 • Intrusion detection: ▫Detect malicious activity in computer systems ▫ This could be host-based or network-based • Medical anomalies Examples of Anomaly Detection
  • 13.
    13 • Manufacturing andsensors: ▫ Fault detection ▫ Heat, fire sensors • Text data ▫ Novel topics, events ▫ Plagiarism Examples of Anomaly Detection
  • 14.
    14 • Anomalies canbe classified into 3 major categories 1. Point Anomalies:  In an instance is anomalous compared with the rest of instances, the anomaly is considered a point anomaly 2. Contextual Anomalies  If an instance is anomalous in a specific context, the anomaly would be considered as a contextual anomaly 3. Collective Anomalies  If a collection of related data records are anomalous with respect to the entire data set, the anomaly is a collective anomaly Anomaly Classification
  • 15.
    15 • In thefigure, points o1 and o2 are considered point anomalies • Examples: ▫ A 50% increase in daily stock price ▫ A credit card transaction attempt for $5000 (assuming you have never had a single transaction for anything above $1000) Point Anomalies
  • 16.
    16 • In thefigure, temperature t2 is an anomaly • Note that t1 is lower than t2 but contextually, t1 is expected and t2 isn’t when compared to records around it. Contextual Anomalies
  • 17.
    17 • Multiple BuyStock transactions and then a sequence of Sell transactions around an earnings release date may be anomalous and may indicate insider trading. • Consider the sequence of network activities recorded • Though ssh, buffer-overflow and ftp themselves are not anomalous activities, a sequence of the three indicates a web-based attack • Similarly, multiple http requests from an ip address may indicate a crawler in action. Collective Anomalies
  • 18.
    18 • In medicine,abnormal ECG pattern detection would involve looking for collective anomalies like Premature Atrial Contraction Collective Anomalies http://www.fprmed.com/Pages/Cardio/PAC.html
  • 19.
    19 1. Graphical approach 2.Statistical approach 3. Machine learning approach 4. Density based approach Illustration of four methodologies to Anomaly Detection in R
  • 20.
    20  Boxplot  Scatterplot  Adjusted quantile plot  Symbol plot
  • 21.
    Graphical approaches • Graphicalmethods utilize extreme value analysis, by which outliers correspond to the statistical tails of probability distributions. • Statistical tails are most commonly used for one dimensional distributions, although the same concept can be applied to multidimensional case. • It is important to understand that all extreme values are outliers but the reverse may not be true. • For instance in one dimensional dataset of {1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t considered as an extreme value, but since this observation is the most isolated point, it should be considered as an outlier. 21
  • 22.
    Box plot • Astandardized way of displaying the variation of data based on the five number summary, which includes minimum, first quartile, median, third quartile, and maximum. • This plot does not make any assumptions of the underlying statistical distribution. • Any data not included between the minimum and maximum are considered as an outlier. 22
  • 23.
  • 24.
    Scatter plot • Amathematical diagram, which uses Cartesian coordinates for plotting ordered pairs to show the correlation between typically two random variables. • An outlier is defined as a data point that doesn't seem to fit with the rest of the data points. • In scatterplots, outliers of either intersection or union sets of two variables can be shown. 24
  • 25.
  • 26.
    26 • In statistics,a Q–Q plotis a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. • If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. Q-Q plot Source: Wikipedia
  • 27.
    Adjusted quantile plot •This plot identifies possible multivariate outliers by calculating the Mahalanobis distance of each point from the center of the data. • Multi-dimensional Mahalanobis distance between vectors x and y in 𝑅 𝑛 can be formulated as: d(x,y) = x − y TS−1(x − y) where x and y are random vectors of the same distribution with the covariance matrix S. • An outlier is defined as a point with a distance larger than some pre-determined value. 27
  • 28.
    Adjusted quantile plot •Before applying this method and many other parametric multivariate methods, first we need to check if the data is multivariate normally distributed using different multivariate normality tests, such as Royston, Mardia, Chi- square, univariate plots, etc. • In R, we use the “mvoutlier” package, which utilizes graphical approaches as discussed above. 28
  • 29.
    Adjusted quantile plot 29 Min-Maxnormalization before diving into analysis Multivariate normality test Outlier Boolean vector identifies the outliers Alpha defines maximum thresholding proportion See Graphical_Approach.R
  • 30.
    Adjusted quantile plot 30 SeeGraphical_Approach.R Mahalanobis distances Covariance matrix
  • 31.
    Adjusted quantile plot 31 SeeGraphical_Approach.R
  • 32.
    Symbol plot • Thisplot plots two dimensional data, using robust Mahalanobis distances based on the minimum covariance determinant(mcd) estimator with adjustment. • Minimum Covariance Determinant (MCD) estimator looks for the subset of h data points whose covariance matrix has the smallest determinant. • Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to 25%, 50%, 75% and adjusted quantiles of the chi-square distribution. 32
  • 33.
    Symbol plot 33 See Graphical_Approach.R Parameter“quan” defines the amount of observations, which are used for minimum covariance determinant estimations. The default is 0.5. Alpha defines the amount of observations used for calculating the adjusted quantile.
  • 34.
    34  Hypothesis testing( Chi-square test, Grubb’s test)  Scores
  • 35.
    Hypothesis testing • Thismethod draws conclusions about a sample point by testing whether it comes from the same distribution as the training data. • Statistical tests, such as the t-test and the ANOVA table, can be used on multiple subsets of the data. • Here, the level of significance, i.e, the probability of incorrectly rejecting the true null hypothesis, needs to be chosen. • To apply this method in R, “outliers” package, which utilizes statistical tests, is used . 35
  • 36.
    Chi-square test • Chi-squaretest performs a simple test for detecting outliers of univariate data based on Chi-square distribution of squared difference between data and sample mean. • In this test, sample variance counts as the estimator of the population variance. • Chi-square test helps us identify the lowest and highest values, since outliers can exist in both tails of the data. 36
  • 37.
    37 When an analystattempts to fit a statistical model to observed data, he or she may wonder how well the model actually reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One statistical test that addresses this issue is the chi-square goodness of fit test. This test is commonly used to test association of variables in two-way tables where the assumed model of independence is evaluated against the observed data. In general, the chi-square test statistic is of the form . If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the data (anomaly). Chi-square test
  • 38.
    Chi-square test 38 See Statistical_Approach.R Thisfunction repeats the Chi-square test until it finds all the outliers within the data.
  • 39.
    Grubbs’ test • Testfor outliers for univariate data sets assumed to come from a normally distributed population. • Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected. • This test is defined for the following hypotheses: H0: There are no outliers in the data set H1: There is exactly one outlier in the data set • The Grubbs' test statistic is defined as: 39
  • 40.
    Grubbs’ test 40 See Statistical_Approach.R Theabove function repeats the Grubbs’ test until it finds all the outliers within the data.
  • 41.
    Grubbs’ test 41 See Statistical_Approach.R Histogramof normal observations vs outliers)
  • 42.
    Scores • Scores quantifiesthe tendency of a data point being an outlier by assigning it a score or probability. • The most commonly used scores are: ▫ Normal score: 𝑥 𝑖 −𝑀𝑒𝑎𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 ▫ T-student score: (𝑧−𝑠𝑞𝑟𝑡 𝑛−2 ) 𝑠𝑞𝑟𝑡(𝑧−1−𝑡2) ▫ Chi-square score: 𝑥 𝑖 −𝑀𝑒𝑎𝑛 𝑠𝑑 2 ▫ IQR score: 𝑄3-𝑄1 • By using “score” function in R, p-values can be returned instead of scores. 42
  • 43.
    Scores 43 See Statistical_Approach.R “type” definesthe type of the score, such as normal, t-student, etc. “prob=1” returns the corresponding p-value.
  • 44.
    Scores 44 See Statistical_Approach.R By setting“prob” to any specific value, logical vector returns the data points, whose probabilities are greater than this cut-off value, as outliers. By setting “type” to IQR, all values lower than first and greater than third quartiles are considered and difference between them and nearest quartile divided by IQR is calculated.
  • 45.
    45 • Anomaly Detection ▫Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test for detecting anomalies. ▫ Anomaly detection referring to point-in-time anomalous data points that could be global or local. A local anomaly is one that occurs inside a seasonal pattern; Could be +ve or –ve. ▫ More details here: https://github.com/twitter/AnomalyDetection • Breakout Detection ▫ A breakout is characterized in this package by two steady states and an intermediate transition period that could be sudden or gradual ▫ Uses the E-Divisive with Medians algorithm; Can detect one or multiple breakouts in a given time series and employs energy statistics to detect divergence in mean. More details here: (https://blog.twitter.com/2014/breakout-detection-in-the-wild ) Twitter packages
  • 46.
  • 47.
    47  Linear regression Piecewise/ segmented regression  Clustering-based approaches
  • 48.
    Linear regression • Linearregression investigates the linear relationships between variables and predict one variable based on one or more other variables and it can be formulated as: 𝑌 = 𝛽0 + ෍ 𝑖=1 𝑝 𝛽𝑖 𝑋𝑖 where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a constant. • In this model, ordinary least squares estimator is usually used to minimize the difference between the dependent variable and independent variables. 48
  • 49.
    Piecewise/segmented regression • Amethod in regression analysis, in which the independent variable is partitioned into intervals to allow multiple linear models to be fitted to data for different ranges. • This model can be applied when there are ‘breakpoints’ and clearly two different linear relationships in the data with a sudden, sharp change in directionality. Below is a simple segmented regression for data with two breakpoints: 𝑌 = 𝐶0 + 𝜑1 𝑋 𝑋 < 𝑋1 𝑌 = 𝐶1 + 𝜑2 𝑋 𝑋 > 𝑋1 where Y is a predicted value, X is an independent variable, 𝐶0 and 𝐶1 are constant values, 𝜑1 and 𝜑2 are regression coefficients, and 𝑋1 and 𝑋2 are breakpoints. 49
  • 50.
    50 Anomaly detection vsSupervised learning
  • 51.
    Piecewise/segmented regression • Forthis example, we use “segmented” package in R to first illustrate piecewise regression for two dimensional data set, which has a breakpoint around z=0.5. 51 See Piecewise_Regression.R “pmax” is used for parallel maximization to create different values for y.
  • 52.
    Piecewise/segmented regression • Then,we use linear regression to predict y values for each segment of z. 52 See Piecewise_Regression.R
  • 53.
    Piecewise/segmented regression • Finally,the outliers can be detected for each segment by setting some rules for residuals of model. 53 See Piecewise_Regression.R Here, we set the rule for the residuals corresponding to z less than 0.5, by which the outliers with residuals below 0.5 can be defined as outliers.
  • 54.
    Clustering-based approaches • Thesemethods are suitable for unsupervised anomaly detection. • They aim to partition the data into meaningful groups (clusters) based on the similarities and relationships between the groups found in the data. • Each data point is assigned a degree of membership for each of the clusters. • Anomalies are those data points that: ▫ Do not fit into any clusters. ▫ Belong to a particular cluster but are far away from the cluster centroid. ▫ Form small or sparse clusters. 54
  • 55.
    Clustering-based approaches • Thesemethods partition the data into k clusters by assigning each data point to its closest cluster centroid by minimizing the within-cluster sum of squares (WSS), which is: ෍ 𝑘=1 𝐾 ෍ 𝑖∈𝑆 𝑘 ෍ 𝑗=1 𝑃 (𝑥𝑖𝑗 − 𝜇 𝑘𝑗)2 where 𝑆 𝑘 is the set of observations in the kth cluster and 𝜇 𝑘𝑗 is the mean of jth variable of the cluster center of the kth cluster. • Then, they select the top n points that are the farthest away from their nearest cluster centers as outliers. 55
  • 56.
    56 Anomaly Detection vsUnsupervised Learning
  • 57.
    Clustering-based approaches • “Kmod”package in R is used to show the application of K-means model. 57 In this example the number of clusters is defined through bend graph in order to pass to K-mod function. See Clustering_Approach.R
  • 58.
    Clustering-based approaches 58 See Clustering_Approach.R K=4is the number of clusters and L=10 is the number of outliers
  • 59.
  • 60.
  • 61.
    Local Outlier Factor(LOF) • Local outlier factor (LOF) algorithm first calculates the density of local neighborhood for each point. • Then for each object such as p, LOF score is defined as the average of the ratios of the density of sample p and the density of its nearest neighbors. The number of nearest neighbors, k, is given by user. • Points with largest LOF scores are considered as outliers. • In R, both “DMwR” and “Rlof” packages can be used for performing LOF model. 61
  • 62.
    Local Outlier Factor(LOF) • The LOF scores for outlying points will be high because they are computed in terms of the ratios to the average neighborhood reachability distances. • As a result for data points, which distributed homogenously in the cluster, the LOF scores will be close to one. • Over a different range of values for k, the maximum LOF score will determine the scores associated with the local outliers. 62
  • 63.
    Local Outlier Factor(R) • LOF returns a numeric vector of scores for each observation in the data set. 63 k, is the number of neighbors that is used in calculation of local outlier scores. See Density_Approach.R Outlier indexes
  • 64.
    Local Outlier Factor(R) 64 Local outliers are shown in red. See Density_Approach.R
  • 65.
    65 Local Outlier Factor(R) Histogram of regular observations vs outliers See Density_Approach.R
  • 66.
    Summary 66 We have coveredAnomaly detection Introduction  Definition of anomaly detection and its importance in energy systems  Different types of anomaly detection methods: Statistical, graphical and machine learning methods Graphical approach  Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol plot to demonstrate outliers graphically  The main assumption for applying graphical approaches is multivariate normality  Mahalanobis distance methods is mainly used for calculating the distance of a point from a center of multivariate distribution Statistical approach  Statistical hypothesis testing includes of: Chi-square, Grubb’s test  Statistical methods may use either scores or p-value as threshold to detect outliers Machine learning approach  Both supervised and unsupervised learning methods can be used for outlier detection  Piece wised or segmented regression can be used to identify outliers based on the residuals for each segment  In K-means clustering method outliers are defined as points which have doesn’t belong to any cluster, are far away from the centroids of the cluster or shaping sparse clusters Density approach  Local outlier factor algorithm is used to detect local outliers  The relative density of a data point is compared the density of it’s k nearest neighbors. K is mainly identified by user
  • 67.
    Case study: AnomalyDetection in Energy Data 2016 Copyright QuantUniversity LLC.
  • 68.
    • Demand response(DR) is defined by the U. S. department of energy (DOE) as: “Changes in electric usage by end‐use customers from their normal consumption patterns in response to changes in the price of electricity over time, or to incentive payments designed to induce lower electricity use at time of high wholesale market prices or when system reliability is jeopardized.” Demand response (DR)
  • 69.
    Demand response (DR) http://www.poweritsolutions.com/solutions/demand_response TheDR costumer gets notified to participate in the DR event The customer curtails his energy usage The usage returns to the normal level as the event ends
  • 70.
    Demand response (DR) •In DR programs, customers are incentivized by either capacity payments or energy payments: ▫ Capacity payments: The payments to customers to stand by to be ready to make electrical capacity available during an emergency. ▫ Energy payments: The payments based on the actual energy that a customer provides over a set period of time during a DR event. • DR programs’ customers are compensated based on the extent to which they reduce their energy consumption. • Therefore, DR providers require a reliable system to measure energy reduction.
  • 71.
    • Demand ResponseMeasurement and Verification (M&V) is the application of statistical techniques to measure and verify the load reduction during a DR event. • To measure the load reduction, DR providers should first estimate the baseline for each of their customers. Demand response measurement and verification (M&V)
  • 72.
    Baseline (normal model) •A baseline is an estimate of the electricity usage that a customer would have consumed in the absence of a DR event. • The baseline is critical for measuring curtailment during DR events. • It enables DR providers to measure the performance of DR events. • Since M&V processes are entirely dependent on the baseline calculation and actual load, the baseline and actual electricity consumption must be calculated as accurately as possible.
  • 73.
    Load reduction • Baseline= The amount of energy the costumer would have consumed in the absence of the DR event • Actual consumption = The amount of energy the costumer actually consumed during the DR event • Load reduction = Baseline – Actual consumption
  • 74.
    Baseline, actual consumption,load reduction 74 Actual consumptionBaseline Load reduction https://www.cozero.com.au/demand-response
  • 75.
    Anomaly detection andits importance in DR programs • The performance of DR events should be computed precisely and customers should not receive credit for more or less than the load reduction they actually provide, therefore data should be free of any anomalies for analyses. • Anomalies can have a significant impact on the results of statistical models, which are applied to data for measuring the DR events’ performance. • Anomaly detection techniques can be applied to identify anomalies and allow the decision makers to find the best way to deal with them.
  • 76.
    Energy consumption data •The primary data set is collected from EnerNOC public data sources. • The data covers the 5 minute energy usage of a primary/secondary school, in Georgetown, Delaware for 2012. • EnerNOC collected this data through its smart electrical meters installed on the site. • The data consists of five variables: ▫ Timestamp : Date and time in local time zone ▫ Dttm_utc: Date and time in UTC time zone ▫ Value: 5 minute energy usage ▫ Estimated: If estimated equals 1, it indicates that the value (energy usage) is estimated. If it equals 0, it indicates that the value is not estimated. ▫ Anomaly: If anomaly equals 1, it indicates that the value (energy usage) is anomalous. If anomaly equals 0, it indicates that the value is normal. • For the purpose of this case study, we use the data from the beginning of January to the end of March. We also only use dttm_utc and value columns. 76
  • 77.
    Energy consumption data 77 EnerNOCdata EnerNOC meta data See 136.csv See all_sites.csv
  • 78.
    Weather data • Theenergy consumption is correlated with temperature. For that reason, we use weather data (temperature) as a variable impacting the energy usage. • For this purpose, we collect hourly weather data for Georgetown in Delaware using “getDetailedWeather” function from “weatherData” package in R. • This function gathers hourly temperature data for any specific location and date from www.wunderground.com using two arguments, “station id” and “date”. • The weather data consists of two variables: • Timestamp: Data and time in ETS time zone • Temperature: Hourly temperature in Fahrenheit • The weather data timestamp is hourly and associated with the 54th minute of each hour. • The weather data is collected for January 2012 to March 2012. 78
  • 79.
    79 • Refer toCase_study.R for more details Case study
  • 80.
    80 • Apache Spark: Core Statistics and Machine Learning ▫ September 12,13th, Cambridge, MA ▫ www.analyticscertificate.com/SparkWorkshop • Anomaly Detection: Core techniques and Best Practices ▫ September 19,20th, New York ▫ www.analyticscertificate.com/AnomalyNYC • Anomaly Detection: Core techniques and Best Practices ▫ October 26th, 27th, San Francisco ▫ http://www.analyticscertificate.com/GBDC/ Upcoming QuantUniversity workshops
  • 81.
    Thank you! Sri Krishnamurthy,CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 81