03 presentation-bothiesson

Anomaly Detection
–from basic concepts to application cases at Microsoft
Bo Thiesson
thiesson@cs.aau.dk
http://people.cs.aau.dk/~thiesson
Aalborg University
– 1996
Microsoft Research,
Redmond USA
(1996 – 2013)
Aalborg University
(2013 –

Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
2InfinIT seminar, March 2015

Bo Thiesson
(Simplified) Anomaly Detection Example
Computer features: 𝑥=( 𝑥↓1 ,… 𝑥↓𝑛 )
𝑥↓1 : Memory use
𝑥↓2 : CPU load
⋮
Training data: { 𝑥↑(1) ,
…, 𝑥↑( 𝑚) }
Run-time data: 𝑥
CPU load
Memory use
𝑥 𝑥 𝑥
𝑥
𝑥
𝑥
𝑥𝑥 𝑥𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
𝑥
𝑥 𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
ok
𝑥
anomaly

Bo Thiesson
What is an anomaly
•  Anomaly is a pattern in the data that does not conform to the expected
behaviour
•  Also referred to as outliers, exceptions, peculiarities, surprises, etc.
•  Definition of Hawkins [Hawkins 1980]:
“An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism”

Bo Thiesson
(Simplified) Anomaly Detection Example
Computer features: 𝑥=( 𝑥↓1 ,… 𝑥↓𝑛 )
𝑥↓1 : Memory use
𝑥↓2 : CPU load
⋮
Training data:
ok: { 𝑥↑(1) ,…, 𝑥↑( 𝑚) }
not-ok: { 𝑥↑(1) ,…,
𝑥↑( 𝑘) }
Run-time data: 𝑥
CPU load
Memory use
𝑥 𝑥 𝑥
𝑥
𝑥
𝑥
𝑥𝑥 𝑥𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
𝑥
𝑥 𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
ok
𝑥
anomaly
𝑥 𝑥𝑥𝑥
𝑥
𝑥
𝑥
𝑥

Bo Thiesson
Two Scenarios
Supervised scenario (standard classification)
•  Training data with both normal and abnormal data objects are
provided
•  There may be multiple normal and/or abnormal classes
•  Often, the classification problem is highly imbalanced
•  Up-sample abnormal data or down-sample normal data
•  Future anomalies must look like the abnormal class(es) of training
objects

Bo Thiesson
Two Scenarios (cont.)
Semi-supervised Scenario (anomaly detection)
•  Training data for only the normal class(es) is provided
•  There may be multiple normal classes
•  Many times no (explicit) class label given, but data assumed (mostly)
normal
•  Relies on anomalies being very rare compared to normal data
•  Sometimes a very small number of anomalies included in training data
(0 to 20 is common), but not enough to learn abnormal class(es)
•  Future anomalies may look nothing like any previously observed
anomaly
Focus

Bo Thiesson
Where to look for anomalies
w  Anomalous events occur relatively infrequently
w  However, when they do occur, their
consequences can be quite dramatic and quite
often in a negative sense
Some applications:
w  Fraud: Banking, Insurance, Health care, Click,..
w  Cyber intrusion: Insider/outsider, machine or network level
w  Medicine and health: abnormal test results, disease patterns (e.g. geo-spatial)
w  (Anti) terrorism
w  Data cleaning (measurement errors)
w  …

Bo Thiesson
Outline
Standard algorithms
•  Clustering

Bo Thiesson
Modeling paradigms
Parametric
•  Learn (parametric) model representing normal data
•  Outliers deviate strongly from this model
•  Data represented by sufficient statistics (estimating model parameters)
•  Space efficient & fast run-time (usually)
•  Depends on parametric structure (shape) of the model, also restricts number
of normal classes
Non-parametric
•  All data kept around - the “model” defines a distance measure between data –
and sometimes a kernel
•  Distance-based approach: outlier have large distance to other data
•  (Kernel) density-based approach: density around outlier is significantly smaller
than for its neighbors
•  Space demanding & not as fast run-time (usually)
•  No structural model constraints nor restrictions on number of normal classes

Bo Thiesson
Algorithm – Parametric
Example: Multivariate Gaussian (normal) model

•  𝜇 is the mean value of all points (usually data is normalized so that 𝜇=0)
•  Σ is the covariance matrix from the mean
•  Mahalanobis distance
•  𝑀𝐷𝑖𝑠𝑡(𝑥, 𝜇)= ( 𝑥− 𝜇)↑𝑇 Σ↑−1 (𝑥− 𝜇)
•  follows a Χ↑2 -distribution with 𝑑 degrees of freedom ( 𝑑 = data
dimensionality)
•  Outliers: points 𝑥, with 𝑀𝐷𝑖𝑠𝑡( 𝑥, 𝜇) > Χ↑2 (0,975) [≈3 𝜎]
( )
2
)()( 1
||2
1
),;(
µµ
π
µ
−−
−
−
Σ
=Σ
xx
d
exN
ΣT

Bo Thiesson
Parametric algorithm – Multivariate Gaussian
Tan, P.-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley.
( )
2
)()( 1
||2
1
),;(
µµ
π
µ
−−
−
−
Σ
=Σ
xx
d
exN
ΣT
Parameters
Data object

Bo Thiesson
Algorithm – Parametric (cont.)
Challenge
A Fix
•  Mixture modeling
•  ⇒ contextual outlier detection
(more later)
•  Demands a clustering approach
•  Not as flexible as non-parametric approaches
µDB

Bo Thiesson
Replicator Neural Networks
S. Hawkins, et al. Outlier detection using replicator neural networks, DaWaK02 (2002)
•  Use a replicator 4-layer feed-forward neural network (RNN) with the
same number of input and output nodes
•  Input variables are the output variables so that RNN forms a compressed
model of the data during training
•  A measure of anomaly is the reconstruction error of individual data points
Target
variables
Input

Bo Thiesson
Algorithm – Non-Parametric
Nearest Neighbor (NN) Based Techniques
Key assumption: normal data records have close neighbors while anomalies
are located far from other records
General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data record is anomaly
or not
Paradigms:
•  Distance based methods
•  Anomalies are data points most distant from other points
•  Density based methods
•  Anomalies are data points in low density regions

Bo Thiesson
NN Distance-based Anomaly Detection
Knorr & Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
Algorithm
•  For each data record 𝑑 compute the distance to the k-th nearest
neighbor 𝑑↓𝑘 
•  The distance measure is the important design choice
•  Sort all data records according to the distance 𝑑↓𝑘 
•  Outliers are records that have the largest distance 𝑑↓𝑘  and therefore are
located in the more sparse neighborhoods
•  Usually data records that have top 𝑛% distance 𝑑↓𝑘  are identified as
outliers
•  Multiple normal classes is not a problem
•  Not suitable for datasets that have modes with varying density

Bo Thiesson
Advantages of Density-based Techniques
Anomalies:
•  Distance-based: 𝒑↓ 𝟏 ,
𝒑↓ 𝟑 ,…
•  Density-based: 𝒑↓ 𝟏 ,
𝒑↓ 𝟐 , …
p2
×

p1
×

×

p3
Distance from p3
to nearest
neighbor
Distance from p2
to nearest
neighbor
Fixes problem with different local densities

Bo Thiesson
Local Outlier Factor (LOF)
(Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000)
Algorithm
•  For each data point 𝑞 compute the distance to the 𝑘-th NN ( 𝑘-dist)-dist)
•  Compute reachability distance ( 𝑟-dist) for each data example 𝑞 with-dist) for each data example 𝑞 with
respect to data example 𝑝 as:
𝑟−dist( 𝑞, 𝑝) = max{ 𝑘−dist( 𝑝), dist( 𝑞, 𝑝)}
•  Compute local reachability density (lrd) of data example 𝑞 as inverse of the
average reachability distance based on the k-NN of data example 𝑞
lrd(𝑞)= 𝑘/∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒𝑟−dist( 𝑞, 𝑝)  
•  Compute LOF↓𝑘 ( 𝑞) as ratio of average local reachability density of 𝑞’s
k-NN and local reachability density of the data record 𝑞
LOF↓𝑘 (𝑞)=1/𝑘 ∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒lrd( 𝑝)/lrd( 𝑞)  

Bo Thiesson
Local Outlier Factor (LOF)
Properties
•  LOF ≈ 1: point is in a cluster (region with homogeneous
density around the point and its neighbors)
•  LOF >> 1: point is an anomaly

Bo Thiesson
Clustering Based Techniques
Key Assumption: Normal data instances belong to large and dense clusters,
while anomalies do not belong to any significant cluster.
General Approach:
•  Cluster data into a finite number of clusters.
•  Analyze each data instance with respect to its closest cluster.
•  Anomalous Instances
•  Data instances that do not fit into any cluster (residuals from clustering)‫‏‬.
•  Data instances in small clusters.
•  Data instances that are far from other points within the same cluster.

Bo Thiesson
Outline
Standard algorithms
•  Clustering

Bo Thiesson
Contextual Anomaly Detection
Key Assumption: All normal objects within a context will be similar (in
terms of behavioral attributes), while the anomalies will be different from
other objects within the context.
General Approach:
•  Identify a context around a data object (using a set of contextual
attributes).
•  Determine if the test data objects is anomalous within the context
(using a set of behavioral attributes).

Bo Thiesson
Contextual Attributes
Contextual attributes define a neighborhood (context) for each instance
For example:
•  Spatial Context
•  Latitude, Longitude
•  Graph Context
•  Edges, Weights
•  Sequential Context
•  Position, Time
•  Profile Context
•  User demographics
•  Behavioral context (mixture modeling)
•  Contextual = behavioral attributes

Bo Thiesson
Contextual Anomaly Detection Techniques
Reduction to global anomaly detection
•  Segment or influence data using contextual attributes
•  Apply a traditional anomaly outlier within each context using behavioral
attributes
•  Often, contextual attributes cannot be segmented easily – use favourite
clustering technique (from standard machine learning)
A “softer” alternative
•  Conditional anomaly detection*
* X. Song, M. Wu, C. Jermaine & S. Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.

Bo Thiesson
Outline
Standard algorithms
•  Clustering

Bo Thiesson
The importance of real-number evaluation
When developing a learning algorithm (choosing features,
etc.), making decisions is much easier if we have a way of
evaluating our learning algorithm.
Data
•  Assume we have some labeled data, of normal and very few
anomalous examples.
•  Split data into
•  Training set, with only normal examples
•  Validation set, with both normal and half of the
anomalous examples
•  Test set with both normal and other half of the
anomalous examples
Training
Data
Validation
Data
Test
Data

Bo Thiesson
Example
Data
•  100.000 snapshots of well (normal) running machines
•  20 snapshots of (latent) failing machines
Training data: 60K well-running machines
Validation data: 20K well-running machines, 10 failing machines
Test data: 20K well-running machines, 10 failing machines

Bo Thiesson
The development process (cont.)
Algorithm evaluation
•  Fit normal model 𝑀( 𝑥) on the training data
•  On validation/test data example 𝑥, predict
𝑦 = 𝑀(𝑥)={█■normal&− 𝑜𝑟−@not normal&  
and compare to true value 𝑦
Possible evaluation metrics:
–  True positive, false positive, false negative, true negative
–  Precision/Recall
–  F1-score
Use to compare algorithms with different features, tuning parameters, etc.

Bo Thiesson
false positive
(Type I Error)
We want to avoid…
false negative
(Type II Error)
You’re
not
pregnant
You’re
pregnant
InfinIT seminar, March 2015
29
29

Bo Thiesson
Validation (credit card fraud example)
(Inspired by David J. Hand’s talk at NATO ASI: Mining Massive Data sets for Security, 2007)
Unbalanced classes
Detector
•  correctly identifies 99 in 100 legitimate transactions, and
•  correctly identifies 99 in 100 fraudulent transactions
Pretty good?
But suppose only 1 in 10.000
transactions are fraudulent
Only 1 in 100 reported fraudulent
transactions are in fact fraudulent
True class
Legit Fraud
Predicted
class
Legit 99% 1%
Fraud 1% 99%
Numbers 9999 1
True class
Legit Fraud
Predicted
class
Legit 9899,01 0,01
Fraud 99,99 0.99
Numbers 9999 1
TP rate=
0,99/99,99+0.99 
=0,98%

Bo Thiesson
“The boy who cried wolf”
>99% of suspected frauds are in fact legitimate
This matters because:
•  operational decisions must be made (stop card?)
•  good customers must not be irritated
Same challenge for other anomaly detection systems
– cyber intrusion detection, detecting latent machine failures,
industrial damage detection, etc.

Bo Thiesson
Outline
Standard algorithms
•  Clustering

Bo Thiesson
Excel (conditional formatting rules)
(Max Chickering, Allan Folting, David Heckerman, Eric Vigessa & Bo Thiesson)
Drill across
Region (outlier)

Bo Thiesson
Drill into
cat3 (outlier)

Bo Thiesson

Bo Thiesson
Capacity Prediction
(Alexei Bocharov & Bo Thiesson)
Capacity prediction = prediction of traffic to sets of MSN web pages
•  “How many impressions can we sell on a property between certain start and end
dates in the future?”
•  Over-prediction = oversell (customer dissatisfaction, make good items,…)
•  Under-prediction = undersell (loss of revenue, dilution of product, …)
•  Microsoft adCenter essentially sells web page traffic futures to advertisers across
>20,000 MSN page groups
Forecast future from past data
Many existing forecasting methods
•  One series: ARIMA, Exponential Smoothing methods (e.g. Holt-Winters), ART,…
•  Multiple series: VARIMA, ARTxp , Neural nets,…
Gradually forgt
the past

Bo Thiesson
Breaking the forecasting models
•  Transient events
•  Cyclical (recurring, but not
periodic)
•  FOMC-meetings (interest rates)
•  Periodic (long periodicities with few
data)
•  E.g., April 15th is Tax day
•  CPS only has 2-3 yrs of data, not
enough!
•  Sporadic
•  E.g., earthquake in Chile
37

Bo Thiesson
Seasonal and Floating Patterns
TSF provides basis for predicting future traffic with
•  trend,
•  seasonality, and
•  floating calendar patterns
For trend and seasonality:
•  TSF model selection + spectral analysis
•  Industry-strength seasonal pattern prediction and superior seasonal
pattern detection
For floating patterns:
•  a calendar of future events is needed (such as used by MSN Channel
editorial team → next slide)
•  Custom calendar event forecasters improve prediction accuracy by
factor of 2 or more (compared to traditional forecasters)

Bo Thiesson
Example: MSN Editorial Calendar

Bo Thiesson
The OSD Data Quality (DQ) system
•  Single source of Data Quality reports across the entire OSD
•  Reports are basis for investigating (abnormal) incidents with both user-facing Bing
properties and Bing infrastructure
•  Alert generator from online KPI measures (based on TSF, in parts)
•  Alerts are ranked:
•  Severity of anomaly (deviation from predicted value) +
•  importance of KPI
•  Involves the (un)certainty of prediction
•  Number of alerts can be matched human resources (for report investigations)

Bo Thiesson
Latent Fault Detection in Data Centers
Gabel, Schuster, Bachrach & Bjørner (2012)
Challenge:
•  machine failures ⇒ service outages, data loss
•  Proactively detect (latent) failures before they happen
•  Machine failures are often not a result of abrupt change but rather a slow
degrade in performance
Solution
•  Agile & domain independent: no domain knowledge needed, uses only standard
performance counters collected from:
•  Hardware (e.g., temperature)
•  Operating system (e.g., number of threads)
•  Runtime system (e.g., garbage collected)
•  Application layer (e.g. transactions completed)
•  Un-supervised

Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Assumptions:
•  Many machines, majority working properly at any point in
time
•  Machines are homogeneous (perform similar tasks on similar
hardware and software)
•  On average, workload is balanced across machines
•  Counters are ordinal and reported at same rate
•  Counters are memoryless
Standard for (groups
of) machines in
big data centers
Technical assumptions that (I believe)
can be softened with some thought

Bo Thiesson
Anomaly detection framework (slightly simplified):
•  𝑥( 𝑚, 𝑡): vector of performance counters for machine
𝑚∈ 𝑀 at time 𝑡 in epoch 𝑇.
•  𝑥( 𝑡): collection of all 𝑥( 𝑚, 𝑡) at time 𝑡
•  For each machine 𝑚 compute:
•  𝑆(𝑚, 𝑥( 𝑡))=1/|𝑀|−1 ∑𝑚≠ 𝑚′↑▒ 𝑥(𝑚, 𝑡)
− 𝑥( 𝑚↑′ , 𝑡)/‖𝑥(𝑚, 𝑡)− 𝑥( 𝑚↑′ , 𝑡)‖  
•  Machine measure: 𝑣(𝑚)=1/|𝑇|−1 
∑𝑡∈ 𝑇↑▒𝑆(𝑚, 𝑥( 𝑡)) 
•  Compute centroid measure across all machines: 𝑣 
•  For each machine 𝑚, use statistical test 𝐹(e.g. sign-test,
Tukey-test, LOF-test)* to determine if deviance: 43InfinIT seminar, March 2015
*: See details in: Latent Fault Detection in Large Scale Services; Gabel, Schuster, Bachrach & Bjørner (2012)

Bo Thiesson
Results – detection of latent faults:
•  4500 machines
•  Considered fault latencies: 1 day, 7 days, 14 days
•  Incomplete health information labeled according to current
system “watchdogs”
•  True failure not reported by “watchdog” → counted as
false positive
•  Abrupt failure without proceeding latent faulting →
counted as false negative
⇒ Conservative
result claims

Bo Thiesson
Corporate Security – Insider Threats
(Xiang Wang, Jay Stokes,& Bo Thiesson)
•  User/machine “social” network of
•  Approx. 100K users
•  Approx ½M machines
•  Rich contextual information:
•  User: name, location, organization,
title,…
•  Machine: name, IP address,
services,,…

Bo Thiesson
Corporate Network Security (cont.)

Bo Thiesson
User-user similarity matrix User-user constraint matrix

Bo Thiesson
Anomalies over time -- visualization

Bo Thiesson
Who did we catch?

Bo Thiesson
Outline
Standard algorithms
•  Clustering

Bo Thiesson
Future (and current) challenges – “Big Data”
Volume
•  many transactions - billions - algorithms must be efficient
-  [cannot just take sample]
Velocity
•  Streaming data – efficient online, adaptive algorithms
-  [reactive to drift]
Variety
•  Data from many sources – mixed variable types, large number
of variables (many irrelevant), network data
-  [very demanding on distance measure & modeling process]
Other issues
-  different misclassification costs
-  unbalanced class sizes
-  delay in labeling (e.g., in credit card fraud)
-  mislabeled classes

03 presentation-bothiesson

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 03 presentation-bothiesson

Similar to 03 presentation-bothiesson (20)

More from InfinIT - Innovationsnetværket for it

More from InfinIT - Innovationsnetværket for it (20)

Recently uploaded

Recently uploaded (20)

03 presentation-bothiesson