SlideShare a Scribd company logo
Anomaly Detection
–from basic concepts to application cases at Microsoft
Bo Thiesson
thiesson@cs.aau.dk
http://people.cs.aau.dk/~thiesson
Aalborg University
– 1996
Microsoft Research,
Redmond USA
(1996 – 2013)
Aalborg University
(2013 –
Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
2InfinIT seminar, March 2015
Bo Thiesson
(Simplified) Anomaly Detection Example
Computer features: 𝑥=(​ 𝑥↓1 ,…​ 𝑥↓𝑛 )
​ 𝑥↓1 : Memory use
​ 𝑥↓2 : CPU load
⋮
3InfinIT seminar, March 2015
Training data: {​ 𝑥↑(1) ,
…,​ 𝑥↑( 𝑚) }
Run-time data:     𝑥
CPU load
Memory use
𝑥 𝑥 𝑥
𝑥
𝑥
𝑥
𝑥𝑥 𝑥𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
𝑥
𝑥 𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
ok
𝑥
anomaly
Bo Thiesson
What is an anomaly
•  Anomaly is a pattern in the data that does not conform to the expected
behaviour
•  Also referred to as outliers, exceptions, peculiarities, surprises, etc.
•  Definition of Hawkins [Hawkins 1980]:
“An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism”
4InfinIT seminar, March 2015
Bo Thiesson
(Simplified) Anomaly Detection Example
Computer features: 𝑥=(​ 𝑥↓1 ,…​ 𝑥↓𝑛 )
​ 𝑥↓1 : Memory use
​ 𝑥↓2 : CPU load
⋮
5InfinIT seminar, March 2015
Training data:
ok: {​ 𝑥↑(1) ,…,​ 𝑥↑( 𝑚) }
not-ok: {​ 𝑥↑(1) ,…,​
𝑥↑( 𝑘) }
Run-time data:     𝑥
CPU load
Memory use
𝑥 𝑥 𝑥
𝑥
𝑥
𝑥
𝑥𝑥 𝑥𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
𝑥
𝑥 𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
ok
𝑥
anomaly
𝑥 𝑥𝑥𝑥
𝑥
𝑥
𝑥
𝑥
Bo Thiesson
Two Scenarios
Supervised scenario (standard classification)
•  Training data with both normal and abnormal data objects are
provided
•  There may be multiple normal and/or abnormal classes
•  Often, the classification problem is highly imbalanced
•  Up-sample abnormal data or down-sample normal data
•  Future anomalies must look like the abnormal class(es) of training
objects
6InfinIT seminar, March 2015
Bo Thiesson
Two Scenarios (cont.)
Semi-supervised Scenario (anomaly detection)
•  Training data for only the normal class(es) is provided
•  There may be multiple normal classes
•  Many times no (explicit) class label given, but data assumed (mostly)
normal
•  Relies on anomalies being very rare compared to normal data
•  Sometimes a very small number of anomalies included in training data
(0 to 20 is common), but not enough to learn abnormal class(es)
•  Future anomalies may look nothing like any previously observed
anomaly
7InfinIT seminar, March 2015
Focus
Bo Thiesson
Where to look for anomalies
w  Anomalous events occur relatively infrequently
w  However, when they do occur, their
consequences can be quite dramatic and quite
often in a negative sense
8InfinIT seminar, March 2015
Some applications:
w  Fraud: Banking, Insurance, Health care, Click,..
w  Cyber intrusion: Insider/outsider, machine or network level
w  Medicine and health: abnormal test results, disease patterns (e.g. geo-spatial)
w  (Anti) terrorism
w  Data cleaning (measurement errors)
w  …
Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
9InfinIT seminar, March 2015
Bo Thiesson
Modeling paradigms
Parametric
•  Learn (parametric) model representing normal data
•  Outliers deviate strongly from this model
•  Data represented by sufficient statistics (estimating model parameters)
•  Space efficient & fast run-time (usually)
•  Depends on parametric structure (shape) of the model, also restricts number
of normal classes
Non-parametric
•  All data kept around - the “model” defines a distance measure between data –
and sometimes a kernel
•  Distance-based approach: outlier have large distance to other data
•  (Kernel) density-based approach: density around outlier is significantly smaller
than for its neighbors
•  Space demanding & not as fast run-time (usually)
•  No structural model constraints nor restrictions on number of normal classes
10InfinIT seminar, March 2015
Bo Thiesson
Algorithm – Parametric
Example: Multivariate Gaussian (normal) model

•  𝜇 is the mean value of all points (usually data is normalized so that 𝜇=0)
•  Σ  is the covariance matrix from the mean
•  Mahalanobis distance 
•  𝑀𝐷𝑖𝑠𝑡(𝑥, 𝜇)=  ​( 𝑥− 𝜇)↑𝑇 ​Σ↑−1 (𝑥− 𝜇)
•  follows a ​Χ↑2 -distribution with 𝑑  degrees of freedom ( 𝑑  = data
dimensionality)
•  Outliers: points 𝑥, with 𝑀𝐷𝑖𝑠𝑡( 𝑥, 𝜇) > ​Χ↑2 (0,975) [≈3 𝜎]
11InfinIT seminar, March 2015
( )
2
)()( 1
||2
1
),;(
µµ
π
µ
−−
−
−
Σ
=Σ
xx
d
exN
ΣT
Bo Thiesson
Parametric algorithm – Multivariate Gaussian
Tan, P.-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley.
12InfinIT seminar, March 2015
( )
2
)()( 1
||2
1
),;(
µµ
π
µ
−−
−
−
Σ
=Σ
xx
d
exN
ΣT
Parameters
Data object
Bo Thiesson
Algorithm – Parametric (cont.)
Challenge
A Fix
•  Mixture modeling
•  ⇒ contextual outlier detection
(more later)
•  Demands a clustering approach
•  Not as flexible as non-parametric approaches
13InfinIT seminar, March 2015
µDB
Bo Thiesson
Replicator Neural Networks
S. Hawkins, et al. Outlier detection using replicator neural networks, DaWaK02 (2002)
•  Use a replicator 4-layer feed-forward neural network (RNN) with the
same number of input and output nodes
•  Input variables are the output variables so that RNN forms a compressed
model of the data during training
•  A measure of anomaly is the reconstruction error of individual data points
14InfinIT seminar, March 2015
Target
variables
Input
Bo Thiesson
Algorithm – Non-Parametric
Nearest Neighbor (NN) Based Techniques
Key assumption: normal data records have close neighbors while anomalies
are located far from other records
General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data record is anomaly
or not
Paradigms:
•  Distance based methods
•  Anomalies are data points most distant from other points
•  Density based methods
•  Anomalies are data points in low density regions
15InfinIT seminar, March 2015
Bo Thiesson
NN Distance-based Anomaly Detection
Knorr & Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
Algorithm
•  For each data record   𝑑  compute the distance to the k-th nearest
neighbor ​ 𝑑↓𝑘 
•  The distance measure is the important design choice
•  Sort all data records according to the distance ​ 𝑑↓𝑘   
•  Outliers are records that have the largest distance ​ 𝑑↓𝑘  and therefore are
located in the more sparse neighborhoods
•  Usually data records that have top 𝑛% distance ​ 𝑑↓𝑘  are identified as
outliers
•  Multiple normal classes is not a problem
•  Not suitable for datasets that have modes with varying density
16InfinIT seminar, March 2015
Bo Thiesson
Advantages of Density-based Techniques
Anomalies:
•  Distance-based: ​ 𝒑↓ 𝟏 ,  ​
𝒑↓ 𝟑 ,…
•  Density-based: ​     𝒑↓ 𝟏 ,  ​
𝒑↓ 𝟐 ,  …
17InfinIT seminar, March 2015
p2
×	

 p1
×	

×	

p3
Distance from p3
to nearest
neighbor
Distance from p2
to nearest
neighbor
Fixes problem with different local densities
Bo Thiesson
Local Outlier Factor (LOF)
(Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000)
Algorithm
•  For each data point 𝑞  compute the distance to the 𝑘-th NN ( 𝑘-dist)-dist)
•  Compute reachability distance ( 𝑟-dist) for each data example 𝑞  with-dist) for each data example 𝑞  with
respect to data example 𝑝  as:
𝑟−dist( 𝑞,   𝑝)  =  max​{ 𝑘−dist( 𝑝),  dist( 𝑞, 𝑝)}
•  Compute local reachability density (lrd) of data example 𝑞  as inverse of the
average reachability distance based on the k-NN of data example 𝑞
lrd(𝑞)=​ 𝑘/∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒𝑟−dist( 𝑞, 𝑝)  
•  Compute LO​F↓𝑘 ( 𝑞) as ratio of average local reachability density of 𝑞’s
k-NN and local reachability density of the data record 𝑞
LO​F↓𝑘 (𝑞)=​1/𝑘 ∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒​lrd( 𝑝)/lrd( 𝑞)  
18InfinIT seminar, March 2015
Bo Thiesson
Local Outlier Factor (LOF)
Properties
•  LOF ≈ 1: point is in a cluster (region with homogeneous
density around the point and its neighbors)
•  LOF >> 1: point is an anomaly
19InfinIT seminar, March 2015
Bo Thiesson
Clustering Based Techniques
Key Assumption: Normal data instances belong to large and dense clusters,
while anomalies do not belong to any significant cluster.
General Approach:
•  Cluster data into a finite number of clusters.
•  Analyze each data instance with respect to its closest cluster.
•  Anomalous Instances
•  Data instances that do not fit into any cluster (residuals from clustering)‫‏‬.
•  Data instances in small clusters.
•  Data instances that are far from other points within the same cluster.
20InfinIT seminar, March 2015
Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
21InfinIT seminar, March 2015
Bo Thiesson
Contextual Anomaly Detection
Key Assumption: All normal objects within a context will be similar (in
terms of behavioral attributes), while the anomalies will be different from
other objects within the context.
General Approach:
•  Identify a context around a data object (using a set of contextual
attributes).
•  Determine if the test data objects is anomalous within the context
(using a set of behavioral attributes).
22InfinIT seminar, March 2015
Bo Thiesson
Contextual Attributes
Contextual attributes define a neighborhood (context) for each instance
For example:
•  Spatial Context
•  Latitude, Longitude
•  Graph Context
•  Edges, Weights
•  Sequential Context
•  Position, Time
•  Profile Context
•  User demographics
•  Behavioral context (mixture modeling)
•  Contextual = behavioral attributes
23InfinIT seminar, March 2015
Bo Thiesson
Contextual Anomaly Detection Techniques
Reduction to global anomaly detection
•  Segment or influence data using contextual attributes
•  Apply a traditional anomaly outlier within each context using behavioral
attributes
•  Often, contextual attributes cannot be segmented easily – use favourite
clustering technique (from standard machine learning)
A “softer” alternative
•  Conditional anomaly detection*
24InfinIT seminar, March 2015
* X. Song, M. Wu, C. Jermaine & S. Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.
Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
25InfinIT seminar, March 2015
Bo Thiesson
The development process
The importance of real-number evaluation
When developing a learning algorithm (choosing features,
etc.), making decisions is much easier if we have a way of
evaluating our learning algorithm.
Data
•  Assume we have some labeled data, of normal and very few
anomalous examples.
•  Split data into
•  Training set, with only normal examples
•  Validation set, with both normal and half of the
anomalous examples
•  Test set with both normal and other half of the
anomalous examples
26InfinIT seminar, March 2015
Training
Data
Validation
Data
Test
Data
Bo Thiesson
Example
Data
•  100.000 snapshots of well (normal) running machines
•  20 snapshots of (latent) failing machines
Training data: 60K well-running machines
Validation data: 20K well-running machines, 10 failing machines
Test data: 20K well-running machines, 10 failing machines
27InfinIT seminar, March 2015
Bo Thiesson
The development process (cont.)
Algorithm evaluation
•  Fit normal model 𝑀( 𝑥) on the training data
•  On validation/test data example 𝑥, predict
​ 𝑦 = 𝑀(𝑥)={█■normal&− 𝑜𝑟−@not  normal&  
and compare to true value 𝑦
Possible evaluation metrics:
–  True positive, false positive, false negative, true negative
–  Precision/Recall
–  F1-score
Use to compare algorithms with different features, tuning parameters, etc.
28InfinIT seminar, March 2015
Bo Thiesson
false positive
(Type I Error)
We want to avoid…
false negative
(Type II Error)
You’re
not
pregnant
You’re
pregnant
InfinIT seminar, March 2015
29
29
Bo Thiesson
Validation (credit card fraud example)
(Inspired by David J. Hand’s talk at NATO ASI: Mining Massive Data sets for Security, 2007)
Unbalanced classes
Detector
•  correctly identifies 99 in 100 legitimate transactions, and
•  correctly identifies 99 in 100 fraudulent transactions
Pretty good?
But suppose only 1 in 10.000
transactions are fraudulent
Only 1 in 100 reported fraudulent
transactions are in fact fraudulent
30InfinIT seminar, March 2015
True class
Legit Fraud
Predicted
class
Legit 99% 1%
Fraud 1% 99%
Numbers 9999 1
True class
Legit Fraud
Predicted
class
Legit 9899,01 0,01
Fraud 99,99 0.99
Numbers 9999 1
TP rate=​
0,99/99,99+0.99 
=0,98%
Bo Thiesson
“The boy who cried wolf”
>99% of suspected frauds are in fact legitimate
This matters because:
•  operational decisions must be made (stop card?)
•  good customers must not be irritated
Same challenge for other anomaly detection systems
– cyber intrusion detection, detecting latent machine failures,
industrial damage detection, etc.
31InfinIT seminar, March 2015
Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
32InfinIT seminar, March 2015
Bo Thiesson
Excel (conditional formatting rules)
(Max Chickering, Allan Folting, David Heckerman, Eric Vigessa & Bo Thiesson)
33InfinIT seminar, March 2015
Drill across
Region (outlier)
Bo Thiesson
Excel (conditional formatting rules)
34InfinIT seminar, March 2015
Drill into
cat3 (outlier)
Bo Thiesson
Excel (conditional formatting rules)
35InfinIT seminar, March 2015
Bo Thiesson
Capacity Prediction
(Alexei Bocharov & Bo Thiesson)
Capacity prediction = prediction of traffic to sets of MSN web pages
•  “How many impressions can we sell on a property between certain start and end
dates in the future?”
•  Over-prediction = oversell (customer dissatisfaction, make good items,…)
•  Under-prediction = undersell (loss of revenue, dilution of product, …)
•  Microsoft adCenter essentially sells web page traffic futures to advertisers across
>20,000 MSN page groups
Forecast future from past data
Many existing forecasting methods
•  One series: ARIMA, Exponential Smoothing methods (e.g. Holt-Winters), ART,…
•  Multiple series: VARIMA, ARTxp , Neural nets,…
Gradually forgt
the past
Bo Thiesson
Breaking the forecasting models
•  Transient events
•  Cyclical (recurring, but not
periodic)
•  FOMC-meetings (interest rates)
•  Periodic (long periodicities with few
data)
•  E.g., April 15th is Tax day
•  CPS only has 2-3 yrs of data, not
enough!
•  Sporadic
•  E.g., earthquake in Chile
37
Bo Thiesson
Seasonal and Floating Patterns
TSF provides basis for predicting future traffic with
•  trend,
•  seasonality, and
•  floating calendar patterns
For trend and seasonality:
•  TSF model selection + spectral analysis
•  Industry-strength seasonal pattern prediction and superior seasonal
pattern detection
For floating patterns:
•  a calendar of future events is needed (such as used by MSN Channel
editorial team → next slide)
•  Custom calendar event forecasters improve prediction accuracy by
factor of 2 or more (compared to traditional forecasters)
Bo Thiesson
Example: MSN Editorial Calendar
Bo Thiesson
The OSD Data Quality (DQ) system
•  Single source of Data Quality reports across the entire OSD
•  Reports are basis for investigating (abnormal) incidents with both user-facing Bing
properties and Bing infrastructure
•  Alert generator from online KPI measures (based on TSF, in parts)
•  Alerts are ranked:
•  Severity of anomaly (deviation from predicted value) +
•  importance of KPI
•  Involves the (un)certainty of prediction
•  Number of alerts can be matched human resources (for report investigations)
40InfinIT seminar, March 2015
Bo Thiesson
Latent Fault Detection in Data Centers
Gabel, Schuster, Bachrach & Bjørner (2012)
Challenge:
•  machine failures ⇒ service outages, data loss
•  Proactively detect (latent) failures before they happen
•  Machine failures are often not a result of abrupt change but rather a slow
degrade in performance
Solution
•  Agile & domain independent: no domain knowledge needed, uses only standard
performance counters collected from:
•  Hardware (e.g., temperature)
•  Operating system (e.g., number of threads)
•  Runtime system (e.g., garbage collected)
•  Application layer (e.g. transactions completed)
•  Un-supervised
41InfinIT seminar, March 2015
Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Assumptions:
•  Many machines, majority working properly at any point in
time
•  Machines are homogeneous (perform similar tasks on similar
hardware and software)
•  On average, workload is balanced across machines
•  Counters are ordinal and reported at same rate
•  Counters are memoryless
42InfinIT seminar, March 2015
Standard for (groups
of) machines in
big data centers
Technical assumptions that (I believe)
can be softened with some thought
Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Anomaly detection framework (slightly simplified):
•  𝑥( 𝑚,   𝑡): vector of performance counters for machine
𝑚∈ 𝑀 at time 𝑡 in epoch 𝑇.
•  𝑥( 𝑡): collection of all 𝑥( 𝑚,   𝑡) at time 𝑡
•  For each machine 𝑚  compute:
•  𝑆(𝑚, 𝑥( 𝑡))=​1/|𝑀|−1 ∑𝑚≠ 𝑚′↑▒​ 𝑥(𝑚, 𝑡)
− 𝑥(​ 𝑚↑′ , 𝑡)/‖𝑥(𝑚, 𝑡)− 𝑥(​ 𝑚↑′ , 𝑡)‖  
•  Machine measure: 𝑣(𝑚)=​1/|𝑇|−1 
∑𝑡∈ 𝑇↑▒𝑆(𝑚, 𝑥( 𝑡)) 
•  Compute centroid measure across all machines: ​ 𝑣 
•  For each machine 𝑚, use statistical test 𝐹(e.g. sign-test,
Tukey-test, LOF-test)* to determine if deviance: 43InfinIT seminar, March 2015
*: See details in: Latent Fault Detection in Large Scale Services; Gabel, Schuster, Bachrach & Bjørner (2012)
Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Results – detection of latent faults:
•  4500 machines
•  Considered fault latencies: 1 day, 7 days, 14 days
•  Incomplete health information labeled according to current
system “watchdogs”
•  True failure not reported by “watchdog” → counted as
false positive
•  Abrupt failure without proceeding latent faulting →
counted as false negative
44InfinIT seminar, March 2015
⇒ Conservative
result claims
Bo Thiesson
Corporate Security – Insider Threats
(Xiang Wang, Jay Stokes,& Bo Thiesson)
•  User/machine “social” network of
•  Approx. 100K users
•  Approx ½M machines
•  Rich contextual information:
•  User: name, location, organization,
title,…
•  Machine: name, IP address,
services,,…
45InfinIT seminar, March 2015
Bo Thiesson
Corporate Network Security (cont.)
46InfinIT seminar, March 2015
Bo Thiesson
Corporate Network Security (cont.)
47InfinIT seminar, March 2015
User-user similarity matrix User-user constraint matrix
Bo Thiesson
Corporate Network Security (cont.)
Anomalies over time -- visualization
48InfinIT seminar, March 2015
Bo Thiesson
Corporate Network Security (cont.)
Who did we catch?
49InfinIT seminar, March 2015
Bo Thiesson
Outline
Basic setup & motivation
•  1 class ↔ 2 class
•  Application examples
Standard algorithms
•  Parametric - Gaussian modeling, Replicator Neural Networks
•  Non-parametric - NN modeling (distance & density based, LOF)
•  Clustering
Contextual detection
The development process
•  Feature selection & model validation
Microsoft applications
•  Excel (conditional formatting rules)
•  OSD (Bing, MSN, AdCenter) – Capacity prediction
•  OSD, Azure – Latent Fault Detection in Data Centers
•  Corporate Security – Insider Threats
Future (and current) challenges
50InfinIT seminar, March 2015
Bo Thiesson
Future (and current) challenges – “Big Data”
Volume
•  many transactions - billions - algorithms must be efficient
-  [cannot just take sample]
Velocity
•  Streaming data – efficient online, adaptive algorithms
-  [reactive to drift]
Variety
•  Data from many sources – mixed variable types, large number
of variables (many irrelevant), network data
-  [very demanding on distance measure & modeling process]
Other issues
-  different misclassification costs
-  unbalanced class sizes
-  delay in labeling (e.g., in credit card fraud)
-  mislabeled classes
51InfinIT seminar, March 2015

More Related Content

What's hot

Data analysis – using computers for presentation
Data analysis – using computers for presentationData analysis – using computers for presentation
Data analysis – using computers for presentationNoonapau
 
Techniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsTechniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start Recommendations
Matthias Braunhofer
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
guest0edcaf
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
Anmol Dwivedi
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
Omar F. Althuwaynee
 
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Institute of Contemporary Sciences
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
zeer1234
 
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...
Waqas Tariq
 
Chap10 Anomaly Detection
Chap10 Anomaly DetectionChap10 Anomaly Detection
Chap10 Anomaly Detection
guest76d673
 
sigir2020
sigir2020sigir2020
sigir2020
Tetsuya Sakai
 
Clustering
ClusteringClustering
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
ShantanuDeosthale
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
Shishir Choudhary
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
Matthias Braunhofer
 
Active learning
Active learningActive learning
Active learning
Akhilesh Ravi
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
ijiert bestjournal
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
Khalid Elshafie
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
Houw Liong The
 

What's hot (20)

Data analysis – using computers for presentation
Data analysis – using computers for presentationData analysis – using computers for presentation
Data analysis – using computers for presentation
 
Techniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start RecommendationsTechniques for Context-Aware and Cold-Start Recommendations
Techniques for Context-Aware and Cold-Start Recommendations
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
 
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
Deep Attention Model for Triage of Emergency Department Patients - Djordje Gl...
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
 
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...
 
Chap10 Anomaly Detection
Chap10 Anomaly DetectionChap10 Anomaly Detection
Chap10 Anomaly Detection
 
sigir2020
sigir2020sigir2020
sigir2020
 
Clustering
ClusteringClustering
Clustering
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
 
Active learning
Active learningActive learning
Active learning
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 

Similar to 03 presentation-bothiesson

MLSEV Virtual. Searching for Anomalies
MLSEV Virtual. Searching for AnomaliesMLSEV Virtual. Searching for Anomalies
MLSEV Virtual. Searching for Anomalies
BigML, Inc
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
Wei Zhong Toh
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
Datamining Tools
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
DataminingTools Inc
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
PyData
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
PrashantYadav931011
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
Decision trees
Decision treesDecision trees
Decision trees
Ncib Lotfi
 
Data preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsData preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in Bioinformatics
Elena Sügis
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
Ahmed Youssef Ali Amer
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business businessBI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
JawaherAlbaddawi
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
AI Summary
 
Lecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptxLecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptx
ZainULABIDIN496386
 
Data in science
Data in science Data in science
Data in science
Sreejith Aravindakshan
 

Similar to 03 presentation-bothiesson (20)

MLSEV Virtual. Searching for Anomalies
MLSEV Virtual. Searching for AnomaliesMLSEV Virtual. Searching for Anomalies
MLSEV Virtual. Searching for Anomalies
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Decision trees
Decision treesDecision trees
Decision trees
 
Data preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsData preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in Bioinformatics
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business businessBI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Lecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptxLecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptx
 
Data in science
Data in science Data in science
Data in science
 

More from InfinIT - Innovationsnetværket for it

Erfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermarkErfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermark
InfinIT - Innovationsnetværket for it
 
Object orientering, test driven development og c
Object orientering, test driven development og cObject orientering, test driven development og c
Object orientering, test driven development og c
InfinIT - Innovationsnetværket for it
 
Embedded softwaredevelopment hcs
Embedded softwaredevelopment hcsEmbedded softwaredevelopment hcs
Embedded softwaredevelopment hcs
InfinIT - Innovationsnetværket for it
 
C og c++-jens lund jensen
C og c++-jens lund jensenC og c++-jens lund jensen
C og c++-jens lund jensen
InfinIT - Innovationsnetværket for it
 
201811xx foredrag c_cpp
201811xx foredrag c_cpp201811xx foredrag c_cpp
C som-programmeringssprog-bt
C som-programmeringssprog-btC som-programmeringssprog-bt
C som-programmeringssprog-bt
InfinIT - Innovationsnetværket for it
 
Infinit seminar 060918
Infinit seminar 060918Infinit seminar 060918
DCR solutions
DCR solutionsDCR solutions
Not your grandfathers BPM
Not your grandfathers BPMNot your grandfathers BPM
Not your grandfathers BPM
InfinIT - Innovationsnetværket for it
 
Kmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolutionKmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolution
InfinIT - Innovationsnetværket for it
 
EcoKnow - oplæg
EcoKnow - oplægEcoKnow - oplæg
Martin Wickins Chatbots i fronten
Martin Wickins Chatbots i frontenMartin Wickins Chatbots i fronten
Martin Wickins Chatbots i fronten
InfinIT - Innovationsnetværket for it
 
Marie Fenger ai kundeservice
Marie Fenger ai kundeserviceMarie Fenger ai kundeservice
Marie Fenger ai kundeservice
InfinIT - Innovationsnetværket for it
 
Mads Kaysen SupWiz
Mads Kaysen SupWizMads Kaysen SupWiz
Leif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support CenterLeif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support Center
InfinIT - Innovationsnetværket for it
 
Jan Neerbek NLP og Chatbots
Jan Neerbek NLP og ChatbotsJan Neerbek NLP og Chatbots
Jan Neerbek NLP og Chatbots
InfinIT - Innovationsnetværket for it
 
Anders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer SupportAnders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer Support
InfinIT - Innovationsnetværket for it
 
Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018
InfinIT - Innovationsnetværket for it
 
Innovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekterInnovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekter
InfinIT - Innovationsnetværket for it
 
Rokoko infin it presentation
Rokoko infin it presentation Rokoko infin it presentation
Rokoko infin it presentation
InfinIT - Innovationsnetværket for it
 

More from InfinIT - Innovationsnetværket for it (20)

Erfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermarkErfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermark
 
Object orientering, test driven development og c
Object orientering, test driven development og cObject orientering, test driven development og c
Object orientering, test driven development og c
 
Embedded softwaredevelopment hcs
Embedded softwaredevelopment hcsEmbedded softwaredevelopment hcs
Embedded softwaredevelopment hcs
 
C og c++-jens lund jensen
C og c++-jens lund jensenC og c++-jens lund jensen
C og c++-jens lund jensen
 
201811xx foredrag c_cpp
201811xx foredrag c_cpp201811xx foredrag c_cpp
201811xx foredrag c_cpp
 
C som-programmeringssprog-bt
C som-programmeringssprog-btC som-programmeringssprog-bt
C som-programmeringssprog-bt
 
Infinit seminar 060918
Infinit seminar 060918Infinit seminar 060918
Infinit seminar 060918
 
DCR solutions
DCR solutionsDCR solutions
DCR solutions
 
Not your grandfathers BPM
Not your grandfathers BPMNot your grandfathers BPM
Not your grandfathers BPM
 
Kmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolutionKmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolution
 
EcoKnow - oplæg
EcoKnow - oplægEcoKnow - oplæg
EcoKnow - oplæg
 
Martin Wickins Chatbots i fronten
Martin Wickins Chatbots i frontenMartin Wickins Chatbots i fronten
Martin Wickins Chatbots i fronten
 
Marie Fenger ai kundeservice
Marie Fenger ai kundeserviceMarie Fenger ai kundeservice
Marie Fenger ai kundeservice
 
Mads Kaysen SupWiz
Mads Kaysen SupWizMads Kaysen SupWiz
Mads Kaysen SupWiz
 
Leif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support CenterLeif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support Center
 
Jan Neerbek NLP og Chatbots
Jan Neerbek NLP og ChatbotsJan Neerbek NLP og Chatbots
Jan Neerbek NLP og Chatbots
 
Anders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer SupportAnders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer Support
 
Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018
 
Innovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekterInnovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekter
 
Rokoko infin it presentation
Rokoko infin it presentation Rokoko infin it presentation
Rokoko infin it presentation
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

03 presentation-bothiesson

  • 1. Anomaly Detection –from basic concepts to application cases at Microsoft Bo Thiesson thiesson@cs.aau.dk http://people.cs.aau.dk/~thiesson Aalborg University – 1996 Microsoft Research, Redmond USA (1996 – 2013) Aalborg University (2013 –
  • 2. Bo Thiesson Outline Basic setup & motivation •  1 class ↔ 2 class •  Application examples Standard algorithms •  Parametric - Gaussian modeling, Replicator Neural Networks •  Non-parametric - NN modeling (distance & density based, LOF) •  Clustering Contextual detection The development process •  Feature selection & model validation Microsoft applications •  Excel (conditional formatting rules) •  OSD (Bing, MSN, AdCenter) – Capacity prediction •  OSD, Azure – Latent Fault Detection in Data Centers •  Corporate Security – Insider Threats Future (and current) challenges 2InfinIT seminar, March 2015
  • 3. Bo Thiesson (Simplified) Anomaly Detection Example Computer features: 𝑥=(​ 𝑥↓1 ,…​ 𝑥↓𝑛 ) ​ 𝑥↓1 : Memory use ​ 𝑥↓2 : CPU load ⋮ 3InfinIT seminar, March 2015 Training data: {​ 𝑥↑(1) , …,​ 𝑥↑( 𝑚) } Run-time data:     𝑥 CPU load Memory use 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥𝑥 𝑥𝑥 𝑥 𝑥 𝑥 𝑥 𝑥𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥𝑥 ok 𝑥 anomaly
  • 4. Bo Thiesson What is an anomaly •  Anomaly is a pattern in the data that does not conform to the expected behaviour •  Also referred to as outliers, exceptions, peculiarities, surprises, etc. •  Definition of Hawkins [Hawkins 1980]: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” 4InfinIT seminar, March 2015
  • 5. Bo Thiesson (Simplified) Anomaly Detection Example Computer features: 𝑥=(​ 𝑥↓1 ,…​ 𝑥↓𝑛 ) ​ 𝑥↓1 : Memory use ​ 𝑥↓2 : CPU load ⋮ 5InfinIT seminar, March 2015 Training data: ok: {​ 𝑥↑(1) ,…,​ 𝑥↑( 𝑚) } not-ok: {​ 𝑥↑(1) ,…,​ 𝑥↑( 𝑘) } Run-time data:     𝑥 CPU load Memory use 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥𝑥 𝑥𝑥 𝑥 𝑥 𝑥 𝑥 𝑥𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥𝑥 ok 𝑥 anomaly 𝑥 𝑥𝑥𝑥 𝑥 𝑥 𝑥 𝑥
  • 6. Bo Thiesson Two Scenarios Supervised scenario (standard classification) •  Training data with both normal and abnormal data objects are provided •  There may be multiple normal and/or abnormal classes •  Often, the classification problem is highly imbalanced •  Up-sample abnormal data or down-sample normal data •  Future anomalies must look like the abnormal class(es) of training objects 6InfinIT seminar, March 2015
  • 7. Bo Thiesson Two Scenarios (cont.) Semi-supervised Scenario (anomaly detection) •  Training data for only the normal class(es) is provided •  There may be multiple normal classes •  Many times no (explicit) class label given, but data assumed (mostly) normal •  Relies on anomalies being very rare compared to normal data •  Sometimes a very small number of anomalies included in training data (0 to 20 is common), but not enough to learn abnormal class(es) •  Future anomalies may look nothing like any previously observed anomaly 7InfinIT seminar, March 2015 Focus
  • 8. Bo Thiesson Where to look for anomalies w  Anomalous events occur relatively infrequently w  However, when they do occur, their consequences can be quite dramatic and quite often in a negative sense 8InfinIT seminar, March 2015 Some applications: w  Fraud: Banking, Insurance, Health care, Click,.. w  Cyber intrusion: Insider/outsider, machine or network level w  Medicine and health: abnormal test results, disease patterns (e.g. geo-spatial) w  (Anti) terrorism w  Data cleaning (measurement errors) w  …
  • 9. Bo Thiesson Outline Basic setup & motivation •  1 class ↔ 2 class •  Application examples Standard algorithms •  Parametric - Gaussian modeling, Replicator Neural Networks •  Non-parametric - NN modeling (distance & density based, LOF) •  Clustering Contextual detection The development process •  Feature selection & model validation Microsoft applications •  Excel (conditional formatting rules) •  OSD (Bing, MSN, AdCenter) – Capacity prediction •  OSD, Azure – Latent Fault Detection in Data Centers •  Corporate Security – Insider Threats Future (and current) challenges 9InfinIT seminar, March 2015
  • 10. Bo Thiesson Modeling paradigms Parametric •  Learn (parametric) model representing normal data •  Outliers deviate strongly from this model •  Data represented by sufficient statistics (estimating model parameters) •  Space efficient & fast run-time (usually) •  Depends on parametric structure (shape) of the model, also restricts number of normal classes Non-parametric •  All data kept around - the “model” defines a distance measure between data – and sometimes a kernel •  Distance-based approach: outlier have large distance to other data •  (Kernel) density-based approach: density around outlier is significantly smaller than for its neighbors •  Space demanding & not as fast run-time (usually) •  No structural model constraints nor restrictions on number of normal classes 10InfinIT seminar, March 2015
  • 11. Bo Thiesson Algorithm – Parametric Example: Multivariate Gaussian (normal) model •  𝜇 is the mean value of all points (usually data is normalized so that 𝜇=0) •  Σ  is the covariance matrix from the mean •  Mahalanobis distance •  𝑀𝐷𝑖𝑠𝑡(𝑥, 𝜇)=  ​( 𝑥− 𝜇)↑𝑇 ​Σ↑−1 (𝑥− 𝜇) •  follows a ​Χ↑2 -distribution with 𝑑  degrees of freedom ( 𝑑  = data dimensionality) •  Outliers: points 𝑥, with 𝑀𝐷𝑖𝑠𝑡( 𝑥, 𝜇) > ​Χ↑2 (0,975) [≈3 𝜎] 11InfinIT seminar, March 2015 ( ) 2 )()( 1 ||2 1 ),;( µµ π µ −− − − Σ =Σ xx d exN ΣT
  • 12. Bo Thiesson Parametric algorithm – Multivariate Gaussian Tan, P.-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley. 12InfinIT seminar, March 2015 ( ) 2 )()( 1 ||2 1 ),;( µµ π µ −− − − Σ =Σ xx d exN ΣT Parameters Data object
  • 13. Bo Thiesson Algorithm – Parametric (cont.) Challenge A Fix •  Mixture modeling •  ⇒ contextual outlier detection (more later) •  Demands a clustering approach •  Not as flexible as non-parametric approaches 13InfinIT seminar, March 2015 µDB
  • 14. Bo Thiesson Replicator Neural Networks S. Hawkins, et al. Outlier detection using replicator neural networks, DaWaK02 (2002) •  Use a replicator 4-layer feed-forward neural network (RNN) with the same number of input and output nodes •  Input variables are the output variables so that RNN forms a compressed model of the data during training •  A measure of anomaly is the reconstruction error of individual data points 14InfinIT seminar, March 2015 Target variables Input
  • 15. Bo Thiesson Algorithm – Non-Parametric Nearest Neighbor (NN) Based Techniques Key assumption: normal data records have close neighbors while anomalies are located far from other records General two-step approach 1. Compute neighborhood for each data record 2. Analyze the neighborhood to determine whether data record is anomaly or not Paradigms: •  Distance based methods •  Anomalies are data points most distant from other points •  Density based methods •  Anomalies are data points in low density regions 15InfinIT seminar, March 2015
  • 16. Bo Thiesson NN Distance-based Anomaly Detection Knorr & Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98 Algorithm •  For each data record   𝑑  compute the distance to the k-th nearest neighbor ​ 𝑑↓𝑘  •  The distance measure is the important design choice •  Sort all data records according to the distance ​ 𝑑↓𝑘    •  Outliers are records that have the largest distance ​ 𝑑↓𝑘  and therefore are located in the more sparse neighborhoods •  Usually data records that have top 𝑛% distance ​ 𝑑↓𝑘  are identified as outliers •  Multiple normal classes is not a problem •  Not suitable for datasets that have modes with varying density 16InfinIT seminar, March 2015
  • 17. Bo Thiesson Advantages of Density-based Techniques Anomalies: •  Distance-based: ​ 𝒑↓ 𝟏 ,  ​ 𝒑↓ 𝟑 ,… •  Density-based: ​     𝒑↓ 𝟏 ,  ​ 𝒑↓ 𝟐 ,  … 17InfinIT seminar, March 2015 p2 × p1 × × p3 Distance from p3 to nearest neighbor Distance from p2 to nearest neighbor Fixes problem with different local densities
  • 18. Bo Thiesson Local Outlier Factor (LOF) (Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000) Algorithm •  For each data point 𝑞  compute the distance to the 𝑘-th NN ( 𝑘-dist)-dist) •  Compute reachability distance ( 𝑟-dist) for each data example 𝑞  with-dist) for each data example 𝑞  with respect to data example 𝑝  as: 𝑟−dist( 𝑞,   𝑝)  =  max​{ 𝑘−dist( 𝑝),  dist( 𝑞, 𝑝)} •  Compute local reachability density (lrd) of data example 𝑞  as inverse of the average reachability distance based on the k-NN of data example 𝑞 lrd(𝑞)=​ 𝑘/∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒𝑟−dist( 𝑞, 𝑝)   •  Compute LO​F↓𝑘 ( 𝑞) as ratio of average local reachability density of 𝑞’s k-NN and local reachability density of the data record 𝑞 LO​F↓𝑘 (𝑞)=​1/𝑘 ∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒​lrd( 𝑝)/lrd( 𝑞)   18InfinIT seminar, March 2015
  • 19. Bo Thiesson Local Outlier Factor (LOF) Properties •  LOF ≈ 1: point is in a cluster (region with homogeneous density around the point and its neighbors) •  LOF >> 1: point is an anomaly 19InfinIT seminar, March 2015
  • 20. Bo Thiesson Clustering Based Techniques Key Assumption: Normal data instances belong to large and dense clusters, while anomalies do not belong to any significant cluster. General Approach: •  Cluster data into a finite number of clusters. •  Analyze each data instance with respect to its closest cluster. •  Anomalous Instances •  Data instances that do not fit into any cluster (residuals from clustering)‫‏‬. •  Data instances in small clusters. •  Data instances that are far from other points within the same cluster. 20InfinIT seminar, March 2015
  • 21. Bo Thiesson Outline Basic setup & motivation •  1 class ↔ 2 class •  Application examples Standard algorithms •  Parametric - Gaussian modeling, Replicator Neural Networks •  Non-parametric - NN modeling (distance & density based, LOF) •  Clustering Contextual detection The development process •  Feature selection & model validation Microsoft applications •  Excel (conditional formatting rules) •  OSD (Bing, MSN, AdCenter) – Capacity prediction •  OSD, Azure – Latent Fault Detection in Data Centers •  Corporate Security – Insider Threats Future (and current) challenges 21InfinIT seminar, March 2015
  • 22. Bo Thiesson Contextual Anomaly Detection Key Assumption: All normal objects within a context will be similar (in terms of behavioral attributes), while the anomalies will be different from other objects within the context. General Approach: •  Identify a context around a data object (using a set of contextual attributes). •  Determine if the test data objects is anomalous within the context (using a set of behavioral attributes). 22InfinIT seminar, March 2015
  • 23. Bo Thiesson Contextual Attributes Contextual attributes define a neighborhood (context) for each instance For example: •  Spatial Context •  Latitude, Longitude •  Graph Context •  Edges, Weights •  Sequential Context •  Position, Time •  Profile Context •  User demographics •  Behavioral context (mixture modeling) •  Contextual = behavioral attributes 23InfinIT seminar, March 2015
  • 24. Bo Thiesson Contextual Anomaly Detection Techniques Reduction to global anomaly detection •  Segment or influence data using contextual attributes •  Apply a traditional anomaly outlier within each context using behavioral attributes •  Often, contextual attributes cannot be segmented easily – use favourite clustering technique (from standard machine learning) A “softer” alternative •  Conditional anomaly detection* 24InfinIT seminar, March 2015 * X. Song, M. Wu, C. Jermaine & S. Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.
  • 25. Bo Thiesson Outline Basic setup & motivation •  1 class ↔ 2 class •  Application examples Standard algorithms •  Parametric - Gaussian modeling, Replicator Neural Networks •  Non-parametric - NN modeling (distance & density based, LOF) •  Clustering Contextual detection The development process •  Feature selection & model validation Microsoft applications •  Excel (conditional formatting rules) •  OSD (Bing, MSN, AdCenter) – Capacity prediction •  OSD, Azure – Latent Fault Detection in Data Centers •  Corporate Security – Insider Threats Future (and current) challenges 25InfinIT seminar, March 2015
  • 26. Bo Thiesson The development process The importance of real-number evaluation When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm. Data •  Assume we have some labeled data, of normal and very few anomalous examples. •  Split data into •  Training set, with only normal examples •  Validation set, with both normal and half of the anomalous examples •  Test set with both normal and other half of the anomalous examples 26InfinIT seminar, March 2015 Training Data Validation Data Test Data
  • 27. Bo Thiesson Example Data •  100.000 snapshots of well (normal) running machines •  20 snapshots of (latent) failing machines Training data: 60K well-running machines Validation data: 20K well-running machines, 10 failing machines Test data: 20K well-running machines, 10 failing machines 27InfinIT seminar, March 2015
  • 28. Bo Thiesson The development process (cont.) Algorithm evaluation •  Fit normal model 𝑀( 𝑥) on the training data •  On validation/test data example 𝑥, predict ​ 𝑦 = 𝑀(𝑥)={█■normal&− 𝑜𝑟−@not  normal&   and compare to true value 𝑦 Possible evaluation metrics: –  True positive, false positive, false negative, true negative –  Precision/Recall –  F1-score Use to compare algorithms with different features, tuning parameters, etc. 28InfinIT seminar, March 2015
  • 29. Bo Thiesson false positive (Type I Error) We want to avoid… false negative (Type II Error) You’re not pregnant You’re pregnant InfinIT seminar, March 2015 29 29
  • 30. Bo Thiesson Validation (credit card fraud example) (Inspired by David J. Hand’s talk at NATO ASI: Mining Massive Data sets for Security, 2007) Unbalanced classes Detector •  correctly identifies 99 in 100 legitimate transactions, and •  correctly identifies 99 in 100 fraudulent transactions Pretty good? But suppose only 1 in 10.000 transactions are fraudulent Only 1 in 100 reported fraudulent transactions are in fact fraudulent 30InfinIT seminar, March 2015 True class Legit Fraud Predicted class Legit 99% 1% Fraud 1% 99% Numbers 9999 1 True class Legit Fraud Predicted class Legit 9899,01 0,01 Fraud 99,99 0.99 Numbers 9999 1 TP rate=​ 0,99/99,99+0.99  =0,98%
  • 31. Bo Thiesson “The boy who cried wolf” >99% of suspected frauds are in fact legitimate This matters because: •  operational decisions must be made (stop card?) •  good customers must not be irritated Same challenge for other anomaly detection systems – cyber intrusion detection, detecting latent machine failures, industrial damage detection, etc. 31InfinIT seminar, March 2015
  • 32. Bo Thiesson Outline Basic setup & motivation •  1 class ↔ 2 class •  Application examples Standard algorithms •  Parametric - Gaussian modeling, Replicator Neural Networks •  Non-parametric - NN modeling (distance & density based, LOF) •  Clustering Contextual detection The development process •  Feature selection & model validation Microsoft applications •  Excel (conditional formatting rules) •  OSD (Bing, MSN, AdCenter) – Capacity prediction •  OSD, Azure – Latent Fault Detection in Data Centers •  Corporate Security – Insider Threats Future (and current) challenges 32InfinIT seminar, March 2015
  • 33. Bo Thiesson Excel (conditional formatting rules) (Max Chickering, Allan Folting, David Heckerman, Eric Vigessa & Bo Thiesson) 33InfinIT seminar, March 2015 Drill across Region (outlier)
  • 34. Bo Thiesson Excel (conditional formatting rules) 34InfinIT seminar, March 2015 Drill into cat3 (outlier)
  • 35. Bo Thiesson Excel (conditional formatting rules) 35InfinIT seminar, March 2015
  • 36. Bo Thiesson Capacity Prediction (Alexei Bocharov & Bo Thiesson) Capacity prediction = prediction of traffic to sets of MSN web pages •  “How many impressions can we sell on a property between certain start and end dates in the future?” •  Over-prediction = oversell (customer dissatisfaction, make good items,…) •  Under-prediction = undersell (loss of revenue, dilution of product, …) •  Microsoft adCenter essentially sells web page traffic futures to advertisers across >20,000 MSN page groups Forecast future from past data Many existing forecasting methods •  One series: ARIMA, Exponential Smoothing methods (e.g. Holt-Winters), ART,… •  Multiple series: VARIMA, ARTxp , Neural nets,… Gradually forgt the past
  • 37. Bo Thiesson Breaking the forecasting models •  Transient events •  Cyclical (recurring, but not periodic) •  FOMC-meetings (interest rates) •  Periodic (long periodicities with few data) •  E.g., April 15th is Tax day •  CPS only has 2-3 yrs of data, not enough! •  Sporadic •  E.g., earthquake in Chile 37
  • 38. Bo Thiesson Seasonal and Floating Patterns TSF provides basis for predicting future traffic with •  trend, •  seasonality, and •  floating calendar patterns For trend and seasonality: •  TSF model selection + spectral analysis •  Industry-strength seasonal pattern prediction and superior seasonal pattern detection For floating patterns: •  a calendar of future events is needed (such as used by MSN Channel editorial team → next slide) •  Custom calendar event forecasters improve prediction accuracy by factor of 2 or more (compared to traditional forecasters)
  • 39. Bo Thiesson Example: MSN Editorial Calendar
  • 40. Bo Thiesson The OSD Data Quality (DQ) system •  Single source of Data Quality reports across the entire OSD •  Reports are basis for investigating (abnormal) incidents with both user-facing Bing properties and Bing infrastructure •  Alert generator from online KPI measures (based on TSF, in parts) •  Alerts are ranked: •  Severity of anomaly (deviation from predicted value) + •  importance of KPI •  Involves the (un)certainty of prediction •  Number of alerts can be matched human resources (for report investigations) 40InfinIT seminar, March 2015
  • 41. Bo Thiesson Latent Fault Detection in Data Centers Gabel, Schuster, Bachrach & Bjørner (2012) Challenge: •  machine failures ⇒ service outages, data loss •  Proactively detect (latent) failures before they happen •  Machine failures are often not a result of abrupt change but rather a slow degrade in performance Solution •  Agile & domain independent: no domain knowledge needed, uses only standard performance counters collected from: •  Hardware (e.g., temperature) •  Operating system (e.g., number of threads) •  Runtime system (e.g., garbage collected) •  Application layer (e.g. transactions completed) •  Un-supervised 41InfinIT seminar, March 2015
  • 42. Bo Thiesson Latent Fault Detection in Data Centers (cont.) Assumptions: •  Many machines, majority working properly at any point in time •  Machines are homogeneous (perform similar tasks on similar hardware and software) •  On average, workload is balanced across machines •  Counters are ordinal and reported at same rate •  Counters are memoryless 42InfinIT seminar, March 2015 Standard for (groups of) machines in big data centers Technical assumptions that (I believe) can be softened with some thought
  • 43. Bo Thiesson Latent Fault Detection in Data Centers (cont.) Anomaly detection framework (slightly simplified): •  𝑥( 𝑚,   𝑡): vector of performance counters for machine 𝑚∈ 𝑀 at time 𝑡 in epoch 𝑇. •  𝑥( 𝑡): collection of all 𝑥( 𝑚,   𝑡) at time 𝑡 •  For each machine 𝑚  compute: •  𝑆(𝑚, 𝑥( 𝑡))=​1/|𝑀|−1 ∑𝑚≠ 𝑚′↑▒​ 𝑥(𝑚, 𝑡) − 𝑥(​ 𝑚↑′ , 𝑡)/‖𝑥(𝑚, 𝑡)− 𝑥(​ 𝑚↑′ , 𝑡)‖   •  Machine measure: 𝑣(𝑚)=​1/|𝑇|−1  ∑𝑡∈ 𝑇↑▒𝑆(𝑚, 𝑥( 𝑡))  •  Compute centroid measure across all machines: ​ 𝑣  •  For each machine 𝑚, use statistical test 𝐹(e.g. sign-test, Tukey-test, LOF-test)* to determine if deviance: 43InfinIT seminar, March 2015 *: See details in: Latent Fault Detection in Large Scale Services; Gabel, Schuster, Bachrach & Bjørner (2012)
  • 44. Bo Thiesson Latent Fault Detection in Data Centers (cont.) Results – detection of latent faults: •  4500 machines •  Considered fault latencies: 1 day, 7 days, 14 days •  Incomplete health information labeled according to current system “watchdogs” •  True failure not reported by “watchdog” → counted as false positive •  Abrupt failure without proceeding latent faulting → counted as false negative 44InfinIT seminar, March 2015 ⇒ Conservative result claims
  • 45. Bo Thiesson Corporate Security – Insider Threats (Xiang Wang, Jay Stokes,& Bo Thiesson) •  User/machine “social” network of •  Approx. 100K users •  Approx ½M machines •  Rich contextual information: •  User: name, location, organization, title,… •  Machine: name, IP address, services,,… 45InfinIT seminar, March 2015
  • 46. Bo Thiesson Corporate Network Security (cont.) 46InfinIT seminar, March 2015
  • 47. Bo Thiesson Corporate Network Security (cont.) 47InfinIT seminar, March 2015 User-user similarity matrix User-user constraint matrix
  • 48. Bo Thiesson Corporate Network Security (cont.) Anomalies over time -- visualization 48InfinIT seminar, March 2015
  • 49. Bo Thiesson Corporate Network Security (cont.) Who did we catch? 49InfinIT seminar, March 2015
  • 50. Bo Thiesson Outline Basic setup & motivation •  1 class ↔ 2 class •  Application examples Standard algorithms •  Parametric - Gaussian modeling, Replicator Neural Networks •  Non-parametric - NN modeling (distance & density based, LOF) •  Clustering Contextual detection The development process •  Feature selection & model validation Microsoft applications •  Excel (conditional formatting rules) •  OSD (Bing, MSN, AdCenter) – Capacity prediction •  OSD, Azure – Latent Fault Detection in Data Centers •  Corporate Security – Insider Threats Future (and current) challenges 50InfinIT seminar, March 2015
  • 51. Bo Thiesson Future (and current) challenges – “Big Data” Volume •  many transactions - billions - algorithms must be efficient -  [cannot just take sample] Velocity •  Streaming data – efficient online, adaptive algorithms -  [reactive to drift] Variety •  Data from many sources – mixed variable types, large number of variables (many irrelevant), network data -  [very demanding on distance measure & modeling process] Other issues -  different misclassification costs -  unbalanced class sizes -  delay in labeling (e.g., in credit card fraud) -  mislabeled classes 51InfinIT seminar, March 2015