Anomaly Detection: Real World Scenarios,
Approaches and Live Implementation
WEBINAR | DECEMBER 15, 2017
Saurabh DuttaRavishankar Rao Vallabhajosyula
SENIOR DATA SCIENTIST, IMPETUS TECHNOLOGIES
TWITTER: @ImpetusTech
TECHNICAL PRODUCT MANAGER, STREAMANALYTIX
TWITTER: @StreamAnalytix
Agenda
• What’s an anomaly?
• Real world use cases of anomaly detection
• Key steps in anomaly detection
• A deep dive into building an anomaly detection model
• Types of anomaly detection
• Data attributes
• Approaches and methods
• A platform approach to anomaly detection
• Live implementation using StreamAnalytix
• Q & A
About Impetus
Mission critical
technology solutions
since 1996
Fortune 500:
Big data clients
1700 people; US,
India, global reach
Unique mix of
big data products
and services
What’s an Anomaly?
Anomaly: is an observation that greatly
deviates from most of the other
observations, i.e., a data
point/behavior/pattern that appears to be
statistically unusual or 'anomalous'
Basic qualities of anomaly:
1. Rare
2. Significantly different from others
What is different about modern anomaly detection?
• Rule based methods are hard to scale
• Modern data science techniques are more efficient
• Can work with real-time data
• Improve detection across multiple channels
• Learn and detect variations
• Adaptable to multiple domains
Real world use cases of anomaly detection
Anomaly detection is influencing business decisions across verticals
MANUFACTURING
Detect abnormal machine
behavior to prevent cost
overruns
FINANCE & INSURANCE
Detect and prevent out of
pattern or fraudulent
spend, travel expenses
HEALTHCARE
Detect fraud in claims
and payments; events
from RFID and mobiles
BANKING
Flag abnormally high
purchases/deposits, detect
cyber intrusions
NETWORKING
Detect intrusion into
networks, prevent theft
of source code or IP
SOCIAL MEDIA
Detect compromised
accounts, bots that
generate fake reviews
VIDEO SURVEILLANCE
Detect or track objects
and persons of interest
in monotonous footage
SMART HOUSE
Detect energy leakage,
standardize smart
sensor datasets
TELECOM
Detect roaming abuse,
revenue fraud, service
disruptions
TRANSPORTATION
Ensure external
communications to the
vehicle are not intrusion
Key steps in anomaly detection
• Problem identification and setting expectations
• Defining the sources and schema
• Parsing and pre-processing
• Model development
• Model execution
• Investigation and feedback
• Model updating
• Operationalize model for scoring
Key steps in anomaly detection
• Problem identification and setting expectations
• Defining the sources and schema
• Parsing and pre-processing
• Model development
• Model execution
• Investigation and feedback
• Model updating
• Operationalize model for scoring
Model development for anomaly detection
Type of anomaly
detection used
Type of data
available
If the data has
labels
Taxonomy of anomaly detection
Anomaly Detection
Collective AnomalyContextual AnomalyPoint Anomaly
Data – Types of attributes
Data
Categorical
Nominal
Ordinal
Numerical
Named
Categories
Categories with
an implied order
Discrete
Continuous
Only particular
numbers
Any numerical
value
Binary
Variables with
only two options
(Yes/No)
Data – Choice of algorithm
Data
Categorical
Nominal
Ordinal
Numerical
Discrete
Continuous
Binary
Apply K-means clustering
Data has no labels
Apply time-series anomaly
detection algorithms
When time-stamps are
present
Data has labels
Use standard machine learning
classifiers
Use sequence classification algorithms
When time-stamps are
absent
Approaches to anomaly detection
Model
Test Data
Result
Training
Data
Supervised
(Classification)
Data skewness, lack of
counter examples
Model
Test Data
Result
Training
Data
Semi-supervised
(Novelty detection)
Requires a 'normal'
training dataset
Model
Unlabeled
Data
Result
Unsupervised
(Clustering)
Faces curse of dimensionality
Unsupervised
Algorithm
Methods for anomaly detection:
Categorical and numeric attributes
K-modes Generic mixture models Robust SVM
Uses hamming distance
to measure distance for
categorical features
Extends the framework of
Gaussian mixture models
Kernel-based approach that identifies
regions in which data resides in
alternate feature space
Methods for anomaly detection: Sequential data
State space models Hidden Markov models Graph-based methods
Model the evolution of data in time to enable
forecasting and flag an anomaly if it exceeds
a threshold
Markov Chains and HMMs measure the
probability of different events happening in
some sequence
Graphs capture interdependencies, and
allow discovery of relational associations
such as in fraud
System
Behavior
model
Observed
behavior
Expected
behavior
Observation
Model
Formation
Anomaly
Detection
Simulation
Latest methods for anomaly detection
Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets
AutoEncoders can learn the latent representation
of the data by using an encoder and a decoder
together
RNN-based architectures enable sequence
prediction. The network can flag an anomaly
when needed
GANs combine two neural networks - a
generator and a discriminator, and can be
used to find anomalies
Anomaly detection algorithms
Host-based IDS
• Statistical profiling using histograms
• Mixture of models,
• Neural networks
• SVM, Rule-based systems
Network intrusion detection
• Statistical profiling using histograms
• Parametric statistical modeling
• Non-parametric statistical modeling
• Bayesian networks, Neural networks
• SVM, Rule-based systems
• Clustering based, Nearest neighbor
• Spectral, Information Theoretic
Credit card fraud detection
• Neural Networks,
• Rule-based systems
• Clustering, Self-organizing map
• Artificial immune system
• Decision trees, SVM
Mobile phone fraud detection
• Statistical profiling using histograms
• Parametric statistical modeling
• Neural networks, Rule-based systems
Insider trading detection
• Statistical profiling using histograms
• Information theoretic
Medical and public health
• Parametric statistical modeling
• Neural networks, Bayesian networks
• Rule-based systems
• Nearest neighbor techniques
Fault detection in mechanical units
• Parametric statistical modeling
• Non-parametric statistical modeling
• Neural networks, Spectral methods
• Rule-based systems
Structural damage detection
• Statistical profiling using histograms
• Parametric statistical modeling
• Mixture of models, Neural networks
Image processing, Surveillance
• Mixture of models, Regression, SVM
• Bayesian networks, Neural networks,
• Clustering, Nearest neighbor methods
Anomalous topic detection
• Mixture of models, Neural networks
• Statistical profiling using histograms
• Clustering, SVM
Anomaly detection in sensor networks
• Parametric statistical modeling
• Bayesian networks, Nearest neighbor
• Rule-based systems, Spectral
Poll question:
At what stage is your organization in implementing anomaly detection techniques /
solutions using advanced Data Science / Machine Learning / Real-time approaches?
Stage 0: We do not have any plans yet, I am here for education
Stage 1: We are at an initial planning stage
Stage 2: Currently evaluating platforms/ implementation partners
Stage 3: Implementation underway
Stage 4: Already using a modern anomaly detection platform/ solution
Key steps in anomaly detection
• Problem identification and setting expectations
• Defining the sources and schema
• Parsing and pre-processing
• Model development
• Model execution
• Investigation and feedback
• Model updating
• Operationalize model for scoring
A modern platform approach to anomaly detection
• Multi-tenancy
• Rapidly develop and operationalize
• Apply data science / machine learning techniques with real-time data
• A-B testing
• Easily scalable
• Monitor, debug and diagnose at scale
• Version management
• Deployment workflow: Dev – Test – Prod
Real-time Stream Processing and Machine Learning Platform
ENABLING THE REAL-TIME ENTERPRISE
Implementing credit card fraud detection in real-time using
Schema overview
{
"isMerchantCompromised": 0,
"isfraudent": true,
"transactionAmount": 11276.0,
"phone": "1478523699",
"radiusFromResidence": 2.0,
"deviation": 10.0,
"averageTransaction": 4608.0,
"city": 3,
"transactionTime": "1512979321050",
"email": "ava@mail.com",
"name": "Jean",
"gender": "Male",
"merchantName": “My_Company",
"timeOfDay": "10:30:19",
"merchantCity": 10
}
Build Apache Spark Applications Within Minutes
https://www.streamanalytix.com/download
Key takeaways
• Modern data science techniques significantly improve detection of anomalies
• It is possible to do it on streaming data in a scalable manner
• Modern platforms can simplify implementation and reduce development cycle
Thank you.
Questions?
© 2017 Impetus Technologies
Email: inquiry@streamanalytix.com Twitter : @ImpetusTech / @StreamAnalytix

Anomaly Detection - Real World Scenarios, Approaches and Live Implementation

  • 1.
    Anomaly Detection: RealWorld Scenarios, Approaches and Live Implementation WEBINAR | DECEMBER 15, 2017 Saurabh DuttaRavishankar Rao Vallabhajosyula SENIOR DATA SCIENTIST, IMPETUS TECHNOLOGIES TWITTER: @ImpetusTech TECHNICAL PRODUCT MANAGER, STREAMANALYTIX TWITTER: @StreamAnalytix
  • 2.
    Agenda • What’s ananomaly? • Real world use cases of anomaly detection • Key steps in anomaly detection • A deep dive into building an anomaly detection model • Types of anomaly detection • Data attributes • Approaches and methods • A platform approach to anomaly detection • Live implementation using StreamAnalytix • Q & A
  • 3.
    About Impetus Mission critical technologysolutions since 1996 Fortune 500: Big data clients 1700 people; US, India, global reach Unique mix of big data products and services
  • 4.
    What’s an Anomaly? Anomaly:is an observation that greatly deviates from most of the other observations, i.e., a data point/behavior/pattern that appears to be statistically unusual or 'anomalous' Basic qualities of anomaly: 1. Rare 2. Significantly different from others
  • 5.
    What is differentabout modern anomaly detection? • Rule based methods are hard to scale • Modern data science techniques are more efficient • Can work with real-time data • Improve detection across multiple channels • Learn and detect variations • Adaptable to multiple domains
  • 6.
    Real world usecases of anomaly detection Anomaly detection is influencing business decisions across verticals MANUFACTURING Detect abnormal machine behavior to prevent cost overruns FINANCE & INSURANCE Detect and prevent out of pattern or fraudulent spend, travel expenses HEALTHCARE Detect fraud in claims and payments; events from RFID and mobiles BANKING Flag abnormally high purchases/deposits, detect cyber intrusions NETWORKING Detect intrusion into networks, prevent theft of source code or IP SOCIAL MEDIA Detect compromised accounts, bots that generate fake reviews VIDEO SURVEILLANCE Detect or track objects and persons of interest in monotonous footage SMART HOUSE Detect energy leakage, standardize smart sensor datasets TELECOM Detect roaming abuse, revenue fraud, service disruptions TRANSPORTATION Ensure external communications to the vehicle are not intrusion
  • 7.
    Key steps inanomaly detection • Problem identification and setting expectations • Defining the sources and schema • Parsing and pre-processing • Model development • Model execution • Investigation and feedback • Model updating • Operationalize model for scoring
  • 8.
    Key steps inanomaly detection • Problem identification and setting expectations • Defining the sources and schema • Parsing and pre-processing • Model development • Model execution • Investigation and feedback • Model updating • Operationalize model for scoring
  • 9.
    Model development foranomaly detection Type of anomaly detection used Type of data available If the data has labels
  • 10.
    Taxonomy of anomalydetection Anomaly Detection Collective AnomalyContextual AnomalyPoint Anomaly
  • 11.
    Data – Typesof attributes Data Categorical Nominal Ordinal Numerical Named Categories Categories with an implied order Discrete Continuous Only particular numbers Any numerical value Binary Variables with only two options (Yes/No)
  • 12.
    Data – Choiceof algorithm Data Categorical Nominal Ordinal Numerical Discrete Continuous Binary Apply K-means clustering Data has no labels Apply time-series anomaly detection algorithms When time-stamps are present Data has labels Use standard machine learning classifiers Use sequence classification algorithms When time-stamps are absent
  • 13.
    Approaches to anomalydetection Model Test Data Result Training Data Supervised (Classification) Data skewness, lack of counter examples Model Test Data Result Training Data Semi-supervised (Novelty detection) Requires a 'normal' training dataset Model Unlabeled Data Result Unsupervised (Clustering) Faces curse of dimensionality Unsupervised Algorithm
  • 14.
    Methods for anomalydetection: Categorical and numeric attributes K-modes Generic mixture models Robust SVM Uses hamming distance to measure distance for categorical features Extends the framework of Gaussian mixture models Kernel-based approach that identifies regions in which data resides in alternate feature space
  • 15.
    Methods for anomalydetection: Sequential data State space models Hidden Markov models Graph-based methods Model the evolution of data in time to enable forecasting and flag an anomaly if it exceeds a threshold Markov Chains and HMMs measure the probability of different events happening in some sequence Graphs capture interdependencies, and allow discovery of relational associations such as in fraud System Behavior model Observed behavior Expected behavior Observation Model Formation Anomaly Detection Simulation
  • 16.
    Latest methods foranomaly detection Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets AutoEncoders can learn the latent representation of the data by using an encoder and a decoder together RNN-based architectures enable sequence prediction. The network can flag an anomaly when needed GANs combine two neural networks - a generator and a discriminator, and can be used to find anomalies
  • 17.
    Anomaly detection algorithms Host-basedIDS • Statistical profiling using histograms • Mixture of models, • Neural networks • SVM, Rule-based systems Network intrusion detection • Statistical profiling using histograms • Parametric statistical modeling • Non-parametric statistical modeling • Bayesian networks, Neural networks • SVM, Rule-based systems • Clustering based, Nearest neighbor • Spectral, Information Theoretic Credit card fraud detection • Neural Networks, • Rule-based systems • Clustering, Self-organizing map • Artificial immune system • Decision trees, SVM Mobile phone fraud detection • Statistical profiling using histograms • Parametric statistical modeling • Neural networks, Rule-based systems Insider trading detection • Statistical profiling using histograms • Information theoretic Medical and public health • Parametric statistical modeling • Neural networks, Bayesian networks • Rule-based systems • Nearest neighbor techniques Fault detection in mechanical units • Parametric statistical modeling • Non-parametric statistical modeling • Neural networks, Spectral methods • Rule-based systems Structural damage detection • Statistical profiling using histograms • Parametric statistical modeling • Mixture of models, Neural networks Image processing, Surveillance • Mixture of models, Regression, SVM • Bayesian networks, Neural networks, • Clustering, Nearest neighbor methods Anomalous topic detection • Mixture of models, Neural networks • Statistical profiling using histograms • Clustering, SVM Anomaly detection in sensor networks • Parametric statistical modeling • Bayesian networks, Nearest neighbor • Rule-based systems, Spectral
  • 18.
    Poll question: At whatstage is your organization in implementing anomaly detection techniques / solutions using advanced Data Science / Machine Learning / Real-time approaches? Stage 0: We do not have any plans yet, I am here for education Stage 1: We are at an initial planning stage Stage 2: Currently evaluating platforms/ implementation partners Stage 3: Implementation underway Stage 4: Already using a modern anomaly detection platform/ solution
  • 19.
    Key steps inanomaly detection • Problem identification and setting expectations • Defining the sources and schema • Parsing and pre-processing • Model development • Model execution • Investigation and feedback • Model updating • Operationalize model for scoring
  • 20.
    A modern platformapproach to anomaly detection • Multi-tenancy • Rapidly develop and operationalize • Apply data science / machine learning techniques with real-time data • A-B testing • Easily scalable • Monitor, debug and diagnose at scale • Version management • Deployment workflow: Dev – Test – Prod
  • 21.
    Real-time Stream Processingand Machine Learning Platform ENABLING THE REAL-TIME ENTERPRISE
  • 22.
    Implementing credit cardfraud detection in real-time using
  • 23.
    Schema overview { "isMerchantCompromised": 0, "isfraudent":true, "transactionAmount": 11276.0, "phone": "1478523699", "radiusFromResidence": 2.0, "deviation": 10.0, "averageTransaction": 4608.0, "city": 3, "transactionTime": "1512979321050", "email": "ava@mail.com", "name": "Jean", "gender": "Male", "merchantName": “My_Company", "timeOfDay": "10:30:19", "merchantCity": 10 }
  • 24.
    Build Apache SparkApplications Within Minutes https://www.streamanalytix.com/download
  • 25.
    Key takeaways • Moderndata science techniques significantly improve detection of anomalies • It is possible to do it on streaming data in a scalable manner • Modern platforms can simplify implementation and reduce development cycle
  • 26.
    Thank you. Questions? © 2017Impetus Technologies Email: inquiry@streamanalytix.com Twitter : @ImpetusTech / @StreamAnalytix