Anomaly Detection &
Spark Implementation
Presenters:-
Maxim Shkarayev
Anand Venugopal
Punit Shah
DECEMBER 5, 2017
Meetup:
Stream Processing and Machine
Learning Platform for the Enterprise
Thought Leadership / Advisory
Impetus Introduction
Mission critical
technology solutions
since 1996
Global leaders are
our Big Data clients
1700 people: US,
India, global reach
Unique mix of
Big Data products
and services
• Real-time C360 and Churn
• Next Best Offer or Action
• Streaming ETL
• IoT and Log Analytics
• Fraud, Risk Anomaly detection
• Anomaly detection
• Predictive Maintenance
Enabling the Real-time Enterprise
Delightful Customer Experiences
Maximizing operational efficiency
with real-time insights
Build and Deploy use-cases fast
Pre-built ETL, Analytics, Read-write operators
Drag and Drop visual development and DevOps
Fast Data and Big Data; On-premise and Cloud
Enabling the Real-time Enterprise
“I could do my 1.5 month Spark app
in 1.5 days with this product”
- Analytics Lead at Tier 1 US Telco
Impetus Data Science Practice – Relevant Use-cases
Banking and Finance
Data Analytics & Modeling
Finding fraudulent travel and expenses
Text Mining & NLP
Intent to Fraud Detection in e-coms
Graph Analytics
Business impact of customer loss
Insurance
Data Analytics & Modeling
Insurance premium determination using
Catastrophe Modeling
Text Mining & NLP
Detecting Intent to commit fraud in e-
communications (AML, Dodd Frank etc.)
Communication and Media
Data Analytics & Modeling
Finding root cause of No Dial Tone;
Self-learning Anomaly Detection System
Marketing Analytics
Lead generation and Multi-touch
Attribution for increasing conversion rates
Manufacturing and Logistics
Data Analytics & Modeling
Lowering rejection rate of silicon wafers
for a semiconductor company
Early detection of paint defects for
leading auto manufacturer
Correlating multiple data sources to
identify factors related to warranty issues
Energy & Utilities
Data Analytics & Modeling
Reinforcement Learning model to enable
bidding of electricity (price and quantity)
Information Extraction
Extract label information from P&IDs and
make them searchable
Create a Bill of Materials for Budgeting
Healthcare
Data Analytics & Modeling
Predicting Patient Readmission
Text Mining & NLP
Competitive analysis of medicines
Graph Analytics
Drug-disease co-occurrence with Medline
Anomaly Definition
Anomaly: is an observation that greatly deviates from most of the other observations, i.e., a
data point/behavior/pattern that appears to be statistically unusual or anomalous
Basic qualities of anomaly:
1. Rare
2. Significantly different from others
Impetus DSP – Some Applications of Anomaly Detection
The problem of finding patterns in data that do not conform to expected behaviour
Manufacturing
Detect abnormal
machine behavior to
prevent cost overruns
Finance, Insurance
Detect and prevent Out
of Pattern or Fraudulent
spend, travel expenses
Healthcare
Detect fraud in claims
and payments; Events
from RFID and mobiles
Banking
Flag abnormally high
purchases or deposits,
detect cyber intrusions
Networking
Detect intrusion into
networks, prevent theft of
source code or IP
Social Media
Detect compromised
accounts, bots that
generate fake reviews
Video Surveillance
Detect or track objects
and persons of interest in
monotonous footage
Smart Homes
Detect energy leakage,
Standardize smart
sensor datasets
Telecom
Detect roaming abuse,
Revenue fraud, Service
disruptions etc.
Transportation
Ensure external
communications to the
vehicle are not intrusion
Deep Dive on Anomaly Detection
Thought Leadership / Advisory
Anomaly Detection Algorithms Across Disciplines
Host-based IDS
• Statistical Profiling using histograms
• Mixture of Models, Neural Networks
• SVM, Rule-based systems
Network Intrusion Detection
• Statistical Profiling using histograms
• Parametric Statistical Modeling
• Non-parametric Statistical Modeling
• Bayesian Networks, Neural Networks
• SVM, Rule-based systems
• Clustering based, Nearest Neighbor
• Spectral, Information Theoretic
Credit Card Fraud Detection
• Neural Networks,
• Rule-based systems
• Clustering, Self-Organizing Map
• Artificial Immune System
• Decision Trees, SVM
Mobile Phone Fraud Detection
• Statistical Profiling using Histograms
• Parametric Statistical Modeling
• Neural networks, Rule-based systems
Insider Trading Detection
• Statistical Profiling using Histograms
• Information Theoretic
Medical and Public Health
• Parametric Statistical Modeling
• Neural Networks, Bayesian Networks
• Rule-based systems
• Nearest Neighbor Techniques
Fault Detection in Mechanical Units
• Parametric Statistical Modeling
• Non-Parametric Statistical Modeling
• Neural Networks, Spectral Methods
• Rule-based Systems
Structural Damage Detection
• Statistical Profiling using histograms
• Parametric Statistical Modeling
• Mixture of Models, Neural Networks
Image Processing, Surveilence
• Mixture of Models, Regression, SVM
• Bayesian Networks, Neural Networks,
• Clustering, Nearest Neighbor Methods
Anomalous Topic Detection
• Mixture of Models, Neural Networks
• Statistical Profiling using Histograms
• Clustering, SVM
Anomaly Detection in Sensor Networks
• Parametric Statistical Modeling
• Bayesian Networks, Nearest Neighbor
• Rule-based Systems, Spectral
Source: Chandola, V. et al. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
Taxonomy for Anomaly Detection Algorithms
Anomaly
Detection
Point Anomaly
Detection
Contextual
Anomaly Detection
Collective Anomaly
Detection
Data instance anomalous with
respect to rest of the data (e.g. a
large transaction)
Data instance anomalous in a
specific context (e.g. large power
spike at night)
A collection of related data
instances are anomalous with
respect to the entire data set
Data – Types of Attributes
Data
Categorical
Nominal
Ordinal
Numerical
Named
Categories
Categories with
an implied order
Discrete
Continuous
Only particular
numbers
Any numerical
value
Binary
Variables with
only two options
(Yes/No)
Anomaly Detection Approaches
Supervised
(Classification)
Data skewness, lack
of counter examples
Unsupervised
(Clustering)
Faces curse of
dimensionality
Semi-supervised
(Novelty
detection)
Requires a “normal”
training dataset
• Anomalies are often a handful among millions of
normal data
• Given training data, this is a class imbalance problem
• There are methods to address this and using SVM,
Random Forests and ensemble learning
• If the data is auto-correlated, then it maybe required to use
time-series classification or Recurrent Neural Network
based approaches
• When there is no training data, unsupervised or
semi-supervised methods can be used
Source: https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/
Unsupervised Anomaly Detection Algorithms
Unsupervised AD Algorithms
• k-NN Global Anomaly Detection (uses average
distance to k neighbors)
• kth-NN (uses distance to kth neighbor)
• LOF – Local Outlier Factor
• COF – Connectivity based OF
• LoOP – Local Outlier Probability
• LOCI – Local Correlation Integral
• aLOCI – approximate LOCI
• INFLO – Influenced Outlierness
• CBLOF/ uCBLOF - Cluster-Based LOF
• LDCOF - Local Density Cluster-based OF
• CMGOS - Clustering-based Multivariate
Gaussian Outlier Score
• HBOS - Histogram-based Outlier Score
• One-class Support Vector Machine
• rPCA - Robust PCA LOF
performance
Global anomalies (x1, x2), a
local anomaly x3 and a micro-
cluster c3.
K-NN underperforms on
local anomalies
Source: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173
Some Anomaly Detection Methods
Data has a mix of Categorical and Numeric attributes
K-modes Generic Mixture Models Robust SVM
Uses Hamming distance
to measure distance for
Categorical Features
Extends the framework of
Gaussian Mixture Models
Kernel based approach that identifies
regions in which data resides in
alternate feature space
• Makes standard SVM robust as it
can be affected by outliers
• Retains strengths of SVM – fast
computation, handling high-
dimensional data and kernels
• Is based on GMMs which are
latent variable models
• A latent variable model is a
probability model where some
variables are never observed
• K-Means cannot handle data that
is non-numeric
• K-Modes applies a dissimilarity
measure for categorical items
Some Anomaly Detection Methods
Data has a sequential nature (timestamps, or sequences)
State Space Models Hidden Markov Modes Graph based Methods
Model the evolution of data in time to enable
forecasting and flag an anomaly if it exceeds
a threshold
Markov Chains and HMMs measure the
probability of different events happening in
some sequence
Graphs capture interdependencies, and
allow discovery of relational associations
such as in fraud
• Network intrusion graph grows
dynamically as events occur
• An activity vector obtained from the
graph can detect anomalies
• Markov chains can be built from
historical data
• This chain can be used to find the
probability of an anomalous sequence of
events
• Residual error between model and the
real system is used to identify
anomalous events
• This works with streaming data
System
Behavior
model
Observe
d
behavior
Expecte
d
behavior
Observation
Model Formation
Anomaly
Detection
Simulation
X
Some Anomaly Detection Methods
Other Methods
Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets
AutoEncoders can learn the latent
representation of the data by using an
encoder and a decoder together
RNN-based architectures enable sequence
prediction. The network can flag an anomaly
when needed
GANs combine two neural networks - a
generator and a discriminator, and can be
used to find anomalies
• Deep Convolutional GANs are being
used to learn a manifold of normal
variability
• This allows high accuracy in anomaly
detection
• RNN based models can detect
anomalies in Time Series Data
• More capable architectures such as
LSTM are also possible
• The output of the AutoEncoder is
compared to the input to detect and flag
anomalies
• Anomalies are more likely to have a high
reconstruction error
Impetus DSP - Out of Pattern Transaction Detection
The Challenge
• Major credit card company has
several thousand corporate
customers
• Customers have unique compliance
policies around acceptable spend
• Build a scalable product to identify out
of pattern spend behavior at card
level
Benefits Realized
• Value added service led to increase in
charge volumes of corporate
customers
• Demonstrated the value of external
facing product launches that leverage
machine learning
• Extending to fraud in travel expenses
Impetus Contribution
• Spend behavior of the card accounts
was analyzed to identify normal
spend
• Implemented algorithm to determine
out of pattern transactions and
scaled it to ~ 2M card accounts
• Launched the product in < 3 months
Case Study – “Out of Pattern” Financial Transactions
2 possible reasons
1)Customer’s situation may have really changed
2)Fraudulent usage
Product Demo
i. Introduction to web user interface for StreamAnalytix
ii. Multi-tenancy feature support
iii. Introduction to Data360 in StreamAnalytix
• Data pipelines
• Deploying the jobs
• Real-time dashboards and monitoring in StreamAnalytix
iv. Data Science in StreamAnalytix :
• Network anomaly use case
• Customer transaction anomaly detection use case
• A-B testing use case
v. Enterprise level features in StreamAnalytix
• Versioning
• Import & export data pipelines
• Register entities
• Data pipeline inspect
Thank you.
Questions?
© 2017 Impetus Technologies
Email: inquiry@streamanalytix.com Twitter : @StreamAnalytix

Anomaly Detection and Spark Implementation - Meetup Presentation.pptx

  • 1.
    Anomaly Detection & SparkImplementation Presenters:- Maxim Shkarayev Anand Venugopal Punit Shah DECEMBER 5, 2017 Meetup:
  • 3.
    Stream Processing andMachine Learning Platform for the Enterprise Thought Leadership / Advisory
  • 4.
    Impetus Introduction Mission critical technologysolutions since 1996 Global leaders are our Big Data clients 1700 people: US, India, global reach Unique mix of Big Data products and services
  • 5.
    • Real-time C360and Churn • Next Best Offer or Action • Streaming ETL • IoT and Log Analytics • Fraud, Risk Anomaly detection • Anomaly detection • Predictive Maintenance Enabling the Real-time Enterprise Delightful Customer Experiences Maximizing operational efficiency with real-time insights
  • 6.
    Build and Deployuse-cases fast Pre-built ETL, Analytics, Read-write operators Drag and Drop visual development and DevOps Fast Data and Big Data; On-premise and Cloud Enabling the Real-time Enterprise “I could do my 1.5 month Spark app in 1.5 days with this product” - Analytics Lead at Tier 1 US Telco
  • 7.
    Impetus Data SciencePractice – Relevant Use-cases Banking and Finance Data Analytics & Modeling Finding fraudulent travel and expenses Text Mining & NLP Intent to Fraud Detection in e-coms Graph Analytics Business impact of customer loss Insurance Data Analytics & Modeling Insurance premium determination using Catastrophe Modeling Text Mining & NLP Detecting Intent to commit fraud in e- communications (AML, Dodd Frank etc.) Communication and Media Data Analytics & Modeling Finding root cause of No Dial Tone; Self-learning Anomaly Detection System Marketing Analytics Lead generation and Multi-touch Attribution for increasing conversion rates Manufacturing and Logistics Data Analytics & Modeling Lowering rejection rate of silicon wafers for a semiconductor company Early detection of paint defects for leading auto manufacturer Correlating multiple data sources to identify factors related to warranty issues Energy & Utilities Data Analytics & Modeling Reinforcement Learning model to enable bidding of electricity (price and quantity) Information Extraction Extract label information from P&IDs and make them searchable Create a Bill of Materials for Budgeting Healthcare Data Analytics & Modeling Predicting Patient Readmission Text Mining & NLP Competitive analysis of medicines Graph Analytics Drug-disease co-occurrence with Medline
  • 8.
    Anomaly Definition Anomaly: isan observation that greatly deviates from most of the other observations, i.e., a data point/behavior/pattern that appears to be statistically unusual or anomalous Basic qualities of anomaly: 1. Rare 2. Significantly different from others
  • 9.
    Impetus DSP –Some Applications of Anomaly Detection The problem of finding patterns in data that do not conform to expected behaviour Manufacturing Detect abnormal machine behavior to prevent cost overruns Finance, Insurance Detect and prevent Out of Pattern or Fraudulent spend, travel expenses Healthcare Detect fraud in claims and payments; Events from RFID and mobiles Banking Flag abnormally high purchases or deposits, detect cyber intrusions Networking Detect intrusion into networks, prevent theft of source code or IP Social Media Detect compromised accounts, bots that generate fake reviews Video Surveillance Detect or track objects and persons of interest in monotonous footage Smart Homes Detect energy leakage, Standardize smart sensor datasets Telecom Detect roaming abuse, Revenue fraud, Service disruptions etc. Transportation Ensure external communications to the vehicle are not intrusion
  • 10.
    Deep Dive onAnomaly Detection Thought Leadership / Advisory
  • 11.
    Anomaly Detection AlgorithmsAcross Disciplines Host-based IDS • Statistical Profiling using histograms • Mixture of Models, Neural Networks • SVM, Rule-based systems Network Intrusion Detection • Statistical Profiling using histograms • Parametric Statistical Modeling • Non-parametric Statistical Modeling • Bayesian Networks, Neural Networks • SVM, Rule-based systems • Clustering based, Nearest Neighbor • Spectral, Information Theoretic Credit Card Fraud Detection • Neural Networks, • Rule-based systems • Clustering, Self-Organizing Map • Artificial Immune System • Decision Trees, SVM Mobile Phone Fraud Detection • Statistical Profiling using Histograms • Parametric Statistical Modeling • Neural networks, Rule-based systems Insider Trading Detection • Statistical Profiling using Histograms • Information Theoretic Medical and Public Health • Parametric Statistical Modeling • Neural Networks, Bayesian Networks • Rule-based systems • Nearest Neighbor Techniques Fault Detection in Mechanical Units • Parametric Statistical Modeling • Non-Parametric Statistical Modeling • Neural Networks, Spectral Methods • Rule-based Systems Structural Damage Detection • Statistical Profiling using histograms • Parametric Statistical Modeling • Mixture of Models, Neural Networks Image Processing, Surveilence • Mixture of Models, Regression, SVM • Bayesian Networks, Neural Networks, • Clustering, Nearest Neighbor Methods Anomalous Topic Detection • Mixture of Models, Neural Networks • Statistical Profiling using Histograms • Clustering, SVM Anomaly Detection in Sensor Networks • Parametric Statistical Modeling • Bayesian Networks, Nearest Neighbor • Rule-based Systems, Spectral Source: Chandola, V. et al. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
  • 12.
    Taxonomy for AnomalyDetection Algorithms Anomaly Detection Point Anomaly Detection Contextual Anomaly Detection Collective Anomaly Detection Data instance anomalous with respect to rest of the data (e.g. a large transaction) Data instance anomalous in a specific context (e.g. large power spike at night) A collection of related data instances are anomalous with respect to the entire data set
  • 13.
    Data – Typesof Attributes Data Categorical Nominal Ordinal Numerical Named Categories Categories with an implied order Discrete Continuous Only particular numbers Any numerical value Binary Variables with only two options (Yes/No)
  • 14.
    Anomaly Detection Approaches Supervised (Classification) Dataskewness, lack of counter examples Unsupervised (Clustering) Faces curse of dimensionality Semi-supervised (Novelty detection) Requires a “normal” training dataset • Anomalies are often a handful among millions of normal data • Given training data, this is a class imbalance problem • There are methods to address this and using SVM, Random Forests and ensemble learning • If the data is auto-correlated, then it maybe required to use time-series classification or Recurrent Neural Network based approaches • When there is no training data, unsupervised or semi-supervised methods can be used Source: https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/
  • 15.
    Unsupervised Anomaly DetectionAlgorithms Unsupervised AD Algorithms • k-NN Global Anomaly Detection (uses average distance to k neighbors) • kth-NN (uses distance to kth neighbor) • LOF – Local Outlier Factor • COF – Connectivity based OF • LoOP – Local Outlier Probability • LOCI – Local Correlation Integral • aLOCI – approximate LOCI • INFLO – Influenced Outlierness • CBLOF/ uCBLOF - Cluster-Based LOF • LDCOF - Local Density Cluster-based OF • CMGOS - Clustering-based Multivariate Gaussian Outlier Score • HBOS - Histogram-based Outlier Score • One-class Support Vector Machine • rPCA - Robust PCA LOF performance Global anomalies (x1, x2), a local anomaly x3 and a micro- cluster c3. K-NN underperforms on local anomalies Source: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173
  • 16.
    Some Anomaly DetectionMethods Data has a mix of Categorical and Numeric attributes K-modes Generic Mixture Models Robust SVM Uses Hamming distance to measure distance for Categorical Features Extends the framework of Gaussian Mixture Models Kernel based approach that identifies regions in which data resides in alternate feature space • Makes standard SVM robust as it can be affected by outliers • Retains strengths of SVM – fast computation, handling high- dimensional data and kernels • Is based on GMMs which are latent variable models • A latent variable model is a probability model where some variables are never observed • K-Means cannot handle data that is non-numeric • K-Modes applies a dissimilarity measure for categorical items
  • 17.
    Some Anomaly DetectionMethods Data has a sequential nature (timestamps, or sequences) State Space Models Hidden Markov Modes Graph based Methods Model the evolution of data in time to enable forecasting and flag an anomaly if it exceeds a threshold Markov Chains and HMMs measure the probability of different events happening in some sequence Graphs capture interdependencies, and allow discovery of relational associations such as in fraud • Network intrusion graph grows dynamically as events occur • An activity vector obtained from the graph can detect anomalies • Markov chains can be built from historical data • This chain can be used to find the probability of an anomalous sequence of events • Residual error between model and the real system is used to identify anomalous events • This works with streaming data System Behavior model Observe d behavior Expecte d behavior Observation Model Formation Anomaly Detection Simulation X
  • 18.
    Some Anomaly DetectionMethods Other Methods Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets AutoEncoders can learn the latent representation of the data by using an encoder and a decoder together RNN-based architectures enable sequence prediction. The network can flag an anomaly when needed GANs combine two neural networks - a generator and a discriminator, and can be used to find anomalies • Deep Convolutional GANs are being used to learn a manifold of normal variability • This allows high accuracy in anomaly detection • RNN based models can detect anomalies in Time Series Data • More capable architectures such as LSTM are also possible • The output of the AutoEncoder is compared to the input to detect and flag anomalies • Anomalies are more likely to have a high reconstruction error
  • 19.
    Impetus DSP -Out of Pattern Transaction Detection The Challenge • Major credit card company has several thousand corporate customers • Customers have unique compliance policies around acceptable spend • Build a scalable product to identify out of pattern spend behavior at card level Benefits Realized • Value added service led to increase in charge volumes of corporate customers • Demonstrated the value of external facing product launches that leverage machine learning • Extending to fraud in travel expenses Impetus Contribution • Spend behavior of the card accounts was analyzed to identify normal spend • Implemented algorithm to determine out of pattern transactions and scaled it to ~ 2M card accounts • Launched the product in < 3 months
  • 20.
    Case Study –“Out of Pattern” Financial Transactions 2 possible reasons 1)Customer’s situation may have really changed 2)Fraudulent usage
  • 21.
  • 22.
    i. Introduction toweb user interface for StreamAnalytix ii. Multi-tenancy feature support iii. Introduction to Data360 in StreamAnalytix • Data pipelines • Deploying the jobs • Real-time dashboards and monitoring in StreamAnalytix iv. Data Science in StreamAnalytix : • Network anomaly use case • Customer transaction anomaly detection use case • A-B testing use case v. Enterprise level features in StreamAnalytix • Versioning • Import & export data pipelines • Register entities • Data pipeline inspect
  • 23.
    Thank you. Questions? © 2017Impetus Technologies Email: inquiry@streamanalytix.com Twitter : @StreamAnalytix