Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anomaly Detection and Spark Implementation - Meetup Presentation.pptx


Published on

StreamAnalytix sponsored a meetup on “Anomaly Detection Techniques and Implementation using Apache Spark” which took place on Tuesday December 5, 2017 at Larkspur Landing Milpitas Hotel, Milpitas, CA. The meetup was led by Maxim Shkarayev, Lead Data Scientist, Impetus Technologies along with Punit Shah, Solution Architect, StreamAnalytix and Anand Venugopal, Product Head & AVP, StreamAnalytix, who introduced and summarized the vast field of Anomaly Detection and its applications in various industry problems. The speakers at the event also offered a structured approach to choose the right anomaly detection techniques based on specific use-cases and data characteristics which was followed by a demonstration of some real-world anomaly detection use-cases on Apache Spark based analytics platform.

Published in: Data & Analytics
  • Be the first to comment

Anomaly Detection and Spark Implementation - Meetup Presentation.pptx

  1. 1. Anomaly Detection & Spark Implementation Presenters:- Maxim Shkarayev Anand Venugopal Punit Shah DECEMBER 5, 2017 Meetup:
  2. 2. Stream Processing and Machine Learning Platform for the Enterprise Thought Leadership / Advisory
  3. 3. Impetus Introduction Mission critical technology solutions since 1996 Global leaders are our Big Data clients 1700 people: US, India, global reach Unique mix of Big Data products and services
  4. 4. • Real-time C360 and Churn • Next Best Offer or Action • Streaming ETL • IoT and Log Analytics • Fraud, Risk Anomaly detection • Anomaly detection • Predictive Maintenance Enabling the Real-time Enterprise Delightful Customer Experiences Maximizing operational efficiency with real-time insights
  5. 5. Build and Deploy use-cases fast Pre-built ETL, Analytics, Read-write operators Drag and Drop visual development and DevOps Fast Data and Big Data; On-premise and Cloud Enabling the Real-time Enterprise “I could do my 1.5 month Spark app in 1.5 days with this product” - Analytics Lead at Tier 1 US Telco
  6. 6. Impetus Data Science Practice – Relevant Use-cases Banking and Finance Data Analytics & Modeling Finding fraudulent travel and expenses Text Mining & NLP Intent to Fraud Detection in e-coms Graph Analytics Business impact of customer loss Insurance Data Analytics & Modeling Insurance premium determination using Catastrophe Modeling Text Mining & NLP Detecting Intent to commit fraud in e- communications (AML, Dodd Frank etc.) Communication and Media Data Analytics & Modeling Finding root cause of No Dial Tone; Self-learning Anomaly Detection System Marketing Analytics Lead generation and Multi-touch Attribution for increasing conversion rates Manufacturing and Logistics Data Analytics & Modeling Lowering rejection rate of silicon wafers for a semiconductor company Early detection of paint defects for leading auto manufacturer Correlating multiple data sources to identify factors related to warranty issues Energy & Utilities Data Analytics & Modeling Reinforcement Learning model to enable bidding of electricity (price and quantity) Information Extraction Extract label information from P&IDs and make them searchable Create a Bill of Materials for Budgeting Healthcare Data Analytics & Modeling Predicting Patient Readmission Text Mining & NLP Competitive analysis of medicines Graph Analytics Drug-disease co-occurrence with Medline
  7. 7. Anomaly Definition Anomaly: is an observation that greatly deviates from most of the other observations, i.e., a data point/behavior/pattern that appears to be statistically unusual or anomalous Basic qualities of anomaly: 1. Rare 2. Significantly different from others
  8. 8. Impetus DSP – Some Applications of Anomaly Detection The problem of finding patterns in data that do not conform to expected behaviour Manufacturing Detect abnormal machine behavior to prevent cost overruns Finance, Insurance Detect and prevent Out of Pattern or Fraudulent spend, travel expenses Healthcare Detect fraud in claims and payments; Events from RFID and mobiles Banking Flag abnormally high purchases or deposits, detect cyber intrusions Networking Detect intrusion into networks, prevent theft of source code or IP Social Media Detect compromised accounts, bots that generate fake reviews Video Surveillance Detect or track objects and persons of interest in monotonous footage Smart Homes Detect energy leakage, Standardize smart sensor datasets Telecom Detect roaming abuse, Revenue fraud, Service disruptions etc. Transportation Ensure external communications to the vehicle are not intrusion
  9. 9. Deep Dive on Anomaly Detection Thought Leadership / Advisory
  10. 10. Anomaly Detection Algorithms Across Disciplines Host-based IDS • Statistical Profiling using histograms • Mixture of Models, Neural Networks • SVM, Rule-based systems Network Intrusion Detection • Statistical Profiling using histograms • Parametric Statistical Modeling • Non-parametric Statistical Modeling • Bayesian Networks, Neural Networks • SVM, Rule-based systems • Clustering based, Nearest Neighbor • Spectral, Information Theoretic Credit Card Fraud Detection • Neural Networks, • Rule-based systems • Clustering, Self-Organizing Map • Artificial Immune System • Decision Trees, SVM Mobile Phone Fraud Detection • Statistical Profiling using Histograms • Parametric Statistical Modeling • Neural networks, Rule-based systems Insider Trading Detection • Statistical Profiling using Histograms • Information Theoretic Medical and Public Health • Parametric Statistical Modeling • Neural Networks, Bayesian Networks • Rule-based systems • Nearest Neighbor Techniques Fault Detection in Mechanical Units • Parametric Statistical Modeling • Non-Parametric Statistical Modeling • Neural Networks, Spectral Methods • Rule-based Systems Structural Damage Detection • Statistical Profiling using histograms • Parametric Statistical Modeling • Mixture of Models, Neural Networks Image Processing, Surveilence • Mixture of Models, Regression, SVM • Bayesian Networks, Neural Networks, • Clustering, Nearest Neighbor Methods Anomalous Topic Detection • Mixture of Models, Neural Networks • Statistical Profiling using Histograms • Clustering, SVM Anomaly Detection in Sensor Networks • Parametric Statistical Modeling • Bayesian Networks, Nearest Neighbor • Rule-based Systems, Spectral Source: Chandola, V. et al. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
  11. 11. Taxonomy for Anomaly Detection Algorithms Anomaly Detection Point Anomaly Detection Contextual Anomaly Detection Collective Anomaly Detection Data instance anomalous with respect to rest of the data (e.g. a large transaction) Data instance anomalous in a specific context (e.g. large power spike at night) A collection of related data instances are anomalous with respect to the entire data set
  12. 12. Data – Types of Attributes Data Categorical Nominal Ordinal Numerical Named Categories Categories with an implied order Discrete Continuous Only particular numbers Any numerical value Binary Variables with only two options (Yes/No)
  13. 13. Anomaly Detection Approaches Supervised (Classification) Data skewness, lack of counter examples Unsupervised (Clustering) Faces curse of dimensionality Semi-supervised (Novelty detection) Requires a “normal” training dataset • Anomalies are often a handful among millions of normal data • Given training data, this is a class imbalance problem • There are methods to address this and using SVM, Random Forests and ensemble learning • If the data is auto-correlated, then it maybe required to use time-series classification or Recurrent Neural Network based approaches • When there is no training data, unsupervised or semi-supervised methods can be used Source:
  14. 14. Unsupervised Anomaly Detection Algorithms Unsupervised AD Algorithms • k-NN Global Anomaly Detection (uses average distance to k neighbors) • kth-NN (uses distance to kth neighbor) • LOF – Local Outlier Factor • COF – Connectivity based OF • LoOP – Local Outlier Probability • LOCI – Local Correlation Integral • aLOCI – approximate LOCI • INFLO – Influenced Outlierness • CBLOF/ uCBLOF - Cluster-Based LOF • LDCOF - Local Density Cluster-based OF • CMGOS - Clustering-based Multivariate Gaussian Outlier Score • HBOS - Histogram-based Outlier Score • One-class Support Vector Machine • rPCA - Robust PCA LOF performance Global anomalies (x1, x2), a local anomaly x3 and a micro- cluster c3. K-NN underperforms on local anomalies Source:
  15. 15. Some Anomaly Detection Methods Data has a mix of Categorical and Numeric attributes K-modes Generic Mixture Models Robust SVM Uses Hamming distance to measure distance for Categorical Features Extends the framework of Gaussian Mixture Models Kernel based approach that identifies regions in which data resides in alternate feature space • Makes standard SVM robust as it can be affected by outliers • Retains strengths of SVM – fast computation, handling high- dimensional data and kernels • Is based on GMMs which are latent variable models • A latent variable model is a probability model where some variables are never observed • K-Means cannot handle data that is non-numeric • K-Modes applies a dissimilarity measure for categorical items
  16. 16. Some Anomaly Detection Methods Data has a sequential nature (timestamps, or sequences) State Space Models Hidden Markov Modes Graph based Methods Model the evolution of data in time to enable forecasting and flag an anomaly if it exceeds a threshold Markov Chains and HMMs measure the probability of different events happening in some sequence Graphs capture interdependencies, and allow discovery of relational associations such as in fraud • Network intrusion graph grows dynamically as events occur • An activity vector obtained from the graph can detect anomalies • Markov chains can be built from historical data • This chain can be used to find the probability of an anomalous sequence of events • Residual error between model and the real system is used to identify anomalous events • This works with streaming data System Behavior model Observe d behavior Expecte d behavior Observation Model Formation Anomaly Detection Simulation X
  17. 17. Some Anomaly Detection Methods Other Methods Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets AutoEncoders can learn the latent representation of the data by using an encoder and a decoder together RNN-based architectures enable sequence prediction. The network can flag an anomaly when needed GANs combine two neural networks - a generator and a discriminator, and can be used to find anomalies • Deep Convolutional GANs are being used to learn a manifold of normal variability • This allows high accuracy in anomaly detection • RNN based models can detect anomalies in Time Series Data • More capable architectures such as LSTM are also possible • The output of the AutoEncoder is compared to the input to detect and flag anomalies • Anomalies are more likely to have a high reconstruction error
  18. 18. Impetus DSP - Out of Pattern Transaction Detection The Challenge • Major credit card company has several thousand corporate customers • Customers have unique compliance policies around acceptable spend • Build a scalable product to identify out of pattern spend behavior at card level Benefits Realized • Value added service led to increase in charge volumes of corporate customers • Demonstrated the value of external facing product launches that leverage machine learning • Extending to fraud in travel expenses Impetus Contribution • Spend behavior of the card accounts was analyzed to identify normal spend • Implemented algorithm to determine out of pattern transactions and scaled it to ~ 2M card accounts • Launched the product in < 3 months
  19. 19. Case Study – “Out of Pattern” Financial Transactions 2 possible reasons 1)Customer’s situation may have really changed 2)Fraudulent usage
  20. 20. Product Demo
  21. 21. i. Introduction to web user interface for StreamAnalytix ii. Multi-tenancy feature support iii. Introduction to Data360 in StreamAnalytix • Data pipelines • Deploying the jobs • Real-time dashboards and monitoring in StreamAnalytix iv. Data Science in StreamAnalytix : • Network anomaly use case • Customer transaction anomaly detection use case • A-B testing use case v. Enterprise level features in StreamAnalytix • Versioning • Import & export data pipelines • Register entities • Data pipeline inspect
  22. 22. Thank you. Questions? © 2017 Impetus Technologies Email: Twitter : @StreamAnalytix