This document discusses anomaly detection techniques. It provides an overview of basic concepts in anomaly detection including one-class versus two-class problems and potential application areas. Standard algorithms covered include parametric methods using Gaussian modeling and replicator neural networks, as well as non-parametric approaches such as nearest neighbor modeling, distance-based and density-based techniques, and clustering. The document also discusses contextual detection and gives examples of Microsoft applications of anomaly detection.
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesAshikur Rahman
This slide is prepared for a course of Dept. of CSE, Islamic Univresity of Technology (IUT).
Course: CSE 4739- Data Mining
This topic is based on:
Data Mining: Concepts and Techniques
Book by Jiawei Han
Chapter 12
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesAshikur Rahman
This slide is prepared for a course of Dept. of CSE, Islamic Univresity of Technology (IUT).
Course: CSE 4739- Data Mining
This topic is based on:
Data Mining: Concepts and Techniques
Book by Jiawei Han
Chapter 12
Techniques for Context-Aware and Cold-Start RecommendationsMatthias Braunhofer
Context-aware recommender systems better identify interesting items for users by adapting their suggestions to the specific contextual situations, e.g., to the current weather, if an excursion is to be recommended . But, the cold-start problem may jeopardise the quality of the recommendations: for users, items or contextual situations that are new to the system, recommendations are hard to compute. We have developed a number of novel techniques to tame this problem, and in particular, new hybrid algorithms that combine several, simpler, algorithms in order to exploit their strengths and avoid their weaknesses. We have also developed algorithms for actively identifying the most useful preference information to ask the user in order to bootstrap the system. Our results obtained from a series of offline and online experiments reveal that the proposed techniques can effectively alleviate the cold-start problem of context-aware recommender systems.
Connections b/w active learning and model extractionAnmol Dwivedi
Codes on https://github.com/anmold-07/Model-Extraction-with-RL
https://www.usenix.org/conference/usenixsecurity20/presentation/chandrasekaran
This paper formalizes model extraction and discusses possible defense strategies by drawing parallels between model extraction and an established area of active learning. In particular, the authors show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks and investigate possible defense strategies.
"Optimization of patient throughput and wait time in emergency departments (ED) is an important task for hospital systems. For that reason, Emergency Severity Index (ESI) system for patient triage was introduced to help guide manual estimation of acuity levels, which is used by nurses to rank the patients and organize hospital resources. However, despite improvements that it brought to managing medical resources, such triage system greatly depends on nurse’s subjective judgment and is thus prone to human errors. Here, we propose a novel deep model based on the word attention mechanism designed for predicting a number of resources an ED patient would need.
Our approach incorporates routinely available continuous and nominal (structured) data with medical text (unstructured) data, including patient’s chief complaint, past medical history, medication list, and nurse assessment collected for 338,500 ED visits over three years in a large urban hospital. Using both structured and unstructured data, the proposed approach achieves the AUC of 88% for the task of identifying resource intensive patients, and the accuracy of 44% for predicting exact category of number of resources, giving an estimated lift over nurses’ performance by 16% in accuracy. Furthermore, the attention mechanism of the proposed model provides interpretability by assigning attention scores for nurses’ notes which is crucial for decision making and implementation of such approaches in the real systems working on human health."
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...Waqas Tariq
One major problem in the data cleaning & data reduction step of KDD process is the presence of missing values in attributes. Many of analysis task have to deal with missing values and have developed several treatments to guess them. One of the most common method to replace the missing values is the mean method of imputation. In this paper we suggested a new imputation method by combining factor type and compromised imputation method, using two-phase sampling scheme and by using this method we impute the missing values of a target attribute in a data warehouse. Our simulation study shows that the estimator of mean from this method is found more efficient than compare to other.
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
Contextual Information Elicitation in Travel Recommender SystemsMatthias Braunhofer
Context-Aware Recommender Systems are advisory applications that exploit users’ preference knowledge contained in datasets of context-dependent user ratings, i.e., ratings augmented with the description of the contextual situation detected when the user experienced the item and rated it. Since the space of context-dependent ratings increases exponentially in size with the number of contextual factors, and because certain contextual information is still hard to acquire automatically (e.g., the user’s mood or the travellers’ group composition), it is fundamental to identify and acquire only those factors that truly influence the user preferences and consequently the ratings and the recommendations. In this paper, we propose a novel method that estimates the impact of a contextual factor on rating predictions and adaptively elicits from the users only the relevant ones. Our experimental evaluation, on two travel-related datasets, shows that our method compares favorably to other state-of-the-art context selection methods.
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Techniques for Context-Aware and Cold-Start RecommendationsMatthias Braunhofer
Context-aware recommender systems better identify interesting items for users by adapting their suggestions to the specific contextual situations, e.g., to the current weather, if an excursion is to be recommended . But, the cold-start problem may jeopardise the quality of the recommendations: for users, items or contextual situations that are new to the system, recommendations are hard to compute. We have developed a number of novel techniques to tame this problem, and in particular, new hybrid algorithms that combine several, simpler, algorithms in order to exploit their strengths and avoid their weaknesses. We have also developed algorithms for actively identifying the most useful preference information to ask the user in order to bootstrap the system. Our results obtained from a series of offline and online experiments reveal that the proposed techniques can effectively alleviate the cold-start problem of context-aware recommender systems.
Connections b/w active learning and model extractionAnmol Dwivedi
Codes on https://github.com/anmold-07/Model-Extraction-with-RL
https://www.usenix.org/conference/usenixsecurity20/presentation/chandrasekaran
This paper formalizes model extraction and discusses possible defense strategies by drawing parallels between model extraction and an established area of active learning. In particular, the authors show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks and investigate possible defense strategies.
"Optimization of patient throughput and wait time in emergency departments (ED) is an important task for hospital systems. For that reason, Emergency Severity Index (ESI) system for patient triage was introduced to help guide manual estimation of acuity levels, which is used by nurses to rank the patients and organize hospital resources. However, despite improvements that it brought to managing medical resources, such triage system greatly depends on nurse’s subjective judgment and is thus prone to human errors. Here, we propose a novel deep model based on the word attention mechanism designed for predicting a number of resources an ED patient would need.
Our approach incorporates routinely available continuous and nominal (structured) data with medical text (unstructured) data, including patient’s chief complaint, past medical history, medication list, and nurse assessment collected for 338,500 ED visits over three years in a large urban hospital. Using both structured and unstructured data, the proposed approach achieves the AUC of 88% for the task of identifying resource intensive patients, and the accuracy of 44% for predicting exact category of number of resources, giving an estimated lift over nurses’ performance by 16% in accuracy. Furthermore, the attention mechanism of the proposed model provides interpretability by assigning attention scores for nurses’ notes which is crucial for decision making and implementation of such approaches in the real systems working on human health."
Some Imputation Methods to Treat Missing Values in Knowledge Discovery in Dat...Waqas Tariq
One major problem in the data cleaning & data reduction step of KDD process is the presence of missing values in attributes. Many of analysis task have to deal with missing values and have developed several treatments to guess them. One of the most common method to replace the missing values is the mean method of imputation. In this paper we suggested a new imputation method by combining factor type and compromised imputation method, using two-phase sampling scheme and by using this method we impute the missing values of a target attribute in a data warehouse. Our simulation study shows that the estimator of mean from this method is found more efficient than compare to other.
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
Contextual Information Elicitation in Travel Recommender SystemsMatthias Braunhofer
Context-Aware Recommender Systems are advisory applications that exploit users’ preference knowledge contained in datasets of context-dependent user ratings, i.e., ratings augmented with the description of the contextual situation detected when the user experienced the item and rated it. Since the space of context-dependent ratings increases exponentially in size with the number of contextual factors, and because certain contextual information is still hard to acquire automatically (e.g., the user’s mood or the travellers’ group composition), it is fundamental to identify and acquire only those factors that truly influence the user preferences and consequently the ratings and the recommendations. In this paper, we propose a novel method that estimates the impact of a contextual factor on rating predictions and adaptively elicits from the users only the relevant ones. Our experimental evaluation, on two travel-related datasets, shows that our method compares favorably to other state-of-the-art context selection methods.
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Analyst’s Nightmare or Laundering Massive SpreadsheetsPyData
By Feyzi Bagirov
PyData New York City 2017
Poor data quality frequently invalidates data analysis when performed on Excel data that underwent transformations, imputations, and manual manipulations. In this talk we will use Pandas to walk through Excel data analysis and illustrate several common pitfalls that make this analysis invalid.
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
Decision Trees and Ensemble Methods is a different form of Machine Learning algorithm classes. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
This is a presentation we perform internally every quarter as part of our Data Science Brown Bag Series. This presentation was talking about different types of soft clustering techniques - all of which the team currently performs depending on the complexity of the data and the complexity of customer problems. If you are interested in learning more about working with L-3 Data Tactics or interested in working for the L-3 Data Tactics Data Science team please contact us soon! Thank you.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
- A high-level overview of artificial intelligence
- The importance of predictions across different domains of life
- Big (text) data
- Competition as a discovery process
- Domain-general learning
- Computer vision and natural language processing
- Elements of a machine learning system
- A hierarchy of problem classes
- Data collection
- The purpose of a model
- Logistic loss function
- Likelihood, log likelihood and maximum likelihood
- Ockham's Razor
- Intelligence as sequence prediction
- Building blocks of neural networks: neurons, weights and layers
- Logistic regression as a neural network
- Sigmoid function
- A look at backpropagation
- Gradient descent
- Convolutional neural networks
- Max-pooling
- Deep neural networks
The use of data and its modelling in science provides meaningful interpretation of real world problems. This presentation provides an easy to understand overview of data visualization and analytics , and snippets of data science applications using R - programming.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
03 presentation-bothiesson
1. Anomaly Detection
–from basic concepts to application cases at Microsoft
Bo Thiesson
thiesson@cs.aau.dk
http://people.cs.aau.dk/~thiesson
Aalborg University
– 1996
Microsoft Research,
Redmond USA
(1996 – 2013)
Aalborg University
(2013 –
2. Bo Thiesson
Outline
Basic setup & motivation
• 1 class ↔ 2 class
• Application examples
Standard algorithms
• Parametric - Gaussian modeling, Replicator Neural Networks
• Non-parametric - NN modeling (distance & density based, LOF)
• Clustering
Contextual detection
The development process
• Feature selection & model validation
Microsoft applications
• Excel (conditional formatting rules)
• OSD (Bing, MSN, AdCenter) – Capacity prediction
• OSD, Azure – Latent Fault Detection in Data Centers
• Corporate Security – Insider Threats
Future (and current) challenges
2InfinIT seminar, March 2015
3. Bo Thiesson
(Simplified) Anomaly Detection Example
Computer features: 𝑥=( 𝑥↓1 ,… 𝑥↓𝑛 )
𝑥↓1 : Memory use
𝑥↓2 : CPU load
⋮
3InfinIT seminar, March 2015
Training data: { 𝑥↑(1) ,
…, 𝑥↑( 𝑚) }
Run-time data: 𝑥
CPU load
Memory use
𝑥 𝑥 𝑥
𝑥
𝑥
𝑥
𝑥𝑥 𝑥𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
𝑥
𝑥 𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
ok
𝑥
anomaly
4. Bo Thiesson
What is an anomaly
• Anomaly is a pattern in the data that does not conform to the expected
behaviour
• Also referred to as outliers, exceptions, peculiarities, surprises, etc.
• Definition of Hawkins [Hawkins 1980]:
“An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism”
4InfinIT seminar, March 2015
5. Bo Thiesson
(Simplified) Anomaly Detection Example
Computer features: 𝑥=( 𝑥↓1 ,… 𝑥↓𝑛 )
𝑥↓1 : Memory use
𝑥↓2 : CPU load
⋮
5InfinIT seminar, March 2015
Training data:
ok: { 𝑥↑(1) ,…, 𝑥↑( 𝑚) }
not-ok: { 𝑥↑(1) ,…,
𝑥↑( 𝑘) }
Run-time data: 𝑥
CPU load
Memory use
𝑥 𝑥 𝑥
𝑥
𝑥
𝑥
𝑥𝑥 𝑥𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
𝑥
𝑥 𝑥
𝑥
𝑥
𝑥
𝑥
𝑥𝑥
ok
𝑥
anomaly
𝑥 𝑥𝑥𝑥
𝑥
𝑥
𝑥
𝑥
6. Bo Thiesson
Two Scenarios
Supervised scenario (standard classification)
• Training data with both normal and abnormal data objects are
provided
• There may be multiple normal and/or abnormal classes
• Often, the classification problem is highly imbalanced
• Up-sample abnormal data or down-sample normal data
• Future anomalies must look like the abnormal class(es) of training
objects
6InfinIT seminar, March 2015
7. Bo Thiesson
Two Scenarios (cont.)
Semi-supervised Scenario (anomaly detection)
• Training data for only the normal class(es) is provided
• There may be multiple normal classes
• Many times no (explicit) class label given, but data assumed (mostly)
normal
• Relies on anomalies being very rare compared to normal data
• Sometimes a very small number of anomalies included in training data
(0 to 20 is common), but not enough to learn abnormal class(es)
• Future anomalies may look nothing like any previously observed
anomaly
7InfinIT seminar, March 2015
Focus
8. Bo Thiesson
Where to look for anomalies
w Anomalous events occur relatively infrequently
w However, when they do occur, their
consequences can be quite dramatic and quite
often in a negative sense
8InfinIT seminar, March 2015
Some applications:
w Fraud: Banking, Insurance, Health care, Click,..
w Cyber intrusion: Insider/outsider, machine or network level
w Medicine and health: abnormal test results, disease patterns (e.g. geo-spatial)
w (Anti) terrorism
w Data cleaning (measurement errors)
w …
9. Bo Thiesson
Outline
Basic setup & motivation
• 1 class ↔ 2 class
• Application examples
Standard algorithms
• Parametric - Gaussian modeling, Replicator Neural Networks
• Non-parametric - NN modeling (distance & density based, LOF)
• Clustering
Contextual detection
The development process
• Feature selection & model validation
Microsoft applications
• Excel (conditional formatting rules)
• OSD (Bing, MSN, AdCenter) – Capacity prediction
• OSD, Azure – Latent Fault Detection in Data Centers
• Corporate Security – Insider Threats
Future (and current) challenges
9InfinIT seminar, March 2015
10. Bo Thiesson
Modeling paradigms
Parametric
• Learn (parametric) model representing normal data
• Outliers deviate strongly from this model
• Data represented by sufficient statistics (estimating model parameters)
• Space efficient & fast run-time (usually)
• Depends on parametric structure (shape) of the model, also restricts number
of normal classes
Non-parametric
• All data kept around - the “model” defines a distance measure between data –
and sometimes a kernel
• Distance-based approach: outlier have large distance to other data
• (Kernel) density-based approach: density around outlier is significantly smaller
than for its neighbors
• Space demanding & not as fast run-time (usually)
• No structural model constraints nor restrictions on number of normal classes
10InfinIT seminar, March 2015
11. Bo Thiesson
Algorithm – Parametric
Example: Multivariate Gaussian (normal) model
• 𝜇 is the mean value of all points (usually data is normalized so that 𝜇=0)
• Σ is the covariance matrix from the mean
• Mahalanobis distance
• 𝑀𝐷𝑖𝑠𝑡(𝑥, 𝜇)= ( 𝑥− 𝜇)↑𝑇 Σ↑−1 (𝑥− 𝜇)
• follows a Χ↑2 -distribution with 𝑑 degrees of freedom ( 𝑑 = data
dimensionality)
• Outliers: points 𝑥, with 𝑀𝐷𝑖𝑠𝑡( 𝑥, 𝜇) > Χ↑2 (0,975) [≈3 𝜎]
11InfinIT seminar, March 2015
( )
2
)()( 1
||2
1
),;(
µµ
π
µ
−−
−
−
Σ
=Σ
xx
d
exN
ΣT
12. Bo Thiesson
Parametric algorithm – Multivariate Gaussian
Tan, P.-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley.
12InfinIT seminar, March 2015
( )
2
)()( 1
||2
1
),;(
µµ
π
µ
−−
−
−
Σ
=Σ
xx
d
exN
ΣT
Parameters
Data object
13. Bo Thiesson
Algorithm – Parametric (cont.)
Challenge
A Fix
• Mixture modeling
• ⇒ contextual outlier detection
(more later)
• Demands a clustering approach
• Not as flexible as non-parametric approaches
13InfinIT seminar, March 2015
µDB
14. Bo Thiesson
Replicator Neural Networks
S. Hawkins, et al. Outlier detection using replicator neural networks, DaWaK02 (2002)
• Use a replicator 4-layer feed-forward neural network (RNN) with the
same number of input and output nodes
• Input variables are the output variables so that RNN forms a compressed
model of the data during training
• A measure of anomaly is the reconstruction error of individual data points
14InfinIT seminar, March 2015
Target
variables
Input
15. Bo Thiesson
Algorithm – Non-Parametric
Nearest Neighbor (NN) Based Techniques
Key assumption: normal data records have close neighbors while anomalies
are located far from other records
General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data record is anomaly
or not
Paradigms:
• Distance based methods
• Anomalies are data points most distant from other points
• Density based methods
• Anomalies are data points in low density regions
15InfinIT seminar, March 2015
16. Bo Thiesson
NN Distance-based Anomaly Detection
Knorr & Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets, VLDB98
Algorithm
• For each data record 𝑑 compute the distance to the k-th nearest
neighbor 𝑑↓𝑘
• The distance measure is the important design choice
• Sort all data records according to the distance 𝑑↓𝑘
• Outliers are records that have the largest distance 𝑑↓𝑘 and therefore are
located in the more sparse neighborhoods
• Usually data records that have top 𝑛% distance 𝑑↓𝑘 are identified as
outliers
• Multiple normal classes is not a problem
• Not suitable for datasets that have modes with varying density
16InfinIT seminar, March 2015
17. Bo Thiesson
Advantages of Density-based Techniques
Anomalies:
• Distance-based: 𝒑↓ 𝟏 ,
𝒑↓ 𝟑 ,…
• Density-based: 𝒑↓ 𝟏 ,
𝒑↓ 𝟐 , …
17InfinIT seminar, March 2015
p2
×
p1
×
×
p3
Distance from p3
to nearest
neighbor
Distance from p2
to nearest
neighbor
Fixes problem with different local densities
18. Bo Thiesson
Local Outlier Factor (LOF)
(Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000)
Algorithm
• For each data point 𝑞 compute the distance to the 𝑘-th NN ( 𝑘-dist)-dist)
• Compute reachability distance ( 𝑟-dist) for each data example 𝑞 with-dist) for each data example 𝑞 with
respect to data example 𝑝 as:
𝑟−dist( 𝑞, 𝑝) = max{ 𝑘−dist( 𝑝), dist( 𝑞, 𝑝)}
• Compute local reachability density (lrd) of data example 𝑞 as inverse of the
average reachability distance based on the k-NN of data example 𝑞
lrd(𝑞)= 𝑘/∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒𝑟−dist( 𝑞, 𝑝)
• Compute LOF↓𝑘 ( 𝑞) as ratio of average local reachability density of 𝑞’s
k-NN and local reachability density of the data record 𝑞
LOF↓𝑘 (𝑞)=1/𝑘 ∑𝑝∈ 𝑘𝑁𝑁( 𝑞)↑▒lrd( 𝑝)/lrd( 𝑞)
18InfinIT seminar, March 2015
19. Bo Thiesson
Local Outlier Factor (LOF)
Properties
• LOF ≈ 1: point is in a cluster (region with homogeneous
density around the point and its neighbors)
• LOF >> 1: point is an anomaly
19InfinIT seminar, March 2015
20. Bo Thiesson
Clustering Based Techniques
Key Assumption: Normal data instances belong to large and dense clusters,
while anomalies do not belong to any significant cluster.
General Approach:
• Cluster data into a finite number of clusters.
• Analyze each data instance with respect to its closest cluster.
• Anomalous Instances
• Data instances that do not fit into any cluster (residuals from clustering).
• Data instances in small clusters.
• Data instances that are far from other points within the same cluster.
20InfinIT seminar, March 2015
21. Bo Thiesson
Outline
Basic setup & motivation
• 1 class ↔ 2 class
• Application examples
Standard algorithms
• Parametric - Gaussian modeling, Replicator Neural Networks
• Non-parametric - NN modeling (distance & density based, LOF)
• Clustering
Contextual detection
The development process
• Feature selection & model validation
Microsoft applications
• Excel (conditional formatting rules)
• OSD (Bing, MSN, AdCenter) – Capacity prediction
• OSD, Azure – Latent Fault Detection in Data Centers
• Corporate Security – Insider Threats
Future (and current) challenges
21InfinIT seminar, March 2015
22. Bo Thiesson
Contextual Anomaly Detection
Key Assumption: All normal objects within a context will be similar (in
terms of behavioral attributes), while the anomalies will be different from
other objects within the context.
General Approach:
• Identify a context around a data object (using a set of contextual
attributes).
• Determine if the test data objects is anomalous within the context
(using a set of behavioral attributes).
22InfinIT seminar, March 2015
23. Bo Thiesson
Contextual Attributes
Contextual attributes define a neighborhood (context) for each instance
For example:
• Spatial Context
• Latitude, Longitude
• Graph Context
• Edges, Weights
• Sequential Context
• Position, Time
• Profile Context
• User demographics
• Behavioral context (mixture modeling)
• Contextual = behavioral attributes
23InfinIT seminar, March 2015
24. Bo Thiesson
Contextual Anomaly Detection Techniques
Reduction to global anomaly detection
• Segment or influence data using contextual attributes
• Apply a traditional anomaly outlier within each context using behavioral
attributes
• Often, contextual attributes cannot be segmented easily – use favourite
clustering technique (from standard machine learning)
A “softer” alternative
• Conditional anomaly detection*
24InfinIT seminar, March 2015
* X. Song, M. Wu, C. Jermaine & S. Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.
25. Bo Thiesson
Outline
Basic setup & motivation
• 1 class ↔ 2 class
• Application examples
Standard algorithms
• Parametric - Gaussian modeling, Replicator Neural Networks
• Non-parametric - NN modeling (distance & density based, LOF)
• Clustering
Contextual detection
The development process
• Feature selection & model validation
Microsoft applications
• Excel (conditional formatting rules)
• OSD (Bing, MSN, AdCenter) – Capacity prediction
• OSD, Azure – Latent Fault Detection in Data Centers
• Corporate Security – Insider Threats
Future (and current) challenges
25InfinIT seminar, March 2015
26. Bo Thiesson
The development process
The importance of real-number evaluation
When developing a learning algorithm (choosing features,
etc.), making decisions is much easier if we have a way of
evaluating our learning algorithm.
Data
• Assume we have some labeled data, of normal and very few
anomalous examples.
• Split data into
• Training set, with only normal examples
• Validation set, with both normal and half of the
anomalous examples
• Test set with both normal and other half of the
anomalous examples
26InfinIT seminar, March 2015
Training
Data
Validation
Data
Test
Data
27. Bo Thiesson
Example
Data
• 100.000 snapshots of well (normal) running machines
• 20 snapshots of (latent) failing machines
Training data: 60K well-running machines
Validation data: 20K well-running machines, 10 failing machines
Test data: 20K well-running machines, 10 failing machines
27InfinIT seminar, March 2015
28. Bo Thiesson
The development process (cont.)
Algorithm evaluation
• Fit normal model 𝑀( 𝑥) on the training data
• On validation/test data example 𝑥, predict
𝑦 = 𝑀(𝑥)={█■normal&− 𝑜𝑟−@not normal&
and compare to true value 𝑦
Possible evaluation metrics:
– True positive, false positive, false negative, true negative
– Precision/Recall
– F1-score
Use to compare algorithms with different features, tuning parameters, etc.
28InfinIT seminar, March 2015
29. Bo Thiesson
false positive
(Type I Error)
We want to avoid…
false negative
(Type II Error)
You’re
not
pregnant
You’re
pregnant
InfinIT seminar, March 2015
29
29
30. Bo Thiesson
Validation (credit card fraud example)
(Inspired by David J. Hand’s talk at NATO ASI: Mining Massive Data sets for Security, 2007)
Unbalanced classes
Detector
• correctly identifies 99 in 100 legitimate transactions, and
• correctly identifies 99 in 100 fraudulent transactions
Pretty good?
But suppose only 1 in 10.000
transactions are fraudulent
Only 1 in 100 reported fraudulent
transactions are in fact fraudulent
30InfinIT seminar, March 2015
True class
Legit Fraud
Predicted
class
Legit 99% 1%
Fraud 1% 99%
Numbers 9999 1
True class
Legit Fraud
Predicted
class
Legit 9899,01 0,01
Fraud 99,99 0.99
Numbers 9999 1
TP rate=
0,99/99,99+0.99
=0,98%
31. Bo Thiesson
“The boy who cried wolf”
>99% of suspected frauds are in fact legitimate
This matters because:
• operational decisions must be made (stop card?)
• good customers must not be irritated
Same challenge for other anomaly detection systems
– cyber intrusion detection, detecting latent machine failures,
industrial damage detection, etc.
31InfinIT seminar, March 2015
32. Bo Thiesson
Outline
Basic setup & motivation
• 1 class ↔ 2 class
• Application examples
Standard algorithms
• Parametric - Gaussian modeling, Replicator Neural Networks
• Non-parametric - NN modeling (distance & density based, LOF)
• Clustering
Contextual detection
The development process
• Feature selection & model validation
Microsoft applications
• Excel (conditional formatting rules)
• OSD (Bing, MSN, AdCenter) – Capacity prediction
• OSD, Azure – Latent Fault Detection in Data Centers
• Corporate Security – Insider Threats
Future (and current) challenges
32InfinIT seminar, March 2015
33. Bo Thiesson
Excel (conditional formatting rules)
(Max Chickering, Allan Folting, David Heckerman, Eric Vigessa & Bo Thiesson)
33InfinIT seminar, March 2015
Drill across
Region (outlier)
36. Bo Thiesson
Capacity Prediction
(Alexei Bocharov & Bo Thiesson)
Capacity prediction = prediction of traffic to sets of MSN web pages
• “How many impressions can we sell on a property between certain start and end
dates in the future?”
• Over-prediction = oversell (customer dissatisfaction, make good items,…)
• Under-prediction = undersell (loss of revenue, dilution of product, …)
• Microsoft adCenter essentially sells web page traffic futures to advertisers across
>20,000 MSN page groups
Forecast future from past data
Many existing forecasting methods
• One series: ARIMA, Exponential Smoothing methods (e.g. Holt-Winters), ART,…
• Multiple series: VARIMA, ARTxp , Neural nets,…
Gradually forgt
the past
37. Bo Thiesson
Breaking the forecasting models
• Transient events
• Cyclical (recurring, but not
periodic)
• FOMC-meetings (interest rates)
• Periodic (long periodicities with few
data)
• E.g., April 15th is Tax day
• CPS only has 2-3 yrs of data, not
enough!
• Sporadic
• E.g., earthquake in Chile
37
38. Bo Thiesson
Seasonal and Floating Patterns
TSF provides basis for predicting future traffic with
• trend,
• seasonality, and
• floating calendar patterns
For trend and seasonality:
• TSF model selection + spectral analysis
• Industry-strength seasonal pattern prediction and superior seasonal
pattern detection
For floating patterns:
• a calendar of future events is needed (such as used by MSN Channel
editorial team → next slide)
• Custom calendar event forecasters improve prediction accuracy by
factor of 2 or more (compared to traditional forecasters)
40. Bo Thiesson
The OSD Data Quality (DQ) system
• Single source of Data Quality reports across the entire OSD
• Reports are basis for investigating (abnormal) incidents with both user-facing Bing
properties and Bing infrastructure
• Alert generator from online KPI measures (based on TSF, in parts)
• Alerts are ranked:
• Severity of anomaly (deviation from predicted value) +
• importance of KPI
• Involves the (un)certainty of prediction
• Number of alerts can be matched human resources (for report investigations)
40InfinIT seminar, March 2015
41. Bo Thiesson
Latent Fault Detection in Data Centers
Gabel, Schuster, Bachrach & Bjørner (2012)
Challenge:
• machine failures ⇒ service outages, data loss
• Proactively detect (latent) failures before they happen
• Machine failures are often not a result of abrupt change but rather a slow
degrade in performance
Solution
• Agile & domain independent: no domain knowledge needed, uses only standard
performance counters collected from:
• Hardware (e.g., temperature)
• Operating system (e.g., number of threads)
• Runtime system (e.g., garbage collected)
• Application layer (e.g. transactions completed)
• Un-supervised
41InfinIT seminar, March 2015
42. Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Assumptions:
• Many machines, majority working properly at any point in
time
• Machines are homogeneous (perform similar tasks on similar
hardware and software)
• On average, workload is balanced across machines
• Counters are ordinal and reported at same rate
• Counters are memoryless
42InfinIT seminar, March 2015
Standard for (groups
of) machines in
big data centers
Technical assumptions that (I believe)
can be softened with some thought
43. Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Anomaly detection framework (slightly simplified):
• 𝑥( 𝑚, 𝑡): vector of performance counters for machine
𝑚∈ 𝑀 at time 𝑡 in epoch 𝑇.
• 𝑥( 𝑡): collection of all 𝑥( 𝑚, 𝑡) at time 𝑡
• For each machine 𝑚 compute:
• 𝑆(𝑚, 𝑥( 𝑡))=1/|𝑀|−1 ∑𝑚≠ 𝑚′↑▒ 𝑥(𝑚, 𝑡)
− 𝑥( 𝑚↑′ , 𝑡)/‖𝑥(𝑚, 𝑡)− 𝑥( 𝑚↑′ , 𝑡)‖
• Machine measure: 𝑣(𝑚)=1/|𝑇|−1
∑𝑡∈ 𝑇↑▒𝑆(𝑚, 𝑥( 𝑡))
• Compute centroid measure across all machines: 𝑣
• For each machine 𝑚, use statistical test 𝐹(e.g. sign-test,
Tukey-test, LOF-test)* to determine if deviance: 43InfinIT seminar, March 2015
*: See details in: Latent Fault Detection in Large Scale Services; Gabel, Schuster, Bachrach & Bjørner (2012)
44. Bo Thiesson
Latent Fault Detection in Data Centers (cont.)
Results – detection of latent faults:
• 4500 machines
• Considered fault latencies: 1 day, 7 days, 14 days
• Incomplete health information labeled according to current
system “watchdogs”
• True failure not reported by “watchdog” → counted as
false positive
• Abrupt failure without proceeding latent faulting →
counted as false negative
44InfinIT seminar, March 2015
⇒ Conservative
result claims
45. Bo Thiesson
Corporate Security – Insider Threats
(Xiang Wang, Jay Stokes,& Bo Thiesson)
• User/machine “social” network of
• Approx. 100K users
• Approx ½M machines
• Rich contextual information:
• User: name, location, organization,
title,…
• Machine: name, IP address,
services,,…
45InfinIT seminar, March 2015
50. Bo Thiesson
Outline
Basic setup & motivation
• 1 class ↔ 2 class
• Application examples
Standard algorithms
• Parametric - Gaussian modeling, Replicator Neural Networks
• Non-parametric - NN modeling (distance & density based, LOF)
• Clustering
Contextual detection
The development process
• Feature selection & model validation
Microsoft applications
• Excel (conditional formatting rules)
• OSD (Bing, MSN, AdCenter) – Capacity prediction
• OSD, Azure – Latent Fault Detection in Data Centers
• Corporate Security – Insider Threats
Future (and current) challenges
50InfinIT seminar, March 2015
51. Bo Thiesson
Future (and current) challenges – “Big Data”
Volume
• many transactions - billions - algorithms must be efficient
- [cannot just take sample]
Velocity
• Streaming data – efficient online, adaptive algorithms
- [reactive to drift]
Variety
• Data from many sources – mixed variable types, large number
of variables (many irrelevant), network data
- [very demanding on distance measure & modeling process]
Other issues
- different misclassification costs
- unbalanced class sizes
- delay in labeling (e.g., in credit card fraud)
- mislabeled classes
51InfinIT seminar, March 2015