This slide is prepared for a course of Dept. of CSE, Islamic Univresity of Technology (IUT).
Course: CSE 4739- Data Mining
This topic is based on:
Data Mining: Concepts and Techniques
Book by Jiawei Han
Chapter 12
Outlier analysis identifies outliers, which are data objects that are grossly different from or inconsistent with the remaining set of data. Outliers can be identified using statistical, distance-based, density-based, or deviation-based approaches. Statistical approaches assume an underlying data distribution and identify outliers based on significance probabilities. Distance-based approaches identify outliers as objects with too few neighbors within a given distance. Density-based approaches identify local outliers based on local density comparisons. Deviation-based approaches identify outliers as objects that deviate from the main characteristics of their data group.
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
Outlier analysis is used to identify outliers, which are data objects that are inconsistent with the general behavior or model of the data. There are two main types of outlier detection - statistical distribution-based detection, which identifies outliers based on how far they are from the average statistical distribution, and distance-based detection, which finds outliers based on how far they are from other data objects. Outlier analysis is useful for tasks like fraud detection, where outliers may indicate fraudulent activity that is different from normal patterns in the data.
Anomaly detection techniques aim to identify outliers or anomalies in datasets. Statistical approaches assume a data distribution and detect anomalies that differ significantly. Distance-based approaches measure distances between data points to find outliers that are far from neighbors. Clustering approaches group normal data and detect outliers in small clusters or far from other clusters. Challenges include determining the number of outliers, handling unlabeled data, and scaling to high dimensions where distances become similar.
This chapter discusses various methods for outlier detection in data mining, including statistical approaches that assume normal data fits a statistical model, proximity-based approaches that identify outliers as objects far from their nearest neighbors, and clustering-based approaches that find outliers as objects not belonging to large clusters. It also covers classification and semi-supervised approaches, detecting contextual and collective outliers, and challenges in high-dimensional outlier detection.
This document discusses different clustering methods in data mining. It begins by defining cluster analysis and its applications. It then categorizes major clustering methods into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based clustering methods. Finally, it provides details on partitioning methods like k-means and k-medoids clustering algorithms.
Outlier analysis identifies outliers, which are data objects that are grossly different from or inconsistent with the remaining set of data. Outliers can be identified using statistical, distance-based, density-based, or deviation-based approaches. Statistical approaches assume an underlying data distribution and identify outliers based on significance probabilities. Distance-based approaches identify outliers as objects with too few neighbors within a given distance. Density-based approaches identify local outliers based on local density comparisons. Deviation-based approaches identify outliers as objects that deviate from the main characteristics of their data group.
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
Outlier analysis is used to identify outliers, which are data objects that are inconsistent with the general behavior or model of the data. There are two main types of outlier detection - statistical distribution-based detection, which identifies outliers based on how far they are from the average statistical distribution, and distance-based detection, which finds outliers based on how far they are from other data objects. Outlier analysis is useful for tasks like fraud detection, where outliers may indicate fraudulent activity that is different from normal patterns in the data.
Anomaly detection techniques aim to identify outliers or anomalies in datasets. Statistical approaches assume a data distribution and detect anomalies that differ significantly. Distance-based approaches measure distances between data points to find outliers that are far from neighbors. Clustering approaches group normal data and detect outliers in small clusters or far from other clusters. Challenges include determining the number of outliers, handling unlabeled data, and scaling to high dimensions where distances become similar.
This chapter discusses various methods for outlier detection in data mining, including statistical approaches that assume normal data fits a statistical model, proximity-based approaches that identify outliers as objects far from their nearest neighbors, and clustering-based approaches that find outliers as objects not belonging to large clusters. It also covers classification and semi-supervised approaches, detecting contextual and collective outliers, and challenges in high-dimensional outlier detection.
This document discusses different clustering methods in data mining. It begins by defining cluster analysis and its applications. It then categorizes major clustering methods into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based clustering methods. Finally, it provides details on partitioning methods like k-means and k-medoids clustering algorithms.
The document discusses sequential pattern mining, which involves finding frequently occurring ordered sequences or subsequences in sequence databases. It covers key concepts like sequential patterns, sequence databases, support count, and subsequences. It also describes several algorithms for sequential pattern mining, including GSP (Generalized Sequential Patterns) which uses a candidate generation and test approach, SPADE which works on a vertical data format, and PrefixSpan which employs a prefix-projected sequential pattern growth approach without candidate generation.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
Outlier analysis is used to identify outliers, which are data objects that are inconsistent with the general behavior or model of the data. There are two main types of outlier detection - statistical distribution-based detection, which identifies outliers based on how far they are from the average statistical distribution, and distance-based detection, which finds outliers based on how far they are from other data objects. Outlier analysis is useful for tasks like fraud detection, where outliers may indicate fraudulent activity that is different from normal patterns in the data.
The document introduces data preprocessing techniques for data mining. It discusses why data preprocessing is important due to real-world data often being dirty, incomplete, noisy, inconsistent or duplicate. It then describes common data types and quality issues like missing values, noise, outliers and duplicates. The major tasks of data preprocessing are outlined as data cleaning, integration, transformation and reduction. Specific techniques for handling missing values, noise, outliers and duplicates are also summarized.
This document discusses intrusion detection techniques. It describes misuse detection, which detects known attacks based on predefined rules, and anomaly detection, which detects deviations from normal behavior. Common misuse detection methods include rule-based, state transition analysis, and expert systems. Anomaly detection methods include statistical methods, machine learning, and data mining. The document also proposes ideas to improve intrusion detection, such as using association rule mining to detect patterns in audit data and discovering new patterns by analyzing existing rulesets.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
The ID3 algorithm generates a decision tree from training data using a top-down, greedy search. It calculates the entropy of attributes in the training data to determine which attribute best splits the data into pure subsets with maximum information gain. It then recursively builds the decision tree, using the selected attributes to split the data at each node until reaching leaf nodes containing only one class. The resulting decision tree can then classify new samples not in the training data.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
This document summarizes a machine learning workshop on feature selection. It discusses typical feature selection methods like single feature evaluation using metrics like mutual information and Gini indexing. It also covers subset selection techniques like sequential forward selection and sequential backward selection. Examples are provided showing how feature selection improves performance for logistic regression on large datasets with more features than samples. The document outlines the workshop agenda and provides details on when and why feature selection is important for machine learning models.
The document provides an overview of neural networks for data mining. It discusses how neural networks can be used for classification tasks in data mining. It describes the structure of a multi-layer feedforward neural network and the backpropagation algorithm used for training neural networks. The document also discusses techniques like neural network pruning and rule extraction that can optimize neural network performance and interpretability.
Introduction
What is ML, DL, AL?
Decision Tree
Definition
Why Decision Tree?
Basic Terminology
Challenges
Random Forest
Definition
Why Random Forest
How does it work?
Advantages & Disadvantages
Definition: According to Arthur Samuel (1950) “Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed”.
Machine learning is the study and design of algorithms which can learn by processing input (learning samples) data.
The most widely used definition of machine learning is that of Carnegie Mellon University Professor Tom Mitchell: “A computer program is said to learn from experience ‘E’, with respect to some class of tasks ‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by ‘P’ improves with experience ‘E’”.
Decision Tree
Definition
Why Decision Tree?
Basic Terminology
Challenges
Random Forest
Definition
Why Random Forest
How does it work?
Data mining involves multiple steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, and pattern evaluation. It has various functionalities including descriptive mining to characterize data, predictive mining for inference, and different mining techniques like classification, association analysis, clustering, and outlier analysis.
The document defines data mining as extracting useful information from large datasets. It discusses two main types of data mining tasks: descriptive tasks like frequent pattern mining and classification/prediction tasks like decision trees. Several data mining techniques are covered, including association, classification, clustering, prediction, sequential patterns, and decision trees. Real-world applications of data mining are also outlined, such as market basket analysis, fraud detection, healthcare, education, and CRM.
This document provides an overview of outlier detection. It defines outliers as observations that deviate significantly from other observations. There are two types of outliers: univariate outliers found in a single feature and multivariate outliers found in multiple features. Common causes of outliers include data entry errors, measurement errors, experimental errors, intentional outliers, data processing errors, sampling errors, and natural outliers. Methods for detecting outliers include z-score analysis, statistical modeling, linear regression models, proximity based models, information theory models, and high dimensional detection methods.
This chapter discusses outlier analysis and various methods for outlier detection. It defines outliers as data objects that differ significantly from normal data. Several types of outliers are described, including global outliers that differ from all other data, contextual outliers that differ based on selected context attributes, and collective outliers where a group of objects collectively differ. Statistical, proximity-based, and clustering-based methods are some common approaches for outlier detection discussed in the chapter. Statistical approaches assume data follows a stochastic model, while proximity-based methods use distance measures and density-based methods to identify outliers. Clustering-based methods identify outliers as objects not belonging to large, dense clusters of normal data. Both supervised and unsupervised learning techniques can be applied to outlier detection.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
The document discusses sequential pattern mining, which involves finding frequently occurring ordered sequences or subsequences in sequence databases. It covers key concepts like sequential patterns, sequence databases, support count, and subsequences. It also describes several algorithms for sequential pattern mining, including GSP (Generalized Sequential Patterns) which uses a candidate generation and test approach, SPADE which works on a vertical data format, and PrefixSpan which employs a prefix-projected sequential pattern growth approach without candidate generation.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
Outlier analysis is used to identify outliers, which are data objects that are inconsistent with the general behavior or model of the data. There are two main types of outlier detection - statistical distribution-based detection, which identifies outliers based on how far they are from the average statistical distribution, and distance-based detection, which finds outliers based on how far they are from other data objects. Outlier analysis is useful for tasks like fraud detection, where outliers may indicate fraudulent activity that is different from normal patterns in the data.
The document introduces data preprocessing techniques for data mining. It discusses why data preprocessing is important due to real-world data often being dirty, incomplete, noisy, inconsistent or duplicate. It then describes common data types and quality issues like missing values, noise, outliers and duplicates. The major tasks of data preprocessing are outlined as data cleaning, integration, transformation and reduction. Specific techniques for handling missing values, noise, outliers and duplicates are also summarized.
This document discusses intrusion detection techniques. It describes misuse detection, which detects known attacks based on predefined rules, and anomaly detection, which detects deviations from normal behavior. Common misuse detection methods include rule-based, state transition analysis, and expert systems. Anomaly detection methods include statistical methods, machine learning, and data mining. The document also proposes ideas to improve intrusion detection, such as using association rule mining to detect patterns in audit data and discovering new patterns by analyzing existing rulesets.
This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques
The ID3 algorithm generates a decision tree from training data using a top-down, greedy search. It calculates the entropy of attributes in the training data to determine which attribute best splits the data into pure subsets with maximum information gain. It then recursively builds the decision tree, using the selected attributes to split the data at each node until reaching leaf nodes containing only one class. The resulting decision tree can then classify new samples not in the training data.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
This document summarizes a machine learning workshop on feature selection. It discusses typical feature selection methods like single feature evaluation using metrics like mutual information and Gini indexing. It also covers subset selection techniques like sequential forward selection and sequential backward selection. Examples are provided showing how feature selection improves performance for logistic regression on large datasets with more features than samples. The document outlines the workshop agenda and provides details on when and why feature selection is important for machine learning models.
The document provides an overview of neural networks for data mining. It discusses how neural networks can be used for classification tasks in data mining. It describes the structure of a multi-layer feedforward neural network and the backpropagation algorithm used for training neural networks. The document also discusses techniques like neural network pruning and rule extraction that can optimize neural network performance and interpretability.
Introduction
What is ML, DL, AL?
Decision Tree
Definition
Why Decision Tree?
Basic Terminology
Challenges
Random Forest
Definition
Why Random Forest
How does it work?
Advantages & Disadvantages
Definition: According to Arthur Samuel (1950) “Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed”.
Machine learning is the study and design of algorithms which can learn by processing input (learning samples) data.
The most widely used definition of machine learning is that of Carnegie Mellon University Professor Tom Mitchell: “A computer program is said to learn from experience ‘E’, with respect to some class of tasks ‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by ‘P’ improves with experience ‘E’”.
Decision Tree
Definition
Why Decision Tree?
Basic Terminology
Challenges
Random Forest
Definition
Why Random Forest
How does it work?
Data mining involves multiple steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, and pattern evaluation. It has various functionalities including descriptive mining to characterize data, predictive mining for inference, and different mining techniques like classification, association analysis, clustering, and outlier analysis.
The document defines data mining as extracting useful information from large datasets. It discusses two main types of data mining tasks: descriptive tasks like frequent pattern mining and classification/prediction tasks like decision trees. Several data mining techniques are covered, including association, classification, clustering, prediction, sequential patterns, and decision trees. Real-world applications of data mining are also outlined, such as market basket analysis, fraud detection, healthcare, education, and CRM.
This document provides an overview of outlier detection. It defines outliers as observations that deviate significantly from other observations. There are two types of outliers: univariate outliers found in a single feature and multivariate outliers found in multiple features. Common causes of outliers include data entry errors, measurement errors, experimental errors, intentional outliers, data processing errors, sampling errors, and natural outliers. Methods for detecting outliers include z-score analysis, statistical modeling, linear regression models, proximity based models, information theory models, and high dimensional detection methods.
This chapter discusses outlier analysis and various methods for outlier detection. It defines outliers as data objects that differ significantly from normal data. Several types of outliers are described, including global outliers that differ from all other data, contextual outliers that differ based on selected context attributes, and collective outliers where a group of objects collectively differ. Statistical, proximity-based, and clustering-based methods are some common approaches for outlier detection discussed in the chapter. Statistical approaches assume data follows a stochastic model, while proximity-based methods use distance measures and density-based methods to identify outliers. Clustering-based methods identify outliers as objects not belonging to large, dense clusters of normal data. Both supervised and unsupervised learning techniques can be applied to outlier detection.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This document provides an overview of outlier detection techniques. It begins by defining outliers as data points that are considerably different from the majority of the data set. Several approaches to outlier detection are then discussed, including statistical-based methods using distributions and depth, deviation-based methods like sequential exceptions and OLAP data cubes, and distance-based methods using indexes and nearest neighbors. Specific algorithms are explained for each approach.
Data mining Basics and complete description onwordSulman Ahmed
This document discusses data mining and provides examples of its applications. It begins by explaining why data is mined from both commercial and scientific viewpoints in order to discover useful patterns and information. It then discusses some of the challenges of data mining, such as dealing with large datasets, high dimensionality, complex data types, and distributed data sources. The document outlines common data mining tasks like classification, clustering, association rule mining, and regression. It provides real-world examples of how these techniques are used for applications like fraud detection, customer profiling, and scientific discovery.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
This document provides an introduction to unsupervised learning and clustering algorithms. It discusses how unsupervised learning is used to find patterns in unlabeled data. Clustering algorithms are introduced as a common unsupervised learning technique that groups similar data points together. Specific clustering algorithms covered include k-means, k-medoids, hierarchical clustering, density-based clustering, and grid-based clustering. The document also compares the k-means and k-medoids partitioning clustering algorithms.
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
This document discusses outlier detection, including types of outliers (global, contextual, collective), challenges of outlier detection (modeling normal vs abnormal data, application dependence, handling noise), and methods for outlier detection (supervised, unsupervised, proximity-based, clustering-based). Global outliers significantly deviate from the overall data set. Contextual outliers deviate based on specific contextual attributes. Collective outliers involve a subset of data objects that together deviate from the data set, even if individually they are not outliers.
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
The outliers in data mining can be detected using semi-supervised and unsupervised methods. Outlier
detection in high dimensional data faces various challenges from curse of dimensionality. It means due
to the distance concentration the data becomes unobvious in high dimensional data. Using outlier
detection techniques, the distance base methods are used to detect outliers and label all the points as
good outliers. In high dimensional data to detect outliers effectively, we use unsupervised learning
methods like IQR, KNN with Anti hub.
The document summarizes an anomaly detection survey paper. It discusses different aspects of anomaly detection problems including the nature of input data, type of anomalies, availability of data labels, and output types. It also describes several anomaly detection techniques such as classification-based, nearest neighbor-based, clustering-based, statistical-based, and spectral-based methods. For each technique, it provides the basic idea, categories, examples, advantages, and disadvantages.
This presentation will present topics such as "What is Anomaly Detection? What are the different types of Data that may be used? What are the popular techniques may be used to identify anomalies. What are the best practices in anomaly detection? What is the Value of Anomaly Detection?
Pattern recognition at scale anomaly detection in banking on stream dataNUS-ISS
This document discusses anomaly detection techniques. It defines anomalies as observations that do not conform to expected patterns in a dataset. There are three main types of anomalies: point anomalies involving individual data instances, contextual anomalies where instances are anomalous in specific contexts, and collective anomalies involving anomalous groups of related instances. The document outlines unsupervised, supervised, and semi-supervised techniques for anomaly detection and provides examples of techniques like moving averages and autoencoders and how they can be used to model normal behavior and identify anomalous instances. It also discusses evaluating anomaly detection performance using metrics like confusion matrices.
This document presents an overview of outlier detection, including definitions of outliers, common applications of outlier detection, and an algorithm called the depth-based approach. The depth-based approach models outliers as data objects located on the outer layers of convex hulls of the data space. It works by organizing data objects into convex hull layers and identifying outliers as those on outer layers, based on the assumption that outliers are located at the border of the data space while normal data is at the center.
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
Anomaly detection techniques are used to identify rare items, events or observations which raise suspicions by differing significantly from the majority of the data. There are various types of anomalies including point anomalies, contextual anomalies and collective anomalies. Anomaly detection algorithms typically build a model of normal behavior and then label new data as normal or anomalous based on how well it fits the model. Common techniques include clustering, statistical methods and distance-based approaches. Applications include fraud detection, system failure diagnosis and cybersecurity.
types of data in research, measurement level, sampling techniques, sampling t...SRM UNIVERSITY, SIKKIM
This document discusses various topics related to sampling and data collection, including:
1. It describes different types of data sources like primary data collected by the researcher and secondary data collected by others. It notes the advantages and disadvantages of each.
2. It discusses different levels of measurement for data like nominal, ordinal, interval, and ratio scales.
3. It covers sampling techniques including probability methods like simple random sampling, systematic sampling, stratified random sampling, and cluster sampling as well as non-probability methods like purposive sampling, quota sampling, snowball sampling, and convenience sampling.
4. It provides an overview of scale construction techniques for developing measurement scales.
This document discusses cluster analysis and clustering algorithms. It defines a cluster as a collection of similar data objects that are dissimilar from objects in other clusters. Unsupervised learning is used with no predefined classes. Popular clustering algorithms include k-means, hierarchical, density-based, and model-based approaches. Quality clustering produces high intra-class similarity and low inter-class similarity. Outlier detection finds dissimilar objects to identify anomalies.
Why are anomalies important? Because they tell us a different story from the norm. An anomaly or an event might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies or anomalous events.
In this talk, we will give an introduction to anomaly detection. Anomalies are rare events. As a result, standard accuracy measures do not apply. But then, how do we evaluate an Anomaly Detection (AD) method? If we want to compare two or more AD methods, what kind of simple tests can we do? What are the data repositories that are available for AD?
We will also discuss an ensemble method for AD. Constructing an AD ensemble is challenging because the class labels are not known. We will look at an unusual ally from psychometrics – Item Response Theory – to help us in this construction.
Graph Theory: Matrix representation of graphsAshikur Rahman
The document discusses different matrix representations of graphs:
1) Incidence matrices represent the relationship between vertices and edges, with each column having two 1s. Circuit matrices represent circuits, with each row as a circuit vector. Cut-set matrices represent edge sets whose removal disconnects the graph.
2) Path matrices represent paths between vertex pairs, with columns of all 0s/1s indicating edges not/in every path. Adjacency matrices directly encode vertex connectivity.
3) Exercises are provided to construct the incidence matrix, circuit matrix, fundamental circuit matrix, and cut-set matrix for a given graph.
This document provides guidance on writing a statement of purpose (SOP) for graduate school applications. It defines an SOP as a reflection of an applicant's personality and background that explains who they are, why they are applying, and what they want to achieve in the future. A good SOP allows admissions committees to learn about an applicant's experiences and skills, and can help overcome weaknesses. The document recommends including specifics about an applicant's area of interest, how their background prepares them, what goals they have for the program, and why the particular program is a good fit. It also provides formatting tips and suggests focusing on relevant strengths while maintaining flow and avoiding irrelevant or flattering information.
The document discusses cut-sets and cut-vertices in graphs. It defines a cut-set as a set of edges whose removal disconnects a connected graph. Cut-sets always separate a graph into two disconnected pieces and reduce the graph's rank by one. Theorems are presented regarding the relationship between cut-sets and spanning trees, including that every cut-set must contain at least one branch from every spanning tree. Fundamental cut-sets are also introduced with respect to spanning trees.
The document discusses properties and theorems related to trees in graph theory. Some key points include:
- A tree is a connected acyclic graph with n vertices that has n-1 edges.
- There is a one-to-one correspondence between labeled trees with n vertices and sequences of n-2 labels, as proven by Cayley's theorem.
- Every connected graph has at least one spanning tree, which is a subgraph that contains all vertices. Fundamental circuits are formed when a chord is added to a spanning tree.
- Cyclic interchange can be used to generate all possible spanning trees by adding and removing edges.
1. The document discusses different types of walks and paths in graphs, including closed walks, open walks, paths, and circuits.
2. It also covers Euler graphs and defines an Euler line as a closed walk that goes through every edge exactly once. It presents the theorem that a connected graph is an Euler graph if and only if all vertices have even degree.
3. The document discusses operations that can be performed on graphs, including union, intersection, and ring sum. It also covers decomposition of graphs into subgraphs.
This document discusses cybercrimes and cybercriminals. It defines cybercrime as a computer-oriented crime that threatens privacy, security and reliability in the virtual world. Some common cybercrimes include cyberbullying, cyber extortion, phishing, identity theft, and different types of online scams. The document also categorizes cybercriminals and hackers, distinguishing between non-professionals like script kiddies, social workers like hacktivists, professionals like white hat and red hat hackers, and criminals like cyber terrorists and black hat hackers. Insider threats from current and former employees are also addressed. Different hacking techniques like social engineering are outlined.
The document discusses online consumer behavior and e-commerce marketing strategies. Some key points:
- Around 75% of U.S. households now have broadband internet access, though growth is slowing. Intensity and scope of online usage is increasing.
- Common online marketing strategies discussed include search engine marketing, display ads, email marketing, affiliate marketing, social media marketing, and mobile marketing.
- Models of online consumer behavior are presented, outlining the 5 stages of an online purchasing decision. Trust and convenience are important factors for online purchases.
Signature verification Using SIFT FeaturesAshikur Rahman
This document presents research on offline signature verification using local keypoint features. It discusses existing challenges in offline signature verification like different signature orientations and image noise. The objective is to develop a robust method for offline signature verification that can handle noise, orientation variations, and different writing styles. The proposed method uses Harris corner detection to extract keypoints, and creates a 128-bin feature descriptor for each keypoint. Keypoint matching and classification using KNN is then used to verify signatures. Future work includes implementing the proposed method and improving its robustness to rotations, noise, and ink variations with minimal complexity.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Diana Rendina
Librarians are leading the way in creating future-ready citizens – now we need to update our spaces to match. In this session, attendees will get inspiration for transforming their library spaces. You’ll learn how to survey students and patrons, create a focus group, and use design thinking to brainstorm ideas for your space. We’ll discuss budget friendly ways to change your space as well as how to find funding. No matter where you’re at, you’ll find ideas for reimagining your space in this session.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
1. Outlier Analysis
Based on- Chapter-12
Data Mining: Concepts and Techniques
Han, Kamber & Pei
A.B.M. Ashikur Rahman
Asst. Professor,
Dept. of CSE, IUT
2. 2
What Are Outliers?
• Outlier: A data object that deviates significantly from the normal objects as if it were generated by a
different mechanism
• Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
• Noise is random error or variance in a measured variable
• Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Customer segmentation
• Medical analysis
4. 4
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
• Can be viewed as a generalization of local outliers—whose density significantly deviates from
its local area
• Issue: How to define or formulate meaningful context?
Global Outlier
5. 5
Types of Outliers (II)
• Collective Outliers
• A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be outliers
• Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service
packages to each other
Collective Outlier
Detection of collective outliers
Consider not only behavior of individual objects, but also that of groups of objects
Need to have the background knowledge on the relationship among data objects,
such as a distance or similarity measure on objects.
A data set may have multiple types of outlier
One object may belong to more than one type of outlier
6. 6
Challenges of Outlier Detection
Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in an application
The border between normal and outlier objects is often a gray area
Application-specific outlier detection
Choice of distance measure among objects and the model of relationship among objects are often
application-dependent
E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger
fluctuations
Handling noise in outlier detection
Noise may distort the normal objects and blur the distinction between normal objects and outliers.
It may help hide outliers and reduce the effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
Specify the degree of an outlier: the unlikelihood of the object being generated by a normal
mechanism
7. Outlier Detection I: Supervised Methods
• Two ways to categorize outlier detection methods:
• Based on whether user-labeled examples of outliers can be obtained:
• Supervised, semi-supervised vs. unsupervised methods
• Based on assumptions about normal data and outliers:
• Statistical, proximity-based, and clustering-based methods
• Outlier Detection I: Supervised Methods
• Modeling outlier detection as a classification problem
• Samples examined by domain experts used for training & testing
• Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as outliers, or
• Model outliers and treat those not matching the model as normal
• Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial
outliers
• Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not
mislabeling normal objects as outliers)
7
8. Outlier Detection II: Unsupervised Methods
• Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having some distinct
features
• An outlier is expected to be far away from any groups of normal objects
• Weakness: Cannot detect collective outlier effectively
• Normal objects may not share any strong patterns, but the collective outliers may share high
similarity in a small area
• Ex. In some intrusion or virus detection, normal activities are diverse
• Unsupervised methods may have a high false positive rate but still miss many real outliers.
• Supervised methods can be more effective, e.g., identify attacking some key resources
• Many clustering methods can be adapted for unsupervised methods
• Find clusters, then outliers (objects not belonging to any cluster)
• Problem 1: Hard to distinguish noise from outliers
• Problem 2: Costly. Processing non target (normal) data before target (outlier) data
• Newer methods: tackle outliers directly
8
9. Outlier Detection III: Semi-Supervised Methods
• Situation: In many applications, the number of labeled data is often small: Labels could be on outliers
only, normal objects only, or both
• Semi-supervised outlier detection: Regarded as applications of semi-supervised learning
• If some labeled normal objects are available
• Use the labeled examples and the proximate unlabeled objects to train a model for normal objects
• Those not fitting the model of normal objects are detected as outliers
• If only some labeled outliers are available, a small number of labeled outliers many not cover the
possible outliers well
• To improve the quality of outlier detection, one can get help from models for normal objects learned
from unsupervised methods
9
10. Outlier Detection (1): Statistical Methods
• Statistical methods (also known as model-based methods) assume that the normal data follow some
statistical model (a stochastic model)
• The data not following the model are outliers.
10
Effectiveness of statistical methods: highly depends on whether the assumption of statistical model
holds in the real data
There are rich alternatives to use various statistical models
E.g., parametric vs. non-parametric
Example (right figure): First use Gaussian distribution to model the normal
data
For each object y in region R, estimate gD(y), the probability of y fits the
Gaussian distribution
If gD(y) is very low, y is unlikely generated by the Gaussian model, thus
an outlier
11. Statistical Approaches
• Statistical approaches assume that the objects in a data set are generated by a stochastic process (a
generative model)
• Idea: learn a generative model fitting the given data set, and then identify the objects in low probability
regions of the model as outliers
• Methods are divided into two categories: parametric vs. non-parametric
• Parametric method
• Assumes that the normal data is generated by a parametric distribution with parameter θ
• The probability density function of the parametric distribution f(x, θ) gives the probability that object
x is generated by the distribution
• The smaller this value, the more likely x is an outlier
• Non-parametric method
• Not assume an a-priori statistical model and determine the model from the input data
• Not completely parameter free but consider the number and nature of the parameters are flexible and
not fixed in advance
• Examples: histogram and kernel density estimation
11
12. Parametric Methods I:
Detection Univariate Outliers Based on Normal Distribution
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution, learn the parameters from the input data, and
identify the points with low probability as outliers
• Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
• Use the maximum likelihood method to estimate μ and σ
12
Taking derivatives with respect to μ and σ2, we derive the following maximum likelihood
estimates
For the above data with n = 10, we have
Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
13. Non-Parametric Methods: Detection Using Histogram
• The model of normal data is learned from the input data without any a priori
structure.
• Often makes fewer assumptions about the data, and thus can be applicable in
more scenarios
• Outlier detection using histogram:
13
Figure shows the histogram of purchase amounts in transactions
A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount
higher than $5,000
Problem: Hard to choose an appropriate bin size for histogram
Too small bin size → normal objects in empty/rare bins, false positive
Too big bin size → outliers in some frequent bins, false negative
Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If
the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier.
14. Outlier Detection (2): Proximity-Based Methods
• An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the
object is significantly deviates from the proximity of most of the other objects in the same data set
14
The effectiveness of proximity-based methods highly relies on the proximity measure.
In some applications, proximity or distance measures cannot be obtained easily.
Often have a difficulty in finding a group of outliers which stay close to each other
Two major types of proximity-based outlier detection
Distance-based vs. density-based
Example (right figure): Model the proximity of an object using its 3 nearest
neighbors
Objects in region R are substantially different from other objects in the
data set.
Thus the objects in R are outliers
15. Distance-Based Outlier Detection
• For each object o, examine the # of other objects in the r-neighborhood of o, where r is a user-specified distance threshold
• An object o is an outlier if most (taking π as a fraction threshold) of the objects in D are far away from o, i.e., not in the r-
neighborhood of o
• An object o is a DB(r, π) outlier if
• Equivalently, one can check the distance between o and its k-th nearest neighbor ok, where . o is an outlier if
dist(o, ok) > r
• Efficient computation: Nested loop algorithm
• For any object oi, calculate its distance from other objects, and count the # of other objects in the r-neighborhood.
• If π∙n other objects are within r distance, terminate the inner loop
• Otherwise, oi is a DB(r, π) outlier
• Efficiency: Actually CPU time is not O(n2) but linear to the data set size since for most non-outlier objects, the inner loop
terminates early
15
16. Distance-Based Outlier Detection: A Grid-Based Method
• Why efficiency is still a concern? When the complete set of objects cannot be held into main memory,
cost I/O swapping
• The major cost: (1) each object tests against the whole data set, why not only its close neighbor? (2) check
objects one by one, why not group by group?
• Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each cell is a hyper cube with
diagonal length r/2
16
Pruning using the level-1 & level 2 cell properties:
For any possible point x in cell C and any possible point y in a level-1 cell,
dist(x,y) ≤ r
For any possible point x in cell C and any point y such that dist(x,y) ≥ r, y is in
a level-2 cell
Thus we only need to check the objects that cannot be pruned, and even for such an object o, only need
to compute the distance between o and the objects in the level-2 cells (since beyond level-2, the distance
from o is more than r)
17. Example
Red – A certain cell
Yellow – Layer-1 Neighbor Cells
Blue – Layer-2 Neighbor Cells
Notes:
The maximum distance between a point in the red
cell and a point In its layer-1 neighbor cells is D
The minimum distance between
A point in the red cell and a point outside its
layer-2 neighbor cells is D
19. Density-Based Outlier Detection
• Local outliers: Outliers comparing to their local neighborhoods,
instead of the global data distribution
• In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4
is not an outlier. However, proximity-based clustering cannot find
o1 and o2 are outlier (e.g., comparing with O4).
19
Intuition (density-based outlier detection): The density around an outlier object is significantly different
from the density around its neighbors
Method: Use the relative density of an object against its neighbors as the indicator of the degree of the
object being outliers
k-distance of an object o, distk(o): distance between o and its k-th NN
k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
Nk(o) could be bigger than k since multiple objects may have identical distance to o
20. Local Outlier Factor: LOF
• Reachability distance from o’ to o:
• where k is a user-specified parameter
• Local reachability density of o:
20
LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and
those of o’s k-nearest neighbors
The lower the local reachability density of o, and the higher the local reachability density of the
kNN of o, the higher LOF
This captures a local outlier whose local density is relatively low comparing to the local
densities of its kNN
21. LOF Example
Step–1: calculate all the distances between each
two data points
Step 2: calculate all the dist2(o)
Step 3: calculate all the Nk(o)
25. Outlier Detection (3): Clustering-Based Methods
• Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or
do not belong to any clusters
25
Since there are many clustering methods, there are many clustering-based outlier detection methods as
well
Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be
costly and does not scale up well for large data sets
Example (right figure): two clusters
All points not in R form a large cluster
The two points in R form a tiny cluster, thus are outliers
26. Clustering-Based Outlier Detection (1 & 2):
• An object is an outlier if (1) it does not belong to any cluster, (2) there is a large distance between the
object and its closest cluster , or (3) it belongs to a small or sparse cluster
Case I: Not belong to any cluster
Identify animals not part of a flock: Using a density-based clustering method
such as DBSCAN
Case 2: Far from its closest cluster
Using k-means, partition data points of into clusters
For each object o, assign an outlier score based on its distance from its closest
center
If dist(o, co)/avg_dist(co) is large, likely an outlier
Ex. Intrusion detection: Consider the similarity between data points and the clusters
in a training data set
Use a training set to find patterns of “normal” data, e.g., frequent itemsets in each segment, and
cluster similar connections into groups
Compare new data points with the clusters mined—Outliers are possible attacks
26
27. • FindCBLOF: Detect outliers in small clusters
• Find clusters, and sort them in decreasing size
• To each data point, assign a cluster-based local outlier factor (CBLOF):
• If obj p belongs to a large cluster, CBLOF = cluster_size X similarity
between p and cluster
• If p belongs to a small one, CBLOF = cluster size X similarity betw. p
and the closest large cluster
27
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity between o and C1 is
small. For any point in C3, its closest large cluster is C2 but its similarity from C2 is low, plus |C3| = 3
is small
28. Clustering-Based Method: Strength and Weakness
• Strength
• Detect outliers without requiring any labeled data
• Work for many types of data
• Clusters can be regarded as summaries of the data
• Once the cluster are obtained, need only compare any object against the clusters to determine
whether it is an outlier (fast)
• Weakness
• Effectiveness depends highly on the clustering method used—they may not be optimized for
outlier detection
• High computational cost: Need to first find clusters
• A method to reduce the cost: Fixed-width clustering
• A point is assigned to a cluster if the center of the cluster is within a pre-defined distance
threshold from the point
• If a point cannot be assigned to any existing cluster, a new cluster is created and the distance
threshold may be learned from the training data under certain conditions
29. Classification-Based Method I: One-Class Model
• Idea: Train a classification model that can distinguish “normal” data
from outliers
• A brute-force approach: Consider a training set that contains samples
labeled as “normal” and others labeled as “outlier”
• But, the training set is typically heavily biased: # of “normal”
samples likely far exceeds # of outlier samples
• Cannot detect unseen anomaly
29
One-class model: A classifier is built to describe only the normal class.
Learn the decision boundary of the normal class using classification methods such as SVM
Any samples that do not belong to the normal class (not within the decision boundary) are declared
as outliers
Adv: can detect new outliers that may not appear close to any outlier objects in the training set
Extension: Normal objects may belong to multiple classes
30. Classification-Based Method II: Semi-Supervised Learning
• Semi-supervised learning: Combining classification-based and clustering-
based methods
• Method
• Using a clustering-based approach, find a large cluster, C, and a small
cluster, C1
• Since some objects in C carry the label “normal”, treat all objects in C as
normal
• Use the one-class model of this cluster to identify normal objects in outlier
detection
• Since some objects in cluster C1 carry the label “outlier”, declare all
objects in C1 as outliers
• Any object that does not fall into the model for C (such as a) is considered
an outlier as well
30
Comments on classification-based outlier detection methods
Strength: Outlier detection is fast
Bottleneck: Quality heavily depends on the availability and quality of the training set, but often difficult
to obtain representative and high-quality training data
31. Mining Contextual Outliers I: Transform into Conventional Outlier Detection
If the contexts can be clearly identified, transform it to conventional outlier detection
• Identify the context of the object using the contextual attributes
• Calculate the outlier score for the object in the context using a conventional outlier detection method
Ex. Detect outlier customers in the context of customer groups
• Contextual attributes: age group, postal code
• Behavioral attributes: # of trans/yr, annual total trans. amount
Steps:
• (1) locate c’s context,
• (2) compare c with the other customers in the same group, and
• (3) use a conventional outlier detection method
If the context contains very few customers, generalize contexts
• Ex. Learn a mixture model U on the contextual attributes, and another mixture model V of the data on
the behavior attributes
• Learn a mapping p(Vi|Uj): the probability that a data object o belonging to cluster Uj on the contextual
attributes is generated by cluster Vi on the behavior attributes
• Outlier score:
31
32. Mining Contextual Outliers II:
Modeling Normal Behavior with Respect to Contexts
• In some applications, one cannot clearly partition the data into contexts
• Ex. if a customer suddenly purchased a product that is unrelated to those she recently browsed, it is
unclear how many products browsed earlier should be considered as the context
• Model the “normal” behavior with respect to contexts
• Using a training data set, train a model that predicts the expected behavior attribute values with respect
to the contextual attribute values
• An object is a contextual outlier if its behavior attribute values significantly deviate from the values
predicted by the model
• Using a prediction model that links the contexts and behavior, these methods avoid the explicit identification
of specific contexts
• Methods: A number of classification and prediction techniques can be used to build such models, such as
regression, Markov Models, and Finite State Automaton
32