Chapter 10 Anomaly Detection


Published on

Chapter 10 Anomaly Detection sections (10.1 ~ 10.3).

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chapter 10 Anomaly Detection

  1. 1. Anomaly Detection(10.1 ~ 10.3)<br />Khalid Elshafie<br /><br />Database / Bioinformatics Lab.<br />Chungbuk National University<br />
  2. 2. Anomaly Detection (10.1 ~ 10.3)<br />Contents<br />1<br />2<br />3<br />Introduction<br />Statistical Approach<br />Proximity-based Approach<br />2<br />
  3. 3. Anomaly Detection (10.1 ~ 10.3)<br />Introduction (1/4)<br />Anomaly Detection<br />Find objects that are different from most other objects.<br />Anomaly objects are often known as outliers.<br />On a scatter plot of data, they lie far away from other data points.<br />Also knows as<br />Deviation detection<br />Anomalous objects have attribute values that deviate significantly from the expected or typical attribute values.<br />Exception mining<br />Because anomalies are exceptional in some sense.<br />3<br />outlier<br />
  4. 4. Anomaly Detection (10.1 ~ 10.3)<br />Introduction (2/4)<br />Applications<br />Fraud Detection.<br />The purchasing behavior of someone who steals a credit card is probably different from that of the original owner.<br />Intrusion Detection.<br />Attacks on computer systems and computer networks.<br />Ecosystem Disturbance.<br />Hurricanes, floods, heat waves…etc<br />Medicine.<br />Unusual symptoms or test result may indicate potential health problem.<br />……<br />4<br />
  5. 5. Anomaly Detection (10.1 ~ 10.3)<br />Introduction (3/4)<br />What causes anomalies<br />Data from Different Sources<br />Someone who committing credit card fraud belongs to different class than those people who use credit card legitimately.<br />Such anomalies are often of considerable interest and are the focus of anomaly detection in the field of data mining.<br />An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by different mechanism (Hawkins’ Definition of Outlier).<br />Natural Variant<br />Many data sets can be modeled by statistical distribution where the probability of a data object decrease rapidly as the distance of the object from the center of the distribution increases.<br />Most objects are near a center (average object) and the likelihood that an object differs from this average is small.<br />Anomalies that represent extreme or unlikely variations are often interesting.<br />Data Measurement and Collection Error<br />Error in the data collection or measurement process are another source of anomalies.<br />The goal is to eliminate such anomalies since they provide no interesting information but only reduce the quality of the data and the subsequent data analysis.<br />5<br />
  6. 6. Anomaly Detection (10.1 ~ 10.3)<br />Introduction (4/4)<br />Approach to Anomaly Detection<br />Model-based Technique.<br />Build a model of the data.<br />Anomalies are objects that do not fit the model very well.<br />Proximity-based Technique.<br />Many of the technique in this area are based on distances and are referred toasdistance-based outlier detection technique.<br />Anomalous object are those that are distant from most of the other objects.<br />Density-Based Technique.<br />Objects that are in regions of low density are relatively distant from their neighbors and can be considered anomalous.<br />6<br />
  7. 7. Anomaly Detection (10.1 ~ 10.3)<br />Statistical Approach (1/2)<br />Statistical approach are model-based approaches<br />A model is created for the data and object are evaluated with respect to how well they fit the model.<br />Most statistical approach to outlier detection are based on building a probability model distribution model and considering how likely objects are under that model.<br />Outliers are objects that has a low probability with respect to probability distribution model of the data (Probabilistic Definition of an Outlier).<br />7<br />
  8. 8. Anomaly Detection (10.1 ~ 10.3)<br />Statistical Approach (2/2)<br />Strength and weakness <br />Have a firm foundation and build on standard statistical technique<br />When there is sufficient knowledge of the data and the type of the test that should be applied, these tests can be very effective.<br />There are a wide variety of statistical outliers test for single attributes, fewer options are available for multivariate data. <br />Can perform poorly for high-dimensional data.<br />8<br />
  9. 9. Anomaly Detection (10.1 ~ 10.3)<br />Proximity-based Approach (1/3)<br />Proximity-based Approach<br />The basic notation of this approach is straightforward<br />An object is anomaly if it is distant from most point.<br />More general and more easily applied than statistical approaches.<br />Its easier to determine a meaningful proximity measure for data set than to determine its statistical distribution.<br />One of the simplest way to measure whether an object is distant from most point is to use the distance to the k-nearest neighbor.<br />The outlier score of an object is given by the distance to its k-nearest neighbor.<br />The lowest value of outlier score is 0<br />The highest value is the maximum possible value of the distance function (usually infinity).<br />9<br />
  10. 10. Anomaly Detection (10.1 ~ 10.3)<br />Proximity-based Approach (2/4)<br />10<br />Approach:<br />Compute the distance between every pair of data points<br />There are various ways to define outliers:<br />Data points for which there are fewer than p neighboring points within a distance D<br />The top n data points whose distance to the kth nearest neighbor is greatest<br />The top n data points whose average distance to the kth nearest neighbors is greatest <br />
  11. 11. Anomaly Detection (10.1 ~ 10.3)<br />Proximity-based Approach (3/4)<br />11<br />Proximity-based Approach<br /><ul><li>The shading of each point indicates its outlier score using value of K=5
  12. 12. The outlier score can be highly sensitive to the value of k
  13. 13. If k is too small e.g., 1 then a small number of nearby outliers can cause a low outlier score
  14. 14. If k is too large then its possible for all objects in a cluster that has fewer objects than k to become outliers</li></li></ul><li>Anomaly Detection (10.1 ~ 10.3)<br />Proximity-based Approach (4/4)<br />Strength and Weaknesses <br />Simple schema.<br />Proximity based approach typically take O(m2) time.<br />For large data sets this can be too expensive.<br />Sensitive to the choice of parameters.<br />It can’t handle dataset with regions of widely differing densities<br />12<br />
  15. 15. Thank You !<br /><br />