Successfully reported this slideshow.

# Outlier Analysis.pdf

Upcoming SlideShare
AI 바이오 (4일차).pdf
×

# Outlier Analysis.pdf

Outlier detection using machine learning, deep learning as well as statistical analysis.
The slide includes time series analysis. Also included is the hands on exercises with code and data, for a 3-day course.

Outlier detection using machine learning, deep learning as well as statistical analysis.
The slide includes time series analysis. Also included is the hands on exercises with code and data, for a 3-day course.

## More Related Content

### Outlier Analysis.pdf

1. 1. Contents
2. 2. • (1) 중심경향성: Ungrouped Data • Mode, Mean, Median • Percentile, Quantile/Quartile • (2) 변동성: Ungrouped Data • Range & IQR (Interquartile Range) • MAD (Mean Absolute Deviation), Variance, Standard Deviation • 분산 및 표준편차 • Unbiased estimator • Z-score • (3) Measures of Shape • Moment • Skewness와 Kurtosis • (4) 연관성 (Association) 측도 • Correlation
3. 3. • 기본개념 • Experiment, (Elementary) , Event와 Sample Space , … • 조건부 확률과 Bayes’ rule • 조건부 확률 법칙 P(X | Y) = (P(X ∩ Y))/(P(Y)) = (P(X)•((Y|X))/(P(Y)) • 독립성 여부의 검정: P(X | Y) = P(X) and P(Y| X) = P(Y) • Bayes’ Rule • P(Xi | Y) = 𝑃 𝑋𝑖 •𝑃(𝑌|𝑋𝑖) 𝑃 𝑋1 •𝑃 𝑌 𝑋1 + 𝑃 𝑋2 •𝑃 𝑌 𝑋2 +⋯+𝑃 𝑋𝑛 •𝑃(𝑌|𝑋𝑛) • Odds • 확률변수와 확률분포 • 확률변수 • = a variable that contains the outcomes of a chance experiment • 확률분포 • 이산분포: 이항분포, Poisson분포 • 연속분포: 균일분포, 정규분포, t-분포, 지수분포, 𝜒2 분포
4. 4. • 추정 • 정규분포 • t-분포 • 가설검정 • p-Value를 이용한 가설검정 • p-value = 관측된 유의수준 (level of significance) • defines the smallest value of 𝛼 for which the H0 can be rejected. • “α 가 p보다 커야만 H0를 reject 가능”
5. 5. • Variate(s) • Univariate • Bi-variate • 공분산향렬과 SSCP • Multivariate • Multivariate • 다변량 확률분포 • 다변량정규분포 • 다변량 분석기법 • PCA • Factor Analysis • MDS • ANOVA/ANCOVA/MANOVA/MANCOVA/… • 다변량 다중회귀분석 (MVMLR)
6. 6. • 주요 이슈 • Subjectivity, Interestingness and noise • Subjective judgement, as to what constitutes a “sufficient” deviation • In real applications, the data may be embedded in a significant amount of noise • Noise 문제 • Representation의 문제 • Normality vs. Anomaly • 특성공학 • 이상치 분석과 데이터 모델
7. 7. • 알고리즘 분류 (1) • Outlier scores: quantify level of “outlierness” • Binary labels: “outlier? or not?” • the threshold is based on the statistical distribution of the scores. • 알고리즘 분류 (2)
8. 8. • qualitative techniques, • 전문가 의견 • time series analysis • 관심사항: patterns and pattern changes • 과거 데이터가 중요한 역할 • causal models. • 인과관계 • 과거 데이터가 중요한 역할 • Regression • 독립변수와 종속변수
9. 9. 단변량 Time Series 모델 • AR Model • 자기상관 (Autocorrelation) • 정상성 (Stationarity)과 ADF Test • Differencing a Time Series • Autocorrelation에서의 Lags • Partial Autocorrelation • AR 모델 정의 • Yule-Walker Equation 이용한 AR 추정 • MA Model • MA 모델 정의 • MA Model의 Fitting • Stationarity • AR vs. MA 모델 선택 • Model Retraining을 통한 다단계 예측 • 최적의 MA Order를 찾기 위한 Grid Search
10. 10. • ARMA 모델 • 모델 정의 • ARMA(1,1) Model의 Fitting • Automated Hyperparameter Tuning • Grid Search • 성능향상을 위한 Tuning • ARIMA 모델 • 모델 정의 • SARIMA 모델 • 모델 정의 • regular AR part φp • seasonal AR part • regular MA part θq • seasonal MA part • regular integration part; order d • seasonal integration part; order D • Coefficient of seasonality s
11. 11. • Multivariate Time Series Models • SARIMAX 모델 • SARIMA 모델에 외생변수 (X)를 추가 • VAR 모델 • Since VAR model proposes one model for multiple target variables, it regroups those variables as a vector • VAR 계수의 추정 • VARMAX 모델 • VAR model에 MV 항을 추가하고 외생변수 허용 • V for vector indicating that it’s a multivariate model • AR for autoregression • MA for moving average • X for the use of exogenous variables (in addition to the endogenous variables)
12. 12. • Linear Regression • kNN, 의사결정 트리/Random Forest • XGBoost와 LightGBM
13. 13. • RNN/LSTM • Predicting a Sequence Rather Than a Value • SimpleRNN • GRU, LSTM • Prophet 모델 (Facebook) • an automated procedure for building forecasting models developed by Facebook. • Input possibilities are • Seasonality of any regular order • Holidays • Additional regressors • hyperparameters • Fourier order of the seasonality: A higher order means more flexibility. • changepoint_prior_scale plays on the trend: The higher the value, the more flexible the trend. • holidays_prior_scale: The lower it is, the less important the holidays are for the model. • prior scale for the seasonality • DeepAR 모델
14. 14. • Euclidean 거리 • Manhattan 거리 • Minkowski 거리 • Cosine 거리
15. 15. • Distance Distribution-based Techniques • to model entire data set to be normally distributed about its mean in the form of a multivariate Gaussian distribution. • Let ҧ 𝜇 be d-dimensional (row) data set, and Σ be its d x d covariance matrix. • Then, the probability distribution 𝑓( ത 𝑋) for a d-dimensional (row vector) data point X is: • |Σ| = determinant of covariance matrix. • 지수부: (half) squared Mahalanobis distance of the data point X to the centroid μ of the data. = outlier score
16. 16. • Extreme-Value Analysis • 극값을 판별해 내는 것 • = Probabilistic Tail Inequalities • Markov Inequality • Chebychev Inequality • … • determine statistical tails of the underlying distribution. • Univariate • Box Plots • 다변량 데이터에서의 극값분석 • Depth-Based Methods – Convex hull 분석 • Deviation-Based Methods • Angle-based • Extreme-value analysis is usually required as a final step on these modeled deviations
17. 17. • 개요 • 확률모형에서는 “likelihood fit of a data point to a generative model is the outlier score”. • 예 • GMM • EM • 장단점 • 장점 • 다양한 경우에 적용 가능 (any data type or mixed data type), as long as an appropriate generative model is available for each mixture component. • 단점 • 분포를 특정하기 어려운 경우. • As the number of model parameters increases, over-fitting becomes more common.
18. 18. • 일반형 • a convex non-linear programming – OLS • Model the data along lower-dimensional subspaces using linear correlations • Hyperplane과 데이터와의 거리 → outlier scores. • PCA • 행렬분해 • Spectral Models • Some variations of matrix decomposition (ex: PCA) used in certain types of data such as graphs and networks, are called spectral models. • They are used commonly for clustering graph data, and are often used in order to identify anomalous changes in temporal sequences of graphs.
19. 19. • 개념 • Clustering method • Density-based methods • Nearest-neighbor methods
20. 20. • 개념 • outliers increase the minimum code length (i.e., minimum length of the summary) required to describe a data set as they represent deviations from natural attempts to summarize data. • 예(1) • 예(2) multidimensional data sets • 확률모델: a data set in terms of generative model parameters, such as a mixture of Gaussian distributions or a mixture of exponential power distributions. • 군집화 / 밀도기반 요약 : describes a data set in terms of cluster descriptions, histograms, or other summarized representations, along with maximum error tolerances. • PCA / spectral 모델: describes the data in terms of lower dimensional subspaces of projection of multi-dimensional data or a latent representation of a network. • FP mining : describes the data in terms of an underlying code book of frequent patterns.
21. 21. • High-dimension • Subspace outlier detection • Assumption: “outliers are often hidden in the unusual local behavior of low-dimensional subspaces, and this deviant behavior is masked by full-dimensional analysis”. • High-dimensional space에서 데이터는 sparse 및 almost equidistant. • → outlier scores become less distinguishable. • Outliers are best emphasized in a lower-dimensional local subspace of relevant attributes.
22. 22. • Max Voting • 주로 classification 에 적용. • 다수 모델로 각각의 데이터를 예측 – 이를 ‘vote’로 처리. • 예: 영화에 대한 평점 • 기법 • Averaging과 Weighted Averaging • Stacking • Blending • Bagging 및 Boosting
23. 23. • Outlier 분석 ensemble 의 2 종류 : • sequential ensembles • a given algorithm or set of algorithms are applied sequentially, so that future applications of the algorithms are influenced by previous applications, in terms of either modifications of the base data for analysis or in terms of the specific choices of the algorithms. • 최종 산출물: either a weighted combination of, or the final result of the last application. (예) 분류모델에서 boosting methods may be considered examples of sequential ensembles. • independent ensembles • different algorithms, or different instantiations of the same algorithm are applied to either the complete data or portions of the data. The choices made about the data and algorithms applied are independent of the results obtained from these different algorithmic executions. • 최종산출물: executions are combined together in order to obtain more robust outliers.
24. 24. • 범주형 데이터, 텍스트 및 Mixed Attributes • categorical attributes that take on discrete unordered values. • Mixed attribute data contain both numerical and categorical attributes. • Regression-based models can be used in a limited way over discrete attribute values, • 대책 • convert the discrete data to binary data by creating one attribute for each categorical value. Such methods can be more easily extended to text • 모델 적용 • LSA (latent semantic analysis) • Clustering • proximity-based methods • probabilistic models • frequent pattern mining • 데이터 내에서의 Dependency 문제 • 시계열 데이터 • Discrete Sequence 데이터 • 그래프, 네트워크 형, …
25. 25. • 모델링: f의 추정? • Prediction • Inference • Resampling과 Cross Validation • 지도학습 vs. 비지도학습
26. 26. • Feature (특성) • A feature is a numeric representation of raw data. • Simple Numbers • Scalars, vectors, spaces • Counts • Binarization. Quantization or binning • Feature Scaling (Normalization) • Min-max scaling • Standardization (variance scaling) • Feature Selection Bucketing Crossing Hashing Embedding
27. 27. • Log 변환 • 텍스트 데이터 • Flat Vectors • Bag-of-words, Bag-of-N-Grams • Filtering • Stopwords, Frequency-based filtering, Stemming • Semantic기법 • Parsing, tokenization, Phrase Detection, TF-IDF • 범주형 변수 - Encoding • One-hot encoding, Dummy coding • 차원축소와 행렬분해 • PCA, SVD • 모델 적용 • LSA (latent semantic analysis), Clustering, 확률모형 • Data Value에서의 Dependency 문제
28. 28. • kNN • KNN graph (k-nearest neighbor graph)? • a graph in which 2 vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other objects from P. • has a vertex for each point, and a directed edge from p to q whenever q is a nearest neighbor of p, a point whose distance from p is minimum among all the given points other than p itself. • (변형 1) 1-NNG • Directions of the edges are ignored and NNG is defined instead as an undirected graph. However, the nearest neighbor relation is not a symmetric one. • (변형 2) FNG (farthest neighbor graph)
29. 29. • Outlier Detection using In-degree Number (ODIN) • 각 data point의 in-degree를 계산 • in-degree = the number of nearest neighbour sets to this point belongs. • In-degree값이 크면 ; more confidence of this point belonging to some dense region in the space. • In-degree값이 작으면 ; it’s not part of many nearest neighbour sets • 즉, is kind of isolated in the space. • the reverse of KNN.
30. 30. • SVM • 개념 • Linear SVM vs. Non-Linear SVM • One-Class Classification • 1. Outlier Detection • 2. AD in Acoustic Signals • 3. Novelty Detection and many others. • One-class SVM (1) • to ensure the widest street • maximize 2/|w| == to minimizing 1/2*(|w|^2). • + Lagrange multiplier → • w is a vector of random weights. • alpha = Lagrange multiplier, • y = either +1 or -1 i.e., class of the sample, • x = samples from data.
31. 31. 비지도학습 일반론 • 차원축소 • Linear Projection • PCA, SVD • Random projection • Manifold Learning • Isomap • T-SNE • Dictionary learning • ICA, Latent Dirichlet Allocation • 군집화 • K-Means • Hierarchical Clustering • DBSCAN • 혼합모형/EM • 딥러닝 기반 비지도학습 • Feature Extraction • Autoencoders • Unsupervised Pretraining • 생성모델과 네트워크 모델 • RBM • Deep Belief Networks • GAN • Sequential Data 적용 • Hidden Markov model • 강화학습과 비지도학습 • Semi-supervised Learning
32. 32. • 비지도학습 • 목적: interesting pattern과 숨겨진 데이터 속성을 찾는 것 • = 자율학습(unsupervised learning) (vs. 지도학습(supervised)) • Can we visualize data? • Can we find meaningful subgroups of observations or variables? • Challenges • EDA - goal is not as clearly defined • 객관적 성능측정이 쉽지 않다 - don’t know the “right answer” • High-dimensional data • 대표적 적용 예 • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
33. 33. • 군집모형 • 군집 (cluster) = a subset of data which are similar.
34. 34. • K-Means • K-Means 기반 이상탐지 • we can define outliers by ourselves. • define what is a ‘far’ distance • define how many data points should be outliers. • outlier/anomaly • a data point far from the centroid of its cluster
35. 35. • LOF (Local Outlier Factors) • 개요 • identify an outlier considering the density of the neighborhood. • 특히 데이터의 밀도 (density of the data)가 일정치 않을 때 효과가 큼 • = ratio of the average LRD of K neighbors of A to the LRD of A. • LRD of each point is used to compare with the average LRD of its K neighbors. • If the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point. In that case, LOF is nearly equal to 1. • If the point is an outlier, LRD of a point < average LRD of neighbors. Then LOF value will be high. • If LOF> 1, is considered as an outlier, but not always true. • 관련 개념 • Reachability distance (RD) • Local reachability density (LRD) • Local Outlier Factor (LOF)
36. 36. • LOF ≈ 1 similar density as neighbors • LOF < 1 higher density than neighbors (normal point) • LOF > 1 lower density than neighbors (anomaly)
37. 37. • K-distance와 K-Neighbors • K-distance • = distance between the point, and it’s Kth nearest neighbor. • K-neighbors, Nₖ(A), includes a set of points that lie in or on the circle of radius K-distance. • Reachability Distance (RD) • = maximum of K-distance of Xj and the distance between Xi and Xj. • Local RD (LRD) K-distance of A with K=2
38. 38. • Mixture Model • model the data in terms of a mixture of several components, where each component has a simple parametric form (예: Gaussian). • assuming class mixture component is known and estimating class membership given parameters. • Mixtures of {Sequences, Curves, …} • 생성모형 • select a component ck for individual i • generate data according to p(Di | ck) • p(Di | ck) can be very general • GMM (Gaussian Mixture Model) • Multivariate Gaussian models
39. 39. • EM (Expectation-Maximization) • Latent variable model • Algorithm • Expectation • Maximization
40. 40. • Neuron과 Artificial Nodes • 개별 신경망의 특징을 결정하는 요소: • 활성함수 • step, sigmoid, tanh, relu • Network topology (or architecture) • 모델이 가진 뉴론의 수 + 연결된 layer의 수 • Training 알고리즘 • Gradient descent, Newton method, Conjugate gradient, … • 학습 – BP through Gradient Descent • Computation Graph
41. 41. • 기본형 • 감독형 딥러닝 • CNN, RNN • 무감독형 딥러닝 • Autoencoder, RBM • 강화학습 • Q-Learning, Policy-Gradient • 응용 • CNN • RNN
42. 42. • 딥러닝 모델 https://link.springer.com/article/10.1007/s00530-020-00694-1
43. 43. • Python기반 딥러닝 프레임워크 • TensorFlow와 Keras • TensorFlow • Keras 이용 • R interface • PyTorch • 기타 주요 라이브러리
44. 44. • 개념 • Anomaly Detection (AD) OR novelty detection • Normality Representation • ☞ 기술통계 • Measures of Frequency • Measures of Central Tendency • Measures of Dispersion • Anomaly representation • Outlier detection 알고리즘에서의 2가지 출력 양식 • Outlier scores: quantify level of “outlier-ness”  outlier tendency. • Binary labels: “Whether a data point is an outlier or not” • 주요 이슈 • Subjectivity, Interestingness and noise • the data may be embedded in a significant amount of noise
45. 45. • RNN • 개념 • RNN은 지금 들어온 입력데이터와 과거에 입력 받았던 데이터를 동시에 고려 • 장단점 • (장점) see how previous layer is stimulated → NN interprets sequences much better. • (단점) more parameters to be calculated A recurrent neuron (왼쪽) unrolled through time (오른쪽)
46. 46. • Long short-term memory models • 목적: 기존의 RNN의 문제점 해결: • Vanishing gradients와 Exploding gradients • inability to remember or forget certain aspects of input sequences • 특징: previous time step으로부터 previous output뿐 아니라 state 정보도 함께 전달받음. • 동작원리 • Output control: How much an output neuron is stimulated by the previous output and current state • Memory control: How much of previous state will be forgotten • Input control: How much of the previous output and new state (memory) will be considered to determine the new current state • These are trainable and optimized
47. 47. • 정의 • shallow, 2-layer NNs constituting DBN (deep-belief networks) • Restriction = no intra-layer communication.
48. 48. • Reconstruction • activations of hidden layer no.1 become input in a backward pass. • Forward pass – RBM to predict node activations: p(a | x; w). • Backward pass - RBM attempts to estimate p(x | a; w). • Reconstruction을 통해 입력데이터의 PDF를 추측 (= generative learning) • 추정된 PDF와 실제 PDF의 거리계산 - Kullback Leibler Divergence. • Kullback-Leibler (KL) divergence measures divergence of two probability distributions, p and q. 이를 통해 p(x, a)
49. 49. • Autoencoder • PCA와 유사하지만 보다 flexible. • Target output is its input (x) in a different form (x’). • dimensionality of input = dimensionality of the output • essentially what we want is x’ = x. x’ = Decode (Encode(x)) • Autoencoder를 이용한 이상탐지 • If a point in feature space lies far away from the majority of the points (meaning it holds different properties), the autoencoder learns the distribution - an anomaly. • 즉, model more or less correctly re-generates the images leading to low loss values. • We use these reconstruction loss values as the anomaly scores • The higher the scores, the higher the chances of input being an anomaly.
50. 50. • LSTM-based Encoder-Decoder for Anomaly Detection • 정상데이터 (MV TSA)를 Unsupervised 방법으로 학습하고 이상치를 탐지하는 모델 • 특징 • LSTM-Encoder와 LSTM-Decoder로 구성 • Encoder는 다변량 데이터를 압축하여 feature로 변환. • Decoder는 Encoder에서 받은 feature를 이용 Encoder에서 받은 다변량 데이터를 재구성 • Reconstruction Error 계산 • MSE Loss를 이용하여 학습 But 추론 과정에서 Error 계산 방법은 Absolute Error를 활용.
51. 51. • Self-Organizing Maps • 자기조직화 지도 • 특징 • Competitive learning by BP • 일종의 DR 기법 • 동작원리 • Components • Initialization • Competition • Cooperation • Adaptation https://arxiv.org/pdf/1312.5753.pdf source; wikipedia
52. 52. • 사이버보안 이상징후 판단 • Malware 분석 • Network traffic 분석 • 센서 네트워크
53. 53. • 모터 결함 진단 (Fault Diagnosis) • = 고장진단 (현 상태가 고장인지 여부) + 고장 예측 • 예: 유도 전동기 (Induction motor) • DBN (Deep Belief Network) • A generative graphical model • Stacking RBMs • 진동 (주파수) 측정 데이터를 이용하여 학습 • → Contrastive divergence using • Gibbs sampling
54. 54. • Chemistry/Materials Science • Medical Outlier
55. 55. • Anomaly detection problem Complexities • Unknownness • Anomalies are associated with many unknowns, e.g., instances with unknown abrupt behaviors, data structures, and distributions. They remain unknown until actually occur. • Heterogeneous anomaly classes. • Anomalies are irregular, and thus, one class of anomalies may demonstrate completely different abnormal characteristics from another class of anomalies. • Rarity and class imbalance • → unavailability of large-scale labeled data in most applications = class imbalance • Diverse types of anomaly. • 3 different types of anomaly have been explored. • Point anomalies • Conditional anomalies • Group anomalies • 3 different types of anomaly have been explored. • Point anomalies • Conditional anomalies = contextual anomalies • Group anomalies = collective anomalies
56. 56. • Deep Anomaly Detection가 해결 시도하는 문제 • CH1: Low anomaly detection recall rate. • CH2: Anomaly detection in high-dimensional and/or not-independent data. • CH3: Data-efficient learning of normality/abnormality. • CH4: Noise-resilient anomaly detection. • CH5: Detection of complex anomalies. • CH6: Anomaly explanation.
57. 57. • Deep Anomaly Detection 접근법의 3가지 Frameworks