SlideShare a Scribd company logo
Contents
• (1) 중심경향성: Ungrouped Data
• Mode, Mean, Median
• Percentile, Quantile/Quartile
• (2) 변동성: Ungrouped Data
• Range & IQR (Interquartile Range)
• MAD (Mean Absolute Deviation), Variance, Standard Deviation
• 분산 및 표준편차
• Unbiased estimator
• Z-score
• (3) Measures of Shape
• Moment
• Skewness와 Kurtosis
• (4) 연관성 (Association) 측도
• Correlation
• 기본개념
• Experiment, (Elementary) , Event와 Sample Space , …
• 조건부 확률과 Bayes’ rule
• 조건부 확률 법칙 P(X | Y) = (P(X ∩ Y))/(P(Y)) = (P(X)•((Y|X))/(P(Y))
• 독립성 여부의 검정: P(X | Y) = P(X) and P(Y| X) = P(Y)
• Bayes’ Rule
• P(Xi | Y) =
𝑃 𝑋𝑖 •𝑃(𝑌|𝑋𝑖)
𝑃 𝑋1 •𝑃 𝑌 𝑋1 + 𝑃 𝑋2 •𝑃 𝑌 𝑋2 +⋯+𝑃 𝑋𝑛 •𝑃(𝑌|𝑋𝑛)
• Odds
• 확률변수와 확률분포
• 확률변수
• = a variable that contains the outcomes of a chance experiment
• 확률분포
• 이산분포: 이항분포, Poisson분포
• 연속분포: 균일분포, 정규분포, t-분포, 지수분포, 𝜒2
분포
• 추정
• 정규분포
• t-분포
• 가설검정
• p-Value를 이용한 가설검정
• p-value = 관측된 유의수준 (level of significance)
• defines the smallest value of 𝛼 for which the H0 can be rejected.
• “α 가 p보다 커야만 H0를 reject 가능”
• Variate(s)
• Univariate
• Bi-variate
• 공분산향렬과 SSCP
• Multivariate
• Multivariate
• 다변량 확률분포
• 다변량정규분포
• 다변량 분석기법
• PCA
• Factor Analysis
• MDS
• ANOVA/ANCOVA/MANOVA/MANCOVA/…
• 다변량 다중회귀분석 (MVMLR)
• 주요 이슈
• Subjectivity, Interestingness and noise
• Subjective judgement, as to what constitutes a “sufficient” deviation
• In real applications, the data may be embedded in a significant amount of noise
• Noise 문제
• Representation의 문제
• Normality vs. Anomaly
• 특성공학
• 이상치 분석과 데이터 모델
• 알고리즘 분류 (1)
• Outlier scores: quantify level of “outlierness”
• Binary labels: “outlier? or not?”
• the threshold is based on the statistical distribution of the scores.
• 알고리즘 분류 (2)
• qualitative techniques,
• 전문가 의견
• time series analysis
• 관심사항: patterns and pattern changes
• 과거 데이터가 중요한 역할
• causal models.
• 인과관계
• 과거 데이터가 중요한 역할
• Regression
• 독립변수와 종속변수
단변량 Time Series 모델
• AR Model
• 자기상관 (Autocorrelation)
• 정상성 (Stationarity)과 ADF Test
• Differencing a Time Series
• Autocorrelation에서의 Lags
• Partial Autocorrelation
• AR 모델 정의
• Yule-Walker Equation 이용한 AR 추정
• MA Model
• MA 모델 정의
• MA Model의 Fitting
• Stationarity
• AR vs. MA 모델 선택
• Model Retraining을 통한 다단계 예측
• 최적의 MA Order를 찾기 위한 Grid
Search
• ARMA 모델
• 모델 정의
• ARMA(1,1) Model의 Fitting
• Automated Hyperparameter Tuning
• Grid Search
• 성능향상을 위한 Tuning
• ARIMA 모델
• 모델 정의
• SARIMA 모델
• 모델 정의
• regular AR part φp
• seasonal AR part
• regular MA part θq
• seasonal MA part
• regular integration part; order d
• seasonal integration part; order D
• Coefficient of seasonality s
• Multivariate Time Series Models
• SARIMAX 모델
• SARIMA 모델에 외생변수 (X)를 추가
• VAR 모델
• Since VAR model proposes one model for multiple target variables, it regroups those variables
as a vector
• VAR 계수의 추정
• VARMAX 모델
• VAR model에 MV 항을 추가하고 외생변수 허용
• V for vector indicating that it’s a multivariate model
• AR for autoregression
• MA for moving average
• X for the use of exogenous variables (in addition to the endogenous variables)
• Linear Regression
• kNN, 의사결정 트리/Random Forest
• XGBoost와 LightGBM
• RNN/LSTM
• Predicting a Sequence Rather Than a Value
• SimpleRNN
• GRU, LSTM
• Prophet 모델 (Facebook)
• an automated procedure for building forecasting models developed by Facebook.
• Input possibilities are
• Seasonality of any regular order
• Holidays
• Additional regressors
• hyperparameters
• Fourier order of the seasonality: A higher order means more flexibility.
• changepoint_prior_scale plays on the trend: The higher the value, the more flexible the trend.
• holidays_prior_scale: The lower it is, the less important the holidays are for the model.
• prior scale for the seasonality
• DeepAR 모델
• Euclidean 거리
• Manhattan 거리
• Minkowski 거리
• Cosine 거리
• Distance Distribution-based Techniques
• to model entire data set to be normally distributed about its mean in the form of a multivariate
Gaussian distribution.
• Let ҧ
𝜇 be d-dimensional (row) data set, and Σ be its d x d covariance matrix.
• Then, the probability distribution 𝑓( ത
𝑋) for a d-dimensional (row vector) data point X is:
• |Σ| = determinant of covariance matrix.
• 지수부: (half) squared Mahalanobis distance of the data point X to the centroid μ of the data.
= outlier score
• Extreme-Value Analysis
• 극값을 판별해 내는 것
• = Probabilistic Tail Inequalities
• Markov Inequality
• Chebychev Inequality
• …
• determine statistical tails of the underlying distribution.
• Univariate
• Box Plots
• 다변량 데이터에서의 극값분석
• Depth-Based Methods – Convex hull 분석
• Deviation-Based Methods
• Angle-based
• Extreme-value analysis is usually required as a final step on these modeled deviations
• 개요
• 확률모형에서는 “likelihood fit of a data point to a generative model is the outlier score”.
• 예
• GMM
• EM
• 장단점
• 장점
• 다양한 경우에 적용 가능 (any data type or mixed data type), as long as an appropriate
generative model is available for each mixture component.
• 단점
• 분포를 특정하기 어려운 경우.
• As the number of model parameters increases, over-fitting becomes more common.
• 일반형
• a convex non-linear programming – OLS
• Model the data along lower-dimensional subspaces using linear correlations
• Hyperplane과 데이터와의 거리 → outlier scores.
• PCA
• 행렬분해
• Spectral Models
• Some variations of matrix decomposition (ex: PCA) used in certain types of data such as
graphs and networks, are called spectral models.
• They are used commonly for clustering graph data, and are often used in order to identify
anomalous changes in temporal sequences of graphs.
• 개념
• Clustering method
• Density-based methods
• Nearest-neighbor methods
• 개념
• outliers increase the minimum code length (i.e., minimum length of the summary) required to
describe a data set as they represent deviations from natural attempts to summarize data.
• 예(1)
• 예(2) multidimensional data sets
• 확률모델: a data set in terms of generative model parameters, such as a mixture of
Gaussian distributions or a mixture of exponential power distributions.
• 군집화 / 밀도기반 요약 : describes a data set in terms of cluster descriptions,
histograms, or other summarized representations, along with maximum error tolerances.
• PCA / spectral 모델: describes the data in terms of lower dimensional subspaces of
projection of multi-dimensional data or a latent representation of a network.
• FP mining : describes the data in terms of an underlying code book of frequent patterns.
• High-dimension
• Subspace outlier detection
• Assumption: “outliers are often hidden in the unusual local behavior of low-dimensional
subspaces, and this deviant behavior is masked by full-dimensional analysis”.
• High-dimensional space에서 데이터는 sparse 및 almost equidistant.
• → outlier scores become less distinguishable.
• Outliers are best emphasized in a lower-dimensional local subspace of relevant attributes.
• Max Voting
• 주로 classification 에 적용.
• 다수 모델로 각각의 데이터를 예측 – 이를 ‘vote’로 처리.
• 예: 영화에 대한 평점
• 기법
• Averaging과 Weighted Averaging
• Stacking
• Blending
• Bagging 및 Boosting
• Outlier 분석 ensemble 의 2 종류 :
• sequential ensembles
• a given algorithm or set of algorithms are applied sequentially, so that future applications of the
algorithms are influenced by previous applications, in terms of either modifications of the base
data for analysis or in terms of the specific choices of the algorithms.
• 최종 산출물: either a weighted combination of, or the final result of the last application.
(예) 분류모델에서 boosting methods may be considered examples of sequential ensembles.
• independent ensembles
• different algorithms, or different instantiations of the same algorithm are applied to either the
complete data or portions of the data. The choices made about the data and algorithms applied
are independent of the results obtained from these different algorithmic executions.
• 최종산출물: executions are combined together in order to obtain more robust outliers.
• 범주형 데이터, 텍스트 및 Mixed Attributes
• categorical attributes that take on discrete unordered values.
• Mixed attribute data contain both numerical and categorical attributes.
• Regression-based models can be used in a limited way over discrete attribute values,
• 대책
• convert the discrete data to binary data by creating one attribute for each categorical value.
Such methods can be more easily extended to text
• 모델 적용
• LSA (latent semantic analysis)
• Clustering
• proximity-based methods
• probabilistic models
• frequent pattern mining
• 데이터 내에서의 Dependency 문제
• 시계열 데이터
• Discrete Sequence 데이터
• 그래프, 네트워크 형, …
• 모델링: f의 추정?
• Prediction
• Inference
• Resampling과 Cross Validation
• 지도학습 vs. 비지도학습
• Feature (특성)
• A feature is a numeric representation of raw data.
• Simple Numbers
• Scalars, vectors, spaces
• Counts
• Binarization. Quantization or binning
• Feature Scaling (Normalization)
• Min-max scaling
• Standardization (variance scaling)
• Feature Selection
Bucketing Crossing Hashing Embedding
• Log 변환
• 텍스트 데이터
• Flat Vectors
• Bag-of-words, Bag-of-N-Grams
• Filtering
• Stopwords, Frequency-based filtering, Stemming
• Semantic기법
• Parsing, tokenization, Phrase Detection, TF-IDF
• 범주형 변수 - Encoding
• One-hot encoding, Dummy coding
• 차원축소와 행렬분해
• PCA, SVD
• 모델 적용
• LSA (latent semantic analysis), Clustering, 확률모형
• Data Value에서의 Dependency 문제
• kNN
• KNN graph (k-nearest neighbor graph)?
• a graph in which 2 vertices p and q are connected by an edge, if the distance between p and q is
among the k-th smallest distances from p to other objects from P.
• has a vertex for each point, and a directed edge from p to q whenever q is a nearest neighbor of p, a
point whose distance from p is minimum among all the given points other than p itself.
• (변형 1) 1-NNG
• Directions of the edges are ignored and NNG is defined instead as an undirected graph.
However, the nearest neighbor relation is not a symmetric one.
• (변형 2) FNG (farthest neighbor graph)
• Outlier Detection using In-degree Number (ODIN)
• 각 data point의 in-degree를 계산
• in-degree = the number of nearest neighbour sets to this point belongs.
• In-degree값이 크면 ; more confidence of this point belonging to some dense region in the
space.
• In-degree값이 작으면 ; it’s not part of many nearest neighbour sets
• 즉, is kind of isolated in the space.
• the reverse of KNN.
• SVM
• 개념
• Linear SVM vs. Non-Linear SVM
• One-Class Classification
• 1. Outlier Detection
• 2. AD in Acoustic Signals
• 3. Novelty Detection and many others.
• One-class SVM (1)
• to ensure the widest street
• maximize 2/|w| == to minimizing 1/2*(|w|^2).
• + Lagrange multiplier →
• w is a vector of random weights.
• alpha = Lagrange multiplier,
• y = either +1 or -1 i.e., class of the sample,
• x = samples from data.
비지도학습 일반론
• 차원축소
• Linear Projection
• PCA, SVD
• Random projection
• Manifold Learning
• Isomap
• T-SNE
• Dictionary learning
• ICA, Latent Dirichlet Allocation
• 군집화
• K-Means
• Hierarchical Clustering
• DBSCAN
• 혼합모형/EM
• 딥러닝 기반 비지도학습
• Feature Extraction
• Autoencoders
• Unsupervised Pretraining
• 생성모델과 네트워크 모델
• RBM
• Deep Belief Networks
• GAN
• Sequential Data 적용
• Hidden Markov model
• 강화학습과 비지도학습
• Semi-supervised Learning
• 비지도학습
• 목적: interesting pattern과 숨겨진 데이터 속성을 찾는 것
• = 자율학습(unsupervised learning) (vs. 지도학습(supervised))
• Can we visualize data?
• Can we find meaningful subgroups of observations or variables?
• Challenges
• EDA - goal is not as clearly defined
• 객관적 성능측정이 쉽지 않다 - don’t know the “right answer”
• High-dimensional data
• 대표적 적용 예
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
• 군집모형
• 군집 (cluster) = a subset of data which are similar.
• K-Means
• K-Means 기반 이상탐지
• we can define outliers by ourselves.
• define what is a ‘far’ distance
• define how many data points should be outliers.
• outlier/anomaly
• a data point far from the centroid of its cluster
• LOF (Local Outlier Factors)
• 개요
• identify an outlier considering the density of the neighborhood.
• 특히 데이터의 밀도 (density of the data)가 일정치 않을 때 효과가 큼
• = ratio of the average LRD of K neighbors of A to the LRD of A.
• LRD of each point is used to compare with the average LRD of its K neighbors.
• If the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the
LRD of a point. In that case, LOF is nearly equal to 1.
• If the point is an outlier, LRD of a point < average LRD of neighbors. Then LOF value will be high.
• If LOF> 1, is considered as an outlier, but not always true.
• 관련 개념
• Reachability distance (RD)
• Local reachability density (LRD)
• Local Outlier Factor (LOF)
• LOF ≈ 1 similar density as neighbors
• LOF < 1 higher density than neighbors (normal point)
• LOF > 1 lower density than neighbors (anomaly)
• K-distance와 K-Neighbors
• K-distance
• = distance between the point, and it’s Kth nearest neighbor.
• K-neighbors, Nₖ(A), includes a set of points that lie in or on the circle of radius K-distance.
• Reachability Distance (RD)
• = maximum of K-distance of Xj and the distance between Xi and Xj.
• Local RD (LRD)
K-distance of A with K=2
• Mixture Model
• model the data in terms of a mixture of several components, where each component has
a simple parametric form (예: Gaussian).
• assuming class mixture component is known and estimating class membership given
parameters.
• Mixtures of {Sequences, Curves, …}
• 생성모형
• select a component ck for individual i
• generate data according to p(Di | ck)
• p(Di | ck) can be very general
• GMM (Gaussian Mixture Model)
• Multivariate Gaussian models
• EM (Expectation-Maximization)
• Latent variable model
• Algorithm
• Expectation
• Maximization
• Neuron과 Artificial Nodes
• 개별 신경망의 특징을 결정하는 요소:
• 활성함수
• step, sigmoid, tanh, relu
• Network topology (or architecture)
• 모델이 가진 뉴론의 수 + 연결된 layer의 수
• Training 알고리즘
• Gradient descent, Newton method, Conjugate gradient, …
• 학습 – BP through Gradient Descent
• Computation Graph
• 기본형
• 감독형 딥러닝
• CNN, RNN
• 무감독형 딥러닝
• Autoencoder, RBM
• 강화학습
• Q-Learning, Policy-Gradient
• 응용
• CNN
• RNN
• 딥러닝 모델
https://link.springer.com/article/10.1007/s00530-020-00694-1
• Python기반 딥러닝 프레임워크
• TensorFlow와 Keras
• TensorFlow
• Keras 이용
• R interface
• PyTorch
• 기타 주요 라이브러리
• 개념
• Anomaly Detection (AD) OR novelty detection
• Normality Representation
• ☞ 기술통계
• Measures of Frequency
• Measures of Central Tendency
• Measures of Dispersion
• Anomaly representation
• Outlier detection 알고리즘에서의 2가지 출력 양식
• Outlier scores: quantify level of “outlier-ness”  outlier tendency.
• Binary labels: “Whether a data point is an outlier or not”
• 주요 이슈
• Subjectivity, Interestingness and noise
• the data may be embedded in a significant amount of noise
• RNN
• 개념
• RNN은 지금 들어온 입력데이터와 과거에 입력 받았던 데이터를 동시에 고려
• 장단점
• (장점) see how previous layer is stimulated
→ NN interprets sequences much better.
• (단점) more parameters to be calculated
A recurrent neuron (왼쪽) unrolled through time (오른쪽)
• Long short-term memory models
• 목적: 기존의 RNN의 문제점 해결:
• Vanishing gradients와 Exploding gradients
• inability to remember or forget certain aspects of input sequences
• 특징: previous time step으로부터 previous output뿐 아니라 state 정보도 함께 전달받음.
• 동작원리
• Output control: How much an output neuron is stimulated by the previous output and current state
• Memory control: How much of previous state will be forgotten
• Input control: How much of the previous output and new state (memory) will be considered to
determine the new current state
• These are trainable and optimized
• 정의
• shallow, 2-layer NNs constituting DBN (deep-belief networks)
• Restriction = no intra-layer communication.
• Reconstruction
• activations of hidden layer no.1 become input in a backward pass.
• Forward pass – RBM to predict node activations: p(a | x; w).
• Backward pass - RBM attempts to estimate p(x | a; w).
• Reconstruction을 통해 입력데이터의 PDF를 추측 (= generative learning)
• 추정된 PDF와 실제 PDF의 거리계산 - Kullback Leibler Divergence.
• Kullback-Leibler (KL) divergence measures divergence of two probability distributions, p and q.
이를 통해 p(x, a)
• Autoencoder
• PCA와 유사하지만 보다 flexible.
• Target output is its input (x) in a different form (x’).
• dimensionality of input = dimensionality of the output
• essentially what we want is x’ = x. x’ = Decode (Encode(x))
• Autoencoder를 이용한 이상탐지
• If a point in feature space lies far away from the majority of the points (meaning it holds
different properties), the autoencoder learns the distribution - an anomaly.
• 즉, model more or less correctly re-generates the images leading to low loss values.
• We use these reconstruction loss values as the anomaly scores
• The higher the scores, the higher the chances of input being an anomaly.
• LSTM-based Encoder-Decoder for Anomaly Detection
• 정상데이터 (MV TSA)를 Unsupervised 방법으로 학습하고 이상치를 탐지하는 모델
• 특징
• LSTM-Encoder와 LSTM-Decoder로 구성
• Encoder는 다변량 데이터를 압축하여 feature로 변환.
• Decoder는 Encoder에서 받은 feature를 이용 Encoder에서 받은 다변량 데이터를 재구성
• Reconstruction Error 계산
• MSE Loss를 이용하여 학습 But 추론 과정에서 Error 계산 방법은 Absolute Error를 활용.
• Self-Organizing Maps
• 자기조직화 지도
• 특징
• Competitive learning by BP
• 일종의 DR 기법
• 동작원리
• Components
• Initialization
• Competition
• Cooperation
• Adaptation
https://arxiv.org/pdf/1312.5753.pdf
source; wikipedia
• 사이버보안 이상징후 판단
• Malware 분석
• Network traffic 분석
• 센서 네트워크
• 모터 결함 진단 (Fault Diagnosis)
• = 고장진단 (현 상태가 고장인지 여부) + 고장 예측
• 예: 유도 전동기 (Induction motor)
• DBN (Deep Belief Network)
• A generative graphical model
• Stacking RBMs
• 진동 (주파수) 측정 데이터를 이용하여 학습
• → Contrastive divergence using
• Gibbs sampling
• Chemistry/Materials Science
• Medical Outlier
• Anomaly detection problem Complexities
• Unknownness
• Anomalies are associated with many unknowns, e.g., instances with unknown abrupt behaviors,
data structures, and distributions. They remain unknown until actually occur.
• Heterogeneous anomaly classes.
• Anomalies are irregular, and thus, one class of anomalies may demonstrate completely
different abnormal characteristics from another class of anomalies.
• Rarity and class imbalance
• → unavailability of large-scale labeled data in most applications = class imbalance
• Diverse types of anomaly.
• 3 different types of anomaly have been explored.
• Point anomalies
• Conditional anomalies
• Group anomalies
• 3 different types of anomaly have been explored.
• Point anomalies
• Conditional anomalies = contextual anomalies
• Group anomalies = collective anomalies
• Deep Anomaly Detection가 해결 시도하는 문제
• CH1: Low anomaly detection recall rate.
• CH2: Anomaly detection in high-dimensional and/or not-independent data.
• CH3: Data-efficient learning of normality/abnormality.
• CH4: Noise-resilient anomaly detection.
• CH5: Detection of complex anomalies.
• CH6: Anomaly explanation.
• Deep Anomaly Detection 접근법의 3가지 Frameworks
Outlier Analysis.pdf

More Related Content

Similar to Outlier Analysis.pdf

Statistical analysis for researchJJ.ppt
Statistical analysis for researchJJ.pptStatistical analysis for researchJJ.ppt
Statistical analysis for researchJJ.ppt
DrJosephJames
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
hktripathy
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.
MohammadMoreb
 
CVPR2008 tutorial generalized pca
CVPR2008 tutorial generalized pcaCVPR2008 tutorial generalized pca
CVPR2008 tutorial generalized pca
zukun
 
Data mining knowledge representation Notes
Data mining knowledge representation NotesData mining knowledge representation Notes
Data mining knowledge representation Notes
RevathiSundar4
 
ARIMA
ARIMA ARIMA
Data science
Data scienceData science
Data science
allytech
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
timeseries cheat sheet with example code for R
timeseries cheat sheet with example code for Rtimeseries cheat sheet with example code for R
timeseries cheat sheet with example code for R
derekjohnson549253
 
6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf
ssuserdca880
 
14 spatial analyst
14   spatial analyst14   spatial analyst
14 spatial analyst
Pawana Jii-haRt
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
chatbot9
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
QuantUniversity
 
Supervised Learning.pptx
Supervised Learning.pptxSupervised Learning.pptx
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 

Similar to Outlier Analysis.pdf (20)

Statistical analysis for researchJJ.ppt
Statistical analysis for researchJJ.pptStatistical analysis for researchJJ.ppt
Statistical analysis for researchJJ.ppt
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.
 
CVPR2008 tutorial generalized pca
CVPR2008 tutorial generalized pcaCVPR2008 tutorial generalized pca
CVPR2008 tutorial generalized pca
 
Data mining knowledge representation Notes
Data mining knowledge representation NotesData mining knowledge representation Notes
Data mining knowledge representation Notes
 
ARIMA
ARIMA ARIMA
ARIMA
 
0 introduction
0  introduction0  introduction
0 introduction
 
Data science
Data scienceData science
Data science
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
timeseries cheat sheet with example code for R
timeseries cheat sheet with example code for Rtimeseries cheat sheet with example code for R
timeseries cheat sheet with example code for R
 
6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf
 
14 spatial analyst
14   spatial analyst14   spatial analyst
14 spatial analyst
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
 
Supervised Learning.pptx
Supervised Learning.pptxSupervised Learning.pptx
Supervised Learning.pptx
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 

More from H K Yoon

AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
H K Yoon
 
AI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdfAI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdf
H K Yoon
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
H K Yoon
 
Open stack and k8s(v4)
Open stack and k8s(v4)Open stack and k8s(v4)
Open stack and k8s(v4)
H K Yoon
 
Open source Embedded systems
Open source Embedded systemsOpen source Embedded systems
Open source Embedded systems
H K Yoon
 
빅데이터, big data
빅데이터, big data빅데이터, big data
빅데이터, big data
H K Yoon
 
Sensor web
Sensor webSensor web
Sensor webH K Yoon
 
Tm기반검색v2
Tm기반검색v2Tm기반검색v2
Tm기반검색v2H K Yoon
 

More from H K Yoon (8)

AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
AI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdfAI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdf
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Open stack and k8s(v4)
Open stack and k8s(v4)Open stack and k8s(v4)
Open stack and k8s(v4)
 
Open source Embedded systems
Open source Embedded systemsOpen source Embedded systems
Open source Embedded systems
 
빅데이터, big data
빅데이터, big data빅데이터, big data
빅데이터, big data
 
Sensor web
Sensor webSensor web
Sensor web
 
Tm기반검색v2
Tm기반검색v2Tm기반검색v2
Tm기반검색v2
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

Outlier Analysis.pdf

  • 1.
  • 3.
  • 4.
  • 5. • (1) 중심경향성: Ungrouped Data • Mode, Mean, Median • Percentile, Quantile/Quartile • (2) 변동성: Ungrouped Data • Range & IQR (Interquartile Range) • MAD (Mean Absolute Deviation), Variance, Standard Deviation • 분산 및 표준편차 • Unbiased estimator • Z-score • (3) Measures of Shape • Moment • Skewness와 Kurtosis • (4) 연관성 (Association) 측도 • Correlation
  • 6. • 기본개념 • Experiment, (Elementary) , Event와 Sample Space , … • 조건부 확률과 Bayes’ rule • 조건부 확률 법칙 P(X | Y) = (P(X ∩ Y))/(P(Y)) = (P(X)•((Y|X))/(P(Y)) • 독립성 여부의 검정: P(X | Y) = P(X) and P(Y| X) = P(Y) • Bayes’ Rule • P(Xi | Y) = 𝑃 𝑋𝑖 •𝑃(𝑌|𝑋𝑖) 𝑃 𝑋1 •𝑃 𝑌 𝑋1 + 𝑃 𝑋2 •𝑃 𝑌 𝑋2 +⋯+𝑃 𝑋𝑛 •𝑃(𝑌|𝑋𝑛) • Odds • 확률변수와 확률분포 • 확률변수 • = a variable that contains the outcomes of a chance experiment • 확률분포 • 이산분포: 이항분포, Poisson분포 • 연속분포: 균일분포, 정규분포, t-분포, 지수분포, 𝜒2 분포
  • 7. • 추정 • 정규분포 • t-분포 • 가설검정 • p-Value를 이용한 가설검정 • p-value = 관측된 유의수준 (level of significance) • defines the smallest value of 𝛼 for which the H0 can be rejected. • “α 가 p보다 커야만 H0를 reject 가능”
  • 8.
  • 9. • Variate(s) • Univariate • Bi-variate • 공분산향렬과 SSCP • Multivariate • Multivariate • 다변량 확률분포 • 다변량정규분포 • 다변량 분석기법 • PCA • Factor Analysis • MDS • ANOVA/ANCOVA/MANOVA/MANCOVA/… • 다변량 다중회귀분석 (MVMLR)
  • 10.
  • 11.
  • 12. • 주요 이슈 • Subjectivity, Interestingness and noise • Subjective judgement, as to what constitutes a “sufficient” deviation • In real applications, the data may be embedded in a significant amount of noise • Noise 문제 • Representation의 문제 • Normality vs. Anomaly • 특성공학 • 이상치 분석과 데이터 모델
  • 13. • 알고리즘 분류 (1) • Outlier scores: quantify level of “outlierness” • Binary labels: “outlier? or not?” • the threshold is based on the statistical distribution of the scores. • 알고리즘 분류 (2)
  • 14.
  • 15. • qualitative techniques, • 전문가 의견 • time series analysis • 관심사항: patterns and pattern changes • 과거 데이터가 중요한 역할 • causal models. • 인과관계 • 과거 데이터가 중요한 역할 • Regression • 독립변수와 종속변수
  • 16. 단변량 Time Series 모델 • AR Model • 자기상관 (Autocorrelation) • 정상성 (Stationarity)과 ADF Test • Differencing a Time Series • Autocorrelation에서의 Lags • Partial Autocorrelation • AR 모델 정의 • Yule-Walker Equation 이용한 AR 추정 • MA Model • MA 모델 정의 • MA Model의 Fitting • Stationarity • AR vs. MA 모델 선택 • Model Retraining을 통한 다단계 예측 • 최적의 MA Order를 찾기 위한 Grid Search
  • 17. • ARMA 모델 • 모델 정의 • ARMA(1,1) Model의 Fitting • Automated Hyperparameter Tuning • Grid Search • 성능향상을 위한 Tuning • ARIMA 모델 • 모델 정의 • SARIMA 모델 • 모델 정의 • regular AR part φp • seasonal AR part • regular MA part θq • seasonal MA part • regular integration part; order d • seasonal integration part; order D • Coefficient of seasonality s
  • 18. • Multivariate Time Series Models • SARIMAX 모델 • SARIMA 모델에 외생변수 (X)를 추가 • VAR 모델 • Since VAR model proposes one model for multiple target variables, it regroups those variables as a vector • VAR 계수의 추정 • VARMAX 모델 • VAR model에 MV 항을 추가하고 외생변수 허용 • V for vector indicating that it’s a multivariate model • AR for autoregression • MA for moving average • X for the use of exogenous variables (in addition to the endogenous variables)
  • 19. • Linear Regression • kNN, 의사결정 트리/Random Forest • XGBoost와 LightGBM
  • 20. • RNN/LSTM • Predicting a Sequence Rather Than a Value • SimpleRNN • GRU, LSTM • Prophet 모델 (Facebook) • an automated procedure for building forecasting models developed by Facebook. • Input possibilities are • Seasonality of any regular order • Holidays • Additional regressors • hyperparameters • Fourier order of the seasonality: A higher order means more flexibility. • changepoint_prior_scale plays on the trend: The higher the value, the more flexible the trend. • holidays_prior_scale: The lower it is, the less important the holidays are for the model. • prior scale for the seasonality • DeepAR 모델
  • 21.
  • 22. • Euclidean 거리 • Manhattan 거리 • Minkowski 거리 • Cosine 거리
  • 23. • Distance Distribution-based Techniques • to model entire data set to be normally distributed about its mean in the form of a multivariate Gaussian distribution. • Let ҧ 𝜇 be d-dimensional (row) data set, and Σ be its d x d covariance matrix. • Then, the probability distribution 𝑓( ത 𝑋) for a d-dimensional (row vector) data point X is: • |Σ| = determinant of covariance matrix. • 지수부: (half) squared Mahalanobis distance of the data point X to the centroid μ of the data. = outlier score
  • 24.
  • 25. • Extreme-Value Analysis • 극값을 판별해 내는 것 • = Probabilistic Tail Inequalities • Markov Inequality • Chebychev Inequality • … • determine statistical tails of the underlying distribution. • Univariate • Box Plots • 다변량 데이터에서의 극값분석 • Depth-Based Methods – Convex hull 분석 • Deviation-Based Methods • Angle-based • Extreme-value analysis is usually required as a final step on these modeled deviations
  • 26. • 개요 • 확률모형에서는 “likelihood fit of a data point to a generative model is the outlier score”. • 예 • GMM • EM • 장단점 • 장점 • 다양한 경우에 적용 가능 (any data type or mixed data type), as long as an appropriate generative model is available for each mixture component. • 단점 • 분포를 특정하기 어려운 경우. • As the number of model parameters increases, over-fitting becomes more common.
  • 27. • 일반형 • a convex non-linear programming – OLS • Model the data along lower-dimensional subspaces using linear correlations • Hyperplane과 데이터와의 거리 → outlier scores. • PCA • 행렬분해 • Spectral Models • Some variations of matrix decomposition (ex: PCA) used in certain types of data such as graphs and networks, are called spectral models. • They are used commonly for clustering graph data, and are often used in order to identify anomalous changes in temporal sequences of graphs.
  • 28. • 개념 • Clustering method • Density-based methods • Nearest-neighbor methods
  • 29. • 개념 • outliers increase the minimum code length (i.e., minimum length of the summary) required to describe a data set as they represent deviations from natural attempts to summarize data. • 예(1) • 예(2) multidimensional data sets • 확률모델: a data set in terms of generative model parameters, such as a mixture of Gaussian distributions or a mixture of exponential power distributions. • 군집화 / 밀도기반 요약 : describes a data set in terms of cluster descriptions, histograms, or other summarized representations, along with maximum error tolerances. • PCA / spectral 모델: describes the data in terms of lower dimensional subspaces of projection of multi-dimensional data or a latent representation of a network. • FP mining : describes the data in terms of an underlying code book of frequent patterns.
  • 30. • High-dimension • Subspace outlier detection • Assumption: “outliers are often hidden in the unusual local behavior of low-dimensional subspaces, and this deviant behavior is masked by full-dimensional analysis”. • High-dimensional space에서 데이터는 sparse 및 almost equidistant. • → outlier scores become less distinguishable. • Outliers are best emphasized in a lower-dimensional local subspace of relevant attributes.
  • 31.
  • 32. • Max Voting • 주로 classification 에 적용. • 다수 모델로 각각의 데이터를 예측 – 이를 ‘vote’로 처리. • 예: 영화에 대한 평점 • 기법 • Averaging과 Weighted Averaging • Stacking • Blending • Bagging 및 Boosting
  • 33. • Outlier 분석 ensemble 의 2 종류 : • sequential ensembles • a given algorithm or set of algorithms are applied sequentially, so that future applications of the algorithms are influenced by previous applications, in terms of either modifications of the base data for analysis or in terms of the specific choices of the algorithms. • 최종 산출물: either a weighted combination of, or the final result of the last application. (예) 분류모델에서 boosting methods may be considered examples of sequential ensembles. • independent ensembles • different algorithms, or different instantiations of the same algorithm are applied to either the complete data or portions of the data. The choices made about the data and algorithms applied are independent of the results obtained from these different algorithmic executions. • 최종산출물: executions are combined together in order to obtain more robust outliers.
  • 34.
  • 35. • 범주형 데이터, 텍스트 및 Mixed Attributes • categorical attributes that take on discrete unordered values. • Mixed attribute data contain both numerical and categorical attributes. • Regression-based models can be used in a limited way over discrete attribute values, • 대책 • convert the discrete data to binary data by creating one attribute for each categorical value. Such methods can be more easily extended to text • 모델 적용 • LSA (latent semantic analysis) • Clustering • proximity-based methods • probabilistic models • frequent pattern mining • 데이터 내에서의 Dependency 문제 • 시계열 데이터 • Discrete Sequence 데이터 • 그래프, 네트워크 형, …
  • 36.
  • 37.
  • 38. • 모델링: f의 추정? • Prediction • Inference • Resampling과 Cross Validation • 지도학습 vs. 비지도학습
  • 39.
  • 40. • Feature (특성) • A feature is a numeric representation of raw data. • Simple Numbers • Scalars, vectors, spaces • Counts • Binarization. Quantization or binning • Feature Scaling (Normalization) • Min-max scaling • Standardization (variance scaling) • Feature Selection Bucketing Crossing Hashing Embedding
  • 41. • Log 변환 • 텍스트 데이터 • Flat Vectors • Bag-of-words, Bag-of-N-Grams • Filtering • Stopwords, Frequency-based filtering, Stemming • Semantic기법 • Parsing, tokenization, Phrase Detection, TF-IDF • 범주형 변수 - Encoding • One-hot encoding, Dummy coding • 차원축소와 행렬분해 • PCA, SVD • 모델 적용 • LSA (latent semantic analysis), Clustering, 확률모형 • Data Value에서의 Dependency 문제
  • 42.
  • 43. • kNN • KNN graph (k-nearest neighbor graph)? • a graph in which 2 vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other objects from P. • has a vertex for each point, and a directed edge from p to q whenever q is a nearest neighbor of p, a point whose distance from p is minimum among all the given points other than p itself. • (변형 1) 1-NNG • Directions of the edges are ignored and NNG is defined instead as an undirected graph. However, the nearest neighbor relation is not a symmetric one. • (변형 2) FNG (farthest neighbor graph)
  • 44. • Outlier Detection using In-degree Number (ODIN) • 각 data point의 in-degree를 계산 • in-degree = the number of nearest neighbour sets to this point belongs. • In-degree값이 크면 ; more confidence of this point belonging to some dense region in the space. • In-degree값이 작으면 ; it’s not part of many nearest neighbour sets • 즉, is kind of isolated in the space. • the reverse of KNN.
  • 45. • SVM • 개념 • Linear SVM vs. Non-Linear SVM • One-Class Classification • 1. Outlier Detection • 2. AD in Acoustic Signals • 3. Novelty Detection and many others. • One-class SVM (1) • to ensure the widest street • maximize 2/|w| == to minimizing 1/2*(|w|^2). • + Lagrange multiplier → • w is a vector of random weights. • alpha = Lagrange multiplier, • y = either +1 or -1 i.e., class of the sample, • x = samples from data.
  • 46.
  • 47. 비지도학습 일반론 • 차원축소 • Linear Projection • PCA, SVD • Random projection • Manifold Learning • Isomap • T-SNE • Dictionary learning • ICA, Latent Dirichlet Allocation • 군집화 • K-Means • Hierarchical Clustering • DBSCAN • 혼합모형/EM • 딥러닝 기반 비지도학습 • Feature Extraction • Autoencoders • Unsupervised Pretraining • 생성모델과 네트워크 모델 • RBM • Deep Belief Networks • GAN • Sequential Data 적용 • Hidden Markov model • 강화학습과 비지도학습 • Semi-supervised Learning
  • 48. • 비지도학습 • 목적: interesting pattern과 숨겨진 데이터 속성을 찾는 것 • = 자율학습(unsupervised learning) (vs. 지도학습(supervised)) • Can we visualize data? • Can we find meaningful subgroups of observations or variables? • Challenges • EDA - goal is not as clearly defined • 객관적 성능측정이 쉽지 않다 - don’t know the “right answer” • High-dimensional data • 대표적 적용 예 • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
  • 49.
  • 50. • 군집모형 • 군집 (cluster) = a subset of data which are similar.
  • 51. • K-Means • K-Means 기반 이상탐지 • we can define outliers by ourselves. • define what is a ‘far’ distance • define how many data points should be outliers. • outlier/anomaly • a data point far from the centroid of its cluster
  • 52. • LOF (Local Outlier Factors) • 개요 • identify an outlier considering the density of the neighborhood. • 특히 데이터의 밀도 (density of the data)가 일정치 않을 때 효과가 큼 • = ratio of the average LRD of K neighbors of A to the LRD of A. • LRD of each point is used to compare with the average LRD of its K neighbors. • If the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point. In that case, LOF is nearly equal to 1. • If the point is an outlier, LRD of a point < average LRD of neighbors. Then LOF value will be high. • If LOF> 1, is considered as an outlier, but not always true. • 관련 개념 • Reachability distance (RD) • Local reachability density (LRD) • Local Outlier Factor (LOF)
  • 53. • LOF ≈ 1 similar density as neighbors • LOF < 1 higher density than neighbors (normal point) • LOF > 1 lower density than neighbors (anomaly)
  • 54. • K-distance와 K-Neighbors • K-distance • = distance between the point, and it’s Kth nearest neighbor. • K-neighbors, Nₖ(A), includes a set of points that lie in or on the circle of radius K-distance. • Reachability Distance (RD) • = maximum of K-distance of Xj and the distance between Xi and Xj. • Local RD (LRD) K-distance of A with K=2
  • 55.
  • 56. • Mixture Model • model the data in terms of a mixture of several components, where each component has a simple parametric form (예: Gaussian). • assuming class mixture component is known and estimating class membership given parameters. • Mixtures of {Sequences, Curves, …} • 생성모형 • select a component ck for individual i • generate data according to p(Di | ck) • p(Di | ck) can be very general • GMM (Gaussian Mixture Model) • Multivariate Gaussian models
  • 57. • EM (Expectation-Maximization) • Latent variable model • Algorithm • Expectation • Maximization
  • 58.
  • 59.
  • 60. • Neuron과 Artificial Nodes • 개별 신경망의 특징을 결정하는 요소: • 활성함수 • step, sigmoid, tanh, relu • Network topology (or architecture) • 모델이 가진 뉴론의 수 + 연결된 layer의 수 • Training 알고리즘 • Gradient descent, Newton method, Conjugate gradient, … • 학습 – BP through Gradient Descent • Computation Graph
  • 61. • 기본형 • 감독형 딥러닝 • CNN, RNN • 무감독형 딥러닝 • Autoencoder, RBM • 강화학습 • Q-Learning, Policy-Gradient • 응용 • CNN • RNN
  • 63.
  • 64. • Python기반 딥러닝 프레임워크 • TensorFlow와 Keras • TensorFlow • Keras 이용 • R interface • PyTorch • 기타 주요 라이브러리
  • 65. • 개념 • Anomaly Detection (AD) OR novelty detection • Normality Representation • ☞ 기술통계 • Measures of Frequency • Measures of Central Tendency • Measures of Dispersion • Anomaly representation • Outlier detection 알고리즘에서의 2가지 출력 양식 • Outlier scores: quantify level of “outlier-ness”  outlier tendency. • Binary labels: “Whether a data point is an outlier or not” • 주요 이슈 • Subjectivity, Interestingness and noise • the data may be embedded in a significant amount of noise
  • 66.
  • 67. • RNN • 개념 • RNN은 지금 들어온 입력데이터와 과거에 입력 받았던 데이터를 동시에 고려 • 장단점 • (장점) see how previous layer is stimulated → NN interprets sequences much better. • (단점) more parameters to be calculated A recurrent neuron (왼쪽) unrolled through time (오른쪽)
  • 68. • Long short-term memory models • 목적: 기존의 RNN의 문제점 해결: • Vanishing gradients와 Exploding gradients • inability to remember or forget certain aspects of input sequences • 특징: previous time step으로부터 previous output뿐 아니라 state 정보도 함께 전달받음. • 동작원리 • Output control: How much an output neuron is stimulated by the previous output and current state • Memory control: How much of previous state will be forgotten • Input control: How much of the previous output and new state (memory) will be considered to determine the new current state • These are trainable and optimized
  • 69.
  • 70. • 정의 • shallow, 2-layer NNs constituting DBN (deep-belief networks) • Restriction = no intra-layer communication.
  • 71. • Reconstruction • activations of hidden layer no.1 become input in a backward pass. • Forward pass – RBM to predict node activations: p(a | x; w). • Backward pass - RBM attempts to estimate p(x | a; w). • Reconstruction을 통해 입력데이터의 PDF를 추측 (= generative learning) • 추정된 PDF와 실제 PDF의 거리계산 - Kullback Leibler Divergence. • Kullback-Leibler (KL) divergence measures divergence of two probability distributions, p and q. 이를 통해 p(x, a)
  • 72. • Autoencoder • PCA와 유사하지만 보다 flexible. • Target output is its input (x) in a different form (x’). • dimensionality of input = dimensionality of the output • essentially what we want is x’ = x. x’ = Decode (Encode(x)) • Autoencoder를 이용한 이상탐지 • If a point in feature space lies far away from the majority of the points (meaning it holds different properties), the autoencoder learns the distribution - an anomaly. • 즉, model more or less correctly re-generates the images leading to low loss values. • We use these reconstruction loss values as the anomaly scores • The higher the scores, the higher the chances of input being an anomaly.
  • 73. • LSTM-based Encoder-Decoder for Anomaly Detection • 정상데이터 (MV TSA)를 Unsupervised 방법으로 학습하고 이상치를 탐지하는 모델 • 특징 • LSTM-Encoder와 LSTM-Decoder로 구성 • Encoder는 다변량 데이터를 압축하여 feature로 변환. • Decoder는 Encoder에서 받은 feature를 이용 Encoder에서 받은 다변량 데이터를 재구성 • Reconstruction Error 계산 • MSE Loss를 이용하여 학습 But 추론 과정에서 Error 계산 방법은 Absolute Error를 활용.
  • 74. • Self-Organizing Maps • 자기조직화 지도 • 특징 • Competitive learning by BP • 일종의 DR 기법 • 동작원리 • Components • Initialization • Competition • Cooperation • Adaptation https://arxiv.org/pdf/1312.5753.pdf source; wikipedia
  • 75.
  • 76.
  • 77.
  • 78. • 사이버보안 이상징후 판단 • Malware 분석 • Network traffic 분석 • 센서 네트워크
  • 79. • 모터 결함 진단 (Fault Diagnosis) • = 고장진단 (현 상태가 고장인지 여부) + 고장 예측 • 예: 유도 전동기 (Induction motor) • DBN (Deep Belief Network) • A generative graphical model • Stacking RBMs • 진동 (주파수) 측정 데이터를 이용하여 학습 • → Contrastive divergence using • Gibbs sampling
  • 81.
  • 82. • Anomaly detection problem Complexities • Unknownness • Anomalies are associated with many unknowns, e.g., instances with unknown abrupt behaviors, data structures, and distributions. They remain unknown until actually occur. • Heterogeneous anomaly classes. • Anomalies are irregular, and thus, one class of anomalies may demonstrate completely different abnormal characteristics from another class of anomalies. • Rarity and class imbalance • → unavailability of large-scale labeled data in most applications = class imbalance • Diverse types of anomaly. • 3 different types of anomaly have been explored. • Point anomalies • Conditional anomalies • Group anomalies • 3 different types of anomaly have been explored. • Point anomalies • Conditional anomalies = contextual anomalies • Group anomalies = collective anomalies
  • 83. • Deep Anomaly Detection가 해결 시도하는 문제 • CH1: Low anomaly detection recall rate. • CH2: Anomaly detection in high-dimensional and/or not-independent data. • CH3: Data-efficient learning of normality/abnormality. • CH4: Noise-resilient anomaly detection. • CH5: Detection of complex anomalies. • CH6: Anomaly explanation.
  • 84.
  • 85. • Deep Anomaly Detection 접근법의 3가지 Frameworks