Outlier detection by using integrated
feature selection algorithms in high
Dimensional Data.
Under the esteemed guidance of
Dr. D.Naga Raju Ph.D.
Professor
Presented by
M.Rao Batchanaboyina
A seminar
on
OUTLIERS
2
Definition:
An outlier is an observation which deviates so much from the
other observations as to arouse other suspicions that it was
generated by a different mechanism.
Outliers also referred to as abnormalities, discordants,
deviants or anomalies.
Useful Applications:
Intrusion Detection Systems
Interesting Sensor Events
Medical Diagnosis
Law Enforcement
Earth Science
3
Outlier Detection models:
Extreme value analysis
Probabilistic and statistical Model
Linear Models
Proximity Based Model
Information Theoretic Models
High Dimensional Outlier Detection
4
High Dimensional Data:
 The dimensionality N is considered large if it is in the range
of hundreds.
But, In recent applications of feature selection, the
Dimensionality can not only in tens but also in hundreds or
thousands.
5
High Dimensional Outlier Detection Methods:
The Subspace Method:
Projected Outliers with Grid
 Distance-based Subspace Outlier Detection
Combining Outliers from Multiple Subspaces
6
High Dimensional Outlier Detection:
While performing outlier detection in High dimensional data
with conventional algorithms, which are suffers from the well
known artifact “curse of High Dimensionality”.
 It is defined as in high dimensional space, the data
becomes sparse, and true outliers are becomes masked by
the noise effects of multiple dimensions, when analyzed in full
dimensionality.
7
Integrating Feature selection algorithms:
Feature subset selection involves 4 steps as follows
1.Subset Generation
2.Subset Evaluation
3. Stopping Criteria
4. Result Evaluation
8
Feature subset selection:
Subset Generation:
For subset generation it follows two approaches:
1. Forward Approach
2. Backward Approach
To efficiently implement these approaches, it uses following:
a. Complete search- Branch & Bound method
b. Sequential Search-Greedy Hill Climbing method
c. Random search- Random State Hill climbing
9
Feature subset selection:
Subset Evaluation:
The subset evaluation carried out in two ways
1. Independent Criteria
2. Dependent Criteria
10
Feature subset selection:
Subset Evaluation:
The subset evaluation carried out in two ways
1. Independent Criteria
2. Dependent Criteria
11
Feature subset selection:
Stopping Criteria:
A Stopping criteria determines when the feature selection
process stopped.
Some frequent used stopping criteria are:
1.The search completes.
2. Some given bound reached.
3. Subsequent addition( or deletion) of any feature
does not produce a better subset.
4. A Sufficiently good subset is selected.
12
Feature subset selection:
Result Validation:
 A straight forward way for result validation is to directly
measure the result using prior knowledge about data.
In real-world applications, however, we usually
do not have such prior knowledge. Hence, we have to rely
on some indirect methods by monitoring the change of
mining performance with the change of features.
13
Feature selection steps:
Subset
generation
Original
set from
HD
Alogorith
ms
Subset
Evaluation
Sub set
Stopping
criteria
Goodness
of Sub set
Result
Validation
YESNO
FOUR KEY STEPS OF FEATURE SELECTION
14
Design of Integrated system for intelligent Feature selection:
It is a two step process,
1. A Unifying Plat Form:
The unifying plat form serves two purposes:
a) To group existing algorithms with similar
characteristics and investigate their strength and
weaknesses on the sample plat form,
b) To provide a guide line in building an intelligent feature
selection.
15
Categorizing Frame Work for Feature Selection Algorithm:
Categorizing Frame Work for Feature Selection Algorithms in a Three
Dimensional Frame Work
16
Design of Integrated system for intelligent Feature selection:
2. An Integrated System
A Preliminary Integrated System
17
Feature Selection with Large Dimensionality:
From the frame work for feature selection, it is seemed to
be the filter model algorithms is preferred to Wrapper model
algorithms when dealing with the large dimensionality.
 Since filter model algorithms use evaluation criteria that
are less computationally expensive when compared with the
wrapper model algorithms.
Recently, algorithms of Hybrid model are considered to
handle data sets with high dimensionality.
18
Real world Application of Feature Selection :
Intrusion Detection
Genomic Analysis
Image Retrieval
Customer Relationship Management
Existing System Vs Proposed System
19
Conclusion:
By implementing the a new frame work by using High
Dimensional Outlier Detection algorithms with
integrating feature selection is applied to detect
outliers in high Dimensional Data.
References
[1] Huan Liu, Lei Yu “Toward Integrating Feature Selection Algorithms for Classification
and Clustering” IEEE Transactions on Knowledge and Data Engineering, Vol 17,NO
.4,April 2005 .
[2] C.C Aggarwal, Outlier analysis. 1 53, DOI 10.1007/978- - © Springer
science+Business Media New York 2013
[3] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc.
Fourth Pacific Asia Conf. Knowledge Discovery and Data Mining,
(PAKDD-2000), pp. 110-121, 2000.
[4] M. Devaney and A. Ram, “Efficient Feature Selection in
Conceptual Clustering,” Proc. 14th Int’l Conf. Machine Learning,
pp. 92-97, 1997.
[5] Feature Selection: An Ever Evolving Frontier in Data Mining ,JMLR: Workshop and
Conference Proceedings 10: 4-13 The Fourth Workshop on Feature Selection in Data
Mining
20
Queries?
21
Thank you
22

Seminarppt

  • 1.
    Outlier detection byusing integrated feature selection algorithms in high Dimensional Data. Under the esteemed guidance of Dr. D.Naga Raju Ph.D. Professor Presented by M.Rao Batchanaboyina A seminar on
  • 2.
    OUTLIERS 2 Definition: An outlier isan observation which deviates so much from the other observations as to arouse other suspicions that it was generated by a different mechanism. Outliers also referred to as abnormalities, discordants, deviants or anomalies. Useful Applications: Intrusion Detection Systems Interesting Sensor Events Medical Diagnosis Law Enforcement Earth Science
  • 3.
    3 Outlier Detection models: Extremevalue analysis Probabilistic and statistical Model Linear Models Proximity Based Model Information Theoretic Models High Dimensional Outlier Detection
  • 4.
    4 High Dimensional Data: The dimensionality N is considered large if it is in the range of hundreds. But, In recent applications of feature selection, the Dimensionality can not only in tens but also in hundreds or thousands.
  • 5.
    5 High Dimensional OutlierDetection Methods: The Subspace Method: Projected Outliers with Grid  Distance-based Subspace Outlier Detection Combining Outliers from Multiple Subspaces
  • 6.
    6 High Dimensional OutlierDetection: While performing outlier detection in High dimensional data with conventional algorithms, which are suffers from the well known artifact “curse of High Dimensionality”.  It is defined as in high dimensional space, the data becomes sparse, and true outliers are becomes masked by the noise effects of multiple dimensions, when analyzed in full dimensionality.
  • 7.
    7 Integrating Feature selectionalgorithms: Feature subset selection involves 4 steps as follows 1.Subset Generation 2.Subset Evaluation 3. Stopping Criteria 4. Result Evaluation
  • 8.
    8 Feature subset selection: SubsetGeneration: For subset generation it follows two approaches: 1. Forward Approach 2. Backward Approach To efficiently implement these approaches, it uses following: a. Complete search- Branch & Bound method b. Sequential Search-Greedy Hill Climbing method c. Random search- Random State Hill climbing
  • 9.
    9 Feature subset selection: SubsetEvaluation: The subset evaluation carried out in two ways 1. Independent Criteria 2. Dependent Criteria
  • 10.
    10 Feature subset selection: SubsetEvaluation: The subset evaluation carried out in two ways 1. Independent Criteria 2. Dependent Criteria
  • 11.
    11 Feature subset selection: StoppingCriteria: A Stopping criteria determines when the feature selection process stopped. Some frequent used stopping criteria are: 1.The search completes. 2. Some given bound reached. 3. Subsequent addition( or deletion) of any feature does not produce a better subset. 4. A Sufficiently good subset is selected.
  • 12.
    12 Feature subset selection: ResultValidation:  A straight forward way for result validation is to directly measure the result using prior knowledge about data. In real-world applications, however, we usually do not have such prior knowledge. Hence, we have to rely on some indirect methods by monitoring the change of mining performance with the change of features.
  • 13.
    13 Feature selection steps: Subset generation Original setfrom HD Alogorith ms Subset Evaluation Sub set Stopping criteria Goodness of Sub set Result Validation YESNO FOUR KEY STEPS OF FEATURE SELECTION
  • 14.
    14 Design of Integratedsystem for intelligent Feature selection: It is a two step process, 1. A Unifying Plat Form: The unifying plat form serves two purposes: a) To group existing algorithms with similar characteristics and investigate their strength and weaknesses on the sample plat form, b) To provide a guide line in building an intelligent feature selection.
  • 15.
    15 Categorizing Frame Workfor Feature Selection Algorithm: Categorizing Frame Work for Feature Selection Algorithms in a Three Dimensional Frame Work
  • 16.
    16 Design of Integratedsystem for intelligent Feature selection: 2. An Integrated System A Preliminary Integrated System
  • 17.
    17 Feature Selection withLarge Dimensionality: From the frame work for feature selection, it is seemed to be the filter model algorithms is preferred to Wrapper model algorithms when dealing with the large dimensionality.  Since filter model algorithms use evaluation criteria that are less computationally expensive when compared with the wrapper model algorithms. Recently, algorithms of Hybrid model are considered to handle data sets with high dimensionality.
  • 18.
    18 Real world Applicationof Feature Selection : Intrusion Detection Genomic Analysis Image Retrieval Customer Relationship Management Existing System Vs Proposed System
  • 19.
    19 Conclusion: By implementing thea new frame work by using High Dimensional Outlier Detection algorithms with integrating feature selection is applied to detect outliers in high Dimensional Data.
  • 20.
    References [1] Huan Liu,Lei Yu “Toward Integrating Feature Selection Algorithms for Classification and Clustering” IEEE Transactions on Knowledge and Data Engineering, Vol 17,NO .4,April 2005 . [2] C.C Aggarwal, Outlier analysis. 1 53, DOI 10.1007/978- - © Springer science+Business Media New York 2013 [3] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc. Fourth Pacific Asia Conf. Knowledge Discovery and Data Mining, (PAKDD-2000), pp. 110-121, 2000. [4] M. Devaney and A. Ram, “Efficient Feature Selection in Conceptual Clustering,” Proc. 14th Int’l Conf. Machine Learning, pp. 92-97, 1997. [5] Feature Selection: An Ever Evolving Frontier in Data Mining ,JMLR: Workshop and Conference Proceedings 10: 4-13 The Fourth Workshop on Feature Selection in Data Mining 20
  • 21.
  • 22.