With the Advancement of time and technology, Outlier Mining methodologies help to sift through the large
amount of interesting data patterns and winnows the malicious data entering in any field of concern. It has
become indispensible to build not only a robust and a generalised model for anomaly detection but also to
dress the same model with extra features like utmost accuracy and precision. Although the K-means
algorithm is one of the most popular, unsupervised, unique and the easiest clustering algorithm, yet it can
be used to dovetail PCA with hubness and the robust model formed from Guassian Mixture to build a very
generalised and a robust anomaly detection system. A major loophole of the K-means algorithm is its
constant attempt to find the local minima and result in a cluster that leads to ambiguity. In this paper, an
attempt has done to combine K-means algorithm with PCA technique that results in the formation of more
closely centred clusters that work more accurately with K-means algorithm .
Detection of Outliers in Large Dataset using Distributed ApproachEditor IJMTER
In this paper, a distributed method is introduced for detecting distance-based outliers in very large
data sets. The approach is based on the concept of outlier detection solving set, which is a small subset of the data
set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to
obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits
excellent performances. From the theoretical point of view, for common settings, the temporal cost of our
algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to
detect outliers. Experimental results show that the algorithm is efficient and that it’s running time scales quite well
for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of
data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the
solving set computed in a distributed environment has the same quality as that produced by the corresponding
centralized method.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.
Detection of Outliers in Large Dataset using Distributed ApproachEditor IJMTER
In this paper, a distributed method is introduced for detecting distance-based outliers in very large
data sets. The approach is based on the concept of outlier detection solving set, which is a small subset of the data
set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to
obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits
excellent performances. From the theoretical point of view, for common settings, the temporal cost of our
algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to
detect outliers. Experimental results show that the algorithm is efficient and that it’s running time scales quite well
for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of
data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the
solving set computed in a distributed environment has the same quality as that produced by the corresponding
centralized method.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.
Anomaly Detection using multidimensional reduction Principal Component AnalysisIOSR Journals
Anomaly detection has been an important research topic in data mining and machine learning. Many
real-world applications such as intrusion or credit card fraud detection require an effective and efficient
framework to identify deviated data instances. However, most anomaly detection methods are typically
implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing
computation and memory requirements. In this paper, we propose multidimensional reduction principal
component analysis (MdrPCA) algorithm to address this problem, and we aim at detecting the presence of
outliers from a large amount of data via an online updating technique. Unlike prior principal component
analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our
approach is especially of interest in online or large-scale problems. By using multidimensional reduction PCA
the target instance and extracting the principal direction of the data, the proposed MdrPCA allows us to
determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector.
Since our MdrPCA need not perform eigen analysis explicitly, the proposed framework is favored for online
applications which have computation or memory limitations. Compared with the well-known power method for
PCA and other popular anomaly detection algorithms
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHODIJCI JOURNAL
Fuzzy logic can be used to reason like humans and can deal with uncertainty other than randomness. Outlier detection is a difficult task to be performed, due to uncertainty involved in it. The outlier itself is a fuzzy concept and difficult to determine in a deterministic way. fuzzy logic system is very promising, since they exactly tackle the situation associated with outliers. Fuzzy logic that addresses the seemingly conflicting goals (i) removing noise, (ii) smoothing out outliers and certain other salient feature. This paper provides a detailed fuzzy logic used for outlier detection by discussing their pros and cons. Thus this is a very helpful document for naive researchers in this field.
A Survey on Cluster Based Outlier Detection Techniques in Data StreamIIRindia
In recent days, Data Mining (DM) is an emerging area of computational intelligence that provides new techniques, algorithms and tools for processing large volumes of data. Clustering is the most popular data mining technique today. Clustering used to separate a dataset into groups that finds intra-group similarity and inter-group similarity. Outlier detection (Anomaly) is to find small groups of data objects that are different when compared with rest of data. The outlier detection is an essential part of mining in data stream. Data Stream (DS) used to mine continuous arrival of high speed data Items. It plays an important role in the fields of telecommunication services, E-Commerce, Tracking customer behaviors and Medical analysis. Detecting outliers over data stream is an active research area. This survey presents the overview of fundamental outlier detection approaches and various types of outlier detection methods in data stream.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...ijscmcj
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Comprehensive Survey of Data Classification & Prediction Techniquesijsrd.com
In this paper, we present an literature survey of the modern data classification and prediction algorithms. All these algorithms are very important in real world applications like- heart disease prediction, cancer prediction etc. Classification of data is a very popular and computationally expensive task. The fundamentals of data classification are also discussed in brief.
Outlier detection is very interesting, useful and challenging problem in the field of data mining. Because of
sparse data clustering algorithm which are based on distance will not work to find outliers in spatial data.
Problem of finding irregular feature in spatial data need to be explore. Many existing approaches have
been proposed to overcome the problem of outlier detection in spatial Geographic data. In this paper an
efficient clustering and density based outlier detection framework has been proposed. The process of
outlier detection has been categorized into two steps in the first step data has been clustered together based
on any density based DBSCAN algorithm and in the second stage outlier detection is performed using LOF.
The purpose is to perform clustering and outlier mining simultaneously to improve feasibility of framework.
To verify the efficiency and robustness of proposed method, comparative study of proposed approach and
several existing approaches are presented in detail, various simulation results demonstrate the
effectiveness of the proposed approach.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Anomaly Detection using multidimensional reduction Principal Component AnalysisIOSR Journals
Anomaly detection has been an important research topic in data mining and machine learning. Many
real-world applications such as intrusion or credit card fraud detection require an effective and efficient
framework to identify deviated data instances. However, most anomaly detection methods are typically
implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing
computation and memory requirements. In this paper, we propose multidimensional reduction principal
component analysis (MdrPCA) algorithm to address this problem, and we aim at detecting the presence of
outliers from a large amount of data via an online updating technique. Unlike prior principal component
analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our
approach is especially of interest in online or large-scale problems. By using multidimensional reduction PCA
the target instance and extracting the principal direction of the data, the proposed MdrPCA allows us to
determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector.
Since our MdrPCA need not perform eigen analysis explicitly, the proposed framework is favored for online
applications which have computation or memory limitations. Compared with the well-known power method for
PCA and other popular anomaly detection algorithms
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHODIJCI JOURNAL
Fuzzy logic can be used to reason like humans and can deal with uncertainty other than randomness. Outlier detection is a difficult task to be performed, due to uncertainty involved in it. The outlier itself is a fuzzy concept and difficult to determine in a deterministic way. fuzzy logic system is very promising, since they exactly tackle the situation associated with outliers. Fuzzy logic that addresses the seemingly conflicting goals (i) removing noise, (ii) smoothing out outliers and certain other salient feature. This paper provides a detailed fuzzy logic used for outlier detection by discussing their pros and cons. Thus this is a very helpful document for naive researchers in this field.
A Survey on Cluster Based Outlier Detection Techniques in Data StreamIIRindia
In recent days, Data Mining (DM) is an emerging area of computational intelligence that provides new techniques, algorithms and tools for processing large volumes of data. Clustering is the most popular data mining technique today. Clustering used to separate a dataset into groups that finds intra-group similarity and inter-group similarity. Outlier detection (Anomaly) is to find small groups of data objects that are different when compared with rest of data. The outlier detection is an essential part of mining in data stream. Data Stream (DS) used to mine continuous arrival of high speed data Items. It plays an important role in the fields of telecommunication services, E-Commerce, Tracking customer behaviors and Medical analysis. Detecting outliers over data stream is an active research area. This survey presents the overview of fundamental outlier detection approaches and various types of outlier detection methods in data stream.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
An Efficient Unsupervised AdaptiveAntihub Technique for Outlier Detection in ...theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...ijscmcj
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Comprehensive Survey of Data Classification & Prediction Techniquesijsrd.com
In this paper, we present an literature survey of the modern data classification and prediction algorithms. All these algorithms are very important in real world applications like- heart disease prediction, cancer prediction etc. Classification of data is a very popular and computationally expensive task. The fundamentals of data classification are also discussed in brief.
Outlier detection is very interesting, useful and challenging problem in the field of data mining. Because of
sparse data clustering algorithm which are based on distance will not work to find outliers in spatial data.
Problem of finding irregular feature in spatial data need to be explore. Many existing approaches have
been proposed to overcome the problem of outlier detection in spatial Geographic data. In this paper an
efficient clustering and density based outlier detection framework has been proposed. The process of
outlier detection has been categorized into two steps in the first step data has been clustered together based
on any density based DBSCAN algorithm and in the second stage outlier detection is performed using LOF.
The purpose is to perform clustering and outlier mining simultaneously to improve feasibility of framework.
To verify the efficiency and robustness of proposed method, comparative study of proposed approach and
several existing approaches are presented in detail, various simulation results demonstrate the
effectiveness of the proposed approach.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
1. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
DOI : 10.5121/ijitmc.2015.3401 1
A MIXTURE MODEL OF HUBNESS AND PCA
FOR DETECTION OF PROJECTED
OUTLIERS
1
Kamal Malik and 2
Harsh Sadawarti
1
Research Scholar PTU and 2
Director RIMT,Punjab Technical University,India
ABSTRACT
With the Advancement of time and technology, Outlier Mining methodologies help to sift through the large
amount of interesting data patterns and winnows the malicious data entering in any field of concern. It has
become indispensible to build not only a robust and a generalised model for anomaly detection but also to
dress the same model with extra features like utmost accuracy and precision. Although the K-means
algorithm is one of the most popular, unsupervised, unique and the easiest clustering algorithm, yet it can
be used to dovetail PCA with hubness and the robust model formed from Guassian Mixture to build a very
generalised and a robust anomaly detection system. A major loophole of the K-means algorithm is its
constant attempt to find the local minima and result in a cluster that leads to ambiguity. In this paper, an
attempt has done to combine K-means algorithm with PCA technique that results in the formation of more
closely centred clusters that work more accurately with K-means algorithm .This combination not only
provides the great boost to the detection of outliers but also enhances its accuracy and precision.
KEYWORDS
PCA, Hubness, Clustering, HDD, Projected outliers
1.INTRODUCTION
Outlier or anomaly detection refers to the task of identifying the deviated patterns that do not give
confirmation to the established regular behaviour [1].Although there are no rigid mathematical
definitions of the outliers, yet they have a very strong practical applicability [2]. They are widely
used in many applications like intrusion, fraud detection, medical diagnosis where the crucial and
actionable information is required. The detection of outliers can be categorised into supervised,
semi-supervised and unsupervised depending upon existence of labels for the outliers and existing
regular instances. To obtain the accurate and representative labels are comparatively very
expensive and difficult to obtain, hence among these three, unsupervised methods are usually
applied. Distance-based methods are one of the very important methods of unsupervised approach
that usually rely on the measure of distance or similarity in order to detect the deviated
observations. A common and a well known problem that usually arises in higher dimensional data
due to “curse of dimensionality” is that the distance becomes meaningless [3] since the distance
measures concentrates i.e., pairwise data becomes equidistant from each other as the
dimensionality enhances. Hence the every point in higher dimensional space becomes neutrally
an outlier, it is quiet challenging. According to the nature of the data processing, outlier detection
methods could be categorised as either Supervised or Unsupervised. The former is based on
supervised classification method that needs the availability of ground truth in order to derive a
suitable training set for the learning process of the classifiers. The later approach does not have
the training data available with it and hence do not confirm any generalised pattern.
2. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
2
Hubness phenomenon consists of an observation that for increasing dimensionality of the data set,
the distribution of the no of the times the data points occur around the k-nearest neighbours of the
other data points becomes increasingly skewed to the right. To effectively and efficiently detect
the projected outliers from the higher dimensional data is itself a great challenge for the
traditional data mining techniques. In this paper, instead of attempting to avoid the curse of
dimensionality by highlighting the lower dimensional feature subspace, the dimensionality is
embraced by taking the advantage of some inherent higher dimensional phenomenon of hubness.
Hubness is the tendency of the higher dimensional data to contain hubs that frequently occur in
the k-nearest list of the other points and can be successfully exploited in the clustering.
Our Motivation is based on the following factors:
The notion that the distance becomes indiscernible as the dimensionality of the data
enhances and each point becomes equally a good outlier in higher dimensional data
space.
Moreover, the hubness can be termed as a good metric of a centralised point in higher
dimensional data clusters and the major hubs can be effectively used as cluster prototypes
while searching the centroid based cluster configuration.
Our Contribution in this paper is:
First of all, the various challenges of higher dimensionality in the case of unsupervised
learning are discussed which includes curse of dimensionality, Attribute relevant
analysis.
The phenomenon of hubness is discussed which has a great impact on the reverse nearest
neighbourhood counts.
Projected outliers are highlightened by linking the K-means algorithm with PCA and the
hybrid model formed Gaussian mixture.
2.RELATED WORKS
In the past, many outlier detection methodologies have been proposed [4] [5] [6] [7]
[8][9][10][11][12]. The outlier methodologies are broadly categorised into distribution based,
distance based and density based methods. Statistical based methods usually follow the
underlying assumptions of the data and this type of approach aims to find the outliers which
deviate from such distributions. Most of the statistical distribution models are univariate, as the
multivariate distributions lack the robustness. The solutions of the statistical models suffer from
noise present in the data as the assumptions or the prior knowledge of the data distribution is not
easily determined for the practical problems. In case of the distance based methods the distance
between each point of interest and its neighbours are calculated. This distance is compared with
the predetermined threshold and if the data points lie beyond the threshold, then those points are
termed as outliers, otherwise they are considered as the normal points[13]. In case of the
multiclustered structured data, the data distribution is quiet complex as no prior knowledge of the
data distribution is needed. In such cases, improper neighbours are determined which enhances
the false positive rate.
To alleviate this problem, the density based methods are proposed. LOF is one of the important
density based methods to measure the outlierness of each data instance as LOF not only
determines the outliers rather highlight the degree of outlierness, which provide the suspicious
ranking scores for all the samples. Although it identifies the outliers in the local data structure via
density estimation and also awares the user of the outliers that are sheltered under the global data
structures, yet the estimation of the local density for each instance is computationally expensive,
usually when the data size is very large. Apart from the above work, there are other outlier
3. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
3
detection approaches that are recently proposed [6][9][10] .Among them, ABOD i.e., angle based
outlier detection is the one which is very crucial and unique as it calculates the variation of the
angles between each target instance and the remaining data points and it’s usually observed that
the outliers always produce a smaller angle variance than the normal one. Then further it’s
extension named as fast ABOD was proposed to generate the approximation of the original
ABOD solution. K-means algorithm is an important clustering algorithm to cluster n-objects
based on the attributes into k-partitions where k<n. It is quiet similar to the expectation-
maximisation algorithm for the different Gaussian mixture in a manner that they both attempt to
find the centres of the natural clusters in the data . Moreover, it assumes that the object attribute
forms a vector space.A major drawback of the K-means algorithm is its constant attempt to find
out the local minima and results in a cluster that leads to ambiguity. In this paper, we have tried
to mortise the K-means algorithm and Robust Gaussian model with Principal Component analysis
and hubness to enhance their effectiveness, efficiency and precision.
This section describes related works on clustering higher dimensional data. Due to the implicit
sparsity of the points, HDD has become an emerging challenge for the clustering algorithms [14].
In [14], they provided a complete survey of a usual concept of the projected subspaces for
searching the projected clusters. To eliminate the sparse datasets for every cluster, and further
adjusting the points into the subspaces in which the most likelihood occurs is a much generalised
idea of clustering in full dimensional subspaces. This technique is used in [15] for the effective
HDD visualisation. SUBCLU [16] (density-connected Subspaces Clustering) is an indispensible
approach to the subspace clustering problem. In [16], the various paradigms in a usual framework
are used systematically. Projected clustering is a usual data mining task for unmanned grouping
object [17]. In paper [18], they presented a detailed probabilistic approach to k-nearest neighbour
classification. The feasibility of incorporating the hubness data for Bayesian class prediction is
examined in [15].In [16], it is shown that the nearest neighbour methods can be further enhanced
by taking a point in the hubness. In HDD, it’s very difficult to handle the ordinary machine
learning algorithms, which particularly characterize the problem of curse of dimensionality. In
[19],they provided better assured proposed labels by exposing the fuzzy nearest neighbour
classification and they also enlarged the already existing crisp hubness based approach into a
fuzzy counterpart. Moreover, hybrid fuzzy functions are available and tested in detail in [19].
3. CHALLENGES IN HIGHER DIMENSIONAL DATA
3.1 CURSE OF DIMENSIONALITY
In higher dimensional space, unsupervised methods detect every point equally a good outlier
because the distance becomes indiscernible as the dimensionality increases. All the data points
become equidistant from each other showing that all of them carry useful and important
information. Due to the sparsity of the data, the outliers are hidden in the lower dimensional
subspaces and are usually do not prominently highlighted while dealing with independent
components.
3.2 ATTRIBUTE RELEVANT ANALYSIS
In case of the higher dimensional clustering, the relevant attributes contain the projected clusters
and the irrelevant ones contain outliers or noise [20]. Cluster structure is the region with the
higher density of the points than its surroundings. Such dense regions represent the 1
dimensional projection of the cluster. So, by detecting the dense regions in each dimension, it
becomes easier to differentiate between the dimensions of the relevant and irrelevant clusters. The
dense regions can be distinguished from the sparse one using the predefined density threshold, but
in such a case, this value of density threshold may affect the accuracy and the goodness of the
cluster.
4. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
4
4. PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis is one of the very important techniques of dimensionality
reduction where “respecting structure” means “variance preservation”. PCA has already been
rediscovered by so many researchers in so many fields and is basically known as a method of
empirical orthogonal functions, and singular value decomposition and determines the principal
directions of the data distributions. To obtain these Principal directions, one needs to construct
the data covariance matrix and calculate its dominant eigenvectors [23]. Among the vectors in the
original data space, these eigenvectors are the most informative so are considered as the principal
directions. If we want to summarize p-dimensional feature vectors into a q dimensional feature
subspace, then the principal components are the projections of the original vectors onto q
directions which spans the subspace. Mathematically, although there can be various different
ways of deriving the principal components yet the most prominent one is to find out the
projections whose variance is maximum. The first principal component will be taken in the
feature space where the projections have the largest variance and the second one is taken in the
direction which maximises variance among all the directions orthogonal to the first. In a
generalized way, the kth
component is the variance maximizing direction orthogonal to the k-1th
previous component. The variance can be maximised using equation
2
=1/n (1)
Where n data vectors can be stacked into an n*p matrix ,X is the projection and is given by Xw
which is an n*1 matrix. Then,
σ2
=1/n (Xw)T
(Xw) (2)
σ2
=1/n wT
XT
Xw (3)
σ2
= wT ((XT
X)/n) w (4)
σ2
=wT
Vw (5)
The unit vectors are need to taken into consideration and we need to constrain the maximisation.
5. HUBNESS
Hubness is usually treated as a metric to detect the local centrality which further enhances the
clustering in an efficient and an effective way keeping the notion in a mind that the hubs are
located near the centres of compact sub clusters in higher dimensional data which is an obvious
way to check the feasibility of using them to calculate their centres and compare the hub based
approach with some centroid based technique. That’s why we are using K-means algorithm with
iterative approach to define the clusters around separated high hubness data elements The
mediods and centroids in K-means tend to converge the hubness data elements leading to the
promising subspaces in the in the large datasets. Mediods and centroids usually depend upon the
current cluster elements whereas hubs carry locally centralized information as they depend on
their neighbouring elements. Acc to [22], two types of hubness are there i.e., local hubness and
global hubness where local hubness restricts the global hubness on any given cluster.
5. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
5
Fig 1. Hubness based Clustering [22]
6. PROPOSED METHODOLOGY
To validate the centroids formed from K-means clustering algorithm, the hubness based
clustering technique is employed. Hubness can be considered as local centrality measure. It is
usually considered that the hubs are located near the centres of the compact sub clusters in the
higher dimensional data to test the feasibility of the centroids. In this way, the hubness based
approach helps us to compare the proximity and validity of the centroids formed from K-means
clusters. The stepwise procedure employed for the detection of the true projected outliers:
The large dataset or higher dimensional data is uploaded or taken .In our experiment, we
have taken the iris dataset from UCI repository where n=7500 d-dimensional points
whose components are independently drawn.
Then, the dimensionality of this large dataset is reduced using Principal Component
Analysis technique where the kth
Principal component is the variance maximising
direction orthogonal to the (k-1)th
previous Principal components.
Then on these reduced dimensions, the iterative k-means algorithm is applied to have the
concrete clusters using the central centroids and Mediods. Hence, the higher dimensional
data space in this way is divided into the proper subspaces and the outliers can be
detected and defined properly.
To obtain more refined and true outliers, the phenomenon of hubness is also used which
is based on the local centrality. Hubness not only converges the Mediods and centroids
more to their accurate and approximate locations but also improves the quality of
clustering as well as subspace in order to provide the true projected outliers.
Fig 2. Block Diagram for the detection of outliers
7. RESULTS AND DISCUSSIONS
The results of this experiment are implemented on matlab and the iris data is taken from UCI
repository . This is an n=7500 d- dimensional data with independent components.Then Principal
Component analysis is done which is simply based upon maximising the variance and according
to which the first principal component is in the direction which maximizes the variance among all
the directions orthogonal to the first.
6. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
6
Then K-means algorithm is applied iteratively and the various cluster centroids and clusters are
formed and shown in fig 3(a). below . Now we want to find out the exact lower dimensional
subspaces because outliers mostly hidden in H.D.D equidistant from each other ,so it becomes
important to find out the subspaces in lower dimensional area. In order to find the appropriate
lower dimensional subspaces the hubness phenomenon is used which is a tactic to overcome the
problem of curse of dimensionality.Hubness is a very effective tool in eliminating the sparse
subspaces for every cluster and projecting the points into those subspaces in which the greater
similarity occurs. Moreover, this K-means algorithm is used as an itrerative approach for defining
clusters around separated high hubness data elements and a very important point is that the
centroids obtained from K-means depend on all the cluster elements while hubs depend mostly on
their neighboring elements and therefore they contain the localized centrality information .
Moreover,Table1 shows the comparason of the various outlier detection methodologies.
Fig 3 (a) Cluster Centroids along with clusters
Fig 3 (b) Refined subspaces along
With projected outliers
7. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
7
Table 1 Comparison of various existing outlier detection Methodologies
Fig 4. Graphical Representation of various parameters against values and dimensions
Outlier Detection
Methodology
Anomaly
detection Ratio
for H.D.D
Compatibility
with H.D.D
Statistical Based
Methods
Low No
Distance Based
Methods
Very Low No
Density Based
Methods
Low No
Clustering based
Methods
High Not All
Subspace
Methods
Very High Yes
Hubness Method Very High Yes
8. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
8
8.CONCLUSION
In this paper, we have used the technique of Principal Component Analysis for the reduction of
data and Hubness based clustering to obtain the appropriate and concrete clusters. Firstly the
various clusters are formed using iterative K-means clustering and then their centroids are
converged in an effective way using hubness that enhances the goodness and the quality of the
clusters and finally linked with the principal component analysis technique to find out the global
and the refined outliers. Fig 4 provides the compatibility of various outlier detection
methodologies with the higher dimensional data.
9. REFERENCES
[1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Survey,
vol. 41, no. 3, pp 15, 2009.
[2] P. J. Rousseuw and A. M. Leroy, Robust Regression and Outlier Detection. Hoboken, NJ, USA:
Wiley, 1987.
[3] PhD Thesis by Dr. Ji Zhang entitled as “Towards Outlier Detection for higher Dimensional data
streams using projected outlier analysis strategy, 2009.
[4] D.M. Hawkins, Identification of Outliers. Chapman and Hall, 1980.
[5] M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,”
Proc. ACM SIGMOD Int’l Conf. Management of Data, 2000.
[6] H.-P. Kriegel, M. Schubert, and A. Zimek, “Angle-Based Outlier Detection in High-Dimensional
Data,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and data Mining, 2008.
[7] F. Angiulli, S. Basta, and C. Pizzuti, “Distance-Based Detection and Prediction of Outliers,” IEEE
Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 145-160, 2006.
[8] V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley&Sons, 1994.
[9] W. Jin, A.K.H. Tung, J. Han, and W. Wang, “Ranking Outliers Using Symmetric Neighborhood
Relationship,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, 2006.
[10] N.L.D. Khoa and S. Chawla, “Robust Outlier Detection Using Commute Time and Eigenspace
Embedding,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, 2010.
[11] E.M. Knox and R.T. Ng, “Algorithms for Mining Distance-Based Outliers in Large Data Sets,” Proc.
Int’l Conf. Very Large Data Bases, 1998.
[12] H.-P. Kriegel, P. Kro¨ger, E. Schubert, and A. Zimek, “Outlier Detection in Axis-Parallel Subspaces
of High Dimensional Data,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, 2009.
[13] C.C. Aggarwal and P.S. Yu, “Outlier Detection for High Dimensional Data,” Proc. ACM SIGMOD
Int’l Conf. Management of Data, 2001.
[14] C. C. Aggarwal and P. S. Yu, ―Finding generalized projected clusters in high dimensional spaces, in
Proc. 26th ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 70–8
[15] K. Kailing, H.-P. Kriegel, P. Kr ¨oger, and S. Wanka, ―Ranking subspaces for clustering high
dimensional data, in Proc. 7th European Conf. on Principles and Practice of Knowledge Discovery
in Databases (PKDD), 2003, pp. 241–252.
[16] D. Francois, V. Wertz, and M. Verleysen, ―The concentration of fractional distances, IEEE
Transactions on Knowledge and Data Engineering, vol. 19, no. 7, pp. 873–886, 2007.
[17] A. Kab´an, ―Non-parametric detection of meaningless distances in high dimensional data, Statistics
and Computing, vol. 22, no. 2, pp. 375–385, 2011.
[18] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, ―On the surprising behavior of distance metrics in
high dimensional spaces,‖ in Proc. 8th Int. Conf. on Database Theory (ICDT), 2001,pp. 420–434.
[19] I. S. Dhillon, Y. Guan, and B. Kulis, ―Kernel k-means: spectral clustering and normalized cuts, in
Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2004, pp. 551
[20] M.Radovanovi´c, A. Nanopoulos, and M.Ivanovi´c, ―On the existence of obstinate results in vector
space models,‖ in Proc.33rd Annual Int. ACM SIGIR Conf. on Research and Development in
Information Retrieval, 2010, pp. 186–193.
[21] Yuh-Jye Lee, Yi-Ren Yeh, and Yu-Chiang Frank Wang, Member, IEEE, Anomaly Detection via
Online Oversampling Principal Component Analysis. IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING, VOL. 25, NO. 7, JULY 2013.
9. International Journal of Information Technology, Modeling and Computing (IJITMC) Vol. 3, No.1/2/3/4, November 2015
9
[22] Nenad Tomasev, Milos Radovaovic, Dunja Mladenic Ivanovic. “The Role of hubness in clustering
higher dimensional data”
[22] Sindhupriya.R. Ignatius Selvarani.X “K-means based Clustering in Higher Dimensional Data”
International Journal of Advanced Research in Computer Engineering and Technology, vol 3,Issue 2
,Feb,2014
[23] Yuh-Jye Lee, Yi-Ren Yeh, and Yu-Chiang Frank Wang, Member, IEEE “Anomaly Detection via
Online Oversampling Principal Component Analysis”