IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULAR DESCRIPTOR DATASET USING PRINCIPAL COMPONENT OUTLIER DETECTION ALGORITHM AND COMPARATIVE NUMERICAL STUDY OF OTHER ROBUST ESTIMATORS
This document summarizes an algorithm called Principal Component Outlier Detection (PrCmpOut) for identifying outliers in high-dimensional molecular descriptor datasets. PrCmpOut uses principal component analysis to transform the data into a lower-dimensional space, where it can more efficiently detect outliers using robust estimators of location and covariance. The properties of PrCmpOut are analyzed and compared to other robust outlier detection methods through simulation studies using a dataset of oxazoline and oxazole molecular descriptors. Numerical results show PrCmpOut performs well at outlier detection in high-dimensional data.
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...cscpconf
In the present study, the abilities of three classification methods of data mining namely artificial
neural networks with feed-forward back propagation algorithm, J48 decision tree method and
logistic regression analysis are compared in a medical real dataset. The prediction of
malignancy in suspected thyroid tumour patients is the objective of the study. The accuracy of
the correct predictions (the minimum error rate), the amount of time consuming in the
modelling process and the interpretability and simplicity of the results for clinical experts are
the factors considered to choose the best method
A comparative analysis of classification techniques on medical data setseSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...Editor IJCATR
Data mining refers to extracting knowledge from large amount of data. Real life data mining approaches are
interesting because they often present a different se
t of problems for
diabetic
patient’s
data
.
The
research area to solve
various problems and classification is one of main problem in the field. The research describes algorithmic discussion of J48
,
J48 Graft, Random tree, REP, LAD. Here used to compare the
performance of computing time, correctly classified
instances, kappa statistics, MAE, RMSE, RAE, RRSE and
to find the error rate measurement for different classifiers in
weka .In this paper the
data
classification is diabetic patients data set is develope
d by collecting data from hospital repository
consists of 1865 instances with different attributes. The instances in the dataset are two categories of blood tests, urine t
ests.
Weka tool is used to classify the data is evaluated using 10 fold cross validat
ion and the results are compared. When the
performance of algorithms
,
we found J48 is better algorithm in most of the cases
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...cscpconf
In the present study, the abilities of three classification methods of data mining namely artificial
neural networks with feed-forward back propagation algorithm, J48 decision tree method and
logistic regression analysis are compared in a medical real dataset. The prediction of
malignancy in suspected thyroid tumour patients is the objective of the study. The accuracy of
the correct predictions (the minimum error rate), the amount of time consuming in the
modelling process and the interpretability and simplicity of the results for clinical experts are
the factors considered to choose the best method
A comparative analysis of classification techniques on medical data setseSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...Editor IJCATR
Data mining refers to extracting knowledge from large amount of data. Real life data mining approaches are
interesting because they often present a different se
t of problems for
diabetic
patient’s
data
.
The
research area to solve
various problems and classification is one of main problem in the field. The research describes algorithmic discussion of J48
,
J48 Graft, Random tree, REP, LAD. Here used to compare the
performance of computing time, correctly classified
instances, kappa statistics, MAE, RMSE, RAE, RRSE and
to find the error rate measurement for different classifiers in
weka .In this paper the
data
classification is diabetic patients data set is develope
d by collecting data from hospital repository
consists of 1865 instances with different attributes. The instances in the dataset are two categories of blood tests, urine t
ests.
Weka tool is used to classify the data is evaluated using 10 fold cross validat
ion and the results are compared. When the
performance of algorithms
,
we found J48 is better algorithm in most of the cases
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISijcsit
The Healthcare industry contains big and complex data that may be required in order to discover fascinating pattern of diseases & makes effective decisions with the help of different machine learning techniques. Advanced data mining techniques are used to discover knowledge in database and for medical
research. This paper has analyzed prediction systems for Diabetes, Kidney and Liver disease using more
number of input attributes. The data mining classification techniques, namely Support Vector Machine(SVM) and Random Forest (RF) are analyzed on Diabetes, Kidney and Liver disease database. The performance of these techniques is compared, based on precision, recall, accuracy, f_measure as well
as time. As a result of study the proposed algorithm is designed using SVM and RF algorithm and the experimental result shows the accuracy of 99.35%, 99.37 and 99.14 on diabetes, kidney and liver disease respectively.
Classification of data is a data mining technique based on machine learning is used to classification of each item set in as a set of dataset into a set of predefined labelled as classes or groups. Classification is tasks for different application such as text classification, image classification, class’s predictions, data Classification etc. In this paper, we presenting the major classification techniques used for prediction of classes using supervised learning dataset. Several major types of classification method including Random Forest, Naive Bayes, Support Vector Machine (SVM) techniques. The goal of this review paper is to provide a review, accuracy and comparative between different classification techniques in data mining.
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
Heart disease diagnosis requires more experience and it is a complex task. The Heart MRI, ECG and Stress Test etc are the numbers of medical tests are prescribed by the doctor for examining the heart disease and it is the way of tradition in the prediction of heart disease. Today world, the hidden information of the huge amount of health care data is contained by the health care industry. The effective decisions are made by means of this hidden information. For appropriate results, the advanced data mining techniques with the information which is based on the computer are used. In any empirical sciences, for the inference and categorisation, the new mathematical techniques to be used called Artificial neural networks (ANNs) it also be used to the modelling of the real neural networks. Acting, Wanting, knowing, remembering, perceiving, thinking and inferring are the nature of mental phenomena and these can be understand by using the theory of ANN. The problem of probability and induction can be arised for the inference and classification because these are the powerful instruments of ANN. In this paper, the classification techniques like Naive Bayes Classification algorithm and Artificial Neural Networks are used to classify the attributes in the given data set. The attribute filtering techniques like PCA (Principle Component Analysis) filtering and Information Gain Attribute Subset Evaluation technique for feature selection in the given data set to predict the heart disease symptoms. A new framework is proposed which is based on the above techniques, the framework will take the input dataset and fed into the feature selection techniques block, which selects any one techniques that gives the least number of attributes and then classification task is done using two algorithms, the same attributes that are selected by two classification task is taken for the prediction of heart disease. This framework consumes the time for predicting the symptoms of heart disease which make the user to know the important attributes based on the proposed framework.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...gerogepatton
Parkinson’s disease is a complex chronic neurodegenerative disorder of the central nervous system. One of the common symptoms for the Parkinson’s disease subjects, is vocal performance degradation. Patients usually advised to follow personalized rehabilitative treatment sessions with speech experts. Recent research trends aim to investigate the potential of using sustained vowel phonations for replicating the speech experts’ assessments of Parkinson’s disease subjects’ voices. With the purpose of improving the accuracy and efficiency of Parkinson’s disease treatment, this article proposes a two-stage diagnosis model to evaluate an LSVT dataset. Firstly, we propose a modified minimum Redundancy-Maximum Relevance (mRMR) feature selection approach, based on Cuckoo Search and Tabu Search to reduce the features numbers. Secondly, we apply simple random sampling technique to dataset to increase the samples of the minority class. Promisingly, the developed approach obtained a classification Accuracy rate of 95% with 24 features by 10-fold CV method.
Life is the most precious gift to man and safeguarding this gift is of utmost importance.With
increasing number of diseases and fast paced lives, people have less time to look after themselves and
their family members or to even visit the doctor for regular check-ups.Our E-Health patient
monitoring system can remotely monitor the health of the patients and intimate the doctor of critical
conditions without human intervention. Some of the existing E-Health systems include telemedicine
network for Francophone African countries (RAFT) and LOBIN. RAFT is implemented in java and
uses asymmetric public – private key encryption, however it is expensive, does not support mobility
and is not a context aware system. LOBIN is a hardware/software platform to locate and monitor a set
of physiological parameters and context parameters of several patients within hospital facilities.
Although it is a context aware system it cannot handle high and concurrent data traffic load.
To overcome the above flaws, our proposed system puts forward an idea of patient monitoring
using various knowledge based techniques like K-means clustering, Gaussian kernel function, ANN
and Fuzzy inference engine. In our project we intend to do remote patient health monitoring in which
we will be using three-four machines which will send various sensed health parameters to the
centralised server that will make clusters of the sensed health parameters based on criticality of the
health condition. Then depending upon clusters formed and on comparison with the threshold values
appropriate reports will be generated and send to the doctors and caretakers.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Essential things that should always be in your carEason Chan
A driver can bail out of a lot of sticky situations if he plans ahead. More often than not, things go south on you when you think nothing could go wrong. So it pays to hope for the best and plan for the worst, especially on the road. Here are some things that should always be kept in your car for all those just in case moments.
What happens when the digital tools and platforms we make and use for communication and entertainment are hijacked for terrorism, violence against the vulnerable and nefarious transactions? What role do designers and developers play? Are we complicit as creators of these technologies and products? Should we police them or fight back? As Portfolio Lead for Northern Lab, Northern Trust's internal innovation startup focused on client and partner experience, Antonio will share a mix of provocative scenarios torn from today's headlines and compelling stories where activism and technology facilitated peace—and war.
As a call-to-action for designers and developers to engage in projects capable of transformational change, he'll explore the question: How might technology foster new experiences to better accelerate social activism and make the world a smarter, safer place?
My books- Hacking Digital Learning Strategies http://hackingdls.com & Learning to Go https://gum.co/learn2go
Resources at http://shellyterrell.com/classmanagement
The reality for companies that are trying to figure out their blogging or content strategy is that there's a lot of content to write beyond just the "buy now" page.
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISijcsit
The Healthcare industry contains big and complex data that may be required in order to discover fascinating pattern of diseases & makes effective decisions with the help of different machine learning techniques. Advanced data mining techniques are used to discover knowledge in database and for medical
research. This paper has analyzed prediction systems for Diabetes, Kidney and Liver disease using more
number of input attributes. The data mining classification techniques, namely Support Vector Machine(SVM) and Random Forest (RF) are analyzed on Diabetes, Kidney and Liver disease database. The performance of these techniques is compared, based on precision, recall, accuracy, f_measure as well
as time. As a result of study the proposed algorithm is designed using SVM and RF algorithm and the experimental result shows the accuracy of 99.35%, 99.37 and 99.14 on diabetes, kidney and liver disease respectively.
Classification of data is a data mining technique based on machine learning is used to classification of each item set in as a set of dataset into a set of predefined labelled as classes or groups. Classification is tasks for different application such as text classification, image classification, class’s predictions, data Classification etc. In this paper, we presenting the major classification techniques used for prediction of classes using supervised learning dataset. Several major types of classification method including Random Forest, Naive Bayes, Support Vector Machine (SVM) techniques. The goal of this review paper is to provide a review, accuracy and comparative between different classification techniques in data mining.
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
Heart disease diagnosis requires more experience and it is a complex task. The Heart MRI, ECG and Stress Test etc are the numbers of medical tests are prescribed by the doctor for examining the heart disease and it is the way of tradition in the prediction of heart disease. Today world, the hidden information of the huge amount of health care data is contained by the health care industry. The effective decisions are made by means of this hidden information. For appropriate results, the advanced data mining techniques with the information which is based on the computer are used. In any empirical sciences, for the inference and categorisation, the new mathematical techniques to be used called Artificial neural networks (ANNs) it also be used to the modelling of the real neural networks. Acting, Wanting, knowing, remembering, perceiving, thinking and inferring are the nature of mental phenomena and these can be understand by using the theory of ANN. The problem of probability and induction can be arised for the inference and classification because these are the powerful instruments of ANN. In this paper, the classification techniques like Naive Bayes Classification algorithm and Artificial Neural Networks are used to classify the attributes in the given data set. The attribute filtering techniques like PCA (Principle Component Analysis) filtering and Information Gain Attribute Subset Evaluation technique for feature selection in the given data set to predict the heart disease symptoms. A new framework is proposed which is based on the above techniques, the framework will take the input dataset and fed into the feature selection techniques block, which selects any one techniques that gives the least number of attributes and then classification task is done using two algorithms, the same attributes that are selected by two classification task is taken for the prediction of heart disease. This framework consumes the time for predicting the symptoms of heart disease which make the user to know the important attributes based on the proposed framework.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...gerogepatton
Parkinson’s disease is a complex chronic neurodegenerative disorder of the central nervous system. One of the common symptoms for the Parkinson’s disease subjects, is vocal performance degradation. Patients usually advised to follow personalized rehabilitative treatment sessions with speech experts. Recent research trends aim to investigate the potential of using sustained vowel phonations for replicating the speech experts’ assessments of Parkinson’s disease subjects’ voices. With the purpose of improving the accuracy and efficiency of Parkinson’s disease treatment, this article proposes a two-stage diagnosis model to evaluate an LSVT dataset. Firstly, we propose a modified minimum Redundancy-Maximum Relevance (mRMR) feature selection approach, based on Cuckoo Search and Tabu Search to reduce the features numbers. Secondly, we apply simple random sampling technique to dataset to increase the samples of the minority class. Promisingly, the developed approach obtained a classification Accuracy rate of 95% with 24 features by 10-fold CV method.
Life is the most precious gift to man and safeguarding this gift is of utmost importance.With
increasing number of diseases and fast paced lives, people have less time to look after themselves and
their family members or to even visit the doctor for regular check-ups.Our E-Health patient
monitoring system can remotely monitor the health of the patients and intimate the doctor of critical
conditions without human intervention. Some of the existing E-Health systems include telemedicine
network for Francophone African countries (RAFT) and LOBIN. RAFT is implemented in java and
uses asymmetric public – private key encryption, however it is expensive, does not support mobility
and is not a context aware system. LOBIN is a hardware/software platform to locate and monitor a set
of physiological parameters and context parameters of several patients within hospital facilities.
Although it is a context aware system it cannot handle high and concurrent data traffic load.
To overcome the above flaws, our proposed system puts forward an idea of patient monitoring
using various knowledge based techniques like K-means clustering, Gaussian kernel function, ANN
and Fuzzy inference engine. In our project we intend to do remote patient health monitoring in which
we will be using three-four machines which will send various sensed health parameters to the
centralised server that will make clusters of the sensed health parameters based on criticality of the
health condition. Then depending upon clusters formed and on comparison with the threshold values
appropriate reports will be generated and send to the doctors and caretakers.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Essential things that should always be in your carEason Chan
A driver can bail out of a lot of sticky situations if he plans ahead. More often than not, things go south on you when you think nothing could go wrong. So it pays to hope for the best and plan for the worst, especially on the road. Here are some things that should always be kept in your car for all those just in case moments.
What happens when the digital tools and platforms we make and use for communication and entertainment are hijacked for terrorism, violence against the vulnerable and nefarious transactions? What role do designers and developers play? Are we complicit as creators of these technologies and products? Should we police them or fight back? As Portfolio Lead for Northern Lab, Northern Trust's internal innovation startup focused on client and partner experience, Antonio will share a mix of provocative scenarios torn from today's headlines and compelling stories where activism and technology facilitated peace—and war.
As a call-to-action for designers and developers to engage in projects capable of transformational change, he'll explore the question: How might technology foster new experiences to better accelerate social activism and make the world a smarter, safer place?
My books- Hacking Digital Learning Strategies http://hackingdls.com & Learning to Go https://gum.co/learn2go
Resources at http://shellyterrell.com/classmanagement
The reality for companies that are trying to figure out their blogging or content strategy is that there's a lot of content to write beyond just the "buy now" page.
Similar to IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULAR DESCRIPTOR DATASET USING PRINCIPAL COMPONENT OUTLIER DETECTION ALGORITHM AND COMPARATIVE NUMERICAL STUDY OF OTHER ROBUST ESTIMATORS
Multiple Linear Regression Models in Outlier Detection IJORCS
Identifying anomalous values in the real-world database is important both for improving the quality of original data and for reducing the impact of anomalous values in the process of knowledge discovery in databases. Such anomalous values give useful information to the data analyst in discovering useful patterns. Through isolation, these data may be separated and analyzed. The analysis of outliers and influential points is an important step of the regression diagnostics. In this paper, our aim is to detect the points which are very different from the others points. They do not seem to belong to a particular population and behave differently. If these influential points are to be removed it will lead to a different model. Distinction between these points is not always obvious and clear. Hence several indicators are used for identifying and analyzing outliers. Existing methods of outlier detection are based on manual inspection of graphically represented data. In this paper, we present a new approach in automating the process of detecting and isolating outliers. Impact of anomalous values on the dataset has been established by using two indicators DFFITS and Cook’sD. The process is based on modeling the human perception of exceptional values by using multiple linear regression analysis.
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
With the Advancement of time and technology, Outlier Mining methodologies help to sift through the large
amount of interesting data patterns and winnows the malicious data entering in any field of concern. It has
become indispensible to build not only a robust and a generalised model for anomaly detection but also to
dress the same model with extra features like utmost accuracy and precision. Although the K-means
algorithm is one of the most popular, unsupervised, unique and the easiest clustering algorithm, yet it can
be used to dovetail PCA with hubness and the robust model formed from Guassian Mixture to build a very
generalised and a robust anomaly detection system. A major loophole of the K-means algorithm is its
constant attempt to find the local minima and result in a cluster that leads to ambiguity. In this paper, an
attempt has done to combine K-means algorithm with PCA technique that results in the formation of more
closely centred clusters that work more accurately with K-means algorithm .This combination not only
provides the great boost to the detection of outliers but also enhances its accuracy and precision.
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
With the Advancement of time and technology, Outlier Mining methodologies help to sift through the large
amount of interesting data patterns and winnows the malicious data entering in any field of concern. It has
become indispensible to build not only a robust and a generalised model for anomaly detection but also to
dress the same model with extra features like utmost accuracy and precision. Although the K-means
algorithm is one of the most popular, unsupervised, unique and the easiest clustering algorithm, yet it can
be used to dovetail PCA with hubness and the robust model formed from Guassian Mixture to build a very
generalised and a robust anomaly detection system. A major loophole of the K-means algorithm is its
constant attempt to find the local minima and result in a cluster that leads to ambiguity. In this paper, an
attempt has done to combine K-means algorithm with PCA technique that results in the formation of more
closely centred clusters that work more accurately with K-means algorithm
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
With the Advancement of time and technology, Outlier Mining methodologies help to sift through the large
amount of interesting data patterns and winnows the malicious data entering in any field of concern. It has
become indispensible to build not only a robust and a generalised model for anomaly detection but also to
dress the same model with extra features like utmost accuracy and precision. Although the K-means
algorithm is one of the most popular, unsupervised, unique and the easiest clustering algorithm, yet it can
be used to dovetail PCA with hubness and the robust model formed from Guassian Mixture to build a very
generalised and a robust anomaly detection system. A major loophole of the K-means algorithm is its
constant attempt to find the local minima and result in a cluster that leads to ambiguity. In this paper, an
attempt has done to combine K-means algorithm with PCA technique that results in the formation of more
closely centred clusters that work more accurately with K-means algorithm .
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
As we know that health care industry is completely based on assumptions, which after get tested and verified via various tests and patient have to be depend on the doctors knowledge on that topic . so we made a system that uses data mining techniques to predict the health of a person based on various medical test results. so we can predict the health of that person based on that analysis performed by the system.The system currently design only for heart issues, for that we had used Statlog (Heart) Data Set from UCI Machine Learning Repository it includes attributes like age, sex, chest pain type, cholesterol, sugar, outcomes,etc.for training the system. we only need to passed few general inputs in order to generate the prediction and the prediction results from all algorithms are they merged together by calculating there mean value that value shows the actual outcome of the prediction process which entirely works in background
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Detection of Outliers in Large Dataset using Distributed ApproachEditor IJMTER
In this paper, a distributed method is introduced for detecting distance-based outliers in very large
data sets. The approach is based on the concept of outlier detection solving set, which is a small subset of the data
set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to
obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits
excellent performances. From the theoretical point of view, for common settings, the temporal cost of our
algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to
detect outliers. Experimental results show that the algorithm is efficient and that it’s running time scales quite well
for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of
data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the
solving set computed in a distributed environment has the same quality as that produced by the corresponding
centralized method.
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...ijcsa
Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the NRMSE and NMAE estimates of random forest, we are able to estimate the imputation error. Evaluation is performed on two molecular descriptor datasets generated from a diverse selection of pharmaceutical fields with artificially introduced missing values ranging from 10% to 30%. The experimental result demonstrates that missing values has a great impact on the effectiveness of imputation techniques and our method MiFoImpute is more robust to missing value than the other ten imputation methods used as benchmark. Additionally, MiFoImpute exhibits attractive computational efficiency and can cope with high-dimensional data.
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
The outliers in data mining can be detected using semi-supervised and unsupervised methods. Outlier
detection in high dimensional data faces various challenges from curse of dimensionality. It means due
to the distance concentration the data becomes unobvious in high dimensional data. Using outlier
detection techniques, the distance base methods are used to detect outliers and label all the points as
good outliers. In high dimensional data to detect outliers effectively, we use unsupervised learning
methods like IQR, KNN with Anti hub.
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
Regularized regression technique
s for lin
ear regression have been creat
ed
the last few
ten
year
s to
reduce
the
flaws
of ordinary least squ
ares regression
with regard to prediction accuracy.
In this paper, new
methods
for using regularized regression in model
choice are introduc
ed, and we
distinguish
the condition
s
in whic
h regularized regression develops
our ability to discriminate models.
W
e applied all the five
methods that use penalty
-
based (regularization) shrinkage to
handle Oxazolines and Oxazoles derivatives
descriptor
dataset
with far more predictors than observations.
The lasso,
ridge,
elasticnet,
lars and relaxed
lasso
further pos
sess the desirable property that they simultaneously sele
ct relevant predictive descriptor
s
and optimally
estimate their effects.
Here, we comparatively evaluate the performance of five regularized
linear regression methods
The assessment of the performanc
e of each model by means of benchmark
experiments
is an
established exercise.
Cross
-
validation and
resampling
method
s
are genera
lly
used to
arrive
point
evaluates the efficienci
es which are compared
to recognize
methods
with acceptable feature
s.
Predictiv
e accuracy
was evaluated
us
ing the root mean squared error
(RMSE)
and
Square of usual
correlation between predictors and observed mean inhibitory concentration of antitubercular activity
(R
square)
.
We found that all five regularized regression models were
able to produce feasible models
and
efficient capturing the linearity in the data
.
The elastic net and lars had similar accuracies
as well as lasso
and relaxed lasso
had similar accuracies
but outperformed ridge regression in terms of the RMSE and R
squ
are
metrics.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
ABSRACT: Missing data is one of the major issues in data miningand pattern recognition. The knowledge contains in attributeswith missing data values are important in improvingregression correlationprocessofanorganization.Thelearningprocesson each instance is necessary as it may containsome exceptional knowledge. There are various methods tohandle missing data in regression correlation. Analysis of photovoltaic cell, Sunlight striking on different geographical location to know the defective or connectionless photovoltaic cell plate. we mainly aims To showcasing the energy produced at different geographical location and to find the defective plate. And also analysis the data sets of energy produced along with current weather in particular area to know the status of the photovoltaic plates. In this project, we used Hadoop Map-Reduce framework to analyze the solar energy datasets.
Anomaly Detection using multidimensional reduction Principal Component AnalysisIOSR Journals
Anomaly detection has been an important research topic in data mining and machine learning. Many
real-world applications such as intrusion or credit card fraud detection require an effective and efficient
framework to identify deviated data instances. However, most anomaly detection methods are typically
implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing
computation and memory requirements. In this paper, we propose multidimensional reduction principal
component analysis (MdrPCA) algorithm to address this problem, and we aim at detecting the presence of
outliers from a large amount of data via an online updating technique. Unlike prior principal component
analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our
approach is especially of interest in online or large-scale problems. By using multidimensional reduction PCA
the target instance and extracting the principal direction of the data, the proposed MdrPCA allows us to
determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector.
Since our MdrPCA need not perform eigen analysis explicitly, the proposed framework is favored for online
applications which have computation or memory limitations. Compared with the well-known power method for
PCA and other popular anomaly detection algorithms
Similar to IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULAR DESCRIPTOR DATASET USING PRINCIPAL COMPONENT OUTLIER DETECTION ALGORITHM AND COMPARATIVE NUMERICAL STUDY OF OTHER ROBUST ESTIMATORS (20)
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULAR DESCRIPTOR DATASET USING PRINCIPAL COMPONENT OUTLIER DETECTION ALGORITHM AND COMPARATIVE NUMERICAL STUDY OF OTHER ROBUST ESTIMATORS
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
DOI : 10.5121/ijdkp.2013.3405 75
IDENTIFICATION OF OUTLIERS IN OXAZOLINES
AND OXAZOLES HIGH DIMENSION MOLECULAR
DESCRIPTOR DATASET USING PRINCIPAL
COMPONENT OUTLIER DETECTION ALGORITHM
AND COMPARATIVE NUMERICAL STUDY OF OTHER
ROBUST ESTIMATORS
Doreswamy1
and Chanabasayya .M. Vastrad2
1
Department of Computer Science, Mangalore University,
Mangalagangotri-574 199, Karnataka, India
1
Doreswamyh@yahoo.com
2
channu.vastrad@gmail.com
ABSTRACT
From the past decade outlier detection has been in use. Detection of outliers is an emerging topic and is
having robust applications in medical sciences and pharmaceutical sciences. Outlier detection is used to
detect anomalous behaviour of data. Typical problems in Bioinformatics can be addressed by outlier
detection. A computationally fast method for detecting outliers is shown, that is particularly effective in
high dimensions. PrCmpOut algorithm make use of simple properties of principal components to detect
outliers in the transformed space, leading to significant computational advantages for high dimensional
data. This procedure requires considerably less computational time than existing methods for outlier
detection. The properties of this estimator (Outlier error rate (FN), Non-Outlier error rate(FP) and
computational costs) are analyzed and compared with those of other robust estimators described in the
literature through simulation studies. Numerical evidence based Oxazolines and Oxazoles molecular
descriptor dataset shows that the proposed method performs well in a variety of situations of practical
interest. It is thus a valuable companion to the existing outlier detection methods.
KEYWORDS
Robust distance, Mahalanobis distance, median absolute deviation, Principal Components Analysis,
Projection-Persuit ,
1. INTRODUCTION
Accurate detection of outliers plays an important role in statistical analysis. If classical statistical
models are randomly applied to data containing outliers, the results can be deceptive at best. In
addition , An outlier is an observation that lies an abnormal distance from other values in a data
from a dataset and their identification is the main purpose of the investigation. Classical methods
based on the mean and covariance matrix are not often able to detect all the multivariate outliers
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
76
in a descriptor dataset due to the masking effect [1] , with the result that methods based on
classical measures are inappropriate for general use unless it is certain that outliers are not
present. Erroneous data are usually found in several situations, and so robust methods that detect
or down weight outliers are important methods for statisticians. The objective of this examination
is to provide an detection of outliers, prior to whatever modelling process is visualized.
Sometimes detection of outliers is the primary purpose of the analysis, other times the outliers
need to be removed or down weighted prior to fitting non robust models. We do not distinguish
between the various reasons for outlier detection, we simply aim to inform the analyst of
observations that are considerably different from the majority. Our procedures are therefore
exploratory, and applicable to a wide variety of settings.
Most methods with a high resistance to outliers are computationally exhaustive; not accordingly,
the availability of cheap computing resources has enabled this field to develop significantly in
recent years.
Among other proposals, there currently exist a wide variety of statistical models ranging from
regression to principal components [2] that can include outliers without being excessively
influenced, as well as several algorithms that explicitly focus on outlier detection.
There are several applications where multi-dimensional outlier identification and/or robust
estimation are important. The field of Computational drug discovery, for instance, has recently
received a lot of concentration from statisticians (e.g. the project Bioconductor,
http://www.bioconductor.org). Improvements in computing power have allowed pharmacists to
record and store extremely large databases of information. Such information likely to contain a
fair amount of large errors, however, so robust methods is needed to prevent these errors from
influencing the statistical model. Undoubtedly, algorithms that take a long time to compute are
not perfect or even practical for such large data sets.
Also, there is a further difficulty encountered in molecular descriptor data. The number of
dimensions is typically several orders of importance larger than the number of observations,
leading to a singular covariance matrix, so the majority of statistical methods cannot be applied in
the usual way. As will be discussed later, this situation can be solved through singular value
decomposition but it does require special attention. It can thus be seen that there are a number of
important applications in which current robust statistical models are impractical.
2. MATERIALS AND ALGORITHAMS
2.1 The Data Set
The molecular descriptors of 100 Oxazolines and Oxazoles derivatives [30-31] based H37Rv
inhibitors analyzed. These molecular descriptors are generated using Padel-Descriptor tool [32].
The dataset covers a diverse set of molecular descriptors with a wide range of inhibitory activities
against H37Rv.
2.2 A Brief Overview of Outlier Detection
There are two basic ways to outlier detection – distance-based methods, and projection pursuit.
Distance-based methods aim to detect outliers by computing a measure of how far a particular
point is from the centre of the data. The common measure of “outlyingness” for a data point
x୧ ∈ ℝ୮
, i = 1, … . . , n, is a robust version of the Mahalanobis distance,
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
77
ܴܦ = ඥሺݔ − ܶሻ்ܥିଵሺݔ − ܶሻ ሺ1ሻ
Where ܶ is a robust measure of location of the descriptor data set ܺ and ܥ is a robust estimate of
the covariance matrix. Problems faced by distance-based methods include (i) obtaining a reliable
estimate of in addition to (ii) how large RD୧ should be before a point is classified as outlying.
This focuses the devoted connection between outlier detection and robust estimation – the latter is
required as part of the prior. Retrieving good robust estimators of ܶ and ܥ are essential for
distance-based outlier detection methods. It is then crucial to find a metric (based on ܶ and C) set
apart outliers from regular points. The final separation boundary commonly depends on user-
specified penalties for misclassification of outliers as well as regular points(inliers).
2.2.1 Robust Estimation as Main Goal
A simple robust estimate of location is the coordinatewise median. This estimator is not
orthogonally equivariant (does not transform correctly under orthogonal transformations), but if
this property is important, the L1 median should be used alternatively, expressed as
ߤෝ = ܽ݊݅݉݃ݎ
ఓ∈ோ
‖ݔ − ߤ‖
ୀଵ
, ሺ2ሻ
where ‖·‖ stands for the Euclidean norm. The L1 median has maximal breakdown point, and a
fast algorithm for its computation is given in [3] .
A simple robust estimate of scale is the MAD (median absolute deviation), expressed for a
dataset { ݔଵ,….,ݔ}⊂ ℝ as
ܦܣܯሺݔଵ, … . , ݔሻ = 1.4826 · ݉݁݀
ቚݔ − ݉݁݀
ݔቚ. ሺ3ሻ
More complex estimators of location and scale are presented by the class of S-estimators [4] ,
expressed as the vector ܶ and positive definite symmetric matrix ܥ that satisfy
min||ܥ .ݏ .ݐ
1
݊
ߩሺ݀ ܿ⁄ ሻ = ܾ
ୀଵ
,
݀ = ඥሺݔ − ܶሻ்ܥିଵሺݔ − ܶሻ , ሺ4ሻ
where ρሺ·ሻ is a non-decreasing function on ሾ0, ∞ሻ , and and b are tuning constants that can
be as one chosen to support particular breakdown properties. It is commonly easier to work with
ψ = ∂ρ ∂d⁄ after all has a root where ρ has minimum.
Distance based algorithms that follow robust estimation as a main goal – without explicit outlier
detection – contain the OGK estimate [5] the minimum volume ellipsoid (MVE) and minimum
covariance determinant (MCD)[6-7]
MCD tries to find the covariance matrix of minimum determinant containing at least h data
points, where h determines the robustness of the estimator; it should be at least ሺn + p − 1ሻ 2⁄ . The
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
78
MCD and MVE are instances of S-estimators with non-differentiable ρሺd୧ሻ because ρሺd୧ሻ is either
0 or 1. MCD shows good performance on data sets with low dimension but on larger data sets the
computational stress can be restrictive – the accurate solution requires a combinational search. In
the latter case good starting points need to be acquired, accommodating an approximately correct
method. Equivariant procedures of acquiring these starting points, however, are based on
subsampling methods and the number of subsamples needed to acquire an tolerable level of
accuracy increases swiftly with dimension. Rousseeuw and van Driessen (1999) developed a
faster version of MCD which was a considerable advancement, but is still quite computationally
exhaustive. The OGK estimator [5] is located on pair wise robust estimates of the covariance.
Gnanadesikan and Kettenring (1972) computed a robust covariance estimate for two variables ܺ
and based on the uniqueness
ܸܱܥሺܺ, ܻሻ =
1
4
ሺߪሺܺ + ܻሻଶ
− ߪሺܺ − ܻሻଶሻ, ሺ5ሻ
where σ is a robust estimate of the variance. The matrix build from these pair wise estimates will
not necessarily be positive semi definite, so Maronna and Zamar (2002) carried out by
performing an eigen decomposition of this matrix. After all the variables in eigenvector space are
orthogonal, the covariances are zero and it is enough to get robust variance estimates of the data
projected onto each eigenvector direction. The eigenvalues are then restored with these robust
variances, and the eigenvector conversion is applied in reverse to give in a positive semidefinite
robust covariance matrix. If the original data matrix is robustly scaled (every component divided
by its robust variance), the OGK will be scale invariant. This method can be iterated, despite
Maronna and Zamar (2002) find this is not consistently better. Maronna and Zamar (2002) in
addition to find that using weighted estimates is to some extent better, in which case the
observations are weighted according to their robust distances ݀ as scaled by the robust covariance
matrix. They concern a weighting function of the form ܫሺ݀ < ݀ሻ where ܫሺ·ሻ is the indicator
function and ݀ is taken to be
݀ =
߯
ଶሺߚሻ݉݁݀ሺ݀ଵ, … . . , ݀ଶሻ
߯
ଶሺ0.5ሻ
, ሺ6ሻ
where χ୮
ଶሺβሻ is the -quantile of the χ୮
ଶ
distribution. Observations thus admit full weight except
that their robust distance d > d, in which case they admit zero weight. Maronna and Zamar
(2002) note that the robust distances can be swiftly calculated in the eigenvector space without
the requirement for matrix inversion because the components are orthogonal in this space. That
is,
d୧ = ቆ
z୧୨ − μ൫Z୨൯
σ൫Z୨൯
ቇ
ଶ୮
୨ୀଵ
, i = 1, … … , n, ሺ7ሻ
where z୧୨ are the data in the space of eigenvectors , Z୨ are the components in this space, is
robust location estimate and ߪ is a robust variance estimate.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
79
2.2.2 Explicit Outlier Detection
Continuing robust estimation to outlier detection needs some knowledge of the distribution of
robust distances If pursues a multivariate normal distribution, the squared classic Mahalanobis
distance (based upon the sample mean and covariance matrix) pursues a χ୮
ଶ
distribution [8]. In
addition to, if robust estimators and are applied to a large data set in which the non-outliers
are normally distributed. Hardin and Rocke (2005) discovered that the squared distances could be
expressed by a scaled - distribution. After all, for non-normal data, it is not clear how the
outlier boundary should be driven to give optimal classification rates. These regards form the
basis for the use of d by Maronna and Zamar (2002). The complete change of equation (6) helps
the distribution of d୧′s be like that of χ୮
ଶ
for non-normal original data, heading to better results for
the cutoff value than simply χ୮
ଶሺβሻ.
Hopeful algorithms that focus on detection of outliers contain rocke-Estimator [25], sfast-
Estimator[26],M-Estimator[27],MVE-Estimator[28],NNC-Estimator [29] ,BACON[9],
PCDist[10-11] ,sign1[12-13] and sign2[14], Outlier identification using robust (mahalanobis)
distances based on robust multivariate location and covariance matrix [15]. BACON and robust
multivariate location and covariance matrix are distance-based and in an appropriate, direct the
larger part of computational effort toward obtaining robust estimators and . BACON begins
with a small subset of observations believed to be outlier-free, to which it iteratively adds points
that have a small Mahalanobis distance based on T and of the current subset. One reason that
builds MCD unreliable for high is that its contamination bias evolves very swiftly with p [16].
Robust multivariate location and covariance matrix aims to reduce the computational burden by
subdividing the data into cells and running MCD on each cell, i.e. reducing the number of
observations that MCD performs on, with the same number of dimensions. It then associates the
results from each cell to yield a beginning point for an S-estimator [17] that deals a complex
minimization problem to yields a robust estimate of the covariance matrix . S-estimators can
occasionally converge to an inaccurate local solution, so a good beginning point is needed.
Despite, awaiting on MCD in the first stage confines Robust multivariate location and covariance
matrix from investigating large data sets, particularly those of high dimension. It would appear
that methods based on combinatorial search and alternative there of obtains an inherent inability
to investigate large data sets.
2.2.3 Projection Pursuit
In adverse to distance-based procedures are projection pursuit methods [18] which can
equivalently be applied to robust estimation as a main goal or carries towards explicit outlier
detection. The fundamental purpose of projection pursuit methods is to find suitable projections
of the data in which the outliers are readily credible and can thus be downweighted to turn out a
robust estimator, which in turn can be used to detect the outliers. Because they do not consider
the data to begin from a particular distribution but only search for useful projections, projection
pursuit methods are not altered by non-normality and can be mainly applied in diverse data
situations. The penalty for such independence comes in the form of increased computational
stress, because it is not clear which projections should be tested; an exact method would require
that all attainable directions be tested. The primal equivariant robust estimator having a high
breakdown point in arbitrary dimension was the Stahel-Donoho estimator [19-20]. Computer
approximation based on instructions from random subsamples was developed by Stahel (1981),
but without any doubt a large amount of time is necessary to obtain acceptable results. Even
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
80
though projection pursuit algorithms have the advantage of being appropriate in different data
situations, their computational difficulties seem horrible.
2.3 The High Dimensional Situation
High dimensional data present several problems to classical statistical analysis. As earlier
discussed, computation time increases more rapidly with than with ݊. For combinatorial and
projection pursuit algorithms, this increase is of enough importance to put in question the
practicability of such methods for high dimensional molecular descriptor data. In between the
speedy distance-based methods, computation times of algorithms increase linearly with ݊ and
cubically with . This indicates that for very high dimensional molecular descriptor data, the
computational stress of inverting the scatter matrix is nontrivial. This is particularly evident in
iterative methods which require many iterations to converge, since the covariance matrix is
inverted on each iteration. Thus, while the Mahalanobis distance is a very useful metric for
finding correlated multivariate outliers, it is expensive to calculate. Every other methods of
detection of outliers fare even worse, however, usually give up either computational time or
detection accuracy. The MCD is a good case of this in that the precise solution is very accurate
but infeasible to compute for all but small molecular descriptor data sets, whereas a faster
solution can be obtained if
random subsampling is used to produce an approximate solution. It will be investigated in results
and discussion whether the subsampling version of MCD is competitive regarding both accuracy
and computation time. Projection pursuit methods including the Stahel-Donoho estimator have
computation times that increase very rapidly in higher dimensions, and are often at least an order
of magnitude slower than distance based methods since their search for appropriate projections is
an naturally time-consuming task. Thus even if the Mahalanobis distance may be computationally
difficult due to the matrix inversion step, the robust version – RD, as defined in equation (1) – is
an accurate metric for outlier detection and could well be more computationally fair than other
ways.
This is appropriate to several biological applications where the data commonly have orders of
weight more dimensions than observations. This is also the typical situation in chemometrics,
which led to the development of Partial Least Squares (PLS) [21] with other methods.Because the
covariance matrix is singular the robust Mahalanobis distance cannot be calculated. This is not as
big a problem as initially occurs, after all, because the data can be transformed via singular value
decomposition to an equivalent space of dimension ݊ − 1 [2] and the analysis carried in the same
way as < ݊. However, this situation needs special attention and most outlier algorithms have to
be modified to way this type of high-dimensional molecular descriptor data.
High dimensional molecular descriptor data have various interesting geometrical properties,
discussed in [22]. One such property that is particularly appropriate to outlier detection, is that
high dimensional data points lie near the surface of an enlarging sphere. For example , if ‖‖ݔ is
the norm of ݔ = ሺݔଵ, … . . , ݔሻ்
drawn from a normal distribution with zero mean and identity
covariance matrix, then, for ample we have
‖‖ݔ
ඥ
=
ට∑ ൫ݔ൯
ଶ
ୀଵ
ඥ
⟶ 1,
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
81
because the summation includes a ߯
ଶ
distribution In this manner, if the outliers have even a
slightly dissimilar covariance structure from the inliers(non-outliers), they will lie on a different
sphere. This does not help low dimensional outlier detection, but if an algorithm is efficient of
processing high dimensional molecular descriptor data, it should not be too hard to discover the
different spheres of the outliers and inliers.
Principal components are a well known method of dimension reduction, that also advice an way
to detecting high dimensional outliers. Recall that principal components are those directions that
maximize the variance along each component, directed to the circumstance of orthogonality.
Because outliers increase the variance along their respective directions, it appears instinctive that
outliers will come into more visible in principal component space than the original data space; i.e.
, Minimum some of the directions of maximum variance are hopefully to be those that enable the
outliers to “tie-up” more. Exploring for outliers in principal component space should at least, till,
not be any worse than searching for them in the original data space. If the data originally exist in
a high-dimensional space, many of these dimensions likely do not provide significant additional
information and are irrelevant. Principal components thus pick out a handful of highly
informative components (relative to the total number of components), thereby performing a high
degree of dimension reduction and making the data set much more computationally manageable
without losing a lot of information.
For high dimension molecular descriptor data, a ample portion of the smaller principal
components are actually noise [23]. Particularly if ≪ ݊ , the larger part of principal components
will actually be noise and will not add to the total variance. By considering only those principal
components that constitute some agreed level of the total variance, the number of components can
be extensively reduced so that only those components that are truly meaningful are kept. Almost
we found good results using a level of 99%. It can be argued this yields similar results to
transforming the data via SVD to a dimension less than the minimum of ݊ and . Thus, in place
of imposing a level of improvement to the variance such as 99%, it would also be likely to choose
the ݊ − 1 (or fewer) components with the largest variance.
As marked in equation (7) in the OGK method [5] , after dividing by the MAD, the Euclidean
distance in principal components space is for that reason similar to a robust Mahalanobis
distance, because the off-diagonal elements of the scatter matrix are zero. Mahalanobis distance,
because the off-diagonal elements of the scatter matrix are zero.
Hence, it is not essential to invert a p ⨯ p matrix when computing a measure of outlyingness for
every point (i.e. the robust Mahalanobis distance), on the other hand slightly to divide (or
“standardize”) each principal component by its specific variance element. Because eigenvector
decomposition has computational complexity ଷ
to matrix inversion, doing the robust distance
computations in principal component space is not more time-consuming than in common data
space. If this transformation benefits the outliers become more visible and reduces the number of
iterations required to detect them, the result will be a net savings in computational time.
It can be seen that the above approaches are based on simple basic properties of principal
components; this is additional proof of how principal components continue to present appealing
properties to both theoretical and applied statisticians.
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
82
2.4 Details of the Proposed PrCmpOut Method
The method we propose consists of two basic parts: First one aims to detect location outliers, and
a next that aims to detect scatter outliers. Scatter outliers obtains a different scatter matrix than
the rest of the data, while location outliers are described by a different location parameter. To
begin, it is useful to robustly rescale or sphere each component using the coordinatewise median
and the MAD, presenting to
x୧,୨
∗
=
x୧,୨ − med൫xଵ,୨, … . . , x୬୨൯
MAD൫xଵ,୨, … … , x୬,୨൯
, j = 1, … … , p . ሺ8ሻ
Dimensions with a MAD of zero should be one or the other excluded , or alternatively another
scale measure has to be used. Beginning with the rescaled data x୧,୨
∗
, we calculate a weighted
covariance matrix, from which we compute the eigen values and eigen vectors and hence a semi-
robust principal components breakdown. We absorb only those eigen vectors or eigen values that
provide to at least 99% of the total variance; call this new dimension ∗
. The remaining
components are generally useless noise and d only deal to complicate any fundamental structure.
For the case ≫ ݊, this also clarifies the singularity problem because ∗
< ݊. For the ∗
⨯ ∗
matrix ܸ of eigenvectors we thus get the matrix of principal components as
ܼ = ܺ∗
ܸ, ሺ9ሻ
where ܺ∗
is the matrix with the elements x୧,୨
∗
Rescale these principal components by the median
and the MAD very much alike to equation (8),
z୧,୨
∗
=
z୧,୨ − med൫zଵ,୨, … . . , z୬୨൯
MAD൫zଵ,୨, … … , z୬,୨൯
, j = 1, … … , p∗
. ሺ10ሻ
Collect ܼ∗
for the second stage of the algorithm. After the above pre-processing steps, the
location outlier stage is initiated by computing the absolute value of a robust kurtosis measure for
each component as stated in :
ݓ = ቮ
1
݊
ቀݖ
∗
− ݉݁݀൫ݖଵ
∗
, … . . , ݖଵ
∗
൯
ସ
ቁ
ܦܣܯ൫ݖଵ
∗
, … . . , ݖଵ
∗
൯
ସ
ୀଵ
− 3ቮ , ݆ = 1, … … . , ∗
ሺ11ሻ
We make use of the absolute value because very much like to Multivariate outlier detection and
robust covariance matrix estimation [24] , both small and large values of the kurtosis coefficient
can be characteristics of outliers. This allows us to assign weights to each component according
to how likely we think it is to disclose the outliers. We use respective weights w୨ ∑ w୧୧⁄ to present
a well known scale 0 ≤ w୨ ≤ 1. If no outliers are there in a given component, we assume the
principal components to be nearly normally distributed similar to the original data, producing a
kurtosis near to zero. Because the being of outliers is likely to originate the kurtosis to become
different than zero, we weight each of the p∗
dimensions proportional to the absolute value of its
kurtosis coefficient. Assigning equal weights to all components (meanwhile the computation of
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
83
robust Mahalanobis distances) reduce the strength of the discriminatory power because if outliers
undoubtedly stick out in one component, the information in this component will be weakened
unless it is given higher weight. Specifically in principal component space, outliers are more
likely to be clearly visible in one specific component than slightly visible in several components,
so it is important to assign this component higher weight. Because the components are
uncorrelated, we compute a robust Mahalanobis distance employing the distance from the median
(as scaled by the MAD), weighting each component according to the relative weights w୨ ∑ w୧୧⁄
with the kurtosis measure w୨ expressed in equation (11). The kurtosis measure expressed in
equation (11) helps to guarantee that important information included in a particular component is
not diluted by components which do not separate the outliers.
To complete the first stage of the algorithm, we want to determine how large the robust
Mahalanobis distance should be to get an accurate classification between outliers and non-
outliers. The kurtosis weighting strategy destroys any similarity to a χ୮∗
ଶ
distribution that might
have been introduce , so it is not possible use a χ୮∗
ଶ
quantile as a separation barrier. Although ,
resembling to Maronna and Zamar (2002) and to equation (6), we got that transforming the robust
distances {ܴܦሽ as stated in
݀ = ܴܦ ·
ටχ୮∗,.ହ
ଶ
݉݁݀ሺܴܦଵ, … … . , ܴܦሻ
݂ݎ ݅ = 1, … … , ݊ ሺ12ሻ
helped the experimental distances {d୧ሽ to have the same median as the assumed distances and thus
bring the earlier somewhat near to χ୮∗
ଶ
, where χ୮∗,.ହ
ଶ
is the χ୮∗
ଶ
50th quantile. We make use of
the adapted biweight function [25] to assign weights to every observation and use these weights
as a measure of outlyingness. The adapted biweight fits into the general scheme of S-estimators
defined by equation (4) and is similar to Tukey’s biweight function other than that ψ begins from
0 at some point ܯ from the origin. That is, observations closer than the scaled distance ܯ to
location estimate accepts full weight of 1. The ψ function of the adapted biweight is thus given by
ψሺd, c, Mሻ =
ە
ۖ
۔
ۖ
ۓ
݀, 0 ≤ ݀ < ܯ
ቆ1 − ൬
݀ − ܯ
ܿ
൰
ଶ
ቇ
ଶ
, ܯ ≤ ݀ ≤ ܯ + ܿ , ሺ13ሻ
0, ݀ > ܯ + ܿ
which corresponds to the weighting function
ݓሺ݀; ܿ, ܯሻ =
ە
ۖ
۔
ۖ
ۓ
1, 0 ≤ ݀ < ܯ
ቆ1 − ൬
݀ − ܯ
ܿ
൰
ଶ
ቇ
ଶ
, ܯ ≤ ݀ ≤ ܯ + ܿ , ሺ14ሻ
0, ݀ > ܯ + ܿ
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
84
Precisely assigning known non-outliers full weights of one while assigning known outliers
weights of zero results in increased capabilities of the estimators (supported the classifications are
correct), and is also computationally faster. Between these peaks is a subset of points that receive
weights similar to the usual biweight function. To support a high level of robustness, it is best to
be moderate in assigning weights of one, since if any outliers enter the course of action with a
weight of one (or close to it) that will make the other outliers harder to detect due to the masking
effect. Remember that since the principal components have been scaled by the median and MAD
these robust distances measure a weighted distance from the median (using transformed units of
the MAD). We found good experimental values assigning a weight of one to the 1 3⁄ of points
acquiring the smallest robust distances. At the other end of the weighting pattern we assign zero
weight to points with ݀ > ܿ , where
ܿ = ݉݁݀ሺ݀ଵ, … . . , ݀ሻ + 2.5 · ܦܣܯሺ݀ଵ, … . . , ݀ሻ , ሺ15ሻ
corresponding nearly to classical outlier barriers. Similar to equation (14), the weights for each
observation are calculated by the interpreted biweight function as stated in
ݓଵ =
ە
ۖ
۔
ۖ
ۓ
0, ݀ ≥ ܿ
ቆ1 − ൬
݀ − ܯ
ܿ − ܯ
൰
ଶ
ቇ
ଶ
, ܯ < ݀ < ܿ , ሺ16ሻ
1, ݀ ≤ ܯ
where ݅ = 1, … … , ݊ and ܯ is the 33
ଵ
ଷ
ௗ
of the quantile of the distances {݀ଵ, … . . , ݀ଶሽ. Alternate
weighing strategies were tested ; the benefit of the interpreted biweight is that it allows a subset
of points (that we are quite definite are non-outliers) to be given full weight, while another subset
of points that is likely to contain outliers can be given weights of zero, thereby excluding
unacceptable influence by potential outliers, and a smooth weighting curve for in between points
The weights. {wଵ୧ሽ from equation (16) are saved; we will use them again at the end of the
algorithm. The second stage of our algorithm is similar to the first except that we don’t use the
kurtosis weighting layout. Principal components addresses on those directions that have large
variance, so it is possibly not unexpected that we find good results searching for scatter outliers in
the semi robust principal component space expressed at the beginning of this section. That is, we
search for outliers in the space determined by ܼ∗
from equation (10). Earlier , computing the
Euclidian norm for data in principal component space is similar to the Mahalanobis distance in
the original data space, other than that it is faster to compute.
Because the distribution of these distances has not been modified like it was through the kurtosis
weighting layout and assuming we start with normally distributed non-outliers, transforming the
robust distances as before via equation (12) results in a distribution that is fairly close to χ୮∗
ଶ
. In
establishing the interpreted biweight as in equation (16), then, acceptable results can be acquired
by assigning ܯଶ
equal to the χ୮∗
ଶ
25th
quantile and ܿଶ
equal to the χ୮∗
ଶ
99th
quantile. This
distribution is certainly not exactly equal to χ୮∗
ଶ
, so there are chances when visual examination of
these distances could head to a better boundary than this automated algorithm. Denote the
weights calculated in this way, ݓଶ , ݅ = 1, … . . , ݊.
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
85
Lastly we combine the weights from these two steps to compute eventual weights w୧ , i = 1, … . . , n,
as stated in.
w୧ =
ሺwଵ୧ + sሻሺwଶ୧ + sሻ
ሺ1 + sሻଶ
, ሺ17ሻ
Where usually the scaling constant ݏ = 0.25. The justification for presenting ݏ is that frequently
too many non-outliers accept a weight of 0 in only one of the two steps; setting ݏ ≠ 0 helps to
certify that the last weight ݓ = 0 only if both steps assign a low weight. Outliers are then
classified as points that have weight w୧ < 0.25. These values means that if one of the two weights
wଵ୧ or wଶ୧ equals one, the other must be less than 0.0625 for the point x୧ to be classified an
outlier. Or, if wଵ୧ = wଶ୧, then this common value must be less than 0.375 for x୧ to be classified as
outlying.
We will hereafter call this algorithm as PrCmpOut. It is beneficial to recap the algorithm in brief.
Stage 1: Detection of location outliers
a) Robustly sphere the data as stated in equation (8). Compute the sample covariance
matrix of the transformed data ࢄ∗
.
b) Compute a principal component break down of the semi robust covariance matrix from
Step a and retain only those ∗
eigenvectors whose eigenvalues provide to at least 99% of the
total variance. Robustly sphere the transformed data as in equation (10).
c) Compute the robust kurtosis weights for each component as in equation (11), and thus
weighted norms for the sphered data from Step b. Because the data have been scaled by the
MAD, these Euclidean norms in principal component space are similar to robust Mahalanobis
distances. Convert these distances as stated in equation (12).
d) Select weights ݓଵ for every robust distance according the adapted biweight in equation
(16), with ܯ equal to the 33
ଵ
ଷ
ௗ
quantile of the distances {݀ଵ, … . . , ݀ሽ and ܿ = ݉݁݀ሺ݀ଵ, … . . , ݀ሻ +
2.5 · ܦܣܯሺ݀ଵ, … . . , ݀ሻ .
Stage 2: Detection of scatter outliers
e) Use the same semi-robust principal component break down computed in Step b and
compute the (unweighted) Euclidean norms of the data in principal component space. Transform
as stated in equation (12) to produce a set of distances for use in Step f.
f) Select weights wଶ୧ for every robust distance as stated in the adapted biweight in equation
(16) with ܿଶ
equal to the χ୮∗
ଶ
99th
quantile and ܯଶ
equal to the χ୮∗
ଶ
25th
quantile.
Combining Stage 1 and Stage 2: Use the weights from Steps d and f to select final weights for
all observations as stated in equation (17).
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
86
2.5 Preliminary Investigation of developed method
As a preliminary step to developing a new outlier detection method, we briefly examined and
compared existing methods to determine possible areas of improvement We selected nine
algorithms that appeared to have good potential for finding outliers: rocke-Estimator [25], sfast-
Estimator[26],M-Estimator[27],MVE-Estimator[28],NNC-Estimator [29] BACON[9],PCDist[10-
11], sign1[12-13] and sign2[14]. Somewhat similar to our method, the Sign procedure is also
based on a type of robust principal component analysis. It obtains robust estimates of location
and spread based upon projecting the data onto a sphere. In this way, the effects of outlying
observations are limited since they are placed on the boundary of the ellipsoid and the resulting
mean and covariance matrix are robust. Standard principal components can thus be carried out on
the sphered data without undue influence by any single point (or small subset of points). We
considered attributes p = 10, 20 ,30,40 and critical values α= 0.05,0.1,0.15,0.2 , so for each
parameter combination we carried out 14 simulations with n = 100 observations.
Frequently, results of this type are presented in a 2 × 2 table showing success and failure in
identifying outliers and non-outliers, which we henceforth call inliers. In this paper, we present
the percentage false negatives(FN) followed by the percentage false positives(FP) in the same
cell; we henceforth refer to these respective values as the outlier error rate and inlier error
rate(non outlier error rate). We propose these names because the outlier error rate specifies the
percentage errors recorded within the group of true outliers, and similarly for the inlier error rate.
True outliers among 100 observations are 10,16,18,22,23,25,27,29, 30,47,66,70,72,80,84 90,99
and 100. This is more compact than a series of 2 × 2 table and allows us to easily focus on
bringing the error rates down to zero.
Table 1. Sample 2 × 2 table illustrating the notation used in this paper
Predicted Outliers Predicted Inliers
True Outliers a b
True Inliers c d
ܱݎ݈݁݅ݐݑ ݎݎݎܧ ݁ݐܽݎ ሺܰܨሻ =
ܾ
ܽ+ܾ
, ݎ݈݁݅݊ܫ ݎݎݎܧ ݁ݐܽݎ ሺܲܨሻ =
ܿ
ܿ+݀
3. RESULTS AND DISCUSSION
Even if this algorithm was designed basically for computational performance at high dimension,
we compare its performance against other outlier algorithms in low dimension, because a high
dimensional comparison is not achievable. In the following we examine a variety of outlier
configurations in molecular descriptor data. Examination of Table 2 unfolds that PrcCmpOut
performs well at identifying outliers (low false positives), although it has a higher percentage of
false negatives than most of the methods. PrCmpOut has the lowest percentage of false negatives,
often by a tolerable margin, and is a competitive outlier detection method.
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
87
Table 2 Outlyingness measures: average percentage of outliers that were not identified and
average percentage of regular observations that were declared outliers
We present simulation results in which the dimension was increased from = 10 to = 40,
based on the mean of 16 simulations at each level. In contrast to the previous simulation
experiment in dimension = 10 , in this case we were not able to examine the performance of
the other algorithms since they were not computationally feasible for these dimensions. The
number of observations was held constant at ݊ = 100, as was the number of outliers at 18. With
increasing dimension PrCmpOut can identify almost all outliers. None of the known methods
experience much success in identifying outliers for small dimensions because geometrically, the
outliers are not very different from the non-outliers. However, as dimension increases, it can be
seen how the outliers separate from the non-outliers and become easier to detect. At p =
20 dimensions, barely more than half of the outliers can be detected, at p = 30 dimensions almost
90% are detected, and at = 40 dimensions more than 95% of the outliers are detected.
Figure 1. Average outlier error rate(left) and Average non-outlier error rate(right) for varying
fraction of Outliers
14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
88
The results for the PrCmpOut and rest of the estimators are presented in Figure 1. For an outlier
fraction of 10% all estimators except PCDist perform excellent in terms of outlier error rate (FN)
and detect outliers independently of the percentage of outliers as seen from the left panel of
Figure 1. The average percentage of non-outliers that were declared outliers (FP) differ and
PrCmpOut performs best, followed closely by Sign2 ,PCDist (below 10). With more than 20%
comes next for the rest of the estimators. MVE declares somewhat more than 30% of regular
observations as outliers.Sign1 performs worst with the average non-outlier error rate increasing
with increase of the fraction of non outliers as an outliers.In terms of the outlier error rate
PrCmpOut performs best (11%) followed by NNC, SIGN1, BACON and MVE Estimators error
rate between 13% and 34% and PCDist (89%) is again last.
Exploratory data analysis is often used to get some understanding of the data at hand, with one
important aspect being the possible occurrence of outliers. Sometimes the detection of these
outliers is the goal of the data analysis, more often however they must be identified and dealt with
in order to assure the validity of inferential methods. Most outlier detection methods are not very
robust when the multinormality assumption is violated, and in particular when outliers are
present. Robust distance plots are commonly used to identify outliers. To demonstrate this idea,
we first compute the robust distances based on the sample mean vector and the sample covariance
matrix. Points which have distances larger than χ୮,ଵି
ଶ
are usually viewed as potential outliers,
and so we will label such points accordingly. Figure 2 shows the distance plots of all the methods.
Figure 2. Robust distance plots for the Oxazolines and Oxazoles molecular descriptor data set.
The red points are according to the robust distances outliers.
15. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
89
It is important to consider the computational performance of the different outlier detection
algorithms in case of large Oxazolines and Oxazoles molecular descriptor dataset with n = 100.
To evaluate and compare the computational times a simulation experiment was carried out. The
experiment was performed on a Intel core i3 with 6Gb RAM running Windows 7 Professional.
All computations were performed in R i386 2.15.3. The rocke, M,sfast, MVE, NNC, BACON,
PCDist, Sign1, Sign2 and Proposed algorithm PrCmpOut algorithms from were used. Fastest is
PrCmpOut, followed closely by Sign1,Sign2,PCDist ,NNC and BACON. Slowest is sfast-Est
followed by rocke-Est, M-Est and MVE. These computation times are presented graphically in
Figure 3 using a seconds scale
Figure 3. Comparing the computation time of outlier detection algorithms in the presence
of outliers
A fast multivariate outlier detection method is particularly useful in the field of bioinformatics
where hundreds or even thousands of molecular descriptors need to be analyzed. Here we will
focus only on outlier detection among the molecular descriptors, clearly a high dimensional data
set. First, columns with MAD equal to zero were removed, with the remaining columns
investigated for outliers. Figure 4 shows the results from PrCmpOut; The intermediate graphs
provide more insight into the workflow of the method: the kurtosis weights of Step 3 of the
PrCmpOut algorithm are shown in the upper left panel, together with the weight boundaries
described in Step 4, leading to the weights in the upper right panel. Similarly, the distances from
Step 5 and the weights from Step 6 are respectively shown in the left and right panel of the
second row. The lower left panel shows the combined weights (Phase 1 and 2 combined) together
with the outlier boundary 0.25, which results in 0/1 weights (lower right panel). The observations
are clearly visible as multivariate outliers in the intermediate steps of the algorithm
16. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
90
Figure 4. The panels show the intermediate steps of the PrCmpOut algorithm (distances and
weights) for analyzing Oxazolines and Oxazoles molecular descriptor dataset.
We compare the performance of PrCmpOut on this data set with the Sign2 method; the other
algorithms are not feasible due to the high dimensionality. Figure 5 shows the distances (left) and
weights (right) as calculated by the Sign2 method.
17. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
91
Figure 9. Distances and weights for analyzing the Oxazolines and Oxazoles molecular
descriptor dataset with the Sign method.
A possible explanation for the difficulty experienced by the Sign2 method is a masking effect for
PCA. It is evident that PrCmpOut has better performance than the Sign2 method, which is also
evident from the results in Table 2. We infer that PrCmpOut is a competitive outlier detection
algorithm regarding detection accuracy as well as computation time.
4. CONCLUSIONS
PrCmpOut is a method for detecting outliers in multivariate data that utilizes inherent properties
of principal components decomposition. It demonstrates very good performance for high
dimensional data and through the use of a robust kurtosis measure. In this paper we tested several
approaches for identifying outliers in Oxazolines and Oxazoles molecular descriptor dataset. Two
aspects seem to be of major importance: the computation time and the accuracy of the outlier
detection method. For the latter we used the fraction of false negatives(FN) - outliers that were
not identified - and the fraction of false positives(FP) - non-outliers that were declared as outliers.
It is very fast to compute and can easily handle high dimensions. Thus, it can be extended to
fields such as bioinformatics and data mining where computational feasibility of statistical
routines has usually been a limiting factor. At lower dimensions, it still produces competitive
results when compared to well-known outlier detection methods.
ACKNOLDGEMENTS
We gratefully thank to the Department of Computer Science Mangalore University, Mangalore
India for technical support of this research.
18. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
92
REFERENCES
[1] Ben-Gal I., “Outlier detection” , In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge
Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic
Publishers, 2005, ISBN 0-387-24435-2.
[2] Hubert, M., Rousseeuw, P., Branden, K. V., “Robpca: A new approach to robust principal
components analysis”,Technometrics 2005.47, 64–79
[3] H¨ossjer, O., Croux, C., “Generalizing univariate signed rank statistics for testing and estimating a
multivariate location parameter. “ ,Journal of Nonparametric Statistics 4 1995. 293–308.
[4] Maronna, R., Martin, R., Yohai, V., “Robust Statistics: Theory and Methods.” ,2006. John Wiley.
[5] Maronna, R., Zamar, R., “Robust estimates of location and dispersion for high-dimensional data
sets.”,Technometrics 2002. 44 (4), 307–317.
[6] Rousseeuw, P., In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (Eds.), “Multivariate estimation
with high breakdown point. “Mathematical Statistics and Applications. Vol. B. Reidel Publishing
Company, pp. 1985. 283–297.
[7] Rousseeuw, P., van Driessen, K., “A fast algorithm for the minimum covariance determinant
estimator.”, Technometrics 1999. 41, 212–223.
[8] Johnson, R., Wichern, D. W “Applied Multivariate Statistical Analysis”, 4th Edition. , 1998. Prentice-
Hall.
[9] Billor, N., Hadi, A., Velleman, P., “BACON: Blocked adaptive computationally-efficient outlier
nominators.”, Computational Statistics and Data Analysis 2000. 34, 279–2980 .
[10] Peter Filzmoser,, Valentin Todorov, “Robust Tools for the Imperfect World”, Elsevier September 25,
2012 http://www.statistik.tuwien.ac.at/public/filz/papers/2012IS.pdf
[11] A.D. Shieh and Y.S. Hung,“Detecting Outlier Samples in Microarray Data”, Statistical Applications
in Genetics and Molecular Biology (2009)Vol. 8
[12] Jerome H. Friedman ,“Exploratory Projection Pursuit”, Journal Of The American Statistical
Association Vol. 82, No. 397(Mar., 1987) , 249-266.
[13] Robert Serfling and Satyaki Mazumder ,” Computationally Easy Outlier Detection via Projection
Pursuit with Finitely Many Directions” ,
http://www.utdallas.edu/~serfling/papers/Serfling_Mazumder_final_version.pdf
[14] Locantore, N., Marron, J., Simpson, D., Tripoli, N., Zhang, J., Cohen, K.,. “Robust principal
components for functional data.” Test 1999 8, 1–73.
[15] P. J. Rousseeuw and B. C. Van Zomeren. “Unmasking multivariate outliers and leverage points.”,
Journal of the American Statistical Association. Vol. 85(411), (1990) pp. 633-651.
[16] Adrover, J., Yohai, V. ,”Projection estimates of multivariate location.” ,The Annals of Statistics 30,
2002 1760–1781.
[17] Davies, P.,“Asymptotic behaviour of s-estimates of multivariate location parameters and dispersion
matrices.“,The Annals of Statistics 15 (3), 1987. 1269–1292.
[18] Huber, P.,” Projection pursuit.“, Annals of Statistics 1985 13 (2), 435–475,
[19] Donoho, D., ”Breakdown properties of multivariate location estimators.”, 1982. Ph.D. thesis,
Harvard University.
[20] Stahel, W., “Breakdown of covariance estimators.” , Research Report 1981. 31, Fachgruppe
f¨urStatistik, E.T.H. Z¨urich.
[21] Tenenhaus, M., Vinzi, V. E., Chatelin, Y.-M., Lauro, C., “Pls path modeling”, 48 (1), 2005. 159–205.
[22] Hall, P., Marron, J., Neeman, A., “Geometric representation of high dimension, low sample size
data”, Journal of the Royal Statistical Society, Series B: Statistical Methodology 67, 2005. 427–444
[23] Jackson, J., “A User’s Guide to Principal Components. “,1991.,Wiley & Sons, New York.
[24] Pe˜na, D., Prieto, F., “Multivariate outlier detection and robust covariance matrix estimation.”,
Technometrics 43 (3) 2001., 286–310.
[25] Rocke, D., ,”Robustness properties of s-estimators of multivariate location and shape in high
dimension.”, Annals of Statistics 24 (3), 1996. 1327–1345.
[26] C. Croux, C. Dehon, A.Yadine ,“ON THE OPTIMALITY OF MULTIVARIATE S-ESTIMATORS”,
Scandinavian Journal of Statistics, 38(2), 332-341
19. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.4, July 2013
93
[27] Leonard A. Stefanski and Dennis D. Boos ,“Calculus of M-Estimation” ,The American Statistician,
February 2002. Vol 56. No. 1.
[28] Christophe Croux, Gentiane Haesbroeck, “Location Adjustment for the Minimum Volume Ellipsoid
Estimator” , Journal of Statistical Computation and Simulation , 72, 585-596
[29] Croux, C., and Van Aelst, C., "Nearest Neighbor Variance Estimation (NNVE): Robust Covariance
Estimation via Nearest Neighbor Cleaning" by Wang, N., and Raftery, A.E., The Journal of the
American Statistical Association, 97, (2002) 1006-1009.
[30] Andrew J. Phillips, Yoshikazu Uto, Peter Wipf, Michael J. Reno, and David R. Williams, “Synthesis
of Functionalized Oxazolines and Oxazoles with DAST and Deoxo-Fluor” Organic Letters 2000 Vol
2 ,No.8 1165-1168
[31] Moraski GC, Chang M, Villegas-Estrada A, Franzblau SG, Möllmann U, Miller MJ.,”Structure-
activity relationship of new anti-tuberculosis agents derived from oxazoline and oxazole benzyl
esters”, Eur J Med Chem. 2010 May;45(5):1703-16. doi: 10.1016/j.ejmech.2009.12.074. Epub 2010
Jan 14.
[32] “Padel-Descriptor” http://padel.nus.edu.sg/software/padeldescriptor/
Authors
Doreswamy received B.Sc degree in Computer Science and M.Sc Degree in Computer
Science from University of Mysore in 1993 and 1995 respectively. Ph.D degree in
Computer Science from Mangalore University in the year 2007. After completion of
his Post-Graduation Degree, he subsequently joined and served as Lecturer in
Computer Science at St. Joseph’s College, Bangalore from 1996-1999.Then he has
elevated to the position Reader in Computer Science at Mangalore University in year
2003. He was the Chairman of the Department of Post-Graduate Studies and research
in computer science from 2003-2005 and from 2009-2008 and served at varies
capacities in Mangalore University at present he is the Chairman of Board of Studies and Associate
Professor in Computer Science of Mangalore University. His areas of Research interests include Data
Mining and Knowledge Discovery, Artificial Intelligence and Expert Systems, Bioinformatics, Molecular
modelling and simulation, Computational Intelligence, Nanotechnology, Image Processing and Pattern
recognition. He has been granted a Major Research project entitled “Scientific Knowledge Discovery
Systems (SKDS) for Advanced Engineering Materials Design Applications” from the funding agency
University Grant Commission, New Delhi, India. He has been published about 30 contributed peer
reviewed Papers at national/International Journal and Conferences. He received SHIKSHA RATTAN
PURASKAR for his outstanding achievements in the year 2009 and RASTRIYA VIDYA
SARASWATHI AWARD for outstanding achievement in chosen field of activity in the year 2010.
Chanabasayya M. Vastrad received B.E. degree and M.Tech. degree in the year 2001 and
2006 respectively. Currently working towards his Ph.D Degree in Computer Science and
Technology under the guidance of Dr. Doreswamy in the Department of Post-Graduate
Studies and Research in Computer Science , Mangalore University.