The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...IJMIT JOURNAL
We aim at developing and improving the imbalanced business risk modeling via jointly using proper
evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques.
Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison
based on 10-fold cross validation. Two undersampling strategies including random undersampling (RUS)
and cluster centroid undersampling (CCUS), as well as two oversampling methods including random
oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly
interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR
(L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and
Boosting, are applied on the DT classifier for further model improvement. The results show that, Boosting
on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can
achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively.
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesColleen Farrelly
Miami Data Science SALON (Nov 2018) talk regarding geometric methods for dimensionality reduction, data visualization, and stock market analysis (India's NSE).
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...IJMIT JOURNAL
We aim at developing and improving the imbalanced business risk modeling via jointly using proper
evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques.
Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison
based on 10-fold cross validation. Two undersampling strategies including random undersampling (RUS)
and cluster centroid undersampling (CCUS), as well as two oversampling methods including random
oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly
interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR
(L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and
Boosting, are applied on the DT classifier for further model improvement. The results show that, Boosting
on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can
achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively.
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesColleen Farrelly
Miami Data Science SALON (Nov 2018) talk regarding geometric methods for dimensionality reduction, data visualization, and stock market analysis (India's NSE).
A Magnified Application of Deficient Data Using Bolzano Classifierjournal ijrtem
Abstract: Deficient and Inconsistent knowledge are a pervasive and lasting problem in heavy information sets. However, basic accusation is attractive often used to impute missing data, whereas multiple imputation generates right value to replace. Consequently, variety of machine learning (ML) techniques are developed to reprocess the incomplete information. It is estimated multiple imputation of missing information in large datasets and the performance of MI focuses on several unsupervised ML algorithms like mean, median, standard deviation and Supervised ML techniques for probabilistic algorithm like NBI classifier. Such research is carried out employing a comprehensive range of databases, for which missing cases are first filled in by various sets of plausible values to create multiple completed datasets, then standard complete- data operations are applied to each completed dataset, and finally the multiple sets of results combine to generate a single inference. The main aim of this report is to offer general guidelines for selection of suitable data imputation algorithms based on characteristics of the data. Implementing Bolzano Weierstrass theorem in machine learning techniques to assess the functioning of every sequence of rational and irrational number has a monotonic subsequence and specify every sequence always has a finite boundary. For estimate imputation of missing data, the standard machine learning repository dataset has been applied. Tentative effects manifest the proposed approach to have good accuracy and the accuracy measured in terms of percent. Keywords: Bolzano Classifier, Maximum Likelihood, NBI classifier, posterior probability, predictor probability, prior probability, Supervised ML, Unsupervised ML.
Presentation summarizes main content of Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application to Actuarial Science. arXiv preprint arXiv:1708.05712.
Paper was accepted December 2017 by Casualty Actuarial Society.
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...ijcsa
Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the NRMSE and NMAE estimates of random forest, we are able to estimate the imputation error. Evaluation is performed on two molecular descriptor datasets generated from a diverse selection of pharmaceutical fields with artificially introduced missing values ranging from 10% to 30%. The experimental result demonstrates that missing values has a great impact on the effectiveness of imputation techniques and our method MiFoImpute is more robust to missing value than the other ten imputation methods used as benchmark. Additionally, MiFoImpute exhibits attractive computational efficiency and can cope with high-dimensional data.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
This presentation is based on ``Statistical Modeling: The two cultures'' from Leo Breiman. It compares the data modeling culture (statistics) and the algorithmic modeling culture (machine learning).
There is doubtlessly manufactured artificial neural system (ANN) is a standout amongst the most
acclaimed all-inclusive approximators, and has been executed in numerous fields. This is because of its
capacity to naturally take in any example with no earlier suppositions and loss of all inclusive statement.
ANNs have contributed fundamentally towards time arrangement expectation field, yet the nearness of
exceptions that normally happen in the time arrangement information may dirty the system preparing
information. Hypothetically, the most widely recognized calculation to prepare the system is the
backpropagation (BP) calculation which depends on the minimization of the common ordinary least
squares (OLS) estimator as far as mean squared error (MSE). Be that as it may, this calculation is not
absolutely strong within the sight of exceptions and may bring about the bogus forecast of future qualities.
Accordingly, in this paper, we actualize another calculation which exploits firefly calculation on the minimal
middle of squares (FA-LMedS) estimator for manufactured neural system nonlinear autoregressive
(BPNN-NAR) and counterfeit neural system nonlinear autoregressive moving normal (BPNN-NARMA)
models to cook the different degrees of remote issue in time arrangement information. In addition, the
execution of the proposed powerful estimator with correlation with the first MSE and strong iterative
slightest middle squares (ILMedS) and molecule swarm advancement on minimum middle squares (PSOLMedS)
estimators utilizing reenactment information, in light of root mean squared blunder (RMSE) are
likewise talked about in this paper. It was found that the robustified backpropagation learning calculation
utilizing FA-LMedS beat the first and other powerful estimators of ILMedS and PSO-LMedS. As a
conclusion, developmental calculations beat the first MSE mistake capacity in giving hearty preparing of
counterfeit neural systems.
Neural networks have gained a great deal of importance in the area of soft computing and are widely used in making predictions. The work presented in this paper is about the development of Artificial Neural Network (ANN) based models for the prediction of sugarcane yield in India. The ANN models have been experimented using different partitions of training patterns and different combinations of ANN parameters.
Experiments have also been conducted for different number of neurons in hidden layer and the algorithms for ANN training. For this work, data has been obtained from the website of Directorate of Economics and Statistics, Ministry of Agriculture, Government of India. In this work, the experiments have been conducted for 2160 different ANN models. The least Root Mean Square Error (RMSE) value that could be achieved on
test data was 4.03%. This has been achieved when the data was partitioned in such a way that there were 10% records in the test data, 10 neurons in hidden layer, learning rate was 0.001, the error goal was set to 0.01 and traincgb algorithm in MATLAB was used for ANN training.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
Missing data exploration in air quality data set using R-package data visuali...journalBEEI
Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.
International Journal of Computer Science, Engineering and Information Techno...IJCSEIT Journal
In the field of proteomics because of more data is added, the computational methods need to be more
efficient. The part of molecular sequences is functionally more important to the molecule which is more
resistant to change. To ensure the reliability of sequence alignment, comparative approaches are used. The
problem of multiple sequence alignment is a proposition of evolutionary history. For each column in the
alignment, the explicit homologous correspondence of each individual sequence position is established. The
different pair-wise sequence alignment methods are elaborated in the present work. But these methods are
only used for aligning the limited number of sequences having small sequence length. For aligning
sequences based on the local alignment with consensus sequences, a new method is introduced. From NCBI
databank triticum wheat varieties are loaded. Phylogenetic trees are constructed for divided parts of
dataset. A single new tree is constructed from previous generated trees using advanced pruning technique.
Then, the closely related sequences are extracted by applying threshold conditions and by using shift
operations in the both directions optimal sequence alignment is obtained.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...ijma
In the process of content-based multimedia retrieval, multimedia information is processed in order to
obtain descriptive features. Descriptive representation of features, results in a huge feature count, which
results in processing overhead. To reduce this descriptive feature overhead, various approaches have been
used to dimensional reduction, among which PCA and LDA are the most used methods. However, these
methods do not reflect the significance of feature content in terms of inter-relation among all dataset
features. To achieve a dimension reduction based on histogram transformation, features with low
significance can be eliminated. In this paper, we propose a feature dimensional reduction approaches to
achieve the dimension reduction approach based on a multi-linear kernel (MLK) modeling. A benchmark
dataset for the experimental work is taken and the proposed work is observed to be improved in analysis in
comparison to the conventional system.
A Magnified Application of Deficient Data Using Bolzano Classifierjournal ijrtem
Abstract: Deficient and Inconsistent knowledge are a pervasive and lasting problem in heavy information sets. However, basic accusation is attractive often used to impute missing data, whereas multiple imputation generates right value to replace. Consequently, variety of machine learning (ML) techniques are developed to reprocess the incomplete information. It is estimated multiple imputation of missing information in large datasets and the performance of MI focuses on several unsupervised ML algorithms like mean, median, standard deviation and Supervised ML techniques for probabilistic algorithm like NBI classifier. Such research is carried out employing a comprehensive range of databases, for which missing cases are first filled in by various sets of plausible values to create multiple completed datasets, then standard complete- data operations are applied to each completed dataset, and finally the multiple sets of results combine to generate a single inference. The main aim of this report is to offer general guidelines for selection of suitable data imputation algorithms based on characteristics of the data. Implementing Bolzano Weierstrass theorem in machine learning techniques to assess the functioning of every sequence of rational and irrational number has a monotonic subsequence and specify every sequence always has a finite boundary. For estimate imputation of missing data, the standard machine learning repository dataset has been applied. Tentative effects manifest the proposed approach to have good accuracy and the accuracy measured in terms of percent. Keywords: Bolzano Classifier, Maximum Likelihood, NBI classifier, posterior probability, predictor probability, prior probability, Supervised ML, Unsupervised ML.
Presentation summarizes main content of Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application to Actuarial Science. arXiv preprint arXiv:1708.05712.
Paper was accepted December 2017 by Casualty Actuarial Society.
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...ijcsa
Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the NRMSE and NMAE estimates of random forest, we are able to estimate the imputation error. Evaluation is performed on two molecular descriptor datasets generated from a diverse selection of pharmaceutical fields with artificially introduced missing values ranging from 10% to 30%. The experimental result demonstrates that missing values has a great impact on the effectiveness of imputation techniques and our method MiFoImpute is more robust to missing value than the other ten imputation methods used as benchmark. Additionally, MiFoImpute exhibits attractive computational efficiency and can cope with high-dimensional data.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
This presentation is based on ``Statistical Modeling: The two cultures'' from Leo Breiman. It compares the data modeling culture (statistics) and the algorithmic modeling culture (machine learning).
There is doubtlessly manufactured artificial neural system (ANN) is a standout amongst the most
acclaimed all-inclusive approximators, and has been executed in numerous fields. This is because of its
capacity to naturally take in any example with no earlier suppositions and loss of all inclusive statement.
ANNs have contributed fundamentally towards time arrangement expectation field, yet the nearness of
exceptions that normally happen in the time arrangement information may dirty the system preparing
information. Hypothetically, the most widely recognized calculation to prepare the system is the
backpropagation (BP) calculation which depends on the minimization of the common ordinary least
squares (OLS) estimator as far as mean squared error (MSE). Be that as it may, this calculation is not
absolutely strong within the sight of exceptions and may bring about the bogus forecast of future qualities.
Accordingly, in this paper, we actualize another calculation which exploits firefly calculation on the minimal
middle of squares (FA-LMedS) estimator for manufactured neural system nonlinear autoregressive
(BPNN-NAR) and counterfeit neural system nonlinear autoregressive moving normal (BPNN-NARMA)
models to cook the different degrees of remote issue in time arrangement information. In addition, the
execution of the proposed powerful estimator with correlation with the first MSE and strong iterative
slightest middle squares (ILMedS) and molecule swarm advancement on minimum middle squares (PSOLMedS)
estimators utilizing reenactment information, in light of root mean squared blunder (RMSE) are
likewise talked about in this paper. It was found that the robustified backpropagation learning calculation
utilizing FA-LMedS beat the first and other powerful estimators of ILMedS and PSO-LMedS. As a
conclusion, developmental calculations beat the first MSE mistake capacity in giving hearty preparing of
counterfeit neural systems.
Neural networks have gained a great deal of importance in the area of soft computing and are widely used in making predictions. The work presented in this paper is about the development of Artificial Neural Network (ANN) based models for the prediction of sugarcane yield in India. The ANN models have been experimented using different partitions of training patterns and different combinations of ANN parameters.
Experiments have also been conducted for different number of neurons in hidden layer and the algorithms for ANN training. For this work, data has been obtained from the website of Directorate of Economics and Statistics, Ministry of Agriculture, Government of India. In this work, the experiments have been conducted for 2160 different ANN models. The least Root Mean Square Error (RMSE) value that could be achieved on
test data was 4.03%. This has been achieved when the data was partitioned in such a way that there were 10% records in the test data, 10 neurons in hidden layer, learning rate was 0.001, the error goal was set to 0.01 and traincgb algorithm in MATLAB was used for ANN training.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
Missing data exploration in air quality data set using R-package data visuali...journalBEEI
Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.
International Journal of Computer Science, Engineering and Information Techno...IJCSEIT Journal
In the field of proteomics because of more data is added, the computational methods need to be more
efficient. The part of molecular sequences is functionally more important to the molecule which is more
resistant to change. To ensure the reliability of sequence alignment, comparative approaches are used. The
problem of multiple sequence alignment is a proposition of evolutionary history. For each column in the
alignment, the explicit homologous correspondence of each individual sequence position is established. The
different pair-wise sequence alignment methods are elaborated in the present work. But these methods are
only used for aligning the limited number of sequences having small sequence length. For aligning
sequences based on the local alignment with consensus sequences, a new method is introduced. From NCBI
databank triticum wheat varieties are loaded. Phylogenetic trees are constructed for divided parts of
dataset. A single new tree is constructed from previous generated trees using advanced pruning technique.
Then, the closely related sequences are extracted by applying threshold conditions and by using shift
operations in the both directions optimal sequence alignment is obtained.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...ijma
In the process of content-based multimedia retrieval, multimedia information is processed in order to
obtain descriptive features. Descriptive representation of features, results in a huge feature count, which
results in processing overhead. To reduce this descriptive feature overhead, various approaches have been
used to dimensional reduction, among which PCA and LDA are the most used methods. However, these
methods do not reflect the significance of feature content in terms of inter-relation among all dataset
features. To achieve a dimension reduction based on histogram transformation, features with low
significance can be eliminated. In this paper, we propose a feature dimensional reduction approaches to
achieve the dimension reduction approach based on a multi-linear kernel (MLK) modeling. A benchmark
dataset for the experimental work is taken and the proposed work is observed to be improved in analysis in
comparison to the conventional system.
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...IJERA Editor
Cost estimating at schematic design stage as the basis of project evaluation, engineering design, and cost
management, plays an important role in project decision under a limited definition of scope and constraints in
available information and time, and the presence of uncertainties. The purpose of this study is to compare the
performance of cost estimation models of two different hybrid artificial intelligence approaches: regression
analysis-adaptive neuro fuzzy inference system (RANFIS) and case based reasoning-genetic algorithm (CBRGA)
techniques. The models were developed based on the same 50 low-cost apartment project datasets in
Indonesia. Tested on another five testing data, the models were proven to perform very well in term of accuracy.
A CBR-GA model was found to be the best performer but suffered from disadvantage of needing 15 cost drivers
if compared to only 4 cost drivers required by RANFIS for on-par performance.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Oversampling technique in student performance classification from engineering...IJECEIAES
The first year of an engineering student was important to take proper academic planning. All subjects in the first year were essential for an engineering basis. Student performance prediction helped academics improve their performance better. Students checked performance by themselves. If they were aware that their performance are low, then they could make some improvement for their better performance. This research focused on combining the oversampling minority class data with various kinds of classifier models. Oversampling techniques were SMOTE, BorderlineSMOTE, SVMSMOTE, and ADASYN and four classifiers were applied using MLP, gradient boosting, AdaBoost and random forest in this research. The results represented that Borderline-SMOTE gave the best result for minority class prediction with several classifiers.
Multimode system condition monitoring using sparsity reconstruction for quali...IJECEIAES
In this paper, we introduce an improved multivariate statistical monitoring method based on the stacked sparse autoencoder (SSAE). Our contribution focuses on the choice of the SSAE model based on neural networks to solve diagnostic problems of complex systems. In order to monitor the process performance, the squared prediction error (SPE) chart is linked with nonparametric adaptive confidence bounds which arise from the kernel density estimation to minimize erroneous alerts. Then, faults are localized using two methods: contribution plots and sensor validity index (SVI). The results are obtained from experiments and real data from a drinkable water processing plant, demonstrating how the applied technique is performed. The simulation results of the SSAE model show a better ability to detect and identify sensor failures.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.
Data reduction techniques for high dimensional biological dataeSAT Journals
Abstract
High dimensional biological datasets in recent years has been growing rapidly. Extracting the knowledge and analyzing highdimensional
biological data is one the key challenges in which variety and veracity are the two distinct characteristics. The
question that arises now is, how to perform dimensionality reduction for this heterogeneous data and how to develop a high
performance platform to efficiently analyze high dimensional biological data and how to find the useful things from this data. To
deeply discuss this issue, this paper begins with a brief introduction to data analytics available for biological data, followed by
the discussions of big data analytics and then a survey on various data reduction methods for biological data. We propose a dense
clustering algorithm for standard high dimensional biological data.
Keywords: Big Data Analytics, Dimensionality Reduction
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
Improved correlation analysis and visualization of industrial alarm dataISA Interchange
The problem of multivariate alarm analysis and rationalization is complex and important in the area of smart alarm management due to the interrelationships between variables. The technique of capturing and visualizing the correlation information, especially from historical alarm data directly, is beneficial for further analysis. In this paper, the Gaussian kernel method is applied to generate pseudo continuous time series from the original binary alarm data. This can reduce the influence of missed, false, and chattering alarms. By taking into account time lags between alarm variables, a correlation color map of the transformed or pseudo data is used to show clusters of correlated variables with the alarm tags reordered to better group the correlated alarms. Thereafter correlation and redundancy information can be easily found and used to improve the alarm settings; and statistical methods such as singular value decomposition techniques can be applied within each cluster to help design multivariate alarm strategies. Industrial case studies are given to illustrate the practicality and efficacy of the proposed method. This improved method is shown to be better than the alarm similarity color map when applied in the analysis of industrial alarm data.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
RAINFALL PREDICTION USING DATA MINING TECHNIQUES - A SURVEYcsandit
Rainfall is considered as one of the major components of the hydrological process; it takes
significant part in evaluating drought and flooding events. Therefore, it is important to have an
accurate model for rainfall prediction. Recently, several data-driven modeling approaches have
been investigated to perform such forecasting tasks as multilayer perceptron neural networks
(MLP-NN). In fact, the rainfall time series modeling (SARIMA) involvesimportant temporal
dimensions. In order to evaluate the incomes of both models, statistical parameters were used to
make the comparison between the two models. These parameters include the Root Mean Square
Error RMSE, Mean Absolute Error MAE, Coefficient Of Correlation CC and BIAS. Two-Third
of the data was used for training the model and One-third for testing.
Similar to Influence over the Dimensionality Reduction and Clustering for Air Quality Measurements using PCA and SOM (20)
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Influence over the Dimensionality Reduction and Clustering for Air Quality Measurements using PCA and SOM
1. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1044
Influence over the Dimensionality Reduction and
Clustering for Air Quality Measurements using
PCA and SOM
Navya H.N
Bangalore, 560072, India
Abstract—The current trend in the industry is to analyze
large data sets and apply data mining, machine learning
techniques to identify a pattern. But the challenges with
huge data sets are the high dimensions associated with it.
Sometimes in data analytics applications, large amounts
of data produce worse performance. Also, most of the
data mining algorithms are implemented column wise and
too many columns restrict the performance and make it
slower. Therefore, dimensionality reduction is an
important step in data analysis. Dimensionality reduction
is a technique that converts high dimensional data into
much lower dimension, such that maximum variance is
explained within the first few dimensions.
This paper focuses on multivariate statistical and
artificial neural networks techniques for data reduction.
Each method has a different rationale to preserve the
relationship between input parameters during analysis.
Principal Component Analysis which is a multivariate
technique and Self Organising Map a neural network
technique is presented in this paper. Also, a hierarchical
clustering approach has been applied to the reduced data
set. A case study of Air quality measurement has been
considered to evaluate the performance of the proposed
techniques.
Keywords — Air Quality Dimensionality reduction,
Hierarchical Clustering,Principal Component Analysis,
Self Organising Maps.
I. INTRODUCTION
Multivariate Statistical Analysis is useful in the case
where data is of high dimension. Since human vision is
limited to 3 dimensions, all application above 2 or 3
dimension is an ideal case for data to be analyzed through
Multivariate Statistical Analysis (MVA). This analysis
provides joint analysis and easy visualization of the
relationship between involved variables. As a result,
knowledge that was unrevealed and hidden among vast
amounts of data can be obtained. PCA is a powerful
multivariate technique in reducing the dimension of data
set and revealing the hidden relationship among the
different variables without losing much of information
[1].
Artificial Neural Network (ANN) is an information
processing computational model based on biological
neural networks and is composed of several
interconnected processing elements (neurons) that work
in parallel to solve a generic problem. In general ANN are
used to identify complex relations between input and
output or to find patterns in a given data set. ANN is an
adaptive system that can change its structure based on the
flow of information through the network during the
training phase.
The two important learning paradigms in ANNs are
supervised learning and unsupervised learning. In
supervised learning, there exists a training data that helps
in the construction of the model by specifying classes
and by providing positive and negative objects that
belong to those classes. In unsupervised learning, there is
no preexisting taxonomy and the algorithm classifies the
output into different classes. Kohonen’s Self Organising
Map is an unsupervised learning technique that reduces a
high dimensional data into a 2 dimensional space. This
reduction in dimensionality helps in understanding the
relationship quickly and also SOM provides better
visualization of components [2].
The rest of the paper is organized as follows. Section 2
discusses related work and their findings. Section 3
briefly explains the existing Principal Component
Analysis and Self Organising Map techniques used in our
work. Section 4 consists of a case study on air quality to
evaluate the performance of the proposed techniques,
along with expected results. Finally the concluding points
are given in section 5.
II. RELATED WORK
Some of the related work pertaining to Principal
Component Analysis and Self Organising Maps are
discussed in this section. In [3] multivariate analysis used
to analyze and interpret data from a large chemical
process. PCA was used to identify correct correlations
between the variables to reduce the dimensionality of
2. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1045
process data. In [4] Multivariate Statistical Analysis is
applied for Dermatological Disease Diagnosis. There are
few diseases like psoriasis, seborrhoeic dermatitis, lichen
planus, pityriasis, chronic dermatitis and pityriasis rubra
piaris in dermatology that share the same clinical features.
PCA is applied to identify the relationship between 12
clinical attributes and the circle of correlations depicts
patterns of variable associations. Monitoring abnormal
changes in the concentration of ozone in the troposphere
is of great interest because of its negative influence on
human health, vegetation and materials. Modeling ozone
is very difficult because formation mechanisms in
troposphere are very complex and adding to it is the
uncertainty regarding the meteorological condition in
urban areas. PCA is used as a data detection method for
highly correlated variables in [5]. PCA was applied for
the yeast sporulation data for simplification of analysis
and visualization of gene expression data. Data was
collected for 7 different gene expressions over time and it
was observed that much of variability was explained
within first two principal components in [6].
In [7] both principal component analysis and self-
organizing has been applied to classify and visualize fire
risks in forest regions. On application of PCA first two
principal component has explained most of the variance
but SOM was successful in effective visualization and
clustering of nodes to depict fire risks. In [8] PCA, cluster
analysis and SOM was applied to a large environmental
data set to assess the quality of river water. The results
indicated the power of classification of SOM when
compared to other traditional methods. In [9] Forest
Inventory (FI) contains useful information on forest
conditions. To interpret such large data sets, SOM
technique is applied that helps in fast and easy
visualization, analysis of multidimensional data sets. In
[10] PCA and SOM was applied in cellular manufacturing
system for visual clustering of machine-part cell
formation. PCA was used for reducing the dimensionality
of data set and was projected on to 2 dimensional space.
The unsupervised SOM technique was applied for data
visualization and also to solve the problem of cell
formation. In [11] SOM was applied to complex
geospatial datasets for knowledge discovery and
information visualization. The dataset consisted of socio
economic indicators mapped to municipalities in
Netherlands. By the application of SOM, structure and
patterns of dataset was uncovered and graphical
representation helped in discovery of knowledge and
better understanding.
III. EXISTING TECHNIQUES
Multivariate Statistical Analysis is useful in the case
where data is of high dimension. Since human vision is
limited to 3 dimensions, all application above 2 or 3
dimensions is an ideal case for data to be analyzed
through Multivariate Statistical Analysis (MVA). This
analysis provides joint analysis and easy visualization of
the relationship between involved variables. As a result,
knowledge that was unrevealed and hidden among vast
amounts of data can be obtained.
A. Principal Component Analysis (PCA)
PCA is a powerful multivariate technique in reducing the
dimension of data set and revealing the hidden
relationship among the different variables without losing
much of information [12]. The steps involved are:
1. Calculate the covariance matrix
Covariance is the measure of how one variable varies
with respect to other variable. If the variables are more
than two, then the covariance matrix needs to be
calculated. The covariance matrix can be obtained by
calculating the covariance values for different
dimensions. The formula for obtaining the covariance
matrix is
𝐶 𝑛 𝑋 𝑛
= (𝑐𝑖 ,𝑗 , 𝑐𝑖 ,𝑗 = 𝑐𝑜𝑣(𝐷𝑖𝑚𝑖 , 𝐷𝑖𝑚𝑗) )
where Cn x n
is a matrix with n rows and n columns
i, j is the row and column indices and covariance is
calculated using below formula
𝑐𝑜𝑣(𝑋, 𝑌) = ∑ [(𝑋𝑖
𝑛
𝑖=1
− 𝑋̅)(𝑌𝑖 − 𝑌̅)]/(𝑛 − 1)
where Xi is the value of X at ith
position
Yi is the value of Y at ith
position
𝑋̅, 𝑌̅ indicates the mean values of X and Y
respectively.
2. Find Eigen Vectors for covariance matrix
Eigen vectors can be calculated only for square matrix
and not all square matrix has eigen vectors. In case n x n
matrix contains eigen vectors, then total number of eigen
vectors is equal to n. Another important property is that
the eigen vectors are always perpendicular to each other.
Eigen values are associated with Eigen vector and
describe the strength of the transformation.
3. Sort Eigen vectors in decreasing order of Eigen values
The highest eigen value corresponds to be the principal
component which explains the maximum variance among
the data points with minimum error. The first few
principal components explain most of the variance and
eigen values with lesser values can be neglected with very
little loss of information.
4. Derive the new data set
The original data will be retained back but will be in
terms of the selected eigen vectors without loss of data.
B. Self Organising Map (SOM)
Self Organising Map is a popular technique in Artificial
Neural Network (ANN) under the category of
unsupervised learning. In conventional ANN approach,
3. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1046
the input vector is presented to a multilayer feed forward
network and the generated output is compared with the
target vector. When there exists a difference, the weights
are altered to minimize the error in output. This phase is
repeated many times with different sets, until the desired
output is achieved. However, SOM does not require a
target vector during the training phase. A SOM without
any external supervision, learns to classify the training
data.
The basic principle of SOM is when the weight of nodes
matches the input vector, then that portion of lattice is
selectively optimized to resemble the input vector. Each
node receives every element of input or training data in
vector format one at a time. A calculation is done between
the element and weight of the node to determine the fit
between them. The calculation performed is to determine
the distance between the two and usually it is Euclidian
distance or any other distance measure can be used. A
winning node that best describes the training element can
be obtained and it has the smallest distance between the
input element and node’s weight. The neighbors of
winning node should be identified. Then, these neighbors
and winning node is updated to represent the new training
element. By this procedure, the map learns through
individual elements [13].
SOM Algorithm
1. Initialize the weights for each node either randomly or
by pre-computed values.
2. For every input element
a) Get the input element and convert to a vector
b) For every node in the map
i) Compare the input vector with the node
c) Every node weight is examined to see which one
matches and is closest to the input vector. The
node with the smallest distance is declared as the
winning unit and is commonly referred as Best
Matching Unit (BMU).
Distance between the between input vector and
node’s weight is calculated using Euclidean
Distance formula.
where V is the current input vector
W is the node’s weight vector
d) The radius of neighborhood of the BMU is
calculated. All nodes within this radius range
will be updated in the next iterations. Initially,
the radius will be set to the radius of lattice and it
decreases in each step.
e) The weights of each neighboring nodes are
adjusted according to the below equation
𝑥(𝑡 + 1) = 𝑥(𝑡) + 𝑁(𝑥, 𝑡) 𝛼 (𝑡) (𝜕(𝑡) − 𝑥(𝑡))
where x(t+1) is the next value of weight vector
x(t) is current value of weight vector
N(x, t) is the neighbourhood function, which
decreases as a function of time
α(t) is the learning rate, which decreases as a
function of time
∂(t) is the vector representing the input document
IV. CASE STUDY
A case study is considered to demonstrate the concept of
dimensionality reduction on a data set using Principal
Component Analysis and Self Organising Map. Later,
hierarchical clustering is applied to the obtained SOM
results. The case study considered determines the quality
of air in regions of New York city. The data set consists
of variables like Ozone, Solar radiation, Wind,
Temperature for different regions. Ozone variable
consists of numeric values in parts per billion, Solar
radiation depicts the radiation in Langleys with a
frequency band 4000-7700 Angstroms, Wind variable has
numeric values in miles per hour, Temperature indicates
in degrees the maximum daily temperature.
A. PCA Application
PCA reduces the dimension and explores the hidden
information. The first few principal components explain
most of the variance. Table 1 indicates the summary of
principal component analysis which explains the standard
deviation between the principal components, proportion
of variance and cumulative proportion.
Table.1: indicates the standard deviation, variance and
cumulative proportion associated with principal
components
PC1 PC2 PC3 PC4
Standard
Deviation
1.5343 0.9529 0.6890 0.51441
Proportion of
Variance
0.5886 0.2266 0.1187 0.06615
Cummulative
Proportion
0.5886 0.8152 0.9338 1.00000
The important task in PCA is deciding the number of
principal components to be considered for further
analysis. Fig. 1. Indicates a scree plot for determining the
number of PCs. The elbow point shows the number of
components to be considered and beyond this the eigen
values are small and indicates negligible data loss.
4. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1047
Fig. 1. indicates a scree plot for determining the number
of PCs.
The relationship between variables can be identified using
biplot. Fig. 2. represents variables and observation of
multi-dimensional data. Fig.2 indicates the variables
Temperature and Ozone are closely related to each other
and Wind is negatively correlated with other variables.
The angle between vectors represent an approximation of
covariance. A small angle between vectors indicates the
variables are highly correlated, an angle of 90 degrees
indicates variables are not correlated and 180 degrees
represents that the variables are negatively correlated.
Fig. 2. Represents variables and observation of multi-
dimensional data.
B. SOM Application
Before the application of SOM, the data was prepared for
analysis by identifying the duplicates, removal of outliers,
data conversion, removal of null values and the variables
are scaled to provide equal importance during training
phase. A SOM grid is created with a size of 10 x 10 and
there is no explicit rule for selecting the number of nodes,
except that it should allow easy visualization of the SOM.
The SOM is trained with a learning rate of 0.05 that
gradually declines to 0.01. The radius of neighbourhood
can either be a vector or a single number. If it is a single
number, the radius will vary from current value to the
negative value of the current number. Once, the
neighbourhood becomes small and radius is less than 1,
only the winning node will be updated.
Fig.3. shows a plot of SOM during training phase after
100 iterations. In this experiment, 100 times the complete
data set is presented to the network. The distance between
the weight of the node and input sample reduces as the
training iterations progress. This distance should ideally
reach a minimum value and Fig. 3. shows training
progress over time.
Fig.3. shows a plot of SOM during training phase after
100 iterations.
Unified Distance matrix (U –matrix) is a type of SOM
visual representation in a map of 2 dimensional grid as
shown in fig. 4. U-matrix indicates the distance between
adjacent nodes and it is represented by different grey
shades. A dark color indicates a larger distance between
the neighboring nodes and represents a gap in input
values.
A lighter shade indicates the vectors are close to each
other and they indicate cluster themselves, high values
represents borders among the clusters.
Fig. 4. Indicates the unified distance matrix visualization
Component Planes
Component planes represents clearly the visualization of
individual input variables. They are represented in grey
scale or a combination of different colors. Fig. 5.
5. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1048
represents component planes for each variable. By
inspecting component planes in Fig. 5. it is evident that
the variables Ozone, Solar radiation and Temperature are
positively correlated and variable Wind is negatively
correlated with other variables. Fig. 5. represents
component planes for each variable.
The rate of chemical reactions that produce ozone are
affected by solar radiation and temperature. Therefore, an
increase in these values increases the chemical reactions
and leads to an increase in ozone levels. Wind speed is
negatively related to all the variables. The concentration
of ozone is high, when there is low wind speeds. The
covariance matrix is generated for the input variables and
it reflects the same relationship as the component planes.
Table 1 represents the correlation coefficient values
between the variables. It can be observed that maximum
correlation exists between ozone and temperature. Wind
is most negatively related with ozone. Component planes
help in understanding this concept pictorially, therefore
Self Organizing Map helps in easy visualization of
relationship between variables.
Table 1 represents the correlation coefficient values
between the variables
__________________________________
Variables Ozone Solar.R Wind Temp
Ozone 1.0000
Solar.R 0.3411 1.0000
Wind -0.6265 -0.1147 1.0000
Temp 0.6938 0.2850 -0.4947 1.0000
Clustering of Self Organising Map
A U-matrix may be appropriate to identify the cluster
boundaries. But this may not be effective and therefore, a
hierarchical clustering technique is applied after the SOM
is trained. Fig. 6. Shows the formation of clusters. Here,
6. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1049
there is a formation of 3 clusters – High Ozone, Medium
Ozone and Low Ozone.
Fig. 6. Blue nodes indicate High Ozone, Orange colour
nodes indicate Medium Ozone and Green color nodes
indicate Low Ozone.
Fig. 6. shows the formation of clusters into low ozone,
medium and high ozone that is depicted by green, orange
and blue respectively.
In High Ozone class, the effect of solar radiation and
temperature are above average and wind speed has very
low effect. The effect of solar radiation and temperature is
below average in Low Ozone class and wind speed has
significant effect. In Medium Ozone cluster, solar
radiation, temperature and wind have about average
effect. The clustering results obtained are satisfactory and
Self Organising Maps can be used to classify the quality
of air.
V. CONCLUSION
In this paper, principal component analysis (PCA) and
Self Organising Map (SOM) has been applied to
demonstrate the dimensionality reduction of dataset. A
case study was considered to visualize and classify air
quality data. PCA explained most of the variance in data
but it was difficult to interpret the data pattern. However,
SOM appears to be the best fit to represent complex data.
Especially, the component planes of SOM provides
effective visualization and uncovers the correlation
between the input variables.
U-matrix can be used for data classification, but may not
be a good choice in all cases. Therefore, in this paper
hierarchical clustering technique is applied to the SOM
results obtained. The quality of air is partitioned into 3
clusters – low, medium and high ozone content.
Therefore, it can be concluded that SOM has better
resolving power of classification than traditional methods.
REFERENCES
[1] Johnson, Richard Arnold and Wichern, Dean W and
others. “Applied multivariate statistical analysis”
s.l. : Prentice hall Englewood Cliffs, NJ, 1992. Vol.
4.
[2] Gurney, Kevin, “An introduction to neural
networks”, CRC press, 1997.
[3] Kosanovich, KA and Piovoso, MJ , “Process data
analysis using multivariate statistical methods ” ,
IEEE, pp. 721—724, 1991
[4] Barreto, Alexandre S., “Multivariate statistical
analysis for dermatological disease diagnosis”,
Biomedical and Health Informatics (BHI), IEEE-
EMBS International Conference IEEE, 2014, pp.
500-504.
[5] Harrou; Fouzi and Nounou; Mohamed Numan and
Nounou;Hazem Numan , “Detecting abnormal
ozone levels using PCA-based GLR hypothesis
testing”, s.l. : IEEE, 2013, pp. 95-102.
[6] Raychaudhuri, Soumya and Stuart, Joshua M and
Altman, Russ B, “Principal components analysis to
summarize microarray experiments: application to
sporulation time series”, Pacific Symposium on
Biocomputing. Pacific Symposium on
Biocomputing, NIH Public Access, 2000, pp. 455.
[7] Annas, Suwardi and Kanai, Takenori and Koyama,
Shuhei, “Principal component analysis and self-
organizing map for visualizing and classifying fire
risks in forest regions”, Agricultural Information
Research Journal, 2007,pp. 44-51.
[8] Astel, A., Tsakovski, S., Barbieri, P., & Simeonov,
V. , “ Comparison of self-organizing maps
classification approach with cluster and principal
components analysis for large environmental data
sets” Water Research,41(19), 2007, pp. 4566-4578.
[9] Klobucar, Damir, and Marko Subasic. "Using self-
organizing maps in the visualization and analysis of
forest inventory." iForest-Biogeosciences and
Forestry 5.5, 2012 , pp. 216.
[10]Chattopadhyay, Manojit, Pranab K. Dan, and
Sitanath Mazumdar. "Principal component analysis
and self-organizing map for visual clustering of
machine-part cell formation in cellular
manufacturing system." Systems Research Forum.
Vol. 5., 2011,pp.25-51.
[11]Koua, E. L. "Using self-organizing maps for
information visualization and knowledge discovery
in complex geospatial datasets." Proceedings of 21st
7. International Journal of Advanced Engineering, Management and Science (IJAEMS) [Vol-3, Issue-11, Nov- 2017]
https://dx.doi.org/10.24001/ijaems.3.11.4 ISSN: 2454-1311
www.ijaems.com Page | 1050
International Cartographic Renaissance (ICC),
2003, pp.1694-1702.
[12]Smith, Lindsay I, “A tutorial on principal
components analysis” , Cornell University, USA, p.
65, 2002.
[13]Kohonen, Teuvo, et al. "Engineering applications of
the self-organizing map."Proceedings of the
IEEE 84.10,1996, pp. 1358-1384.