Feature selection is more accurate technique in protein sequence classification. Researchers apply some well-known classification techniques like neural networks, Genetic algorithm, Fuzzy ARTMAP, Rough Set Classifier etc for extracting features.This paper presents a review is with
three different classification models such as fuzzy ARTMAP model, neural network model and Rough set classifier model.This is followed by a new technique for classifying protein
sequences.The proposed model is typically implemented with an own designed tool using JAVA and tries to prove that it reduce the computational overheads encountered by earlier
approaches and also increase the accuracy of classification.
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...IEEEGLOBALSOFTTECHNOLOGIES
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...IEEEGLOBALSOFTTECHNOLOGIES
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
A chi-square-SVM based pedagogical rule extraction method for microarray data...IJAAS Team
Support Vector Machine (SVM) is currently an efficient classification technique due to its ability to capture nonlinearities in diagnostic systems, but it does not reveal the knowledge learnt during training. It is important to understand of how a decision is reached in the machine learning technology, such as bioinformatics. On the other hand, a decision tree has good comprehensibility; the process of converting such incomprehensible models into an understandable model is often regarded as rule extraction. In this paper we proposed an approach for extracting rules from SVM for microarray dataset by combining the merits of both the SVM and decision tree. The proposed approach consists of three steps; the SVM-CHI-SQUARE is employed to reduce the feature set. Dataset with reduced features is used to obtain SVM model and synthetic data is generated. Classification and Regression Tree (CART) is used to generate Rules as the Last phase. We use breast masses dataset from UCI repository where comprehensibility is a key requirement. From the result of the experiment as the reduced feature dataset is used, the proposed approach extracts smaller length rules, thereby improving the comprehensibility of the system. We obtained accuracy of 93.53%, sensitivity of 89.58%, specificity of 96.70%, and training time of 3.195 seconds. A comparative analysis is carried out done with other algorithms.
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
The alignment of two DNA sequences is a basic step in the analysis of biological data. Sequencing a long
DNA sequence is one of the most interesting problems in bioinformatics. Several techniques have been
developed to solve this sequence alignment problem like dynamic programming and heuristic algorithms.
In this paper, we introduce (GPCodon alignment) a pairwise DNA-DNA method for global sequence
alignment that improves the accuracy of pairwise sequence alignment. We use a new scoring matrix to
produce the final alignment called the empirical codon substitution matrix. Using this matrix in our
technique enabled the discovery of new relationships between sequences that could not be discovered using
traditional matrices. In addition, we present experimental results that show the performance of the
proposed technique over eleven datasets of average length of 2967 bps. We compared the efficiency and
accuracy of our techniques against a comparable tool called “Pairwise Align Codons” [1].
Genome structure prediction a review over soft computing techniqueseSAT Journals
Abstract There are some techniques like spectrometry or crystallography for the determination of DNA, RNA or protein structures. These processes provide very accurate results for the structure estimation. But these conventional techniques are very slow and could be applied over a few special cases only. Soft computing techniques guarantee a near appropriate results in much smaller time and have very large applicability. These techniques are much easier to apply. Different approaches have been used in soft computing including nature inspired computing for estimation of genome structures with a considerable accuracy of results. This paper provides a review over different soft computing techniques been applied along with application method for the determination of genome structure. Keywords—DNA, RNA, proteins, structure, soft computing, techniques.
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDYijcsa
Nowadays, one of the main issues to create challenges in medicine sciences by developing technology is the
disease diagnosis with high accuracy. In the recent decades, Artificial Neural Networks (ANNs) are considered as the best solutions to achieve this goal and involve in widespread researches to diagnose the diseases. In this paper, we consider a Multi-layer Perceptron (MLP) ANN using back propagation learning algorithm to classify Thyroid disease. It consists of an input layer with 5 neurons, a hidden layer with 6 neurons and an output layer with just 1 neuron. The suitable selection of activation function and the number of neurons in the hidden layer and also the number of layers are achieved using test and error method. Our simulation results indicate that the performed optimization in MLP ANNs can be reached the accuracy level to 98.6%.
Semi-supervised learning approach using modified self-training algorithm to c...IJECEIAES
Burst header packet flooding is an attack on optical burst switching (OBS) network which may cause denial of service. Application of machine learning technique to detect malicious nodes in OBS network is relatively new. As finding sufficient amount of labeled data to perform supervised learning is difficult, semi-supervised method of learning (SSML) can be leveraged. In this paper, we studied the classical self-training algorithm (ST) which uses SSML paradigm. Generally, in ST, the available true-labeled data (L) is used to train a base classifier. Then it predicts the labels of unlabeled data (U). A portion from the newly labeled data is removed from U based on prediction confidence and combined with L. The resulting data is then used to re-train the classifier. This process is repeated until convergence. This paper proposes a modified self-training method (MST). We trained multiple classifiers on L in two stages and leveraged agreement among those classifiers to determine labels. The performance of MST was compared with ST on several datasets and significant improvement was found. We applied the MST on a simulated OBS network dataset and found very high accuracy with a small number of labeled data. Finally we compared this work with some related works.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqKiogyf
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Abstract: The prediction of the price of crude oil is important for economic, political, and industrial purposes in both crude oil importing and exporting countries. Fluctuations in oil prices can have a significant influence in many countries. Therefore, it is necessary to develop a suitable model that can accurately predict different economic and engineering parameters that are directly related to the crude oil price. This paper proposes the use of a soft computing (SC) model which consists of a multi-layer perceptron neural network (MLP-NN) for accurate future crude oil price prediction. The performance of the SC model proposed in this study was compared to that of other neural network approaches and found to perform better in the prediction of both monthly and daily crude oil prices, especially where there is a limited number of input data for model training and in situations of high parameters variability.
Keywords: Crude Oil Price, Soft Computing, Economics, Artificial Neural Networks.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Computer Science, Engineering and Information Techno...IJCSEIT Journal
In the field of proteomics because of more data is added, the computational methods need to be more
efficient. The part of molecular sequences is functionally more important to the molecule which is more
resistant to change. To ensure the reliability of sequence alignment, comparative approaches are used. The
problem of multiple sequence alignment is a proposition of evolutionary history. For each column in the
alignment, the explicit homologous correspondence of each individual sequence position is established. The
different pair-wise sequence alignment methods are elaborated in the present work. But these methods are
only used for aligning the limited number of sequences having small sequence length. For aligning
sequences based on the local alignment with consensus sequences, a new method is introduced. From NCBI
databank triticum wheat varieties are loaded. Phylogenetic trees are constructed for divided parts of
dataset. A single new tree is constructed from previous generated trees using advanced pruning technique.
Then, the closely related sequences are extracted by applying threshold conditions and by using shift
operations in the both directions optimal sequence alignment is obtained.
Comparative analysis of dynamic programming algorithms to find similarity in ...eSAT Journals
Abstract There exist many computational methods for finding similarity in gene sequence, finding suitable methods that gives optimal similarity is difficult task. Objective of this project is to find an appropriate method to compute similarity in gene/protein sequence, both within the families and across the families. Many dynamic programming algorithms like Levenshtein edit distance; Longest Common Subsequence and Smith-waterman have used dynamic programming approach to find similarities between two sequences. But none of the method mentioned above have used real benchmark data sets. They have only used dynamic programming algorithms for synthetic data. We proposed a new method to compute similarity. The performance of the proposed algorithm is evaluated using number of data sets from various families, and similarity value is calculated both within the family and across the families. A comparative analysis and time complexity of the proposed method reveal that Smith-waterman approach is appropriate method when gene/protein sequence belongs to same family and Longest Common Subsequence is best suited when sequence belong to two different families. Keywords - Bioinformatics, Gene, Gene Sequencing, Edit distance, String Similarity.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Delineation of techniques to implement on the enhanced proposed model using d...ijdms
In post genomic era with the advent of new technologies a huge amount of complex molecular data are
generated with high throughput. The management of this biological data is definitely a challenging task
due to complexity and heterogeneity of data for discovering new knowledge. Issues like managing noisy
and incomplete data are needed to be dealt with. Use of data mining in biological domain has made its
inventory success. Discovering new knowledge from the biological data is a major challenge in data
mining technique. The novelty of the proposed model is its combined use of intelligent techniques to classify
the protein sequence faster and efficiently. Use of FFT, fuzzy classifier, String weighted algorithm, gram
encoding method, neural network model and rough set classifier in a single model and in an appropriate
place can enhance the quality of the classification system .Thus the primary challenge is to identify and
classify the large protein sequences in a very fast and easy but intellectual way to decrease the time
complexity and space complexity.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to
analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection based ensemble learning models is to classify the high dimensional data with high computational efficiency and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
A chi-square-SVM based pedagogical rule extraction method for microarray data...IJAAS Team
Support Vector Machine (SVM) is currently an efficient classification technique due to its ability to capture nonlinearities in diagnostic systems, but it does not reveal the knowledge learnt during training. It is important to understand of how a decision is reached in the machine learning technology, such as bioinformatics. On the other hand, a decision tree has good comprehensibility; the process of converting such incomprehensible models into an understandable model is often regarded as rule extraction. In this paper we proposed an approach for extracting rules from SVM for microarray dataset by combining the merits of both the SVM and decision tree. The proposed approach consists of three steps; the SVM-CHI-SQUARE is employed to reduce the feature set. Dataset with reduced features is used to obtain SVM model and synthetic data is generated. Classification and Regression Tree (CART) is used to generate Rules as the Last phase. We use breast masses dataset from UCI repository where comprehensibility is a key requirement. From the result of the experiment as the reduced feature dataset is used, the proposed approach extracts smaller length rules, thereby improving the comprehensibility of the system. We obtained accuracy of 93.53%, sensitivity of 89.58%, specificity of 96.70%, and training time of 3.195 seconds. A comparative analysis is carried out done with other algorithms.
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
The alignment of two DNA sequences is a basic step in the analysis of biological data. Sequencing a long
DNA sequence is one of the most interesting problems in bioinformatics. Several techniques have been
developed to solve this sequence alignment problem like dynamic programming and heuristic algorithms.
In this paper, we introduce (GPCodon alignment) a pairwise DNA-DNA method for global sequence
alignment that improves the accuracy of pairwise sequence alignment. We use a new scoring matrix to
produce the final alignment called the empirical codon substitution matrix. Using this matrix in our
technique enabled the discovery of new relationships between sequences that could not be discovered using
traditional matrices. In addition, we present experimental results that show the performance of the
proposed technique over eleven datasets of average length of 2967 bps. We compared the efficiency and
accuracy of our techniques against a comparable tool called “Pairwise Align Codons” [1].
Genome structure prediction a review over soft computing techniqueseSAT Journals
Abstract There are some techniques like spectrometry or crystallography for the determination of DNA, RNA or protein structures. These processes provide very accurate results for the structure estimation. But these conventional techniques are very slow and could be applied over a few special cases only. Soft computing techniques guarantee a near appropriate results in much smaller time and have very large applicability. These techniques are much easier to apply. Different approaches have been used in soft computing including nature inspired computing for estimation of genome structures with a considerable accuracy of results. This paper provides a review over different soft computing techniques been applied along with application method for the determination of genome structure. Keywords—DNA, RNA, proteins, structure, soft computing, techniques.
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDYijcsa
Nowadays, one of the main issues to create challenges in medicine sciences by developing technology is the
disease diagnosis with high accuracy. In the recent decades, Artificial Neural Networks (ANNs) are considered as the best solutions to achieve this goal and involve in widespread researches to diagnose the diseases. In this paper, we consider a Multi-layer Perceptron (MLP) ANN using back propagation learning algorithm to classify Thyroid disease. It consists of an input layer with 5 neurons, a hidden layer with 6 neurons and an output layer with just 1 neuron. The suitable selection of activation function and the number of neurons in the hidden layer and also the number of layers are achieved using test and error method. Our simulation results indicate that the performed optimization in MLP ANNs can be reached the accuracy level to 98.6%.
Semi-supervised learning approach using modified self-training algorithm to c...IJECEIAES
Burst header packet flooding is an attack on optical burst switching (OBS) network which may cause denial of service. Application of machine learning technique to detect malicious nodes in OBS network is relatively new. As finding sufficient amount of labeled data to perform supervised learning is difficult, semi-supervised method of learning (SSML) can be leveraged. In this paper, we studied the classical self-training algorithm (ST) which uses SSML paradigm. Generally, in ST, the available true-labeled data (L) is used to train a base classifier. Then it predicts the labels of unlabeled data (U). A portion from the newly labeled data is removed from U based on prediction confidence and combined with L. The resulting data is then used to re-train the classifier. This process is repeated until convergence. This paper proposes a modified self-training method (MST). We trained multiple classifiers on L in two stages and leveraged agreement among those classifiers to determine labels. The performance of MST was compared with ST on several datasets and significant improvement was found. We applied the MST on a simulated OBS network dataset and found very high accuracy with a small number of labeled data. Finally we compared this work with some related works.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqKiogyf
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Abstract: The prediction of the price of crude oil is important for economic, political, and industrial purposes in both crude oil importing and exporting countries. Fluctuations in oil prices can have a significant influence in many countries. Therefore, it is necessary to develop a suitable model that can accurately predict different economic and engineering parameters that are directly related to the crude oil price. This paper proposes the use of a soft computing (SC) model which consists of a multi-layer perceptron neural network (MLP-NN) for accurate future crude oil price prediction. The performance of the SC model proposed in this study was compared to that of other neural network approaches and found to perform better in the prediction of both monthly and daily crude oil prices, especially where there is a limited number of input data for model training and in situations of high parameters variability.
Keywords: Crude Oil Price, Soft Computing, Economics, Artificial Neural Networks.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Computer Science, Engineering and Information Techno...IJCSEIT Journal
In the field of proteomics because of more data is added, the computational methods need to be more
efficient. The part of molecular sequences is functionally more important to the molecule which is more
resistant to change. To ensure the reliability of sequence alignment, comparative approaches are used. The
problem of multiple sequence alignment is a proposition of evolutionary history. For each column in the
alignment, the explicit homologous correspondence of each individual sequence position is established. The
different pair-wise sequence alignment methods are elaborated in the present work. But these methods are
only used for aligning the limited number of sequences having small sequence length. For aligning
sequences based on the local alignment with consensus sequences, a new method is introduced. From NCBI
databank triticum wheat varieties are loaded. Phylogenetic trees are constructed for divided parts of
dataset. A single new tree is constructed from previous generated trees using advanced pruning technique.
Then, the closely related sequences are extracted by applying threshold conditions and by using shift
operations in the both directions optimal sequence alignment is obtained.
Comparative analysis of dynamic programming algorithms to find similarity in ...eSAT Journals
Abstract There exist many computational methods for finding similarity in gene sequence, finding suitable methods that gives optimal similarity is difficult task. Objective of this project is to find an appropriate method to compute similarity in gene/protein sequence, both within the families and across the families. Many dynamic programming algorithms like Levenshtein edit distance; Longest Common Subsequence and Smith-waterman have used dynamic programming approach to find similarities between two sequences. But none of the method mentioned above have used real benchmark data sets. They have only used dynamic programming algorithms for synthetic data. We proposed a new method to compute similarity. The performance of the proposed algorithm is evaluated using number of data sets from various families, and similarity value is calculated both within the family and across the families. A comparative analysis and time complexity of the proposed method reveal that Smith-waterman approach is appropriate method when gene/protein sequence belongs to same family and Longest Common Subsequence is best suited when sequence belong to two different families. Keywords - Bioinformatics, Gene, Gene Sequencing, Edit distance, String Similarity.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Delineation of techniques to implement on the enhanced proposed model using d...ijdms
In post genomic era with the advent of new technologies a huge amount of complex molecular data are
generated with high throughput. The management of this biological data is definitely a challenging task
due to complexity and heterogeneity of data for discovering new knowledge. Issues like managing noisy
and incomplete data are needed to be dealt with. Use of data mining in biological domain has made its
inventory success. Discovering new knowledge from the biological data is a major challenge in data
mining technique. The novelty of the proposed model is its combined use of intelligent techniques to classify
the protein sequence faster and efficiently. Use of FFT, fuzzy classifier, String weighted algorithm, gram
encoding method, neural network model and rough set classifier in a single model and in an appropriate
place can enhance the quality of the classification system .Thus the primary challenge is to identify and
classify the large protein sequences in a very fast and easy but intellectual way to decrease the time
complexity and space complexity.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to
analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection based ensemble learning models is to classify the high dimensional data with high computational efficiency and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
This paper presents an approach based on applying an aggregated predictor formed by multiple versions of a multilayer neural network with a back-propagation optimization algorithm for helping the engineer to get a list of the most appropriate well-test interpretation models for a given set of pressure/ production data. The proposed method consists of three stages: (1) data decorrelation through principal component analysis to reduce the covariance between the variables and the dimension of the input layer in the artificial neural network, (2) bootstrap replicates of the learning set where the data is repeatedly sampled with a random split of the data into train sets and using these as new learning sets, and (3) automatic reservoir model identification through aggregated predictor formed by a plurality vote when predicting a new class. This method is described in detail to ensure successful replication of results. The required training and test dataset were generated by using analytical solution models. In our case, there were used 600 samples: 300 for training, 100 for cross-validation, and 200 for testing. Different network structures were tested during this study to arrive at optimum network design. We notice that the single net methodology always brings about confusion in selecting the correct model even though the training results for the constructed networks are close to 1. We notice also that the principal component analysis is an effective strategy in reducing the number of input features, simplifying the network structure, and lowering the training time of the ANN. The results obtained show that the proposed model provides better performance when predicting new data with a coefficient of correlation approximately equal to 95% Compared to a previous approach 80%, the combination of the PCA and ANN is more stable and determine the more accurate results with lesser computational complexity than was feasible previously. Clearly, the aggregated predictor is more stable and shows less bad classes compared to the previous approach.
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
Classification in data mining is receiving immense interest in recent times. As the knowledge is based on
historical data, classifications of data are essential for discovering the knowledge. To decrease the
classification complexity, the quantitative attributes of data need splitting. But the splitting using the
classical logic is less accurate. This can be overcome by the use of fuzzy logic. This paper illustrates how to
build up the classification rules using the fuzzy logic. The fuzzy classifier is built on by using the prism
decision tree algorithm. This classifier produces more realistic results than the classical one. The
effectiveness of this method is justified over a sample dataset.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A clonal based algorithm for the reconstruction of genetic network using s sy...eSAT Journals
Abstract Motivation: Gene regulatory network is the network based approach to represent the interactions between genes. DNA microarray is the most widely used technology for extracting the relationships between thousands of genes simultaneously. Gene microarray experiment provides the gene expression data for a particular condition and varying time periods. The expression of a particular gene depends upon the biological conditions and other genes. In this paper, we propose a new method for the analysis of microarray data. The proposed method makes use of S-system, which is a well-accepted model for the gene regulatory network reconstruction. Since the problem has multiple solutions, we have to identify an optimized solution. Evolutionary algorithms have been used to solve such problems. Though there are a number of attempts already been carried out by various researchers, the solutions are still not that satisfactory with respect to the time taken and the degree of accuracy achieved. Therefore, there is a need of huge amount further work in this topic for achieving solutions with improved performances. Results: In this work, we have proposed Clonal selection algorithm for identifying optimal gene regulatory network. The approach is tested on the real life data: SOS Ecoli DNA repairing gene expression data. It is observed that the proposed algorithm converges much faster and provides better results than the existing algorithms. Index Terms: Microarray analysis, Evolutionary Algorithm, Artificial Immune System, S-system, Gene Regulatory Network, SOS Ecoli DNA repairing, Clonal Selection Algorithm.
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...IEEEGLOBALSOFTTECHNOLOGIES
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Similar to A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
2. 244 Computer Science & Information Technology (CS & IT)
predefined values. Using neural networks, Genetic algorithm, Fuzzy ARTMAP, Rough Set based
Classifier etc till date; none of researchers have achieved 100% accuracy level. This paper
presents a comprehensive study of the on-going research on protein sequence classification
followed by a comparative analysis.
The rest of the paper is organized as follows: Section 2 presents a review of classification models;
section 3 consists of proposed work and implementation in Section 4. Finally Section 5 presents
the result discursion and Section 6 presents the conclusion.
2. CURRENT STATE OF ARTS
Main aim to classify the protein sequence is distributed the unknown protein sequences into its
particular class, sub class or family. Different techniques are already involved in this area. All
these methods use the same policy to extract some features, match the value of these features and
finally classify the protein sequence. This paper focuses on mainly three types of classification
techniques, the, (i) the Fuzzy ARTMAP Model, (ii) Neural network Modeland (iii) the Rough Set
Classifier.
2.1.Fuzzy ARTMAP Model
Generally Fuzzy ARTMAP model, a machine learning method is used to classify the protein
sequence. The basic difference between neural network model and fuzzy model is that neural
network model do not analysis the data individually, it only provide a knowledge based
information. On the other hand Fuzzy model calculates the membership value of every feature
using membership functions and implements it in the whole model.
This model [5] was implemented to classify the unknown protein sequence into different
predefine protein families or protein sub families with 93% high accuracy. A cleaning process
was also been conducted on the databases. After that different features were extracted from
protein sequences, e.g. physic-chemical properties of the sequences. They calculated the
molecular weight (W) and the isoelectric point (pI) of the protein sequences, followed by Amino
acid composition of the sequences. The hydropathy composition (C), the hydropathy distribution
(D) and the Hydropathy transmission (T) also calculated. After extracting all forty different
features an unknown protein sequence was used as the input of the Fuzzy ARTMAP model. Some
predefine features which were extracted or generated from known protein sequence also used as
the unit of classification rules. This model generated the name of family or sub family of the
unknown protein sequence as the output which was taken as the input of the Fuzzy ARTMAP
model.
In [6] author wants to classify protein sequence using Fuzzy model. Calculating the membership
value using the membership function is most important in fuzzy model. At first feature is
extracted using 6-letter exchange group method. Then membership value is assigned and
constructs the pattern matrix. Using a fuzzy rule pattern matrix was distributed into 3 small
groups (i) small, (ii) medium and (iii) large. Now according to the target, choose a group and
further distribution was done to reach to goal. At the end, the model is tested using uniport 11.0
dataset which contain globin, kinase, ras and trypsin super families of protein. In this paper
number of antecedent variables is huge. It is right that increase of antecedent variable, increase
the classification accuracy but it also increases the CPU time.
[8] Proposes an advancement of the techniques proposed in [6]. This technique tries to decrease
the CPU time without changing the classification accuracy. Here features also extracted using 2-
3. Computer Science & Information Technology (CS & IT) 245
gram encoding method and 6-letter exchange group method and according to the membership
value of features pattern matrix was distributed into 3 small groups. But executing the distribution
method a new algorithm is applied on the value of features. This algorithm provides a rank on the
value of the features using the feature ranking algorithm and according to the rank features is
arranged in descending order. Now collect the top ranked features to construct the pattern matrix.
In this way this technique can reduce the CPU overhead. This is a normal, easy, human
understandable and alignment – independent method. As a result every biologist can easily
understand this method and feel free to implement it. At the end this method is evaluated and
compared to the non fuzzy technique (C 4.5). The computational complexity is reduced, but the
accuracy level remains the same as the earlier method [6].
Fuzzy modeling helps in the data analysis although storage and time requirement are high. The
construction of fuzzy sets, for every iteration adds up to the computational complexity as well.
This model also failed to process the physical relationships which are most important in this
purpose.
2.2. Neural Network Model
Generally there are different types of approaches available for classification, such as decision
trees and neural networks. Extracted features of protein sequences are distributed in a high
dimensional space with complex characteristics, which is difficult to satisfy model using some
parameterized approaches. So neural network based classifier have been chosen to classify
protein sequence. Decision tree based techniques fails to classify patterns with continuous
features especially as the number of attributes is larger.
Neural network model [2] has been used to classify unknown protein sequences by extracting
some features from it which can apply as input of this model. 2-gram encoding method and 6-
letter exchange group methods were used to find global similarity. For local similarity, user
defined variables Len, Mut, and occur were used. Minimum description length (MDL) principle
was also used to calculate the significance of motif. Some predefined values of these features
were used as intermediate layers or hidden layers of the neural network. This model produces
90% to 92% accuracy.
In [1] authors want to classify the protein sequences using neural network model. Here n-gram
encoding method (n = 2, 3… N and N = len. of the input sequence) was used to extract feature
which was applied to construct the pattern matrix. At the end by using neural network model new
pattern was matched with the predefine pattern of protein super families or families. N-gram
encoding method includes all 2-gram, 3-gram, etc encoding method, so to form the pattern matrix
of features extracted from n-gram encoding method, individual also needed. Therefore in case of
large sequences computational overhead also be increase. The accuracy level remains 90% only.
[7] Proposes an advance technique of [1]. At first 2-gram encoding method is applied and using
only this result pattern matrix is build. If this matrix is unable to classify the input protein
sequence, result of 3-gram encoding method is added to the pattern matrix. The result is then
matched using neural network. The performance of this technique is largely dependent on the
number of encoding operations performed. In case all the sub patterns are to be checked,
performance deteriorates sharply. The average performance is slightly better than [1].
In [9] authors used a probabilistic neural network model. The paper uses self organized map
(SOM) network. The SOM networks can be used to discover relationships within a set of protein
sequences by clustering them into different groups. Different types of features like Amino acid
distribution, 2-gram amino acid distribution, etc, were extracted from the input protein sequence
to construct the pattern matrix. According to the unsupervised learning method of neural network
4. 246 Computer Science & Information Technology (CS & IT)
input sequences are placed in the 1st
layer of neural network, then pattern matrix is presented in
the hidden layer (2nd
layer) for matching with some predefine values. Different outcome results
are summarized in the 3rd
layer. 4th
or final layer of the probabilistic neural network model
produced the final result of classification. The technique failed to produce impressive results in
case of unclassifiedp and unclassifiedn parameters. The use of SOM network also causes
hindrance in interpreting the results.
The main limitations of SOM networks for protein sequence classification are its interpretability
of the results, and the model selection. SOM is a straight forward method; there is no chance of
back propagation. But to reach a particular goal and increase the accuracy level of the
classification back propagation is most important technique. In back propagation based model,
there is a chance to move to the previous steps.
The problems faced by the SOM based technique in [9] is overcome by back propagation neural
network (BPNN) technique in [4]. Here authors use extreme learning machine to classify protein
sequence. This extreme learning machine included the advancement of back propagation
technique of neural network model.To evaluate the performance of this machine authors extracted
some features like 2-gram encoding method and 6-letter exchange method from the input protein
sequence. A pattern matrix was formed using those features and used in the extreme learning
machine. Finally accuracy level also is measured.
The use of neural network technique normally neural network is good at handling non-linear data
(noise data). The protein sequence being linear, use of neural network does not add up. It has
been observed that sequences of 20 different amino acids (Protein sequences) were used as
working data in this paper. The data being linear, the use of neural network modelling fails to add
any extra benefit. The paper fails to take care of noise in protein sequence even through it uses
neural network. The model failed to process the physical relationships which are most important
in this purpose. Again regarding the accuracy issue, neural network model provide 90%-92%
accuracy. Improvement of this accuracy is mostly needed.
2.3. Rough Set Classifier
Generally machine learning methods such as the neural network model, Fuzzy ARTMAP model
etc., are insufficient to handle large number of unnecessary features, extracted for rule discovery
[11]. As a result they try to select the features to reduce the computation time. But these methods
also degrade their performance. Accuracy level is not sufficient since every feature is equally
important for proper classification. Rough set classifier is a new model to overcome this problem.
Rough sets theory is a machine learning method, which is introduced by Pawlak [3] in the early
1980s. It implements the concept of set theory to make some decision. The indiscernibility
relation that induced minimal decision rules from training examples is the important notation in
rough set model. To identify the minimal set of the features, if-else rule is used on the decision
table.
This new classification model [10] can classify the voluminous protein data based on structural
and functional properties of protein. This model is faster, accurate and efficient classification tool
than the others. Rough Set Protein Classifier provides 97.7% accuracy. It is a hybridized tool
comprising Sequence Arithmetic, Rough Set Theory and Concept Lattice. It reduces the domain
search space to 9% without losing the potentiality of classification of proteins. An innovative
technique viz., Sequence Arithmetic (SA) to identify family information and utilize it for
reducing the domain search space is proposed. Rules are generated and stored in Sequence
Arithmetic database. A new approach to compute predominant attributes (approximate reducts)
and use them to construct decision tree called Reduce based Decision Tree (RDT) is proposed.
5. Computer Science & Information Technology (CS & IT) 247
Decision rules generated from the RDT are stored in RDT Rules Database (RDTRD). These rules
are used to obtain class information. The infirmity of RDT is overcome by extracting spatial
information by means of Neighbourhood Analysis (NA). Spatial information is converted into
binary information using threshold. It is utilized for the construction of Concept Lattice (CL). The
Associated Rules from the CL are stored in Concept lattice Association Rule Database (CARD).
Further, the domain search space is confined to a set of sequences within a class by using these
Association Rules. Time complexity of this model is O(n) + O(f) + O(log C) + O(2r
), where the
unknown sequence y with size n. No of the families in the database is f. ‘C’ is the number of
classes in a given family and ‘r’ is the number of proteins in classes then the CL will have 2r
nodes.
In [11] authors use rough set classifier to extract all the features necessary for classification. The
feature set was built based on compositional percentages of the 20 amino acids properties. The
authors had used Rosetta system for data mining and knowledge discovery. In the first phase, a
method is implemented on the whole datasets in which all the subfamilies were included ignoring
the small size of sequences. Rough set model generally use standard Genetic Algorithms. The
Rough Set was further applied to classify the data and evaluate the performance of the seven
subfamilies. This paper achieves a satisfactory accuracy level without increase the computational
time.
The Rough Set Classifier model provides knowledge based information only without any analysis
of data. For properly classifying protein sequences, both play an important role. Instead of
classifying protein sequence into classes or sub classes, this model provides a small known
sequence from a long unknown protein sequence. Thus it requires extra time and space for further
classification of the output sequence into classes or sub classes. The accuracy level is 97.7%,
which leaves scope for improvement.
3. PROPOSED MODEL
Figure 1. Pictorial representation of the different modules of the proposed technique
6. 248 Computer Science & Information Technology (CS & IT)
On the basis of previous review the aim of this model should be to classify the unknown protein
sequences in different families, classes or sub classes with high accuracy level and low
computational time. The proposed technique consists of three phases. The first phase aims to
reduce the input dataset. The second phase helps to increase accuracy level of classification and
the third phase implies the association rule to classify the protein sequence. Figure 1 gives a
pictorial representation of the different modules of the proposed technique.
3.1. Phase 1
In 1st
Phase, 2-gram encoding method and 6-letter exchange group method both are used to
extract the global similarity of the protein sequence [2]. These two methods are directly related to
the structure of a protein sequence. In the first phase if back propagation technique of the neural
network [4] is used on the reduce data set to extract the knowledge based information then more
accrue result will be generated. Now this pattern matrix acts as the input of the neural network
model. On the other hand extraction of knowledge based information provides more accurate
result. So accuracy level is increase in this phase.
Here 2-gram encoding method was proposed which is described in [1]. The 2-gram encoding
method extracts and counts the occurrences of patterns of two consecutive amino acids in a
protein sequence. Let take the example of small protein sequence MARETFAR and apply 2-gram
encoding method on it and provide the following result 1 for MA (occurrence of MA is 1 in the
total sequence), 2 for AR (occurrence of AR is 2 in the total sequence), 1 for RE, 1 for ET, 1 for
TA and 1 for FA. After that calculate the value for every pair using the formula x = c / (len(S) –
1), where x is the value, c is no of occurrence of every pair and len(S) is the total length of the
input protein sequence, i.e. 0.1429 for MA, 0.2857 for AR and 0.1429 for others. Finally calculate
the mean value (m) = 0.1667 and standard deviation (d) = 0.448 using the formula 1and 2
݉ =
൫∑ ௫
ಿ
సభ ൯
ே
…………………………….. (1)
݀ = ට
൫∑ ሺିሻమಿ
సభ ൯
ሺேିଵሻ
……………………………… (2)
In this phase 6-letter exchange group {e1,e2,e3,e4,e5,e6} also is adopted to represent a protein
sequence, where e1 € {H,R,K}, e2 € {D,E,N,Q}, e3 € {C}, e4 € {S,T,P,A, G}, e5 € {M,I,L,V},
e6 € {F,Y,W}. For example, the above protein sequence MARETFAR can be represented as
e5e4e1e2e4e6e4e1. The 6-letter exchange group method provides the following result 1 for e5e4
(occurrence of e5e4 is 1 in the total sequence), 2 for e4e1 (occurrence of e4e1 is 2 in the total
sequence), 1 for e1e2, 1 for e2e4, 1 for e4e6 and 1 for e6e4. After that calculate the value for
every pair using the formula x = c / (len(S) – 1), where x is the value, c is no of occurrence of
every pair and len(S) is the total length of the input protein sequence, i.e. 0.1429 for e5e4, 0.2857
for e4e1 and 0.1429 for others. Finally calculate the mean value (m) = 0.1667 and standard
deviation (d) = 0.448 using the formula describe above.
For each protein sequence, both the 2-gram amino acid encoding and the 2-gram exchange group
encoding are applied to the sequence. Thus, there are 20 X 20 + 6 X 6 = 436 possible 2-gram
patterns in total. If all the 436 2-gram patterns are chosen as the neural network input features
After complete the 1st
Phase, generally the unknown input protein sequences may be classified in
to particular families otherwise it definitely reduces the trained dataset. If phase 1 able to
classified the input unknown protein sequence in to families, no need to continue this process
further. The process will stop here otherwise the process will continue using 2nd
Phase.
7. Computer Science & Information Technology (CS & IT) 249
3.2. Phase 2
In Second phase global features are extracted from the input protein sequence. These global
features represent the nature of the entire protein sequence and global similarity between related
sequences allowing for comparison. To extract the global features at first physico-chemical
properties i.e. Molecular weight (W) and the isoelectric point (pI), of the input sequences are
calculated. Apart of those features Hydropathy composition and Hydropathy distribution also
calculated using the table 1.
Table 1: Chart of Molecular weight, isoelectric point, Hydropathy Property of 20 different Amino Acids
For an example Let consider a small protein sequence MARETFAR. According to the table 1
molecular weight of M is 149.21, A is 89.10, R is 174.20, E is 147.13, T is 119.12, F is 165.19, so
total molecular weight of this sequence is (149.21 + 89.10 + 174.20 + 147.13 + 119.12 + 165.19
+ 89.10 + 174.20) = 1107.25 and average molecular weight is 138.41. In the same way isoelectric
point is calculated. Isoelectric point of M is 5.74, A is 6.00, R is 10.76, E is 3.22, T is 5.60, F is
5.48, so total isoelectric point of this sequence is (5.74 + 6.00 + 10.76 + 3.22 + 5.60 + 5.48 + 6.00
+ 10.76) = 53.56 and average molecular weight is 6.69. Now consider the hydropathy property of
the input sequence. According to the table M is hydrophobic in nature, A is hydrophobic, R is
hydrophilic, E is hydrophilic, T is Neutral and F is Hydrophobic in nature. Table 2 and Table 3
show the Hydropathy composition and Hydropathy distribution of the input protein sequence.
Name
1 letter
Symbol
Molecular
weight (W)
Isoelectric
point (pI)
Hydropathy
Property
Alanine A 89.10 6.00 Hydrophobic
Arginine R 174.20 10.76 Hydrophilic
Asparagine N 132.12 5.41 Hydrophilic
Aspartic Acid D 133.11 2.77 Hydrophilic
Cysteine C 121.16 5.07 Hydrophobic
Glutamic Acid E 147.13 3.22 Hydrophilic
Glutamine Q 146.15 5.65 Neutral
Glycine G 75.07 5.97 Neutral
Histicline H 155.16 7.59 Neutral
Isoleucine I 131.18 6.02 Hydrophobic
Leucine L 131.18 5.98 Hydrophobic
Lysine K 146.19 9.74 Hydrophilic
Methinine M 149.21 5.74 Hydrophobic
Plenylalanine F 165.19 5.48 Hydrophobic
Proline P 115.13 6.30 Hydrophilic
Serine S 105.09 5.68 Neutral
Threonine T 119.12 5.60 Neutral
Tryptophan W 204.23 5.89 Hydrophobic
Tyrosine Y 181.19 5.66 Hydrophobic
Valine V 117.15 5.96 Hydrophobic
8. 250 Computer Science & Information Technology (CS & IT)
Table 2. Result of Hydropathy Composition of Input Protein Sequence
Hydropathy Composition Occurrences Percentage of Occurrences
Hydrophobic 4 50.00%
Hydrophilic 3 37.50%
Neutral 1 12.50%
Table 3. Result of Hydropathy Distribution of Input Protein Sequence
Hydropathy Distribution Occurrences Percentage of
Occurrences
Hydrophobic – Hydrophobic 2 28.57 %
Hydrophobic – Hydrophilic 2 28.57 %
Hydrophobic – Neutral 0 00.00 %
Hydrophilic Hydrophobic 0 00.00 %
Hydrophilic – Hydrophilic 1 14.29 %
Hydrophilic – Neutral 1 14.29 %
Neutral – Hydrophobic 1 14.29 %
Neutral – Hydrophilic 0 00.00 %
Neutral – Neutral 0 00.00 %
After extracting those features, a feature ranking algorithm will be applied on it. This algorithm
will be able to provide a rank to the feature values and arrange the features values according to
the descending order [8]. Top rank means have an extra ability to classify the protein sequence.
After that pattern matrix will be generated. Now this pattern matrix will be distributed within
three groups [6] and applied to the Fuzzy model. Those features which are used here generally
deal with the molecular structure of the amino acids. So it is possible to eliminate huge no of
families of protein in which the input protein sequence does not belongs. In this case data by data
analysis was implemented instead of extracting knowledge based information. The main
advantage of the data by data analysis is it provides the high accuracy level.
After completion of 2nd
Phase, generally the unknown input protein sequences may be classified
in to particular families otherwise it definitely reduces the trained dataset. If phase 2 able to
classified the input unknown protein sequence in to families, no need to continue this process
further. The process will stop here otherwise the process will continue using 3rd
Phase.
3.3. Phase 3
In the third phase, Neighbourhood Analysis (NA) will be used to classify the input sequence in
the particular class or family. To use neighbourhood analysis we generally apply association
rule. This rule has a power to extract the particular association between the protein sequence and
classes, sub classes and families. So it is possible to eliminate all other classes, sub classes and
families of protein in which this input protein sequence do not belongs
4. IMPLEMENTATION
The three different phases of the new proposed model have been implemented using an own
designed tool and try to prove the incremental accuracy level and low computational time than
others models. The proposed model is tested with various input protein sequences. In the next
execution procedure are described.
9. Computer Science & Information Technology (CS & IT) 251
4.1. Execution Procedure
According to the previous description of the proposed model, total execution will be done in
three different phases. At first user give an unknown protein sequence and submit for noise
Figure 2. A Flow Diagram to execute the proposed classifier
checking. If the input sequence is noise free then only two different doors will be opened to
continue the further process i.e. 1) to execute phase 1 and 2) to insert data into the data
10. 252 Computer Science & Information Technology (CS & IT)
warehouse, otherwise execution will be stopped. In the time of noise checking period, other some
backend module also been executed. System creates database name knowledge with the help of
data warehouse. Data warehouse stores the value of the every feature extracted in two different
phases, against the protein family. Knowledge database store range of different feature values
with respect to protein family. The overall process use this knowledge table instate of data
warehouse to continue the further process.
Here another method also been incorporated to reduce the computational overhead. A single row
column table named countrow is created to store the size of the data warehouse. Every time
system count size of the data warehouse and match with the value of countrow, if value is
different, then only knowledge table is created otherwise execute with the previous one.
Now go for execute Phase 1, 2-gram encoding method and 6 letter exchange group method are
calculated. If system able to provide single family corresponding to the input, then process will be
stop and family name will be placed in the final output box. Second phase will be executed if and
only if the size of the phase 1 result set is greater than one. Otherwise execute phase 3 for
prediction.
In the second phase Average Molecular Weight, Average Isoelectric point, Hydropathy
composition and distribution are extracted and match with the data of Knowledge table. If system
able to provide single family corresponding to the input, then process will be stop and family
name will be placed in the final output boxIn the third phase neighborhood analysis will be done
for prediction. A detail flow diagram is given in figure 2.
Another module also be attached which can able to insert data in to data warehouse. After
completing the noise checking procedure execute phase 1 and phase 2 one after another and
provide the family name. Finally click on insert data button, system store all the information in to
the data warehouse. In the next a flow diagram is designed to describe the overall process.
5. RESULT DESCRIPTION
A series of experiments are carried out to evaluate the performance of the proposed new classifier
technique named “Protein Sequence Classifier” on a Pentium IV PC running the Windows XP
operating system. Generally the data used to construct the data warehouse for testing the classifier
tool were obtained from NCBI and SCOP. Data means different protein sequences were obtained
from NCBI and corresponding family name of these protein sequences were got from SCOP.
After that this classifier tool helps us to create the data warehouse using the inserting feature of
this classifier tool.
As a testing dataset five positive datasets of protein families are considered there are 1) FaeA-
LIKE, 2) Marine Metagenome Family WH1, 3) MiaE-LIKE, 4) PRP4-LIKE and 5) SOCS_BOX-
LIKE. At least 500 to 800 sequences of different length of every family are taken to calculate the
features which were stored in the data warehouse. Means feature values of total 2500 to 4000
sequences are used to construct the data warehouse as a testing purpose. As a result 2500 to 4000
rows are available and total 18 features are considered for every protein sequences. So total size
of the data warehouse is19 * 4000. [18 features and 1 column to store the corresponding family
name so 19 columns and 4000 rows]
In this classifier with in three phases one phase executeNeural Network based Model, second one
execute Fuzzy based model and another one is neighborhood Analysis. After evaluating the
performance a huge success is got to execute the Neural based modelbefore execution ofFuzzy
11. Computer Science & Information Technology (CS & IT) 253
based model. Generally it also proved that after analyzing 500 unknown protein sequences 405
protein sequences are classified using only Neural Network based model. In these 405 cases we
do not need to execute the Fuzzy Based Model so timing complexity is reduced. Another 50 cases
need to execute the Fuzzy Based Model after executing the Neural Network based model. And
left 45 cases need to execute all three cases. So from this analysis that is concluded, Neural
Network based model are considered as the 1st
Phase, Fuzzy based Model are execute in next and
Finally Neighborhood analysis are performed.
6. CONCLUSION
Protein analysis has a major part in the domain of biological research. So before analyzing,
identification of protein is mostly needed. As a result different protein sequence classification
techniques are invented, this has enhance the research scope in Biological Domain. Choosing the
perfect techniques regarding this classification is also a most important part of the research.
Recent trend analysis shows that it is very difficult to classify large amount of biological data like
protein sequences using traditional database system. Data mining technique are appropriate to
handle the large data sources. A number of different techniques are prevalent for classifying
protein sequences. This dissertation includes a detail review of ongoing research work involving
three different techniques to classify the protein sequences. A comparative study is done to
understand the basic difference between these models. The accuracy level of each proposed
model has been studied. Those brief review and comparative study help to understand the
insufficiency of the previous classification techniques with respect to the accuracy level and
computational time.
A new classification model has been proposed in this dissertation for classifying the unknown
protein sequences into families producing knowledge based information beside data analysis
technique. The classifier has two phases, (i) Neural Network cased Classifying (ii)Fuzzy Logic
based Classifying. In this new proposed model Fuzzy based algorithm and Neural network based
algorithm both are incorporated. Fuzzy logic based algorithm helps us to do the data by data
analysis which is very useful to reduce the total dataset and neural network based algorithm
produce a knowledge by analysis the overall data warehouse which help us to classify and predict
the require output. Neighborhood analysis is done in the final stage of the new proposed
technique for predicting the required output.
The proposed classifier has been implemented using JAVA (JDK 1.6_16). We have also tested
the model with the order of phases reversed. The model has been tested using a test data of 500
different sequences. It has been observed that the accuracy level increase if the Neural network
based approch is executed before the fuzzy logic based approch.
REFERENCES
[1] Cathy Wu, Michael Berry, Sailaja Shivakumar , Jerry Mclarty (1995) Neural Networks for Full-Scale
Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition. Kluwer
Academic Publishers, Boston. Manufactured in The Netherlands, Machine Learning, 21, pp: 177-193.
[2] Jason T. L. Wang, Qic Heng Ma, Dennis Shasha, Cathy H Wu (2000) Application of Neural
Networks to Biological Data Mining: A case study in Protein Sequence Classification. KDD, Boston,
MA, USA, pp: 305-309.
[3] C. Z. Cai, L. Y. Han, Z. L. Ji, X. Chen, and Y. Z. Chen (2003) SVM-Prot: Web-based support vector
machine software for functional classification of a protein from its primary sequence. Nucleic Acids
Research, vol. 31, pp. 3692–3697.
12. 254 Computer Science & Information Technology (CS & IT)
[4] Dianhui Wang, Guang-Bin Huang (2005) Protein Sequence Classification Using Extreme Learning
Machine. Proceedings of International Joint Conference on Neural Networks (IJCNN2005), Montreal,
Canada.
[5] Shakir Mohamed, David Rubin and Tshilidzi Marwala (2006) Multi-class Protein Sequence
Classification Using Fuzzy ARTMAP. IEEE Conference pp: 1676 – 1680.
[6] E. G. Mansoori, M. J. Zolghadri, S. D. Katebi, H. Mohabatkar, R. Boostani and M. H. Sadreddini
(2008) Generating Fuzzy Rules For Protein Classification. Iranian Journal of Fuzzy Systems Vol. 5,
No. 2, pp. 21-33.
[7] Zarita Zainuddin, Maragathan Kumar (2008) Radial Basic Function Neural Networks in Protein
Sequence Classification. Malaysian Journal of Mathematical Science2(2), pp: 195-204.
[8] Eghbal G. Mansoori, Mansoor J. Zolghadri, and Seraj D. Katebi (March 2009) Protein Superfamily
Classification Using Fuzzy Rule-Based Classifier. IEEE Transactions On Nanobioscience, Vol. 8, No.
1, pp 92-99.
[9] PV Nageswara Rao, T Uma Devi, Dsvgk Kaladhar, Gr Sridhar, Allam Appa Rao (2009) A
Probabilistic Neural Network Approach For Protein Superfamily Classification. Journal of
Theoretical and Applied Information Technology.
[10] Ramadevi Yellasiri, C.R.Rao (2009) Rough Set Protein Classifier. Journal of Theoretical and Applied
Information Technology.
[11] Shuzlina Abdul Rahman, Azuraliza Abu Bakar, Zeti Azura Mohamed Hussein (2009) Feature
Selection and Classification of Protein Subfamilies Using Rough Sets. International Conference on
Electrical Engineering and Informatics, Selangor, Malaysia.
[12] Suprativ Saha, Rituparna Chaki (2012) “A Brief Review of Data Mining Application Involving
Protein Sequence Classification”, Advances in Intelligent Systems and Computing, 1, Volume 177,
Advances in Computing and Information Technology, Springer Publisher, ACITY 2012, Chennai,
India, pp. 469-477
[13] George Tzanis, Christos Berberidis, and Ioannis Vlahavas, “Biological Data Mining”.
AUTHORS
Avik Samantareceived his M.C.A degree from JIS college of Engineering in 2010 and B.S.C degree from
The University of Calcutta in year 2007. His research interests include the field of Database Management
System, Data Mining, and Distributed Database.
Suprativ Saha is an Assistant Professorat Global Institute of Management and
Technology, Krishnagar City, West Bengal,India since 2012. He received his M.E
degree from West Bengal University of Technology in year 2012 and B-Tech degree
from JIS college of Engineering in 2009. His research interests include the field of
Database Management System, Data Mining, Distributed Database and Optical
Network.Mr. Saha has about 2 referred international publications toher credit.