1. The document discusses using a hybrid genetic algorithm-support vector machine (GA-SVM) approach for feature selection in email classification to improve SVM performance.
2. SVM has been shown to be inefficient and consume a lot of computational resources when classifying large email datasets with many features.
3. The hybrid GA-SVM approach uses a genetic algorithm to optimize feature selection for SVM in order to improve classification accuracy and reduce computation time for email spam detection.
03 fauzi indonesian 9456 11nov17 edit septianIAESIJEECS
Since the rise of WWW, information available online is growing rapidly. One of the example is Indonesian online news. Therefore, automatic text classification became very important task for information filtering. One of the major issue in text classification is its high dimensionality of feature space. Most of the features are irrelevant, noisy, and redundant, which may decline the accuracy of the system. Hence, feature selection is needed. Maximal Marginal Relevance for Feature Selection (MMR-FS) has been proven to be a good feature selection for text with many redundant features, but it has high computational complexity. In this paper, we propose a two-phased feature selection method. In the first phase, to lower the complexity of MMR-FS we utilize Information Gain first to reduce features. This reduced feature will be selected using MMR-FS in the second phase. The experiment result showed that our new method can reach the best accuracy by 86%. This new method could lower the complexity of MMR-FS but still retain its accuracy.
This document discusses a system for detecting suspicious emails using machine learning algorithms. It presents a model that filters emails containing images and PDFs to improve performance over existing text-based spam filtering systems. The model uses data collection, preprocessing, machine learning classification algorithms like Naive Bayes, Support Vector Machines, and Decision Trees. It aims to achieve acceptable recall and precision values. The document also reviews related work on email spam detection using machine learning and discusses the future scope of applying the algorithm to embedded images/PDFs and larger datasets.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used tocategories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like
2 statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
The user generated content on the web grows rapidly in this emergent information age. The evolutionary changes in technology make use of such information to capture only the user’s essence and finally the useful information are exposed to information seekers. Most of the existing research on text information processing, focuses in the factual domain rather than the opinion domain. In this paper we detect online hotspot forums by computing sentiment analysis for text data available in each forum. This approach analyses the forum text data and computes value for each word of text. The proposed approach combines K-means clustering and Support Vector Machine with PSO (SVM-PSO) classification algorithm that can be used to group the forums into two clusters forming hotspot forums and non-hotspot forums within the current time span. The proposed system accuracy is compared with the other classification algorithms such as Naïve Bayes, Decision tree and SVM. The experiment helps to identify that K-means and SVM-PSO together achieve highly consistent results.
03 fauzi indonesian 9456 11nov17 edit septianIAESIJEECS
Since the rise of WWW, information available online is growing rapidly. One of the example is Indonesian online news. Therefore, automatic text classification became very important task for information filtering. One of the major issue in text classification is its high dimensionality of feature space. Most of the features are irrelevant, noisy, and redundant, which may decline the accuracy of the system. Hence, feature selection is needed. Maximal Marginal Relevance for Feature Selection (MMR-FS) has been proven to be a good feature selection for text with many redundant features, but it has high computational complexity. In this paper, we propose a two-phased feature selection method. In the first phase, to lower the complexity of MMR-FS we utilize Information Gain first to reduce features. This reduced feature will be selected using MMR-FS in the second phase. The experiment result showed that our new method can reach the best accuracy by 86%. This new method could lower the complexity of MMR-FS but still retain its accuracy.
This document discusses a system for detecting suspicious emails using machine learning algorithms. It presents a model that filters emails containing images and PDFs to improve performance over existing text-based spam filtering systems. The model uses data collection, preprocessing, machine learning classification algorithms like Naive Bayes, Support Vector Machines, and Decision Trees. It aims to achieve acceptable recall and precision values. The document also reviews related work on email spam detection using machine learning and discusses the future scope of applying the algorithm to embedded images/PDFs and larger datasets.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used tocategories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like
2 statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
The user generated content on the web grows rapidly in this emergent information age. The evolutionary changes in technology make use of such information to capture only the user’s essence and finally the useful information are exposed to information seekers. Most of the existing research on text information processing, focuses in the factual domain rather than the opinion domain. In this paper we detect online hotspot forums by computing sentiment analysis for text data available in each forum. This approach analyses the forum text data and computes value for each word of text. The proposed approach combines K-means clustering and Support Vector Machine with PSO (SVM-PSO) classification algorithm that can be used to group the forums into two clusters forming hotspot forums and non-hotspot forums within the current time span. The proposed system accuracy is compared with the other classification algorithms such as Naïve Bayes, Decision tree and SVM. The experiment helps to identify that K-means and SVM-PSO together achieve highly consistent results.
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approachesijtsrd
In this work, we have reviewed the issue of spam mail which is a big problem in the area of Internet. The growing size of uncalled mass e mail or spam has produced the requirement of a dependable anti spam filter. Now a days the Machine learning ML proedures are being employed to spontaneously filter the spam e mail in an effective manner. In this work, we have reviewed some of the prevalent ML approaches such as Rough sets, Bayesian classification, SVMs, k NN, ANNs and Artificial immune system and of their use fullness in the issue of spam Email taxonomy. We have provided the depictions of the procedures and the divergence of their enactment on the basis of the quantity of Spam Assassin. Anu | Ms. Preeti "A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-6 , October 2020, URL: https://www.ijtsrd.com/papers/ijtsrd33261.pdf Paper Url: https://www.ijtsrd.com/computer-science/data-processing/33261/a-deep-analysis-on-prevailing-spam-mail-filteration-machine-learning-approaches/anu
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
This document discusses using the K-Means clustering algorithm to cluster text documents and compares it to using K-Means clustering with dimension reduction techniques. It uses the BBC Sports dataset containing 737 documents in 5 classes. The document outlines preprocessing the text, creating a document term matrix, applying K-Means clustering, and using dimension reduction techniques like InfoGain before clustering. It evaluates the different methods using precision, recall, accuracy, and F-measure, finding that K-Means with InfoGain dimension reduction outperforms standard K-Means clustering.
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...Kato Mivule
Literature Review – Talk, By Kato Mivule, COSC891 Fall 2013, Computer Science Department, Bowie State University
"Signal Processing and Machine Learning with Differential Privacy Algorithms and challenges for continuous data" Sarwate and Chaudhuri (2013)
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...ijcsa
Active learning is a supervised learning method that is based on the idea that a machine learning algorithm can achieve greater accuracy with fewer labelled training images if it is allowed to choose the image from which it learns. Facial age classification is a technique to classify face images into one of the several predefined age groups. The proposed study applies an active learning approach to facial age classification which allows a classifier to select the data from which it learns. The classifier is initially trained using a small pool of labeled training images. This is achieved by using the bilateral two dimension linear discriminant analysis. Then the most informative unlabeled image is found out from the unlabeled pool using the furthest nearest neighbor criterion, labeled by the user and added to the
appropriate class in the training set. The incremental learning is performed using an incremental version of bilateral two dimension linear discriminant analysis. This active learning paradigm is proposed to be applied to the k nearest neighbor classifier and the support vector machine classifier and to compare the performance of these two classifiers.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Profile Analysis of Users in Data Analytics DomainDrjabez
Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
Recommendation based on Clustering and Association RulesIJARIIE JOURNAL
This document proposes a hybrid recommendation framework that uses both clustering and association rule mining. It first clusters users using the k-means algorithm to group similar users together. It then applies the Eclat algorithm to generate frequent itemsets and association rules from the user-item data. These association rules are used to make recommendations to users. The framework aims to address issues like sparsity and cold start problems. An experimental study compares the performance of the Eclat and Apriori algorithms on recommendation quality and execution time. The proposed approach integrates clustering and association rule mining to provide more accurate recommendations.
Text classification based on gated recurrent unit combines with support vecto...IJECEIAES
As the amount of unstructured text data that humanity produce largely and a lot of texts are grows on the Internet, so the one of the intelligent technique is require processing it and extracting different types of knowledge from it. Gated recurrent unit (GRU) and support vector machine (SVM) have been successfully used to Natural Language Processing (NLP) systems with comparative, remarkable results. GRU networks perform well in sequential learning tasks and overcome the issues of “vanishing and explosion of gradients in standard recurrent neural networks (RNNs) when captureing long-term dependencies. In this paper, we proposed a text classification model based on improved approaches to this norm by presenting a linear support vector machine (SVM) as the replacement of Softmax in the final output layer of a GRU model. Furthermore, the cross-entropy function shall be replaced with a margin-based function. Empirical results present that the proposed GRU-SVM model achieved comparatively better results than the baseline approaches BLSTM-C, DABN.
Spam image email filtering using K-NN and SVMIJECEIAES
The developing utilization of web has advanced a simple and quick method for e-correspondence. The outstanding case for this is e-mail. Presently days sending and accepting email as a method for correspondence is prominently utilized. Be that as it may, at that point there stand up an issue in particular, Spam mails. Spam sends are the messages send by some obscure sender just to hamper the improvement of Internet e.g. Advertisement and many more. Spammers introduced the new technique of embedding the spam mails in the attached image in the mail. In this paper, we proposed a method based on combination of SVM and KNN. SVM tend to set aside a long opportunity to prepare with an expansive information set. On the off chance that "excess" examples are recognized and erased in pre-handling, the preparation time could be diminished fundamentally. We propose a k-nearest neighbor (k-NN) based example determination strategy. The strategy tries to select the examples that are close to the choice limit and that are effectively named. The fundamental thought is to discover close neighbors to a question test and prepare a nearby SVM that jelly the separation work on the gathering of neighbors. Our experimental studies based on a public available dataset (Dredze) show that results are improved to approximately 98%.
This document summarizes a research paper on robust unsupervised feature selection on networked data. It introduces the challenges of high dimensionality and noise in networked data. The proposed NetFS framework addresses this by (1) modeling link information with latent representations learned from the network structure, and (2) embedding latent representation learning into the feature selection process to reduce noise. The framework is optimized using an alternating optimization approach. Experiments on blog, Flickr, and Epinions networks demonstrate that NetFS improves clustering performance over other methods by selecting more informative features. Future work could apply the framework to other network types and dynamic networks.
An Improved Leader Election Algorithm for Distributed Systemsijngnjournal
This document summarizes an article from the International Journal of Next-Generation Networks that proposes an improved leader election algorithm for distributed systems. The algorithm modifies the existing ring election algorithm to minimize the number of messages exchanged during leader election. Simulation results showed that the proposed algorithm reduces message complexity compared to the original ring algorithm. The document provides background on leader election algorithms and discusses related work that has aimed to improve leader election in distributed systems through approaches like failure detectors and modifying the bully algorithm.
Query Aware Determinization of Uncertain Objects1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
This document summarizes recent approaches in opinion mining and sentiment analysis (OMSA) in online social networks. It discusses 15 different frameworks and methods that have been proposed, including the Twitter Opinion Mining framework, an election prediction model using user influence factors, and a fuzzy deep belief network approach. The document analyzes the key steps and characteristics of each approach, such as data collection, preprocessing, classification techniques, and performance results. Overall, the paper reviews the state-of-the-art in OMSA research and highlights areas for future improvements.
Learning to Rank Image Tags With Limited Training Examples1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
A Heuristic Approach for Network Data Clusteringidescitation
In this growing world of technology there are lots of security threats received by
each and every area of computer networks. Most of the time the network security threats
produce high false positive and negative ratios, this creates an obstacle for any security
system to work improperly. The overwhelming threats make it challenging to understand
and manage the network data.
To address this problem we present a novel approach which eventually understand the
network data by clustering them without background knowledge of any threats according to
various parameters like source IP, Destination IP etc. And this approach saves
administrator’s time and energy in processing of large amount threats.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Framework for opinion as a service on review data of customer using semantics...IJECEIAES
At opinion mining plays a significant role in representing the original and unbiased perception of the products/services. However, there are various challenges associated with performing an effective opinion mining in the present era of distributed computing system with dynamic behaviour of users. Existing approaches is more laborious towards extracting knowledge from the reviews of user which is further subjected to various rounds of operation with complex procedures. The proposed system addresses the problem by introducing a novel framework called as opinion-as-a-service which is meant for direct utilization of the extracted knowledge in most user friendly manner. The proposed system introduces a set of three sequential algorithm that performs aggregated of incoming stream of opinion data, performing indexing, followed by applying semantics for extracting knowledge. The study outcome shows that proposed system is better than existing system in mining performance.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approachesijtsrd
In this work, we have reviewed the issue of spam mail which is a big problem in the area of Internet. The growing size of uncalled mass e mail or spam has produced the requirement of a dependable anti spam filter. Now a days the Machine learning ML proedures are being employed to spontaneously filter the spam e mail in an effective manner. In this work, we have reviewed some of the prevalent ML approaches such as Rough sets, Bayesian classification, SVMs, k NN, ANNs and Artificial immune system and of their use fullness in the issue of spam Email taxonomy. We have provided the depictions of the procedures and the divergence of their enactment on the basis of the quantity of Spam Assassin. Anu | Ms. Preeti "A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-6 , October 2020, URL: https://www.ijtsrd.com/papers/ijtsrd33261.pdf Paper Url: https://www.ijtsrd.com/computer-science/data-processing/33261/a-deep-analysis-on-prevailing-spam-mail-filteration-machine-learning-approaches/anu
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
This document discusses using the K-Means clustering algorithm to cluster text documents and compares it to using K-Means clustering with dimension reduction techniques. It uses the BBC Sports dataset containing 737 documents in 5 classes. The document outlines preprocessing the text, creating a document term matrix, applying K-Means clustering, and using dimension reduction techniques like InfoGain before clustering. It evaluates the different methods using precision, recall, accuracy, and F-measure, finding that K-Means with InfoGain dimension reduction outperforms standard K-Means clustering.
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...Kato Mivule
Literature Review – Talk, By Kato Mivule, COSC891 Fall 2013, Computer Science Department, Bowie State University
"Signal Processing and Machine Learning with Differential Privacy Algorithms and challenges for continuous data" Sarwate and Chaudhuri (2013)
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...ijcsa
Active learning is a supervised learning method that is based on the idea that a machine learning algorithm can achieve greater accuracy with fewer labelled training images if it is allowed to choose the image from which it learns. Facial age classification is a technique to classify face images into one of the several predefined age groups. The proposed study applies an active learning approach to facial age classification which allows a classifier to select the data from which it learns. The classifier is initially trained using a small pool of labeled training images. This is achieved by using the bilateral two dimension linear discriminant analysis. Then the most informative unlabeled image is found out from the unlabeled pool using the furthest nearest neighbor criterion, labeled by the user and added to the
appropriate class in the training set. The incremental learning is performed using an incremental version of bilateral two dimension linear discriminant analysis. This active learning paradigm is proposed to be applied to the k nearest neighbor classifier and the support vector machine classifier and to compare the performance of these two classifiers.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Profile Analysis of Users in Data Analytics DomainDrjabez
Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
Recommendation based on Clustering and Association RulesIJARIIE JOURNAL
This document proposes a hybrid recommendation framework that uses both clustering and association rule mining. It first clusters users using the k-means algorithm to group similar users together. It then applies the Eclat algorithm to generate frequent itemsets and association rules from the user-item data. These association rules are used to make recommendations to users. The framework aims to address issues like sparsity and cold start problems. An experimental study compares the performance of the Eclat and Apriori algorithms on recommendation quality and execution time. The proposed approach integrates clustering and association rule mining to provide more accurate recommendations.
Text classification based on gated recurrent unit combines with support vecto...IJECEIAES
As the amount of unstructured text data that humanity produce largely and a lot of texts are grows on the Internet, so the one of the intelligent technique is require processing it and extracting different types of knowledge from it. Gated recurrent unit (GRU) and support vector machine (SVM) have been successfully used to Natural Language Processing (NLP) systems with comparative, remarkable results. GRU networks perform well in sequential learning tasks and overcome the issues of “vanishing and explosion of gradients in standard recurrent neural networks (RNNs) when captureing long-term dependencies. In this paper, we proposed a text classification model based on improved approaches to this norm by presenting a linear support vector machine (SVM) as the replacement of Softmax in the final output layer of a GRU model. Furthermore, the cross-entropy function shall be replaced with a margin-based function. Empirical results present that the proposed GRU-SVM model achieved comparatively better results than the baseline approaches BLSTM-C, DABN.
Spam image email filtering using K-NN and SVMIJECEIAES
The developing utilization of web has advanced a simple and quick method for e-correspondence. The outstanding case for this is e-mail. Presently days sending and accepting email as a method for correspondence is prominently utilized. Be that as it may, at that point there stand up an issue in particular, Spam mails. Spam sends are the messages send by some obscure sender just to hamper the improvement of Internet e.g. Advertisement and many more. Spammers introduced the new technique of embedding the spam mails in the attached image in the mail. In this paper, we proposed a method based on combination of SVM and KNN. SVM tend to set aside a long opportunity to prepare with an expansive information set. On the off chance that "excess" examples are recognized and erased in pre-handling, the preparation time could be diminished fundamentally. We propose a k-nearest neighbor (k-NN) based example determination strategy. The strategy tries to select the examples that are close to the choice limit and that are effectively named. The fundamental thought is to discover close neighbors to a question test and prepare a nearby SVM that jelly the separation work on the gathering of neighbors. Our experimental studies based on a public available dataset (Dredze) show that results are improved to approximately 98%.
This document summarizes a research paper on robust unsupervised feature selection on networked data. It introduces the challenges of high dimensionality and noise in networked data. The proposed NetFS framework addresses this by (1) modeling link information with latent representations learned from the network structure, and (2) embedding latent representation learning into the feature selection process to reduce noise. The framework is optimized using an alternating optimization approach. Experiments on blog, Flickr, and Epinions networks demonstrate that NetFS improves clustering performance over other methods by selecting more informative features. Future work could apply the framework to other network types and dynamic networks.
An Improved Leader Election Algorithm for Distributed Systemsijngnjournal
This document summarizes an article from the International Journal of Next-Generation Networks that proposes an improved leader election algorithm for distributed systems. The algorithm modifies the existing ring election algorithm to minimize the number of messages exchanged during leader election. Simulation results showed that the proposed algorithm reduces message complexity compared to the original ring algorithm. The document provides background on leader election algorithms and discusses related work that has aimed to improve leader election in distributed systems through approaches like failure detectors and modifying the bully algorithm.
Query Aware Determinization of Uncertain Objects1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
This document summarizes recent approaches in opinion mining and sentiment analysis (OMSA) in online social networks. It discusses 15 different frameworks and methods that have been proposed, including the Twitter Opinion Mining framework, an election prediction model using user influence factors, and a fuzzy deep belief network approach. The document analyzes the key steps and characteristics of each approach, such as data collection, preprocessing, classification techniques, and performance results. Overall, the paper reviews the state-of-the-art in OMSA research and highlights areas for future improvements.
Learning to Rank Image Tags With Limited Training Examples1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
A Heuristic Approach for Network Data Clusteringidescitation
In this growing world of technology there are lots of security threats received by
each and every area of computer networks. Most of the time the network security threats
produce high false positive and negative ratios, this creates an obstacle for any security
system to work improperly. The overwhelming threats make it challenging to understand
and manage the network data.
To address this problem we present a novel approach which eventually understand the
network data by clustering them without background knowledge of any threats according to
various parameters like source IP, Destination IP etc. And this approach saves
administrator’s time and energy in processing of large amount threats.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Framework for opinion as a service on review data of customer using semantics...IJECEIAES
At opinion mining plays a significant role in representing the original and unbiased perception of the products/services. However, there are various challenges associated with performing an effective opinion mining in the present era of distributed computing system with dynamic behaviour of users. Existing approaches is more laborious towards extracting knowledge from the reviews of user which is further subjected to various rounds of operation with complex procedures. The proposed system addresses the problem by introducing a novel framework called as opinion-as-a-service which is meant for direct utilization of the extracted knowledge in most user friendly manner. The proposed system introduces a set of three sequential algorithm that performs aggregated of incoming stream of opinion data, performing indexing, followed by applying semantics for extracting knowledge. The study outcome shows that proposed system is better than existing system in mining performance.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
This document discusses online feature selection (OFS) for data mining applications. It addresses two tasks of OFS: 1) learning with full input, where the learner can access all features to select a subset, and 2) learning with partial input, where only a limited number of features can be accessed for each instance. Novel algorithms are presented for each task, and their performance is analyzed theoretically. Experiments on real-world datasets demonstrate the efficacy of the proposed OFS techniques for applications in computer vision, bioinformatics, and other domains involving high-dimensional sequential data.
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
Since dealing with high dimensional data is computationally complex and sometimes even intractable, recently several feature reductions methods have been developed to reduce the dimensionality of the data in order to simplify the calculation analysis in various applications such as text categorization, signal processing, image retrieval, gene expressions and etc. Among feature reduction techniques, feature selection is one the most popular methods due to the preservation of the original features. However, most of the current feature selection methods do not have a good performance when fed on imbalanced data sets which are pervasive in real world applications. In this paper, we propose a new unsupervised feature selection method attributed to imbalanced data sets, which will remove redundant features from the original feature space based on the distribution of features. To show the effectiveness of the proposed method, popular feature selection methods have been implemented and compared. Experimental results on the several imbalanced data sets, derived from UCI repository database, illustrate the effectiveness of our proposed methods in comparison with the other compared methods in terms of both accuracy and the number of selected features.
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
Feature selection is considered as a problem of global combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant, noisy and redundant data. However, identification of useful features from hundreds or even thousands of related features is not an easy task. Selecting relevant genes from microarray data becomes even more challenging owing to the high dimensionality of features, multiclass categories involved and the usually small sample size. In order to improve the prediction accuracy and to avoid incomprehensibility due to the number of features different feature selection techniques can be implemented. This survey classifies and analyzes different approaches, aiming to not only provide a comprehensive presentation but also discuss challenges and various performance parameters. The techniques are generally classified into three; filter, wrapper and hybrid.
A chi-square-SVM based pedagogical rule extraction method for microarray data...IJAAS Team
Support Vector Machine (SVM) is currently an efficient classification technique due to its ability to capture nonlinearities in diagnostic systems, but it does not reveal the knowledge learnt during training. It is important to understand of how a decision is reached in the machine learning technology, such as bioinformatics. On the other hand, a decision tree has good comprehensibility; the process of converting such incomprehensible models into an understandable model is often regarded as rule extraction. In this paper we proposed an approach for extracting rules from SVM for microarray dataset by combining the merits of both the SVM and decision tree. The proposed approach consists of three steps; the SVM-CHI-SQUARE is employed to reduce the feature set. Dataset with reduced features is used to obtain SVM model and synthetic data is generated. Classification and Regression Tree (CART) is used to generate Rules as the Last phase. We use breast masses dataset from UCI repository where comprehensibility is a key requirement. From the result of the experiment as the reduced feature dataset is used, the proposed approach extracts smaller length rules, thereby improving the comprehensibility of the system. We obtained accuracy of 93.53%, sensitivity of 89.58%, specificity of 96.70%, and training time of 3.195 seconds. A comparative analysis is carried out done with other algorithms.
Spam filtering by using Genetic based Feature SelectionEditor IJCATR
Spam is defined as redundant and unwanted electronica letters, and nowadays, it has created many problems in business life such as
occupying networks bandwidth and the space of user’s mailbox. Due to these problems, much research has been carried out in this
regard by using classification technique. The resent research show that feature selection can have positive effect on the efficiency of
machine learning algorithm. Most algorithms try to present a data model depending on certain detection of small set of features.
Unrelated features in the process of making model result in weak estimation and more computations. In this research it has been tried
to evaluate spam detection in legal electronica letters, and their effect on several Machin learning algorithms through presenting a
feature selection method based on genetic algorithm. Bayesian network and KNN classifiers have been taken into account in
classification phase and spam base dataset is used.
This study extensively evaluated 12 feature selection metrics on 229 text classification problems from datasets like Reuters and OHSUMED. It found that a new metric called Bi-Normal Separation (BNS) substantially outperformed other metrics on accuracy, F-measure, and recall. BNS's performance was particularly good on tasks with high class skew, which is common in text classification. BNS was also consistently one of the best-performing metrics when choosing optimal pairs of metrics for each performance goal. The study provides guidance for practitioners on choosing effective feature selection metrics for text classification tasks.
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)IJCSEA Journal
Feature selection is an effective method used in text categorization for sorting a set of documents into certain number of predefined categories. It is an important method for improving the efficiency and accuracy of text categorization algorithms by removing irredundant terms from the corpus. Genome contains the total amount of genetic information in the chromosomes of an organism, including its genes and DNA sequences. In this paper a Clustering technique called Hierarchical Techniques is used to categories the Features from the Genome documents. A framework is proposed for Genomic Feature set Selection. A Filter based Feature Selection Method like statistics, CHIR statistics are used to select the Feature set. The Selected Feature set is verified by using F-measure and it is biologically validated for Biological relevance using the BLAST tool.
Integrated bio-search approaches with multi-objective algorithms for optimiza...TELKOMNIKA JOURNAL
Optimal selection of features is very difficult and crucial to achieve, particularly for the task of classification. It is due to the traditional method of selecting features that function independently and generated the collection of irrelevant features, which therefore affects the quality of the accuracy of the classification. The goal of this paper is to leverage the potential of bio-inspired search algorithms, together with wrapper, in optimizing multi-objective algorithms, namely ENORA and NSGA-II to generate an optimal set of features. The main steps are to idealize the combination of ENORA and NSGA-II with suitable bio-search algorithms where multiple subset generation has been implemented. The next step is to validate the optimum feature set by conducting a subset evaluation. Eight (8) comparison datasets of various sizes have been deliberately selected to be checked. Results shown that the ideal combination of multi-objective algorithms, namely ENORA and NSGA-II, with the selected bio-inspired search algorithm is promising to achieve a better optimal solution (i.e. a best features with higher classification accuracy) for the selected datasets. This discovery implies that the ability of bio-inspired wrapper/filtered system algorithms will boost the efficiency of ENORA and NSGA-II for the task of selecting and classifying features.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
Automated Question Paper Generator And Answer Checker Using Information Retri...Sheila Sinclair
This document summarizes an automated system for generating question papers and checking answers. The system allows faculty to generate randomized question papers that cover selected chapters from a database of questions tagged by type, topic, difficulty level, and cognitive skill. It also proposes an automated scoring approach using Jaro-Winkler string similarity to compare a student's answer to a model answer and assign a score. The system aims to reduce the time and effort required for manual question paper generation and marking by faculty.
Feature selection for multiple water quality status: integrated bootstrapping...IJECEIAES
STORET is one method to determine the river water quality, and to classify them into four classes (very good, good, medium and bad) based on the data of water for each attribute or feature. The success of the formation of pattern recognition model much depends on the quality of data. There are two issues as the concern of this research as follows, the data having disproportionate amount among the classes (imbalance class) and the finding of noise on its attribute. Therefore, this research integrates the SMOTE Technique and bootstrapping to handle the problem of imbalance class. While an experiment is conducted to eliminate the noise on the attribute by using some feature selection algorithms with filter approach (information gain, rule, derivation, correlation and chi square). This research has some stages as follows: data understanding, pre-processing, imbalance class, feature selection, classification and performance evaluation. Based on the result of testing using 10-fold cross validation, it shows that the use of the SMOTE-bootstrapping technique is able to increase the accuracy from 83.3% to be 98.8%. While the process of noise elimination onthe data attribute is also able to increase the accuracy to be 99.5% (the use of feature subset produced by the information gain algorithm and the decision tree classification algorithm).
This document discusses techniques for detecting duplicate records from multiple web databases. It begins with an abstract describing an unsupervised approach that uses classifiers like the weighted component similarity summing classifier and support vector machine along with a Gaussian mixture model to iteratively identify duplicate records. The document then provides details on related work, including probabilistic matching models, supervised and unsupervised learning techniques, distance-based techniques, rule-based approaches, and methods for improving efficiency like blocking and the sorted neighborhood approach.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Feature selection in high-dimensional datasets is
considered to be a complex and time-consuming problem. To
enhance the accuracy of classification and reduce the execution
time, Parallel Evolutionary Algorithms (PEAs) can be used. In
this paper, we make a review for the most recent works which
handle the use of PEAs for feature selection in large datasets.
We have classified the algorithms in these papers into four main
classes (Genetic Algorithms (GA), Particle Swarm Optimization
(PSO), Scattered Search (SS), and Ant Colony Optimization
(ACO)). The accuracy is adopted as a measure to compare the
efficiency of these PEAs. It is noticeable that the Parallel Genetic
Algorithms (PGAs) are the most suitable algorithms for feature
selection in large datasets; since they achieve the highest accuracy.
On the other hand, we found that the Parallel ACO is timeconsuming
and less accurate comparing with other PEA.
Similar to Hybrid ga svm for efficient feature selection in e-mail classification (20)
Abnormalities of hormones and inflammatory cytokines in women affected with p...Alexander Decker
Women with polycystic ovary syndrome (PCOS) have elevated levels of hormones like luteinizing hormone and testosterone, as well as higher levels of insulin and insulin resistance compared to healthy women. They also have increased levels of inflammatory markers like C-reactive protein, interleukin-6, and leptin. This study found these abnormalities in the hormones and inflammatory cytokines of women with PCOS ages 23-40, indicating that hormone imbalances associated with insulin resistance and elevated inflammatory markers may worsen infertility in women with PCOS.
A usability evaluation framework for b2 c e commerce websitesAlexander Decker
This document presents a framework for evaluating the usability of B2C e-commerce websites. It involves user testing methods like usability testing and interviews to identify usability problems in areas like navigation, design, purchasing processes, and customer service. The framework specifies goals for the evaluation, determines which website aspects to evaluate, and identifies target users. It then describes collecting data through user testing and analyzing the results to identify usability problems and suggest improvements.
A universal model for managing the marketing executives in nigerian banksAlexander Decker
This document discusses a study that aimed to synthesize motivation theories into a universal model for managing marketing executives in Nigerian banks. The study was guided by Maslow and McGregor's theories. A sample of 303 marketing executives was used. The results showed that managers will be most effective at motivating marketing executives if they consider individual needs and create challenging but attainable goals. The emerged model suggests managers should provide job satisfaction by tailoring assignments to abilities and monitoring performance with feedback. This addresses confusion faced by Nigerian bank managers in determining effective motivation strategies.
A unique common fixed point theorems in generalized dAlexander Decker
This document presents definitions and properties related to generalized D*-metric spaces and establishes some common fixed point theorems for contractive type mappings in these spaces. It begins by introducing D*-metric spaces and generalized D*-metric spaces, defines concepts like convergence and Cauchy sequences. It presents lemmas showing the uniqueness of limits in these spaces and the equivalence of different definitions of convergence. The goal of the paper is then stated as obtaining a unique common fixed point theorem for generalized D*-metric spaces.
A trends of salmonella and antibiotic resistanceAlexander Decker
This document provides a review of trends in Salmonella and antibiotic resistance. It begins with an introduction to Salmonella as a facultative anaerobe that causes nontyphoidal salmonellosis. The emergence of antimicrobial-resistant Salmonella is then discussed. The document proceeds to cover the historical perspective and classification of Salmonella, definitions of antimicrobials and antibiotic resistance, and mechanisms of antibiotic resistance in Salmonella including modification or destruction of antimicrobial agents, efflux pumps, modification of antibiotic targets, and decreased membrane permeability. Specific resistance mechanisms are discussed for several classes of antimicrobials.
A transformational generative approach towards understanding al-istifhamAlexander Decker
This document discusses a transformational-generative approach to understanding Al-Istifham, which refers to interrogative sentences in Arabic. It begins with an introduction to the origin and development of Arabic grammar. The paper then explains the theoretical framework of transformational-generative grammar that is used. Basic linguistic concepts and terms related to Arabic grammar are defined. The document analyzes how interrogative sentences in Arabic can be derived and transformed via tools from transformational-generative grammar, categorizing Al-Istifham into linguistic and literary questions.
A time series analysis of the determinants of savings in namibiaAlexander Decker
This document summarizes a study on the determinants of savings in Namibia from 1991 to 2012. It reviews previous literature on savings determinants in developing countries. The study uses time series analysis including unit root tests, cointegration, and error correction models to analyze the relationship between savings and variables like income, inflation, population growth, deposit rates, and financial deepening in Namibia. The results found inflation and income have a positive impact on savings, while population growth negatively impacts savings. Deposit rates and financial deepening were found to have no significant impact. The study reinforces previous work and emphasizes the importance of improving income levels to achieve higher savings rates in Namibia.
A therapy for physical and mental fitness of school childrenAlexander Decker
This document summarizes a study on the importance of exercise in maintaining physical and mental fitness for school children. It discusses how physical and mental fitness are developed through participation in regular physical exercises and cannot be achieved solely through classroom learning. The document outlines different types and components of fitness and argues that developing fitness should be a key objective of education systems. It recommends that schools ensure pupils engage in graded physical activities and exercises to support their overall development.
A theory of efficiency for managing the marketing executives in nigerian banksAlexander Decker
This document summarizes a study examining efficiency in managing marketing executives in Nigerian banks. The study was examined through the lenses of Kaizen theory (continuous improvement) and efficiency theory. A survey of 303 marketing executives from Nigerian banks found that management plays a key role in identifying and implementing efficiency improvements. The document recommends adopting a "3H grand strategy" to improve the heads, hearts, and hands of management and marketing executives by enhancing their knowledge, attitudes, and tools.
This document discusses evaluating the link budget for effective 900MHz GSM communication. It describes the basic parameters needed for a high-level link budget calculation, including transmitter power, antenna gains, path loss, and propagation models. Common propagation models for 900MHz that are described include Okumura model for urban areas and Hata model for urban, suburban, and open areas. Rain attenuation is also incorporated using the updated ITU model to improve communication during rainfall.
A synthetic review of contraceptive supplies in punjabAlexander Decker
This document discusses contraceptive use in Punjab, Pakistan. It begins by providing background on the benefits of family planning and contraceptive use for maternal and child health. It then analyzes contraceptive commodity data from Punjab, finding that use is still low despite efforts to improve access. The document concludes by emphasizing the need for strategies to bridge gaps and meet the unmet need for effective and affordable contraceptive methods and supplies in Punjab in order to improve health outcomes.
A synthesis of taylor’s and fayol’s management approaches for managing market...Alexander Decker
1) The document discusses synthesizing Taylor's scientific management approach and Fayol's process management approach to identify an effective way to manage marketing executives in Nigerian banks.
2) It reviews Taylor's emphasis on efficiency and breaking tasks into small parts, and Fayol's focus on developing general management principles.
3) The study administered a survey to 303 marketing executives in Nigerian banks to test if combining elements of Taylor and Fayol's approaches would help manage their performance through clear roles, accountability, and motivation. Statistical analysis supported combining the two approaches.
A survey paper on sequence pattern mining with incrementalAlexander Decker
This document summarizes four algorithms for sequential pattern mining: GSP, ISM, FreeSpan, and PrefixSpan. GSP is an Apriori-based algorithm that incorporates time constraints. ISM extends SPADE to incrementally update patterns after database changes. FreeSpan uses frequent items to recursively project databases and grow subsequences. PrefixSpan also uses projection but claims to not require candidate generation. It recursively projects databases based on short prefix patterns. The document concludes by stating the goal was to find an efficient scheme for extracting sequential patterns from transactional datasets.
A survey on live virtual machine migrations and its techniquesAlexander Decker
This document summarizes several techniques for live virtual machine migration in cloud computing. It discusses works that have proposed affinity-aware migration models to improve resource utilization, energy efficient migration approaches using storage migration and live VM migration, and a dynamic consolidation technique using migration control to avoid unnecessary migrations. The document also summarizes works that have designed methods to minimize migration downtime and network traffic, proposed a resource reservation framework for efficient migration of multiple VMs, and addressed real-time issues in live migration. Finally, it provides a table summarizing the techniques, tools used, and potential future work or gaps identified for each discussed work.
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
This document discusses data mining of big data using Hadoop and MongoDB. It provides an overview of Hadoop and MongoDB and their uses in big data analysis. Specifically, it proposes using Hadoop for distributed processing and MongoDB for data storage and input. The document reviews several related works that discuss big data analysis using these tools, as well as their capabilities for scalable data storage and mining. It aims to improve computational time and fault tolerance for big data analysis by mining data stored in Hadoop using MongoDB and MapReduce.
1. The document discusses several challenges for integrating media with cloud computing including media content convergence, scalability and expandability, finding appropriate applications, and reliability.
2. Media content convergence challenges include dealing with the heterogeneity of media types, services, networks, devices, and quality of service requirements as well as integrating technologies used by media providers and consumers.
3. Scalability and expandability challenges involve adapting to the increasing volume of media content and being able to support new media formats and outlets over time.
This document surveys trust architectures that leverage provenance in wireless sensor networks. It begins with background on provenance, which refers to the documented history or derivation of data. Provenance can be used to assess trust by providing metadata about how data was processed. The document then discusses challenges for using provenance to establish trust in wireless sensor networks, which have constraints on energy and computation. Finally, it provides background on trust, which is the subjective probability that a node will behave dependably. Trust architectures need to be lightweight to account for the constraints of wireless sensor networks.
This document discusses private equity investments in Kenya. It provides background on private equity and discusses trends in various regions. The objectives of the study discussed are to establish the extent of private equity adoption in Kenya, identify common forms of private equity utilized, and determine typical exit strategies. Private equity can involve venture capital, leveraged buyouts, or mezzanine financing. Exits allow recycling of capital into new opportunities. The document provides context on private equity globally and in developing markets like Africa to frame the goals of the study.
This document discusses a study that analyzes the financial health of the Indian logistics industry from 2005-2012 using Altman's Z-score model. The study finds that the average Z-score for selected logistics firms was in the healthy to very healthy range during the study period. The average Z-score increased from 2006 to 2010 when the Indian economy was hit by the global recession, indicating the overall performance of the Indian logistics industry was good. The document reviews previous literature on measuring financial performance and distress using ratios and Z-scores, and outlines the objectives and methodology used in the current study.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Hybrid ga svm for efficient feature selection in e-mail classification
1. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
Hybrid GA-SVM for Efficient Feature Selection in E-mail
Classification
Fagbola Temitayo* Olabiyisi Stephen Adigun Abimbola
Department of Computer Science & Engineering, Ladoke Akintola University of Technology,
PMB 4000, Ogbomoso, Oyo State, Nigeria
* E-mail of the corresponding author: cometoty@yahoo.com
Abstract
Feature selection is a problem of global combinatorial optimization in machine learning in which subsets of
relevant features are selected to realize robust learning models. The inclusion of irrelevant and redundant
features in the dataset can result in poor predictions and high computational overhead. Thus, selecting
relevant feature subsets can help reduce the computational cost of feature measurement, speed up learning
process and improve model interpretability. SVM classifier has proven inefficient in its inability to produce
accurate classification results in the face of large e-mail dataset while it also consumes a lot of
computational resources. In this study, a Genetic Algorithm-Support Vector Machine (GA-SVM) feature
selection technique is developed to optimize the SVM classification parameters, the prediction accuracy
and computation time. Spam assassin dataset was used to validate the performance of the proposed system.
The hybrid GA-SVM showed remarkable improvements over SVM in terms of classification accuracy and
computation time.
Keywords: E-mail Classification, Feature-Selection, Genetic algorithm, Support Vector Machine
1. Introduction
E -mail is one of the most popular, fastest and cheapest means of communication. It has become a part of
everyday life for millions of people, changing the way we work and collaborate (Whittaker et al 2005). Its
profound impact is being felt on business growth and national development in a positive way and has also
proven success in enhancing peer communications. E-mail is one of the many technological developments
that have influenced our lives. However, the downside of this sporadic success is the constantly growing
size of unfiltered e-mail messages received by recipients. Thus, E-mail classification becomes a significant
and growing problem for individuals and corporate bodies. E-mail classification tasks are often divided into
several sub-tasks. First, Data collection and representation of e-mail messages, second, e-mail feature
selection and feature dimensionality reduction for the remaining steps of the task (Awad & ELseuofi 2011).
Finally, the e-mail classification phase of the process finds the actual mapping between training set and
testing set of the e-mail dataset.
The spam email problem is well-known, and personally experienced by anyone who uses email. Spam is
defined as junk e-mail message that is unwanted, delivered by the internet mail service (Chan 2008). It
could also be defined as an unsolicited, unwanted email that was sent indiscriminately, directly or indirectly,
by a sender having no current relationship with the recipient. Some of the spam e-mails are unsolicited
commercial and get-rich messages while others could contain offensive material. Spam e-mails can also
clog up the e-mail system by filling-up the server disk space when sent to many users from the same
organization. The goal of Spam Classification is to distinguish between spam and legitimate mail messages.
Technical solutions to detecting spam include filtering the sender’s address or header content but the
problem with filtering is that a valid message may be blocked sometimes; hence e-mail is better classified
17
2. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
as spam or non-spam based on features (Chan 2008).
The selection of features is flexible such as percentage of words in the e-mail that match specified word,
percentage of words in the e-mail that match specified character, average length of uninterrupted sequences
of capital letters etc. Thus, the feature selection process can be considered a problem of global
combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant,
noisy and redundant data, and results in acceptable classification accuracy. Feature subset selection can
involve random or systematic selection of inputs or formulation as an optimization problem which involves
searching the solution space of subsets for an optimal or near-optimal subset of features, according to a
specified criterion.
Feature selection methods seeking optimal subsets are usually directed toward one of two goals:
(1) minimize the number of features selected while satisfying some minimal level of classification
capability or
(2) maximize classification ability for a subset of prescribed cardinality. The feature selection process can
be made more efficient by optimizing its subset selection techniques through the use of some well-known
optimizers.
Techniques for feature selection are characterized either as filters, which ignore the classifier to be used, or
wrappers, which base selection directly on the classifier. Computationally more efficient than wrappers, a
filter approach performs subset selection based only on the feature qualities within the training data. Since
the classifier is ignored, there is no interaction between the biases of the feature selector and the classifier,
but the quality of the best filter subset is typically not as effective as a subset selected using a wrapper
model (John et al. 1994). Filters rate features based on general characteristics, such as interclass distance or
statistical independence, without employing any mining algorithms (Guyon & Elisseef 2003). Wrappers
select a feature subset based directly on the classifier. The training data are used to train the classifier using
different feature subsets; each is then evaluated using the testing data to find the best subset.
As a new approach to pattern recognition, SVM is based on Structural Risk Minimization (SRM), a concept
in which decision planes define decision boundaries. A decision plane is one that separates a set of objects
having different class memberships; the SVM finds an optimal hyperplane with the maximal margin to
separate the two classes and then classify the dataset. It is suitable for dealing with magnitude features
problems with a given finite amount of training data only. Youn & McLeod (2006) used four classifiers
including Neural Network, SVM, Naïve Bayesian, and J48 to filter spams from the dataset of emails. All
the emails were classified as spam (1) or not (0). The authors reported that Neural Network and SVM did
not show good result compared with J48 or Naïve Bayesian classifier on large features. Based on this fact,
the authors concluded that Neural Network and SVM are not appropriate for classifying large email dataset.
The result of the study by Priyanka et al. (2010) on SVM for large dataset classification showed that SVM
is time and memory consuming when size of data is enormous.
SVMs (Chapelle et al. 1999; El-Naqa et al. 2002; Kim et al. 2002; Kim et al. 2003; Liyang et al. 2005a;
Liyang et al. 2005b; Song et al. 2002; Vapnik 1995) are much more effective than other conventional
nonparametric classifiers (e.g., the neural networks, nearest neighbor, k-NN classifier) in terms of
classification accuracy, computational time and stability to parameter setting, it is weak in its attempt to
classify highly dimensional dataset with large number of features (Andrew 2010). Liu et al. (2005) in the
result of their study on Support Vector Machine (SVM) showed that the skewed distribution of the Yahoo!
Directory and other large taxonomies with many extremely rare categories makes the classification
performance of SVMs unacceptable. More substantial investigation is thus needed to improve SVMs and
other statistical methods for very large-scale applications. These drawbacks deteriorated the success of
SVM strongly. Sequel to this, an optimizer is required to reduce the number of feature vectors before the
resultant feature vectors are introduced to SVM and this forms the focus of this study.
Computational studies of Darwinian evolution and natural selection have led to numerous models for
solving optimization problems (Holland 1975; Fogel 1998). Genetic Algorithm comprises a subset of these
evolution-based optimization problems techniques focusing on the application of selection, mutation, and
18
3. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
recombination to a population of competing problem solutions (Holland 1975; Goldberg 1989). GA’s are
parallel, iterative optimizers, and have been successfully applied to a broad spectrum of optimization
problems, including many pattern recognition and classification tasks. De-Jong et al. (1993) produced
GABIL (Genetic Algorithm-Based Inductive Learning), one of the first general-purpose GAs for learning
disjunctive normal form concepts. GABIL was shown to produce rules achieving validation set accuracy
comparable to that of decision trees induced using ID3 and C4.5.
Furthermore, GAs have been applied to find an optimal set of feature weights that improve classification
accuracy (Chandra & Nandhini 2010) and has proven to be an effective computational method, especially
in situations where the search space is uncharacterized, not fully understood, or/and highly dimensional
(Ishibuchi & Nakashima 2003). Thus, GA is employed as an optimizer for the feature selection process of
SVM classifier in this study. The approach for feature selection can be divided into wrappers, filters and
embedded methods essentially. In this study, the spam problem is treated as a classification problem in a
wrapper manner; a GA-Based feature selection optimization technique for Support Vector Machine
classifier is proposed and developed to classify e-mail dataset as spam or ham.
2. Related Works
E-mail classification has been an active area of research. Cohen (1996) developed a propositional learning
algorithm RIPPER to induce ‘‘keyword-spotting rules’’ for filing e-mails into folders. The multi-class
problem was transformed into several binary problems by considering one folder versus all the others.
Cohen argued that keyword spotting rules are more useful as they are easier to understand, modify and can
be used in conjunction with user-constructed rules. Sahami (1998) applied NB for spam e-mail filtering
using bag of words to represent e-mail corpora and binary encoding. The performance improved by the
incorporation of hand-crafted phrases and domain-specific features such as the domain type of the sender
and the percentage of non-alphabetical characters in the subject. Rennie (2000) used Naïve-Bayes to file
e-mails into folders and suggested the three most suitable folders for each message. The system applies
stemming, removes stop words and uses document frequency threshold as feature selector.
Pantel et al. (1998) developed SpamCop: a spam classification and organization program. SpamCop is a
system for spam e-mail filtering also based on Naïve-Bayes. Both stemming and a dynamically created stop
word list are used. The authors investigated the implications of the training data size, different rations of
spam and non-spam e-mails, use of trigrams instead of words and also showed that SpamCop outperformed
Ripper. MailCat et al. (1999) uses a nearest-neighbor (k-NN) technique and tf-idf representation to file
e-mails into folders. K-NN supports incremental learning but requires significant time for classification of
new e-mails. Androutsopoulos et al. (2000) found that Naïve-Bayes and a k-NN technique called TiMBL
clearly outperform the keyword-based spam filter of Outlook 2000 on the LingSpam corpora. Ensembles of
classifiers were also used for spam filtering. Sakkis et al. (2001) combined a NB and k-NN by stacking and
found that the ensemble achieved better performance. Carrera & Marquez (2001) showed that boosted trees
outperformed decision trees, NB and k-NN. Rios & Zha (2004) applied RF for spam detection on time
indexed data using a combination of text and metadata features. For low false positive spam rates, RF was
shown to be overall comparable with SVM in classification accuracy.
Koprinska et al. (2007) worked on supervised e-mail classification by:
(1) applying RF for both filing e-mails into folders and filtering spam e-mails, comparing RF with
a number of state-of-the-art classifiers and showing that it is a very good choice in terms of
accuracy, running and classification time, and simplicity to tune,
(2) introducing a new feature selector that is accurate and computationally efficient,
(3) studying the portability of an anti-spam filter across different users, and
(4) comparing the performance of a large number of algorithms on the benchmark corpora for
spam filtering using the same version of the data and the same pre-processing.
Okunade & Longe (2011) developed an improved electronic mail classification using hybridized root word
19
4. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
extractions. They employed word Stemming/Hashing combined with Bayesian probability as an approach
to ensure accurate classification of electronic mails in the electronic mail infrastructure. The authors argued
that Content based spam filters only work well if the suspicious terms are lexically correct; and that
spammers deceive content based filters by rearranging suspicious terms to fool such filters.
In order to effectively rack such deceptions, the suspicious terms used by the authors were valid terms with
correct spelling that is grammatically correct; which otherwise could result into false positive. They
developed the hybridized technique which could detect the modified suspicious terms of the harmful
messages by examining the base root of the misspelled or modified word and reconverting them to the
correct tokens or near correct tokens. The implementation of the technique results indicated the reduction of
false positives thus improving e-mail classification.
3. Materials and Method
This section presents the statement of problem, feature representation concept, support vector machine,
genetic algorithm, genetic algorithm for support vector machine parameters optimization, the e-mail dataset
features, performance evaluation metrics and the simulation tool.
3.1 Problem Statement
E-mail classification (spam filtering) is a supervised learning problem. It can be formally stated as follows.
Given a training set of labeled e-mail documents where di is an e-mail
document from a document set D and ci is the label chosen from a predefined set of categories C. It should
be noted that effective feature selection is essential to making the learning task more efficient and faster.
Sequel to this, the goal of this study is to optimize the feature selection technique of the SVM hypothesis
(classifier) to accurately classify new, unseen e-mail documents in which
C contains two labels: spam and non-spam (legitimate).
3.2 Feature Representation
A feature is a word. In this study, refers to a word, x is a feature vector that is composed of the various
words from an e-mail message formed by analyzing the contents of the e-mail. There is one feature vector
per e-mail message. There are various alternatives and enhancements in constructing the x vectors.
(a) Term Frequency
The ith component of the feature vector is the number of times that word wi appears in that email message.
In this study, a word is a feature only if it occurs three or more times in a message or messages. This
prevents misspelled words and words used rarely from appearing in the dictionary.
(b) Use of a Stop List
Stop list was formed by using words like “of,” “and,” “the,” etc., are used to form a stop list. While forming
a feature vector, words on the stop list were not used. This is because common words are not very useful in
classification.
However, the learning algorithm itself determined whether a particular word is important or not. The choice
of words put on the stop list is a function of the classification task and as such, the use of word stemming:
words such as “work”, “worker” and “working” are shortened to the word stem “work.” Doing this, the size
20
5. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
of the feature vector becomes reduced as redundant and irrelevant features are being removed.
3.3 Support Vector Machine (SVM)
Support Vector Machine (SVM) is a machine learning algorithm that is useful in solving classification
problems. A classification problem typically involves a number of data samples, each of which is
associated with a class (or label) and some features (or attributes). Given a previously unseen sample,
the problem is to predict its class by looking at its features. A support vector machine solves this
problem by first building a model from a set of data samples with known classes (i.e., the training set),
and use the model to predict classes of data samples that are of unknown classes.
To evaluate the performance of an SVM, a testing set of data samples is only given with the features.
The accuracy of the SVM can be defined as the percentage of the testing samples with correctly
predicted class labels. In essence, given the training set with class labels and features, a support vector
machine treats all the features of a data sample as a point in a high dimensional space, and tries to
construct hyperplanes to divide the space into partitions. Points with different class labels are
separated while those with the same class label are kept in the same partition.
There are two classes:
(1)
and there are N labeled training examples:
(2)
where d is the dimensionality of the vector.
If the two classes are linearly separable, then one can find an optimal weight vector such that is
minimum; and:
if = +1, (3)
if = -1 (4)
or equivalently such that:
(5)
Training examples that satisfy the equality are termed support vectors. The support vectors define two
hyperplanes, one that goes through the support vectors of one class and one goes through the support
vectors of the other class. The distance between the two hyperplanes defines a margin and this margin is
maximized when the norm of the weight vector is minimum.
This minimization can be performed by maximizing the following function with respect to the variables
:
W (6)
subject to the constraint: where it is assumed that there are N training examples, is one of the
training vectors, and * represents the dot product. If then is termed a support vector and X is a
multiplication factor.
For an unknown vector classification then corresponds to finding:
(7)
where (8)
and the sum is over the nonzero support vectors (whose s are nonzero). There are two reasons for
believing that the training time for SVM’s with binary features can be reduced. The first is that the vectors
are binary and the second is that the feature vectors are sparse (typically only 4% of a vector is nonzero).
This allows the dot product in the SVM optimization to be replaced with faster non-multiplicative routines.
21
6. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
3.4 Genetic Algorithm (GA)
There are three major design decisions to consider when implementing a GA to solve a particular problem.
A representation for candidate solutions must be chosen and encoded on the GA chromosome, fitness
function must be specified to evaluate the quality of each candidate solution, and finally the GA run
parameters must be specified, including which genetic operators to use, such as crossover, mutation,
selection, and their possibilities of occurrence. In general, the initial population is generated from the
spamassassin email dataset with 6,000 email messages.
GA Parameters
GA approach was used to select a set of good finite feature subset for SVMs classifier. The parameters used
with their corresponding default value are presented in table 1.
3.5 SVM Parameters Optimization using GA
In addition to the feature selection, proper parameters setting can improve the SVM classification accuracy.
The choice of C and the kernel parameter is important to get a good classification rate. In the most case
these parameters are tuned manually. In order to automatize this choice we use genetic algorithms. The
SVM parameters, C and are real, we have to encode them with binary chains; we fix two search
intervals, one for each parameter,
(9)
and (10)
Thus, a 32 bits encoding scheme of C is given by Cb1,…,Cb32 where
(11)
and is given by where
(12)
with (13)
and (14)
where (15)
Fitness Function
Fitness function was developed to evaluate the effectiveness of each individual in a population; it has an
individual as an input and then returns a numerical evaluation that must represent the goodness of the
feature subset. The fitness function that was used to evolve the chromosomes population is the SVM
classification accuracy. The goal was to see if the GA would maximize this function effectively. Some
reasons why SVM must use combined feature selection include its high computational cost and inaccurate
classification result on large datasets. In addition, SVMs are formulated as a quadratic programming
problem and, therefore, it is difficult to use SVMs to do feature selection directly.
The main steps of the proposed GA-SVM are as follows:
1. E-mail feature extraction.
2. Using Genetic Algorithms to generate and select both the optimal feature subset and SVM
parameters at the same time.
3. Classification of the resulting features using SVM.
3.6 The E-Mail Dataset
In this study, SpamAssassin dataset consisting of 6000 emails with the spam rate of 67.04% was used to
22
7. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
test the performance of the proposed system. The dataset has two folders containing spam and ham
messages. The corpus was divided into training and testing set. Each training set contained 75% of the
original set while each testing set contains the rest 25%. The emails are in a bag-of-words vector space
representation known as feature vector dictionary. Attributes are the term frequencies of the words. Words
occurring more than three times or suspicious are removed in the data set resulting in a dictionary size of
about 120,000 words. The data set files are in the sparse data format and are so manually modified to the
form suitable for this study. These E-mail datasets are freely downloadable at www.spamassassins.org.
Algorithm for Processing the Dataset
Step1: The raw mails (both training and testing) are converted to .xlsx format. In which each row
corresponds to one email and each cell contains the index number of the feature vector as indicated in the
feature vector dictionary.
Step2: The appropriate label (-1, 1) for each row of the entire email dataset as either spam or non-spam (-1
for non-spam, 1 for spam) are determined.
Step3: 6000 most frequent words from both spam and ham mails are taken and mixed together to form
around 7000 most frequent words.
Step4: A unique integer was assigned to each word to act as our features for classification.
Step5: 75% of the entire dataset is randomly generated for training and the remaining 25% for testing.
Step6: The dataset is passed to the appropriate classifier as required.
3.7 Performance Evaluation Metrics
The two performance evaluation metrics considered in this study are classification accuracy and the
computation time.
Accuracy (A) is the percentage of all emails that are correctly categorized. Simply put,
A= = (16)
Where and are the number of messages that have been correctly
classified to the legitimate email and spam email respectively; and are the
number of legitimate and spam messages that have been misclassified; and are the total
number of legitimate and spam messages to be classified.
Computational time is total finite time in seconds for a program to compile, run and display results
after being loaded.
3.8 Simulation Tool
The programming tool used to implement the algorithms is MATLAB. This is because MATLAB is a very
powerful computing system for handling calculations involved in scientific and engineering problems. The
name MATLAB stands for MATrix LABoratory. With MATLAB, computational and graphical tools to
solve relatively complex science and engineering problems can be designed, developed and implemented.
Specifically, MATLAB 2007b was used for the development.
4 Results
In this study, two classification methods (Support Vector Machine classifier and the optimized Support
Vector Machine classifier) were evaluated. The performance metrics used include accuracy and
23
8. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
computation time. The dataset used contained 6,000 emails which is large enough for the study. 4,500
emails were used to train the SVM classifier and 1,500 emails were used to test it. The results obtained are
presented in Fig. 2, 3, 4 and 5. After the evaluation of SVM and GA-SVM, the result obtained is
summarized and presented in Table 2. GA-SVM yielded a higher classification accuracy of 93.5% within a
lesser computational time of 119.562017 seconds while SVM yielded a classification accuracy of 90%
within a computational time of 149.9844 seconds. This shows that the hybrid GA-SVM algorithm yielded
better classification accuracy than SVM within a reasonable computational time which eventually has
eliminated the drawbacks of SVM. These results confirmed the results obtained by Priyanka et al. (2010)
and Andrew (2010) on SVM drawbacks; and also the work of Ishibuchi & Nakashima (2000) and Chandra
& Nandhini (2010) on GA’s ability to find an optimal set of feature weights that improve classification rate,
and as an effective computational search method especially in situations where the search space is
uncharacterized, not fully understood, or highly dimensional.
5 Conclusion and Future Works
In this study, two classifiers, SVM and GA-SVM were tested to filter spams from the spamassassin
dataset of emails. All the emails were classified as spam (1) or legitimate (-1). GA is applied to
optimize the feature subset selection and classification parameters for SVM classifier. It eliminates the
redundant and irrelevant features in the dataset, and thus reduces the feature vector dimensionality
drastically. This helps SVM to select optimal feature subset from the resulting feature subset. The
resultant system is called GA-SVM. GA-SVM achieves higher recognition rate using only few feature
subset. The hybrid system has shown a significant improvement over SVM in terms of classification
accuracy as well as the computational time in the face of a large dataset. Future research work should
extend GA-SVM to allow for filtering multi-variable classification problems. Also, a number of other
optimization algorithms should be introduced to SVM and other classifiers to investigate the
performance of these optimizers on the classification accuracy and computation time of the resulting
system over large dataset. Performance of such integrated systems can be evaluated when compared
with the existing ones. Evaluation of the performance of such integrated systems on varying sizes of
features could also be carried out. An ensemble of classifiers can also be integrated together and
optimized using any optimizer of choice. Performance evaluation of the ensemble of classifiers can
also be investigated, with or without the optimizer and put to test over small and large dataset, to
evaluate the classification accuracy and as well as the computational time.
References
Andrew, W. (2010), “Statistical Pattern Recognition”, London: Oxford University Press.
Androutsopoulos, I., Palioras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., & Stamatopoulos, P. (2000),
“Learning to filter spam e-mail: a comparison of a Naive Bayesian and memory-based approach”, in: Proc.
4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 1–13.
Awad, W., A., & ELseuofi, S., M. (2011), “Machine Learning Methods for Spam E-mail Classification”,
International Journal of Computer Science & Information Technology (IJCSIT) 3(1).
Carrera, X. & Marquez, L. (2001), “Boosting trees for anti-spam email filtering”, in: Proc. 4th International
Conference on Recent Advances in Natural Language Processing.
Chan, R. (2008), “A novel evolutionary data mining algorithm with applications to GA prediction”, IEEE
Transactions on Evolutionary Computation 7(6), 572-584.
Chandra, E., & Nandhini, K. (2010), “Learning and Optimizing the Features with Genetic Algorithms”,
International Journal of Computer Applications 9(6).
Chapelle, O., Haffner, P., & Vapnik, V. (1999), “Support Vector Machines for Histogram-based Image
Classification”, IEEE Transactions on Neural Networks 10(5), 1055–1064.
Cohen, W. (1996), “Learning rules that classify e-Mail”, in: Proc. AAAI Symposium on Machine Learning
24
9. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
in Information Access, 18–25.
De-Jong, K., A., Spears, W., M., & Gordon, F., D. (1993), “Using genetic algorithms for concept learning”,
Machine Learning 13, 161-188.
El-Naqa, I., Yongyi, Y., Wernick, M., N., Galatsanos, N., P., & Nishikawa, R., M. (2002), “A support Vector
Machine approach for Detection of Microcalcifications”, Medical Imaging 21(12), 1552–1563.
Fogel, D., B. (1998), “Evolutionary Computation: The Fossil Record”, New York: IEEE Press.
Goldberg, D. (1989), “Genetic Algorithms in Search, Optimization, and Machine Learning Reading”, MA:
Addison-Wesley.
Guyon, I., & Elisseef, A. (2003), “An Introduction to Variable and Feature Selection”, Journal of Machine
Learning Research, vol. 3, 1157-1182.
Holland, J., H. (1975) “Adaptation in Natural and Artificial Systems”, Ann Arbor, MI: University of
Michigan Press.
Ishibuchi, H., & Nakashima, T. (2000), “Multi-objective pattern and feature selection by a genetic
algorithm”, Proc. of Genetic and Evolutionary Computation Conference (Las Vegas, Nevada, U.S.A.),
1069-1076.
John, G., H., Kohavi, R., & Pfleger, K. (1994), “Irrelevant features and the subset selection problem”, in
th
Proc. 11 Int. Conf on Machine Learning, 121-129.
Kim, K., I., Jung, K., Park, S. H. & Kim, J. H. (2002), “Support Vector Machines for Texture
Classification”, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(11), 1542–1550.
Kim, K., I., Jung, K., & Kim, J., H. (2003), “Texture-Based Approach for Text Detection in Images using
Support Vector Machines and Continuously Adaptive Mean Shift Algorithm”, IEEE Transactions on Pattern
Analysis and Machine Intelligence 25(12), 1631–1639.
Koprinska, I., Poon, J., Clark, J. & Chan, J. (2007), “Learning to classify e-mail”, Information Sciences 177,
2167–2187.
Liu, T., Yang, Y., Wan, H., Zeng, H., Chen, Z., & Ma, W. (2005), “Support Vector Machines Classification
with a Very Large-scale Taxonomy”, SIGKDD Explorations 7(1), 36 -43.
Liyang, W., Y., Yongyi, R., M., Nishikawa, M., N., & Wernick, A., E. (2005a), “Relevance vector machine
for automatic detection of clustered microcalcifications”, IEEE Transactions on Medical Imaging 24(10),
1278–1285.
Liyang, W., Y., Yongyi, R., M., Nishikawa, M., N., & Yulei, J. (2005b), “A study on several
machine-learning methods for classification of malignant and benign clustered microcalcifications”, IEEE
Transactions on Medical Imaging, 24(3), 371–380.
MailCat, M., Kuposde, L., & Sigre, T. (1999), “K-NN technique for email filing”, Communications of the
ACM, 41(8):86-89.
Okunade, O., & Longe, O., B. (2011), “Improved Electronic Mail Classification Using Hybridized Root
Word Extractions,” Proceedings of ICT4A, 23-26 March, Nigeria: Covenant University & Bells University
of Technology, Ota, Ogun State, Nigeria. 3(1), 49-52.
Pantel, P., Met, S., & Lin, D. (1998), “SpamCop: a spam classification and organization program”, in: Proc.
AAAI Workshop on Learning for Text Categorization.
Priyanka, C., Rajesh, W., & Sanyam, S. (2010), “Spam Filtering using Support Vector Machine”, Special
Issue of IJCCT 1(2, 3, 4), 166-171.
Rennie, J. (2000), “An application of machine learning to e-mail filtering”, in: Proc. KDD-2000 Text
Mining Workshop.
Rios, G., & Zha, H. (2004), “Exploring support vector machines and random forests for spam detection”, in:
Proc. First International Conference on Email and Anti Spam (CEAS).
25
10. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998), “A Bayesian approach to filtering junk
e-mail”, in: Proc. AAAI Workshop on Learning for Text Categorization.
Sakkis, G., Androutsopoulos, I., Palioras, G., Karkaletsis, V., Spyropoulos, C., & Stamatopoulos, C. (2001)
“Stacking classifiers for anti-spam filtering of e-mail”, in: Proc. 6th Conference on Empirical Methods in
Natural Language Processing, 44–50.
Song, Q., Hu, W., & Xie, W. (2002), “Robust support vector machine with bullet hole image classification”,
IEEE Transactions on System, Man and Cybernetics, Part C, 32(4), 440–448.
Vapnik, V. (1995), “The nature of statistical learning theory”, New York: Springer Verlag.
Whittaker, Bellotti, V., & Moody, P. (2005), “Introduction to this special issue on revisiting and reinventing
e-mail”, Human-Computer Interaction 20, HCI 1–9.
Youn, & McLeod (2006), “A Comparative Study for Email Classification Group”, doi: 10.1.1.21.1027,
Sir-lab.usc.edu.
Figure 1. Proposed GA-SVM System for Feature Subset Selection and Classification
Figure 1 is the proposed GA-SVM conceptual model for generating optimal feature subset and best-fit
SVM parameters.
Figure.2. Results obtained using SVM.
26
11. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
Figure 2 shows the results obtained using SVM. The accuracy and the computation time obtained are
90% and 149.9844 respectively.
Figure 3. GA-SVM Loading Email Dataset.
Figure 3 shows GA-SVM algorithm loading the spam email assassin dataset for training and testing
Figure 4. GA-SVM running.
Figure 4 shows GA-SVM’s run time generation of results.
Figure 5. Results obtained using GA-SVM.
Figure 5 shows the results obtained using GA-SVM. The accuracy and the computation time obtained are
27
12. Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 3, No.3, 2012
93.5% and 119.5620 respectively.
Table 1. Parameters Used for the Genetic Process
Parameter Default Value Meaning
Population Size 2250 Number of Chromosomes created in each
generation
Crossover Rate 0.8 Probability of Crossover
Mutation Rate 0.1 Probability of Mutation
Number of Generations 100 Maximum Number of Generations
Table 1 shows the GA parameters used for the optimal finite feature selection for SVM classifier with their
corresponding default value.
Table 2. Summary of Results Obtained
Classifier Classification Accuracy (%) Computation time (s)
SVM 90 149.9844
GA-SVM 93.5 119.562017
Table 2 presents the summarized results obtained after the evaluation of SVM and GA-SVM
28