This document summarizes a research paper on using correlation-based feature subset selection to improve spam detection accuracy when using a Bayesian classifier. The researchers introduce using feature subset selection to identify the most relevant features of spam emails while removing redundant features. This improves the accuracy of a naïve Bayesian classifier for spam detection from 65-74% to over 80%. It discusses how correlation-based feature subset selection works by selecting features highly correlated with the class (spam or not spam) but uncorrelated with each other. The researchers apply this method to a spam email dataset and achieve over 92% accuracy in spam detection using a Bayesian network classifier after feature subset selection, an improvement over using the classifier alone.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Spams are unwanted and also undesirable emails which are mass sent to the numerous victims. Further
penetration of spams into electronic processors and communication equipments such as computers and
mobiles as well as lack of control on the information shared on the internet and other communication
networks and also inefficiency of the spam detecting methods developed for Persian contexts are among the
main challenging issues of the Persian subscribers. This paper presents a novel and efficient method for
thematic identification of Persian spams. The proposed method is capable of identifying the Persian, spams
and also “Penglish” spams. “Penglish” is made up of two words Persian and English and demonstrates a
Persian text which is written by English alphabetic letters. Based on the experimental analysis of the 10000
spams of different type the efficiency of the proposed method is evaluated to be more than 98%. The
presented method is also capable of updating its databases taking the advantage of the feedbacks received
from the users.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Spams are unwanted and also undesirable emails which are mass sent to the numerous victims. Further
penetration of spams into electronic processors and communication equipments such as computers and
mobiles as well as lack of control on the information shared on the internet and other communication
networks and also inefficiency of the spam detecting methods developed for Persian contexts are among the
main challenging issues of the Persian subscribers. This paper presents a novel and efficient method for
thematic identification of Persian spams. The proposed method is capable of identifying the Persian, spams
and also “Penglish” spams. “Penglish” is made up of two words Persian and English and demonstrates a
Persian text which is written by English alphabetic letters. Based on the experimental analysis of the 10000
spams of different type the efficiency of the proposed method is evaluated to be more than 98%. The
presented method is also capable of updating its databases taking the advantage of the feedbacks received
from the users.
Analysis of an image spam in email based on content analysisijnlc
Researchers initially have addressed the problem of spam detection as a text classification or
categorization problem. However, as spammers’ continue to develop new techniques and the type of email
content becomes more disparate, text-based anti-spam approaches alone are not sufficiently enough in
preventing spam. In an attempt to defeat the anti-spam development technologies, spammers have recently
adopted the image spam trick to make the scrutiny of emails’ body text inefficient. The main idea behind
this project is to design a spam detection system. The system will be enabled to analyze the content of
emails, in particular the artificially generated image sent as attachment in an email. The system will
analyze the image content and classify the embedded image as spam or legitimate hence classify the email
accordingly.
Now a days Short Message Service(SMS) is most popular way to communication for mobile user because it is cheapest mode or version for communication than other mode.SMS is used for transmitting short length msg of around 160 character to different devices such as smart phones, cellular phones, PDAs using standardized communication protocols. The amount of Short Message Service (SMS) spam is increasing. SMS spam should be put into the spam folder, not the inbox. The growth of the mobile phone users has led to a dramatic increase in SMS spam messages. To avoid this problem SMS filtering Techniques are used. Our proposed approach filters SMS spam on an independent mobile phone on a large dataset and acceptable processing time. There are different approaches able to automatically detect and remove most of these messages, and the best-known ones are based on Bayesian decision theory and Support Vector Machines. Riya Mehta | Ankita Gandhi"A Survey: SMS Spam Filtering" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-3 , April 2018, URL: http://www.ijtsrd.com/papers/ijtsrd12850.pdf http://www.ijtsrd.com/computer-science/data-miining/12850/a-survey-sms-spam-filtering/riya-mehta
Naïve Bayes is a probabilistic machine learning algorithm that can be used in a wide variety of classification tasks.
Typical applications include filtering spam, classifying documents, sentiment prediction etc.
Naïve Bayes classifiers are a popular statistical technique of email-filtering.
Bayes Rule P (H | E) = (P (E | H))/(P (E)) × P (H).
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...Pedram Hayati
In recent years, online spam has become a major problem for
the sustainability of the Internet. Excessive amounts of spam
are not only reducing the quality of information available on
the Internet but also creating concern amongst search engines
and web users. This paper aims to analyse existing works in
two different categories of spam domains - email spam and
image spam to gain a deeper understanding of this problem.
Future research directions are also presented in these spam
domains.
More info: http://debii.curtin.edu.au/~pedram/research/publications/76-evaluation-of-spam-detection-and-prevention-frameworks-for-email-and-image-spam-a-state-of-art.html
A multi layer architecture for spam-detection systemcsandit
As the email is becoming a prominent mode of communication so are the attempts to misuse it to
take undue advantage of its low cost and high reachability. However, as email communication
is very cheap, spammers are taking advantage of it for advertising their products, for
committing cybercrimes. So, researchers are working hard to combat with the spammers. Many
spam detections techniques and systems are built to fight spammers. But the spammers are
continuously finding new ways to defeat the existing filters. This paper describes the existing
spam filters techniques and proposes a multi-level architecture for spam email detection. We
present the analysis of the architecture to prove the effectiveness of the architecture.
Analysis of an image spam in email based on content analysisijnlc
Researchers initially have addressed the problem of spam detection as a text classification or
categorization problem. However, as spammers’ continue to develop new techniques and the type of email
content becomes more disparate, text-based anti-spam approaches alone are not sufficiently enough in
preventing spam. In an attempt to defeat the anti-spam development technologies, spammers have recently
adopted the image spam trick to make the scrutiny of emails’ body text inefficient. The main idea behind
this project is to design a spam detection system. The system will be enabled to analyze the content of
emails, in particular the artificially generated image sent as attachment in an email. The system will
analyze the image content and classify the embedded image as spam or legitimate hence classify the email
accordingly.
Now a days Short Message Service(SMS) is most popular way to communication for mobile user because it is cheapest mode or version for communication than other mode.SMS is used for transmitting short length msg of around 160 character to different devices such as smart phones, cellular phones, PDAs using standardized communication protocols. The amount of Short Message Service (SMS) spam is increasing. SMS spam should be put into the spam folder, not the inbox. The growth of the mobile phone users has led to a dramatic increase in SMS spam messages. To avoid this problem SMS filtering Techniques are used. Our proposed approach filters SMS spam on an independent mobile phone on a large dataset and acceptable processing time. There are different approaches able to automatically detect and remove most of these messages, and the best-known ones are based on Bayesian decision theory and Support Vector Machines. Riya Mehta | Ankita Gandhi"A Survey: SMS Spam Filtering" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-3 , April 2018, URL: http://www.ijtsrd.com/papers/ijtsrd12850.pdf http://www.ijtsrd.com/computer-science/data-miining/12850/a-survey-sms-spam-filtering/riya-mehta
Naïve Bayes is a probabilistic machine learning algorithm that can be used in a wide variety of classification tasks.
Typical applications include filtering spam, classifying documents, sentiment prediction etc.
Naïve Bayes classifiers are a popular statistical technique of email-filtering.
Bayes Rule P (H | E) = (P (E | H))/(P (E)) × P (H).
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...Pedram Hayati
In recent years, online spam has become a major problem for
the sustainability of the Internet. Excessive amounts of spam
are not only reducing the quality of information available on
the Internet but also creating concern amongst search engines
and web users. This paper aims to analyse existing works in
two different categories of spam domains - email spam and
image spam to gain a deeper understanding of this problem.
Future research directions are also presented in these spam
domains.
More info: http://debii.curtin.edu.au/~pedram/research/publications/76-evaluation-of-spam-detection-and-prevention-frameworks-for-email-and-image-spam-a-state-of-art.html
A multi layer architecture for spam-detection systemcsandit
As the email is becoming a prominent mode of communication so are the attempts to misuse it to
take undue advantage of its low cost and high reachability. However, as email communication
is very cheap, spammers are taking advantage of it for advertising their products, for
committing cybercrimes. So, researchers are working hard to combat with the spammers. Many
spam detections techniques and systems are built to fight spammers. But the spammers are
continuously finding new ways to defeat the existing filters. This paper describes the existing
spam filters techniques and proposes a multi-level architecture for spam email detection. We
present the analysis of the architecture to prove the effectiveness of the architecture.
Integration of Bayesian Theory and Association Rule Mining in Predicting User...Editor IJCATR
Bayesian theory and association rule mining methods are artificial intelligence techniques that have been used in various
computing fields, especially in machine learning. Internet has been considered as an easy ground for vices like radicalization because
of its diverse nature and ease of information access. These vices could be managed using recommender systems methods which are
used to deliver users’ preference data based on their previous interests and in relation with the community around the user. The
recommender systems are divided into two broad categories, i.e. collaborative systems which considers users which share the same
preferences as the user in question and content-based recommender systems tends to recommend websites similar to those already
liked by the user. Recent research and information from security organs indicate that, online radicalization has been growing at an
alarming rate. The paper reviews in depth what has been carried out in recommender systems and looks at how these methods could be
combined to from a strong system to monitor and manage online menace as a result of radicalization. The relationship between
different websites and the trend from continuous access of these websites forms the basis for probabilistic reasoning in understanding
the users’ behavior. Association rule mining method has been widely used in recommender systems in profiling and generating users’
preferences. To add probabilistic reasoning considering internet magnitude and more so in social media, Bayesian theory is
incorporated. Combination of this two techniques provides better analysis of the results thereby adding reliability and knowledge to the
results.
Applications of Nano Electrical Machines used in Ball mills for Nano and Pyro...Editor IJCATR
Nano fillers play a vital role in increasing the performance of different types of motors. In recent days, nano technology
shows a tremendous improvement in the manufacture of high performance electronic devices and circuits, electrical apparatuses and
equipment. In this paper, a wide literature survey was done on the filled of nano dielectrics and nano coated motors. Comparison of
different nano fillers coated motors was done to show which motor was having superior performance characteristics compared to other
motors. Based on the literature survey on the previous research works carried out in the field of applications of nano technology in the
coating of nano fillers to the enamel used in the motors. Ball mills are using three phase induction motors for the mechanical
operations. Ball mills are used to manufacture the nano powders used for both the nano technology and pyro technology. Industries
should be well equipped with safety devices to avoid the fire accidents. Industries should follow the safety norms to avoid the fire
accidents. Pyro technology based research centre was located in Sivakasi to understand and motivate the engineers, people to make an
interest towards Pyro industries and to train the persons about the safety measures while working with the nano pyro powders used in
the nano pyro industries. The powders used here are always in the nano range. But, the people were unaware of this technique. So, this
paper will create some knowledge to the people who are working in nano pyro based industries present in Sivakasi. Sivakasi was an
industrial city located in South India having more than 15000 nano pyro based industries. So, this paper will educate the engineers,
managers and the persons who are all associated with these industries.
A Survey of Existing Mechanisms in Energy-Aware Routing In MANETsEditor IJCATR
A mobile ad hoc network (MANET) is a distributed and Self-organized network. In MANET, network topology
frequently changes because of high mobility nodes. Mobility of nodes and battery energy depletion are two major factors that cause loss
of the discovered routes. battery power depletion causes the nodes to die and loss of the obtained paths and thus affects the network
connectivity. Therefore, a routing protocol for energy efficiency should consider all the aspects to manage the energy consumption in
the network. so introducing an energy aware routing protocol, is one of the most important issues in MANET. This paper reviews some
energy aware routing protocols. The main purpose energy aware protocols are efficiently use of energy, reducing energy consumption
and increasing the network lifetime
Richard Petrie, Chief Executive of buildingSMART International presents the organisation and its chapters, open standards, bSDD and the developing of common sloutions for the construction sector.
According to Gartner, organizations can reduce their database spend by up to 80% by deploying EDB Postgres in place of traditional database solutions like Oracle. Nevertheless, the perceived risks associated with migrating from Oracle to an open source-based alternative prevents many organizations from trying.
Review this presentation to learn some of EDB Postgres Enterprise’s more important features and techniques employed to reduce migration risk.
This presentation will be valuable to organizations researching Postgres, as well as current Oracle customers considering migrating to an open source-based database management system such as EDB Postgres. It highlights key points for both business and technical decision-makers and influencers.
Les bonnes pratiques pour migrer d'Oracle vers PostgresEDB
Lors de cette presentation il abordera :
Comment donner la priorité à la bonne application ou un projet pour votre première migration Oracle ;
Conseils pour exécuter un processus de migration progressif et bien définie pour minimiser les risques et augmenter la valeur ajoutée / temps passé ;
Bien gérer les préoccupations et les pièges courants liés à un projet de migration ;
Quelles sont les ressources que vous pouvez exploiter avant, pendant et après votre migration ;
Suggestions sur comment vous pouvez atteindre l'indépendance d'une base de données Oracle - sans sacrifier les performances.
Public cible : Cette présentation est destinée aux décideurs IT et des membres de l'équipe impliqués dans les décisions et l'exécution de base de données.
In the era of information technology, information sharing has become very easy and fast. Many platforms are available for users to share information anywhere across the world. Among all information sharing mediums, email is the simplest, cheapest, and the most rapid method of information sharing worldwide. But, due to their simplicity, emails are vulnerable to different kinds of attacks, and the most common and dangerous one is spam. No one wants to receive emails not related to their interest because they waste receivers’ time and resources. Besides, these emails can have malicious content hidden in the form of attachments or URLs that may lead to the host system’s security breaches. Spam is any irrelevant and unwanted message or email sent by the attacker to a significant number of recipients by using emails or any other medium of information sharing. So, it requires an immense demand for the security of the email system. Spam emails may carry viruses, rats, and Trojans. Attackers mostly use this technique for luring users towards online services. They may send spam emails that contain attachments with the multiple-file extension, packed URLs that lead the user to malicious and spamming websites and end up with some sort of data or financial fraud and identify theft. Many email providers allow their users to make keywords base rules that automatically filter emails. Still, this approach is not very useful because it is difficult, and users do not want to customize their emails, due to which spammers attack their email accounts
Identification of Spam Emails from Valid Emails by Using VotingEditor IJCATR
In recent years, the increasing use of e-mails has led to the emergence and increase of problems caused by mass unwanted
messages which are commonly known as spam. In this study, by using decision trees, support vector machine, Naïve Bayes theorem
and voting algorithm, a new version for identifying and classifying spams is provided. In order to verify the proposed method, a set of
a mails are chosen to get tested. First three algorithms try to detect spams, and then by using voting method, spams are identified. The
advantage of this method is utilizing a combination of three algorithms at the same time: decision tree, support vector machine and
Naïve Bayes method. During the evaluation of this method, a data set is analyzed by Weka software. Charts prepared in spam
detection indicate improved accuracy compared to the previous methods.
A multi layer architecture for spam-detection systemcsandit
As the email is becoming a prominent mode of commun
ication so are the attempts to misuse it to
take undue advantage of its low cost and high reach
ability. However, as email communication
is very cheap, spammers are taking advantage of it
for advertising their products, for
committing cybercrimes. So, researchers are working
hard to combat with the spammers. Many
spam detections techniques and systems are built to
fight spammers. But the spammers are
continuously finding new ways to defeat the existin
g filters. This paper describes the existing
spam filters techniques and proposes a multi-level
architecture for spam email detection. We
present the analysis of the architecture to prove t
he effectiveness of the architecture
Email spam, also known as junk email or unsolicited
bulk email(UBE), is a subset of electronic spam involving nearly
identical messages sent to numerous recipients by email. Clicking
on links in spam email may send users to phishing web sites or sites
that are hosting malware. Spam email may also include malware as
scripts or other executable file attachments. Definitions of spam
usually include the aspects that email is unsolicited and sent in bulk
In order to overcome spam problem many researchers have
been conducted and various method of anti-spam filtering have
been implemented. A spam filter is a set of instruction for
determining the status of the received email. Spam filters are used
to prevent spam email passing through the recipient. The main
challenge is how to design an effective spam filter that allows
desired email to pass through while blocking the unwanted email.
Detecting Spambot as an Antispam Technique for Web Internet BBSijsrd.com
Spam which is one of the most popular and also the most relevant topic that needs to be understood in the current scenario. Everyone whether it may be a small child or an old person are using emails everyday all around the world. The scenario which we are seeing is that almost no one is aware or in simple sentence they do not know what actually the spam is and what they will do in their systems. Spam in general means unsolicited or unwanted mails. Botnets are considered one of the main source of the spam. Botnet means the group of software's called bots and the function of these bots is to run on several compromised computers autonomously and automatically. The main objective of this paper is to detect such a bot or spambots for the Bulletin Board System (BBS). BBS is a computer that is running software that allows users to leave a message and access information of general interest. Originally BBSes were accessed only over a phone line using a modem, but nowadays some BBSes allowed access via a Telnet, packet switched network, or packet radio connection. The main methodology that we are going to focus is on Behavioural-based Spam Detection (BSD) method. Behavioral-based Spam Detector (BSD) combines several behaviours of the spam bots at different stages including the behaviour of spam preparation before the spam session when the spammers search for an open relay SMTP service to send e-mails through, and the behaviour of spammers while connecting to the mail server. Detecting the abnormal behaviour produced by the spam activities gives a high rate of suspicion on the existence of bots.
AN ANALYSIS OF EFFECTIVE ANTI SPAM PROTOCOL USING DECISION TREE CLASSIFIERSijsrd.com
As the internet usage increases in day to day activities, there is an inherent corresponding increase in usage of communication through it with email being the mainstay or rather in the forefront of modern day communication methodologies for businesses and general persons as well. This has led to get customer attention in the form of unwanted and unsolicited bombarding of the customers mail accounts with advertisements, offers, phishing activities, viruses, worms, trojans, generating hate crimes, making the customer to part with sensitive information like passwords, and other media as well which is known as spam. Spam is mass mailing or flooding of mail account servers with unwanted trash data causing damage some times. Spam filters have been in use from the time such mail flooding happens. Most of the spam filters are manual meaning which the user after identifying a mail in his account blocks the sender and henceforth the system will not allow mails to the inbox from such addresses. However the spammers are resilient and send spam mails from different identities and flood the inboxes. This study focuses on algorithms and data mining techniques used to unearth spam mails. They filter the inbox mails as they arrive at the server depending on certain rules which are already defined known as supervised learning methods. Such technologies are known as knowledge engineering techniques. Here a decision classifier is used to train such mails with varying words to filter and identify the words in the mail as spam. The Decision Tree model is used to analyze the mails and identify spam mails and block them. The number of mails sent, content, subject, type whether reply or forward, language etc. are identified using the decision classifier like Naves Bayes and analyzed accordingly to filter the emails.
Text Mining in Digital Libraries using OKAPI BM25 ModelEditor IJCATR
The emergence of the internet has made vast amounts of information available and easily accessible online. As a result, most libraries have digitized their content in order to remain relevant to their users and to keep pace with the advancement of the internet. However, these digital libraries have been criticized for using inefficient information retrieval models that do not perform relevance ranking to the retrieved results. This paper proposed the use of OKAPI BM25 model in text mining so as means of improving relevance ranking of digital libraries. Okapi BM25 model was selected because it is a probability-based relevance ranking algorithm. A case study research was conducted and the model design was based on information retrieval processes. The performance of Boolean, vector space, and Okapi BM25 models was compared for data retrieval. Relevant ranked documents were retrieved and displayed at the OPAC framework search page. The results revealed that Okapi BM 25 outperformed Boolean model and Vector Space model. Therefore, this paper proposes the use of Okapi BM25 model to reward terms according to their relative frequencies in a document so as to improve the performance of text mining in digital libraries.
Green Computing, eco trends, climate change, e-waste and eco-friendlyEditor IJCATR
This study focused on the practice of using computing resources more efficiently while maintaining or increasing overall performance. Sustainable IT services require the integration of green computing practices such as power management, virtualization, improving cooling technology, recycling, electronic waste disposal, and optimization of the IT infrastructure to meet sustainability requirements. Studies have shown that costs of power utilized by IT departments can approach 50% of the overall energy costs for an organization. While there is an expectation that green IT should lower costs and the firm’s impact on the environment, there has been far less attention directed at understanding the strategic benefits of sustainable IT services in terms of the creation of customer value, business value and societal value. This paper provides a review of the literature on sustainable IT, key areas of focus, and identifies a core set of principles to guide sustainable IT service design.
Policies for Green Computing and E-Waste in NigeriaEditor IJCATR
Computers today are an integral part of individuals’ lives all around the world, but unfortunately these devices are toxic to the environment given the materials used, their limited battery life and technological obsolescence. Individuals are concerned about the hazardous materials ever present in computers, even if the importance of various attributes differs, and that a more environment -friendly attitude can be obtained through exposure to educational materials. In this paper, we aim to delineate the problem of e-waste in Nigeria and highlight a series of measures and the advantage they herald for our country and propose a series of action steps to develop in these areas further. It is possible for Nigeria to have an immediate economic stimulus and job creation while moving quickly to abide by the requirements of climate change legislation and energy efficiency directives. The costs of implementing energy efficiency and renewable energy measures are minimal as they are not cash expenditures but rather investments paid back by future, continuous energy savings.
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...Editor IJCATR
Vehicular ad hoc networks (VANETs) are a favorable area of exploration which empowers the interconnection amid the movable vehicles and between transportable units (vehicles) and road side units (RSU). In Vehicular Ad Hoc Networks (VANETs), mobile vehicles can be organized into assemblage to promote interconnection links. The assemblage arrangement according to dimensions and geographical extend has serious influence on attribute of interaction .Vehicular ad hoc networks (VANETs) are subclass of mobile Ad-hoc network involving more complex mobility patterns. Because of mobility the topology changes very frequently. This raises a number of technical challenges including the stability of the network .There is a need for assemblage configuration leading to more stable realistic network. The paper provides investigation of various simulation scenarios in which cluster using k-means algorithm are generated and their numbers are varied to find the more stable configuration in real scenario of road.
Optimum Location of DG Units Considering Operation ConditionsEditor IJCATR
The optimal sizing and placement of Distributed Generation units (DG) are becoming very attractive to researchers these days. In this paper a two stage approach has been used for allocation and sizing of DGs in distribution system with time varying load model. The strategic placement of DGs can help in reducing energy losses and improving voltage profile. The proposed work discusses time varying loads that can be useful for selecting the location and optimizing DG operation. The method has the potential to be used for integrating the available DGs by identifying the best locations in a power system. The proposed method has been demonstrated on 9-bus test system.
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...Editor IJCATR
Early detection of diabetes mellitus (DM) can prevent or inhibit complication. There are several laboratory test that must be done to detect DM. The result of this laboratory test then converted into data training. Data training used in this study generated from UCI Pima Database with 6 attributes that were used to classify positive or negative diabetes. There are various classification methods that are commonly used, and in this study three of them were compared, which were fuzzy KNN, C4.5 algorithm and Naïve Bayes Classifier (NBC) with one identical case. The objective of this study was to create software to classify DM using tested methods and compared the three methods based on accuracy, precision, and recall. The results showed that the best method was Fuzzy KNN with average and maximum accuracy reached 96% and 98%, respectively. In second place, NBC method had respective average and maximum accuracy of 87.5% and 90%. Lastly, C4.5 algorithm had average and maximum accuracy of 79.5% and 86%, respectively.
Web Scraping for Estimating new Record from Source SiteEditor IJCATR
Study in the Competitive field of Intelligent, and studies in the field of Web Scraping, have a symbiotic relationship mutualism. In the information age today, the website serves as a main source. The research focus is on how to get data from websites and how to slow down the intensity of the download. The problem that arises is the website sources are autonomous so that vulnerable changes the structure of the content at any time. The next problem is the system intrusion detection snort installed on the server to detect bot crawler. So the researchers propose the use of the methods of Mining Data Records and the method of Exponential Smoothing so that adaptive to changes in the structure of the content and do a browse or fetch automatically follow the pattern of the occurrences of the news. The results of the tests, with the threshold 0.3 for MDR and similarity threshold score 0.65 for STM, using recall and precision values produce f-measure average 92.6%. While the results of the tests of the exponential estimation smoothing using ? = 0.5 produces MAE 18.2 datarecord duplicate. It slowed down to 3.6 datarecord from 21.8 datarecord results schedule download/fetch fix in an average time of occurrence news.
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...Editor IJCATR
Most of the existing semantic similarity measures that use ontology structure as their primary source can measure semantic similarity between concepts/classes using single ontology. The ontology-based semantic similarity techniques such as structure-based semantic similarity techniques (Path Length Measure, Wu and Palmer’s Measure, and Leacock and Chodorow’s measure), information content-based similarity techniques (Resnik’s measure, Lin’s measure), and biomedical domain ontology techniques (Al-Mubaid and Nguyen’s measure (SimDist)) were evaluated relative to human experts’ ratings, and compared on sets of concepts using the ICD-10 “V1.0” terminology within the UMLS. The experimental results validate the efficiency of the SemDist technique in single ontology, and demonstrate that SemDist semantic similarity techniques, compared with the existing techniques, gives the best overall results of correlation with experts’ ratings.
Semantic Similarity Measures between Terms in the Biomedical Domain within f...Editor IJCATR
The techniques and tests are tools used to define how measure the goodness of ontology or its resources. The similarity between biomedical classes/concepts is an important task for the biomedical information extraction and knowledge discovery. However, most of the semantic similarity techniques can be adopted to be used in the biomedical domain (UMLS). Many experiments have been conducted to check the applicability of these measures. In this paper, we investigate to measure semantic similarity between two terms within single ontology or multiple ontologies in ICD-10 “V1.0” as primary source, and compare my results to human experts score by correlation coefficient.
A Strategy for Improving the Performance of Small Files in Openstack Swift Editor IJCATR
This is an effective way to improve the storage access performance of small files in Openstack Swift by adding an aggregate storage module. Because Swift will lead to too much disk operation when querying metadata, the transfer performance of plenty of small files is low. In this paper, we propose an aggregated storage strategy (ASS), and implement it in Swift. ASS comprises two parts which include merge storage and index storage. At the first stage, ASS arranges the write request queue in chronological order, and then stores objects in volumes. These volumes are large files that are stored in Swift actually. During the short encounter time, the object-to-volume mapping information is stored in Key-Value store at the second stage. The experimental results show that the ASS can effectively improve Swift's small file transfer performance.
Integrated System for Vehicle Clearance and RegistrationEditor IJCATR
Efficient management and control of government's cash resources rely on government banking arrangements. Nigeria, like many low income countries, employed fragmented systems in handling government receipts and payments. Later in 2016, Nigeria implemented a unified structure as recommended by the IMF, where all government funds are collected in one account would reduce borrowing costs, extend credit and improve government's fiscal policy among other benefits to government. This situation motivated us to embark on this research to design and implement an integrated system for vehicle clearance and registration. This system complies with the new Treasury Single Account policy to enable proper interaction and collaboration among five different level agencies (NCS, FRSC, SBIR, VIO and NPF) saddled with vehicular administration and activities in Nigeria. Since the system is web based, Object Oriented Hypermedia Design Methodology (OOHDM) is used. Tools such as Php, JavaScript, css, html, AJAX and other web development technologies were used. The result is a web based system that gives proper information about a vehicle starting from the exact date of importation to registration and renewal of licensing. Vehicle owner information, custom duty information, plate number registration details, etc. will also be efficiently retrieved from the system by any of the agencies without contacting the other agency at any point in time. Also number plate will no longer be the only means of vehicle identification as it is presently the case in Nigeria, because the unified system will automatically generate and assigned a Unique Vehicle Identification Pin Number (UVIPN) on payment of duty in the system to the vehicle and the UVIPN will be linked to the various agencies in the management information system.
Assessment of the Efficiency of Customer Order Management System: A Case Stu...Editor IJCATR
The Supermarket Management System deals with the automation of buying and selling of good and services. It includes both sales and purchase of items. The project Supermarket Management System is to be developed with the objective of making the system reliable, easier, fast, and more informative.
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*Editor IJCATR
Energy is a key component in the Wireless Sensor Network (WSN)[1]. The system will not be able to run according to its function without the availability of adequate power units. One of the characteristics of wireless sensor network is Limitation energy[2]. A lot of research has been done to develop strategies to overcome this problem. One of them is clustering technique. The popular clustering technique is Low Energy Adaptive Clustering Hierarchy (LEACH)[3]. In LEACH, clustering techniques are used to determine Cluster Head (CH), which will then be assigned to forward packets to Base Station (BS). In this research, we propose other clustering techniques, which utilize the Social Network Analysis approach theory of Betweeness Centrality (BC) which will then be implemented in the Setup phase. While in the Steady-State phase, one of the heuristic searching algorithms, Modified Bi-Directional A* (MBDA *) is implemented. The experiment was performed deploy 100 nodes statically in the 100x100 area, with one Base Station at coordinates (50,50). To find out the reliability of the system, the experiment to do in 5000 rounds. The performance of the designed routing protocol strategy will be tested based on network lifetime, throughput, and residual energy. The results show that BC-MBDA * is better than LEACH. This is influenced by the ways of working LEACH in determining the CH that is dynamic, which is always changing in every data transmission process. This will result in the use of energy, because they always doing any computation to determine CH in every transmission process. In contrast to BC-MBDA *, CH is statically determined, so it can decrease energy usage.
Security in Software Defined Networks (SDN): Challenges and Research Opportun...Editor IJCATR
In networks, the rapidly changing traffic patterns of search engines, Internet of Things (IoT) devices, Big Data and data centers has thrown up new challenges for legacy; existing networks; and prompted the need for a more intelligent and innovative way to dynamically manage traffic and allocate limited network resources. Software Defined Network (SDN) which decouples the control plane from the data plane through network vitalizations aims to address these challenges. This paper has explored the SDN architecture and its implementation with the OpenFlow protocol. It has also assessed some of its benefits over traditional network architectures, security concerns and how it can be addressed in future research and related works in emerging economies such as Nigeria.
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
Report handling on "LAPOR!" (Laporan, Aspirasi dan Pengaduan Online Rakyat) system depending on the system administrator who manually reads every incoming report [3]. Read manually can lead to errors in handling complaints [4] if the data flow is huge and grows rapidly, it needs at least three days to prepare a confirmation and it sensitive to inconsistencies [3]. In this study, the authors propose a model that can measure the identities of the Query (Incoming) with Document (Archive). The authors employed Class-Based Indexing term weighting scheme, and Cosine Similarities to analyse document similarities. CoSimTFIDF, CoSimTFICF and CoSimTFIDFICF values used in classification as feature for K-Nearest Neighbour (K-NN) classifier. The optimum result evaluation is pre-processing employ 75% of training data ratio and 25% of test data with CoSimTFIDF feature. It deliver a high accuracy 84%. The k = 5 value obtain high accuracy 84.12%
Hangul Recognition Using Support Vector MachineEditor IJCATR
The recognition of Hangul Image is more difficult compared with that of Latin. It could be recognized from the structural arrangement. Hangul is arranged from two dimensions while Latin is only from the left to the right. The current research creates a system to convert Hangul image into Latin text in order to use it as a learning material on reading Hangul. In general, image recognition system is divided into three steps. The first step is preprocessing, which includes binarization, segmentation through connected component-labeling method, and thinning with Zhang Suen to decrease some pattern information. The second is receiving the feature from every single image, whose identification process is done through chain code method. The third is recognizing the process using Support Vector Machine (SVM) with some kernels. It works through letter image and Hangul word recognition. It consists of 34 letters, each of which has 15 different patterns. The whole patterns are 510, divided into 3 data scenarios. The highest result achieved is 94,7% using SVM kernel polynomial and radial basis function. The level of recognition result is influenced by many trained data. Whilst the recognition process of Hangul word applies to the type 2 Hangul word with 6 different patterns. The difference of these patterns appears from the change of the font type. The chosen fonts for data training are such as Batang, Dotum, Gaeul, Gulim, Malgun Gothic. Arial Unicode MS is used to test the data. The lowest accuracy is achieved through the use of SVM kernel radial basis function, which is 69%. The same result, 72 %, is given by the SVM kernel linear and polynomial.
Application of 3D Printing in EducationEditor IJCATR
This paper provides a review of literature concerning the application of 3D printing in the education system. The review identifies that 3D Printing is being applied across the Educational levels [1] as well as in Libraries, Laboratories, and Distance education systems. The review also finds that 3D Printing is being used to teach both students and trainers about 3D Printing and to develop 3D Printing skills.
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...Editor IJCATR
In underwater environment, for retrieval of information the routing mechanism is used. In routing mechanism there are three to four types of nodes are used, one is sink node which is deployed on the water surface and can collect the information, courier/super/AUV or dolphin powerful nodes are deployed in the middle of the water for forwarding the packets, ordinary nodes are also forwarder nodes which can be deployed from bottom to surface of the water and source nodes are deployed at the seabed which can extract the valuable information from the bottom of the sea. In underwater environment the battery power of the nodes is limited and that power can be enhanced through better selection of the routing algorithm. This paper focuses the energy-efficient routing algorithms for their routing mechanisms to prolong the battery power of the nodes. This paper also focuses the performance analysis of the energy-efficient algorithms under which we can examine the better performance of the route selection mechanism which can prolong the battery power of the node
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...Editor IJCATR
The designing of routing algorithms faces many challenges in underwater environment like: propagation delay, acoustic channel behaviour, limited bandwidth, high bit error rate, limited battery power, underwater pressure, node mobility, localization 3D deployment, and underwater obstacles (voids). This paper focuses the underwater voids which affects the overall performance of the entire network. The majority of the researchers have used the better approaches for removal of voids through alternate path selection mechanism but still research needs improvement. This paper also focuses the architecture and its operation through merits and demerits of the existing algorithms. This research article further focuses the analytical method of the performance analysis of existing algorithms through which we found the better approach for removal of voids
Decay Property for Solutions to Plate Type Equations with Variable CoefficientsEditor IJCATR
In this paper we consider the initial value problem for a plate type equation with variable coefficients and memory in
1 n R n ), which is of regularity-loss property. By using spectrally resolution, we study the pointwise estimates in the spectral
space of the fundamental solution to the corresponding linear problem. Appealing to this pointwise estimates, we obtain the global
existence and the decay estimates of solutions to the semilinear problem by employing the fixed point theorem
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Spam Detection in Social Networks Using Correlation Based Feature Subset Selection
1. International Journal of Computer Applications Technology and Research
Volume 4– Issue 8, 629 - 632, 2015, ISSN: 2319–8656
www.ijcat.com 629
Spam Detection in Social Networks Using Correlation
Based Feature Subset Selection
Sanjeev Dhawan
Department of Computer Science & Engineering,
University Institute of Engineering and Technology,
Kurukshetra University, Kurukshetra-136119,
Haryana, India
Meena Devi
Department of Computer Science and Engineering
University Institute of Engineering and Technology,
Kurukshetra University, Kurukshetra-136119,
Haryana, India
Abstract: Bayesian classifier works efficiently on some fields, and badly on some. The performance of Bayesian Classifier suffers in
fields that involve correlated features. Feature selection is beneficial in reducing dimensionality, removing irrelevant data,
incrementing learning accuracy, and improving result comprehensibility. But, the recent increase of dimensionality of data place a hard
challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this paper, Bayesian Classifier
with Correlation Based Feature Selection is introduced which can key out relevant features as well as redundancy among relevant
features without pair wise correlation analysis. The efficiency and effectiveness of our method is presented through broad.
Keywords: Bayesian Classifier, Feature Subset Selection, Naïve Bayesian Classifier, Correlation Based FSS, Spam, Non-Spam
1. INTRODUCTION
It is impossible to tell exactly who was the first one to come
upon a simple idea that if you send out an advertisement to a
number of people, then at least one person will react to it no
matter what is the proposal. E-mail provides a very good way
to send these millions of advertisements at no cost for the
sender, and this unfortunate fact is nowadays extensively
exploited by several organizations. As a result, the e-
mailboxes of millions of people get cluttered with all this so-
called unsolicited bulk e-mail also known as “spam” or “junk
mail”. Being incredibly cheap to send, spam causes a lot of
problems to the Internet community: large amounts of spam-
traffic between servers cause delays in delivery of solicited
email, people with dial-up Internet access have to spend
bandwidth downloading junk mail. Sorting out the unwanted
messages takes time and introduces a risk of deleting normal
mail by mistake. Finally, there is quite an amount of
pornographic spam that should not be uncovered to children.
A number of ways of fighting spam have been proposed.
There are “social” methods like legal measures (one example
is an anti-spam law introduced in the US) and plain personal
participation (never respond to spam, never publish your e-
mail address on WebPages, never forward chain-letters. . .).
There are 60 “technological” ways like blocking spammer’s
IP-address (blacklist), e-mail filtering etc.. Unluckily, till now
there is no perfect method to get rid of spam exists, so the
amount of spam mail keeps increasing. For example, about
50% of the messages coming to my personal mailbox are
unsolicited mail. For blocking spam at the moment
Automatic e-mail filtering appears to be the most effective
method and a tough competition between spammers and
spam-filtering methods is going on: the better the anti-spam
methods get, so do the tricks of the spammers. Several years
ago most of the spam could be reliably handle by blocking e-
mails coming from certain addresses or filtering out messages
with certain subject lines. To overcome these spammers began
to specify random sender addresses and to append random
characters to the end of the message subject. Spam filtering
rules adjusted to consider separate words in messages could
deal with that, but then junk mail with specially spelled words
(e.g. B-U-Y N-O-W) or simply with misspelled words (e.g.
BUUY NOOW) was born. To fool the more advanced filters
that relies on word frequencies spammers append a large
amount of “usual words” to the end of a message. Besides,
there are spams that contain no text at all (typical are HTML
messages with a single image that is downloaded from the
Internet when the message is opened), and there are even self-
decrypting spams (e.g. an encrypted HTML message
containing JavaScript code that decrypts its contents when
opened). So, as you see, it’s a never-ending battle. There are
two basic approaches to mail filtering knowledge engineering
(KE) and machine learning (ML). In the former case, a set of
rules is created according to which messages are categorized
as spam or legitimate mail. A typical rule of this kind could
look like “if the Subject of a message contains the text BUY
NOW, then the message is spam”. A set of such rules should
be created either by the user of the filter, or by some other
authority (e.g. the software company that provides a particular
rule-based spam-filtering tool).The major drawback of this
method is that the set of rules must be constantly updated, and
maintaining it is not convenient for most users. The rules
could, of course, be updated in a centralized manner by the
maintainer of the spam filtering tool, and there is even a peer-
2-peer knowledgebase solution, but when the rules are
publicly available, the spammer has the ability to adjust the
text of his message so that it would pass through the filter.
Therefore it is better when spam filtering is customized on a
per-user basis. The machine learning approach does not
require specifying any rules explicitly. Instead, a set of pre-
classified documents (training samples) is needed. A specific
algorithm is then used to “learn” the classification rules from
this data. The subject of machine learning has been widely
studied and there are lots of algorithms suitable for this task.
This article considers some of the most popular machine
learning algorithms and their application to the problem of
spam filtering. More-or-less self-contained descriptions of the
algorithms are presented and a simple comparison of the
performance of my implementations of the algorithms is
given. Finally, some ideas of improving the algorithms are
shown.
2. CHALLENGES IN SPAM
DETECTION
One of the barriers to legislation against spam is the fact that
not everyone uses exactly the same definition. It doesn’t help
that laws may be made at different levels even within the
2. International Journal of Computer Applications Technology and Research
Volume 4– Issue 8, 629 - 632, 2015, ISSN: 2319–8656
www.ijcat.com 630
same country, let alone laws in different countries. With so
many different and sometimes conflicting laws, prosecution
can be very difficult. Another barrier both to legislation and
practical filtering is that email is not designed in such a way
that the sender can always be traced easily. There is no
authentication of the sender built in to the protocol used by
email, leaving it possible for people to forge sender
information. This makes it hard to trace back and prosecute
the sender, or to avoid receiving messages from a known
spammer in the future. There are several proposals to adapt
this protocol like Microsoft’s “Caller ID for email”. Spam
changes with time as new product are introduced and seasons
change. For example, Christmas-themed spam is not usually
sent in June. But beyond that, there are targeted changes
happening in spam. Perhaps the largest problem of spam
filtering is that spammers have intelligent beings working to
ensure that “direct email marketing” (the marketing term for
spam) is seen by as many potential customers as possible.
Many anti-spam tools are freely available online, which
means that spammers have access to them too, and can learn
how to get through them. This makes spam detection a co-
evolutionary process, much like virus detection: both sides
change to gain an advantage, however temporarily. Although
it does change, spam is not completely volatile. Terry Sullivan
found that while spam does undergo periods of rapid changes,
it also has a core set of features which are stable for long
periods of time. Spam changes from person to person. This is
partly due to targeting on the part of the address harvesters,
who try to guess the interests of the recipients so that the
response rate will be higher. But more importantly, legitimate
mail also varies from person to person. In theory it should be
possible to discover spam without much attention to the
legitimate mail. However, the great success of classifiers
which use both, such as Graham’s Bayesian classifier and the
CRM114 discriminator [Yer04], implies that use of data from
both legitimate and spam email is very beneficial. One final
thing to note in the difficulty of spam classification is that all
mistakes in classification are not equal. False negatives,
messages that have accidentally been tagged as non-spam, are
usually seen by the user. They may be annoying, but are
usually easy to deal with. However, false positives, messages
that have been accidentally tagged as spam, tend to be more
problematic. When a single legitimate message is in a pile of
spam, it is much easier to miss seeing it. (A typical user will
not read all spam, but instead scans subject and from lines
quickly to see if anything legitimate stands out.) While there
is relatively little impact if a person receives a single spam,
missing a real message which might be important is much
more dangerous. One research firm suggests that companies
lose $3 billion dealing with false positives.
3. PROPOSED WORK
In previous work various spam detection algorithm have been
proposed ranging from text based to feature based using
classifiers such as naïve bayes, SVM, ANN, kNN and
decision tree etc. However Naïve Bayesian Method is utilized
by 99% of the company. The reason for this is their
classification efficiency. But these probabilistic methods take
in consideration all the feature of the spam making the overall
accuracy ranging from 65 to 74 %. So we require a more
efficient method to improve spam detection and false alarm
reduction. The feature subset algorithm tries to formulate the
vector space of the features by filtering of subset selecting the
most prominent feature of spam and removing unwanted
features. The filtering allows the reduction in search space and
noise. After filtering using FSS we have applied attribute
selection based naïve Bayesian probabilistic classifier and
achieved 17-20% more accuracy.
4. FEATURE SUBSET SELECTION
Feature subset selection is used for identifying and removing
as much irrelevant and redundant information as possible and
thus it reduces the dimensionality of the data and may allow
learning algorithms to run faster and more effectively. In
some cases, accuracy on future classification can be
improved; in others, the result is a more compact, well
interpreted representation of the aimed concept.
5. CORRELATION BASED FSS
CFS algorithm relies on a heuristic for assessing the cost or
merit of a subset of features. This heuristic takes into account
the usefulness of individual features for forecasting the class
label along with the level of intercorrelation among them. The
hypotheses on which the heuristic is based is:
Sound feature subsets contain features highly correlated with
(predictive of) the class, yet uncorrelated with (not predictive
of) each other.
Features are relevant if their values vary systematically with
category membership. A feature is useful if it is correlated
with or forecaster of the class; otherwise it is irrelevant.
Empirical grounds from the feature selection literature show
that, along with irrelevant features, redundant information
Subset
Calculator
Correlation
Information
Selection
Subset with
max.
Correlation
Selected
Features
Classifier
Accuracy
Attributes
3. International Journal of Computer Applications Technology and Research
Volume 4– Issue 8, 629 - 632, 2015, ISSN: 2319–8656
www.ijcat.com 631
should be wiped out as well. A feature is said to be redundant
if one or more of the other features are highly correlated with
it. The above definitions for relevance and redundancy lead to
the idea that best features for a given classification are those
that are highly correlated with one of the classes and have an
insignificant correlation with the rest of the features in the set.
If the correlation between each of the components in a test
and the outside variable is known, and the inter-correlation
between each pair of components is given,then the correlation
between a composite consisting of the summed components
and the outside variable can be predicted from
(5.1)
Where
rzc = correlation between the summed components and
the outside variable.
k = number of components (features).
rzi = average of the correlations between the
components and the outside variable.
rii = average inter-correlation between components.
Equation 5.1 represents the Pearson’s correlation
coefficient, where all the variables have been standardized.
The numerator can be thought of as giving an indication of
how predictive of the class a group of features are; the
denominator of how much redundancy there is among them.
Thus, equation 5.1 shows that the correlation between a
composite and an outside variable is a function of the number
of component variables in the composite and the magnitude of
the inter-correlations among them, together with the
magnitude of the correlations between the components and the
outside variable. Some conclusions can be extracted from
(5.1):
The higher the correlations between the components
and the outside variable, the higher the correlation between
the composite and the outside variable.
As the number of components in the composite
increases, the correlation between the composite and the
outside variable increases.
The lower the inter-correlation among the
components, the higher the correlation between the
composite and the outside variable.
6. CLASSIFICATION RESULTS
Classifier
TP
Rate
FP
Rate Precision Recall F-Measure ROC Area Correct
Naïve Bayes 0.793 0.152 0.842 0.793 0.794 0.937 79.2871
Naïve Bayes 20 Folds 0.692 0.046 0.959 0.692 0.804 0.937 79.5262
NB Info Gain FSS 0.8 0.196 0.808 0.8 0.802 0.861 80.0478
Bayes Net 0.9 0.123 0.9 0.9 0.899 0.965 89.9587
Bayes Net + CFS 0.924 0.096 0.925 0.924 0.924 0.974 92.4147
7. CONCLUSION AND FUTURE SCOPE
Feature subset selection (FSS) plays a vital act in the fields of
data excavating and contraption learning. A good FSS
algorithm can efficiently remove irrelevant and redundant
features and seize into report feature interaction. This also
clears the understanding of the data and additionally enhances
the presentation of a learner by enhancing the generalization
capacity and the interpretability of the discovering mode. An
alternative way employing a classifier on a corpus of e-mail
memos from countless users and a collective dataset.
In this work we have worked on improving SPAM detection
based on feature subset selection of Spam data set. The
Feature Subset selection methods such as Info Gain Attribute
selection and Correlation based Attribute Selection can be
perceived as the main enhancement to Naïve Bayesian/
probabilistic methods. We have analyzed the Probabilistic
SPAM Filters and attained more than 92% of success in
filtering SPAM.
However many open issues still remain open such as, the
system deals only with content as it has been translated to
plain text or HTML. Since some spam is sent where most of
the message is in an image, it would be worth looking at ways
in which images and other attachments could be examined by
the system. These could include algorithms which extract text
from the attachment, or more complex analysis of the
information contained within the attachment. We can also
work on a technique to recognize web junk e-mail according
to finding these boosting pages in place of web spam page
itself. We will begin from a small set of spam seed pages to
get a hold of boosting pages. Then web junk e-mail pages are
supposed to be identified making use of boosting pages. We
can also work on a better larger dataset; the system should be
tested over a longer period than the one-year one available in
the public domain.
8. REFERENCES
[1] Hayati, Vidyasagar Potdar and Pedram,
“Evaluation of spam detection and prevention
frameworks for email and image spam: a state
of art,” In Proceedings of the 10th International
Conference on Information Integration and
Web-based Applications & Services, ACM, pp.
520-527, 2008.
4. International Journal of Computer Applications Technology and Research
Volume 4– Issue 8, 629 - 632, 2015, ISSN: 2319–8656
www.ijcat.com 632
[2] Becchetti, Luca, Carlos Castillo, Debora
Donato, Ricardo Baeza-Yates and Stefano
Leonardi, “Link analysis for web spam
detection,” ACM Transactions on the Web
(TWEB), vol. 2, no. 1, 2008.
[3] Ioannis Kanaris, Konstantinos Kanaris, Ioannis
Houvardas, And Efstathios Stamatatos, “Words
Vs. Character N-Grams For Anti-Spam
Filtering,” International Journal on Artificial
Intelligence Tools, pp. 1–20, 2006.
[4] Joshua Attenberg, Kilian Weinberger, Anirban
Dasgupta, Alex Smola, and Martin Zinkevich,
“Collaborative Email-Spam Filtering with the
Hashing Trick,” CEAS, 2009.
[5] Tu Ouyang, Soumya Ray, Michael Rabinovich
and Mark Allman,” Can network
characteristics detect spam effectively in a
stand-alone enterprise?,” In Passive and Active
Measurement, (Springer Berlin Heidelberg,
2011), pp. 92-101, 2011.
[6] Rushdi Shams and Robert E.
Mercer,”Classifying Spam Emails using Text
and Readability Features,” IEEE 13th
International Conference on Data Mining
(ICDM), pp. 657-666, 2013.
[7] Lei Yu, Huan Liu,” Feature Selection for High-
Dimensional Data: A Fast Correlation-Based
Filter Solution” Proceedings of the Twentieth
International Conference on Machine Learning
(ICML-2003), Washington DC, 2003.
[8] Liumei Zhang, Jianfeng Ma, and Yichuan
Wang, “Content Based Spam Text
Classification: An Empirical Comparison
between English and Chinese,” 5th
International Conference on Intelligent
Networking and Collaborative Systems
(INCoS), IEEE, pp. 69-76, 2013.
[9] Igor Santos, Carlos Laorden, Borja Sanz, and
Pablo Garcia Bringas, “JURD: Joiner of Un-
Readable Documents to reverse tokenization
attacks to content-based spam filters”,
Consumer Communications and Networking
Conference (CCNC), IEEE, pp. 259-264, 2013.
[10] De Wang, Danesh Irani, and Calton Pu, “ A
study on evolution of email spam over fifteen
years,” IEEE 2013 9th International
Conference on In Collaborative Computing:
Networking, Applications and Worksharing
(Collaboratecom), pp. 1-10, 2013.
[11] Bujang, Yanti Rosmunie, and Husnayati
Hussin, “Should we be concerned with spam
emails? A look at its impacts and
implications,” 2013 5th International
Conference on Information and
Communication Technology for the Muslim
World (ICT4M), IEEE, pp. 1-6 2013.
[12] Manek, Asha S., D. K. Shamini, Veena H.
Bhat, P. Deepa Shenoy, M. Chandra Mohan,
K. R. Venugopal, and L. M. Patnaik, “ReP-
ETD: A Repetitive Preprocessing technique for
Embedded Text Detection from images in
spam emails,” 2014 IEEE International
Advance Computing Conference (IACC), pp.
568-573, 2014.
[13] Bosma, Maarten, Edgar Meij, and Wouter
Weerkamp, “A framework for unsupervised
spam detection in social networking sites,
Advances in Information Retrieval,” Springer
Berlin Heidelberg, pp. 364-375, 2012.
[14] Dave, Vacha, Saikat Guha, and Yin Zhang,
“Measuring and fingerprinting click-spam in
ad networks,” In Proceedings of the ACM
SIGCOMM 2012 conference on Applications,
technologies, architectures, and protocols for
computer communication, ACM, pp. 175-186,
2012.
[15] Karthika Renuka and Visalakshi, “Latent
Semantic Indexing Based SVM Model for
Email Spam Classification,” Journal of
Scientific & Industrial Research, vol. 73, pp.
437-442,July 2014.