SlideShare a Scribd company logo
1 of 38
Download to read offline
Email Spam Detection Using Machine Learning
A Report Submitted
in Partial Fulfilment of the Requirements
for the Degree of
Bachelor of Technology
in
Computer Science & Engineering
by
Vishal Verma (20204231)
Vikash Yadav (20204228)
Vishal Vishwakarma (20204232)
Uma Patel (20204219)
Shubham Gautam (20204198)
Under the guidance of Prof. Manoj Madhava Gore
to the
COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY
ALLAHABAD, PRAYAGRAJ
November 2023
ii
UNDERTAKING
We declare that the work presented in this report titled “Email
Spam Detection Using Machine Learning”, submitted to the
Computer Science and Engineering Department, Motilal Nehru
National Institute of Technology Allahabad, Prayagraj, for the
award of the Bachelor of Technology degree in Computer Science
& Engineering, is our original work. We have not plagiarized or
submitted the same work for the award of any other degree. In
case this undertaking is found incorrect, we accept that our
degree may be unconditionally withdrawn.
November 2023
Allahabad
Vishal Verma
Vikash Yadav
Vishal Vishwakarma
Uma Patel
Shubham Gautam
iii
CERTIFICATE
Certified that the work contained in the report titled “Email
Spam Detection Using Machine Learning”, by Vishal Verma,
Vikash Yadav, Vishal Vishwakarma, Uma Patel, Shubham
Gautam has been carried out under my supervision and
that this work has not been submitted elsewhere for a
degree.
(Prof. Manoj Madhava Gore)
Computer Science and Engineering Dept.
M.N.N.I.T, Allahabad
November 2023
iv
Acknowledgement
This project is a result of large amount of work, research, and dedication. Firstly, we
would like to thank our project supervisor, Prof. Manoj Madhava Gore for
providing us with the necessary guidance and support to complete this project. Your
insights, expertise, and valuable feedback were essential to the success of this
project. It has truly been a privilege working under him during this semester. We
would like to express our sincere gratitude and appreciation to everyone who has
contributed to the completion of this project.
v
Table of Contents
1. Motivation
2. Problem Statement
3. Workflow
4. Machine Learning Algorithm
4.1 Machine Learning Based Spam Filtering Methods
4.1.1 Supervised Machine Learning
4.1.2 Decision Tree Classifier
4.1.3 Support Vector machine (SVM)
4.1.4 Naive Bayes Classifier (NB)
4.2 Overall Insights of Machine Learning Algorithms for Spam
Detection
5. Naive Bayes Algorithm
6. Steps Involved
6.1 Data Collection
6.2 Data Cleaning and Exploration
6.2.1 Data Cleaning
6.2.2 Data Exploration
6.3 Data Pre-processing
6.4 Model Building
6.5 Evaluation
7. Result
8. Conclusion
9. References
1
Chapter 1
Motivation
The aim is to enhance users’ email experiences by reducing the pain and
frustration brought on by spam emails is the driving force behind the
implementation of this email spam filtering system. Users may save time and
effort by successfully filtering out spam instead of having to manually look
through and delete undesired communications. A major driver for improving
email security is the prevalence of dangerous material in spam emails, such as
phishing links and malware attachments, which can compromise users’
personal data or infect their devices. The necessity to make sure that crucial
emails from reliable sources swiftly reach users’ inboxes motivates the
reduction of false positives, when valid emails are wrongly classified as spam.
The ultimate goal is to establish a safe, effective, and dependable email
environment that improves user satisfaction and promotes confidence in the
email service provider. While Google and other email providers are useful for
identifying spam emails, they are still in their early stages and require
continuous user input.
The end-user receives basic services for free from well-known email
providers like Gmail, Yandex, Yahoo Mail, etc., with an EULA, of course. As
private firms operate their own email servers and desire them to be more
secure due to personal data, there is a large potential for developing email
spam classifiers. In such circumstances, email spam classifier solutions may
be supplied to such organisations.
2
Chapter 2
Problem Statement
This project will identify spam using machine learning approaches. It is
necessary to identify spam letters that are fraudulent. We will use all the
machine learning algorithms on our data sets before choosing the most
precise and accurate algorithm for email spam detection. In this project, we’ll
create a complete spam filtering system that may assist both people and
businesses in maintaining spam-free email inboxes, determining which emails
are regular emails and which are spam.
3
Chapter 3
Workflow
➢ Gather the raw dataset and import it: Import the necessary data into
your machine learning environment after collecting it from multiple
sources.
➢ Initial data cleaning procedures should be carried out to deal with
missing values, outliers, and inconsistent data in the dataset.
Techniques like imputation, outlier reduction, and standardization may
be used for this.
➢ Analyse the dataset to learn more and spot trends, connections, and
crucial elements. This technique aids in data comprehension and
decision-making during model preparation and training.
➢ Separate the training and test sets from the pre-processed dataset. The
testing set is used to assess the machine learning model’s performance
after it has been trained using the training set.
➢ Based on the characteristics of the problem and the information at
hand, select a suitable machine learning algorithm. Think about things
like the data’s kind (supervised or unsupervised), the output you want
(classification or regression), and the required performance.
4
Fig 3.1 Workflow
5
Chapter 4
Machine Learning Algorithms
The threat of spam communications has reached unparalleled heights in the
modern digital era, where email has evolved into a crucial form of
communication. Both email service providers and users constantly struggle to
sort through an enormous volume of unsolicited and potentially hazardous
communications. But now that machine learning algorithms have advanced,
we have strong tools at our disposal to address this problem. The classification
of email spam has undergone a revolutionary change thanks to machine
learning algorithms. These algorithms make extensive use of data, learning
from trends and traits to distinguish between spam and legitimate emails with
accuracy. By automating the categorization process, they increase
productivity and guarantee that users’ inboxes are free of clutter while
minimizing their vulnerability to phishing or harmful assaults.
4.1 Machine Learning Based Spam Filtering Methods
The way we combat the issue of email spam has been revolutionized using ML-
based spam detection methods. These techniques use complex algorithms to
categorize incoming emails into two categories: spam or authentic, based on
learning from labelled datasets. The ensemble method, which combines
various machine learning models, including decision trees and support vector
machines, to reach a decision, is a well-liked strategy. The spam filtering
system’s overall reliability and precision are enhanced by this ensemble
technique. Keyword frequency, header analysis, and text analysis are just a few
6
of the significant features that are extracted from emails using feature
engineering techniques. The machine learning algorithms then receive these
features and learn relationships and patterns to produce precise predictions.
4.1.1 Supervised Machine Learning
An area of artificial intelligence known as supervised machine learning is
concerned with developing models to make forecasts or judgments using
labelled training data. It is an effective and frequently used method that
enables algorithms to generalize their knowledge and learn from examples to
make precise predictions on unobserved data. In the case of supervised
learning, the input features (sometimes referred to as predictors) and the
accompanying output labels make up the training data. The objective is to
7
identify a model or function that transfers the input features to the
appropriate output labels. The training stage and the prediction phase are the
two main steps in this approach. The machine learning system picks up new
information from the provided labelled samples throughout the training
phase. To find patterns and relationships, it analyses the input attributes and
the labels that go with them. The algorithm modifies its internal weights or
parameters to reduce the discrepancy between the expected results and
finance.
4.1.2 Decision Tree Classifier
A well-liked supervised machine learning approach called a decision tree
classifier develops a model that resembles trees for classification tasks by
learning decision rules from labelled training data. Each inner node
8
represents a judgment based on a feature, and every node in the leaf
represents a class label. It uses a hierarchical arrangement of internal nodes
and leaf nodes. Decision trees are useful for acquiring insights into how
decisions are made since they are simple to understand and visualize. They
are efficient with huge dataset and can handle both category and numerical
features. Due to their versatility and simplicity, decision tree classifiers are
frequently employed in a variety of industries, such as marketing, healthcare,
and finance.
4.1.3 Support Vector machine (SVM)
The widely used SVM (supervised machine learning) method known as
Support Vector Machine (SVM) excels at classification and regression tasks.
9
Because it can work with complex data sets and spaces with high dimensions,
it is adaptable for a variety of applications. SVMs search for the best
hyperplane to maximize the margin between classes to improve
generalization and minimize overfitting. The kernel method, which allows for
nonlinear boundaries for decisions by implicitly translating the data into
higher-dimensional feature spaces, is one of the main benefits of SVM.
This adaptability, together with the regularization parameter’s control
over the bias-variance trade-off, aids SVMs in making precise predictions and
adjusting to various datasets. SVMs are useful in areas where visibility and
interpretability are essential since they are resilient to outliers and offer
understandable decision boundaries. SVMs continue to be a potent machine
learning tool for a variety of real-world applications thanks to their broad
range of kernel functions and scaling through optimization strategies.
10
4.1.4 Naive Bayes Classifier (NB)
Based on the Bayes theorem of probability, the Naive Bayes algorithm is a well-
known machine learning technique for categorization. Naive Bayes is a
probabilistic classifier algorithm that is easy to use, quick to train, and
effective in a variety of real-world scenarios. First, the Naive Bayes classifier
determines the likelihood that a new observation will belong to each class
based on its attributes. Bayes’ theorem, which asserts that the likelihood of a
hypothesis (in this case, the class) given certain observable evidence (the
features) is proportional to the likelihood of the proof given the hypothesis
times the prior likelihood of the hypothesis, is used to accomplish this.
The reason why this algorithm is known as naive Bayes is the assumption that
its features are conditionally independent even when a class is given. This
presumption makes the probability calculation easier and more
computationally efficient. Naive Bayes continues to perform well in practice,
especially when the number of features is large, even though this assumption
might not always be true.
4.2 Overall Insights of Machine Learning Algorithms for
Spam Detection
After reviewing the literature, we found that most datasets used to develop,
test, and apply various models were made artificially. The labelling of all the
supervised data models is challenging, and there aren’t enough examples for
analysis. As a result of the synthetic data sets used to train the models, the
findings of the classifiers are not completely trustworthy. A large range of
machine learning models are now used for the detection of spam emails; these
are not indicative of real-world spam evaluations. The three frequently used
learning algorithms—logistic regression, naive bayes, and support vector
11
machines (SVM)—outperform the alternative learning algorithms in most of
the experiments and research. Typically, Naive Bayes as well as logistic
regression outperform SVM in terms of performance. However, as SVM is not
compared to all the other algorithms, it should not be regarded as the best
method alone. Future research should examine different learning models on
different datasets using a variety of feature engineering techniques.
12
Chapter 5
Naive Bayes Algorithm
A straightforward yet effective technique called Naive Bayes is used in
machine learning to perform classification jobs. It is based on the Bayes
theorem and makes the” naive assumption that each characteristic is
independent.
Bayes theorem:
Here’s how the Naive Bayes algorithm works:
Training Phase: The algorithm determines the prior probability of each class
and the likelihood of each feature given to each class given a labelled training
dataset. In essence, it picks up on the data’s statistical characteristics.
13
Feature Independence Assumption: Naive Bayes assumes that the traits are
conditionally independent of one another considering class. This presumption
makes the computations easier and increases the computational efficiency of
the method.
Phase of classification: When a fresh, unlabelled instance is supplied, the
algorithm determines the posterior probability of the cache class based on the
attributes that were seen. The instance belongs to the class with the highest
posterior probability.
To calculate the posterior probability, Naive Bayes uses the Bayes theorem:
P(class|features) = (P(class)* P(features|class))/P(features)
The posterior probability of the class given the observed characteristics is P
(class features).
P(class) is the class’s prior probability.
The likelihood of the observed features given the class is P (features|class).
The probability of the observed characteristics is P (features).
The likelihood term may be computed as the product of the individual feature
probabilities since the method assumes that the characteristics are
conditionally independent.
P(features|class) = P(feature1|class) *P(feature2|class)*...* P(featureN|class)
The Naive Bayes technique is frequently used in text classification
applications where the features correspond to words or word frequencies,
such as spam filtering or sentiment analysis. It works well and is
14
computationally effective even with tiny training datasets. The assumption of
feature independence, nevertheless, cannot always be true, which might
provide less-than-ideal outcomes.
15
Chapter 6
Steps Involved
6.1 Data Collection
The first step is to gather a large dataset of emails containing both spam and ham. The dataset contains a
$228 record, which is a sample of an email, and a target column, "target," which describes whether an email
is spam or not. The composition of the total record in terms of two target categories is:
The data contains 87.37 percent ham mail and 12.63 percent pam mail. It is taken from an open resource on
Kaggle.
16
6.2 Data Cleaning and Exploration
Data cleaning and exploration are important steps in machine learning to
ensure that the data is suitable for analysis and modelling.
6.2.1 Data Cleaning
➢ Identification and treatment of missing values in the dataset.
Techniques like imputation (replacing missing data with an
approximation) or removing rows or columns with missing values may
be used to achieve this.
➢ Detecting and Handling Outliers: Outliers are extreme numbers that
might bias an investigation. Outliers can be handled in one of three
ways: by being eliminated, changed, or by using powerful statistical
methods.
➢ Eliminating Duplicates: To prevent bias in the analysis, locate and
eliminate duplicate records.
➢ Data formatting: Make sure that the data is formatted correctly (e.g.,
numerically, or categorically) and fix any formatting errors.
➢ Handling Noisy Data: Recognize and address any data that is erratic
or inconsistent that may have an adverse effect on the analysis.
17
6.2.2 Data Exploration
Calculate descriptive statistics like the mean, median, standard deviation, and
quartiles to have a better understanding of the distribution, dispersion, and central
patterns of the data.
➢ Data visualization: To see trends, correlations, and possible outliers, visualize
the data using charts, histograms, scatter plots, etc.
➢ Analyse each feature’s link to the target variable to better understand how it
affects the current issue. This may entail running hypothesis tests, evaluating
correlations, or using domain knowledge.
➢ Principal component analysis (PCA) or feature selection methods may be
used to reduce the dimensionality of a dataset if it contains a lot of
characteristics and to concentrate on the most useful ones.
➢ Handling Class Imbalance: Investigate approaches like oversampling, header
sampling, or creating synthetic samples to balance the classes if the dataset
includes unbalanced classes (one class is much more abundant than the
others).
18
Fig 6.3 Number of characters in each type
Fig 6.4 Number of words in each type
19
Correlation:
From the correlation heatmap, we infer that as the num_characters (numbers of
characters) in the text increase, the tendency of the target to become increases which
is 0.38 as shown, likewise for num_words and num_sentences.
Since, num_characters are closely correlated to the target, we will use this column
for our machine learning model formation.
20
6.3 Data Pre-processing
Preparing raw data for analysis and modelling is known as data preprocessing, and
it is an essential stage in machine learning. By addressing inconsistencies, handling
missing numbers, and preparing the data for machine learning algorithms, it aids in
enhancing the quality of the data.
It aids in ensuring that the data is in an appropriate format for training models,
lessens noise and inconsistent data, and enhances the effectiveness and
dependability of the resulting models.
Here are the steps we have performed to preprocess the data for our model:
➢ Converting the test to lower case
➢ Tokenization is generally carried out using specialized tools or libraries. We
have used a Python library called NLTK for this purpose. It is a fundamental
step in data preprocessing for machine learning. especially when dealing with
text data. It involves breaking down a sequence of text into smaller units
called tokens. Tokens can be words, characters, or even sub words, depending
on the specific task and requirements.
➢ After tokenization, the resulting tokens undergo further preprocessing steps
to remove stop words. Stop words are commonly occurring words with little
semantic value. These words are removed so that the remaining important
words, or content words, become more prominent.
➢ Special characters are also removed for the tokenized words.
➢ Natural language processing (NLP) uses the stemming approach to break
down words into their “stem,” or root, form. By eliminating prefixes, suffixes,
and other affixes from words, stemming makes it possible for several word
21
variants to be represented by a single root form. Stemming is reducing words
to their base form or root form to standardize text representation further.
Data is pre-processed to extract useful information and analysis out of it,
which can be used to build and improve models.
We have also pre-processed the data to detect the most frequently used words
in both categories of target, we drew word cloud for them.
Word clouds are visual representation of text data where words are displayed
in varying sizes based on their frequency or importance in the text. While
word clouds are not directly used in machine learning algorithms, they serve
as a valuable tool for data exploration and understanding the characteristics
of text data.
22
6.4 Model Building
In machine learning, the process of generating a predictive or descriptive model
using a given dataset is referred to as model construction. It entails choosing the best
method, training the model using the given data, and assessing the model’s
performance. In order to generate predictions, get new perspectives, or solve
particular issues based on the patterns and correlations discovered from the data,
model development is an essential stage in machine learning.
23
A vectorizer is a component or method used in machine learning to transform
unrepresentable raw input, including text or categorical variables, into numerical
representations, frequently in the form of vectors. Since the majority of machine
learning methods require numerical inputs, vectorization is a fundamental stage in
the model-building process. The algorithms can successfully process and
comprehend the data thanks to vectorization.
Here are the types of vectorizers we have used in our project:
1. Count Vectorizer: A vectorizer designed exclusively for text data is called Count
Vectorizer. It transforms a group of test documents into a matrix that displays the
frequency of words or phrases used in the texts. Each document is converted into
a vector, each of which contains the count of a particular word or phrase. Text
data may be represented as numerical features using CountVectorizer, which is
an easy and efficient method.
2. TF-IDF Vectorizer: The vectorization method TF-IDF (Term Frequency-Inverse
Document Frequency is another one that is used for text data. Each word in a text is
given a weight depending on how frequently it appears in the document and how
uncommon it is throughout the corpus. The TF-IDF considers a word’s relevance in
the whole corpus of texts in addition to just where it appears in a given document.
Each document is represented by a vector of TF-IDF scores for each word or phrase
in a matrix created by the TF-IDF vectorizer,
Vectorization transforms raw data into a suitable numerical representation, enabling
the application of a wide range of machine learning algorithms for analysis,
modelling, and prediction.
Then, we used the fit transform method for fitting the transformation on the training
data and applying the transformation to both the training and test data. Here’s how
it works:
Fit the transformer. First, using the training data, the fit transform function fits the
transformation or preprocessing step. The parameters or statistics that will be
utilized to transform the data are learned from the training set at this stage. To learn
24
the vocabulary, word frequencies, or other pertinent information, the fit transform
method, for instance, analyses the training data in the case of a Vectorizer.
Apply the transformation. After the transformation has been performed, the training
data is subjected to the learned transformation using the fit transform method.
Based on the parameters or statistics discovered during the fitting process, it
changes the data. This transformation could entail codifying category variables,
sealing numerical aspects, or transforming language into numerical representations.
Test Data Transformation: The fit transform method also performs the same
transformation on the test data in addition to the training data. This guarantees
consistency and permits a fair assessment of the model by processing the test data
in the same manner as the training data.
By using fit transform, you can efficiently fit the transformation on the training data
and apply the same transformation to both the training and test data, ensuring that
the preprocessing steps are consistent and avoiding any data leakage between the
two sets.
6.5 Evaluation
Once the model has been trained, it is validated and tested using a separate dataset.
This dataset should be different from the training dataset to ensure that the model
can generalize to new data. The performance of the model is analysed for different
machine learning algorithms to find the most suitable one. The model’s performance
is evaluated based on metrics such as accuracy, precision, recall, and F1 score.
25
26
Chapter 7
Result
In this research, we created a model for identifying email spam based on the Naive
Bayes algorithm. A publicly accessible dataset of a large number of emails, including
spam and valid emails, was used to train and test the model. To minimize the
dimensionality of the dataset, stop words were eliminated, and stemming was
applied. The model’s effectiveness was assessed using a number of criteria, including
accuracy and precision. The findings revealed that our model had an accuracy of
97.0986% and precision of 98.8%. These numbers show that our model is highly
accurate and capable of distinguishing between real email and spam.
27
28
29
30
31
This is our website, where we have entered an email, and it has been
classified as a spam.
32
Chapter 8
Conclusion
As a result, we have successfully used the Naive Bayes method to construct an email
spam identification system. Our method has proven to be highly accurate at
differentiating between spam and ham emails, and it has the potential to significantly
increase email communication’s effectiveness and security. A variety of data types
can be classified using the straightforward and effective Naive Bayes algorithm. In
the process of implementing, we classified emails according to their content as well
as other characteristics like the address of the sender and subject line. We used the
algorithm to forecast the label of fresh, unused emails after training it on a dataset
of labelled emails.
In comparison to conventional rule-based spam filters, our method significantly
improved spam detection accuracy to over 97%. This degree of accuracy was
attained by carefully choosing and preprocessing the features, as well as by
parameterizing the algorithm. The quickness and scalability of our technology are
two of its main benefits. Naïve Bayes is a rather simple algorithm, and the way we
implemented it is capable of processing lots of emails simultaneously. Due to this,
our technology is appropriate for use in business email networks and other
scenarios where communicating via email is problematic.
Overall, we think that our Naive Bayes-based spam detection for email system
has a lot of promise to increase email communication’s effectiveness and security.
33
Chapter 9
References
➢ M. Abd-Eldayem and E. El-Sayed,” A Hybrid Model for Email Spam Filtering
Based on Fuzzy Inference and Genetic Algorithm,” in IEEE Access, vol. 7, pp.
8684–8694, 2019. doi: 10.1109/ACCESS,2019.2895518
➢ P. Mohammadi, H. R. Pourshahabi, and N. Karimian,” Email Spam Filtering
Using Deep Learning Techniques,” in IEEE Access, vol. 7, pp. 116726–116738,
2019. doi: 10.1109/ACCESS.2019.2930799
➢ M. Iqbal, R. Khan, and S. A. Khayam,” Email Spam Filtering: A Machine
Learning Approach,” in IEEE Transactions on Emerging Topics in Computing,
vol.5, 10.1109/TETC.2015.2497579, no. 1, pp. 118–131, 2017.
doi:10.1109/TETC.2015249779
➢ M. Kumar and R. Singh,” Email Spam Detection Using Naive Bayes Algorithm.”
2021 International Conference on Da Science and Business Analytics
(ICDSBA), Noida, India, 2021, 10.1109/ICDSBA52091.2021.9459928, pp. 1-
6, doi:10.1109/ICDSBA52091.2021.9459928

More Related Content

Similar to trialFinal report7th sem.pdf

Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptxNaveenkushwaha18
 
Detection of Spam in Emails using Machine Learning
Detection of Spam in Emails using Machine LearningDetection of Spam in Emails using Machine Learning
Detection of Spam in Emails using Machine LearningIRJET Journal
 
machine learning.docx
machine learning.docxmachine learning.docx
machine learning.docxJadhavArjun2
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
Analysis on Fraud Detection Mechanisms Using Machine Learning Techniques
Analysis on Fraud Detection Mechanisms Using Machine Learning TechniquesAnalysis on Fraud Detection Mechanisms Using Machine Learning Techniques
Analysis on Fraud Detection Mechanisms Using Machine Learning TechniquesIRJET Journal
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdfJamieDornan2
 
Detecting spam mail using machine learning algorithm
Detecting spam mail using machine learning algorithmDetecting spam mail using machine learning algorithm
Detecting spam mail using machine learning algorithmIRJET Journal
 
A Study on Machine Learning and Its Working
A Study on Machine Learning and Its WorkingA Study on Machine Learning and Its Working
A Study on Machine Learning and Its WorkingIJMTST Journal
 
E-Mail Spam Detection Using Supportive Vector Machine
E-Mail Spam Detection Using Supportive Vector MachineE-Mail Spam Detection Using Supportive Vector Machine
E-Mail Spam Detection Using Supportive Vector MachineIRJET Journal
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdfStephenAmell4
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdfAnastasiaSteele10
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdfJamieDornan2
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learningJohnson Ubah
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET Journal
 
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
IRJET-  	  Identify the Human or Bots Twitter Data using Machine Learning Alg...IRJET-  	  Identify the Human or Bots Twitter Data using Machine Learning Alg...
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...IRJET Journal
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
IRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET Journal
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsAM Publications
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...IRJET Journal
 

Similar to trialFinal report7th sem.pdf (20)

Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptx
 
Detection of Spam in Emails using Machine Learning
Detection of Spam in Emails using Machine LearningDetection of Spam in Emails using Machine Learning
Detection of Spam in Emails using Machine Learning
 
machine learning.docx
machine learning.docxmachine learning.docx
machine learning.docx
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Analysis on Fraud Detection Mechanisms Using Machine Learning Techniques
Analysis on Fraud Detection Mechanisms Using Machine Learning TechniquesAnalysis on Fraud Detection Mechanisms Using Machine Learning Techniques
Analysis on Fraud Detection Mechanisms Using Machine Learning Techniques
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
Detecting spam mail using machine learning algorithm
Detecting spam mail using machine learning algorithmDetecting spam mail using machine learning algorithm
Detecting spam mail using machine learning algorithm
 
A Study on Machine Learning and Its Working
A Study on Machine Learning and Its WorkingA Study on Machine Learning and Its Working
A Study on Machine Learning and Its Working
 
E-Mail Spam Detection Using Supportive Vector Machine
E-Mail Spam Detection Using Supportive Vector MachineE-Mail Spam Detection Using Supportive Vector Machine
E-Mail Spam Detection Using Supportive Vector Machine
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection System
 
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
IRJET-  	  Identify the Human or Bots Twitter Data using Machine Learning Alg...IRJET-  	  Identify the Human or Bots Twitter Data using Machine Learning Alg...
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
IRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review AnalysisIRJET - Online Product Scoring based on Sentiment based Review Analysis
IRJET - Online Product Scoring based on Sentiment based Review Analysis
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning Algorithms
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 

Recently uploaded

Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 

Recently uploaded (20)

Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

trialFinal report7th sem.pdf

  • 1. Email Spam Detection Using Machine Learning A Report Submitted in Partial Fulfilment of the Requirements for the Degree of Bachelor of Technology in Computer Science & Engineering by Vishal Verma (20204231) Vikash Yadav (20204228) Vishal Vishwakarma (20204232) Uma Patel (20204219) Shubham Gautam (20204198) Under the guidance of Prof. Manoj Madhava Gore to the COMPUTER SCIENCE AND ENGINEERING DEPARTMENT MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY ALLAHABAD, PRAYAGRAJ November 2023
  • 2. ii UNDERTAKING We declare that the work presented in this report titled “Email Spam Detection Using Machine Learning”, submitted to the Computer Science and Engineering Department, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, for the award of the Bachelor of Technology degree in Computer Science & Engineering, is our original work. We have not plagiarized or submitted the same work for the award of any other degree. In case this undertaking is found incorrect, we accept that our degree may be unconditionally withdrawn. November 2023 Allahabad Vishal Verma Vikash Yadav Vishal Vishwakarma Uma Patel Shubham Gautam
  • 3. iii CERTIFICATE Certified that the work contained in the report titled “Email Spam Detection Using Machine Learning”, by Vishal Verma, Vikash Yadav, Vishal Vishwakarma, Uma Patel, Shubham Gautam has been carried out under my supervision and that this work has not been submitted elsewhere for a degree. (Prof. Manoj Madhava Gore) Computer Science and Engineering Dept. M.N.N.I.T, Allahabad November 2023
  • 4. iv Acknowledgement This project is a result of large amount of work, research, and dedication. Firstly, we would like to thank our project supervisor, Prof. Manoj Madhava Gore for providing us with the necessary guidance and support to complete this project. Your insights, expertise, and valuable feedback were essential to the success of this project. It has truly been a privilege working under him during this semester. We would like to express our sincere gratitude and appreciation to everyone who has contributed to the completion of this project.
  • 5. v Table of Contents 1. Motivation 2. Problem Statement 3. Workflow 4. Machine Learning Algorithm 4.1 Machine Learning Based Spam Filtering Methods 4.1.1 Supervised Machine Learning 4.1.2 Decision Tree Classifier 4.1.3 Support Vector machine (SVM) 4.1.4 Naive Bayes Classifier (NB) 4.2 Overall Insights of Machine Learning Algorithms for Spam Detection 5. Naive Bayes Algorithm 6. Steps Involved 6.1 Data Collection 6.2 Data Cleaning and Exploration 6.2.1 Data Cleaning 6.2.2 Data Exploration 6.3 Data Pre-processing 6.4 Model Building 6.5 Evaluation 7. Result 8. Conclusion 9. References
  • 6. 1 Chapter 1 Motivation The aim is to enhance users’ email experiences by reducing the pain and frustration brought on by spam emails is the driving force behind the implementation of this email spam filtering system. Users may save time and effort by successfully filtering out spam instead of having to manually look through and delete undesired communications. A major driver for improving email security is the prevalence of dangerous material in spam emails, such as phishing links and malware attachments, which can compromise users’ personal data or infect their devices. The necessity to make sure that crucial emails from reliable sources swiftly reach users’ inboxes motivates the reduction of false positives, when valid emails are wrongly classified as spam. The ultimate goal is to establish a safe, effective, and dependable email environment that improves user satisfaction and promotes confidence in the email service provider. While Google and other email providers are useful for identifying spam emails, they are still in their early stages and require continuous user input. The end-user receives basic services for free from well-known email providers like Gmail, Yandex, Yahoo Mail, etc., with an EULA, of course. As private firms operate their own email servers and desire them to be more secure due to personal data, there is a large potential for developing email spam classifiers. In such circumstances, email spam classifier solutions may be supplied to such organisations.
  • 7. 2 Chapter 2 Problem Statement This project will identify spam using machine learning approaches. It is necessary to identify spam letters that are fraudulent. We will use all the machine learning algorithms on our data sets before choosing the most precise and accurate algorithm for email spam detection. In this project, we’ll create a complete spam filtering system that may assist both people and businesses in maintaining spam-free email inboxes, determining which emails are regular emails and which are spam.
  • 8. 3 Chapter 3 Workflow ➢ Gather the raw dataset and import it: Import the necessary data into your machine learning environment after collecting it from multiple sources. ➢ Initial data cleaning procedures should be carried out to deal with missing values, outliers, and inconsistent data in the dataset. Techniques like imputation, outlier reduction, and standardization may be used for this. ➢ Analyse the dataset to learn more and spot trends, connections, and crucial elements. This technique aids in data comprehension and decision-making during model preparation and training. ➢ Separate the training and test sets from the pre-processed dataset. The testing set is used to assess the machine learning model’s performance after it has been trained using the training set. ➢ Based on the characteristics of the problem and the information at hand, select a suitable machine learning algorithm. Think about things like the data’s kind (supervised or unsupervised), the output you want (classification or regression), and the required performance.
  • 10. 5 Chapter 4 Machine Learning Algorithms The threat of spam communications has reached unparalleled heights in the modern digital era, where email has evolved into a crucial form of communication. Both email service providers and users constantly struggle to sort through an enormous volume of unsolicited and potentially hazardous communications. But now that machine learning algorithms have advanced, we have strong tools at our disposal to address this problem. The classification of email spam has undergone a revolutionary change thanks to machine learning algorithms. These algorithms make extensive use of data, learning from trends and traits to distinguish between spam and legitimate emails with accuracy. By automating the categorization process, they increase productivity and guarantee that users’ inboxes are free of clutter while minimizing their vulnerability to phishing or harmful assaults. 4.1 Machine Learning Based Spam Filtering Methods The way we combat the issue of email spam has been revolutionized using ML- based spam detection methods. These techniques use complex algorithms to categorize incoming emails into two categories: spam or authentic, based on learning from labelled datasets. The ensemble method, which combines various machine learning models, including decision trees and support vector machines, to reach a decision, is a well-liked strategy. The spam filtering system’s overall reliability and precision are enhanced by this ensemble technique. Keyword frequency, header analysis, and text analysis are just a few
  • 11. 6 of the significant features that are extracted from emails using feature engineering techniques. The machine learning algorithms then receive these features and learn relationships and patterns to produce precise predictions. 4.1.1 Supervised Machine Learning An area of artificial intelligence known as supervised machine learning is concerned with developing models to make forecasts or judgments using labelled training data. It is an effective and frequently used method that enables algorithms to generalize their knowledge and learn from examples to make precise predictions on unobserved data. In the case of supervised learning, the input features (sometimes referred to as predictors) and the accompanying output labels make up the training data. The objective is to
  • 12. 7 identify a model or function that transfers the input features to the appropriate output labels. The training stage and the prediction phase are the two main steps in this approach. The machine learning system picks up new information from the provided labelled samples throughout the training phase. To find patterns and relationships, it analyses the input attributes and the labels that go with them. The algorithm modifies its internal weights or parameters to reduce the discrepancy between the expected results and finance. 4.1.2 Decision Tree Classifier A well-liked supervised machine learning approach called a decision tree classifier develops a model that resembles trees for classification tasks by learning decision rules from labelled training data. Each inner node
  • 13. 8 represents a judgment based on a feature, and every node in the leaf represents a class label. It uses a hierarchical arrangement of internal nodes and leaf nodes. Decision trees are useful for acquiring insights into how decisions are made since they are simple to understand and visualize. They are efficient with huge dataset and can handle both category and numerical features. Due to their versatility and simplicity, decision tree classifiers are frequently employed in a variety of industries, such as marketing, healthcare, and finance. 4.1.3 Support Vector machine (SVM) The widely used SVM (supervised machine learning) method known as Support Vector Machine (SVM) excels at classification and regression tasks.
  • 14. 9 Because it can work with complex data sets and spaces with high dimensions, it is adaptable for a variety of applications. SVMs search for the best hyperplane to maximize the margin between classes to improve generalization and minimize overfitting. The kernel method, which allows for nonlinear boundaries for decisions by implicitly translating the data into higher-dimensional feature spaces, is one of the main benefits of SVM. This adaptability, together with the regularization parameter’s control over the bias-variance trade-off, aids SVMs in making precise predictions and adjusting to various datasets. SVMs are useful in areas where visibility and interpretability are essential since they are resilient to outliers and offer understandable decision boundaries. SVMs continue to be a potent machine learning tool for a variety of real-world applications thanks to their broad range of kernel functions and scaling through optimization strategies.
  • 15. 10 4.1.4 Naive Bayes Classifier (NB) Based on the Bayes theorem of probability, the Naive Bayes algorithm is a well- known machine learning technique for categorization. Naive Bayes is a probabilistic classifier algorithm that is easy to use, quick to train, and effective in a variety of real-world scenarios. First, the Naive Bayes classifier determines the likelihood that a new observation will belong to each class based on its attributes. Bayes’ theorem, which asserts that the likelihood of a hypothesis (in this case, the class) given certain observable evidence (the features) is proportional to the likelihood of the proof given the hypothesis times the prior likelihood of the hypothesis, is used to accomplish this. The reason why this algorithm is known as naive Bayes is the assumption that its features are conditionally independent even when a class is given. This presumption makes the probability calculation easier and more computationally efficient. Naive Bayes continues to perform well in practice, especially when the number of features is large, even though this assumption might not always be true. 4.2 Overall Insights of Machine Learning Algorithms for Spam Detection After reviewing the literature, we found that most datasets used to develop, test, and apply various models were made artificially. The labelling of all the supervised data models is challenging, and there aren’t enough examples for analysis. As a result of the synthetic data sets used to train the models, the findings of the classifiers are not completely trustworthy. A large range of machine learning models are now used for the detection of spam emails; these are not indicative of real-world spam evaluations. The three frequently used learning algorithms—logistic regression, naive bayes, and support vector
  • 16. 11 machines (SVM)—outperform the alternative learning algorithms in most of the experiments and research. Typically, Naive Bayes as well as logistic regression outperform SVM in terms of performance. However, as SVM is not compared to all the other algorithms, it should not be regarded as the best method alone. Future research should examine different learning models on different datasets using a variety of feature engineering techniques.
  • 17. 12 Chapter 5 Naive Bayes Algorithm A straightforward yet effective technique called Naive Bayes is used in machine learning to perform classification jobs. It is based on the Bayes theorem and makes the” naive assumption that each characteristic is independent. Bayes theorem: Here’s how the Naive Bayes algorithm works: Training Phase: The algorithm determines the prior probability of each class and the likelihood of each feature given to each class given a labelled training dataset. In essence, it picks up on the data’s statistical characteristics.
  • 18. 13 Feature Independence Assumption: Naive Bayes assumes that the traits are conditionally independent of one another considering class. This presumption makes the computations easier and increases the computational efficiency of the method. Phase of classification: When a fresh, unlabelled instance is supplied, the algorithm determines the posterior probability of the cache class based on the attributes that were seen. The instance belongs to the class with the highest posterior probability. To calculate the posterior probability, Naive Bayes uses the Bayes theorem: P(class|features) = (P(class)* P(features|class))/P(features) The posterior probability of the class given the observed characteristics is P (class features). P(class) is the class’s prior probability. The likelihood of the observed features given the class is P (features|class). The probability of the observed characteristics is P (features). The likelihood term may be computed as the product of the individual feature probabilities since the method assumes that the characteristics are conditionally independent. P(features|class) = P(feature1|class) *P(feature2|class)*...* P(featureN|class) The Naive Bayes technique is frequently used in text classification applications where the features correspond to words or word frequencies, such as spam filtering or sentiment analysis. It works well and is
  • 19. 14 computationally effective even with tiny training datasets. The assumption of feature independence, nevertheless, cannot always be true, which might provide less-than-ideal outcomes.
  • 20. 15 Chapter 6 Steps Involved 6.1 Data Collection The first step is to gather a large dataset of emails containing both spam and ham. The dataset contains a $228 record, which is a sample of an email, and a target column, "target," which describes whether an email is spam or not. The composition of the total record in terms of two target categories is: The data contains 87.37 percent ham mail and 12.63 percent pam mail. It is taken from an open resource on Kaggle.
  • 21. 16 6.2 Data Cleaning and Exploration Data cleaning and exploration are important steps in machine learning to ensure that the data is suitable for analysis and modelling. 6.2.1 Data Cleaning ➢ Identification and treatment of missing values in the dataset. Techniques like imputation (replacing missing data with an approximation) or removing rows or columns with missing values may be used to achieve this. ➢ Detecting and Handling Outliers: Outliers are extreme numbers that might bias an investigation. Outliers can be handled in one of three ways: by being eliminated, changed, or by using powerful statistical methods. ➢ Eliminating Duplicates: To prevent bias in the analysis, locate and eliminate duplicate records. ➢ Data formatting: Make sure that the data is formatted correctly (e.g., numerically, or categorically) and fix any formatting errors. ➢ Handling Noisy Data: Recognize and address any data that is erratic or inconsistent that may have an adverse effect on the analysis.
  • 22. 17 6.2.2 Data Exploration Calculate descriptive statistics like the mean, median, standard deviation, and quartiles to have a better understanding of the distribution, dispersion, and central patterns of the data. ➢ Data visualization: To see trends, correlations, and possible outliers, visualize the data using charts, histograms, scatter plots, etc. ➢ Analyse each feature’s link to the target variable to better understand how it affects the current issue. This may entail running hypothesis tests, evaluating correlations, or using domain knowledge. ➢ Principal component analysis (PCA) or feature selection methods may be used to reduce the dimensionality of a dataset if it contains a lot of characteristics and to concentrate on the most useful ones. ➢ Handling Class Imbalance: Investigate approaches like oversampling, header sampling, or creating synthetic samples to balance the classes if the dataset includes unbalanced classes (one class is much more abundant than the others).
  • 23. 18 Fig 6.3 Number of characters in each type Fig 6.4 Number of words in each type
  • 24. 19 Correlation: From the correlation heatmap, we infer that as the num_characters (numbers of characters) in the text increase, the tendency of the target to become increases which is 0.38 as shown, likewise for num_words and num_sentences. Since, num_characters are closely correlated to the target, we will use this column for our machine learning model formation.
  • 25. 20 6.3 Data Pre-processing Preparing raw data for analysis and modelling is known as data preprocessing, and it is an essential stage in machine learning. By addressing inconsistencies, handling missing numbers, and preparing the data for machine learning algorithms, it aids in enhancing the quality of the data. It aids in ensuring that the data is in an appropriate format for training models, lessens noise and inconsistent data, and enhances the effectiveness and dependability of the resulting models. Here are the steps we have performed to preprocess the data for our model: ➢ Converting the test to lower case ➢ Tokenization is generally carried out using specialized tools or libraries. We have used a Python library called NLTK for this purpose. It is a fundamental step in data preprocessing for machine learning. especially when dealing with text data. It involves breaking down a sequence of text into smaller units called tokens. Tokens can be words, characters, or even sub words, depending on the specific task and requirements. ➢ After tokenization, the resulting tokens undergo further preprocessing steps to remove stop words. Stop words are commonly occurring words with little semantic value. These words are removed so that the remaining important words, or content words, become more prominent. ➢ Special characters are also removed for the tokenized words. ➢ Natural language processing (NLP) uses the stemming approach to break down words into their “stem,” or root, form. By eliminating prefixes, suffixes, and other affixes from words, stemming makes it possible for several word
  • 26. 21 variants to be represented by a single root form. Stemming is reducing words to their base form or root form to standardize text representation further. Data is pre-processed to extract useful information and analysis out of it, which can be used to build and improve models. We have also pre-processed the data to detect the most frequently used words in both categories of target, we drew word cloud for them. Word clouds are visual representation of text data where words are displayed in varying sizes based on their frequency or importance in the text. While word clouds are not directly used in machine learning algorithms, they serve as a valuable tool for data exploration and understanding the characteristics of text data.
  • 27. 22 6.4 Model Building In machine learning, the process of generating a predictive or descriptive model using a given dataset is referred to as model construction. It entails choosing the best method, training the model using the given data, and assessing the model’s performance. In order to generate predictions, get new perspectives, or solve particular issues based on the patterns and correlations discovered from the data, model development is an essential stage in machine learning.
  • 28. 23 A vectorizer is a component or method used in machine learning to transform unrepresentable raw input, including text or categorical variables, into numerical representations, frequently in the form of vectors. Since the majority of machine learning methods require numerical inputs, vectorization is a fundamental stage in the model-building process. The algorithms can successfully process and comprehend the data thanks to vectorization. Here are the types of vectorizers we have used in our project: 1. Count Vectorizer: A vectorizer designed exclusively for text data is called Count Vectorizer. It transforms a group of test documents into a matrix that displays the frequency of words or phrases used in the texts. Each document is converted into a vector, each of which contains the count of a particular word or phrase. Text data may be represented as numerical features using CountVectorizer, which is an easy and efficient method. 2. TF-IDF Vectorizer: The vectorization method TF-IDF (Term Frequency-Inverse Document Frequency is another one that is used for text data. Each word in a text is given a weight depending on how frequently it appears in the document and how uncommon it is throughout the corpus. The TF-IDF considers a word’s relevance in the whole corpus of texts in addition to just where it appears in a given document. Each document is represented by a vector of TF-IDF scores for each word or phrase in a matrix created by the TF-IDF vectorizer, Vectorization transforms raw data into a suitable numerical representation, enabling the application of a wide range of machine learning algorithms for analysis, modelling, and prediction. Then, we used the fit transform method for fitting the transformation on the training data and applying the transformation to both the training and test data. Here’s how it works: Fit the transformer. First, using the training data, the fit transform function fits the transformation or preprocessing step. The parameters or statistics that will be utilized to transform the data are learned from the training set at this stage. To learn
  • 29. 24 the vocabulary, word frequencies, or other pertinent information, the fit transform method, for instance, analyses the training data in the case of a Vectorizer. Apply the transformation. After the transformation has been performed, the training data is subjected to the learned transformation using the fit transform method. Based on the parameters or statistics discovered during the fitting process, it changes the data. This transformation could entail codifying category variables, sealing numerical aspects, or transforming language into numerical representations. Test Data Transformation: The fit transform method also performs the same transformation on the test data in addition to the training data. This guarantees consistency and permits a fair assessment of the model by processing the test data in the same manner as the training data. By using fit transform, you can efficiently fit the transformation on the training data and apply the same transformation to both the training and test data, ensuring that the preprocessing steps are consistent and avoiding any data leakage between the two sets. 6.5 Evaluation Once the model has been trained, it is validated and tested using a separate dataset. This dataset should be different from the training dataset to ensure that the model can generalize to new data. The performance of the model is analysed for different machine learning algorithms to find the most suitable one. The model’s performance is evaluated based on metrics such as accuracy, precision, recall, and F1 score.
  • 30. 25
  • 31. 26 Chapter 7 Result In this research, we created a model for identifying email spam based on the Naive Bayes algorithm. A publicly accessible dataset of a large number of emails, including spam and valid emails, was used to train and test the model. To minimize the dimensionality of the dataset, stop words were eliminated, and stemming was applied. The model’s effectiveness was assessed using a number of criteria, including accuracy and precision. The findings revealed that our model had an accuracy of 97.0986% and precision of 98.8%. These numbers show that our model is highly accurate and capable of distinguishing between real email and spam.
  • 32. 27
  • 33. 28
  • 34. 29
  • 35. 30
  • 36. 31 This is our website, where we have entered an email, and it has been classified as a spam.
  • 37. 32 Chapter 8 Conclusion As a result, we have successfully used the Naive Bayes method to construct an email spam identification system. Our method has proven to be highly accurate at differentiating between spam and ham emails, and it has the potential to significantly increase email communication’s effectiveness and security. A variety of data types can be classified using the straightforward and effective Naive Bayes algorithm. In the process of implementing, we classified emails according to their content as well as other characteristics like the address of the sender and subject line. We used the algorithm to forecast the label of fresh, unused emails after training it on a dataset of labelled emails. In comparison to conventional rule-based spam filters, our method significantly improved spam detection accuracy to over 97%. This degree of accuracy was attained by carefully choosing and preprocessing the features, as well as by parameterizing the algorithm. The quickness and scalability of our technology are two of its main benefits. Naïve Bayes is a rather simple algorithm, and the way we implemented it is capable of processing lots of emails simultaneously. Due to this, our technology is appropriate for use in business email networks and other scenarios where communicating via email is problematic. Overall, we think that our Naive Bayes-based spam detection for email system has a lot of promise to increase email communication’s effectiveness and security.
  • 38. 33 Chapter 9 References ➢ M. Abd-Eldayem and E. El-Sayed,” A Hybrid Model for Email Spam Filtering Based on Fuzzy Inference and Genetic Algorithm,” in IEEE Access, vol. 7, pp. 8684–8694, 2019. doi: 10.1109/ACCESS,2019.2895518 ➢ P. Mohammadi, H. R. Pourshahabi, and N. Karimian,” Email Spam Filtering Using Deep Learning Techniques,” in IEEE Access, vol. 7, pp. 116726–116738, 2019. doi: 10.1109/ACCESS.2019.2930799 ➢ M. Iqbal, R. Khan, and S. A. Khayam,” Email Spam Filtering: A Machine Learning Approach,” in IEEE Transactions on Emerging Topics in Computing, vol.5, 10.1109/TETC.2015.2497579, no. 1, pp. 118–131, 2017. doi:10.1109/TETC.2015249779 ➢ M. Kumar and R. Singh,” Email Spam Detection Using Naive Bayes Algorithm.” 2021 International Conference on Da Science and Business Analytics (ICDSBA), Noida, India, 2021, 10.1109/ICDSBA52091.2021.9459928, pp. 1- 6, doi:10.1109/ICDSBA52091.2021.9459928