Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
trialFinal report7th sem.pdf
1. Email Spam Detection Using Machine Learning
A Report Submitted
in Partial Fulfilment of the Requirements
for the Degree of
Bachelor of Technology
in
Computer Science & Engineering
by
Vishal Verma (20204231)
Vikash Yadav (20204228)
Vishal Vishwakarma (20204232)
Uma Patel (20204219)
Shubham Gautam (20204198)
Under the guidance of Prof. Manoj Madhava Gore
to the
COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY
ALLAHABAD, PRAYAGRAJ
November 2023
2. ii
UNDERTAKING
We declare that the work presented in this report titled “Email
Spam Detection Using Machine Learning”, submitted to the
Computer Science and Engineering Department, Motilal Nehru
National Institute of Technology Allahabad, Prayagraj, for the
award of the Bachelor of Technology degree in Computer Science
& Engineering, is our original work. We have not plagiarized or
submitted the same work for the award of any other degree. In
case this undertaking is found incorrect, we accept that our
degree may be unconditionally withdrawn.
November 2023
Allahabad
Vishal Verma
Vikash Yadav
Vishal Vishwakarma
Uma Patel
Shubham Gautam
3. iii
CERTIFICATE
Certified that the work contained in the report titled “Email
Spam Detection Using Machine Learning”, by Vishal Verma,
Vikash Yadav, Vishal Vishwakarma, Uma Patel, Shubham
Gautam has been carried out under my supervision and
that this work has not been submitted elsewhere for a
degree.
(Prof. Manoj Madhava Gore)
Computer Science and Engineering Dept.
M.N.N.I.T, Allahabad
November 2023
4. iv
Acknowledgement
This project is a result of large amount of work, research, and dedication. Firstly, we
would like to thank our project supervisor, Prof. Manoj Madhava Gore for
providing us with the necessary guidance and support to complete this project. Your
insights, expertise, and valuable feedback were essential to the success of this
project. It has truly been a privilege working under him during this semester. We
would like to express our sincere gratitude and appreciation to everyone who has
contributed to the completion of this project.
5. v
Table of Contents
1. Motivation
2. Problem Statement
3. Workflow
4. Machine Learning Algorithm
4.1 Machine Learning Based Spam Filtering Methods
4.1.1 Supervised Machine Learning
4.1.2 Decision Tree Classifier
4.1.3 Support Vector machine (SVM)
4.1.4 Naive Bayes Classifier (NB)
4.2 Overall Insights of Machine Learning Algorithms for Spam
Detection
5. Naive Bayes Algorithm
6. Steps Involved
6.1 Data Collection
6.2 Data Cleaning and Exploration
6.2.1 Data Cleaning
6.2.2 Data Exploration
6.3 Data Pre-processing
6.4 Model Building
6.5 Evaluation
7. Result
8. Conclusion
9. References
6. 1
Chapter 1
Motivation
The aim is to enhance users’ email experiences by reducing the pain and
frustration brought on by spam emails is the driving force behind the
implementation of this email spam filtering system. Users may save time and
effort by successfully filtering out spam instead of having to manually look
through and delete undesired communications. A major driver for improving
email security is the prevalence of dangerous material in spam emails, such as
phishing links and malware attachments, which can compromise users’
personal data or infect their devices. The necessity to make sure that crucial
emails from reliable sources swiftly reach users’ inboxes motivates the
reduction of false positives, when valid emails are wrongly classified as spam.
The ultimate goal is to establish a safe, effective, and dependable email
environment that improves user satisfaction and promotes confidence in the
email service provider. While Google and other email providers are useful for
identifying spam emails, they are still in their early stages and require
continuous user input.
The end-user receives basic services for free from well-known email
providers like Gmail, Yandex, Yahoo Mail, etc., with an EULA, of course. As
private firms operate their own email servers and desire them to be more
secure due to personal data, there is a large potential for developing email
spam classifiers. In such circumstances, email spam classifier solutions may
be supplied to such organisations.
7. 2
Chapter 2
Problem Statement
This project will identify spam using machine learning approaches. It is
necessary to identify spam letters that are fraudulent. We will use all the
machine learning algorithms on our data sets before choosing the most
precise and accurate algorithm for email spam detection. In this project, we’ll
create a complete spam filtering system that may assist both people and
businesses in maintaining spam-free email inboxes, determining which emails
are regular emails and which are spam.
8. 3
Chapter 3
Workflow
➢ Gather the raw dataset and import it: Import the necessary data into
your machine learning environment after collecting it from multiple
sources.
➢ Initial data cleaning procedures should be carried out to deal with
missing values, outliers, and inconsistent data in the dataset.
Techniques like imputation, outlier reduction, and standardization may
be used for this.
➢ Analyse the dataset to learn more and spot trends, connections, and
crucial elements. This technique aids in data comprehension and
decision-making during model preparation and training.
➢ Separate the training and test sets from the pre-processed dataset. The
testing set is used to assess the machine learning model’s performance
after it has been trained using the training set.
➢ Based on the characteristics of the problem and the information at
hand, select a suitable machine learning algorithm. Think about things
like the data’s kind (supervised or unsupervised), the output you want
(classification or regression), and the required performance.
10. 5
Chapter 4
Machine Learning Algorithms
The threat of spam communications has reached unparalleled heights in the
modern digital era, where email has evolved into a crucial form of
communication. Both email service providers and users constantly struggle to
sort through an enormous volume of unsolicited and potentially hazardous
communications. But now that machine learning algorithms have advanced,
we have strong tools at our disposal to address this problem. The classification
of email spam has undergone a revolutionary change thanks to machine
learning algorithms. These algorithms make extensive use of data, learning
from trends and traits to distinguish between spam and legitimate emails with
accuracy. By automating the categorization process, they increase
productivity and guarantee that users’ inboxes are free of clutter while
minimizing their vulnerability to phishing or harmful assaults.
4.1 Machine Learning Based Spam Filtering Methods
The way we combat the issue of email spam has been revolutionized using ML-
based spam detection methods. These techniques use complex algorithms to
categorize incoming emails into two categories: spam or authentic, based on
learning from labelled datasets. The ensemble method, which combines
various machine learning models, including decision trees and support vector
machines, to reach a decision, is a well-liked strategy. The spam filtering
system’s overall reliability and precision are enhanced by this ensemble
technique. Keyword frequency, header analysis, and text analysis are just a few
11. 6
of the significant features that are extracted from emails using feature
engineering techniques. The machine learning algorithms then receive these
features and learn relationships and patterns to produce precise predictions.
4.1.1 Supervised Machine Learning
An area of artificial intelligence known as supervised machine learning is
concerned with developing models to make forecasts or judgments using
labelled training data. It is an effective and frequently used method that
enables algorithms to generalize their knowledge and learn from examples to
make precise predictions on unobserved data. In the case of supervised
learning, the input features (sometimes referred to as predictors) and the
accompanying output labels make up the training data. The objective is to
12. 7
identify a model or function that transfers the input features to the
appropriate output labels. The training stage and the prediction phase are the
two main steps in this approach. The machine learning system picks up new
information from the provided labelled samples throughout the training
phase. To find patterns and relationships, it analyses the input attributes and
the labels that go with them. The algorithm modifies its internal weights or
parameters to reduce the discrepancy between the expected results and
finance.
4.1.2 Decision Tree Classifier
A well-liked supervised machine learning approach called a decision tree
classifier develops a model that resembles trees for classification tasks by
learning decision rules from labelled training data. Each inner node
13. 8
represents a judgment based on a feature, and every node in the leaf
represents a class label. It uses a hierarchical arrangement of internal nodes
and leaf nodes. Decision trees are useful for acquiring insights into how
decisions are made since they are simple to understand and visualize. They
are efficient with huge dataset and can handle both category and numerical
features. Due to their versatility and simplicity, decision tree classifiers are
frequently employed in a variety of industries, such as marketing, healthcare,
and finance.
4.1.3 Support Vector machine (SVM)
The widely used SVM (supervised machine learning) method known as
Support Vector Machine (SVM) excels at classification and regression tasks.
14. 9
Because it can work with complex data sets and spaces with high dimensions,
it is adaptable for a variety of applications. SVMs search for the best
hyperplane to maximize the margin between classes to improve
generalization and minimize overfitting. The kernel method, which allows for
nonlinear boundaries for decisions by implicitly translating the data into
higher-dimensional feature spaces, is one of the main benefits of SVM.
This adaptability, together with the regularization parameter’s control
over the bias-variance trade-off, aids SVMs in making precise predictions and
adjusting to various datasets. SVMs are useful in areas where visibility and
interpretability are essential since they are resilient to outliers and offer
understandable decision boundaries. SVMs continue to be a potent machine
learning tool for a variety of real-world applications thanks to their broad
range of kernel functions and scaling through optimization strategies.
15. 10
4.1.4 Naive Bayes Classifier (NB)
Based on the Bayes theorem of probability, the Naive Bayes algorithm is a well-
known machine learning technique for categorization. Naive Bayes is a
probabilistic classifier algorithm that is easy to use, quick to train, and
effective in a variety of real-world scenarios. First, the Naive Bayes classifier
determines the likelihood that a new observation will belong to each class
based on its attributes. Bayes’ theorem, which asserts that the likelihood of a
hypothesis (in this case, the class) given certain observable evidence (the
features) is proportional to the likelihood of the proof given the hypothesis
times the prior likelihood of the hypothesis, is used to accomplish this.
The reason why this algorithm is known as naive Bayes is the assumption that
its features are conditionally independent even when a class is given. This
presumption makes the probability calculation easier and more
computationally efficient. Naive Bayes continues to perform well in practice,
especially when the number of features is large, even though this assumption
might not always be true.
4.2 Overall Insights of Machine Learning Algorithms for
Spam Detection
After reviewing the literature, we found that most datasets used to develop,
test, and apply various models were made artificially. The labelling of all the
supervised data models is challenging, and there aren’t enough examples for
analysis. As a result of the synthetic data sets used to train the models, the
findings of the classifiers are not completely trustworthy. A large range of
machine learning models are now used for the detection of spam emails; these
are not indicative of real-world spam evaluations. The three frequently used
learning algorithms—logistic regression, naive bayes, and support vector
16. 11
machines (SVM)—outperform the alternative learning algorithms in most of
the experiments and research. Typically, Naive Bayes as well as logistic
regression outperform SVM in terms of performance. However, as SVM is not
compared to all the other algorithms, it should not be regarded as the best
method alone. Future research should examine different learning models on
different datasets using a variety of feature engineering techniques.
17. 12
Chapter 5
Naive Bayes Algorithm
A straightforward yet effective technique called Naive Bayes is used in
machine learning to perform classification jobs. It is based on the Bayes
theorem and makes the” naive assumption that each characteristic is
independent.
Bayes theorem:
Here’s how the Naive Bayes algorithm works:
Training Phase: The algorithm determines the prior probability of each class
and the likelihood of each feature given to each class given a labelled training
dataset. In essence, it picks up on the data’s statistical characteristics.
18. 13
Feature Independence Assumption: Naive Bayes assumes that the traits are
conditionally independent of one another considering class. This presumption
makes the computations easier and increases the computational efficiency of
the method.
Phase of classification: When a fresh, unlabelled instance is supplied, the
algorithm determines the posterior probability of the cache class based on the
attributes that were seen. The instance belongs to the class with the highest
posterior probability.
To calculate the posterior probability, Naive Bayes uses the Bayes theorem:
P(class|features) = (P(class)* P(features|class))/P(features)
The posterior probability of the class given the observed characteristics is P
(class features).
P(class) is the class’s prior probability.
The likelihood of the observed features given the class is P (features|class).
The probability of the observed characteristics is P (features).
The likelihood term may be computed as the product of the individual feature
probabilities since the method assumes that the characteristics are
conditionally independent.
P(features|class) = P(feature1|class) *P(feature2|class)*...* P(featureN|class)
The Naive Bayes technique is frequently used in text classification
applications where the features correspond to words or word frequencies,
such as spam filtering or sentiment analysis. It works well and is
19. 14
computationally effective even with tiny training datasets. The assumption of
feature independence, nevertheless, cannot always be true, which might
provide less-than-ideal outcomes.
20. 15
Chapter 6
Steps Involved
6.1 Data Collection
The first step is to gather a large dataset of emails containing both spam and ham. The dataset contains a
$228 record, which is a sample of an email, and a target column, "target," which describes whether an email
is spam or not. The composition of the total record in terms of two target categories is:
The data contains 87.37 percent ham mail and 12.63 percent pam mail. It is taken from an open resource on
Kaggle.
21. 16
6.2 Data Cleaning and Exploration
Data cleaning and exploration are important steps in machine learning to
ensure that the data is suitable for analysis and modelling.
6.2.1 Data Cleaning
➢ Identification and treatment of missing values in the dataset.
Techniques like imputation (replacing missing data with an
approximation) or removing rows or columns with missing values may
be used to achieve this.
➢ Detecting and Handling Outliers: Outliers are extreme numbers that
might bias an investigation. Outliers can be handled in one of three
ways: by being eliminated, changed, or by using powerful statistical
methods.
➢ Eliminating Duplicates: To prevent bias in the analysis, locate and
eliminate duplicate records.
➢ Data formatting: Make sure that the data is formatted correctly (e.g.,
numerically, or categorically) and fix any formatting errors.
➢ Handling Noisy Data: Recognize and address any data that is erratic
or inconsistent that may have an adverse effect on the analysis.
22. 17
6.2.2 Data Exploration
Calculate descriptive statistics like the mean, median, standard deviation, and
quartiles to have a better understanding of the distribution, dispersion, and central
patterns of the data.
➢ Data visualization: To see trends, correlations, and possible outliers, visualize
the data using charts, histograms, scatter plots, etc.
➢ Analyse each feature’s link to the target variable to better understand how it
affects the current issue. This may entail running hypothesis tests, evaluating
correlations, or using domain knowledge.
➢ Principal component analysis (PCA) or feature selection methods may be
used to reduce the dimensionality of a dataset if it contains a lot of
characteristics and to concentrate on the most useful ones.
➢ Handling Class Imbalance: Investigate approaches like oversampling, header
sampling, or creating synthetic samples to balance the classes if the dataset
includes unbalanced classes (one class is much more abundant than the
others).
23. 18
Fig 6.3 Number of characters in each type
Fig 6.4 Number of words in each type
24. 19
Correlation:
From the correlation heatmap, we infer that as the num_characters (numbers of
characters) in the text increase, the tendency of the target to become increases which
is 0.38 as shown, likewise for num_words and num_sentences.
Since, num_characters are closely correlated to the target, we will use this column
for our machine learning model formation.
25. 20
6.3 Data Pre-processing
Preparing raw data for analysis and modelling is known as data preprocessing, and
it is an essential stage in machine learning. By addressing inconsistencies, handling
missing numbers, and preparing the data for machine learning algorithms, it aids in
enhancing the quality of the data.
It aids in ensuring that the data is in an appropriate format for training models,
lessens noise and inconsistent data, and enhances the effectiveness and
dependability of the resulting models.
Here are the steps we have performed to preprocess the data for our model:
➢ Converting the test to lower case
➢ Tokenization is generally carried out using specialized tools or libraries. We
have used a Python library called NLTK for this purpose. It is a fundamental
step in data preprocessing for machine learning. especially when dealing with
text data. It involves breaking down a sequence of text into smaller units
called tokens. Tokens can be words, characters, or even sub words, depending
on the specific task and requirements.
➢ After tokenization, the resulting tokens undergo further preprocessing steps
to remove stop words. Stop words are commonly occurring words with little
semantic value. These words are removed so that the remaining important
words, or content words, become more prominent.
➢ Special characters are also removed for the tokenized words.
➢ Natural language processing (NLP) uses the stemming approach to break
down words into their “stem,” or root, form. By eliminating prefixes, suffixes,
and other affixes from words, stemming makes it possible for several word
26. 21
variants to be represented by a single root form. Stemming is reducing words
to their base form or root form to standardize text representation further.
Data is pre-processed to extract useful information and analysis out of it,
which can be used to build and improve models.
We have also pre-processed the data to detect the most frequently used words
in both categories of target, we drew word cloud for them.
Word clouds are visual representation of text data where words are displayed
in varying sizes based on their frequency or importance in the text. While
word clouds are not directly used in machine learning algorithms, they serve
as a valuable tool for data exploration and understanding the characteristics
of text data.
27. 22
6.4 Model Building
In machine learning, the process of generating a predictive or descriptive model
using a given dataset is referred to as model construction. It entails choosing the best
method, training the model using the given data, and assessing the model’s
performance. In order to generate predictions, get new perspectives, or solve
particular issues based on the patterns and correlations discovered from the data,
model development is an essential stage in machine learning.
28. 23
A vectorizer is a component or method used in machine learning to transform
unrepresentable raw input, including text or categorical variables, into numerical
representations, frequently in the form of vectors. Since the majority of machine
learning methods require numerical inputs, vectorization is a fundamental stage in
the model-building process. The algorithms can successfully process and
comprehend the data thanks to vectorization.
Here are the types of vectorizers we have used in our project:
1. Count Vectorizer: A vectorizer designed exclusively for text data is called Count
Vectorizer. It transforms a group of test documents into a matrix that displays the
frequency of words or phrases used in the texts. Each document is converted into
a vector, each of which contains the count of a particular word or phrase. Text
data may be represented as numerical features using CountVectorizer, which is
an easy and efficient method.
2. TF-IDF Vectorizer: The vectorization method TF-IDF (Term Frequency-Inverse
Document Frequency is another one that is used for text data. Each word in a text is
given a weight depending on how frequently it appears in the document and how
uncommon it is throughout the corpus. The TF-IDF considers a word’s relevance in
the whole corpus of texts in addition to just where it appears in a given document.
Each document is represented by a vector of TF-IDF scores for each word or phrase
in a matrix created by the TF-IDF vectorizer,
Vectorization transforms raw data into a suitable numerical representation, enabling
the application of a wide range of machine learning algorithms for analysis,
modelling, and prediction.
Then, we used the fit transform method for fitting the transformation on the training
data and applying the transformation to both the training and test data. Here’s how
it works:
Fit the transformer. First, using the training data, the fit transform function fits the
transformation or preprocessing step. The parameters or statistics that will be
utilized to transform the data are learned from the training set at this stage. To learn
29. 24
the vocabulary, word frequencies, or other pertinent information, the fit transform
method, for instance, analyses the training data in the case of a Vectorizer.
Apply the transformation. After the transformation has been performed, the training
data is subjected to the learned transformation using the fit transform method.
Based on the parameters or statistics discovered during the fitting process, it
changes the data. This transformation could entail codifying category variables,
sealing numerical aspects, or transforming language into numerical representations.
Test Data Transformation: The fit transform method also performs the same
transformation on the test data in addition to the training data. This guarantees
consistency and permits a fair assessment of the model by processing the test data
in the same manner as the training data.
By using fit transform, you can efficiently fit the transformation on the training data
and apply the same transformation to both the training and test data, ensuring that
the preprocessing steps are consistent and avoiding any data leakage between the
two sets.
6.5 Evaluation
Once the model has been trained, it is validated and tested using a separate dataset.
This dataset should be different from the training dataset to ensure that the model
can generalize to new data. The performance of the model is analysed for different
machine learning algorithms to find the most suitable one. The model’s performance
is evaluated based on metrics such as accuracy, precision, recall, and F1 score.
31. 26
Chapter 7
Result
In this research, we created a model for identifying email spam based on the Naive
Bayes algorithm. A publicly accessible dataset of a large number of emails, including
spam and valid emails, was used to train and test the model. To minimize the
dimensionality of the dataset, stop words were eliminated, and stemming was
applied. The model’s effectiveness was assessed using a number of criteria, including
accuracy and precision. The findings revealed that our model had an accuracy of
97.0986% and precision of 98.8%. These numbers show that our model is highly
accurate and capable of distinguishing between real email and spam.
36. 31
This is our website, where we have entered an email, and it has been
classified as a spam.
37. 32
Chapter 8
Conclusion
As a result, we have successfully used the Naive Bayes method to construct an email
spam identification system. Our method has proven to be highly accurate at
differentiating between spam and ham emails, and it has the potential to significantly
increase email communication’s effectiveness and security. A variety of data types
can be classified using the straightforward and effective Naive Bayes algorithm. In
the process of implementing, we classified emails according to their content as well
as other characteristics like the address of the sender and subject line. We used the
algorithm to forecast the label of fresh, unused emails after training it on a dataset
of labelled emails.
In comparison to conventional rule-based spam filters, our method significantly
improved spam detection accuracy to over 97%. This degree of accuracy was
attained by carefully choosing and preprocessing the features, as well as by
parameterizing the algorithm. The quickness and scalability of our technology are
two of its main benefits. Naïve Bayes is a rather simple algorithm, and the way we
implemented it is capable of processing lots of emails simultaneously. Due to this,
our technology is appropriate for use in business email networks and other
scenarios where communicating via email is problematic.
Overall, we think that our Naive Bayes-based spam detection for email system
has a lot of promise to increase email communication’s effectiveness and security.
38. 33
Chapter 9
References
➢ M. Abd-Eldayem and E. El-Sayed,” A Hybrid Model for Email Spam Filtering
Based on Fuzzy Inference and Genetic Algorithm,” in IEEE Access, vol. 7, pp.
8684–8694, 2019. doi: 10.1109/ACCESS,2019.2895518
➢ P. Mohammadi, H. R. Pourshahabi, and N. Karimian,” Email Spam Filtering
Using Deep Learning Techniques,” in IEEE Access, vol. 7, pp. 116726–116738,
2019. doi: 10.1109/ACCESS.2019.2930799
➢ M. Iqbal, R. Khan, and S. A. Khayam,” Email Spam Filtering: A Machine
Learning Approach,” in IEEE Transactions on Emerging Topics in Computing,
vol.5, 10.1109/TETC.2015.2497579, no. 1, pp. 118–131, 2017.
doi:10.1109/TETC.2015249779
➢ M. Kumar and R. Singh,” Email Spam Detection Using Naive Bayes Algorithm.”
2021 International Conference on Da Science and Business Analytics
(ICDSBA), Noida, India, 2021, 10.1109/ICDSBA52091.2021.9459928, pp. 1-
6, doi:10.1109/ICDSBA52091.2021.9459928