SlideShare a Scribd company logo
1 of 18
Detecting Phishing Email Using Natural Language Processing
By Ian Harris, Tianrui Peng
University of California, Irvine
Abstract
Phishing emails areone of the most common and harmful threats that people are
constantly facing in contemporary society. In this paper, we will show our natural language
processing algorithm which can detect malicious emails by analyzing sentence structures
and the relationships between words. Our algorithm mainly focuses on analyzing the
content of phishing emails. There are many programs and research projects focused on
detecting phishing emails using the title, the header and the links inside the emails.
However, there are some sophisticated phishing emails that do not contain malicious links
inside. Our algorithm can outperform other algorithms at analyzing this type of content-
oriented phishing emails. Moreover, our algorithm can also be used to improve any existing
link-oriented phishing emails detection algorithm. By using a naturallanguage
processing(NLP) approach, our algorithm does not need to rely on constantly updated data,
and will be able to generalizeto newly generatedattacks. Therefore, our algorithm is able to
detect the constantly changing and developing phishing email attacksin contemporary
society.
1. Introduction
In contemporary society, the security of private information is a major concern of every
person. Social engineering attacksaredangerous threats that aim at using human interaction to
mentally manipulate people into exposing their confidential information.
Phishing is a type of social engineering attackthat focuses on gaining sensitive information by
disguising as a trustworthy entity. Electronic communications, such as email or text message are
common platforms for delivering phishing attacks. Phishing is a continual threat that constantly
affects our daily lives. Attackersoften gain personal information that effects the victims’ personal
lives, financial wellbeing, and work environment. BetweenMay 2004 and May 2005,
approximately 1.2 million computer users in United States suffered financial losses because of
phishing attacks, totaling approximately 929 million USD [1]. Based on the third Microsoft
Computer Safer Index Report released in February 2014, around 5 billion USD are lost to phishing
attacksannually [2].
Phishing emails are the most common type of phishing attacksthat people have to deal with.
Attackersareusually disguised as popular social websites, banks, administrators from IT
departments or popular shopping websites. These emails often lure users to enter personal
information into a malicious website which has a similar outlook to the legitimateone. There are
different types of phishing emails such as spear phishing, clone phishing, and whaling. As
technology becomes more integratedinto our culture, the damage caused by phishing emails
continues to rise. This growing threat calls for our attention to actively seek effective solutions for
this problem. The most common methods to deal with this problem are using reported blacklists,
user training, public awareness, and automatic phishing detection software.
There are many previous research projects that have attemptedto solve this problem by using
software classification approaches. The most common techniques being used aremachine
learning, blacklists, link analysis, and natural languageprocessing(NLP). NLP is suitable to solve
this problem because it can extract the semantic information from the content of the email
without relying on an existing website blacklist or link analysis. A legitimateemail typically
attemptsto present some information to the users. On the other hand, a malicious email usually
aims at luring targetedusers to visit malicious websites or to elicit a response. Using NLP
techniques, we can analyze the content of the message to determine its motives, and classify it
as legitimate or malicious.
There have been various attempts to use NLP to createan algorithm capable of detecting
phishing emails. The research presented here is based on an NLP algorithm proposed by
Professor IanG. Harris, Yuki Sawa, Ram Bhakta, and Christopher Hadnagy [3]. This algorithm was
originally designed to identify social engineering attacksby applying NLP techniques to
conversations. It examines all dialog text transmitted from the attacker to the targeteduser, and
checks the appropriateness of the sentence. A sentence is considered to be malicious if it
inquires sensitive information or commands a performance of action that might expose personal
information. [3] uses NLP techniques to detect questions and commands, and to identify their
relatedsubjects. Once a sentence is categorizedas a question or a command, its potential topics
are extractedby finding verb and noun pairs which connect verbs with their direct objects. Then
each pair is evaluated by whether it is contained by a blacklist of malicious verb and noun pairs.
1.1 My Contribution
My contributions of this research involved systematically evaluating the performance of [3]
generalizedto the detection of phishing emails, and improving the algorithm to gain better
performance at identifying phishing email attacks. I wrote a program that gatheredand parsed
438 malicious emails and 500 legitimate emails to evaluate[3]. Beforeimprovement, [3]’s
precision is 100% and its recall is 12%. After improving and adapting its blacklist, the new
program’s precision is 100% and 38%. Then by combining this program with another link analysis
program, Netcraft [14], the new program’s precision is 99% and its recall increased to 73.5%.
2. RelatedWork
Different research groups have attemptedto prevent phishing attacksthrough various
approaches. There arebroadly two types of phishing schemes. The first, arethe phishing
schemes that are used to detect phishing webpages. The second, arethe schemes that arefocus
on detecting phishing emails.
2.1 Work relatedto detectphishingwebsites
The most common approaches of detecting phishing websites are webpage content analysis
and link analysis. There are two major approaches to solve this problem. The first approach is
blacklist-based anti-phishing techniques such as Google Safe Browsing API. Google Safe Browsing
API allows users to validate whether an URL is in blacklists that are gatheredand updated by
Google [4]. The second approach is based on analyzing the content of the website. SpoofGuard is
a web browser plug-in createdby Stanford University, which uses this approach. It weights
certain malicious components found in the HTML content [5].
2.2 Work relatedto detectphishingemails
2.21 Using URL analysis
Many research focus on analyzing the structure of the links rather than the content of the
emails to identify phishing emails. For example, an approach proposed by Garera, which
presents several differences between a malicious URL and a benign URL [6]. Through
identifying these distinctions, they createda logistic regression filter to detect phishing emails.
In addition, LinkGuard also uses data provided by APWG to identify common features of
malicious links contained in phishing emails [7].
2.22 Using email content
Many phishing detection algorithms arecontent-oriented. For example, the research of
Allen Stone which resulted in a software called EBIDS[8]. It uses NLP technologies to analyze
the plain text in emails. They first readan email in as a text file, and then feed the input into
OntoSem through the DEKADE API. The OSIM processes the text, writing its semantic and
referentialknowledge of the sentences in ontological terms. This step creates equivalence sets
for natural language strings. For example, instead of matching on “send us your account
information,” EBIDScan match on the concept of “request for personal information.” Then
based on the results of OntoSem, they run a string match algorithm to determine if the email
is a phishing email or not. The decision is based on a rule set. They put four rules in the rule
set for testing: account compromise, financial opportunity, account change, and opportunity.
So the program matches certain words and phrases within the email content, and then gets
the similarity between the email content and the four rules. If the similarity of a rule passes a
threshold, which means the email belongs to that rule, the program then decides the email is
malicious. If the email does not match any of these rules, then it is decided that this email is
legitimate.
Another content-oriented research also aimed at detecting phishing emails through an NLP
approach. The work of Aggarwalused content analysis to find four parametersthat can
decide whether an email is malicious or not [9]. These four parametersare: absence of
names, mention of money, reply inducing sentence, and sense of urgency. The algorithm
detects mention of money by keeping a set of names and symbols of all currencies in the
world and their common variants, and then checks whether the email contains any of them.
In order to detect presence of reply inducing sentence, it finds a set of words and phrases
which ask the user to reply to the email, such as contact, get back, and reply. It also uses
WordNet to include the hyponyms and synonyms of these reply inducing words in the set. In
order to find if any of these words are mentioned in the email content, it first uses Stanford
CoreNLP API to tag the words in the email with POS tags, and then stems the words of the
taggedfile to their base form. In addition, it can also detect sense of urgency by checking
whether the email contains words such as now, instantly, or immediately. If a sentence
contains both words that induce reply and have sense of urgency, then it means that the
email has high possibility to be malicious. After detecting these four parameters, it uses a
formula to combine the information it got, and give a final score.
2.23 Using Both URL and email content
Some approaches utilize all the information contained in an email, such as the header, the
links, and the text content. Verma implemented a software called PhishNet-NLP that utilizes
all the information contained in an email to detect phishing emails [10]. PhishNet-NLP
analyzes the text of email by using NLP techniques to give it a Textscore and a Contextscore.
PhishNet-NLP analyzes the Textscore by using following NLP techniques: lexical analysis,
part-of-speech tagging, named entity recognition, normalization of words to lower case,
stemming, and stop word removal. Firstly, it uses lexical analysis to split the email into
sentences, and each sentences into words. Then, it normalizes the words into lower case,
removes the stop words, and stems the words. Then PhishNet-NLP uses named entity
recognition to find out whether the email at least mentioned one institution in the body. If
there is zeroinstitution mentioned, then the email receives a Textscore of 0, which stands for
legitimate. PhishNet-NLP also defines a set of special verbs SV, which is a set that contains
verbs that usually used by malicious emails to instruct people to do certain actions, and
contains hyponyms and synonyms of these verbs. PhishNet-NLP gets hyponyms and
synonyms of verbs by using WordNet. Then PhishNet-NLP gives each of these verbs a score
by using a formula that takes four factors into account: x, l, a, and L. The values of parameters
x and a depend on whether it finds a certain combination of words in a sentence. The
parameter l depends on the number of links in the email. And the parameter L is the level of
the verb which is one more than the least number of hyponymy links followed to reach the
verb from a synset. After computing scores for each verb of the set SV, the Textscore of an
email is equal to the maximum score of all the verb scores.
Besides Textscore, PhishNet-NLP also gives a Contextscore for each email. It treats the email
as a vector of TF-IDF values in the semantics space after applying stopword elimination and
stemming. It uses the Contextscore to check the similarity between this email with other
emails that user received before. The Contextscore = 1 when it finds a similar email in the
inbox.
After computing both Textscore and Contextsocre, PhishNet-NLP combines these two
scores together to get a Final-text-score. Then it combines Final-text-score, headerScore, and
linkScore together to decide whether an email is malicious or legitimate.
3. Introductiontothe Algorithmof SEParser
This research is based on the algorithm created by [3]. [3] is an approach to detect social
engineering attacks by using NLP techniques to parse conversations. To detect social engineering
attacks, SEParser that analyzes the dialogs through semantic analysis of each sentence. Figure 1
presents the structure of the detecting process of SEParser.
Figure 1 Outline of the detection process
During most of the social engineering attacks, the attacker must express one of the following
types of sentence in order to lure sensitive information from users:
1. a question relatedto sensitive information
2. a command to perform a dangerous operation
SEParser analyzes the text conversation between the attacker and the victim, and checks the
appropriateness of the topic. If a sentence inquires personal information or demands dangerous
action, the topic of that sentence is not appropriate. SEParser uses NLPtechniques to find a pattern
in the sentence’s structure. By using this pattern, SEParser would parse all the sentences in the
conversation, and detects questions and commands from them. As soon as SEParser finds the
questions and commands, it would extract themain topics of them. The main topic of the sentence
is extractedby analyzing the parse tree of the sentence. For example, Figure 2 presents a security
policy statement.
Networking equipment must not be manipulated.
Figure 2 Security policy statement
By manually analyzing this sentence, we can identify the main topic and its related action
shown in Table 1.
For this statement, the topic of the sentence is “networking equipment”, and its related
action is “manipulate”. In order to identify this topic and action pair, SEParser will first get a parse
treeof thesentence by using Stanford Parser [11]. The Stanford Parser is a context-free parser that
is well-developed and used by various NLP research projects. The parse treereturned by Stanford
Parser gives every word in the sentence a tag. Stanford Parser can be set to use various taggers,
however the tagsshown in the figure arefrom the Penn Treebank Tagset [12]. Figure 3 showed the
parse tree of the sentence.
Figure 3 Parse tree of the first sentence
By analyzing the sentence structure, SEParser would identify thesentence as not a command
or a question. Therefore, it will not continue finding the topic of the statement. However, ifwe use
another example such as “Reset the router”, the SEParser would first get the parser tree from the
Stanford parser shown in Figure 4.
Figure 4 Parse tree of "reset router"
SEParser would identify this sentence as a command by analyzing its parse tree. It would
recognize the pattern of the direct command’s sentence structure. Then SEParser would find the
topic of the sentence as router, and its related verb as reset. Therefore, the verb/noun pair of the
sentence is (reset, router). SEParser detects this verb/noun pair by finding the main verbs of the
sentence and their direct objects. The verb, “reset”, is a typical command of malicious actions, so
it would be found in the blacklist. Moreover, the noun related to this verb is “router”, which is an
important type of networking equipment. Therefore, this verb/noun pair would be found in the
blacklist of the verb/noun pairs used by SEParser. SEPaser would detect this sentence as malicious,
and would alert victims before actual damage happens.
3.1 SEParser Algorithm
Figure 5 presents the process of question/command detection and topic extraction.
Figure 5 Question/command detection and Topic detection algorithm
As shown in Figure 5, SEParser first gets s best parse trees from Stanford Parser. Then, it would
use patterns to decide the type of the sentence. If the sentence is a question or a command,
SEParser would extract theverb/noun pair of the sentence. For each verb/pair, SEParser would use
MatchTopic Algorithm to determine the appropriateness of the pair as shown in Figure 6.
Figure 6 MatchTopic Algorithm
The MatchTopic Algorithm gets inputs from the previous algorithm, and then searches the
blacklist to determine whether the verb/noun pair inputs are in the blacklist or not. In order to
identify a verb/noun pair as malicious, both verb and noun have to be exactly matched the entry
in the blacklist.
4. Applying and Improving AlgorithmtoDetectionof Phishing Emails
In this research, we applied algorithm [3] to detect phishing emails, and improved this
algorithm in order to gainbetter performance. We made three major improvements:
1. Optimizing the performance of the algorithm.
2. Adapting and extending verb/noun pair blacklist to phishing emails.
3. Combining algorithm [3] with a link analysis software, Netcraft [14].
In order to apply algorithm [3] to detect phishing emails, we first systematically tested the
performance of SEParser on detecting phishing emails without any improvement. We used 500
legitimate emails and 438 malicious emails. After testing, the first improvement we made was
optimizing SEParser so that it would run faster on larger text corpus. By changing a nested for-loop
and deleting some unused functions, we managed to triple the running speed of SEParser.
The second improvement we made was extending and adjusting the verb/noun pair blacklist so
it would be specialized to phishing emails context, instead of conversation. There arephrases such
as “click on the link” or “send this information back” that appear often in phishing emails, but are
less common in conversations. Therefore, we first experimented in optimizing the blacklist to
better identify common phishing phrases. We used information extraction techniques to find the
most common verbs and nouns pairs used in malicious emails and legitimate emails. Then, we
manually went through and added the main difference between these words into the verb/noun
pair blacklist.
The third improvement was combining SEParser with a link analysis tool, Netcraft [14]. After the
second improvement, we realized that the majority of the emails can be detected by link analysis.
Therefore, we decided that for detecting phishing email, it was practicalto combine our algorithm
with a link analysis tool. After carefully examining various link analysis tools, we decided to use
Netcraft [14]. Netcraft is an anti-phishing toolbar that uses link analysis to detect phishing URLs.
There areat least three research project tested and reported Netcraft as one of the most accurate
anti-phishing link analysis program. Netcraft uses serval ways todetect phishing URLs. The first way
is using black lists of phishing websites reported by community. Netcraft constantly updates these
blacklists. Netcraft also analyzes the potential phishing website’s domain, IP address, Hosting
company, Hosting country, Latest performance, Hosting history, and Site technology. By using and
analyzing all this information comprehensively, Netcraft gives each link a risk rating on a scale of 0
to 10. A lower risk rating means that the link is less likely to be malicious. We created a tool that
parses the html format of phishing emails and extracts the links from them. We used a python
library called beautifulsoup [15] to parse the html content of the email and extract the links
embedded in the emails. Some phishing emails trick users to click on malicious links by showing
non malicious links in plain text, and embedding the actual malicious link in the hyperlink
connected to the text. By parsing the html instead of the plain text, we were able to extract the
actual links from the emails. Once we got all the links, we created a program that takes the links
and sends an http request to Netcrafts’ website. The program extracts the result it gets from the
website to find the risk rating of the link. For our program, we choose to use 5 as the threshold. If
the risk rating of a link is higher than 5, then the programidentifies this link as malicious.
After feeding the link into Netcraft and parsing the results, the algorithm then determines
whether it is necessary to run the SEParser. If the link is not identified as malicious by Netcraft,
then the algorithm runs the improved SEParser to analyze the content.
5. Experimental Results
5.1 Databases used to Test theDetection of Phishing Emails
In order to systematically test the performance of SEParser on detection of phishing emails,
we decided to collect two different databases: a legitimate email corpus and a phishing email
corpus. For the legitimate email corpus, we chose to use the enron email corpus. And for the
malicious email corpus, we chose to use the phishing email corpus collected and shared by
Nazario, J [13]. This phishing email corpus is also used by other research projects such as
PhishNet-NLP [10] and PhishCatch. For test purposes, we only picked 500 legitimate emails and
438 phishing emails.
We created a script called read_mbox.py to parse these emails. Most of the emails from the
enron email corpus arein Mbox format. Theemail collected by Jose are also in the Mbox format
and the contents aresometimes html or plain text. Therefore, the script first checks the format
of the email and parses the content accordingly.
5.2 Results Before Combining with Netcraft
5.21 Optimization to ImproveRun Time
For our first test, we directly applied SEParser to our databases without major changes. We
found out that we needed to optimize the performance of the program in order for it to parse
around 1000 emails in a reasonable amount of time. Beforethe changes, weran the programon
a laptop which has an IntelCore i7 processor, 256GB solid state drive and running Windows 10.
This computer was able to process 438 emails in around 3 hours. We realized the running time
was too long for us to effectively test the algorithm. Therefore, we made several changes to the
code so that it ran3 times faster than before. After the changes, we ranthe programon the same
computer, this time it only took 55 minutes for the programto parse the 438 emails.
5.22 Adjusting verb/noun pairblacklist
For the first test, SEParser detected 54 malicious emails out of 438 phishing emails and 0 out
of 500 legitimate emails. Therefore, its true positive is 12 percent, and 0 percent false positive.
We were satisfied with the false positive, but we decided to extend the verb/noun pair blacklist
in order to increase thetrue positive. Therefore, based on the formula shown in equation (1) and
(2), the precision of the algorithm is 100 percent, and the recall is 12 percent.
Equation (1): Precision =
𝑡𝑝
𝑡𝑝+𝑓𝑝
Equation (2): Recall=
𝑡𝑝
𝑡𝑝+𝑓𝑛
The equation uses tp as the number of true positives, fp as the number of false positives, and
fn as the number of false negatives.
For our second test, we mainly focused on finding and adding appropriate verb/noun pairs
that suited the phishing email context. After adding the appropriate verb/noun pairs, we
managed to increase the true positive from 12 percent to 38 percent. And the false positive was
still 0 percent. After the change, the precision of the algorithm is still 100 percent, and the recall
is 38 percent.
5.3 Results After combining with Netcraft
By using Netcraft to analyze the links inside phishing emails, we detected 255 emails out of
438 malicious emails. Then we ran SEParser on the rest of the 183 malicious emails, and it
detected73. By combing with Netcraft, we detected 322 phishing emails out 438. Therefore, the
true positive of our program is 73.5 percent. Then we ran our program on the legitimate email
corpus, and it detected2out of 500 legitimateemails. Therefore, our false positive is 0.4 percent.
After combining with Netcraft, theprecision of thealgorithm is 99 percent, and the recall is 73.8
percent.
Table 3 shown below presents the performance comparisons for the different
improvements. From this table, wecan see that therecallof the programincreased around 200%
after improvement. For all of these tests, we used 438 malicious emails, and 500 legitimate
emails.
SEParser Netcraft SEParser+Netcraft
True Positive 166 255 322
False Positive 0 2 2
Precision 100% 99% 99%
recall 38% 58% 73.5%
Table 3 Comparison between different results
6. Conclusion
In this paper, we havetested and improved SEparser on detecting phishing emails. Bytesting
and adjusting SEparser, we have shown that SEparser’s algorithm can be generalized from
analyzing dialogs of social engineer attacksto detect phishing emails by analyzing the parse tree
and grammar structure of sentences. And by combining improved SEparser with other link
analysis software, it is able to detect most of the malicious emails with less than one percent
false positive. To further improve the performance in the future project, we believe that we can
find certain pattern in most of the phishing emails. By analyzing parse tree of sentences, the
program would be able to identify this pattern from the content of the email. By semantically
analyzing the content, thefuture programshould be ableto detect more phishing emails without
causing more false positives.
7. Reference
1. Kerstein, Paul (July 19, 2005). "How Can We Stop Phishing and Pharming Scams?". CSO.
Archived from the originalon March 24, 2008.
2. "20% Indians are victims of Online phishing attacks: Microsoft". IANS. news.biharprabha.com.
Retrieved February 11,2014.
3. Y. Sawa, H. R. Bhakta, I. G. Harris, "Detection of Social Engineering Attacks Through Natural
Language Processingof Conversations", IEEE ConferenceonSemanticComputing, February2016
4. Google, “Google safe browsing API,” http://code.google.com/apis/safebrowsing/, accessed Oct
2011
5. N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side defense against web-based
identity theft,” in NDSS. The Internet Society, 2004.
6. Garera, S., Provos, N., Chew, M., Rubin, A.: A framework for detection and measurement of
phishing attacks. In: Proc. 2007 ACM Workshop on Recurring Malcode, pp. 1—8 (2007)
7. Chen, J., Guo, C.: Online detection and prevention of phishing attacks. In: First Int’l Conf. on
Communications and Networking in China, ChinaCom 2006, PP. 1—7 IEEE (2006)
8. A. Stone, "Natural-LanguageProcessing for Intrusion Detection," in Computer, vol. 40, no. 12,
pp. 103-105, Dec. 2007.doi: 10.1109/MC.2007.437
9. Shivam Aggarwal, Vishal Kumar, and S. D. Sudarsan. Identification and detection of phishing
emails using natural languageprocessing techniques. In Proceedings of the 7th International
Conference on Security of Information and Networks, SIN ’14, pages 217:217–217:222. ACM,
2014.
10. Verma et al: Detecting Phishing Emails the Natural Language Way. In: ESORICS 2012, LNCS
7459, pp. 824–841, 2012.
11. Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of the
41st
Annual Meeting on Association for Computational Linguistics – Volume 1, 2003.
12. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of English: The penn treebank. Comput. Linguist., 19(2), June 1993.
13. Nazario, J.: The online phishing corpus (2016)
14. 3Sharp, 3Sharp Study finds Internet Explorer 7 Edges Out Netcraft As Most Accurate for Anti-
Phishing Protection. 2006. http://www.3sharp.com/projects/antiphishing/
15. BeatifulSoup Python Library.: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (2016)

More Related Content

What's hot

Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesIJSRED
 
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTIONAN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTIONIJNSA Journal
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamAlexander Decker
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filteringijtsrd
 
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...IRJET Journal
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam FilteringIOSR Journals
 
Malicious-URL Detection using Logistic Regression Technique
Malicious-URL Detection using Logistic Regression TechniqueMalicious-URL Detection using Logistic Regression Technique
Malicious-URL Detection using Logistic Regression TechniqueDr. Amarjeet Singh
 
IRJET- Identification of Clone Attacks in Social Networking Sites
IRJET-  	  Identification of Clone Attacks in Social Networking SitesIRJET-  	  Identification of Clone Attacks in Social Networking Sites
IRJET- Identification of Clone Attacks in Social Networking SitesIRJET Journal
 
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...IJARIIT
 
Customer Involvement in Phishing Defence
Customer Involvement in Phishing DefenceCustomer Involvement in Phishing Defence
Customer Involvement in Phishing DefenceJordan Schroeder
 
An iac approach for detecting profile cloning
An iac approach for detecting profile cloningAn iac approach for detecting profile cloning
An iac approach for detecting profile cloningIJNSA Journal
 
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...IRJET Journal
 
IRJET - Unauthorized Terror Attack Tracking System using Web Usage Mining
IRJET - Unauthorized Terror Attack Tracking System using Web Usage MiningIRJET - Unauthorized Terror Attack Tracking System using Web Usage Mining
IRJET - Unauthorized Terror Attack Tracking System using Web Usage MiningIRJET Journal
 

What's hot (15)

Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning Techniques
 
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTIONAN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
AN INTELLIGENT CLASSIFICATION MODEL FOR PHISHING EMAIL DETECTION
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispam
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filtering
 
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
 
Malicious-URL Detection using Logistic Regression Technique
Malicious-URL Detection using Logistic Regression TechniqueMalicious-URL Detection using Logistic Regression Technique
Malicious-URL Detection using Logistic Regression Technique
 
IRJET- Identification of Clone Attacks in Social Networking Sites
IRJET-  	  Identification of Clone Attacks in Social Networking SitesIRJET-  	  Identification of Clone Attacks in Social Networking Sites
IRJET- Identification of Clone Attacks in Social Networking Sites
 
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
Physical and Cyber Crime Detection using Digital Forensic Approach: A Complet...
 
Customer Involvement in Phishing Defence
Customer Involvement in Phishing DefenceCustomer Involvement in Phishing Defence
Customer Involvement in Phishing Defence
 
An iac approach for detecting profile cloning
An iac approach for detecting profile cloningAn iac approach for detecting profile cloning
An iac approach for detecting profile cloning
 
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
IRJET - Detection and Prevention of Phishing Websites using Machine Learning ...
 
IRJET - Unauthorized Terror Attack Tracking System using Web Usage Mining
IRJET - Unauthorized Terror Attack Tracking System using Web Usage MiningIRJET - Unauthorized Terror Attack Tracking System using Web Usage Mining
IRJET - Unauthorized Terror Attack Tracking System using Web Usage Mining
 
8
88
8
 

Viewers also liked

Panchayat development minister in cg
Panchayat  development minister in cgPanchayat  development minister in cg
Panchayat development minister in cgallwina
 
Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016
Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016
Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016TYA Asia
 
Culture minister
Culture ministerCulture minister
Culture ministerallwina
 
Actividad n° 8 La reseña crítica
Actividad n° 8 La reseña críticaActividad n° 8 La reseña crítica
Actividad n° 8 La reseña críticaElsaCristinaMoran
 
AIPPLE Catalogue 2016
AIPPLE Catalogue 2016AIPPLE Catalogue 2016
AIPPLE Catalogue 2016sam zhu
 
Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016
Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016
Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016TYA Asia
 
Health minister in cg
Health minister in cgHealth minister in cg
Health minister in cgallwina
 
B uppsats den kyrkliga forsvenskningen av bohuslanska hisingen
B uppsats  den kyrkliga forsvenskningen av bohuslanska hisingenB uppsats  den kyrkliga forsvenskningen av bohuslanska hisingen
B uppsats den kyrkliga forsvenskningen av bohuslanska hisingenDiana Olausson Öberg
 
Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016
Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016
Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016TYA Asia
 
Oinarri plakaren osagaiak
Oinarri plakaren osagaiakOinarri plakaren osagaiak
Oinarri plakaren osagaiakmutrikubhi
 
Manual word TIC
Manual word TIC Manual word TIC
Manual word TIC Gamarybaps
 
Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...
Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...
Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...TYA Asia
 
Alimentación en la niñez y adolescencia
Alimentación en la niñez y adolescenciaAlimentación en la niñez y adolescencia
Alimentación en la niñez y adolescenciarocio piñanez
 
How does your media product represent particular social groups? evaluation qu...
How does your media product represent particular social groups? evaluation qu...How does your media product represent particular social groups? evaluation qu...
How does your media product represent particular social groups? evaluation qu...Ethan_Whitmore
 
Types of camera shots
Types of camera shotsTypes of camera shots
Types of camera shotseckros
 
Oinarri-plakaren osagaiak
Oinarri-plakaren osagaiakOinarri-plakaren osagaiak
Oinarri-plakaren osagaiakmutrikubhi
 

Viewers also liked (20)

columbia-gwu
columbia-gwucolumbia-gwu
columbia-gwu
 
Panchayat development minister in cg
Panchayat  development minister in cgPanchayat  development minister in cg
Panchayat development minister in cg
 
Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016
Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016
Chen Yoke Pin - Asian TYA Network event presentation at ricca ricca*festa 2016
 
Culture minister
Culture ministerCulture minister
Culture minister
 
Actividad n° 8 La reseña crítica
Actividad n° 8 La reseña críticaActividad n° 8 La reseña crítica
Actividad n° 8 La reseña crítica
 
AIPPLE Catalogue 2016
AIPPLE Catalogue 2016AIPPLE Catalogue 2016
AIPPLE Catalogue 2016
 
Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016
Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016
Liew Kung Yu - Asian TYA Network event presentation at ricca ricca*festa 2016
 
Health minister in cg
Health minister in cgHealth minister in cg
Health minister in cg
 
B uppsats den kyrkliga forsvenskningen av bohuslanska hisingen
B uppsats  den kyrkliga forsvenskningen av bohuslanska hisingenB uppsats  den kyrkliga forsvenskningen av bohuslanska hisingen
B uppsats den kyrkliga forsvenskningen av bohuslanska hisingen
 
Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016
Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016
Melissa Tan - Asian TYA Network event presentation at ricca ricca*festa 2016
 
eczema treatment singapore
eczema treatment singaporeeczema treatment singapore
eczema treatment singapore
 
evolució del web
evolució del webevolució del web
evolució del web
 
Oinarri plakaren osagaiak
Oinarri plakaren osagaiakOinarri plakaren osagaiak
Oinarri plakaren osagaiak
 
Manual word TIC
Manual word TIC Manual word TIC
Manual word TIC
 
Doc1
Doc1Doc1
Doc1
 
Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...
Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...
Adjjima Na Patalung & Pavinee Samakkabutr - Asian TYA Network event presentat...
 
Alimentación en la niñez y adolescencia
Alimentación en la niñez y adolescenciaAlimentación en la niñez y adolescencia
Alimentación en la niñez y adolescencia
 
How does your media product represent particular social groups? evaluation qu...
How does your media product represent particular social groups? evaluation qu...How does your media product represent particular social groups? evaluation qu...
How does your media product represent particular social groups? evaluation qu...
 
Types of camera shots
Types of camera shotsTypes of camera shots
Types of camera shots
 
Oinarri-plakaren osagaiak
Oinarri-plakaren osagaiakOinarri-plakaren osagaiak
Oinarri-plakaren osagaiak
 

Similar to Research Report

Literature Review.docx
Literature Review.docxLiterature Review.docx
Literature Review.docxAliAgral2
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptxAnush90
 
An analysis on Filter for Spam Mail
An analysis on Filter for Spam MailAn analysis on Filter for Spam Mail
An analysis on Filter for Spam MailAM Publications
 
Thematic and self learning method for
Thematic and self learning method forThematic and self learning method for
Thematic and self learning method forijcsa
 
Spam Mail Prediction Report.docx
Spam Mail Prediction Report.docxSpam Mail Prediction Report.docx
Spam Mail Prediction Report.docxShubham Jaybhaye
 
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...CSCJournals
 
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...CSCJournals
 
Fire eye spearphishing
Fire eye spearphishingFire eye spearphishing
Fire eye spearphishingZeno Idzerda
 
Review of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attackReview of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attackjournalBEEI
 
PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...
PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...
PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...ijistjournal
 
Dealing with the threat of spoof and phishing mail attacks part 6#9 | Eyal ...
Dealing with the threat of spoof and phishing mail attacks   part 6#9 | Eyal ...Dealing with the threat of spoof and phishing mail attacks   part 6#9 | Eyal ...
Dealing with the threat of spoof and phishing mail attacks part 6#9 | Eyal ...Eyal Doron
 
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKMALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKijcseit
 
identifying malevolent facebook requests
identifying malevolent facebook requestsidentifying malevolent facebook requests
identifying malevolent facebook requestsINFOGAIN PUBLICATION
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1Dhara Shah
 
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MININGA LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MININGHeather Strinden
 
All About Phishing Exploring User Research Through A Systematic Literature R...
All About Phishing  Exploring User Research Through A Systematic Literature R...All About Phishing  Exploring User Research Through A Systematic Literature R...
All About Phishing Exploring User Research Through A Systematic Literature R...Gina Rizzo
 

Similar to Research Report (20)

Literature Review.docx
Literature Review.docxLiterature Review.docx
Literature Review.docx
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
An analysis on Filter for Spam Mail
An analysis on Filter for Spam MailAn analysis on Filter for Spam Mail
An analysis on Filter for Spam Mail
 
Thematic and self learning method for
Thematic and self learning method forThematic and self learning method for
Thematic and self learning method for
 
Spam Mail Prediction Report.docx
Spam Mail Prediction Report.docxSpam Mail Prediction Report.docx
Spam Mail Prediction Report.docx
 
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
 
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
An Indistinguishability Model for Evaluating Diverse Classes of Phishing Atta...
 
Fire eye spearphishing
Fire eye spearphishingFire eye spearphishing
Fire eye spearphishing
 
Review of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attackReview of the machine learning methods in the classification of phishing attack
Review of the machine learning methods in the classification of phishing attack
 
PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...
PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...
PHISHING DETECTION IN IMS USING DOMAIN ONTOLOGY AND CBA – AN INNOVATIVE RULE ...
 
Dealing with the threat of spoof and phishing mail attacks part 6#9 | Eyal ...
Dealing with the threat of spoof and phishing mail attacks   part 6#9 | Eyal ...Dealing with the threat of spoof and phishing mail attacks   part 6#9 | Eyal ...
Dealing with the threat of spoof and phishing mail attacks part 6#9 | Eyal ...
 
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORKMALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
MALICIOUS URL DETECTION USING CONVOLUTIONAL NEURAL NETWORK
 
identifying malevolent facebook requests
identifying malevolent facebook requestsidentifying malevolent facebook requests
identifying malevolent facebook requests
 
V01 i010413
V01 i010413V01 i010413
V01 i010413
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MININGA LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
A LITERATURE REVIEW ON PHISHING EMAIL DETECTION USING DATA MINING
 
All About Phishing Exploring User Research Through A Systematic Literature R...
All About Phishing  Exploring User Research Through A Systematic Literature R...All About Phishing  Exploring User Research Through A Systematic Literature R...
All About Phishing Exploring User Research Through A Systematic Literature R...
 

Research Report

  • 1. Detecting Phishing Email Using Natural Language Processing By Ian Harris, Tianrui Peng University of California, Irvine Abstract Phishing emails areone of the most common and harmful threats that people are constantly facing in contemporary society. In this paper, we will show our natural language processing algorithm which can detect malicious emails by analyzing sentence structures and the relationships between words. Our algorithm mainly focuses on analyzing the content of phishing emails. There are many programs and research projects focused on detecting phishing emails using the title, the header and the links inside the emails. However, there are some sophisticated phishing emails that do not contain malicious links inside. Our algorithm can outperform other algorithms at analyzing this type of content- oriented phishing emails. Moreover, our algorithm can also be used to improve any existing link-oriented phishing emails detection algorithm. By using a naturallanguage processing(NLP) approach, our algorithm does not need to rely on constantly updated data, and will be able to generalizeto newly generatedattacks. Therefore, our algorithm is able to detect the constantly changing and developing phishing email attacksin contemporary society. 1. Introduction In contemporary society, the security of private information is a major concern of every
  • 2. person. Social engineering attacksaredangerous threats that aim at using human interaction to mentally manipulate people into exposing their confidential information. Phishing is a type of social engineering attackthat focuses on gaining sensitive information by disguising as a trustworthy entity. Electronic communications, such as email or text message are common platforms for delivering phishing attacks. Phishing is a continual threat that constantly affects our daily lives. Attackersoften gain personal information that effects the victims’ personal lives, financial wellbeing, and work environment. BetweenMay 2004 and May 2005, approximately 1.2 million computer users in United States suffered financial losses because of phishing attacks, totaling approximately 929 million USD [1]. Based on the third Microsoft Computer Safer Index Report released in February 2014, around 5 billion USD are lost to phishing attacksannually [2]. Phishing emails are the most common type of phishing attacksthat people have to deal with. Attackersareusually disguised as popular social websites, banks, administrators from IT departments or popular shopping websites. These emails often lure users to enter personal information into a malicious website which has a similar outlook to the legitimateone. There are different types of phishing emails such as spear phishing, clone phishing, and whaling. As technology becomes more integratedinto our culture, the damage caused by phishing emails continues to rise. This growing threat calls for our attention to actively seek effective solutions for this problem. The most common methods to deal with this problem are using reported blacklists, user training, public awareness, and automatic phishing detection software. There are many previous research projects that have attemptedto solve this problem by using software classification approaches. The most common techniques being used aremachine
  • 3. learning, blacklists, link analysis, and natural languageprocessing(NLP). NLP is suitable to solve this problem because it can extract the semantic information from the content of the email without relying on an existing website blacklist or link analysis. A legitimateemail typically attemptsto present some information to the users. On the other hand, a malicious email usually aims at luring targetedusers to visit malicious websites or to elicit a response. Using NLP techniques, we can analyze the content of the message to determine its motives, and classify it as legitimate or malicious. There have been various attempts to use NLP to createan algorithm capable of detecting phishing emails. The research presented here is based on an NLP algorithm proposed by Professor IanG. Harris, Yuki Sawa, Ram Bhakta, and Christopher Hadnagy [3]. This algorithm was originally designed to identify social engineering attacksby applying NLP techniques to conversations. It examines all dialog text transmitted from the attacker to the targeteduser, and checks the appropriateness of the sentence. A sentence is considered to be malicious if it inquires sensitive information or commands a performance of action that might expose personal information. [3] uses NLP techniques to detect questions and commands, and to identify their relatedsubjects. Once a sentence is categorizedas a question or a command, its potential topics are extractedby finding verb and noun pairs which connect verbs with their direct objects. Then each pair is evaluated by whether it is contained by a blacklist of malicious verb and noun pairs. 1.1 My Contribution My contributions of this research involved systematically evaluating the performance of [3] generalizedto the detection of phishing emails, and improving the algorithm to gain better
  • 4. performance at identifying phishing email attacks. I wrote a program that gatheredand parsed 438 malicious emails and 500 legitimate emails to evaluate[3]. Beforeimprovement, [3]’s precision is 100% and its recall is 12%. After improving and adapting its blacklist, the new program’s precision is 100% and 38%. Then by combining this program with another link analysis program, Netcraft [14], the new program’s precision is 99% and its recall increased to 73.5%. 2. RelatedWork Different research groups have attemptedto prevent phishing attacksthrough various approaches. There arebroadly two types of phishing schemes. The first, arethe phishing schemes that are used to detect phishing webpages. The second, arethe schemes that arefocus on detecting phishing emails. 2.1 Work relatedto detectphishingwebsites The most common approaches of detecting phishing websites are webpage content analysis and link analysis. There are two major approaches to solve this problem. The first approach is blacklist-based anti-phishing techniques such as Google Safe Browsing API. Google Safe Browsing API allows users to validate whether an URL is in blacklists that are gatheredand updated by Google [4]. The second approach is based on analyzing the content of the website. SpoofGuard is a web browser plug-in createdby Stanford University, which uses this approach. It weights certain malicious components found in the HTML content [5]. 2.2 Work relatedto detectphishingemails
  • 5. 2.21 Using URL analysis Many research focus on analyzing the structure of the links rather than the content of the emails to identify phishing emails. For example, an approach proposed by Garera, which presents several differences between a malicious URL and a benign URL [6]. Through identifying these distinctions, they createda logistic regression filter to detect phishing emails. In addition, LinkGuard also uses data provided by APWG to identify common features of malicious links contained in phishing emails [7]. 2.22 Using email content Many phishing detection algorithms arecontent-oriented. For example, the research of Allen Stone which resulted in a software called EBIDS[8]. It uses NLP technologies to analyze the plain text in emails. They first readan email in as a text file, and then feed the input into OntoSem through the DEKADE API. The OSIM processes the text, writing its semantic and referentialknowledge of the sentences in ontological terms. This step creates equivalence sets for natural language strings. For example, instead of matching on “send us your account information,” EBIDScan match on the concept of “request for personal information.” Then based on the results of OntoSem, they run a string match algorithm to determine if the email is a phishing email or not. The decision is based on a rule set. They put four rules in the rule set for testing: account compromise, financial opportunity, account change, and opportunity. So the program matches certain words and phrases within the email content, and then gets the similarity between the email content and the four rules. If the similarity of a rule passes a threshold, which means the email belongs to that rule, the program then decides the email is
  • 6. malicious. If the email does not match any of these rules, then it is decided that this email is legitimate. Another content-oriented research also aimed at detecting phishing emails through an NLP approach. The work of Aggarwalused content analysis to find four parametersthat can decide whether an email is malicious or not [9]. These four parametersare: absence of names, mention of money, reply inducing sentence, and sense of urgency. The algorithm detects mention of money by keeping a set of names and symbols of all currencies in the world and their common variants, and then checks whether the email contains any of them. In order to detect presence of reply inducing sentence, it finds a set of words and phrases which ask the user to reply to the email, such as contact, get back, and reply. It also uses WordNet to include the hyponyms and synonyms of these reply inducing words in the set. In order to find if any of these words are mentioned in the email content, it first uses Stanford CoreNLP API to tag the words in the email with POS tags, and then stems the words of the taggedfile to their base form. In addition, it can also detect sense of urgency by checking whether the email contains words such as now, instantly, or immediately. If a sentence contains both words that induce reply and have sense of urgency, then it means that the email has high possibility to be malicious. After detecting these four parameters, it uses a formula to combine the information it got, and give a final score. 2.23 Using Both URL and email content Some approaches utilize all the information contained in an email, such as the header, the links, and the text content. Verma implemented a software called PhishNet-NLP that utilizes
  • 7. all the information contained in an email to detect phishing emails [10]. PhishNet-NLP analyzes the text of email by using NLP techniques to give it a Textscore and a Contextscore. PhishNet-NLP analyzes the Textscore by using following NLP techniques: lexical analysis, part-of-speech tagging, named entity recognition, normalization of words to lower case, stemming, and stop word removal. Firstly, it uses lexical analysis to split the email into sentences, and each sentences into words. Then, it normalizes the words into lower case, removes the stop words, and stems the words. Then PhishNet-NLP uses named entity recognition to find out whether the email at least mentioned one institution in the body. If there is zeroinstitution mentioned, then the email receives a Textscore of 0, which stands for legitimate. PhishNet-NLP also defines a set of special verbs SV, which is a set that contains verbs that usually used by malicious emails to instruct people to do certain actions, and contains hyponyms and synonyms of these verbs. PhishNet-NLP gets hyponyms and synonyms of verbs by using WordNet. Then PhishNet-NLP gives each of these verbs a score by using a formula that takes four factors into account: x, l, a, and L. The values of parameters x and a depend on whether it finds a certain combination of words in a sentence. The parameter l depends on the number of links in the email. And the parameter L is the level of the verb which is one more than the least number of hyponymy links followed to reach the verb from a synset. After computing scores for each verb of the set SV, the Textscore of an email is equal to the maximum score of all the verb scores. Besides Textscore, PhishNet-NLP also gives a Contextscore for each email. It treats the email as a vector of TF-IDF values in the semantics space after applying stopword elimination and stemming. It uses the Contextscore to check the similarity between this email with other
  • 8. emails that user received before. The Contextscore = 1 when it finds a similar email in the inbox. After computing both Textscore and Contextsocre, PhishNet-NLP combines these two scores together to get a Final-text-score. Then it combines Final-text-score, headerScore, and linkScore together to decide whether an email is malicious or legitimate. 3. Introductiontothe Algorithmof SEParser This research is based on the algorithm created by [3]. [3] is an approach to detect social engineering attacks by using NLP techniques to parse conversations. To detect social engineering attacks, SEParser that analyzes the dialogs through semantic analysis of each sentence. Figure 1 presents the structure of the detecting process of SEParser. Figure 1 Outline of the detection process During most of the social engineering attacks, the attacker must express one of the following types of sentence in order to lure sensitive information from users: 1. a question relatedto sensitive information 2. a command to perform a dangerous operation SEParser analyzes the text conversation between the attacker and the victim, and checks the appropriateness of the topic. If a sentence inquires personal information or demands dangerous action, the topic of that sentence is not appropriate. SEParser uses NLPtechniques to find a pattern
  • 9. in the sentence’s structure. By using this pattern, SEParser would parse all the sentences in the conversation, and detects questions and commands from them. As soon as SEParser finds the questions and commands, it would extract themain topics of them. The main topic of the sentence is extractedby analyzing the parse tree of the sentence. For example, Figure 2 presents a security policy statement. Networking equipment must not be manipulated. Figure 2 Security policy statement By manually analyzing this sentence, we can identify the main topic and its related action shown in Table 1. For this statement, the topic of the sentence is “networking equipment”, and its related action is “manipulate”. In order to identify this topic and action pair, SEParser will first get a parse treeof thesentence by using Stanford Parser [11]. The Stanford Parser is a context-free parser that is well-developed and used by various NLP research projects. The parse treereturned by Stanford Parser gives every word in the sentence a tag. Stanford Parser can be set to use various taggers, however the tagsshown in the figure arefrom the Penn Treebank Tagset [12]. Figure 3 showed the parse tree of the sentence.
  • 10. Figure 3 Parse tree of the first sentence By analyzing the sentence structure, SEParser would identify thesentence as not a command or a question. Therefore, it will not continue finding the topic of the statement. However, ifwe use another example such as “Reset the router”, the SEParser would first get the parser tree from the Stanford parser shown in Figure 4. Figure 4 Parse tree of "reset router" SEParser would identify this sentence as a command by analyzing its parse tree. It would recognize the pattern of the direct command’s sentence structure. Then SEParser would find the topic of the sentence as router, and its related verb as reset. Therefore, the verb/noun pair of the sentence is (reset, router). SEParser detects this verb/noun pair by finding the main verbs of the sentence and their direct objects. The verb, “reset”, is a typical command of malicious actions, so it would be found in the blacklist. Moreover, the noun related to this verb is “router”, which is an important type of networking equipment. Therefore, this verb/noun pair would be found in the blacklist of the verb/noun pairs used by SEParser. SEPaser would detect this sentence as malicious, and would alert victims before actual damage happens.
  • 11. 3.1 SEParser Algorithm Figure 5 presents the process of question/command detection and topic extraction. Figure 5 Question/command detection and Topic detection algorithm As shown in Figure 5, SEParser first gets s best parse trees from Stanford Parser. Then, it would use patterns to decide the type of the sentence. If the sentence is a question or a command, SEParser would extract theverb/noun pair of the sentence. For each verb/pair, SEParser would use MatchTopic Algorithm to determine the appropriateness of the pair as shown in Figure 6. Figure 6 MatchTopic Algorithm The MatchTopic Algorithm gets inputs from the previous algorithm, and then searches the blacklist to determine whether the verb/noun pair inputs are in the blacklist or not. In order to identify a verb/noun pair as malicious, both verb and noun have to be exactly matched the entry in the blacklist. 4. Applying and Improving AlgorithmtoDetectionof Phishing Emails
  • 12. In this research, we applied algorithm [3] to detect phishing emails, and improved this algorithm in order to gainbetter performance. We made three major improvements: 1. Optimizing the performance of the algorithm. 2. Adapting and extending verb/noun pair blacklist to phishing emails. 3. Combining algorithm [3] with a link analysis software, Netcraft [14]. In order to apply algorithm [3] to detect phishing emails, we first systematically tested the performance of SEParser on detecting phishing emails without any improvement. We used 500 legitimate emails and 438 malicious emails. After testing, the first improvement we made was optimizing SEParser so that it would run faster on larger text corpus. By changing a nested for-loop and deleting some unused functions, we managed to triple the running speed of SEParser. The second improvement we made was extending and adjusting the verb/noun pair blacklist so it would be specialized to phishing emails context, instead of conversation. There arephrases such as “click on the link” or “send this information back” that appear often in phishing emails, but are less common in conversations. Therefore, we first experimented in optimizing the blacklist to better identify common phishing phrases. We used information extraction techniques to find the most common verbs and nouns pairs used in malicious emails and legitimate emails. Then, we manually went through and added the main difference between these words into the verb/noun pair blacklist. The third improvement was combining SEParser with a link analysis tool, Netcraft [14]. After the second improvement, we realized that the majority of the emails can be detected by link analysis. Therefore, we decided that for detecting phishing email, it was practicalto combine our algorithm with a link analysis tool. After carefully examining various link analysis tools, we decided to use
  • 13. Netcraft [14]. Netcraft is an anti-phishing toolbar that uses link analysis to detect phishing URLs. There areat least three research project tested and reported Netcraft as one of the most accurate anti-phishing link analysis program. Netcraft uses serval ways todetect phishing URLs. The first way is using black lists of phishing websites reported by community. Netcraft constantly updates these blacklists. Netcraft also analyzes the potential phishing website’s domain, IP address, Hosting company, Hosting country, Latest performance, Hosting history, and Site technology. By using and analyzing all this information comprehensively, Netcraft gives each link a risk rating on a scale of 0 to 10. A lower risk rating means that the link is less likely to be malicious. We created a tool that parses the html format of phishing emails and extracts the links from them. We used a python library called beautifulsoup [15] to parse the html content of the email and extract the links embedded in the emails. Some phishing emails trick users to click on malicious links by showing non malicious links in plain text, and embedding the actual malicious link in the hyperlink connected to the text. By parsing the html instead of the plain text, we were able to extract the actual links from the emails. Once we got all the links, we created a program that takes the links and sends an http request to Netcrafts’ website. The program extracts the result it gets from the website to find the risk rating of the link. For our program, we choose to use 5 as the threshold. If the risk rating of a link is higher than 5, then the programidentifies this link as malicious. After feeding the link into Netcraft and parsing the results, the algorithm then determines whether it is necessary to run the SEParser. If the link is not identified as malicious by Netcraft, then the algorithm runs the improved SEParser to analyze the content. 5. Experimental Results
  • 14. 5.1 Databases used to Test theDetection of Phishing Emails In order to systematically test the performance of SEParser on detection of phishing emails, we decided to collect two different databases: a legitimate email corpus and a phishing email corpus. For the legitimate email corpus, we chose to use the enron email corpus. And for the malicious email corpus, we chose to use the phishing email corpus collected and shared by Nazario, J [13]. This phishing email corpus is also used by other research projects such as PhishNet-NLP [10] and PhishCatch. For test purposes, we only picked 500 legitimate emails and 438 phishing emails. We created a script called read_mbox.py to parse these emails. Most of the emails from the enron email corpus arein Mbox format. Theemail collected by Jose are also in the Mbox format and the contents aresometimes html or plain text. Therefore, the script first checks the format of the email and parses the content accordingly. 5.2 Results Before Combining with Netcraft 5.21 Optimization to ImproveRun Time For our first test, we directly applied SEParser to our databases without major changes. We found out that we needed to optimize the performance of the program in order for it to parse around 1000 emails in a reasonable amount of time. Beforethe changes, weran the programon a laptop which has an IntelCore i7 processor, 256GB solid state drive and running Windows 10. This computer was able to process 438 emails in around 3 hours. We realized the running time was too long for us to effectively test the algorithm. Therefore, we made several changes to the code so that it ran3 times faster than before. After the changes, we ranthe programon the same
  • 15. computer, this time it only took 55 minutes for the programto parse the 438 emails. 5.22 Adjusting verb/noun pairblacklist For the first test, SEParser detected 54 malicious emails out of 438 phishing emails and 0 out of 500 legitimate emails. Therefore, its true positive is 12 percent, and 0 percent false positive. We were satisfied with the false positive, but we decided to extend the verb/noun pair blacklist in order to increase thetrue positive. Therefore, based on the formula shown in equation (1) and (2), the precision of the algorithm is 100 percent, and the recall is 12 percent. Equation (1): Precision = 𝑡𝑝 𝑡𝑝+𝑓𝑝 Equation (2): Recall= 𝑡𝑝 𝑡𝑝+𝑓𝑛 The equation uses tp as the number of true positives, fp as the number of false positives, and fn as the number of false negatives. For our second test, we mainly focused on finding and adding appropriate verb/noun pairs that suited the phishing email context. After adding the appropriate verb/noun pairs, we managed to increase the true positive from 12 percent to 38 percent. And the false positive was still 0 percent. After the change, the precision of the algorithm is still 100 percent, and the recall is 38 percent. 5.3 Results After combining with Netcraft By using Netcraft to analyze the links inside phishing emails, we detected 255 emails out of 438 malicious emails. Then we ran SEParser on the rest of the 183 malicious emails, and it detected73. By combing with Netcraft, we detected 322 phishing emails out 438. Therefore, the
  • 16. true positive of our program is 73.5 percent. Then we ran our program on the legitimate email corpus, and it detected2out of 500 legitimateemails. Therefore, our false positive is 0.4 percent. After combining with Netcraft, theprecision of thealgorithm is 99 percent, and the recall is 73.8 percent. Table 3 shown below presents the performance comparisons for the different improvements. From this table, wecan see that therecallof the programincreased around 200% after improvement. For all of these tests, we used 438 malicious emails, and 500 legitimate emails. SEParser Netcraft SEParser+Netcraft True Positive 166 255 322 False Positive 0 2 2 Precision 100% 99% 99% recall 38% 58% 73.5% Table 3 Comparison between different results 6. Conclusion In this paper, we havetested and improved SEparser on detecting phishing emails. Bytesting and adjusting SEparser, we have shown that SEparser’s algorithm can be generalized from analyzing dialogs of social engineer attacksto detect phishing emails by analyzing the parse tree and grammar structure of sentences. And by combining improved SEparser with other link analysis software, it is able to detect most of the malicious emails with less than one percent false positive. To further improve the performance in the future project, we believe that we can
  • 17. find certain pattern in most of the phishing emails. By analyzing parse tree of sentences, the program would be able to identify this pattern from the content of the email. By semantically analyzing the content, thefuture programshould be ableto detect more phishing emails without causing more false positives. 7. Reference 1. Kerstein, Paul (July 19, 2005). "How Can We Stop Phishing and Pharming Scams?". CSO. Archived from the originalon March 24, 2008. 2. "20% Indians are victims of Online phishing attacks: Microsoft". IANS. news.biharprabha.com. Retrieved February 11,2014. 3. Y. Sawa, H. R. Bhakta, I. G. Harris, "Detection of Social Engineering Attacks Through Natural Language Processingof Conversations", IEEE ConferenceonSemanticComputing, February2016 4. Google, “Google safe browsing API,” http://code.google.com/apis/safebrowsing/, accessed Oct 2011 5. N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side defense against web-based identity theft,” in NDSS. The Internet Society, 2004. 6. Garera, S., Provos, N., Chew, M., Rubin, A.: A framework for detection and measurement of phishing attacks. In: Proc. 2007 ACM Workshop on Recurring Malcode, pp. 1—8 (2007) 7. Chen, J., Guo, C.: Online detection and prevention of phishing attacks. In: First Int’l Conf. on Communications and Networking in China, ChinaCom 2006, PP. 1—7 IEEE (2006) 8. A. Stone, "Natural-LanguageProcessing for Intrusion Detection," in Computer, vol. 40, no. 12, pp. 103-105, Dec. 2007.doi: 10.1109/MC.2007.437
  • 18. 9. Shivam Aggarwal, Vishal Kumar, and S. D. Sudarsan. Identification and detection of phishing emails using natural languageprocessing techniques. In Proceedings of the 7th International Conference on Security of Information and Networks, SIN ’14, pages 217:217–217:222. ACM, 2014. 10. Verma et al: Detecting Phishing Emails the Natural Language Way. In: ESORICS 2012, LNCS 7459, pp. 824–841, 2012. 11. Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics – Volume 1, 2003. 12. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of English: The penn treebank. Comput. Linguist., 19(2), June 1993. 13. Nazario, J.: The online phishing corpus (2016) 14. 3Sharp, 3Sharp Study finds Internet Explorer 7 Edges Out Netcraft As Most Accurate for Anti- Phishing Protection. 2006. http://www.3sharp.com/projects/antiphishing/ 15. BeatifulSoup Python Library.: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (2016)