Research Report

Detecting Phishing Email Using Natural Language Processing
By Ian Harris, Tianrui Peng
University of California, Irvine
Abstract
Phishing emails areone of the most common and harmful threats that people are
constantly facing in contemporary society. In this paper, we will show our natural language
processing algorithm which can detect malicious emails by analyzing sentence structures
and the relationships between words. Our algorithm mainly focuses on analyzing the
content of phishing emails. There are many programs and research projects focused on
detecting phishing emails using the title, the header and the links inside the emails.
However, there are some sophisticated phishing emails that do not contain malicious links
inside. Our algorithm can outperform other algorithms at analyzing this type of content-
oriented phishing emails. Moreover, our algorithm can also be used to improve any existing
link-oriented phishing emails detection algorithm. By using a naturallanguage
processing(NLP) approach, our algorithm does not need to rely on constantly updated data,
and will be able to generalizeto newly generatedattacks. Therefore, our algorithm is able to
detect the constantly changing and developing phishing email attacksin contemporary
society.
1. Introduction
In contemporary society, the security of private information is a major concern of every

person. Social engineering attacksaredangerous threats that aim at using human interaction to
mentally manipulate people into exposing their confidential information.
Phishing is a type of social engineering attackthat focuses on gaining sensitive information by
disguising as a trustworthy entity. Electronic communications, such as email or text message are
common platforms for delivering phishing attacks. Phishing is a continual threat that constantly
affects our daily lives. Attackersoften gain personal information that effects the victims’ personal
lives, financial wellbeing, and work environment. BetweenMay 2004 and May 2005,
approximately 1.2 million computer users in United States suffered financial losses because of
phishing attacks, totaling approximately 929 million USD [1]. Based on the third Microsoft
Computer Safer Index Report released in February 2014, around 5 billion USD are lost to phishing
attacksannually [2].
Phishing emails are the most common type of phishing attacksthat people have to deal with.
Attackersareusually disguised as popular social websites, banks, administrators from IT
departments or popular shopping websites. These emails often lure users to enter personal
information into a malicious website which has a similar outlook to the legitimateone. There are
different types of phishing emails such as spear phishing, clone phishing, and whaling. As
technology becomes more integratedinto our culture, the damage caused by phishing emails
continues to rise. This growing threat calls for our attention to actively seek effective solutions for
this problem. The most common methods to deal with this problem are using reported blacklists,
user training, public awareness, and automatic phishing detection software.
There are many previous research projects that have attemptedto solve this problem by using
software classification approaches. The most common techniques being used aremachine

learning, blacklists, link analysis, and natural languageprocessing(NLP). NLP is suitable to solve
this problem because it can extract the semantic information from the content of the email
without relying on an existing website blacklist or link analysis. A legitimateemail typically
attemptsto present some information to the users. On the other hand, a malicious email usually
aims at luring targetedusers to visit malicious websites or to elicit a response. Using NLP
techniques, we can analyze the content of the message to determine its motives, and classify it
as legitimate or malicious.
There have been various attempts to use NLP to createan algorithm capable of detecting
phishing emails. The research presented here is based on an NLP algorithm proposed by
Professor IanG. Harris, Yuki Sawa, Ram Bhakta, and Christopher Hadnagy [3]. This algorithm was
originally designed to identify social engineering attacksby applying NLP techniques to
conversations. It examines all dialog text transmitted from the attacker to the targeteduser, and
checks the appropriateness of the sentence. A sentence is considered to be malicious if it
inquires sensitive information or commands a performance of action that might expose personal
information. [3] uses NLP techniques to detect questions and commands, and to identify their
relatedsubjects. Once a sentence is categorizedas a question or a command, its potential topics
are extractedby finding verb and noun pairs which connect verbs with their direct objects. Then
each pair is evaluated by whether it is contained by a blacklist of malicious verb and noun pairs.
1.1 My Contribution
My contributions of this research involved systematically evaluating the performance of [3]
generalizedto the detection of phishing emails, and improving the algorithm to gain better

performance at identifying phishing email attacks. I wrote a program that gatheredand parsed
438 malicious emails and 500 legitimate emails to evaluate[3]. Beforeimprovement, [3]’s
precision is 100% and its recall is 12%. After improving and adapting its blacklist, the new
program’s precision is 100% and 38%. Then by combining this program with another link analysis
program, Netcraft [14], the new program’s precision is 99% and its recall increased to 73.5%.
2. RelatedWork
Different research groups have attemptedto prevent phishing attacksthrough various
approaches. There arebroadly two types of phishing schemes. The first, arethe phishing
schemes that are used to detect phishing webpages. The second, arethe schemes that arefocus
on detecting phishing emails.
2.1 Work relatedto detectphishingwebsites
The most common approaches of detecting phishing websites are webpage content analysis
and link analysis. There are two major approaches to solve this problem. The first approach is
blacklist-based anti-phishing techniques such as Google Safe Browsing API. Google Safe Browsing
API allows users to validate whether an URL is in blacklists that are gatheredand updated by
Google [4]. The second approach is based on analyzing the content of the website. SpoofGuard is
a web browser plug-in createdby Stanford University, which uses this approach. It weights
certain malicious components found in the HTML content [5].
2.2 Work relatedto detectphishingemails

2.21 Using URL analysis
Many research focus on analyzing the structure of the links rather than the content of the
emails to identify phishing emails. For example, an approach proposed by Garera, which
presents several differences between a malicious URL and a benign URL [6]. Through
identifying these distinctions, they createda logistic regression filter to detect phishing emails.
In addition, LinkGuard also uses data provided by APWG to identify common features of
malicious links contained in phishing emails [7].
2.22 Using email content
Many phishing detection algorithms arecontent-oriented. For example, the research of
Allen Stone which resulted in a software called EBIDS[8]. It uses NLP technologies to analyze
the plain text in emails. They first readan email in as a text file, and then feed the input into
OntoSem through the DEKADE API. The OSIM processes the text, writing its semantic and
referentialknowledge of the sentences in ontological terms. This step creates equivalence sets
for natural language strings. For example, instead of matching on “send us your account
information,” EBIDScan match on the concept of “request for personal information.” Then
based on the results of OntoSem, they run a string match algorithm to determine if the email
is a phishing email or not. The decision is based on a rule set. They put four rules in the rule
set for testing: account compromise, financial opportunity, account change, and opportunity.
So the program matches certain words and phrases within the email content, and then gets
the similarity between the email content and the four rules. If the similarity of a rule passes a
threshold, which means the email belongs to that rule, the program then decides the email is

malicious. If the email does not match any of these rules, then it is decided that this email is
legitimate.
Another content-oriented research also aimed at detecting phishing emails through an NLP
approach. The work of Aggarwalused content analysis to find four parametersthat can
decide whether an email is malicious or not [9]. These four parametersare: absence of
names, mention of money, reply inducing sentence, and sense of urgency. The algorithm
detects mention of money by keeping a set of names and symbols of all currencies in the
world and their common variants, and then checks whether the email contains any of them.
In order to detect presence of reply inducing sentence, it finds a set of words and phrases
which ask the user to reply to the email, such as contact, get back, and reply. It also uses
WordNet to include the hyponyms and synonyms of these reply inducing words in the set. In
order to find if any of these words are mentioned in the email content, it first uses Stanford
CoreNLP API to tag the words in the email with POS tags, and then stems the words of the
taggedfile to their base form. In addition, it can also detect sense of urgency by checking
whether the email contains words such as now, instantly, or immediately. If a sentence
contains both words that induce reply and have sense of urgency, then it means that the
email has high possibility to be malicious. After detecting these four parameters, it uses a
formula to combine the information it got, and give a final score.
2.23 Using Both URL and email content
Some approaches utilize all the information contained in an email, such as the header, the
links, and the text content. Verma implemented a software called PhishNet-NLP that utilizes

all the information contained in an email to detect phishing emails [10]. PhishNet-NLP
analyzes the text of email by using NLP techniques to give it a Textscore and a Contextscore.
PhishNet-NLP analyzes the Textscore by using following NLP techniques: lexical analysis,
part-of-speech tagging, named entity recognition, normalization of words to lower case,
stemming, and stop word removal. Firstly, it uses lexical analysis to split the email into
sentences, and each sentences into words. Then, it normalizes the words into lower case,
removes the stop words, and stems the words. Then PhishNet-NLP uses named entity
recognition to find out whether the email at least mentioned one institution in the body. If
there is zeroinstitution mentioned, then the email receives a Textscore of 0, which stands for
legitimate. PhishNet-NLP also defines a set of special verbs SV, which is a set that contains
verbs that usually used by malicious emails to instruct people to do certain actions, and
contains hyponyms and synonyms of these verbs. PhishNet-NLP gets hyponyms and
synonyms of verbs by using WordNet. Then PhishNet-NLP gives each of these verbs a score
by using a formula that takes four factors into account: x, l, a, and L. The values of parameters
x and a depend on whether it finds a certain combination of words in a sentence. The
parameter l depends on the number of links in the email. And the parameter L is the level of
the verb which is one more than the least number of hyponymy links followed to reach the
verb from a synset. After computing scores for each verb of the set SV, the Textscore of an
email is equal to the maximum score of all the verb scores.
Besides Textscore, PhishNet-NLP also gives a Contextscore for each email. It treats the email
as a vector of TF-IDF values in the semantics space after applying stopword elimination and
stemming. It uses the Contextscore to check the similarity between this email with other

emails that user received before. The Contextscore = 1 when it finds a similar email in the
inbox.
After computing both Textscore and Contextsocre, PhishNet-NLP combines these two
scores together to get a Final-text-score. Then it combines Final-text-score, headerScore, and
linkScore together to decide whether an email is malicious or legitimate.
3. Introductiontothe Algorithmof SEParser
This research is based on the algorithm created by [3]. [3] is an approach to detect social
engineering attacks by using NLP techniques to parse conversations. To detect social engineering
attacks, SEParser that analyzes the dialogs through semantic analysis of each sentence. Figure 1
presents the structure of the detecting process of SEParser.
Figure 1 Outline of the detection process
During most of the social engineering attacks, the attacker must express one of the following
types of sentence in order to lure sensitive information from users:
1. a question relatedto sensitive information
2. a command to perform a dangerous operation
SEParser analyzes the text conversation between the attacker and the victim, and checks the
appropriateness of the topic. If a sentence inquires personal information or demands dangerous
action, the topic of that sentence is not appropriate. SEParser uses NLPtechniques to find a pattern

in the sentence’s structure. By using this pattern, SEParser would parse all the sentences in the
conversation, and detects questions and commands from them. As soon as SEParser finds the
questions and commands, it would extract themain topics of them. The main topic of the sentence
is extractedby analyzing the parse tree of the sentence. For example, Figure 2 presents a security
policy statement.
Networking equipment must not be manipulated.
Figure 2 Security policy statement
By manually analyzing this sentence, we can identify the main topic and its related action
shown in Table 1.
For this statement, the topic of the sentence is “networking equipment”, and its related
action is “manipulate”. In order to identify this topic and action pair, SEParser will first get a parse
treeof thesentence by using Stanford Parser [11]. The Stanford Parser is a context-free parser that
is well-developed and used by various NLP research projects. The parse treereturned by Stanford
Parser gives every word in the sentence a tag. Stanford Parser can be set to use various taggers,
however the tagsshown in the figure arefrom the Penn Treebank Tagset [12]. Figure 3 showed the
parse tree of the sentence.

Figure 3 Parse tree of the first sentence
By analyzing the sentence structure, SEParser would identify thesentence as not a command
or a question. Therefore, it will not continue finding the topic of the statement. However, ifwe use
another example such as “Reset the router”, the SEParser would first get the parser tree from the
Stanford parser shown in Figure 4.
Figure 4 Parse tree of "reset router"
SEParser would identify this sentence as a command by analyzing its parse tree. It would
recognize the pattern of the direct command’s sentence structure. Then SEParser would find the
topic of the sentence as router, and its related verb as reset. Therefore, the verb/noun pair of the
sentence is (reset, router). SEParser detects this verb/noun pair by finding the main verbs of the
sentence and their direct objects. The verb, “reset”, is a typical command of malicious actions, so
it would be found in the blacklist. Moreover, the noun related to this verb is “router”, which is an
important type of networking equipment. Therefore, this verb/noun pair would be found in the
blacklist of the verb/noun pairs used by SEParser. SEPaser would detect this sentence as malicious,
and would alert victims before actual damage happens.

3.1 SEParser Algorithm
Figure 5 presents the process of question/command detection and topic extraction.
Figure 5 Question/command detection and Topic detection algorithm
As shown in Figure 5, SEParser first gets s best parse trees from Stanford Parser. Then, it would
use patterns to decide the type of the sentence. If the sentence is a question or a command,
SEParser would extract theverb/noun pair of the sentence. For each verb/pair, SEParser would use
MatchTopic Algorithm to determine the appropriateness of the pair as shown in Figure 6.
Figure 6 MatchTopic Algorithm
The MatchTopic Algorithm gets inputs from the previous algorithm, and then searches the
blacklist to determine whether the verb/noun pair inputs are in the blacklist or not. In order to
identify a verb/noun pair as malicious, both verb and noun have to be exactly matched the entry
in the blacklist.
4. Applying and Improving AlgorithmtoDetectionof Phishing Emails

In this research, we applied algorithm [3] to detect phishing emails, and improved this
algorithm in order to gainbetter performance. We made three major improvements:
1. Optimizing the performance of the algorithm.
2. Adapting and extending verb/noun pair blacklist to phishing emails.
3. Combining algorithm [3] with a link analysis software, Netcraft [14].
In order to apply algorithm [3] to detect phishing emails, we first systematically tested the
performance of SEParser on detecting phishing emails without any improvement. We used 500
legitimate emails and 438 malicious emails. After testing, the first improvement we made was
optimizing SEParser so that it would run faster on larger text corpus. By changing a nested for-loop
and deleting some unused functions, we managed to triple the running speed of SEParser.
The second improvement we made was extending and adjusting the verb/noun pair blacklist so
it would be specialized to phishing emails context, instead of conversation. There arephrases such
as “click on the link” or “send this information back” that appear often in phishing emails, but are
less common in conversations. Therefore, we first experimented in optimizing the blacklist to
better identify common phishing phrases. We used information extraction techniques to find the
most common verbs and nouns pairs used in malicious emails and legitimate emails. Then, we
manually went through and added the main difference between these words into the verb/noun
pair blacklist.
The third improvement was combining SEParser with a link analysis tool, Netcraft [14]. After the
second improvement, we realized that the majority of the emails can be detected by link analysis.
Therefore, we decided that for detecting phishing email, it was practicalto combine our algorithm
with a link analysis tool. After carefully examining various link analysis tools, we decided to use

Netcraft [14]. Netcraft is an anti-phishing toolbar that uses link analysis to detect phishing URLs.
There areat least three research project tested and reported Netcraft as one of the most accurate
anti-phishing link analysis program. Netcraft uses serval ways todetect phishing URLs. The first way
is using black lists of phishing websites reported by community. Netcraft constantly updates these
blacklists. Netcraft also analyzes the potential phishing website’s domain, IP address, Hosting
company, Hosting country, Latest performance, Hosting history, and Site technology. By using and
analyzing all this information comprehensively, Netcraft gives each link a risk rating on a scale of 0
to 10. A lower risk rating means that the link is less likely to be malicious. We created a tool that
parses the html format of phishing emails and extracts the links from them. We used a python
library called beautifulsoup [15] to parse the html content of the email and extract the links
embedded in the emails. Some phishing emails trick users to click on malicious links by showing
non malicious links in plain text, and embedding the actual malicious link in the hyperlink
connected to the text. By parsing the html instead of the plain text, we were able to extract the
actual links from the emails. Once we got all the links, we created a program that takes the links
and sends an http request to Netcrafts’ website. The program extracts the result it gets from the
website to find the risk rating of the link. For our program, we choose to use 5 as the threshold. If
the risk rating of a link is higher than 5, then the programidentifies this link as malicious.
After feeding the link into Netcraft and parsing the results, the algorithm then determines
whether it is necessary to run the SEParser. If the link is not identified as malicious by Netcraft,
then the algorithm runs the improved SEParser to analyze the content.
5. Experimental Results

5.1 Databases used to Test theDetection of Phishing Emails
In order to systematically test the performance of SEParser on detection of phishing emails,
we decided to collect two different databases: a legitimate email corpus and a phishing email
corpus. For the legitimate email corpus, we chose to use the enron email corpus. And for the
malicious email corpus, we chose to use the phishing email corpus collected and shared by
Nazario, J [13]. This phishing email corpus is also used by other research projects such as
PhishNet-NLP [10] and PhishCatch. For test purposes, we only picked 500 legitimate emails and
438 phishing emails.
We created a script called read_mbox.py to parse these emails. Most of the emails from the
enron email corpus arein Mbox format. Theemail collected by Jose are also in the Mbox format
and the contents aresometimes html or plain text. Therefore, the script first checks the format
of the email and parses the content accordingly.
5.2 Results Before Combining with Netcraft
5.21 Optimization to ImproveRun Time
For our first test, we directly applied SEParser to our databases without major changes. We
found out that we needed to optimize the performance of the program in order for it to parse
around 1000 emails in a reasonable amount of time. Beforethe changes, weran the programon
a laptop which has an IntelCore i7 processor, 256GB solid state drive and running Windows 10.
This computer was able to process 438 emails in around 3 hours. We realized the running time
was too long for us to effectively test the algorithm. Therefore, we made several changes to the
code so that it ran3 times faster than before. After the changes, we ranthe programon the same

computer, this time it only took 55 minutes for the programto parse the 438 emails.
5.22 Adjusting verb/noun pairblacklist
For the first test, SEParser detected 54 malicious emails out of 438 phishing emails and 0 out
of 500 legitimate emails. Therefore, its true positive is 12 percent, and 0 percent false positive.
We were satisfied with the false positive, but we decided to extend the verb/noun pair blacklist
in order to increase thetrue positive. Therefore, based on the formula shown in equation (1) and
(2), the precision of the algorithm is 100 percent, and the recall is 12 percent.
Equation (1): Precision =
𝑡𝑝
𝑡𝑝+𝑓𝑝
Equation (2): Recall=
𝑡𝑝
𝑡𝑝+𝑓𝑛
The equation uses tp as the number of true positives, fp as the number of false positives, and
fn as the number of false negatives.
For our second test, we mainly focused on finding and adding appropriate verb/noun pairs
that suited the phishing email context. After adding the appropriate verb/noun pairs, we
managed to increase the true positive from 12 percent to 38 percent. And the false positive was
still 0 percent. After the change, the precision of the algorithm is still 100 percent, and the recall
is 38 percent.
5.3 Results After combining with Netcraft
By using Netcraft to analyze the links inside phishing emails, we detected 255 emails out of
438 malicious emails. Then we ran SEParser on the rest of the 183 malicious emails, and it
detected73. By combing with Netcraft, we detected 322 phishing emails out 438. Therefore, the

true positive of our program is 73.5 percent. Then we ran our program on the legitimate email
corpus, and it detected2out of 500 legitimateemails. Therefore, our false positive is 0.4 percent.
After combining with Netcraft, theprecision of thealgorithm is 99 percent, and the recall is 73.8
percent.
Table 3 shown below presents the performance comparisons for the different
improvements. From this table, wecan see that therecallof the programincreased around 200%
after improvement. For all of these tests, we used 438 malicious emails, and 500 legitimate
emails.
SEParser Netcraft SEParser+Netcraft
True Positive 166 255 322
False Positive 0 2 2
Precision 100% 99% 99%
recall 38% 58% 73.5%
Table 3 Comparison between different results
6. Conclusion
In this paper, we havetested and improved SEparser on detecting phishing emails. Bytesting
and adjusting SEparser, we have shown that SEparser’s algorithm can be generalized from
analyzing dialogs of social engineer attacksto detect phishing emails by analyzing the parse tree
and grammar structure of sentences. And by combining improved SEparser with other link
analysis software, it is able to detect most of the malicious emails with less than one percent
false positive. To further improve the performance in the future project, we believe that we can

find certain pattern in most of the phishing emails. By analyzing parse tree of sentences, the
program would be able to identify this pattern from the content of the email. By semantically
analyzing the content, thefuture programshould be ableto detect more phishing emails without
causing more false positives.
7. Reference
1. Kerstein, Paul (July 19, 2005). "How Can We Stop Phishing and Pharming Scams?". CSO.
Archived from the originalon March 24, 2008.
2. "20% Indians are victims of Online phishing attacks: Microsoft". IANS. news.biharprabha.com.
Retrieved February 11,2014.
3. Y. Sawa, H. R. Bhakta, I. G. Harris, "Detection of Social Engineering Attacks Through Natural
Language Processingof Conversations", IEEE ConferenceonSemanticComputing, February2016
4. Google, “Google safe browsing API,” http://code.google.com/apis/safebrowsing/, accessed Oct
2011
5. N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side defense against web-based
identity theft,” in NDSS. The Internet Society, 2004.
6. Garera, S., Provos, N., Chew, M., Rubin, A.: A framework for detection and measurement of
phishing attacks. In: Proc. 2007 ACM Workshop on Recurring Malcode, pp. 1—8 (2007)
7. Chen, J., Guo, C.: Online detection and prevention of phishing attacks. In: First Int’l Conf. on
Communications and Networking in China, ChinaCom 2006, PP. 1—7 IEEE (2006)
8. A. Stone, "Natural-LanguageProcessing for Intrusion Detection," in Computer, vol. 40, no. 12,
pp. 103-105, Dec. 2007.doi: 10.1109/MC.2007.437

9. Shivam Aggarwal, Vishal Kumar, and S. D. Sudarsan. Identification and detection of phishing
emails using natural languageprocessing techniques. In Proceedings of the 7th International
Conference on Security of Information and Networks, SIN ’14, pages 217:217–217:222. ACM,
2014.
10. Verma et al: Detecting Phishing Emails the Natural Language Way. In: ESORICS 2012, LNCS
7459, pp. 824–841, 2012.
11. Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of the
41st
Annual Meeting on Association for Computational Linguistics – Volume 1, 2003.
12. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of English: The penn treebank. Comput. Linguist., 19(2), June 1993.
13. Nazario, J.: The online phishing corpus (2016)
14. 3Sharp, 3Sharp Study finds Internet Explorer 7 Edges Out Netcraft As Most Accurate for Anti-
Phishing Protection. 2006. http://www.3sharp.com/projects/antiphishing/
15. BeatifulSoup Python Library.: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (2016)

Research Report

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (20)

Similar to Research Report

Similar to Research Report (20)

Research Report