Learning to Detect Phishing Emails

2. Norman Sadeh

3. Anthony Tomasic

6. Phishing through Emails

7. Phishing Problem – Hard.

8. An Machine Learning approach to tackle this online identity theft. BhavinMadhani - UC Irvine - 2009 Image courtesy: http://images.google.com - 2005 HowStuffWorks

9. Popular Targets : March 2009 Table courtesy: http://www.phishtank.com/stats/2009/03/ BhavinMadhani - UC Irvine - 2009

11. SpoofGuard

12. NetCraft

13. Email Filtering

14. SpamAssassin

15. SpamatoBhavinMadhani - UC Irvine - 2009 Image courtesy: http://images.google.com - www.glasbergen.com

17. phishing emails / ham (good) emails

18. Feature Set

19. Features as used in email classification

20. Features as used in webpage classificationBhavinMadhani - UC Irvine - 2009

24. ‘playpal.com’ or ‘paypal-update.com’

25. These domains often have a limited life

26. WHOIS query

27. date is within 60 days of the date the email was sent – “fresh” domain. This is a binary featureBhavinMadhani - UC Irvine - 2009

29. This is a case of a link that says paypal.com but actually links to badsite.com.

30. Such a link looks like <a href="badsite.com"> paypal.com</a>. This is a binary feature. BhavinMadhani - UC Irvine - 2009

32. “Click here to restore your account access”

33. Link with the text “link”, “click”, or “here” that links to a domain other than this “modal domain”

34. This is a binary feature.Image courtesy: http://www.bbcchannelpartners.com/worldnews/programmes/1000001/ Bhavin Madhani - UC Irvine - 2009

36. Emails are sent as either plain text, HTML, or a combination of the two - multipart/alternative format

37. To launch an attack without using HTML is difficult

38. This is a binary feature. Image courtesy: http://srtsolutions.com/blogs/marinafedner/ BhavinMadhani - UC Irvine - 2009

40. The number of links present in an email.

41. This is a continuous feature.

42. Eg. Bankofamerica statement.BhavinMadhani - UC Irvine - 2009

44. Simply take the domain names previously extracted from all of the links, and simply count the number of distinct domains.

45. Look at the “main” part of a domain

46. https://www.cs.university.edu/

47. http://www.company.co.jp/

48. This is a continuous feature. BhavinMadhani - UC Irvine - 2009

51. This feature is simply the maximum number of dots (`.') contained in any of the links present in the email, and is a continuous feature. BhavinMadhani - UC Irvine - 2009 Image courtesy: http://www.roslynoxley9.com.au/artists/49/Yayoi_Kusama/38/24460/

53. Attackers can use JavaScript to hide information from the user, and potentially launch sophisticated attacks.

54. An email is flagged with the “contains javascript” feature if the string “javascript” appears in the email, regardless of whether it is actually in a <script> or <a> tag

55. This is a binary feature. BhavinMadhani - UC Irvine - 2009 Image courtesy: http://webdevargentina.ning.com/

57. This is a binary feature, using the trained version of SpamAssassin with the default rule weights and threshold.

58. “Ham” or “Spam”

59. This is a Binary feature.Image courtesy: http://www.suremail.us/spam-filter.shtml BhavinMadhani - UC Irvine - 2009

61. Other Features include:

62. Site in browser history

63. Redirected site

64. tf-idfBhavinMadhani - UC Irvine - 2009

66. Run a set of scripts to extract all the features.

67. Train and test a classifier using 10-fold cross validation.

68. Random forest as a classifier.

69. Random forests create a number of decision trees and each decision tree is made by randomly choosing an attribute to split on at each level, and then pruning the tree.BhavinMadhani - UC Irvine - 2009 Image courtesy: http://meds.queensu.ca/postgraduate/policies/evaluation__promotion___appeals

71. Two publicly available datasets used.

72. The ham corpora from the SpamAssassin project (both the 2002 and 2003 ham collections, easy and hard, for a total of approximately 6950 non-phishing non-spam emails)

73. The publicly available phishingcorpus (approximately 860 email messages). BhavinMadhani - UC Irvine - 2009

75. For comparison against PILFER, we classify the exact same dataset using SpamAssassin version 3.1.0, using the default thresholds and rules.

76. “untrained” SpamAssassin

77. “trained” SpamAssassinBhavinMadhani - UC Irvine - 2009

79. The age of the dataset

80. Phishing websites are short-lived, often lasting only on the order of 48 hours

81. Domains are no longer live at the time of our testing, resulting in missing information

82. The disappearance of domain names, combined with difficulty in parsing results from a large number of WHOIS servers BhavinMadhani - UC Irvine - 2009 Image courtesy: http://illuminatepr.wordpress.com/2008/07/01/challenges/

84. Misclassifying a phishing email may have a different impact than misclassifying a good email.

85. False positive rate (fp) : The proportion of ham emails classified as phishing emails.

86. False negative rate (fn) : The proportion of phishing emails classified as ham.BhavinMadhani - UC Irvine - 2009

87. BhavinMadhani - UC Irvine - 2009

88. Percentage of emails matching the binary features BhavinMadhani - UC Irvine - 2009

89. Mean, standard deviation of the continuous features, per-class BhavinMadhani - UC Irvine - 2009

91. Gone Phishing? Protect Yourself - Stop · Think · Click THANK YOU BhavinMadhani - UC Irvine - 2009 Image courtesy: http://images.google.com

92. Anti-Phishing Phil http://cups.cs.cmu.edu/antiphishing_phil/ http://cups.cs.cmu.edu/antiphishing_phil/new/index.html BhavinMadhani - UC Irvine - 2009 Image Courtesy: http://cups.cs.cmu.edu/antiphishing_phil/

Learning to Detect Phishing Emails

Recommended

Recommended

More Related Content

Similar to Learning to Detect Phishing Emails

Similar to Learning to Detect Phishing Emails (20)

More from butest

More from butest (20)

Learning to Detect Phishing Emails

Editor's Notes