Using Online Activity as Digital
Fingerprints to Create a Better
Spear Phisher
Joaquim Espinhara and Ulisses Albuquerque
JEspinhara@trustwave.com UAlbuquerque@trustwave.com
• Introduction
• Motivation
• Background
• HowStuffWorks
• µphisher
• Demo
• Future Work
• Conclusion
Agenda
• Joaquim Espinhara
– Security Consultant at Trustwave Spiderlabs
• Ulisses Albuquerque
– Coder for offense & defense… as long as it’s fun!
– Lab Manager at Trustwave Spiderlabs
About us
INTRODUCTION
OUR MOTIVATION
• Why?
• Tools available
Our Motivation
BACKGROUND
• Social Networks
• Social Engineering
• Data Pre-Processing
• Natural Language Processing - NLP
Background
• Social Networks
Background
Facebook
Twitter
LinkedIn
Others
• Social Networks
– Communication channel for keeping in touch with
someone (Facebook, Twitter)
– Media sharing (Instagram)
– Specialized networks (Foursquare, GetGlue, TripIt)
Background
• Social Engineering
– Phishing
Background
http://www.d00med.net/uploads/0d832c77559a2070a766f899e7eg783.png
• Data Pre-Processing
– What is it?
– How do we use it?
Background
• Data Pre-Processing Flow
Background
Raw data set
"Had lunch with
@urma and
@jespinhara today
#tgif #lunch"
Data cleaning
"Had lunch with
@urma and
@jespinhara
today"
Data integration
"Had lunch with
@urma and
@jespinhara
today"
Data
normalization
"Had lunch with
@urma and
@jespinhara today
(2013-06-05)"
• Natural Language Processing – NLP
– What is it?
– How do we use it?
– Text analysis
Background
HOWSTUFFWORKS
Identifying
the subject to
profile
Collecting
social
network data
Analyzing and
building the
profile
Our Approach
• The Unknown Subject (Unsub)
Our Approach
Joaquim
Espinhara
@jespinhara
(Twitter)
joaquim.espinhara
(Facebook)
uid=12345
(LinkedIn)
• Data Collection
– Social Network IDs
– Official APIs
– Web Scraping
– OAuth
Our Approach
• Data Collection - Twitter
Our Approach
Application ID
(µphisher)
User ID
(@jespinhara)
Twitter
@urma
@effffn
@SpiderLabs
µPHISHER
• Reference implementation
• Goals
– Validate potential unsub content
– Assisted textual content input
µphisher
• Web Application
• Twitter only (for now)
• Open Source (GPLv3)
µphisher
µphisher
Ruby on
Rails
MongoDB
+ Mongoid
DelayedJob
OAuth
treat
(Stanford
NLP Core)
µphisher
µphisher
Authentication
Unsub
Registration
Data Source
Registration
Data
Collection
Work Set
Definition
Work Set
Analysis
Unsub Profile
DEMO
(FINGERS CROSSED)
LIMITATIONS
• Support for additional data sources
• Machine learning
• More metrics and feedback for assisted input
• Filtering presets
• Adequate handling of quoted content
Future Work
CONCLUSION
THANK YOU!
@urma@jespinhara

Using Online Activity as Digital Fingerprints to Create a Better Spear Phisher

Editor's Notes

  • #4 Inicio – 00:00
  • #5 Tempo: 02:00Tempo Max: 05:00What does doWhat does not doO quepodemosconseguirO quenaoconseguimosUse this like hook to motivation
  • #6 Tempo: 05:00Tempo Max: 07:00
  • #7 Tempo: 20:00Tempo Max: 25:00
  • #8 Tempo: 07:00Tempo Max: 17:00
  • #9 Tempo: 07:00Tempo Max: 10:00
  • #10 Some networks are more specialized and give the user little control over the content produced (e.g., Foursquare, Untappd, TripIt)This can be particularly hard when content is published across linked social network accounts (i.e., Foursquare publishing check-ins in Twitter)Tempo: 11:00Tempo Max: 15:00
  • #11 Tempo: 15:00Tempo Max: 20:00
  • #12 Tempo: 20:00Tempo Max: 25:00
  • #13 Tempo: 25:00Tempo Max: 30:00Non convencional data mining concept.Algo proximo a social data mining. Uma forma de selecionar criterios relevantes para criacao do perfi.Data cleaning consists in removing records which are not relevant to a given analysis (e.g., removing all retweets, removing all Facebook status updates posted from a mobile device);Data integration consists in correlating identities between different data sources, allowing the μphiser user to perform analysis on records from multiple sources (e.g., user @jespinhara in Twitter is user 'jespinhara' in Instagram);Data normalization consists in applying transformations to data attributes in order to make records consistent regarding comparison and correlation (e.g., all date and time attributes will be represented using ISO-8601).
  • #14 Association rule learning, which consists of identifying relationships between variables (e.g., @jespinhara is commonly referenced alongside @urma when "lunch" is present on the status update);Clustering, which consists of discovering groups and structures between records which do not share a known common attribute (e.g., all status updates which mention meals during the day)Anomaly detection, which consists of identifying unusual data records which require further investigation and should most likely be excluded from analysis (e.g., "Hacked by @effffn!!!!111!")Tempo Max: 50:00
  • #15 Tempo: 50:00Tempo Max: 55:00
  • #17 Tempo: 55:00Tempo Max: 60:00
  • #18 Tempo: 70:00Tempo Max: 75:00
  • #19 Tempo: 70:00Tempo Max: 75:00
  • #20 Tempo: 70:00Tempo Max: 75:00
  • #22 Validating textual input against the profile in order to verify the probability of some text being written by an unsub (e.g., checking whether a given tweet was indeed authored by a profiled individual);Assisting the user in writing content which is similar to what the unsub would write regarding word frequency, sentence length, amount of sentences and phrases, amongst other criteriaTempo: 90:00Tempo Max: 95:00
  • #23 Validating textual input against the profile in order to verify the probability of some text being written by an unsub (e.g., checking whether a given tweet was indeed authored by a profiled individual);Assisting the user in writing content which is similar to what the unsub would write regarding word frequency, sentence length, amount of sentences and phrases, amongst other criteriaTempo: 90:00Tempo Max: 95:00
  • #24 Tempo: 105:00Tempo Max: 110:00
  • #25 Tempo: 110:00Tempo Max: 115:00Practicallimitations – Twitterlimitstweetfetchtoaround 3200 tweetsAllofficialAPIs are rate-limited, handlingthosethroughasynchronous threads whileprovidingmeaninful feedback to µphisherusercanbe hardSome metricsrequire NLP processing, whichfails HARD onpartialsentences/abbreviations
  • #26 Tempo: 120:00Tempo Max: 140:00
  • #28 Tempo: 140:00Tempo Max: 145:00Machinelearning => partialcontentprocessing => context-awareautocompletesuggestions
  • #29 Tempo: 145:00Tempo Max: 150:00Noise is badLinked social networks are badHandling quoted content is hard