Email Sherlock identifies and analyzes clusters in large email datasets, which can be used to aid email-based investigations, and possibly, prevent similar cases, by identifying email that contains sensitive or classified information.
This is the presentation I presented as my Final Project at Metis Career Day San Francisco. Please see the link above for slides and video presentation.
3. Data: Hillary’s Emails
● 30,320 emails in dataset
● 60,000 Meaningful Words
● Unique Acronyms
○ Ex. Hillary Clinton = Rodham,
HRC, Madam Secretary
○ Ex. Obama = President,
Administration, Barack
○ Ex. White House = WH
4. Email Pros and Cons
● Emails may contain crucial
information to solve an
investigation.
● Unique acronyms may help
vectorize emails
● Emails within a particular
dataset have a fewer number
of authors
● Often find duplicated text
● A majority of emails do not
contain important and/or
relevant information to an
investigation
● Unique acronyms may make it
more difficult to complete
searches
● Clusters of emails tend to
overlap
Pros Cons
5. Unsupervised Model
TFidF - vectorizer
LSA - reduce dimension
DBSCAN - cluster
Machine LearningSQLiteRaw Data Analyzed Clusters
Key Info:
- Orphan tend to be less important
and/or were anonymized.
- Dense clusters may contain more
information
- DBSCAN -- Density-based spatial
clustering of applications with noise
6. Semi-Unsupervised Model & Query Expansion
Benghazi
Search Term
Neural Network
(word2vec)
Tripoli
Stevens
Libyans
Consulate
Expanded Search Term Results (cluster)
Flask WebApp
&
SQLite
11. Future developments
● Generalize to other Datasets
● Adapt algorithm to prevent fraud
● Develop graphical visualization
● Record Users Activities to improve the software
12. Jay Gondin
Masters in Mathematics
Experienced Economic Analyst
gondin@gmail.com
github.com/jgondin
linkedin.com/in/gondin