Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

•

1 like•891 views

Email Sherlock identifies and analyzes clusters in large email datasets, which can be used to aid email-based investigations, and possibly, prevent similar cases, by identifying email that contains sensitive or classified information. This is the presentation I presented as my Final Project at Metis Career Day San Francisco. Please see the link above for slides and video presentation.

Software

Email Sherlock:
Using Machine Learning to Extract Information from Large Email Datasets
Jay Gondin

Investigations and Emails
● Bear Stearns V.
Lehman Brothers
● Enron
● Hillary Clinton

Data: Hillary’s Emails
● 30,320 emails in dataset
● 60,000 Meaningful Words
● Unique Acronyms
○ Ex. Hillary Clinton = Rodham,
HRC, Madam Secretary
○ Ex. Obama = President,
Administration, Barack
○ Ex. White House = WH

Email Pros and Cons
● Emails may contain crucial
information to solve an
investigation.
● Unique acronyms may help
vectorize emails
● Emails within a particular
dataset have a fewer number
of authors
● Often find duplicated text
● A majority of emails do not
contain important and/or
relevant information to an
investigation
● Unique acronyms may make it
more difficult to complete
searches
● Clusters of emails tend to
overlap
Pros Cons

Unsupervised Model
TFidF - vectorizer
LSA - reduce dimension
DBSCAN - cluster
Machine LearningSQLiteRaw Data Analyzed Clusters
Key Info:
- Orphan tend to be less important
and/or were anonymized.
- Dense clusters may contain more
information
- DBSCAN -- Density-based spatial
clustering of applications with noise

Semi-Unsupervised Model & Query Expansion
Benghazi
Search Term
Neural Network
(word2vec)
Tripoli
Stevens
Libyans
Consulate
Expanded Search Term Results (cluster)
Flask WebApp
&
SQLite

Finding Connections:
Benghazi Libyans
● Clusters are based on
meaning.

Sentiment Analysis
● High Polarity may indicate sensitive information.

Future developments
● Generalize to other Datasets
● Adapt algorithm to prevent fraud
● Develop graphical visualization
● Record Users Activities to improve the software

Jay Gondin
Masters in Mathematics
Experienced Economic Analyst
gondin@gmail.com
github.com/jgondin
linkedin.com/in/gondin

Viewers also liked

Presentacion clase 1 bases de datos

alberromero

The ultimate spa ritual

Google, Jeunesse Forever, La Phyto

New School

Frazer Ward

Grelha 04-edição de dados

Anderson Ricardo Cunha

Vol mediacartar MSD Salud Animal salud Antiparasitarios

MSD Salud Animal

TodoRetail presenta TRN Coworking

TodoRetail

Gabriel binet y oscar esteve

Jose Trinidad

Foro transferencia ciencias de la educacion

OTRI - Universidad de Granada

Diseño de sistema de gestion

Elvis Sa

Opinión de la gente acerca de los precios en los antros del Distrito Federal

Andreaharochi

Bim based process mining master thesis presentation

Stijn van Schaijk

Avances en la telecomunicaciones a nivel mundial diana torres rosmery raiban...

Diana Torres

Ficha inscripción taller de reducción del estrés. mindfulness

Cole Navalazarza

Paper 91

olimpica

Voltaire (aldo franquez)

Juan Viera Alemañy

Giao trinh ky nang lam viec nhom

tranthanhlong_gv

CV_Nitin

Nitin Shingane

WP2 - OPEN INNOVATION PROCESS MODEL

Grial - University of Salamanca

IT-Beschaffung und Open Source Software

Matthias Stürmer

Viewers also liked (19)

Presentacion clase 1 bases de datos

The ultimate spa ritual

New School

Grelha 04-edição de dados

Vol mediacartar MSD Salud Animal salud Antiparasitarios

TodoRetail presenta TRN Coworking

Gabriel binet y oscar esteve

Foro transferencia ciencias de la educacion

Diseño de sistema de gestion

Opinión de la gente acerca de los precios en los antros del Distrito Federal

Bim based process mining master thesis presentation

Avances en la telecomunicaciones a nivel mundial diana torres rosmery raiban...

Ficha inscripción taller de reducción del estrés. mindfulness

Paper 91

Voltaire (aldo franquez)

Giao trinh ky nang lam viec nhom

CV_Nitin

WP2 - OPEN INNOVATION PROCESS MODEL

IT-Beschaffung und Open Source Software

Recently uploaded

The title is not connected to what is inside

shinachiaurasa2

Unlocking the Future of AI Agents with Large Language Models

aagamshah0812

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

ThousandEyes

Define the academic and professional writing..pdf

PearlKirahMaeRagusta1

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

masabamasaba

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts In Chinsurah ❤Personal Whatsapp Number Chinsurah Call Girls 8617697112 💦✅. Chinsurah escorts we are avaliable for our all types budget customers with offer great deals in Chinsurah.Call Now our Chinsurah escort service & call girls ... Independent call girls in Chinsurah escorts available 24 hours a day for discreet incall and outcall bookings from trusted call girls - Elis.in. Nitya salvi Chinsurah escorts service agency # Are you looking for sexy call girls ? Call our agency to get you dream independent call girl, ... One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call Flexibility Choices and options Lists of many beauty fantasies Turn your dream into reality Perfect companionship Cheap and convenient In-call and Out-call services And many more. WhatsApp Chat: 📞 8617697112 Visit The Website : https://www.nityasalvi.com/

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...

Nitya salvi

%in Midrand+277-882-255-28 abortion pills for sale in midrand

masabamasaba

Craft an AI & Machine Learning Pitch with our Editable Professional PowerPoint Template. Ignite your AI & Machine Learning pitch with our cutting-edge PowerPoint template tailored for the industry. Perfect for AI conferences, investor presentations, sales pitches to tech-focused companies, training sessions, and educational programs. - 20+ editable slides: Get a variety of options to choose from for your presentation. - Time-saving solution: Download, replace text/images with a few clicks. - User-friendly customization: Easy to use and personalize. - Modern and attractive design: Captivating visuals, sleek layout. - Tailored to your requirements: Fully alterable for customization. - Well-organized slides: Complete control over content. - Thematic specificity: Reflects healthcare industry with relevant graphics. - Showcase your business idea: Communicate value proposition effectively.

AI & Machine Learning Presentation Template

Presentation.STUDIO

Model Call Girl Services in Delhi reach out to us at 🔝 9953056974 🔝✔️✔️ Our agency presents a selection of young, charming call girls available for bookings at Oyo Hotels. Experience high-class escort services at pocket-friendly rates, with our female escorts exuding both beauty and a delightful personality, ready to meet your desires. Whether it's Housewives, College girls, Russian girls, Muslim girls, or any other preference, we offer a diverse range of options to cater to your tastes. We provide both in-call and out-call services for your convenience. Our in-call location in Delhi ensures cleanliness, hygiene, and 100% safety, while our out-call services offer doorstep delivery for added ease. We value your time and money, hence we kindly request pic collectors, time-passers, and bargain hunters to refrain from contacting us. Our services feature various packages at competitive rates: One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call Full night for more than 1 person: Contact us at 🔝 9953056974 🔝. for details Operating 24/7, we serve various locations in Delhi, including Green Park, Lajpat Nagar, Saket, and Hauz Khas near metro stations. For premium call girl services in Delhi 🔝 9953056974 🔝. Thank you for considering us!

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

9953056974 Low Rate Call Girls In Saket, Delhi NCR

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

masabamasaba

HR Software Buyers Guide in 2024 - HRSoftware.com

Fatema Valibhai

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

KiaraTiradoMicha

Data spaces in distributed environments should be allowed to evolve in agile ways providing data space owners with large flexibility about which data they store. Agility and heterogeneity, however, jeopardize data exchanges because representations may build on varying ontologies and data consumers may not rely on the semantic correctness of their queries in the context of semantically heterogeneous, evolving data spaces. Graph data spaces are one example of a powerful model for representing and querying data whose semantics may change over time. To assert and enforce conditions on individual graph data spaces, shape languages (e.g SHACL) have been developed. We investigate the question of how querying and programming can be guarded by reasoning over SHACL constraints in a distributed setting and we sketch a picture of how a future landscape based on semantically heterogeneous data spaces might look like.

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Steffen Staab

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

masabamasaba

MakeMyPass" Online Bus Pass Management System illustrates the flow of activities and actions that occur within the system to accomplish specific tasks or use cases. This type of diagram focuses on representing the sequence of activities and decision points involved in a particular process. Below is an example outline and description of key elements that could be included in an Activity Diagram for the system:

BUS PASS MANGEMENT SYSTEM USING PHP.pptx

alwaysnagaraju26

A Secure and Reliable Document Management System is Essential.docx

ComplianceQuest1

A great deal of attention in medical devices has shifted towards cybersecurity with the ratification of section 524B of the FD&C act. This new law enables the FDA to enforce cybersecurity controls in any medical device that is capable of networked communications or that has software. In this webinar we will recap the process for managing vulnerabilities, identify categories of vulnerabilities and solutions and more.

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

ICS

In the realm of real-time applications, Large Language Models (LLMs) have long dominated language-centric tasks, while tools like OpenCV have excelled in the visual domain. However, the future (maybe) lies in the fusion of LLMs and deep learning, giving birth to the revolutionary concept of Large Action Models (LAMs). Imagine a world where AI not only comprehends language but mimics human actions on technology interfaces. For example, the Rabbit r1 device presented at CES 2024, driven by an AI operating system and LAM, brings this vision to life. It executes complex commands, leveraging GUIs with unprecedented ease. In this presentation, join me on a journey as a software engineer tinkering with WebRTC, Janus, and LLM/LAMs. Together, we’ll evaluate the current state of these AI technologies, unraveling the potential they hold for shaping the future of real-time applications.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Alberto González Trastoy

VTU technical seminar 8Th Sem on Scikit-learn

AmarnathKambale

Looking for an efficient way to manage your finances? Look no further than our money management app. With easy-to-use features, you can track your expenses, create budgets, and monitor your savings goals all in one place. Our app provides real-time updates on your spending habits and helps you make smarter financial decisions. Take control of your finances today with our user-friendly money management app.

Right Money Management App For Your Financial Goals

Jhone kinadey

Recently uploaded (20)

The title is not connected to what is inside

Unlocking the Future of AI Agents with Large Language Models

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Define the academic and professional writing..pdf

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...

%in Midrand+277-882-255-28 abortion pills for sale in midrand

AI & Machine Learning Presentation Template

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

HR Software Buyers Guide in 2024 - HRSoftware.com

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

BUS PASS MANGEMENT SYSTEM USING PHP.pptx

A Secure and Reliable Document Management System is Essential.docx

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

VTU technical seminar 8Th Sem on Scikit-learn

Right Money Management App For Your Financial Goals

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

1. Email Sherlock: Using Machine Learning to Extract Information from Large Email Datasets Jay Gondin

2. Investigations and Emails ● Bear Stearns V. Lehman Brothers ● Enron ● Hillary Clinton

3. Data: Hillary’s Emails ● 30,320 emails in dataset ● 60,000 Meaningful Words ● Unique Acronyms ○ Ex. Hillary Clinton = Rodham, HRC, Madam Secretary ○ Ex. Obama = President, Administration, Barack ○ Ex. White House = WH

4. Email Pros and Cons ● Emails may contain crucial information to solve an investigation. ● Unique acronyms may help vectorize emails ● Emails within a particular dataset have a fewer number of authors ● Often find duplicated text ● A majority of emails do not contain important and/or relevant information to an investigation ● Unique acronyms may make it more difficult to complete searches ● Clusters of emails tend to overlap Pros Cons

5. Unsupervised Model TFidF - vectorizer LSA - reduce dimension DBSCAN - cluster Machine LearningSQLiteRaw Data Analyzed Clusters Key Info: - Orphan tend to be less important and/or were anonymized. - Dense clusters may contain more information - DBSCAN -- Density-based spatial clustering of applications with noise

6. Semi-Unsupervised Model & Query Expansion Benghazi Search Term Neural Network (word2vec) Tripoli Stevens Libyans Consulate Expanded Search Term Results (cluster) Flask WebApp & SQLite

7. Finding Connections: Benghazi Libyans ● Clusters are based on meaning.

9. Sentiment Analysis ● High Polarity may indicate sensitive information.

10.

11. Future developments ● Generalize to other Datasets ● Adapt algorithm to prevent fraud ● Develop graphical visualization ● Record Users Activities to improve the software

12. Jay Gondin Masters in Mathematics Experienced Economic Analyst gondin@gmail.com github.com/jgondin linkedin.com/in/gondin

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Recently uploaded

Recently uploaded (20)

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset