Dec. 12, 2014 - Frontiers of Forensic Science 2
Some background…
• PhD candidate at ILPS
• Information Extraction & Retrieval
• Project in NWO’s Forensic Science program
• Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 3
Some background…
• PhD candidate at ILPS
• Information Extraction & Retrieval
• Project in NWO’s Forensic Science program
• Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 4
Information Retrieval?
Dec. 12, 2014 - Frontiers of Forensic Science 5
Information Retrieval?
Ò Finding material of unstructured nature from
large collections
Dec. 12, 2014 - Frontiers of Forensic Science 6
Information Extraction?
Ò Text mining
Ò Discovering patterns in text data
Semantic Search in E-Discovery?
Dec. 12, 2014 - Frontiers of Forensic Science 7
Dec. 12, 2014 - Frontiers of Forensic Science 9
E-Discovery?
• Retrieving and securing digital forensic
evidence
Dec. 12, 2014 - Frontiers of Forensic Science 10
E-Discovery
⬜ Semantic Search in E-Discovery
Semantic Search in E-Discovery
• Supporting search for digital forensic evidence
• from emails, hard drives, mobile phones, etc…
• not the open web
Dec. 12, 2014 - Frontiers of Forensic Science 11
• (Google won’t help us here)
Dec. 12, 2014 - Frontiers of Forensic Science 12
Search in E-Discovery
¢ Finding out who knew what, from whom, and when
¢We don’t know what we’re looking for
¢ What we’re looking for might be deliberately hidden
¢ Communication might be very domain-specific,
contextualized or incomplete
Dec. 12, 2014 - Frontiers of Forensic Science 13
Approach
¢ Generic search is not the answer
¢ Google: high precision search
¢ E-Discovery: high recall & exploratory search
Dec. 12, 2014 - Frontiers of Forensic Science 14
Tasks
¢ Support iterative search
¢ Support (re)formulating questions and hypotheses
¢ Retrieve all relevant traces
Recipient recommendation
Ò Given a sender, an email, all possible
recipients (in an enterprise);
Ò Predict which recipient(s) are most likely to
receive the email
Dec. 12, 2014 - Frontiers of Forensic Science 17
Dec. 12, 2014 - Frontiers of Forensic Science 18
Why?
Ò Understanding communication in/structure of an
enterprise
Ò Finding “unexpected” communication
Ò Applications in:
Ò enterprise search
Ò expert finding
Ò community detection
Ò spam classification
Ò anomaly detection
Dec. 12, 2014 - Frontiers of Forensic Science 19
How?
Ò Gmail
Ò Who do you frequently “co-address”
Ò egonetwork
Ò Related work
Ò Social Network Analysis (SNA)
Ò Email content
Ò Us
Ò SNA + email content
Part 1: Social Network Analysis?
d.p.graus@uva.nl z.ren@uva.nl
derijke@uva.nl
Dec. 12, 2014 - Frontiers of Forensic Science 20
Dec. 12, 2014 - Frontiers of Forensic Science 21
image by Calvinius - Creative Commons Attribution-Share Alike 3.0
SNA for predicting recipients?
1. Importance of a node in the network
Prior probability
More important people are more likely to be recipients
of an(y) email
2. Connection strength between two nodes
Conditional probability
Given the sender, the recipients who are strongly
associated are more likely to be the recipient
Dec. 12, 2014 - Frontiers of Forensic Science 22
Dec. 12, 2014 - Frontiers of Forensic Science 23
Part 2: Email content
Ò Statistical Language Models (LMs)
Ò Assign a probability to [a sequence of] words;
Ò By counting words
Ò Used in lots of places;
Ò Web Search
Ò Machine Translation
Ò Speech Recognition
Dec. 12, 2014 - Frontiers of Forensic Science 24
Language Models
Ò Language models as communication “profiles”
Dec. 12, 2014 - Frontiers of Forensic Science 25
Language Models
Ò Language models as communication “profiles”
1. Incoming LM (how people talk to user)
Dec. 12, 2014 - Frontiers of Forensic Science 26
Language Models
Ò Language models as communication “profiles”
1. Incoming LM (how people talk to user)
2. Outgoing LM (how user talks to people)
Dec. 12, 2014 - Frontiers of Forensic Science 27
Language Models
Ò Language models as communication “profiles”
1. Incoming LM (how people talk to user)
2. Outgoing LM (how user talks to people)
3. Interpersonal LM (how node1
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 28
Language Models
Ò Language models as communication “profiles”
1. Incoming LM (how people talk to user)
2. Outgoing LM (how user talks to people)
3. Interpersonal LM (how node1
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 29
Language Models
Ò Language models as communication “profiles”
1. Incoming LM (how people talk to user)
2. Outgoing LM (how user talks to people)
3. Interpersonal LM (how node1
talks with node2)
4. Corpus LM (how everyone
talks)
Dec. 12, 2014 - Frontiers of Forensic Science 30
Why language models?
Ò Comparisons between communication profiles:
Ò Find nodes with most similar communication
Dec. 12, 2014 - Frontiers of Forensic Science 31
Model
Ò Given sender and email, predict recipients
Ò Ranking function:
Email likelihood
Estimate using language modeling
Sender likelihood
using SNA to estimate closeness of R and S
Recipient likelihood
using SNA to estimate importance of R
Dec. 12, 2014 - Frontiers of Forensic Science 32
Recipient Likelihood
P(R) P(R)
P(S|R)
Dec. 12, 2014 - Frontiers of Forensic Science 35
Strength of connection
between two nodes
1. Number of emails sent
between nodes
2. Number of times two nodes
are addressed together
Importance of node
1. Number of emails received
2. PageRank score
Sender Likelihood
P(S|R)
Dec. 12, 2014 - Frontiers of Forensic Science 36
SNA
1. Importance of a node
in the network
2. Strength of
connection between
nodes
Email Content
1. Interpersonal LM
2. Recipient LM
3. Corpus LM
Dec. 12, 2014 - Frontiers of Forensic Science 37
Approach: time-based
time
Training period: build models (SNA + LM)
Testing period: predict recipients
Testing period: predict recipients
Dec. 12, 2014 - Frontiers of Forensic Science 38
Testing
Ò Remove recipients from email
Ò Rank all nodes in the network, by computing:
1. P(E|R,S): Similarity between sender and
candidate LMs
2. P(S|R): Strength of connection between
sender and candidate
3. P(R): Importance of candidate
Dec. 12, 2014 - Frontiers of Forensic Science 40
Findings: What works?
Ò Importance of node:
Number of received emails of node
Pagerank
Ò Strength of connection:
Number of emails between nodes
Number of times co-addressed
Ò LM Similarity:
Interpersonal LM is most important (60%-20%-20%)
Analysis: SNA vs email content
Dec. 12, 2014 - Frontiers of Forensic Science 41
Ò SNA:
Ò SNA signals deteriorate over time
Ò SNA signals are most informative on highly
active users
Ò Email content:
Ò LM signal improves over time
Ò LM signal does worse with highly active users
Dec. 12, 2014 - Frontiers of Forensic Science 42
Finally
Ò Combining Social Network Analysis with
Language Modeling is better than doing either.
Dec. 12, 2014 - Frontiers of Forensic Science 43
Future work
Ò Consider structure of network in more detail
Ò Departments?
Ò Friends/family?
Ò Include ‘time decay’
Ò Dynamically weight LM/SNA?
Applications in E-Discovery/Digital Forensics
Dec. 12, 2014 - Frontiers of Forensic Science 44
Ò Anomaly detection
Ò Given a working prediction model; identify
“unexpected” communication
Ò Language models for communication
Ò For a node, find the most different
interpersonal communication
Ò Friends/family vs colleagues?
Ò Find communication that differs from the
corpus-based communication
Dec. 12, 2014 - Frontiers of Forensic Science 45
Fin
Ò Questions?