Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Understanding email traffic 
David Graus, University of Amsterdam 
d.p.graus@uva.nl 
@dvdgrs
Dec. 12, 2014 - Frontiers of Forensic Science 2 
Some background… 
• PhD candidate at ILPS 
• Information Extraction & Ret...
Dec. 12, 2014 - Frontiers of Forensic Science 3 
Some background… 
• PhD candidate at ILPS 
• Information Extraction & Ret...
Dec. 12, 2014 - Frontiers of Forensic Science 4 
Information Retrieval?
Dec. 12, 2014 - Frontiers of Forensic Science 5 
Information Retrieval? 
Ò Finding material of unstructured nature from 
...
Dec. 12, 2014 - Frontiers of Forensic Science 6 
Information Extraction? 
Ò Text mining 
Ò Discovering patterns in text ...
Semantic Search in E-Discovery? 
Dec. 12, 2014 - Frontiers of Forensic Science 7
Dec. 12, 2014 - Frontiers of Forensic Science 8 
Semantic Search?
Dec. 12, 2014 - Frontiers of Forensic Science 9 
E-Discovery? 
• Retrieving and securing digital forensic 
evidence
Dec. 12, 2014 - Frontiers of Forensic Science 10 
E-Discovery 
⬜ Semantic Search in E-Discovery
Semantic Search in E-Discovery 
• Supporting search for digital forensic evidence 
• from emails, hard drives, mobile phon...
Dec. 12, 2014 - Frontiers of Forensic Science 12 
Search in E-Discovery 
¢ Finding out who knew what, from whom, and when...
Dec. 12, 2014 - Frontiers of Forensic Science 13 
Approach 
¢ Generic search is not the answer 
¢ Google: high precision...
Dec. 12, 2014 - Frontiers of Forensic Science 14 
Tasks 
¢ Support iterative search 
¢ Support (re)formulating questions...
Dec. 12, 2014 - Frontiers of Forensic Science 15
Dec. 12, 2014 - Frontiers of Forensic Science 16
Recipient recommendation 
Ò Given a sender, an email, all possible 
recipients (in an enterprise); 
Ò Predict which reci...
Dec. 12, 2014 - Frontiers of Forensic Science 18 
Why? 
Ò Understanding communication in/structure of an 
enterprise 
Ò ...
Dec. 12, 2014 - Frontiers of Forensic Science 19 
How? 
Ò Gmail 
Ò Who do you frequently “co-address” 
Ò egonetwork 
Ò...
Part 1: Social Network Analysis? 
d.p.graus@uva.nl z.ren@uva.nl 
derijke@uva.nl 
Dec. 12, 2014 - Frontiers of Forensic Sci...
Dec. 12, 2014 - Frontiers of Forensic Science 21 
image by Calvinius - Creative Commons Attribution-Share Alike 3.0
SNA for predicting recipients? 
1. Importance of a node in the network 
Prior probability 
More important people are more ...
Dec. 12, 2014 - Frontiers of Forensic Science 23 
Part 2: Email content 
Ò Statistical Language Models (LMs) 
Ò Assign a...
Dec. 12, 2014 - Frontiers of Forensic Science 24 
Language Models 
Ò Language models as communication “profiles”
Dec. 12, 2014 - Frontiers of Forensic Science 25 
Language Models 
Ò Language models as communication “profiles” 
1. Inco...
Dec. 12, 2014 - Frontiers of Forensic Science 26 
Language Models 
Ò Language models as communication “profiles” 
1. Inco...
Dec. 12, 2014 - Frontiers of Forensic Science 27 
Language Models 
Ò Language models as communication “profiles” 
1. Inco...
Dec. 12, 2014 - Frontiers of Forensic Science 28 
Language Models 
Ò Language models as communication “profiles” 
1. Inco...
Dec. 12, 2014 - Frontiers of Forensic Science 29 
Language Models 
Ò Language models as communication “profiles” 
1. Inco...
Dec. 12, 2014 - Frontiers of Forensic Science 30 
Why language models? 
Ò Comparisons between communication profiles: 
Ò...
Dec. 12, 2014 - Frontiers of Forensic Science 31 
Model 
Ò Given sender and email, predict recipients 
Ò Ranking functio...
Email likelihood 
Estimate using language modeling 
Sender likelihood 
using SNA to estimate closeness of R and S 
Recipie...
Dec. 12, 2014 - Frontiers of Forensic Science 33 
Email likelihood
Dec. 12, 2014 - Frontiers of Forensic Science 34 
Email likelihood 
P(word|R,S) P(word|R) P(word)
Recipient Likelihood 
P(R) P(R) 
P(S|R) 
Dec. 12, 2014 - Frontiers of Forensic Science 35 
Strength of connection 
between...
Dec. 12, 2014 - Frontiers of Forensic Science 36 
SNA 
1. Importance of a node 
in the network 
2. Strength of 
connection...
Dec. 12, 2014 - Frontiers of Forensic Science 37 
Approach: time-based 
time 
Training period: build models (SNA + LM) 
Te...
Testing period: predict recipients 
Dec. 12, 2014 - Frontiers of Forensic Science 38 
Testing 
Ò Remove recipients from e...
Dec. 12, 2014 - Frontiers of Forensic Science 39
Dec. 12, 2014 - Frontiers of Forensic Science 40 
Findings: What works? 
Ò Importance of node: 
Number of received emails...
Analysis: SNA vs email content 
Dec. 12, 2014 - Frontiers of Forensic Science 41 
Ò SNA: 
Ò SNA signals deteriorate over...
Dec. 12, 2014 - Frontiers of Forensic Science 42 
Finally 
Ò Combining Social Network Analysis with 
Language Modeling is...
Dec. 12, 2014 - Frontiers of Forensic Science 43 
Future work 
Ò Consider structure of network in more detail 
Ò Departm...
Applications in E-Discovery/Digital Forensics 
Dec. 12, 2014 - Frontiers of Forensic Science 44 
Ò Anomaly detection 
Ò ...
Dec. 12, 2014 - Frontiers of Forensic Science 45 
Fin 
Ò Questions?
Upcoming SlideShare
Loading in …5
×

Understanding Email Traffic

424 views

Published on

Talk at Frontiers of Forensic Science Lecture Series

Published in: Data & Analytics
  • Be the first to comment

Understanding Email Traffic

  1. 1. Understanding email traffic David Graus, University of Amsterdam d.p.graus@uva.nl @dvdgrs
  2. 2. Dec. 12, 2014 - Frontiers of Forensic Science 2 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  3. 3. Dec. 12, 2014 - Frontiers of Forensic Science 3 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  4. 4. Dec. 12, 2014 - Frontiers of Forensic Science 4 Information Retrieval?
  5. 5. Dec. 12, 2014 - Frontiers of Forensic Science 5 Information Retrieval? Ò Finding material of unstructured nature from large collections
  6. 6. Dec. 12, 2014 - Frontiers of Forensic Science 6 Information Extraction? Ò Text mining Ò Discovering patterns in text data
  7. 7. Semantic Search in E-Discovery? Dec. 12, 2014 - Frontiers of Forensic Science 7
  8. 8. Dec. 12, 2014 - Frontiers of Forensic Science 8 Semantic Search?
  9. 9. Dec. 12, 2014 - Frontiers of Forensic Science 9 E-Discovery? • Retrieving and securing digital forensic evidence
  10. 10. Dec. 12, 2014 - Frontiers of Forensic Science 10 E-Discovery ⬜ Semantic Search in E-Discovery
  11. 11. Semantic Search in E-Discovery • Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web Dec. 12, 2014 - Frontiers of Forensic Science 11 • (Google won’t help us here)
  12. 12. Dec. 12, 2014 - Frontiers of Forensic Science 12 Search in E-Discovery ¢ Finding out who knew what, from whom, and when ¢We don’t know what we’re looking for ¢ What we’re looking for might be deliberately hidden ¢ Communication might be very domain-specific, contextualized or incomplete
  13. 13. Dec. 12, 2014 - Frontiers of Forensic Science 13 Approach ¢ Generic search is not the answer ¢ Google: high precision search ¢ E-Discovery: high recall & exploratory search
  14. 14. Dec. 12, 2014 - Frontiers of Forensic Science 14 Tasks ¢ Support iterative search ¢ Support (re)formulating questions and hypotheses ¢ Retrieve all relevant traces
  15. 15. Dec. 12, 2014 - Frontiers of Forensic Science 15
  16. 16. Dec. 12, 2014 - Frontiers of Forensic Science 16
  17. 17. Recipient recommendation Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to receive the email Dec. 12, 2014 - Frontiers of Forensic Science 17
  18. 18. Dec. 12, 2014 - Frontiers of Forensic Science 18 Why? Ò Understanding communication in/structure of an enterprise Ò Finding “unexpected” communication Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
  19. 19. Dec. 12, 2014 - Frontiers of Forensic Science 19 How? Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork Ò Related work Ò Social Network Analysis (SNA) Ò Email content Ò Us Ò SNA + email content
  20. 20. Part 1: Social Network Analysis? d.p.graus@uva.nl z.ren@uva.nl derijke@uva.nl Dec. 12, 2014 - Frontiers of Forensic Science 20
  21. 21. Dec. 12, 2014 - Frontiers of Forensic Science 21 image by Calvinius - Creative Commons Attribution-Share Alike 3.0
  22. 22. SNA for predicting recipients? 1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email 2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient Dec. 12, 2014 - Frontiers of Forensic Science 22
  23. 23. Dec. 12, 2014 - Frontiers of Forensic Science 23 Part 2: Email content Ò Statistical Language Models (LMs) Ò Assign a probability to [a sequence of] words; Ò By counting words Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition
  24. 24. Dec. 12, 2014 - Frontiers of Forensic Science 24 Language Models Ò Language models as communication “profiles”
  25. 25. Dec. 12, 2014 - Frontiers of Forensic Science 25 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
  26. 26. Dec. 12, 2014 - Frontiers of Forensic Science 26 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
  27. 27. Dec. 12, 2014 - Frontiers of Forensic Science 27 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  28. 28. Dec. 12, 2014 - Frontiers of Forensic Science 28 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  29. 29. Dec. 12, 2014 - Frontiers of Forensic Science 29 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2) 4. Corpus LM (how everyone talks)
  30. 30. Dec. 12, 2014 - Frontiers of Forensic Science 30 Why language models? Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
  31. 31. Dec. 12, 2014 - Frontiers of Forensic Science 31 Model Ò Given sender and email, predict recipients Ò Ranking function:
  32. 32. Email likelihood Estimate using language modeling Sender likelihood using SNA to estimate closeness of R and S Recipient likelihood using SNA to estimate importance of R Dec. 12, 2014 - Frontiers of Forensic Science 32
  33. 33. Dec. 12, 2014 - Frontiers of Forensic Science 33 Email likelihood
  34. 34. Dec. 12, 2014 - Frontiers of Forensic Science 34 Email likelihood P(word|R,S) P(word|R) P(word)
  35. 35. Recipient Likelihood P(R) P(R) P(S|R) Dec. 12, 2014 - Frontiers of Forensic Science 35 Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are addressed together Importance of node 1. Number of emails received 2. PageRank score Sender Likelihood P(S|R)
  36. 36. Dec. 12, 2014 - Frontiers of Forensic Science 36 SNA 1. Importance of a node in the network 2. Strength of connection between nodes Email Content 1. Interpersonal LM 2. Recipient LM 3. Corpus LM
  37. 37. Dec. 12, 2014 - Frontiers of Forensic Science 37 Approach: time-based time Training period: build models (SNA + LM) Testing period: predict recipients
  38. 38. Testing period: predict recipients Dec. 12, 2014 - Frontiers of Forensic Science 38 Testing Ò Remove recipients from email Ò Rank all nodes in the network, by computing: 1. P(E|R,S): Similarity between sender and candidate LMs 2. P(S|R): Strength of connection between sender and candidate 3. P(R): Importance of candidate
  39. 39. Dec. 12, 2014 - Frontiers of Forensic Science 39
  40. 40. Dec. 12, 2014 - Frontiers of Forensic Science 40 Findings: What works? Ò Importance of node: Number of received emails of node Pagerank Ò Strength of connection: Number of emails between nodes Number of times co-addressed Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)
  41. 41. Analysis: SNA vs email content Dec. 12, 2014 - Frontiers of Forensic Science 41 Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly active users Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
  42. 42. Dec. 12, 2014 - Frontiers of Forensic Science 42 Finally Ò Combining Social Network Analysis with Language Modeling is better than doing either.
  43. 43. Dec. 12, 2014 - Frontiers of Forensic Science 43 Future work Ò Consider structure of network in more detail Ò Departments? Ò Friends/family? Ò Include ‘time decay’ Ò Dynamically weight LM/SNA?
  44. 44. Applications in E-Discovery/Digital Forensics Dec. 12, 2014 - Frontiers of Forensic Science 44 Ò Anomaly detection Ò Given a working prediction model; identify “unexpected” communication Ò Language models for communication Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues? Ò Find communication that differs from the corpus-based communication
  45. 45. Dec. 12, 2014 - Frontiers of Forensic Science 45 Fin Ò Questions?

×