Advertisement

More Related Content

Similar to Understanding Email Traffic(20)

More from David Graus(20)

Advertisement

Understanding Email Traffic

  1. Understanding email traffic David Graus, University of Amsterdam d.p.graus@uva.nl @dvdgrs
  2. Dec. 12, 2014 - Frontiers of Forensic Science 2 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  3. Dec. 12, 2014 - Frontiers of Forensic Science 3 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  4. Dec. 12, 2014 - Frontiers of Forensic Science 4 Information Retrieval?
  5. Dec. 12, 2014 - Frontiers of Forensic Science 5 Information Retrieval? Ò Finding material of unstructured nature from large collections
  6. Dec. 12, 2014 - Frontiers of Forensic Science 6 Information Extraction? Ò Text mining Ò Discovering patterns in text data
  7. Semantic Search in E-Discovery? Dec. 12, 2014 - Frontiers of Forensic Science 7
  8. Dec. 12, 2014 - Frontiers of Forensic Science 8 Semantic Search?
  9. Dec. 12, 2014 - Frontiers of Forensic Science 9 E-Discovery? • Retrieving and securing digital forensic evidence
  10. Dec. 12, 2014 - Frontiers of Forensic Science 10 E-Discovery ⬜ Semantic Search in E-Discovery
  11. Semantic Search in E-Discovery • Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web Dec. 12, 2014 - Frontiers of Forensic Science 11 • (Google won’t help us here)
  12. Dec. 12, 2014 - Frontiers of Forensic Science 12 Search in E-Discovery ¢ Finding out who knew what, from whom, and when ¢We don’t know what we’re looking for ¢ What we’re looking for might be deliberately hidden ¢ Communication might be very domain-specific, contextualized or incomplete
  13. Dec. 12, 2014 - Frontiers of Forensic Science 13 Approach ¢ Generic search is not the answer ¢ Google: high precision search ¢ E-Discovery: high recall & exploratory search
  14. Dec. 12, 2014 - Frontiers of Forensic Science 14 Tasks ¢ Support iterative search ¢ Support (re)formulating questions and hypotheses ¢ Retrieve all relevant traces
  15. Dec. 12, 2014 - Frontiers of Forensic Science 15
  16. Dec. 12, 2014 - Frontiers of Forensic Science 16
  17. Recipient recommendation Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to receive the email Dec. 12, 2014 - Frontiers of Forensic Science 17
  18. Dec. 12, 2014 - Frontiers of Forensic Science 18 Why? Ò Understanding communication in/structure of an enterprise Ò Finding “unexpected” communication Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
  19. Dec. 12, 2014 - Frontiers of Forensic Science 19 How? Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork Ò Related work Ò Social Network Analysis (SNA) Ò Email content Ò Us Ò SNA + email content
  20. Part 1: Social Network Analysis? d.p.graus@uva.nl z.ren@uva.nl derijke@uva.nl Dec. 12, 2014 - Frontiers of Forensic Science 20
  21. Dec. 12, 2014 - Frontiers of Forensic Science 21 image by Calvinius - Creative Commons Attribution-Share Alike 3.0
  22. SNA for predicting recipients? 1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email 2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient Dec. 12, 2014 - Frontiers of Forensic Science 22
  23. Dec. 12, 2014 - Frontiers of Forensic Science 23 Part 2: Email content Ò Statistical Language Models (LMs) Ò Assign a probability to [a sequence of] words; Ò By counting words Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition
  24. Dec. 12, 2014 - Frontiers of Forensic Science 24 Language Models Ò Language models as communication “profiles”
  25. Dec. 12, 2014 - Frontiers of Forensic Science 25 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
  26. Dec. 12, 2014 - Frontiers of Forensic Science 26 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
  27. Dec. 12, 2014 - Frontiers of Forensic Science 27 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  28. Dec. 12, 2014 - Frontiers of Forensic Science 28 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  29. Dec. 12, 2014 - Frontiers of Forensic Science 29 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2) 4. Corpus LM (how everyone talks)
  30. Dec. 12, 2014 - Frontiers of Forensic Science 30 Why language models? Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
  31. Dec. 12, 2014 - Frontiers of Forensic Science 31 Model Ò Given sender and email, predict recipients Ò Ranking function:
  32. Email likelihood Estimate using language modeling Sender likelihood using SNA to estimate closeness of R and S Recipient likelihood using SNA to estimate importance of R Dec. 12, 2014 - Frontiers of Forensic Science 32
  33. Dec. 12, 2014 - Frontiers of Forensic Science 33 Email likelihood
  34. Dec. 12, 2014 - Frontiers of Forensic Science 34 Email likelihood P(word|R,S) P(word|R) P(word)
  35. Recipient Likelihood P(R) P(R) P(S|R) Dec. 12, 2014 - Frontiers of Forensic Science 35 Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are addressed together Importance of node 1. Number of emails received 2. PageRank score Sender Likelihood P(S|R)
  36. Dec. 12, 2014 - Frontiers of Forensic Science 36 SNA 1. Importance of a node in the network 2. Strength of connection between nodes Email Content 1. Interpersonal LM 2. Recipient LM 3. Corpus LM
  37. Dec. 12, 2014 - Frontiers of Forensic Science 37 Approach: time-based time Training period: build models (SNA + LM) Testing period: predict recipients
  38. Testing period: predict recipients Dec. 12, 2014 - Frontiers of Forensic Science 38 Testing Ò Remove recipients from email Ò Rank all nodes in the network, by computing: 1. P(E|R,S): Similarity between sender and candidate LMs 2. P(S|R): Strength of connection between sender and candidate 3. P(R): Importance of candidate
  39. Dec. 12, 2014 - Frontiers of Forensic Science 39
  40. Dec. 12, 2014 - Frontiers of Forensic Science 40 Findings: What works? Ò Importance of node: Number of received emails of node Pagerank Ò Strength of connection: Number of emails between nodes Number of times co-addressed Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)
  41. Analysis: SNA vs email content Dec. 12, 2014 - Frontiers of Forensic Science 41 Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly active users Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
  42. Dec. 12, 2014 - Frontiers of Forensic Science 42 Finally Ò Combining Social Network Analysis with Language Modeling is better than doing either.
  43. Dec. 12, 2014 - Frontiers of Forensic Science 43 Future work Ò Consider structure of network in more detail Ò Departments? Ò Friends/family? Ò Include ‘time decay’ Ò Dynamically weight LM/SNA?
  44. Applications in E-Discovery/Digital Forensics Dec. 12, 2014 - Frontiers of Forensic Science 44 Ò Anomaly detection Ò Given a working prediction model; identify “unexpected” communication Ò Language models for communication Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues? Ò Find communication that differs from the corpus-based communication
  45. Dec. 12, 2014 - Frontiers of Forensic Science 45 Fin Ò Questions?
Advertisement