Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Understanding Email Traffic (talk @ E-Discovery NL Symposium)

4,484 views

Published on

Published in: Data & Analytics, Technology
  • Be the first to comment

Understanding Email Traffic (talk @ E-Discovery NL Symposium)

  1. 1. Understanding email traffic David Graus, University of Amsterdam d.p.graus@uva.nl @dvdgrs
  2. 2. 2
  3. 3. 3 Recipient recommendation Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to receive the email
  4. 4. 4 Why? Ò Understanding communication in/structure of an enterprise Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
  5. 5. 5 How? Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork Ò Related work Ò Social Network Analysis (SNA) Ò Email content Ò Us Ò SNA + Email content
  6. 6. 6 Part 1: Social Network Analysis? d.p.graus@uva.nl z.ren@uva.nl derijke@uva.nl
  7. 7. 7 image by Calvinius - Creative Commons Attribution-Share Alike 3.0
  8. 8. 8 SNA for predicting recipients? 1. Importance of a node in the network More important people are more likely to be the recipient of an email 2. Strength of connection between two nodes Given sender of the email, the recipients who are frequently addressed are more likely to be the recipient
  9. 9. 9 SNA for predicting recipients? 1. Importance of a node in the network 1. Number of received emails 2. PageRank score of node 2. Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are adressed together
  10. 10. 10 Part 2: Email content Ò Statistical Language Models (LMs) ! Ò Assign a probability to a sequence of words; Ò Compute models for different corpora; ! Ò Used in lots of places; Ò Information Retrieval Ò Machine Translation Ò Speech Recognition
  11. 11. 11 Language Models Ò Language models as communication “profiles”
  12. 12. 12 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
  13. 13. 13 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
  14. 14. 14 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 
 talks with node2)
  15. 15. 15 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 
 talks with node2)
  16. 16. 16 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 
 talks with node2) 4. Corpus LM (how everyone 
 talks)
  17. 17. 17 Why language models? Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
  18. 18. 18 SNA ! ! 1. Importance of a node in the network ! 3. Strength of connection between nodes ! ! ! Email Content ! ! 1. Incoming LM 2. Outgoing LM 3. Interpersonal LM 4. Corpus-based LM
  19. 19. 19 Approach: time-based t=0 1 email, 2 addresses t=1 2 emails, 2 addresses t=2 3 emails, 4 addresses t=3 4 emails, 5 addresses ! etc… ! t=n 607.011 emails, 2.068 addresses
  20. 20. 20 At some time interval t Ò Given the email, sender, and network Ò Remove recipients from email Ò Rank all nodes in the network Ò By computing for each candidate (recipient) node: 1. Importance of candidate 2. Strength of connection between sender and candidate 3. Similarity between sender and candidate LMs
  21. 21. 21
  22. 22. 22 Findings: what works for predicting recipients? Ò Importance of node: 
 Number of received emails of node ! Ò Strength of connection: 
 Number of emails between nodes ! Ò LM Similarity: 
 Interpersonal LM is most important
  23. 23. 23 Findings: SNA vs email content Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly active users ! Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
  24. 24. 24 Finally Ò Combining Social Network Analysis with Language Modeling is better than doing either.
  25. 25. 25 Why for E-Discovery Ò Anomaly detection Ò Given a working prediction model; identify “unexpected” communication Ò Language models for communication Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues? Ò Find communication that differs from the corpus-based communication

×