SlideShare a Scribd company logo
Understanding email traffic 
David Graus, University of Amsterdam 
d.p.graus@uva.nl 
@dvdgrs
Dec. 12, 2014 - Frontiers of Forensic Science 2 
Some background… 
• PhD candidate at ILPS 
• Information Extraction & Retrieval 
• Project in NWO’s Forensic Science program 
• Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 3 
Some background… 
• PhD candidate at ILPS 
• Information Extraction & Retrieval 
• Project in NWO’s Forensic Science program 
• Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 4 
Information Retrieval?
Dec. 12, 2014 - Frontiers of Forensic Science 5 
Information Retrieval? 
Ò Finding material of unstructured nature from 
large collections
Dec. 12, 2014 - Frontiers of Forensic Science 6 
Information Extraction? 
Ò Text mining 
Ò Discovering patterns in text data
Semantic Search in E-Discovery? 
Dec. 12, 2014 - Frontiers of Forensic Science 7
Dec. 12, 2014 - Frontiers of Forensic Science 8 
Semantic Search?
Dec. 12, 2014 - Frontiers of Forensic Science 9 
E-Discovery? 
• Retrieving and securing digital forensic 
evidence
Dec. 12, 2014 - Frontiers of Forensic Science 10 
E-Discovery 
⬜ Semantic Search in E-Discovery
Semantic Search in E-Discovery 
• Supporting search for digital forensic evidence 
• from emails, hard drives, mobile phones, etc… 
• not the open web 
Dec. 12, 2014 - Frontiers of Forensic Science 11 
• (Google won’t help us here)
Dec. 12, 2014 - Frontiers of Forensic Science 12 
Search in E-Discovery 
¢ Finding out who knew what, from whom, and when 
¢We don’t know what we’re looking for 
¢ What we’re looking for might be deliberately hidden 
¢ Communication might be very domain-specific, 
contextualized or incomplete
Dec. 12, 2014 - Frontiers of Forensic Science 13 
Approach 
¢ Generic search is not the answer 
¢ Google: high precision search 
¢ E-Discovery: high recall & exploratory search
Dec. 12, 2014 - Frontiers of Forensic Science 14 
Tasks 
¢ Support iterative search 
¢ Support (re)formulating questions and hypotheses 
¢ Retrieve all relevant traces
Dec. 12, 2014 - Frontiers of Forensic Science 15
Dec. 12, 2014 - Frontiers of Forensic Science 16
Recipient recommendation 
Ò Given a sender, an email, all possible 
recipients (in an enterprise); 
Ò Predict which recipient(s) are most likely to 
receive the email 
Dec. 12, 2014 - Frontiers of Forensic Science 17
Dec. 12, 2014 - Frontiers of Forensic Science 18 
Why? 
Ò Understanding communication in/structure of an 
enterprise 
Ò Finding “unexpected” communication 
Ò Applications in: 
Ò enterprise search 
Ò expert finding 
Ò community detection 
Ò spam classification 
Ò anomaly detection
Dec. 12, 2014 - Frontiers of Forensic Science 19 
How? 
Ò Gmail 
Ò Who do you frequently “co-address” 
Ò egonetwork 
Ò Related work 
Ò Social Network Analysis (SNA) 
Ò Email content 
Ò Us 
Ò SNA + email content
Part 1: Social Network Analysis? 
d.p.graus@uva.nl z.ren@uva.nl 
derijke@uva.nl 
Dec. 12, 2014 - Frontiers of Forensic Science 20
Dec. 12, 2014 - Frontiers of Forensic Science 21 
image by Calvinius - Creative Commons Attribution-Share Alike 3.0
SNA for predicting recipients? 
1. Importance of a node in the network 
Prior probability 
More important people are more likely to be recipients 
of an(y) email 
2. Connection strength between two nodes 
Conditional probability 
Given the sender, the recipients who are strongly 
associated are more likely to be the recipient 
Dec. 12, 2014 - Frontiers of Forensic Science 22
Dec. 12, 2014 - Frontiers of Forensic Science 23 
Part 2: Email content 
Ò Statistical Language Models (LMs) 
Ò Assign a probability to [a sequence of] words; 
Ò By counting words 
Ò Used in lots of places; 
Ò Web Search 
Ò Machine Translation 
Ò Speech Recognition
Dec. 12, 2014 - Frontiers of Forensic Science 24 
Language Models 
Ò Language models as communication “profiles”
Dec. 12, 2014 - Frontiers of Forensic Science 25 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user)
Dec. 12, 2014 - Frontiers of Forensic Science 26 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people)
Dec. 12, 2014 - Frontiers of Forensic Science 27 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people) 
3. Interpersonal LM (how node1 
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 28 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people) 
3. Interpersonal LM (how node1 
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 29 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people) 
3. Interpersonal LM (how node1 
talks with node2) 
4. Corpus LM (how everyone 
talks)
Dec. 12, 2014 - Frontiers of Forensic Science 30 
Why language models? 
Ò Comparisons between communication profiles: 
Ò Find nodes with most similar communication
Dec. 12, 2014 - Frontiers of Forensic Science 31 
Model 
Ò Given sender and email, predict recipients 
Ò Ranking function:
Email likelihood 
Estimate using language modeling 
Sender likelihood 
using SNA to estimate closeness of R and S 
Recipient likelihood 
using SNA to estimate importance of R 
Dec. 12, 2014 - Frontiers of Forensic Science 32
Dec. 12, 2014 - Frontiers of Forensic Science 33 
Email likelihood
Dec. 12, 2014 - Frontiers of Forensic Science 34 
Email likelihood 
P(word|R,S) P(word|R) P(word)
Recipient Likelihood 
P(R) P(R) 
P(S|R) 
Dec. 12, 2014 - Frontiers of Forensic Science 35 
Strength of connection 
between two nodes 
1. Number of emails sent 
between nodes 
2. Number of times two nodes 
are addressed together 
Importance of node 
1. Number of emails received 
2. PageRank score 
Sender Likelihood 
P(S|R)
Dec. 12, 2014 - Frontiers of Forensic Science 36 
SNA 
1. Importance of a node 
in the network 
2. Strength of 
connection between 
nodes 
Email Content 
1. Interpersonal LM 
2. Recipient LM 
3. Corpus LM
Dec. 12, 2014 - Frontiers of Forensic Science 37 
Approach: time-based 
time 
Training period: build models (SNA + LM) 
Testing period: predict recipients
Testing period: predict recipients 
Dec. 12, 2014 - Frontiers of Forensic Science 38 
Testing 
Ò Remove recipients from email 
Ò Rank all nodes in the network, by computing: 
1. P(E|R,S): Similarity between sender and 
candidate LMs 
2. P(S|R): Strength of connection between 
sender and candidate 
3. P(R): Importance of candidate
Dec. 12, 2014 - Frontiers of Forensic Science 39
Dec. 12, 2014 - Frontiers of Forensic Science 40 
Findings: What works? 
Ò Importance of node: 
Number of received emails of node 
Pagerank 
Ò Strength of connection: 
Number of emails between nodes 
Number of times co-addressed 
Ò LM Similarity: 
Interpersonal LM is most important (60%-20%-20%)
Analysis: SNA vs email content 
Dec. 12, 2014 - Frontiers of Forensic Science 41 
Ò SNA: 
Ò SNA signals deteriorate over time 
Ò SNA signals are most informative on highly 
active users 
Ò Email content: 
Ò LM signal improves over time 
Ò LM signal does worse with highly active users
Dec. 12, 2014 - Frontiers of Forensic Science 42 
Finally 
Ò Combining Social Network Analysis with 
Language Modeling is better than doing either.
Dec. 12, 2014 - Frontiers of Forensic Science 43 
Future work 
Ò Consider structure of network in more detail 
Ò Departments? 
Ò Friends/family? 
Ò Include ‘time decay’ 
Ò Dynamically weight LM/SNA?
Applications in E-Discovery/Digital Forensics 
Dec. 12, 2014 - Frontiers of Forensic Science 44 
Ò Anomaly detection 
Ò Given a working prediction model; identify 
“unexpected” communication 
Ò Language models for communication 
Ò For a node, find the most different 
interpersonal communication 
Ò Friends/family vs colleagues? 
Ò Find communication that differs from the 
corpus-based communication
Dec. 12, 2014 - Frontiers of Forensic Science 45 
Fin 
Ò Questions?

More Related Content

Similar to Understanding Email Traffic

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
Bertram Ludäscher
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
David Graus
 
04 pisa final_event_111214_wp1_dg
04 pisa final_event_111214_wp1_dg04 pisa final_event_111214_wp1_dg
04 pisa final_event_111214_wp1_dg
Digitised Manuscripts to Europeana
 
Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...
Andrea Scharnhorst
 
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
CASRAI
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: Introduction
Mathieu d'Aquin
 
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
DuraSpace
 
Dm2 e okfn-infoday_scholarly_activities_18_nov
Dm2 e okfn-infoday_scholarly_activities_18_novDm2 e okfn-infoday_scholarly_activities_18_nov
Dm2 e okfn-infoday_scholarly_activities_18_nov
Digitised Manuscripts to Europeana
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Krishnaram Kenthapadi
 
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Digitised Manuscripts to Europeana
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
Marina Santini
 
Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
Vrije Universiteit Amsterdam
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
Neo4j
 
08b final event_experimente
08b final event_experimente08b final event_experimente
08b final event_experimente
Digitised Manuscripts to Europeana
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
Pier Luca Lanzi
 
Graph Query Languages: update from LDBC
Graph Query Languages: update from LDBCGraph Query Languages: update from LDBC
Graph Query Languages: update from LDBC
Juan Sequeda
 
Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...
UCD Library
 

Similar to Understanding Email Traffic (20)

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
 
04 pisa final_event_111214_wp1_dg
04 pisa final_event_111214_wp1_dg04 pisa final_event_111214_wp1_dg
04 pisa final_event_111214_wp1_dg
 
Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...
 
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: Introduction
 
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
 
Dm2 e okfn-infoday_scholarly_activities_18_nov
Dm2 e okfn-infoday_scholarly_activities_18_novDm2 e okfn-infoday_scholarly_activities_18_nov
Dm2 e okfn-infoday_scholarly_activities_18_nov
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
 
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
 
08b final event_experimente
08b final event_experimente08b final event_experimente
08b final event_experimente
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
Graph Query Languages: update from LDBC
Graph Query Languages: update from LDBCGraph Query Languages: update from LDBC
Graph Query Languages: update from LDBC
 
Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...
 

More from David Graus

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
David Graus
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
David Graus
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
David Graus
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
David Graus
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
David Graus
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
David Graus
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
David Graus
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
David Graus
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
David Graus
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
David Graus
 
Big Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & ValkuilenBig Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & Valkuilen
David Graus
 
Analyzing and Predicting Task Reminders
Analyzing and Predicting Task RemindersAnalyzing and Predicting Task Reminders
Analyzing and Predicting Task Reminders
David Graus
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
David Graus
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
David Graus
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus
 
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsGenerating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
David Graus
 
yourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic eventsyourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic events
David Graus
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
David Graus
 
Semantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron DatabaseSemantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron Database
David Graus
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
David Graus
 

More from David Graus (20)

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
 
Big Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & ValkuilenBig Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & Valkuilen
 
Analyzing and Predicting Task Reminders
Analyzing and Predicting Task RemindersAnalyzing and Predicting Task Reminders
Analyzing and Predicting Task Reminders
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
 
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsGenerating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
 
yourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic eventsyourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic events
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
 
Semantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron DatabaseSemantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron Database
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
 

Recently uploaded

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 

Recently uploaded (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 

Understanding Email Traffic

  • 1. Understanding email traffic David Graus, University of Amsterdam d.p.graus@uva.nl @dvdgrs
  • 2. Dec. 12, 2014 - Frontiers of Forensic Science 2 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  • 3. Dec. 12, 2014 - Frontiers of Forensic Science 3 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  • 4. Dec. 12, 2014 - Frontiers of Forensic Science 4 Information Retrieval?
  • 5. Dec. 12, 2014 - Frontiers of Forensic Science 5 Information Retrieval? Ò Finding material of unstructured nature from large collections
  • 6. Dec. 12, 2014 - Frontiers of Forensic Science 6 Information Extraction? Ò Text mining Ò Discovering patterns in text data
  • 7. Semantic Search in E-Discovery? Dec. 12, 2014 - Frontiers of Forensic Science 7
  • 8. Dec. 12, 2014 - Frontiers of Forensic Science 8 Semantic Search?
  • 9. Dec. 12, 2014 - Frontiers of Forensic Science 9 E-Discovery? • Retrieving and securing digital forensic evidence
  • 10. Dec. 12, 2014 - Frontiers of Forensic Science 10 E-Discovery ⬜ Semantic Search in E-Discovery
  • 11. Semantic Search in E-Discovery • Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web Dec. 12, 2014 - Frontiers of Forensic Science 11 • (Google won’t help us here)
  • 12. Dec. 12, 2014 - Frontiers of Forensic Science 12 Search in E-Discovery ¢ Finding out who knew what, from whom, and when ¢We don’t know what we’re looking for ¢ What we’re looking for might be deliberately hidden ¢ Communication might be very domain-specific, contextualized or incomplete
  • 13. Dec. 12, 2014 - Frontiers of Forensic Science 13 Approach ¢ Generic search is not the answer ¢ Google: high precision search ¢ E-Discovery: high recall & exploratory search
  • 14. Dec. 12, 2014 - Frontiers of Forensic Science 14 Tasks ¢ Support iterative search ¢ Support (re)formulating questions and hypotheses ¢ Retrieve all relevant traces
  • 15. Dec. 12, 2014 - Frontiers of Forensic Science 15
  • 16. Dec. 12, 2014 - Frontiers of Forensic Science 16
  • 17. Recipient recommendation Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to receive the email Dec. 12, 2014 - Frontiers of Forensic Science 17
  • 18. Dec. 12, 2014 - Frontiers of Forensic Science 18 Why? Ò Understanding communication in/structure of an enterprise Ò Finding “unexpected” communication Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
  • 19. Dec. 12, 2014 - Frontiers of Forensic Science 19 How? Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork Ò Related work Ò Social Network Analysis (SNA) Ò Email content Ò Us Ò SNA + email content
  • 20. Part 1: Social Network Analysis? d.p.graus@uva.nl z.ren@uva.nl derijke@uva.nl Dec. 12, 2014 - Frontiers of Forensic Science 20
  • 21. Dec. 12, 2014 - Frontiers of Forensic Science 21 image by Calvinius - Creative Commons Attribution-Share Alike 3.0
  • 22. SNA for predicting recipients? 1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email 2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient Dec. 12, 2014 - Frontiers of Forensic Science 22
  • 23. Dec. 12, 2014 - Frontiers of Forensic Science 23 Part 2: Email content Ò Statistical Language Models (LMs) Ò Assign a probability to [a sequence of] words; Ò By counting words Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition
  • 24. Dec. 12, 2014 - Frontiers of Forensic Science 24 Language Models Ò Language models as communication “profiles”
  • 25. Dec. 12, 2014 - Frontiers of Forensic Science 25 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
  • 26. Dec. 12, 2014 - Frontiers of Forensic Science 26 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
  • 27. Dec. 12, 2014 - Frontiers of Forensic Science 27 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  • 28. Dec. 12, 2014 - Frontiers of Forensic Science 28 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  • 29. Dec. 12, 2014 - Frontiers of Forensic Science 29 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2) 4. Corpus LM (how everyone talks)
  • 30. Dec. 12, 2014 - Frontiers of Forensic Science 30 Why language models? Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
  • 31. Dec. 12, 2014 - Frontiers of Forensic Science 31 Model Ò Given sender and email, predict recipients Ò Ranking function:
  • 32. Email likelihood Estimate using language modeling Sender likelihood using SNA to estimate closeness of R and S Recipient likelihood using SNA to estimate importance of R Dec. 12, 2014 - Frontiers of Forensic Science 32
  • 33. Dec. 12, 2014 - Frontiers of Forensic Science 33 Email likelihood
  • 34. Dec. 12, 2014 - Frontiers of Forensic Science 34 Email likelihood P(word|R,S) P(word|R) P(word)
  • 35. Recipient Likelihood P(R) P(R) P(S|R) Dec. 12, 2014 - Frontiers of Forensic Science 35 Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are addressed together Importance of node 1. Number of emails received 2. PageRank score Sender Likelihood P(S|R)
  • 36. Dec. 12, 2014 - Frontiers of Forensic Science 36 SNA 1. Importance of a node in the network 2. Strength of connection between nodes Email Content 1. Interpersonal LM 2. Recipient LM 3. Corpus LM
  • 37. Dec. 12, 2014 - Frontiers of Forensic Science 37 Approach: time-based time Training period: build models (SNA + LM) Testing period: predict recipients
  • 38. Testing period: predict recipients Dec. 12, 2014 - Frontiers of Forensic Science 38 Testing Ò Remove recipients from email Ò Rank all nodes in the network, by computing: 1. P(E|R,S): Similarity between sender and candidate LMs 2. P(S|R): Strength of connection between sender and candidate 3. P(R): Importance of candidate
  • 39. Dec. 12, 2014 - Frontiers of Forensic Science 39
  • 40. Dec. 12, 2014 - Frontiers of Forensic Science 40 Findings: What works? Ò Importance of node: Number of received emails of node Pagerank Ò Strength of connection: Number of emails between nodes Number of times co-addressed Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)
  • 41. Analysis: SNA vs email content Dec. 12, 2014 - Frontiers of Forensic Science 41 Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly active users Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
  • 42. Dec. 12, 2014 - Frontiers of Forensic Science 42 Finally Ò Combining Social Network Analysis with Language Modeling is better than doing either.
  • 43. Dec. 12, 2014 - Frontiers of Forensic Science 43 Future work Ò Consider structure of network in more detail Ò Departments? Ò Friends/family? Ò Include ‘time decay’ Ò Dynamically weight LM/SNA?
  • 44. Applications in E-Discovery/Digital Forensics Dec. 12, 2014 - Frontiers of Forensic Science 44 Ò Anomaly detection Ò Given a working prediction model; identify “unexpected” communication Ò Language models for communication Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues? Ò Find communication that differs from the corpus-based communication
  • 45. Dec. 12, 2014 - Frontiers of Forensic Science 45 Fin Ò Questions?