Automated Content Labelingusing Context in Email  Aravindan Raghuveer  Yahoo! Inc, Bangalore.
at-least one email with an attachment last 2 weeks?                                                                       ...
Introduction: The “What ?”                 Email attachments are a very popular mechanism                  to exchange co...
Introduction: The “Why?”                 Tags can be stored as extended attributes of files.                 Application...
Outline                 Problem Statement                 Challenges                 Overview of solution             ...
Problem Statement                  “Given an email E that has a set of attachments A,                  find the set of wor...
Overview of Solution                 Solve a binary classification problem :                      - Is a given sentence r...
Challenges : Why is the classification hard?                 Only a small part of the email is relevant to the           ...
The Enron Email Dataset                 Made public by Federal Energy Regulatory                  Commission during its i...
Observation-1: User behavior for attachments   • Barring a few outliers, almost all users sent/received emails with   atta...
Observation-2: Email length     • Roughly 80% of the emails have less than 8 sentences.     • In another analysis: even in...
Feature Design : Levels of Granularity                                                         S1                         ...
Feature Taxonomy:                                                        Features            Email                        ...
Feature Design : Sentence level                           S1                      Anchor Sentences:                       ...
Feature Design: Sentence Level  Anchor                 Strong Phrase Anchor: Feature value set to 1 if                  ...
Feature Design: Sentence Level  Anchor                 Behavioral Observation: Users tend to refer to an                ...
Feature Design: Sentence Level  Anchor                 Behavioral Observation: Users tend to use file name              ...
Feature Design : Sentence level                           S1                      Anchor Sentences:                       ...
Feature Design : Sentence level  Noisy                 Noisy sentences are usually salutations, signature               ...
Feature Design : Sentence level                           S1                      Anchor Sentences:                       ...
Feature Design : Sentence level  Anaphora                 Once anchors have been identified:                      - NLP ...
Correlation Analysis                                                                      Best positive correlation:      ...
Experiments                 Ground Truth Data                      - randomly sampled 1800 sent emails.                  ...
Summary of Results: F1 measure                 With all features used, F1 scores read:                      - CRF : 0.87 ...
Summary of Results: User Specific Performance             In majority of the cases,              CRF outperforms        ...
In the paper . . .                 A specific use-case for attachment based tagging:                      - Images sent o...
Closing Remarks                 Improvement of retrieval effectiveness due to                  keywords mined:           ...
Conclusion                 Presented a technique to extract information                  from noisy data.               ...
29Yahoo! Confidential
Upcoming SlideShare
Loading in...5
×

Automated Content Labeling using Context in Email

666

Published on

Through a recent survey, we observe that a significant
percentage of people still share photos through email.
When an user composes an email with an attachment,
he/she most likely talks about the attachment in the
body of the email. In this paper, we develop a supervised
machine learning framework to extract keywords
relevant to the attachments from the body of
the email. The extracted keywords can then be stored
as an extended attribute to the file on the local filesystem.
Both desktop indexing software and web-based
image search portals can leverage the context-enriched
keywords to enable a richer search experience. Our
results on the public Enron email dataset shows that
the proposed extraction framework provides both high
precision and recall. As a proof of concept, we have
also implemented a version of the proposed algorithms
as a Mozilla Thunderbird Add-on.
1

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
666
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • left with 1150emails whose sentences were marked either relevant toattachment or not. This data corresponds to 112 usersfrom the Enron corpus and 6472 sentences.
  • Automated Content Labeling using Context in Email

    1. 1. Automated Content Labelingusing Context in Email Aravindan Raghuveer Yahoo! Inc, Bangalore.
    2. 2. at-least one email with an attachment last 2 weeks? 2Yahoo! Confidential http://www.crcna.org/site_uploads/uploads/osjha/mdgs/mdghands.jpg Image Source:
    3. 3. Introduction: The “What ?”  Email attachments are a very popular mechanism to exchange content. - Both in our personal / office worlds.  The one-liner: - The email usually contains a crisp description of the attachment. - “Can we auto generate tags for the attachment from the email ?” 3Yahoo! Confidential
    4. 4. Introduction: The “Why?”  Tags can be stored as extended attributes of files.  Applications like desktop search can use these tags for building indexes.  Tags generated even without parsing content: - Useful for Images - In some cases, the tags have more context than the attachment itself. 4Yahoo! Confidential
    5. 5. Outline  Problem Statement  Challenges  Overview of solution  The Dataset : Quirks and observations  Feature Design  Experiments and overview of results  Conclusion 5Yahoo! Confidential
    6. 6. Problem Statement “Given an email E that has a set of attachments A, find the set of words KEA that appear in E and are relevant to the attachments A.” Subproblems: • Sentence Selection • Keyword Selection 6Yahoo! Confidential
    7. 7. Overview of Solution  Solve a binary classification problem : - Is a given sentence relevant to the attachments or not?  Two part solution: - What are the features to use? • Most insightful and interesting aspect of this work. • Focus of this talk - What classification algorithm to use? 7Yahoo! Confidential
    8. 8. Challenges : Why is the classification hard?  Only a small part of the email is relevant to the attachment. Which part?  While writing emails: users tend to rely on context that is easily understood by humans. - usage of pronouns - nick names - cross referencing information from a different conversation in the same email thread. 8Yahoo! Confidential
    9. 9. The Enron Email Dataset  Made public by Federal Energy Regulatory Commission during its investigation [1].  Curated version [2] consists of 157,510 emails belonging to 150 users.  A total of 30,968 emails have at least one attachment [1] http://www.cs.cmu.edu/~enron/ [2] Thanks to Mark Dredze for providing the curated dataset. 9Yahoo! Confidential
    10. 10. Observation-1: User behavior for attachments • Barring a few outliers, almost all users sent/received emails with attachments. • A rich / well-suited corpus for studying attachment behaviorYahoo! Confidential 10
    11. 11. Observation-2: Email length • Roughly 80% of the emails have less than 8 sentences. • In another analysis: even in emails that have less than 3 sentences, not every sentence is related to the attachment!! 11Yahoo! Confidential
    12. 12. Feature Design : Levels of Granularity S1 S2 S3 S4 S5 S6 Email Level Conversation Level Sentence Level 12Yahoo! Confidential
    13. 13. Feature Taxonomy: Features Email Conversation Sentence Length Level Anchor Noisy Anaphora (Short email: less than 4 (Level: Older in the sentences? ) email thread, higher the conversation level ) Strong Phrase Noun Lexicon Verb Extension (high dimension, sparse, bag of words feature ) Attachment Name Weak Phrase 13Yahoo! Confidential
    14. 14. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 14Yahoo! Confidential
    15. 15. Feature Design: Sentence Level  Anchor  Strong Phrase Anchor: Feature value set to 1 if sentence has any of the words: - attach - here is - Enclosed  of the 30968 emails that have an attachment, 52% of them had a strong anchor phrase. 15Yahoo! Confidential
    16. 16. Feature Design: Sentence Level  Anchor  Behavioral Observation: Users tend to refer to an attachment by its file type  Extension Anchor: Feature value set to 1 if sentence has any of the extension keywords: - xls  spreadsheet, report, excel file - jpg  image, photo, picture  Example: “Please refer to the attached spreadsheet for a list of Associates and Analysts who will be ranked in these meetings and their PRC Reps.” 16Yahoo! Confidential
    17. 17. Feature Design: Sentence Level  Anchor  Behavioral Observation: Users tend to use file name tokens of the attachment to refer to the attachment  Attachment Name Anchor: Feature value set to 1 if sentence has any of the file name tokens. - Tokenization done on case and type transitions,  Example: - attachment name “Book Request Form East.xls” - “These are book requests for the Netco books for all regions.” 17Yahoo! Confidential
    18. 18. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 18Yahoo! Confidential
    19. 19. Feature Design : Sentence level  Noisy  Noisy sentences are usually salutations, signature sections and email headers of conversations.  Two features to capture noisy sentences - Noisy Noun - Noisy Verb  Noisy Noun: marked true if • Noisy Verb: marked true more than 85% of the if no verbs in the words in the sentence are sentence nouns. 19Yahoo! Confidential
    20. 20. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 20Yahoo! Confidential
    21. 21. Feature Design : Sentence level  Anaphora  Once anchors have been identified: - NLP technique called anaphora detection can be employed - Detects other sentences that are linguistically dependent on anchor sentence. - Tracks the hidden context in email.  Example: - “Thought you might me interested in the report. It gives a nice snapshot of our activity with our major counterparties.” 21Yahoo! Confidential
    22. 22. Correlation Analysis Best positive correlation: • Strong phrase anchor • Anaphora feature Short email  low correlation coefficient. noisy verb feature  good negative correlation conversation level <=2 feature has lower negative correlation when compared to the conversation level > 2 feature. 22Yahoo! Confidential
    23. 23. Experiments  Ground Truth Data - randomly sampled 1800 sent emails. - Two independent editors for producing class labels for every sentence in the above sample. - Reconciliation: • Discarded emails that had at least one sentence with conflicting labels.  ML Algorithms studied : Naïve Bayes, SVM, CRF. 23Yahoo! Confidential
    24. 24. Summary of Results: F1 measure  With all features used, F1 scores read: - CRF : 0.87 - SVM: 0.79 - Naïve Bayes: 0.74  CRF consistently beats the other two methods across all feature sub-sets - Sequential nature of the data works in CRF’s favor.  The Phrase Anchor provides best increase in precision  The Anaphora feature provides best increase in recall. 24Yahoo! Confidential
    25. 25. Summary of Results: User Specific Performance  In majority of the cases, CRF outperforms  With same set of features: - the CRF can learn a more generic model applicable to a variety of users 25Yahoo! Confidential
    26. 26. In the paper . . .  A specific use-case for attachment based tagging: - Images sent over email - A user study based on survey  Heuristics for sentences  keywords  More detailed evaluation studies / comparison for Naïve Bayes, SVM, CRF  TagSeeker : A prototype for proposed algorithms implemented as a Thunderbird plugin. 26Yahoo! Confidential
    27. 27. Closing Remarks  Improvement of retrieval effectiveness due to keywords mined: - Could not be performed because the attachment content is not available.  Working on an experiment to do this on a different dataset.  Thanks to the: - Reviewers for the great feedback! - Organizers for the effort putting together this conference! 27Yahoo! Confidential
    28. 28. Conclusion  Presented a technique to extract information from noisy data.  The F1 measure of the proposed methodology - In the high eighties. Good! - Generalized well across different users.  For more information on this work / Information Extraction @ Yahoo! - aravindr@yahoo-inc.com 28Yahoo! Confidential
    29. 29. 29Yahoo! Confidential
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×