SlideShare a Scribd company logo
Automated Content Labeling
using Context in Email
  Aravindan Raghuveer
  Yahoo! Inc, Bangalore.
at-least one email with an attachment last 2 weeks?
                                                                                        2
Yahoo! Confidential http://www.crcna.org/site_uploads/uploads/osjha/mdgs/mdghands.jpg
   Image Source:
Introduction: The “What ?”

                 Email attachments are a very popular mechanism
                  to exchange content.
                      - Both in our personal / office worlds.


                 The one-liner:
                      - The email usually contains a crisp description of the
                        attachment.
                      - “Can we auto generate tags for the attachment from
                        the email ?”

                                                                                3
Yahoo! Confidential
Introduction: The “Why?”

                 Tags can be stored as extended attributes of files.
                 Applications like desktop search can use these
                  tags for building indexes.
                 Tags generated even without parsing content:
                      - Useful for Images
                      - In some cases, the tags have more context than the
                        attachment itself.



                                                                             4
Yahoo! Confidential
Outline

                 Problem Statement
                 Challenges
                 Overview of solution
                 The Dataset : Quirks and observations
                 Feature Design
                 Experiments and overview of results
                 Conclusion

                                                          5
Yahoo! Confidential
Problem Statement

                  “Given an email E that has a set of attachments A,
                  find the set of words KEA that appear in E and are
                  relevant to the attachments A.”

                       Subproblems:
                       • Sentence Selection
                       • Keyword Selection




                                                                       6
Yahoo! Confidential
Overview of Solution

                 Solve a binary classification problem :
                      - Is a given sentence relevant to the attachments or not?


                 Two part solution:
                      - What are the features to use?
                         • Most insightful and interesting aspect of this work.
                         • Focus of this talk
                      - What classification algorithm to use?


                                                                                  7
Yahoo! Confidential
Challenges : Why is the classification hard?

                 Only a small part of the email is relevant to the
                  attachment. Which part?
                 While writing emails: users tend to rely on
                  context that is easily understood by humans.
                      - usage of pronouns
                      - nick names
                      - cross referencing information from a different
                        conversation in the same email thread.


                                                                         8
Yahoo! Confidential
The Enron Email Dataset

                 Made public by Federal Energy Regulatory
                  Commission during its investigation [1].
                 Curated version [2] consists of 157,510 emails
                  belonging to 150 users.
                 A total of 30,968 emails have at least one
                  attachment



               [1] http://www.cs.cmu.edu/~enron/
               [2] Thanks to Mark Dredze for providing the curated dataset.   9
Yahoo! Confidential
Observation-1: User behavior for attachments




   • Barring a few outliers, almost all users sent/received emails with
   attachments.
   • A rich / well-suited corpus for studying attachment behavior
Yahoo! Confidential
                                                                          10
Observation-2: Email length




     • Roughly 80% of the emails have less than 8 sentences.
     • In another analysis: even in emails that have less than 3 sentences,
     not every sentence is related to the attachment!!
                                                                          11
Yahoo! Confidential
Feature Design : Levels of Granularity

                                                         S1
                                                         S2
                                                         S3
                                                         S4
                                                         S5
                                                         S6


             Email Level       Conversation Level   Sentence Level




                                                                     12
Yahoo! Confidential
Feature Taxonomy:
                                                        Features

            Email                            Conversation                                      Sentence

                         Length                             Level             Anchor     Noisy       Anaphora
                      (Short email: less than 4         (Level: Older in the
                      sentences? )                      email thread, higher the
                                                        conversation level )
                                                                                   Strong Phrase      Noun
                         Lexicon                                                                      Verb
                                                                                   Extension
                      (high dimension, sparse, bag of
                      words feature )
                                                                                   Attachment Name


                                                                                    Weak Phrase              13
Yahoo! Confidential
Feature Design : Sentence level
                           S1                      Anchor Sentences:
                           S2                      Most likely positive
                           S3
                                                   matches.
                           S4
                           S5
                                                   Noisy Sentences:
                           S6
                                                   Most likely negative
                                                   matches.
                      Anaphora Sentences: have linguistic relationships
                      to anchor sentences
                                                             14
Yahoo! Confidential
Feature Design: Sentence Level  Anchor

                 Strong Phrase Anchor: Feature value set to 1 if
                  sentence has any of the words:
                   - attach
                   - here is
                   - Enclosed


                 of the 30968 emails that have an attachment,
                  52% of them had a strong anchor phrase.

                                                                    15
Yahoo! Confidential
Feature Design: Sentence Level  Anchor

                 Behavioral Observation: Users tend to refer to an
                  attachment by its file type

                 Extension Anchor: Feature value set to 1 if sentence has
                  any of the extension keywords:
                   - xls  spreadsheet, report, excel file
                   - jpg  image, photo, picture

            Example:
           “Please refer to the attached spreadsheet for a list of Associates and
              Analysts who will be ranked in these meetings and their PRC Reps.”    16
Yahoo! Confidential
Feature Design: Sentence Level  Anchor

                 Behavioral Observation: Users tend to use file name
                  tokens of the attachment to refer to the attachment

                 Attachment Name Anchor: Feature value set to 1 if
                  sentence has any of the file name tokens.
                      - Tokenization done on case and type transitions,
                 Example:
                      - attachment name “Book Request Form East.xls”
                      - “These are book requests for the Netco books for all
                        regions.”
                                                                               17
Yahoo! Confidential
Feature Design : Sentence level
                           S1                      Anchor Sentences:
                           S2                      Most likely positive
                           S3
                                                   matches.
                           S4
                           S5
                                                   Noisy Sentences:
                           S6
                                                   Most likely negative
                                                   matches.
                      Anaphora Sentences: have linguistic relationships
                      to anchor sentences
                                                             18
Yahoo! Confidential
Feature Design : Sentence level  Noisy

                 Noisy sentences are usually salutations, signature
                  sections and email headers of conversations.
                 Two features to capture noisy sentences
                      - Noisy Noun
                      - Noisy Verb

                Noisy Noun: marked true if    •   Noisy Verb: marked true
                 more than 85% of the              if no verbs in the
                 words in the sentence are         sentence
                 nouns.
                                                                             19
Yahoo! Confidential
Feature Design : Sentence level
                           S1                      Anchor Sentences:
                           S2                      Most likely positive
                           S3
                                                   matches.
                           S4
                           S5
                                                   Noisy Sentences:
                           S6
                                                   Most likely negative
                                                   matches.
                      Anaphora Sentences: have linguistic relationships
                      to anchor sentences
                                                             20
Yahoo! Confidential
Feature Design : Sentence level  Anaphora

                 Once anchors have been identified:
                      - NLP technique called anaphora detection can be
                        employed
                      - Detects other sentences that are linguistically
                        dependent on anchor sentence.
                      - Tracks the hidden context in email.
                 Example:
                      - “Thought you might me interested in the report. It gives a nice
                        snapshot of our activity with our major counterparties.”


                                                                                          21
Yahoo! Confidential
Correlation Analysis

                                                                      Best positive correlation:
                                                                      • Strong phrase anchor
                                                                      • Anaphora feature

                                                                      Short email  low
                                                                      correlation coefficient.

                                                                      noisy verb feature 
                                                                      good negative correlation




          conversation level <=2 feature has lower negative correlation when compared
          to the conversation level > 2 feature.
                                                                                                 22
Yahoo! Confidential
Experiments

                 Ground Truth Data
                      - randomly sampled 1800 sent emails.
                      - Two independent editors for producing class labels for
                        every sentence in the above sample.
                      - Reconciliation:
                         • Discarded emails that had at least one sentence with
                           conflicting labels.
                 ML Algorithms studied : Naïve Bayes, SVM, CRF.


                                                                                  23
Yahoo! Confidential
Summary of Results: F1 measure
                 With all features used, F1 scores read:
                      - CRF : 0.87
                      - SVM: 0.79
                      - Naïve Bayes: 0.74
                 CRF consistently beats the other two methods across all
                  feature sub-sets
                      - Sequential nature of the data works in CRF’s favor.
                 The Phrase Anchor provides best increase in precision
                 The Anaphora feature provides best increase in recall.


                                                                              24
Yahoo! Confidential
Summary of Results: User Specific Performance

             In majority of the cases,
              CRF outperforms
             With same set of
              features:
                - the CRF can learn a more
                  generic model applicable
                  to a variety of users




                                                           25
Yahoo! Confidential
In the paper . . .

                 A specific use-case for attachment based tagging:
                      - Images sent over email
                      - A user study based on survey

                 Heuristics for sentences  keywords
                 More detailed evaluation studies / comparison
                  for Naïve Bayes, SVM, CRF
                 TagSeeker : A prototype for proposed algorithms
                  implemented as a Thunderbird plugin.
                                                                      26
Yahoo! Confidential
Closing Remarks

                 Improvement of retrieval effectiveness due to
                  keywords mined:
                      - Could not be performed because the attachment content
                        is not available.
                 Working on an experiment to do this on a different
                  dataset.

                 Thanks to the:
                      - Reviewers for the great feedback!
                      - Organizers for the effort putting together this conference! 27
Yahoo! Confidential
Conclusion

                 Presented a technique to extract information
                  from noisy data.
                 The F1 measure of the proposed methodology
                      - In the high eighties. Good!
                      - Generalized well across different users.


                 For more information on this work / Information
                  Extraction @ Yahoo!
                      - aravindr@yahoo-inc.com
                                                                    28
Yahoo! Confidential
29
Yahoo! Confidential

More Related Content

Similar to Automated Content Labeling using Context in Email

The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1
Apigee | Google Cloud
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligence
INSEMTIVES project
 
Word Tree Corpus Interface
Word Tree Corpus InterfaceWord Tree Corpus Interface
Word Tree Corpus Interface
Ben Showers
 
Extracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication ExchangesExtracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication Exchanges
Suvodeep Mazumdar
 

Similar to Automated Content Labeling using Context in Email (10)

The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligence
 
What's new in Exchange 2013?
What's new in Exchange 2013?What's new in Exchange 2013?
What's new in Exchange 2013?
 
Word Tree Corpus Interface
Word Tree Corpus InterfaceWord Tree Corpus Interface
Word Tree Corpus Interface
 
Itch Scratching
Itch ScratchingItch Scratching
Itch Scratching
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Metamorphic Domain-Specific Languages
Metamorphic Domain-Specific LanguagesMetamorphic Domain-Specific Languages
Metamorphic Domain-Specific Languages
 
A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration
 
Extracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication ExchangesExtracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication Exchanges
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 

Automated Content Labeling using Context in Email

  • 1. Automated Content Labeling using Context in Email Aravindan Raghuveer Yahoo! Inc, Bangalore.
  • 2. at-least one email with an attachment last 2 weeks? 2 Yahoo! Confidential http://www.crcna.org/site_uploads/uploads/osjha/mdgs/mdghands.jpg Image Source:
  • 3. Introduction: The “What ?”  Email attachments are a very popular mechanism to exchange content. - Both in our personal / office worlds.  The one-liner: - The email usually contains a crisp description of the attachment. - “Can we auto generate tags for the attachment from the email ?” 3 Yahoo! Confidential
  • 4. Introduction: The “Why?”  Tags can be stored as extended attributes of files.  Applications like desktop search can use these tags for building indexes.  Tags generated even without parsing content: - Useful for Images - In some cases, the tags have more context than the attachment itself. 4 Yahoo! Confidential
  • 5. Outline  Problem Statement  Challenges  Overview of solution  The Dataset : Quirks and observations  Feature Design  Experiments and overview of results  Conclusion 5 Yahoo! Confidential
  • 6. Problem Statement “Given an email E that has a set of attachments A, find the set of words KEA that appear in E and are relevant to the attachments A.” Subproblems: • Sentence Selection • Keyword Selection 6 Yahoo! Confidential
  • 7. Overview of Solution  Solve a binary classification problem : - Is a given sentence relevant to the attachments or not?  Two part solution: - What are the features to use? • Most insightful and interesting aspect of this work. • Focus of this talk - What classification algorithm to use? 7 Yahoo! Confidential
  • 8. Challenges : Why is the classification hard?  Only a small part of the email is relevant to the attachment. Which part?  While writing emails: users tend to rely on context that is easily understood by humans. - usage of pronouns - nick names - cross referencing information from a different conversation in the same email thread. 8 Yahoo! Confidential
  • 9. The Enron Email Dataset  Made public by Federal Energy Regulatory Commission during its investigation [1].  Curated version [2] consists of 157,510 emails belonging to 150 users.  A total of 30,968 emails have at least one attachment [1] http://www.cs.cmu.edu/~enron/ [2] Thanks to Mark Dredze for providing the curated dataset. 9 Yahoo! Confidential
  • 10. Observation-1: User behavior for attachments • Barring a few outliers, almost all users sent/received emails with attachments. • A rich / well-suited corpus for studying attachment behavior Yahoo! Confidential 10
  • 11. Observation-2: Email length • Roughly 80% of the emails have less than 8 sentences. • In another analysis: even in emails that have less than 3 sentences, not every sentence is related to the attachment!! 11 Yahoo! Confidential
  • 12. Feature Design : Levels of Granularity S1 S2 S3 S4 S5 S6 Email Level Conversation Level Sentence Level 12 Yahoo! Confidential
  • 13. Feature Taxonomy: Features Email Conversation Sentence Length Level Anchor Noisy Anaphora (Short email: less than 4 (Level: Older in the sentences? ) email thread, higher the conversation level ) Strong Phrase Noun Lexicon Verb Extension (high dimension, sparse, bag of words feature ) Attachment Name Weak Phrase 13 Yahoo! Confidential
  • 14. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 14 Yahoo! Confidential
  • 15. Feature Design: Sentence Level  Anchor  Strong Phrase Anchor: Feature value set to 1 if sentence has any of the words: - attach - here is - Enclosed  of the 30968 emails that have an attachment, 52% of them had a strong anchor phrase. 15 Yahoo! Confidential
  • 16. Feature Design: Sentence Level  Anchor  Behavioral Observation: Users tend to refer to an attachment by its file type  Extension Anchor: Feature value set to 1 if sentence has any of the extension keywords: - xls  spreadsheet, report, excel file - jpg  image, photo, picture  Example: “Please refer to the attached spreadsheet for a list of Associates and Analysts who will be ranked in these meetings and their PRC Reps.” 16 Yahoo! Confidential
  • 17. Feature Design: Sentence Level  Anchor  Behavioral Observation: Users tend to use file name tokens of the attachment to refer to the attachment  Attachment Name Anchor: Feature value set to 1 if sentence has any of the file name tokens. - Tokenization done on case and type transitions,  Example: - attachment name “Book Request Form East.xls” - “These are book requests for the Netco books for all regions.” 17 Yahoo! Confidential
  • 18. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 18 Yahoo! Confidential
  • 19. Feature Design : Sentence level  Noisy  Noisy sentences are usually salutations, signature sections and email headers of conversations.  Two features to capture noisy sentences - Noisy Noun - Noisy Verb  Noisy Noun: marked true if • Noisy Verb: marked true more than 85% of the if no verbs in the words in the sentence are sentence nouns. 19 Yahoo! Confidential
  • 20. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 20 Yahoo! Confidential
  • 21. Feature Design : Sentence level  Anaphora  Once anchors have been identified: - NLP technique called anaphora detection can be employed - Detects other sentences that are linguistically dependent on anchor sentence. - Tracks the hidden context in email.  Example: - “Thought you might me interested in the report. It gives a nice snapshot of our activity with our major counterparties.” 21 Yahoo! Confidential
  • 22. Correlation Analysis Best positive correlation: • Strong phrase anchor • Anaphora feature Short email  low correlation coefficient. noisy verb feature  good negative correlation conversation level <=2 feature has lower negative correlation when compared to the conversation level > 2 feature. 22 Yahoo! Confidential
  • 23. Experiments  Ground Truth Data - randomly sampled 1800 sent emails. - Two independent editors for producing class labels for every sentence in the above sample. - Reconciliation: • Discarded emails that had at least one sentence with conflicting labels.  ML Algorithms studied : Naïve Bayes, SVM, CRF. 23 Yahoo! Confidential
  • 24. Summary of Results: F1 measure  With all features used, F1 scores read: - CRF : 0.87 - SVM: 0.79 - Naïve Bayes: 0.74  CRF consistently beats the other two methods across all feature sub-sets - Sequential nature of the data works in CRF’s favor.  The Phrase Anchor provides best increase in precision  The Anaphora feature provides best increase in recall. 24 Yahoo! Confidential
  • 25. Summary of Results: User Specific Performance  In majority of the cases, CRF outperforms  With same set of features: - the CRF can learn a more generic model applicable to a variety of users 25 Yahoo! Confidential
  • 26. In the paper . . .  A specific use-case for attachment based tagging: - Images sent over email - A user study based on survey  Heuristics for sentences  keywords  More detailed evaluation studies / comparison for Naïve Bayes, SVM, CRF  TagSeeker : A prototype for proposed algorithms implemented as a Thunderbird plugin. 26 Yahoo! Confidential
  • 27. Closing Remarks  Improvement of retrieval effectiveness due to keywords mined: - Could not be performed because the attachment content is not available.  Working on an experiment to do this on a different dataset.  Thanks to the: - Reviewers for the great feedback! - Organizers for the effort putting together this conference! 27 Yahoo! Confidential
  • 28. Conclusion  Presented a technique to extract information from noisy data.  The F1 measure of the proposed methodology - In the high eighties. Good! - Generalized well across different users.  For more information on this work / Information Extraction @ Yahoo! - aravindr@yahoo-inc.com 28 Yahoo! Confidential

Editor's Notes

  1. left with 1150emails whose sentences were marked either relevant toattachment or not. This data corresponds to 112 usersfrom the Enron corpus and 6472 sentences.