A Corpus for Entity Profiling in
      Microblog Posts
                           Edgar Meij, Andrei Oghina,
  Damiano Spina            Minh T. Bui, Mathias Breuss,
                                Maarten de Rijke


   UNED NLP & IR Group             ISLA, University of Amsterdam
      Madrid, Spain                 Amsterdam, The Netherlands

                     LREC Workshop on
   Language Engineering for Online Reputation Management
              May 26th, 2012 - Istambul, Turkey
Introduction
• Online Reputation Management
   – Public image of an entity in Online Media
   – Entity = { brand, organization, company, person, product }
• Microblogging services (e.g. Twitter)
   – People sharing thoughts about an entity
   – Dynamic, Real-Time
• Human Language Technologies
   – Aid to reputation managers
   – Retrieval and Analysis of entity mentions
Sentiment vs. Profiling
• Sentiment analysis




• Entity Profiling
   – “hot” topics that people talk about in the context of an entity
Our task: Aspect identification
• @xbox_news here we go again,
  microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b
   more beautiful

• not surprising that @graypowell was out and about -
  he used to be a ’Field Verification & Operator
  Acceptance Engineer’ at Sony
Our task: Aspect identification
• @xbox_news here we go again,
  microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b
  more beautiful

• not surprising that @graypowell was out and about -
  he used to be a ’Field Verification & Operator
  Acceptance Engineer’ at Sony
Goal

• Build manually annotated corpora

  – Evaluate the task of entity profiling in microblog
    streams
A Corpus for Entity Profiling in
      Microblog Posts




        WePS-3 ORM Corpus
                      Collection of tweets
                      Disambiguated company names
                      (e.g. apple fruit vs. Apple Inc.)
A Corpus for Entity Profiling in
      Microblog Posts



                              Tweet annotation
 Pooling Aspects                Opinion targets


          WePS-3 ORM Corpus
A Corpus for Entity Profiling in
      Microblog Posts



                              Tweet annotation
 Pooling Aspects                Opinion targets


          WePS-3 ORM Corpus
Approach I: Pooling aspects
• Pooling methodology
  – 4 Ranking Methods:
    •   TF.IDF [Salton and Buckley, 1988]
    •   Log-Likelihood Ratio [Dunning, 1993]
    •   Parsimonious Language Model [Hiemstra et al. 2004]
    •   Opinion target extraction using topic-specific subjective
        lexicons [Jijkoun et al. 2010]
  – Top 10 terms
• Manual annotation
Aspects dataset: annotation example
Aspects dataset: outcome
• Three annotators, substantial agreement
  (> 0.6 Cohen/Fleiss’ kappa)


• 94 entities, 17775 tweets, ≈177 tweets/entity

• 2455 terms, 1304 aspects (54.11%)
Approach II: Tweet annotation
• Opinion targets dataset
• Tweet-level annotation
  – Is the tweet subjective?
• Phrase-level annotation
  – Subjective phrase
  – Opinion target phrase p:
     • p is an aspect of the entity
     • p is included in a sentence that contains a direct subjective
       phrase
     • p is the target of the expressed opinion
Opinion Targets dataset: annotation
             example
Opinion targets dataset: outcome
• 59 entities, 9396 tweets, ≈159 tweets/entity

• 15.16% of tweets with subjective phrases

• 13.82% of tweets with opinion targets
Aspects vs. Opinion targets

                           Terms in
                           Opinion Targets

Aspects



          783 270   1650
Aspects vs. Opinion targets

                                     Terms in
                                     Opinion Targets

Aspects

                     12.67%
          783 270             1650
            26.69%
A Corpus for Entity Profiling in
              Microblog Posts
• Available at
             http://bitly.com/profilingTwitter

                                         •   59 entities, 9,396 tweets,
 •   94 entities, 17,775 tweets              ≈159 tweets/entity
     ≈177 tweets/entity                  •   15.16% of tweets with subj. phrases
 •   2455 terms, 1304 aspects (54.11%)   •   13.82% of tweets with opinion targets

                                                             Tweet annotation
                  Pooling
                                                                 Opinion targets
               Aspects dataset
                                                                        dataset

                            WePS-3 ORM Corpus

A Corpus for Entity Profiling in Microblog Posts

  • 1.
    A Corpus forEntity Profiling in Microblog Posts Edgar Meij, Andrei Oghina, Damiano Spina Minh T. Bui, Mathias Breuss, Maarten de Rijke UNED NLP & IR Group ISLA, University of Amsterdam Madrid, Spain Amsterdam, The Netherlands LREC Workshop on Language Engineering for Online Reputation Management May 26th, 2012 - Istambul, Turkey
  • 2.
    Introduction • Online ReputationManagement – Public image of an entity in Online Media – Entity = { brand, organization, company, person, product } • Microblogging services (e.g. Twitter) – People sharing thoughts about an entity – Dynamic, Real-Time • Human Language Technologies – Aid to reputation managers – Retrieval and Analysis of entity mentions
  • 3.
    Sentiment vs. Profiling •Sentiment analysis • Entity Profiling – “hot” topics that people talk about in the context of an entity
  • 4.
    Our task: Aspectidentification • @xbox_news here we go again, microsoft being jealous of sony again. • I lov big Sony headphones .. I lov my #music 2 b more beautiful • not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony
  • 5.
    Our task: Aspectidentification • @xbox_news here we go again, microsoft being jealous of sony again. • I lov big Sony headphones .. I lov my #music 2 b more beautiful • not surprising that @graypowell was out and about - he used to be a ’Field Verification & Operator Acceptance Engineer’ at Sony
  • 6.
    Goal • Build manuallyannotated corpora – Evaluate the task of entity profiling in microblog streams
  • 7.
    A Corpus forEntity Profiling in Microblog Posts WePS-3 ORM Corpus Collection of tweets Disambiguated company names (e.g. apple fruit vs. Apple Inc.)
  • 8.
    A Corpus forEntity Profiling in Microblog Posts Tweet annotation Pooling Aspects Opinion targets WePS-3 ORM Corpus
  • 9.
    A Corpus forEntity Profiling in Microblog Posts Tweet annotation Pooling Aspects Opinion targets WePS-3 ORM Corpus
  • 10.
    Approach I: Poolingaspects • Pooling methodology – 4 Ranking Methods: • TF.IDF [Salton and Buckley, 1988] • Log-Likelihood Ratio [Dunning, 1993] • Parsimonious Language Model [Hiemstra et al. 2004] • Opinion target extraction using topic-specific subjective lexicons [Jijkoun et al. 2010] – Top 10 terms • Manual annotation
  • 11.
  • 12.
    Aspects dataset: outcome •Three annotators, substantial agreement (> 0.6 Cohen/Fleiss’ kappa) • 94 entities, 17775 tweets, ≈177 tweets/entity • 2455 terms, 1304 aspects (54.11%)
  • 13.
    Approach II: Tweetannotation • Opinion targets dataset • Tweet-level annotation – Is the tweet subjective? • Phrase-level annotation – Subjective phrase – Opinion target phrase p: • p is an aspect of the entity • p is included in a sentence that contains a direct subjective phrase • p is the target of the expressed opinion
  • 14.
    Opinion Targets dataset:annotation example
  • 15.
    Opinion targets dataset:outcome • 59 entities, 9396 tweets, ≈159 tweets/entity • 15.16% of tweets with subjective phrases • 13.82% of tweets with opinion targets
  • 16.
    Aspects vs. Opiniontargets Terms in Opinion Targets Aspects 783 270 1650
  • 17.
    Aspects vs. Opiniontargets Terms in Opinion Targets Aspects 12.67% 783 270 1650 26.69%
  • 18.
    A Corpus forEntity Profiling in Microblog Posts • Available at http://bitly.com/profilingTwitter • 59 entities, 9,396 tweets, • 94 entities, 17,775 tweets ≈159 tweets/entity ≈177 tweets/entity • 15.16% of tweets with subj. phrases • 2455 terms, 1304 aspects (54.11%) • 13.82% of tweets with opinion targets Tweet annotation Pooling Opinion targets Aspects dataset dataset WePS-3 ORM Corpus