A Corpus for Entity Profiling in Microblog Posts

A Corpus for Entity Profiling in
Microblog Posts
Edgar Meij, Andrei Oghina,
Damiano Spina Minh T. Bui, Mathias Breuss,
Maarten de Rijke

UNED NLP & IR Group ISLA, University of Amsterdam
Madrid, Spain Amsterdam, The Netherlands

LREC Workshop on
Language Engineering for Online Reputation Management
May 26th, 2012 - Istambul, Turkey

Introduction
• Online Reputation Management
– Public image of an entity in Online Media
– Entity = { brand, organization, company, person, product }
• Microblogging services (e.g. Twitter)
– People sharing thoughts about an entity
– Dynamic, Real-Time
• Human Language Technologies
– Aid to reputation managers
– Retrieval and Analysis of entity mentions

Sentiment vs. Profiling
• Sentiment analysis

• Entity Profiling
– “hot” topics that people talk about in the context of an entity

Our task: Aspect identification
• @xbox_news here we go again,
microsoft being jealous of sony again.

• I lov big Sony headphones .. I lov my #music 2 b
more beautiful

• not surprising that @graypowell was out and about -
he used to be a ’Field Verification & Operator
Acceptance Engineer’ at Sony

Goal

• Build manually annotated corpora

– Evaluate the task of entity profiling in microblog
streams

Microblog Posts

WePS-3 ORM Corpus
Collection of tweets
Disambiguated company names
(e.g. apple fruit vs. Apple Inc.)

Microblog Posts

Tweet annotation
Pooling Aspects Opinion targets

WePS-3 ORM Corpus

Approach I: Pooling aspects
• Pooling methodology
– 4 Ranking Methods:
• TF.IDF [Salton and Buckley, 1988]
• Log-Likelihood Ratio [Dunning, 1993]
• Parsimonious Language Model [Hiemstra et al. 2004]
• Opinion target extraction using topic-specific subjective
lexicons [Jijkoun et al. 2010]
– Top 10 terms
• Manual annotation

Aspects dataset: annotation example

Aspects dataset: outcome
• Three annotators, substantial agreement
(> 0.6 Cohen/Fleiss’ kappa)

• 94 entities, 17775 tweets, ≈177 tweets/entity

• 2455 terms, 1304 aspects (54.11%)

Approach II: Tweet annotation
• Opinion targets dataset
• Tweet-level annotation
– Is the tweet subjective?
• Phrase-level annotation
– Subjective phrase
– Opinion target phrase p:
• p is an aspect of the entity
• p is included in a sentence that contains a direct subjective
phrase
• p is the target of the expressed opinion

Opinion Targets dataset: annotation
example

Opinion targets dataset: outcome
• 59 entities, 9396 tweets, ≈159 tweets/entity

• 15.16% of tweets with subjective phrases

• 13.82% of tweets with opinion targets

Aspects vs. Opinion targets

Terms in
Opinion Targets

Aspects

783 270 1650

Aspects vs. Opinion targets

Terms in
Opinion Targets

Aspects

12.67%
783 270 1650
26.69%

Microblog Posts
• Available at
http://bitly.com/profilingTwitter

• 59 entities, 9,396 tweets,
• 94 entities, 17,775 tweets ≈159 tweets/entity
≈177 tweets/entity • 15.16% of tweets with subj. phrases
• 2455 terms, 1304 aspects (54.11%) • 13.82% of tweets with opinion targets

Tweet annotation
Pooling
Opinion targets
Aspects dataset
dataset

WePS-3 ORM Corpus

A Corpus for Entity Profiling in Microblog Posts

More Related Content

What's hot

Viewers also liked

Similar to A Corpus for Entity Profiling in Microblog Posts

More from Damiano Spina

Recently uploaded

A Corpus for Entity Profiling in Microblog Posts