Online reputation management is about monitoring and handling
the public image of entities (such as companies) on the Web. An
important task in this area is identifying aspects of the entity of
interest (such as products, services, competitors, key people, etc.)
given a stream of microblog posts referring to the entity. In this pa-
per we compare different IR techniques and opinion target identi-
fication methods for automatically identifying aspects and find that
(i) simple statistical methods such as TF.IDF are a strong baseline
for the task, significantly outperforming opinion-oriented methods,
and (ii) only considering terms tagged as nouns improves the re-
sults for all the methods analyzed.
Caracterización de una entidad basada en opiniones: un estudio de caso
Identifying Entity Aspects in Microblog Posts
1. Identifying Entity Aspects in Microblog Posts
damiano@lsi.uned.es, {edgar.meij, derijke}@uva.nl, {oghina,mbui,mbreuss}@science.uva.nl
UNED NLP & IR Group ISLA, University of Amsterdam
Madrid, Spain 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
Amsterdam, The Netherlands
Portland, Oregon, USA. August 12–16, 2012.
Introduction Dataset
94 entities, 17,775 tweets, ≈177 tweets/entity
Scenario: Online reputation management
Built upon WePS-3 ORM Task Dataset.
Disambiguated company names (e.g. apple fruit vs. Apple Inc.)
What do people say about an entity?
entity = { company, organization, individual, product} Pooling methodology.
Three annotators with substantial agreement http://bit.ly/profilingTwitter
(Cohen/Fleiss’ kappa > 0.6 )
Identification of aspects discussed on microblog posts
Annotated 2455 terms, 1304 aspects (54.11%)
Aspects ≈ products, services, competitors, key people Most of the true aspects are nouns (89.72%).
related to the entity
Entity Aspects
Apple Inc. ipad, iphone, prototype, store, gizmodo, employee
Sony advertising, headphones, digital, pro, music, xperia, dsc,
x10, bravia, camera, vegas, battery, ericsson, playstation
Starbucks coffee, latte, tea, frappuccino, barista, drink, mocha
Identifying Entity Aspects Experiments and Results
All Words:
TF.IDF significantly outperforms PLM and OO in precision.
Input Output The results for OO are much lower than for the other methods.
Noun filter:
1. Applying a part-of-speech filter and only consider terms tagged as
“inter” nouns
(crosstown derby)
For all methods, MAP and precision values are slightly higher than in the
all words condition: considering only nouns helps to identify aspects.
PLM best method using Noun filter.
2.
“Berlusconi” Observations (after manually inspecting the results):
(club president)
Results for TF.IDF, LLR and PLM are very similar.
OO tends to return more subjective terms as aspects (syntactic parsing
3. errors)
Examples: haha, pls, xd, safety, win
“shirt” OO has more difficulty to filter out generic terms.
… Examples: new, use, today, come
Main idea
Comparing a pseudo-document D built from entity-specific tweets with a
background corpus C.
Score of a term t = s(t, D, C)
Four models tested:
TF.IDF
LLR: Log-Likelihood Ratio
PLM: Parsimonious Language Models
OO: Opinion-oriented. Opinion target extraction using
topic-specific subjective lexicons
Student’s t-test statistic significance w.r.t to TF.IDF All words baseline with α=0.05 (*) and α=0.01 (**).
Conclusions
Simple statistical methods such as TF.IDF are a strong
baseline for the task of identifying entity aspects,
significantly outperforming opinion-oriented methods.
Only considering terms tagged as nouns improves
the results for all the methods analyzed.
Acknowledgments: This research was supported by the European Union's CIP ICT-PSP under grant agreement nr 250430, the European Community's FP7 Programme under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe),
the Netherlands Organisation for Scientific Research (NOW) under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-011, 727.011.005, the Center for Creation, Content and Technology (CCCT), the Hyperlocal Service Platform project funded by the
Service Innovation & ICT program, the WAHSP project funded by the CLARIN-nl program, under COMMIT project Infiniti, the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010-
21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542), the ESF Research Network Program ELIAS and the ACM Special Interest Group on Information Retrieval (SIGIR Student Travel Grant).