Identifying Entity Aspects in Microblog Posts

•

1 like•340 views

Online reputation management is about monitoring and handling the public image of entities (such as companies) on the Web. An important task in this area is identifying aspects of the entity of interest (such as products, services, competitors, key people, etc.) given a stream of microblog posts referring to the entity. In this paper we compare different IR techniques and opinion target identification methods for automatically identifying aspects and find that (i) simple statistical methods such as TF.IDF are a strong baseline for the task, significantly outperforming opinion-oriented methods, and (ii) only considering terms tagged as nouns improves the results for all the methods analyzed.

Identifying Entity Aspects in Microblog Posts

damiano@lsi.uned.es, {edgar.meij, derijke}@uva.nl, {oghina,mbui,mbreuss}@science.uva.nl
UNED NLP & IR Group ISLA, University of Amsterdam
Madrid, Spain 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
Amsterdam, The Netherlands
Portland, Oregon, USA. August 12–16, 2012.

Introduction Dataset
 94 entities, 17,775 tweets, ≈177 tweets/entity
 Scenario: Online reputation management
 Built upon WePS-3 ORM Task Dataset.
Disambiguated company names (e.g. apple fruit vs. Apple Inc.)
What do people say about an entity?
entity = { company, organization, individual, product}  Pooling methodology.
Three annotators with substantial agreement http://bit.ly/profilingTwitter
(Cohen/Fleiss’ kappa > 0.6 )
Identification of aspects discussed on microblog posts
 Annotated 2455 terms, 1304 aspects (54.11%)
 Aspects ≈ products, services, competitors, key people Most of the true aspects are nouns (89.72%).
related to the entity
Entity Aspects
Apple Inc. ipad, iphone, prototype, store, gizmodo, employee
Sony advertising, headphones, digital, pro, music, xperia, dsc,
x10, bravia, camera, vegas, battery, ericsson, playstation
Starbucks coffee, latte, tea, frappuccino, barista, drink, mocha

Identifying Entity Aspects Experiments and Results
 All Words:
 TF.IDF signiﬁcantly outperforms PLM and OO in precision.
Input Output  The results for OO are much lower than for the other methods.

 Noun filter:
1. Applying a part-of-speech ﬁlter and only consider terms tagged as
“inter” nouns
(crosstown derby)
 For all methods, MAP and precision values are slightly higher than in the
all words condition: considering only nouns helps to identify aspects.
 PLM best method using Noun filter.
2.
“Berlusconi”  Observations (after manually inspecting the results):
(club president)
 Results for TF.IDF, LLR and PLM are very similar.
 OO tends to return more subjective terms as aspects (syntactic parsing
3. errors)
Examples: haha, pls, xd, safety, win
“shirt”  OO has more difﬁculty to ﬁlter out generic terms.
… Examples: new, use, today, come

Main idea
 Comparing a pseudo-document D built from entity-speciﬁc tweets with a
background corpus C.

 Score of a term t = s(t, D, C)

Four models tested:

 TF.IDF
 LLR: Log-Likelihood Ratio
 PLM: Parsimonious Language Models
 OO: Opinion-oriented. Opinion target extraction using
topic-specific subjective lexicons
Student’s t-test statistic significance w.r.t to TF.IDF All words baseline with α=0.05 (*) and α=0.01 (**).

Conclusions
 Simple statistical methods such as TF.IDF are a strong
baseline for the task of identifying entity aspects,
signiﬁcantly outperforming opinion-oriented methods.

 Only considering terms tagged as nouns improves
the results for all the methods analyzed.

Acknowledgments: This research was supported by the European Union's CIP ICT-PSP under grant agreement nr 250430, the European Community's FP7 Programme under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe),
the Netherlands Organisation for Scientific Research (NOW) under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-011, 727.011.005, the Center for Creation, Content and Technology (CCCT), the Hyperlocal Service Platform project funded by the
Service Innovation & ICT program, the WAHSP project funded by the CLARIN-nl program, under COMMIT project Infiniti, the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010-
21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542), the ESF Research Network Program ELIAS and the ACM Special Interest Group on Information Retrieval (SIGIR Student Travel Grant).

Viewers also liked

9.12

xinyu1607

The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.

SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...

Tobias Wunner

Aspects of NLP Practice

Vsevolod Dyomkin

Aspect-Oriented Technologies

Esteban Abait

Textmining Information Extraction

guest0edcaf

Aspect extraction (A survey)

Mido Razaz

Aspect Mining Techniques

Esteban Abait

This presentation offers a first, in-breadth survey and comparison of current aspect mining tools and techniques. It focuses mainly on automated techniques that mine a program’s static or dynamic structure for candidate aspects. We present an initial comparative framework for distinguishing aspect mining techniques, and assess known techniques against this framework. The results of this assessment may serve as a roadmap to potential users of aspect mining techniques, to help them in selecting an appropriate technique. It also helps aspect mining researchers to identify remaining open research questions, possible avenues for future research, and interesting combinations of existing techniques.

A Survey Of Aspect Mining Approaches

kim.mens

SystemT: Declarative Information Extraction

Yunyao Li

Natural Language Processing and Machine Learning

Karthik Sankar

Object detection and recognition are important problems in computer vision and pattern recognition domain. Human beings are able to detect and classify objects effortlessly but replication of this ability on computer based systems has proved to be a non-trivial task. In particular, despite significant research efforts focused on meta-heuristic object detection and recognition, robust and reliable object recognition systems in real time remain elusive. Here we present a survey of one particular approach that has proved very promising for invariant feature recognition and which is a key initial stage of multi-stage network architecture methods for the high level task of object recognition.

UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...

ijscai

All types of machine automated systems are generating large amount of data in different forms like statistical, text, audio, video, sensor, and bio-metric data that emerges the term Big Data. In this paper we are discussing issues, challenges, and application of these types of Big Data with the consideration of big data dimensions. Here we are discussing social media data analytics, content based analytics, text data analytics, audio, and video data analytics their issues and expected application areas. It will motivate researchers to address these issues of storage, management, and retrieval of data known as Big Data. As well as the usages of Big Data analytics in India is also highlighted.

BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...

ijscai

Ontologies are being used to organize information in many domains like artificial intelligence, information science, semantic web, library science. Ontologies of an entity having different information can be merged to create more knowledge of that particular entity. Ontologies today are powering more accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0, also termed as the semantic web, ontologies will play a more important role. Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can yield basic information about an entity. This paper proposes an automated method for ontology creation, using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning. Concepts drawn from these domains help in designing more accurate ontologies represented using the XML format. This paper uses document classification using classification algorithms for assigning labels to documents, document similarity to cluster similar documents to the input document, together, and summarization to shorten the text and keep important terms essential in making the ontology. The module is constructed using the Python programming language and NLTK (Natural Language Toolkit). The ontologies created in XML will convey to a lay person the definition of the important term's and their lexical relationships.

A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION

ijscai

Opinion Mining

Shital Kat

Abstract Generally, computer system is handled by the English language only. But the person who is unaware of the English language and structure of query language cannot handle the system. This paper proposed a new approach for accessing the database easily without knowing English. So, the database is accessed with the help of natural languages such as Hindi, Marathi etc. Natural language processing (NLP) is the study of mathematical and computational modeling of various aspects of language and the development of a wide range of systems. Natural Language Processing holds great promise for making computer interfaces that are easier to use for people, since people will be able to talk to the computer in their own language, rather than learn a specialized language of computer commands. Keywords: NLP, Mathematical modeling, Computational modeling

Accessing database using nlp

eSAT Journals

Recent natural language processing advancements have propelled search engine and information retrieval innovations into the public spotlight. People want to be able to interact with their devices in a natural way. In this talk I will be introducing you to natural language search using a Neo4j graph database. I will show you how to interact with an abstract graph data structure using natural language and how this approach is key to future innovations in the way we interact with our devices.

Natural Language Processing with Neo4j

Kenny Bastani

Tutorial given at RANLP 2015 in Hissar, Bulgaria Recent years have seen lots of changes in the field of computational linguistics, most of them due to the widespread use of the Internet and the benefits and problems it brings. The first part of this tutorial will discuss these changes and will focus on crowdsourcing and how it influenced the creation of annotated data. Annotation of data employed to train and test NLP methods used to be the task of language experts who had a good understanding of the linguistic phenomena to be tackled. Given that a large number of people now have access to the Internet, crowdsourcing has become an alternative way of obtaining annotated data. The core idea of crowdsourcing is that it is possible to design tasks that can be completed by non-experts and that the outputs of these tasks can be combined to obtain high-quality linguistic annotation, which would normally be produced by experts. Examples of how crowdsourcing was employed in computational linguistics will be given. Big data is another trend in computational linguistics as researchers rely on more and more data for improving the results of a method. The second part of the tutorial will introduce the MapReduce programming model and show how it was used in processing language. Combined with processing larger quantities of data, the field of computational linguistics has applied deep learning to various tasks successfully, improving their accuracy. An introduction to deep learning will be provided, followed by examples of how it was applied to tasks such as learning semantic representations, sentiment analysis and machine translation evaluation.

New trends in NLP applications

Constantin Orasan

KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final

Deborah Woodcock

Weiskopf AMIA 2016_abstract

Deborah Woodcock

Natural language processing

Hansi Thenuwara

Viewers also liked (20)

9.12

SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...

Aspects of NLP Practice

Aspect-Oriented Technologies

Textmining Information Extraction

Aspect extraction (A survey)

Aspect Mining Techniques

A Survey Of Aspect Mining Approaches

SystemT: Declarative Information Extraction

Natural Language Processing and Machine Learning

UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...

BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...

A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION

Opinion Mining

Accessing database using nlp

Natural Language Processing with Neo4j

New trends in NLP applications

KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final

Weiskopf AMIA 2016_abstract

Natural language processing

Similar to Identifying Entity Aspects in Microblog Posts

The Next-Generation SharePoint: Powered by Text Analytics

Peter Wren-Hilton

The Next Generation SharePoint: Powered by Text Analytics

Alyona Medelyan

Vision of the future: Organization 2.0

Teemu Arina

DBAs spend too time with routine tasks leaving little time for innovation. Autonomous Databases free data professionals to extract more value from data. Oracle Machine Learning, in Autonomous Database, “moves the algorithms; not the data” for 100% in-database processing. Data professionals perform many supporting tasks for “data scientists”, typically 80% of the work. Come learn an evolutionary path for Oracle data professionals to leverage domain knowledge and data skills and add machine learning. See how to build and deploy predictive models inside the Database. Using examples, demos and sharing experiences, Charlie will show you how to discover new insights, make predictions and become an “Oracle Data Scientist” in just 6 weeks!

Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...

Charlie Berger

Once developers have a knowledge management model - covered in our August webinar - they still have to deal with real world implementation constraints. Big data is a fact of life for most modern AI/cognitive computing apps, which usually means ingesting, sampling, or analyzing large data sets from disparate sources, ranging from IOT sensors to social media streams to news feeds and weather forecasts. Frequently, historical data in legacy systems will also be required to generate new insights. This webinar will present a framework to help participants evaluate streaming data management tools, IOT technology stacks, and graph databases as support tools for their modern AI/cognitive computing projects. They will also learn about emerging open source projects and ecosystems that can help kick start their projects today.

Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...

DATAVERSITY

Connecting Talent to Opportunity.. at scale @ LinkedIn

Anmol Bhasin

The Briefing Room with Robin Bloor and Teradata Live Webcast on Jan. 29, 2013 Despite its name, effective Data Science requires a certain amount of artistic flair. Analysts must be creative about how and where they find the insights that will drive business value. One classic roadblock to that kind of frictionless process? Programming. Not everyone can code Java, which makes the unstructured domain of Hadoop quite challenging for the average business analyst. Check out the slides from this episode of the Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how a new generation of analytical platforms will solve the complexity of unifying structured and unstructured data. He'll be briefed by Steve Wooledge of Teradata Aster who will tout his company's Big Data Appliance, which leverages the SQL-H bridge, an innovation designed to connect Hadoop with SQL. Visit: http://www.insideanalysis.com

Left Brain, Right Brain: How to Unify Enterprise Analytics

Inside Analysis

Semantics empowered Physical-Cyber-Social Systems for EarthCube

Amit Sheth

ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic ...

Antidot

Introduction to Big Data An analogy between Sugar Cane & Big Data

Jean-Marc Desvaux

GPSBUS201-GPS Demystifying Artificial Intelligence

Amazon Web Services

Ttg Ca Border Sec Mjd22jun07

martindudziak

Text and Beyond

Seth Grimes

TCS Innovation Forum 2012 - Big Data Whodini

Tata Consultancy Services

Using Big Data to create a data drive organization

Edward Chenard

Demian DIAL Seminar, Cambridge 19/3/2013

pdemian

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Amy W. Tang

IEEE International Conference on Semantic Computing (ICSC 2011). A Multidimensional Semantic Space for Data Model Independent Queries over RDF Data André Freitas, João Gabriel Oliveira, Edward Curry Seán O’Riain http://andrefreitas.org/papers/preprint_multidimensional_ieee_icsc_2011.pdf Abstract: The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data on the Web today, end-users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale Linked Data consumption. This article introduces a multidimensional semantic space model which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation demanded to build the data model independent approach. The ﬁnal multidimensional semantic space proved to be ﬂexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall =0.491.

A Multidimensional Semantic Space for Data Model Independent Queries over RDF...

Andre Freitas

In the era of big data, data processing has taken center stage. The focus often is the speeds and feeds of the organizational data supply chain. Much less thought and expertise have focused on what matters most – using analytics in the proper context. Why is context so important? Why isn’t it enough to slay the v-named data dragons – namely, volume, variety and velocity? In this webinar, you’ll learn how successful organizations apply the right analytical capabilities in the proper context. Because without context, analytics can become noise that disturbs the decision-making process instead of helping it. There is no one-size-fits-all context generator. We’ll also discuss the importance of putting data into the proper context for analytical decision making, why data is your most valuable organizational asset, and how you can apply analytics in a way that converts your data into tangible benefits.

Analyze Your Data, Transform Your Business

DATAVERSITY

Oracle's BigData solutions consist of a number of new products and solutions to support customers looking to gain maximum business value from data sets such as weblogs, social media feeds, smart meters, sensors and other devices that generate massive volumes of data (commonly defined as ‘Big Data’) that isn’t readily accessible in enterprise data warehouses and business intelligence applications today.

Oracle's BigData solutions

Swiss Big Data User Group

Similar to Identifying Entity Aspects in Microblog Posts (20)

The Next-Generation SharePoint: Powered by Text Analytics

The Next Generation SharePoint: Powered by Text Analytics

Vision of the future: Organization 2.0

Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...

Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...

Connecting Talent to Opportunity.. at scale @ LinkedIn

Left Brain, Right Brain: How to Unify Enterprise Analytics

Semantics empowered Physical-Cyber-Social Systems for EarthCube

ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic ...

Introduction to Big Data An analogy between Sugar Cane & Big Data

GPSBUS201-GPS Demystifying Artificial Intelligence

Ttg Ca Border Sec Mjd22jun07

Text and Beyond

TCS Innovation Forum 2012 - Big Data Whodini

Using Big Data to create a data drive organization

Demian DIAL Seminar, Cambridge 19/3/2013

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Multidimensional Semantic Space for Data Model Independent Queries over RDF...

Analyze Your Data, Transform Your Business

Oracle's BigData solutions

More from Damiano Spina

A Formal Account of Effectiveness Evaluation and Ranking Fusion

Damiano Spina

Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...

Damiano Spina

Learning Similarity Functions for Topic Detection in Online Reputation Monito...

Damiano Spina

We present a semi-automatic tool that assists experts in their daily work of monitoring the reputation of entities—companies, organizations or public figures—in Twitter. The tool automatically annotates tweets for relevance (Is the tweet about the entity?), reputational polarity (Does the tweet convey positive or negative implications for the reputation of the entity?), groups tweets in topics and display topics in decreasing order of relevance from a reputational perspective. The interface helps the user to understand the content being analyzed and also to produce a manually annotated version of the data starting from the output of the automatic annotation processes. A demo is available at http://nlp.uned.es/orma

ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter

Damiano Spina

Online Reputation Monitoring in Twitter from an Information Access Perspective

Damiano Spina

In this paper we describe the collaborative participation of UvA \& UNED at RepLab 2013. We propose an active learning approach for the filtering subtask, using features based on the detected semantics in the tweet (using Entity Linking with Wikipedia), as well as tweet-inherent features such as hashtags and usernames. The tweets manually inspected during the active learning process is at most 1\% of the test data. While our baseline does not perform well, we can see that active learning does improve the results.

Towards an Active Learning System for Company Name Disambiguation in Microblo...

Damiano Spina

This paper describes the UNED's Online Reputation Monitoring Team participation at RepLab 2013. Several approaches were tested: first, an instance-based learning approach that uses Heterogeneity Based Ranking to combine seven different similarity measures was applied for all the subtasks. The filtering subtask was also tackled by automatically discovering filter keywords: those whose presence in a tweet reliably confirm (positive keywords) or discard (negative keywords) that the tweet refers to the company. Different approaches have been submitted for the topic detection subtask: agglomerative clustering over wikified tweets, co-occurrence term clusteringand an LDA-based model that uses temporal information. Finally, the polarity subtask was tackled by generating domain specific semantic graphs in order to automatically expand the general purpose lexicon SentiSense. We next use the domain specific sub-lexicons to classify tweets according to their reputational polarity, following an emotional concept-based system for sentiment analysis. We corroborated that using entity-level training data improves the filtering step. Additionally, the proposed approaches to detect topics obtained the highest scores in the official evaluation, showing that they are promising directions to address the problem. In the reputational polarity task, our results suggest that a deeper analysis should be done in order to correctly identify the main differences between the Reputational Polarity task and traditional Sentiment Analysis tasks. A final remark is that the overall performance of a monitoring system in RepLab 2013 highly depends on the performance of the initial filtering step.

UNED Online Reputation Monitoring Team at RepLab 2013

Damiano Spina

This work explores the real-time summarization of scheduled events such as soccer games from torrential ows of Twitter streams. We propose and evaluate an approach that substantially shrinks the stream of tweets in real-time, and consists of two steps: (i) sub-event detection, which determines if something new has occurred, and (ii) tweet se- lection, which picks a representative tweet to describe each sub-event. We compare the summaries generated in three languages for all the soccer games in Copa America 2011 to reference live reports oered by Yahoo! Sports journal- ists. We show that simple text analysis methods which do not involve external knowledge lead to summaries that cover 84% of the sub-events on average, and 100% of key types of sub-events (such as goals in soccer). Our approach should be straightforwardly applicable to other kinds of scheduled events such as other sports, award ceremonies, keynote talks, TV shows, etc.

Towards Real-Time Summarization of Scheduled Events from Twitter Streams

Damiano Spina

Microblogs have become an invaluable source of information for the purpose of online reputation management. Streams of microblogs are of great value because of their direct and real-time nature. An emerging problem is to identify not only microblog posts (such as tweets) that are relevant for a given entity, but also the specific aspects that people discuss. Determining such aspects can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets. Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

A Corpus for Entity Profiling in Microblog Posts

Damiano Spina

Filter keywords and majority class strategies for company name disambiguation...

Damiano Spina

Evaluación de sistemas de monitorización de contenidos generados por usuarios

Damiano Spina

Caracterización de una entidad basada en opiniones: un estudio de caso

Damiano Spina

More from Damiano Spina (12)

A Formal Account of Effectiveness Evaluation and Ranking Fusion

SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...

Learning Similarity Functions for Topic Detection in Online Reputation Monito...

ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter

Online Reputation Monitoring in Twitter from an Information Access Perspective

Towards an Active Learning System for Company Name Disambiguation in Microblo...

UNED Online Reputation Monitoring Team at RepLab 2013

Towards Real-Time Summarization of Scheduled Events from Twitter Streams

A Corpus for Entity Profiling in Microblog Posts

Filter keywords and majority class strategies for company name disambiguation...

Evaluación de sistemas de monitorización de contenidos generados por usuarios

Caracterización de una entidad basada en opiniones: un estudio de caso

Identifying Entity Aspects in Microblog Posts

1. Identifying Entity Aspects in Microblog Posts damiano@lsi.uned.es, {edgar.meij, derijke}@uva.nl, {oghina,mbui,mbreuss}@science.uva.nl UNED NLP & IR Group ISLA, University of Amsterdam Madrid, Spain 35th International ACM SIGIR Conference on Research and Development in Information Retrieval Amsterdam, The Netherlands Portland, Oregon, USA. August 12–16, 2012. Introduction Dataset  94 entities, 17,775 tweets, ≈177 tweets/entity  Scenario: Online reputation management  Built upon WePS-3 ORM Task Dataset. Disambiguated company names (e.g. apple fruit vs. Apple Inc.) What do people say about an entity? entity = { company, organization, individual, product}  Pooling methodology. Three annotators with substantial agreement http://bit.ly/profilingTwitter (Cohen/Fleiss’ kappa > 0.6 ) Identification of aspects discussed on microblog posts  Annotated 2455 terms, 1304 aspects (54.11%)  Aspects ≈ products, services, competitors, key people Most of the true aspects are nouns (89.72%). related to the entity Entity Aspects Apple Inc. ipad, iphone, prototype, store, gizmodo, employee Sony advertising, headphones, digital, pro, music, xperia, dsc, x10, bravia, camera, vegas, battery, ericsson, playstation Starbucks coffee, latte, tea, frappuccino, barista, drink, mocha Identifying Entity Aspects Experiments and Results  All Words:  TF.IDF significantly outperforms PLM and OO in precision. Input Output  The results for OO are much lower than for the other methods.  Noun filter: 1. Applying a part-of-speech filter and only consider terms tagged as “inter” nouns (crosstown derby)  For all methods, MAP and precision values are slightly higher than in the all words condition: considering only nouns helps to identify aspects.  PLM best method using Noun filter. 2. “Berlusconi”  Observations (after manually inspecting the results): (club president)  Results for TF.IDF, LLR and PLM are very similar.  OO tends to return more subjective terms as aspects (syntactic parsing 3. errors) Examples: haha, pls, xd, safety, win “shirt”  OO has more difficulty to filter out generic terms. … Examples: new, use, today, come Main idea  Comparing a pseudo-document D built from entity-specific tweets with a background corpus C.  Score of a term t = s(t, D, C) Four models tested:  TF.IDF  LLR: Log-Likelihood Ratio  PLM: Parsimonious Language Models  OO: Opinion-oriented. Opinion target extraction using topic-specific subjective lexicons Student’s t-test statistic significance w.r.t to TF.IDF All words baseline with α=0.05 (*) and α=0.01 (**). Conclusions  Simple statistical methods such as TF.IDF are a strong baseline for the task of identifying entity aspects, significantly outperforming opinion-oriented methods.  Only considering terms tagged as nouns improves the results for all the methods analyzed. Acknowledgments: This research was supported by the European Union's CIP ICT-PSP under grant agreement nr 250430, the European Community's FP7 Programme under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe), the Netherlands Organisation for Scientific Research (NOW) under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-011, 727.011.005, the Center for Creation, Content and Technology (CCCT), the Hyperlocal Service Platform project funded by the Service Innovation & ICT program, the WAHSP project funded by the CLARIN-nl program, under COMMIT project Infiniti, the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010- 21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542), the ESF Research Network Program ELIAS and the ACM Special Interest Group on Information Retrieval (SIGIR Student Travel Grant).

Identifying Entity Aspects in Microblog Posts

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Identifying Entity Aspects in Microblog Posts

Similar to Identifying Entity Aspects in Microblog Posts (20)

More from Damiano Spina

More from Damiano Spina (12)

Identifying Entity Aspects in Microblog Posts