Online reputation management is about monitoring and handling
the public image of entities (such as companies) on the Web. An
important task in this area is identifying aspects of the entity of
interest (such as products, services, competitors, key people, etc.)
given a stream of microblog posts referring to the entity. In this pa-
per we compare different IR techniques and opinion target identi-
fication methods for automatically identifying aspects and find that
(i) simple statistical methods such as TF.IDF are a strong baseline
for the task, significantly outperforming opinion-oriented methods,
and (ii) only considering terms tagged as nouns improves the re-
sults for all the methods analyzed.
Demystifying Disruption: A New Model for Understanding and Predicting Disrupt...Vivek Pundir
The failure of firms in the face of technological change has been a topic of intense research and debate,
spawning the theory (among others) of disruptive technologies. However, the theory suffers from circular
definitions, inadequate empirical evidence, and lack of a predictive model. Sood & Tellis develop a new schema to address these limitations. The schema generates seven hypotheses and a testable model relating to platform technologies.
The authors test this model and hypotheses with data on 36 technologies from seven markets. Contrary to extant theory,
technologies that adopt a lower attack (“potentially disruptive technologies”) (1) are introduced as frequently
by incumbents as by entrants, (2) are not cheaper than older technologies, and (3) rarely disrupt firms; and (4)
both entrants and lower attacks significantly reduce the hazard of disruption. Moreover, technology disruption is not permanent because of multiple crossings in technology performance and numerous rival technologies coexisting without one disrupting the other. The proposed predictive model of disruption shows good out-of sample predictive accuracy. The authors discuss the implications of these findings.
Research assistants: Esra Kent, Vivek Pundir, and K. L. Dang.
Getting answers without asking questions (by Saartje Van den Branden)InSites on Stage
Getting answers without asking questions: Welcome to the world of unstructured data (by Saartje Van den Branden, Research Manager at InSites Consulting). Presented at the IBM Performance 2012 on November 13, 2012 in Brussels (BE).
Demystifying Disruption: A New Model for Understanding and Predicting Disrupt...Vivek Pundir
The failure of firms in the face of technological change has been a topic of intense research and debate,
spawning the theory (among others) of disruptive technologies. However, the theory suffers from circular
definitions, inadequate empirical evidence, and lack of a predictive model. Sood & Tellis develop a new schema to address these limitations. The schema generates seven hypotheses and a testable model relating to platform technologies.
The authors test this model and hypotheses with data on 36 technologies from seven markets. Contrary to extant theory,
technologies that adopt a lower attack (“potentially disruptive technologies”) (1) are introduced as frequently
by incumbents as by entrants, (2) are not cheaper than older technologies, and (3) rarely disrupt firms; and (4)
both entrants and lower attacks significantly reduce the hazard of disruption. Moreover, technology disruption is not permanent because of multiple crossings in technology performance and numerous rival technologies coexisting without one disrupting the other. The proposed predictive model of disruption shows good out-of sample predictive accuracy. The authors discuss the implications of these findings.
Research assistants: Esra Kent, Vivek Pundir, and K. L. Dang.
Getting answers without asking questions (by Saartje Van den Branden)InSites on Stage
Getting answers without asking questions: Welcome to the world of unstructured data (by Saartje Van den Branden, Research Manager at InSites Consulting). Presented at the IBM Performance 2012 on November 13, 2012 in Brussels (BE).
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...Tobias Wunner
The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.
This presentation offers a first, in-breadth survey and comparison of current aspect mining tools and techniques. It focuses mainly on automated techniques that mine a program’s static or dynamic structure for candidate aspects. We present an initial comparative framework for distinguishing aspect mining techniques, and assess known techniques against this framework. The results of this assessment may serve as a roadmap to potential users of aspect mining techniques, to help them in selecting an appropriate technique. It also helps aspect mining researchers to identify remaining open research questions, possible avenues for future research, and interesting combinations of existing techniques.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...ijscai
Object detection and recognition are important problems in computer vision and pattern recognition
domain. Human beings are able to detect and classify objects effortlessly but replication of this ability on
computer based systems has proved to be a non-trivial task. In particular, despite significant research
efforts focused on meta-heuristic object detection and recognition, robust and reliable object recognition
systems in real time remain elusive. Here we present a survey of one particular approach that has proved
very promising for invariant feature recognition and which is a key initial stage of multi-stage network
architecture methods for the high level task of object recognition.
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...ijscai
All types of machine automated systems are generating large amount of data in different forms like
statistical, text, audio, video, sensor, and bio-metric data that emerges the term Big Data. In this paper we
are discussing issues, challenges, and application of these types of Big Data with the consideration of big
data dimensions. Here we are discussing social media data analytics, content based analytics, text data
analytics, audio, and video data analytics their issues and expected application areas. It will motivate
researchers to address these issues of storage, management, and retrieval of data known as Big Data. As
well as the usages of Big Data analytics in India is also highlighted.
Ontologies are being used to organize information in many domains like artificial intelligence,
information science, semantic web, library science. Ontologies of an entity having different information
can be merged to create more knowledge of that particular entity. Ontologies today are powering more
accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0,
also termed as the semantic web, ontologies will play a more important role.
Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can
yield basic information about an entity. This paper proposes an automated method for ontology creation,
using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning.
Concepts drawn from these domains help in designing more accurate ontologies represented using the
XML format. This paper uses document classification using classification algorithms for assigning labels
to documents, document similarity to cluster similar documents to the input document, together, and
summarization to shorten the text and keep important terms essential in making the ontology. The module
is constructed using the Python programming language and NLTK (Natural Language Toolkit). The
ontologies created in XML will convey to a lay person the definition of the important term's and their
lexical relationships.
Abstract Generally, computer system is handled by the English language only. But the person who is unaware of the English language and structure of query language cannot handle the system. This paper proposed a new approach for accessing the database easily without knowing English. So, the database is accessed with the help of natural languages such as Hindi, Marathi etc. Natural language processing (NLP) is the study of mathematical and computational modeling of various aspects of language and the development of a wide range of systems. Natural Language Processing holds great promise for making computer interfaces that are easier to use for people, since people will be able to talk to the computer in their own language, rather than learn a specialized language of computer commands. Keywords: NLP, Mathematical modeling, Computational modeling
Recent natural language processing advancements have propelled search engine and information retrieval innovations into the public spotlight. People want to be able to interact with their devices in a natural way. In this talk I will be introducing you to natural language search using a Neo4j graph database. I will show you how to interact with an abstract graph data structure using natural language and how this approach is key to future innovations in the way we interact with our devices.
Tutorial given at RANLP 2015 in Hissar, Bulgaria
Recent years have seen lots of changes in the field of computational linguistics, most of them due to the widespread use of the Internet and the benefits and problems it brings. The first part of this tutorial will discuss these changes and will focus on crowdsourcing and how it influenced the creation of annotated data.
Annotation of data employed to train and test NLP methods used to be the task of language experts who had a good understanding of the linguistic phenomena to be tackled. Given that a large number of people now have access to the Internet, crowdsourcing has become an alternative way of obtaining annotated data. The core idea of crowdsourcing is that it is possible to design tasks that can be completed by non-experts and that the outputs of these tasks can be combined to obtain high-quality linguistic annotation, which would normally be produced by experts. Examples of how crowdsourcing was employed in computational linguistics will be given.
Big data is another trend in computational linguistics as researchers rely on more and more data for improving the results of a method. The second part of the tutorial will introduce the MapReduce programming model and show how it was used in processing language. Combined with processing larger quantities of data, the field of computational linguistics has applied deep learning to various tasks successfully, improving their accuracy. An introduction to deep learning will be provided, followed by examples of how it was applied to tasks such as learning semantic representations, sentiment analysis and machine translation evaluation.
NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language. Also called Computational Linguistics – Also concerns how computational methods can aid the understanding of human language
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...Tobias Wunner
The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.
This presentation offers a first, in-breadth survey and comparison of current aspect mining tools and techniques. It focuses mainly on automated techniques that mine a program’s static or dynamic structure for candidate aspects. We present an initial comparative framework for distinguishing aspect mining techniques, and assess known techniques against this framework. The results of this assessment may serve as a roadmap to potential users of aspect mining techniques, to help them in selecting an appropriate technique. It also helps aspect mining researchers to identify remaining open research questions, possible avenues for future research, and interesting combinations of existing techniques.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...ijscai
Object detection and recognition are important problems in computer vision and pattern recognition
domain. Human beings are able to detect and classify objects effortlessly but replication of this ability on
computer based systems has proved to be a non-trivial task. In particular, despite significant research
efforts focused on meta-heuristic object detection and recognition, robust and reliable object recognition
systems in real time remain elusive. Here we present a survey of one particular approach that has proved
very promising for invariant feature recognition and which is a key initial stage of multi-stage network
architecture methods for the high level task of object recognition.
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...ijscai
All types of machine automated systems are generating large amount of data in different forms like
statistical, text, audio, video, sensor, and bio-metric data that emerges the term Big Data. In this paper we
are discussing issues, challenges, and application of these types of Big Data with the consideration of big
data dimensions. Here we are discussing social media data analytics, content based analytics, text data
analytics, audio, and video data analytics their issues and expected application areas. It will motivate
researchers to address these issues of storage, management, and retrieval of data known as Big Data. As
well as the usages of Big Data analytics in India is also highlighted.
Ontologies are being used to organize information in many domains like artificial intelligence,
information science, semantic web, library science. Ontologies of an entity having different information
can be merged to create more knowledge of that particular entity. Ontologies today are powering more
accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0,
also termed as the semantic web, ontologies will play a more important role.
Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can
yield basic information about an entity. This paper proposes an automated method for ontology creation,
using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning.
Concepts drawn from these domains help in designing more accurate ontologies represented using the
XML format. This paper uses document classification using classification algorithms for assigning labels
to documents, document similarity to cluster similar documents to the input document, together, and
summarization to shorten the text and keep important terms essential in making the ontology. The module
is constructed using the Python programming language and NLTK (Natural Language Toolkit). The
ontologies created in XML will convey to a lay person the definition of the important term's and their
lexical relationships.
Abstract Generally, computer system is handled by the English language only. But the person who is unaware of the English language and structure of query language cannot handle the system. This paper proposed a new approach for accessing the database easily without knowing English. So, the database is accessed with the help of natural languages such as Hindi, Marathi etc. Natural language processing (NLP) is the study of mathematical and computational modeling of various aspects of language and the development of a wide range of systems. Natural Language Processing holds great promise for making computer interfaces that are easier to use for people, since people will be able to talk to the computer in their own language, rather than learn a specialized language of computer commands. Keywords: NLP, Mathematical modeling, Computational modeling
Recent natural language processing advancements have propelled search engine and information retrieval innovations into the public spotlight. People want to be able to interact with their devices in a natural way. In this talk I will be introducing you to natural language search using a Neo4j graph database. I will show you how to interact with an abstract graph data structure using natural language and how this approach is key to future innovations in the way we interact with our devices.
Tutorial given at RANLP 2015 in Hissar, Bulgaria
Recent years have seen lots of changes in the field of computational linguistics, most of them due to the widespread use of the Internet and the benefits and problems it brings. The first part of this tutorial will discuss these changes and will focus on crowdsourcing and how it influenced the creation of annotated data.
Annotation of data employed to train and test NLP methods used to be the task of language experts who had a good understanding of the linguistic phenomena to be tackled. Given that a large number of people now have access to the Internet, crowdsourcing has become an alternative way of obtaining annotated data. The core idea of crowdsourcing is that it is possible to design tasks that can be completed by non-experts and that the outputs of these tasks can be combined to obtain high-quality linguistic annotation, which would normally be produced by experts. Examples of how crowdsourcing was employed in computational linguistics will be given.
Big data is another trend in computational linguistics as researchers rely on more and more data for improving the results of a method. The second part of the tutorial will introduce the MapReduce programming model and show how it was used in processing language. Combined with processing larger quantities of data, the field of computational linguistics has applied deep learning to various tasks successfully, improving their accuracy. An introduction to deep learning will be provided, followed by examples of how it was applied to tasks such as learning semantic representations, sentiment analysis and machine translation evaluation.
NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language. Also called Computational Linguistics – Also concerns how computational methods can aid the understanding of human language
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...Charlie Berger
DBAs spend too time with routine tasks leaving little time for innovation. Autonomous Databases free data professionals to extract more value from data. Oracle Machine Learning, in Autonomous Database, “moves the algorithms; not the data” for 100% in-database processing. Data professionals perform many supporting tasks for “data scientists”, typically 80% of the work. Come learn an evolutionary path for Oracle data professionals to leverage domain knowledge and data skills and add machine learning. See how to build and deploy predictive models inside the Database. Using examples, demos and sharing experiences, Charlie will show you how to discover new insights, make predictions and become an “Oracle Data Scientist” in just 6 weeks!
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...DATAVERSITY
Once developers have a knowledge management model - covered in our August webinar - they still have to deal with real world implementation constraints. Big data is a fact of life for most modern AI/cognitive computing apps, which usually means ingesting, sampling, or analyzing large data sets from disparate sources, ranging from IOT sensors to social media streams to news feeds and weather forecasts. Frequently, historical data in legacy systems will also be required to generate new insights.
This webinar will present a framework to help participants evaluate streaming data management tools, IOT technology stacks, and graph databases as support tools for their modern AI/cognitive computing projects. They will also learn about emerging open source projects and ecosystems that can help kick start their projects today.
Left Brain, Right Brain: How to Unify Enterprise AnalyticsInside Analysis
The Briefing Room with Robin Bloor and Teradata
Live Webcast on Jan. 29, 2013
Despite its name, effective Data Science requires a certain amount of artistic flair. Analysts must be creative about how and where they find the insights that will drive business value. One classic roadblock to that kind of frictionless process? Programming. Not everyone can code Java, which makes the unstructured domain of Hadoop quite challenging for the average business analyst.
Check out the slides from this episode of the Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how a new generation of analytical platforms will solve the complexity of unifying structured and unstructured data. He'll be briefed by Steve Wooledge of Teradata Aster who will tout his company's Big Data Appliance, which leverages the SQL-H bridge, an innovation designed to connect Hadoop with SQL.
Visit: http://www.insideanalysis.com
Semantics empowered Physical-Cyber-Social Systems for EarthCubeAmit Sheth
Presentation at the EarthCube Face Face-to-Face Workshop of Semantics & Ontologies Workgroup: April 30-May 1, 2012, Ballston, VA.
Workshop site: http://earthcube.ning.com/group/semantics-and-ontologies/page/workshops
For more recent material on this topic, see: http://wiki.knoesis.org/index.php/PCS
In this session, we provide an overview of the artificial intelligence/machine learning landscape, discuss the current state of the industry, and identify new market opportunities. Partners will come away with a better understanding of the investment that AWS is making in this space, as well as our unique value proposition.
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
This talk was given by Bhaskar Ghosh (Senior Director of Engineering, LinkedIn Data Infrastructure), at the Yale Oct 2012 Symposium on Big Data, in honor of Martin Schultz.
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...Andre Freitas
IEEE International Conference on Semantic Computing (ICSC 2011).
A Multidimensional Semantic Space for Data Model Independent Queries over RDF Data
André Freitas, João Gabriel Oliveira, Edward Curry Seán O’Riain
http://andrefreitas.org/papers/preprint_multidimensional_ieee_icsc_2011.pdf
Abstract: The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data
on the Web today, end-users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively
query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale Linked Data consumption. This article introduces a multidimensional semantic space model which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation
demanded to build the data model independent approach. The final multidimensional semantic space proved to be flexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall =0.491.
Analyze Your Data, Transform Your BusinessDATAVERSITY
In the era of big data, data processing has taken center stage. The focus often is the speeds and feeds of the organizational data supply chain. Much less thought and expertise have focused on what matters most – using analytics in the proper context.
Why is context so important? Why isn’t it enough to slay the v-named data dragons – namely, volume, variety and velocity?
In this webinar, you’ll learn how successful organizations apply the right analytical capabilities in the proper context. Because without context, analytics can become noise that disturbs the decision-making process instead of helping it. There is no one-size-fits-all context generator.
We’ll also discuss the importance of putting data into the proper context for analytical decision making, why data is your most valuable organizational asset, and how you can apply analytics in a way that converts your data into tangible benefits.
Oracle's BigData solutions consist of a number of new products and solutions to support customers looking to gain maximum business value from data sets such as weblogs, social media feeds, smart meters, sensors and other devices that generate massive volumes of data (commonly defined as ‘Big Data’) that isn’t readily accessible in enterprise data warehouses and business intelligence applications today.
Similar to Identifying Entity Aspects in Microblog Posts (20)
A Formal Account of Effectiveness Evaluation and Ranking FusionDamiano Spina
Slides of my presentation at ICTIR'18
September, 17 2018
Tianjin, China
Paper: https://dl.acm.org/citation.cfm?id=3234958
Version with formal proofs: https://arxiv.org/abs/1807.04317
Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in TwitterDamiano Spina
We present a semi-automatic tool that assists experts in their daily work of monitoring the reputation of entities—companies, organizations or public figures—in Twitter. The tool automatically annotates tweets for relevance (Is the tweet about the entity?), reputational polarity (Does the tweet convey positive or negative implications for the reputation of the entity?), groups tweets in topics and display topics in decreasing order of relevance from a reputational perspective. The interface helps the user to understand the content being analyzed and also to produce a manually annotated version of the data starting from the output of the automatic annotation processes. A demo is available at http://nlp.uned.es/orma
Towards an Active Learning System for Company Name Disambiguation in Microblo...Damiano Spina
In this paper we describe the collaborative participation of UvA \& UNED at RepLab 2013. We propose an active learning approach for the filtering subtask, using features based on the detected semantics in the tweet (using Entity Linking with Wikipedia), as well as tweet-inherent features such as hashtags and usernames. The tweets manually inspected during the active learning process is at most 1\% of the test data. While our baseline does not perform well, we can see that active learning does improve the results.
UNED Online Reputation Monitoring Team at RepLab 2013Damiano Spina
This paper describes the UNED's Online Reputation Monitoring Team participation at RepLab
2013. Several approaches were tested: first, an instance-based learning approach that uses Heterogeneity Based Ranking to combine seven different similarity measures was applied for all the subtasks. The filtering subtask was also tackled by automatically discovering filter keywords: those whose presence in a tweet reliably confirm (positive keywords) or discard (negative keywords) that the tweet refers to the company.
Different approaches have been submitted for the topic detection subtask: agglomerative clustering over wikified tweets, co-occurrence term clusteringand an LDA-based model that uses temporal
information.
Finally, the polarity subtask was tackled by generating domain specific semantic graphs in order to automatically expand the general purpose lexicon SentiSense. We next use the domain specific sub-lexicons to classify tweets according to their reputational polarity, following an emotional concept-based system for sentiment analysis.
We corroborated that using entity-level training data improves the filtering step. Additionally, the proposed approaches to detect topics obtained the highest scores in the official evaluation, showing that they are promising directions to address the problem. In the reputational polarity task, our results suggest that a deeper analysis should be done in order to correctly identify the main differences between the Reputational Polarity task and traditional Sentiment Analysis tasks. A final remark is that the overall performance of a monitoring system in RepLab 2013 highly depends on the performance of the initial filtering step.
Towards Real-Time Summarization of Scheduled Events from Twitter StreamsDamiano Spina
This work explores the real-time summarization of sched-
uled events such as soccer games from torrential
ows of Twitter streams. We propose and evaluate an approach
that substantially shrinks the stream of tweets in real-time,
and consists of two steps: (i) sub-event detection, which
determines if something new has occurred, and (ii) tweet se-
lection, which picks a representative tweet to describe each sub-event. We compare the summaries generated in three
languages for all the soccer games in Copa America 2011
to reference live reports oered by Yahoo! Sports journal-
ists. We show that simple text analysis methods which do
not involve external knowledge lead to summaries that cover
84% of the sub-events on average, and 100% of key types of sub-events (such as goals in soccer). Our approach should
be straightforwardly applicable to other kinds of scheduled
events such as other sports, award ceremonies, keynote talks, TV shows, etc.
A Corpus for Entity Profiling in Microblog PostsDamiano Spina
Microblogs have become an invaluable source of information for the purpose of online reputation management. Streams of microblogs
are of great value because of their direct and real-time nature. An emerging problem is to identify not only microblog posts (such as tweets) that are relevant for a given entity, but also the specific aspects that people discuss. Determining such aspects can be non-trivial
because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form
of communication. In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both
of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for
which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human
assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets.
Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which
part of the tweet is subjective and what the target of the sentiment is, if any.
Caracterización de una entidad basada en opiniones: un estudio de caso
Identifying Entity Aspects in Microblog Posts
1. Identifying Entity Aspects in Microblog Posts
damiano@lsi.uned.es, {edgar.meij, derijke}@uva.nl, {oghina,mbui,mbreuss}@science.uva.nl
UNED NLP & IR Group ISLA, University of Amsterdam
Madrid, Spain 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
Amsterdam, The Netherlands
Portland, Oregon, USA. August 12–16, 2012.
Introduction Dataset
94 entities, 17,775 tweets, ≈177 tweets/entity
Scenario: Online reputation management
Built upon WePS-3 ORM Task Dataset.
Disambiguated company names (e.g. apple fruit vs. Apple Inc.)
What do people say about an entity?
entity = { company, organization, individual, product} Pooling methodology.
Three annotators with substantial agreement http://bit.ly/profilingTwitter
(Cohen/Fleiss’ kappa > 0.6 )
Identification of aspects discussed on microblog posts
Annotated 2455 terms, 1304 aspects (54.11%)
Aspects ≈ products, services, competitors, key people Most of the true aspects are nouns (89.72%).
related to the entity
Entity Aspects
Apple Inc. ipad, iphone, prototype, store, gizmodo, employee
Sony advertising, headphones, digital, pro, music, xperia, dsc,
x10, bravia, camera, vegas, battery, ericsson, playstation
Starbucks coffee, latte, tea, frappuccino, barista, drink, mocha
Identifying Entity Aspects Experiments and Results
All Words:
TF.IDF significantly outperforms PLM and OO in precision.
Input Output The results for OO are much lower than for the other methods.
Noun filter:
1. Applying a part-of-speech filter and only consider terms tagged as
“inter” nouns
(crosstown derby)
For all methods, MAP and precision values are slightly higher than in the
all words condition: considering only nouns helps to identify aspects.
PLM best method using Noun filter.
2.
“Berlusconi” Observations (after manually inspecting the results):
(club president)
Results for TF.IDF, LLR and PLM are very similar.
OO tends to return more subjective terms as aspects (syntactic parsing
3. errors)
Examples: haha, pls, xd, safety, win
“shirt” OO has more difficulty to filter out generic terms.
… Examples: new, use, today, come
Main idea
Comparing a pseudo-document D built from entity-specific tweets with a
background corpus C.
Score of a term t = s(t, D, C)
Four models tested:
TF.IDF
LLR: Log-Likelihood Ratio
PLM: Parsimonious Language Models
OO: Opinion-oriented. Opinion target extraction using
topic-specific subjective lexicons
Student’s t-test statistic significance w.r.t to TF.IDF All words baseline with α=0.05 (*) and α=0.01 (**).
Conclusions
Simple statistical methods such as TF.IDF are a strong
baseline for the task of identifying entity aspects,
significantly outperforming opinion-oriented methods.
Only considering terms tagged as nouns improves
the results for all the methods analyzed.
Acknowledgments: This research was supported by the European Union's CIP ICT-PSP under grant agreement nr 250430, the European Community's FP7 Programme under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe),
the Netherlands Organisation for Scientific Research (NOW) under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-011, 727.011.005, the Center for Creation, Content and Technology (CCCT), the Hyperlocal Service Platform project funded by the
Service Innovation & ICT program, the WAHSP project funded by the CLARIN-nl program, under COMMIT project Infiniti, the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010-
21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542), the ESF Research Network Program ELIAS and the ACM Special Interest Group on Information Retrieval (SIGIR Student Travel Grant).