SlideShare a Scribd company logo
1 of 1
Download to read offline
Identifying Entity Aspects in Microblog Posts


                                                  damiano@lsi.uned.es, {edgar.meij, derijke}@uva.nl, {oghina,mbui,mbreuss}@science.uva.nl
   UNED NLP & IR Group                                                                                                                                                                                                                           ISLA, University of Amsterdam
      Madrid, Spain                                         35th International ACM SIGIR Conference on Research and Development in Information Retrieval
                                                                                                                                                                                                                                                  Amsterdam, The Netherlands
                                                                                     Portland, Oregon, USA. August 12–16, 2012.




                                        Introduction                                                                                                                                             Dataset
                                                                                                                                            94 entities, 17,775 tweets, ≈177 tweets/entity
          Scenario: Online reputation management
                                                                                                                                            Built upon WePS-3 ORM Task Dataset.
                                                                                                                                               Disambiguated company names (e.g. apple fruit vs. Apple Inc.)
                What do people say about an entity?
                     entity = { company, organization, individual, product}                                                                 Pooling methodology.
                                                                                                                                               Three annotators with substantial agreement                                               http://bit.ly/profilingTwitter
                                                                                                                                               (Cohen/Fleiss’ kappa > 0.6 )
                 Identification of aspects discussed on microblog posts
                                                                                                                                            Annotated 2455 terms, 1304 aspects (54.11%)
          Aspects ≈ products, services, competitors, key people                                                                             Most of the true aspects are nouns (89.72%).
           related to the entity
                                                                                                                                                                            Entity                                            Aspects
                                                                                                                                                                          Apple Inc.            ipad, iphone, prototype, store, gizmodo, employee
                                                                                                                                                                             Sony               advertising, headphones, digital, pro, music, xperia, dsc,
                                                                                                                                                                                                x10, bravia, camera, vegas, battery, ericsson, playstation
                                                                                                                                                                          Starbucks             coffee, latte, tea, frappuccino, barista, drink, mocha




                 Identifying Entity Aspects                                                                                                              Experiments and Results
                                                                                                                                                   All Words:
                                                                                                                                                             TF.IDF significantly outperforms PLM and OO in precision.
                              Input                                           Output                                                                         The results for OO are much lower than for the other methods.

                                                                                                                                                 Noun filter:
                                                                         1.                                                                                 Applying a part-of-speech filter and only consider terms tagged as
                                                                                    “inter”                                                     nouns
                                                                              (crosstown derby)
                                                                                                                                                           For all methods, MAP and precision values are slightly higher than in the
                                                                                                                                                              all words condition: considering only nouns helps to identify aspects.
                                                                                                                                                           PLM best method using Noun filter.
                                                                         2.
                                                                                 “Berlusconi”                                                    Observations (after manually inspecting the results):
                                                                               (club president)
                                                                                                                                                          Results for TF.IDF, LLR and PLM are very similar.
                                                                                                                                                          OO tends to return more subjective terms as aspects (syntactic parsing
                                                                         3.                                                                              errors)
                                                                                                                                                             Examples: haha, pls, xd, safety, win
                                                                                   “shirt”                                                                OO has more difficulty to filter out generic terms.
                                                                                   …                                                                         Examples: new, use, today, come



                                                      Main idea
                    Comparing a pseudo-document D built from entity-specific tweets with a
                     background corpus C.

                    Score of a term t = s(t, D, C)




                   Four models tested:

                            TF.IDF
                            LLR: Log-Likelihood Ratio
                            PLM: Parsimonious Language Models
                            OO: Opinion-oriented. Opinion target extraction using
                                  topic-specific subjective lexicons
                                                                                                                                               Student’s t-test statistic significance w.r.t to TF.IDF All words baseline with α=0.05 (*) and α=0.01 (**).




                                                                                                                                                                                       Conclusions
                                                                                                                                                        Simple statistical methods such as TF.IDF are a strong
                                                                                                                                                       baseline for the task of identifying entity aspects,
                                                                                                                                                       significantly outperforming opinion-oriented methods.

                                                                                                                                                        Only considering terms tagged as nouns improves
                                                                                                                                                       the results for all the methods analyzed.



Acknowledgments: This research was supported by the European Union's CIP ICT-PSP under grant agreement nr 250430, the European Community's FP7 Programme under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe),
the Netherlands Organisation for Scientific Research (NOW) under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-011, 727.011.005, the Center for Creation, Content and Technology (CCCT), the Hyperlocal Service Platform project funded by the
Service Innovation & ICT program, the WAHSP project funded by the CLARIN-nl program, under COMMIT project Infiniti, the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010-
21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542), the ESF Research Network Program ELIAS and the ACM Special Interest Group on Information Retrieval (SIGIR Student Travel Grant).

More Related Content

Viewers also liked

New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
Constantin Orasan
 
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-finalKCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
Deborah Woodcock
 
Weiskopf AMIA 2016_abstract
Weiskopf AMIA 2016_abstractWeiskopf AMIA 2016_abstract
Weiskopf AMIA 2016_abstract
Deborah Woodcock
 

Viewers also liked (20)

9.12
9.129.12
9.12
 
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Rea...
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Aspect-Oriented Technologies
Aspect-Oriented TechnologiesAspect-Oriented Technologies
Aspect-Oriented Technologies
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Aspect extraction (A survey)
Aspect extraction (A survey)Aspect extraction (A survey)
Aspect extraction (A survey)
 
Aspect Mining Techniques
Aspect Mining TechniquesAspect Mining Techniques
Aspect Mining Techniques
 
A Survey Of Aspect Mining Approaches
A Survey Of Aspect Mining ApproachesA Survey Of Aspect Mining Approaches
A Survey Of Aspect Mining Approaches
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information Extraction
 
Natural Language Processing and Machine Learning
Natural Language Processing and Machine LearningNatural Language Processing and Machine Learning
Natural Language Processing and Machine Learning
 
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
UNSUPERVISED LEARNING MODELS OF INVARIANT FEATURES IN IMAGES: RECENT DEVELOPM...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONA NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4j
 
New trends in NLP applications
New trends in NLP applicationsNew trends in NLP applications
New trends in NLP applications
 
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-finalKCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final
 
Weiskopf AMIA 2016_abstract
Weiskopf AMIA 2016_abstractWeiskopf AMIA 2016_abstract
Weiskopf AMIA 2016_abstract
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Similar to Identifying Entity Aspects in Microblog Posts

The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics
Peter Wren-Hilton
 
The Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsThe Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text Analytics
Alyona Medelyan
 
Vision of the future: Organization 2.0
Vision of the future: Organization 2.0Vision of the future: Organization 2.0
Vision of the future: Organization 2.0
Teemu Arina
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
Jean-Marc Desvaux
 
Ttg Ca Border Sec Mjd22jun07
Ttg Ca Border Sec Mjd22jun07Ttg Ca Border Sec Mjd22jun07
Ttg Ca Border Sec Mjd22jun07
martindudziak
 
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...A Multidimensional Semantic Space for Data Model Independent Queries over RDF...
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...
Andre Freitas
 

Similar to Identifying Entity Aspects in Microblog Posts (20)

The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics
 
The Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsThe Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text Analytics
 
Vision of the future: Organization 2.0
Vision of the future: Organization 2.0Vision of the future: Organization 2.0
Vision of the future: Organization 2.0
 
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
 
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
 
Connecting Talent to Opportunity.. at scale @ LinkedIn
Connecting Talent to Opportunity.. at scale @ LinkedInConnecting Talent to Opportunity.. at scale @ LinkedIn
Connecting Talent to Opportunity.. at scale @ LinkedIn
 
Left Brain, Right Brain: How to Unify Enterprise Analytics
Left Brain, Right Brain: How to Unify Enterprise AnalyticsLeft Brain, Right Brain: How to Unify Enterprise Analytics
Left Brain, Right Brain: How to Unify Enterprise Analytics
 
Semantics empowered Physical-Cyber-Social Systems for EarthCube
Semantics empowered Physical-Cyber-Social Systems for EarthCubeSemantics empowered Physical-Cyber-Social Systems for EarthCube
Semantics empowered Physical-Cyber-Social Systems for EarthCube
 
ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic ...
ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic ...ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic ...
ISWC 2012 - Industry Track - Linked Enterprise Data: leveraging the Semantic ...
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
 
GPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial IntelligenceGPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial Intelligence
 
Ttg Ca Border Sec Mjd22jun07
Ttg Ca Border Sec Mjd22jun07Ttg Ca Border Sec Mjd22jun07
Ttg Ca Border Sec Mjd22jun07
 
Text and Beyond
Text and BeyondText and Beyond
Text and Beyond
 
TCS Innovation Forum 2012 - Big Data Whodini
TCS Innovation Forum 2012 - Big Data WhodiniTCS Innovation Forum 2012 - Big Data Whodini
TCS Innovation Forum 2012 - Big Data Whodini
 
Using Big Data to create a data drive organization
Using Big Data to create a data drive organizationUsing Big Data to create a data drive organization
Using Big Data to create a data drive organization
 
Demian DIAL Seminar, Cambridge 19/3/2013
Demian DIAL Seminar, Cambridge 19/3/2013Demian DIAL Seminar, Cambridge 19/3/2013
Demian DIAL Seminar, Cambridge 19/3/2013
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...A Multidimensional Semantic Space for Data Model Independent Queries over RDF...
A Multidimensional Semantic Space for Data Model Independent Queries over RDF...
 
Analyze Your Data, Transform Your Business
Analyze Your Data, Transform Your BusinessAnalyze Your Data, Transform Your Business
Analyze Your Data, Transform Your Business
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 

More from Damiano Spina

UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013
Damiano Spina
 

More from Damiano Spina (12)

A Formal Account of Effectiveness Evaluation and Ranking Fusion
A Formal Account of Effectiveness Evaluation and Ranking FusionA Formal Account of Effectiveness Evaluation and Ranking Fusion
A Formal Account of Effectiveness Evaluation and Ranking Fusion
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...Learning Similarity Functions for Topic Detection in Online Reputation Monito...
Learning Similarity Functions for Topic Detection in Online Reputation Monito...
 
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in TwitterORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter
 
Online Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access PerspectiveOnline Reputation Monitoring in Twitter from an Information Access Perspective
Online Reputation Monitoring in Twitter from an Information Access Perspective
 
Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...
 
UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013UNED Online Reputation Monitoring Team at RepLab 2013
UNED Online Reputation Monitoring Team at RepLab 2013
 
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter StreamsTowards Real-Time Summarization of Scheduled Events from Twitter Streams
Towards Real-Time Summarization of Scheduled Events from Twitter Streams
 
A Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog PostsA Corpus for Entity Profiling in Microblog Posts
A Corpus for Entity Profiling in Microblog Posts
 
Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...Filter keywords and majority class strategies for company name disambiguation...
Filter keywords and majority class strategies for company name disambiguation...
 
Evaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuariosEvaluación de sistemas de monitorización de contenidos generados por usuarios
Evaluación de sistemas de monitorización de contenidos generados por usuarios
 
Caracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de casoCaracterización de una entidad basada en opiniones: un estudio de caso
Caracterización de una entidad basada en opiniones: un estudio de caso
 

Identifying Entity Aspects in Microblog Posts

  • 1. Identifying Entity Aspects in Microblog Posts damiano@lsi.uned.es, {edgar.meij, derijke}@uva.nl, {oghina,mbui,mbreuss}@science.uva.nl UNED NLP & IR Group ISLA, University of Amsterdam Madrid, Spain 35th International ACM SIGIR Conference on Research and Development in Information Retrieval Amsterdam, The Netherlands Portland, Oregon, USA. August 12–16, 2012. Introduction Dataset  94 entities, 17,775 tweets, ≈177 tweets/entity  Scenario: Online reputation management  Built upon WePS-3 ORM Task Dataset. Disambiguated company names (e.g. apple fruit vs. Apple Inc.) What do people say about an entity? entity = { company, organization, individual, product}  Pooling methodology. Three annotators with substantial agreement http://bit.ly/profilingTwitter (Cohen/Fleiss’ kappa > 0.6 ) Identification of aspects discussed on microblog posts  Annotated 2455 terms, 1304 aspects (54.11%)  Aspects ≈ products, services, competitors, key people Most of the true aspects are nouns (89.72%). related to the entity Entity Aspects Apple Inc. ipad, iphone, prototype, store, gizmodo, employee Sony advertising, headphones, digital, pro, music, xperia, dsc, x10, bravia, camera, vegas, battery, ericsson, playstation Starbucks coffee, latte, tea, frappuccino, barista, drink, mocha Identifying Entity Aspects Experiments and Results  All Words:  TF.IDF significantly outperforms PLM and OO in precision. Input Output  The results for OO are much lower than for the other methods.  Noun filter: 1. Applying a part-of-speech filter and only consider terms tagged as “inter” nouns (crosstown derby)  For all methods, MAP and precision values are slightly higher than in the all words condition: considering only nouns helps to identify aspects.  PLM best method using Noun filter. 2. “Berlusconi”  Observations (after manually inspecting the results): (club president)  Results for TF.IDF, LLR and PLM are very similar.  OO tends to return more subjective terms as aspects (syntactic parsing 3. errors) Examples: haha, pls, xd, safety, win “shirt”  OO has more difficulty to filter out generic terms. … Examples: new, use, today, come Main idea  Comparing a pseudo-document D built from entity-specific tweets with a background corpus C.  Score of a term t = s(t, D, C) Four models tested:  TF.IDF  LLR: Log-Likelihood Ratio  PLM: Parsimonious Language Models  OO: Opinion-oriented. Opinion target extraction using topic-specific subjective lexicons Student’s t-test statistic significance w.r.t to TF.IDF All words baseline with α=0.05 (*) and α=0.01 (**). Conclusions  Simple statistical methods such as TF.IDF are a strong baseline for the task of identifying entity aspects, significantly outperforming opinion-oriented methods.  Only considering terms tagged as nouns improves the results for all the methods analyzed. Acknowledgments: This research was supported by the European Union's CIP ICT-PSP under grant agreement nr 250430, the European Community's FP7 Programme under grant agreements nr 258191 (PROMISE Network of Excellence) and 288024 (LiMoSINe), the Netherlands Organisation for Scientific Research (NOW) under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-011, 727.011.005, the Center for Creation, Content and Technology (CCCT), the Hyperlocal Service Platform project funded by the Service Innovation & ICT program, the WAHSP project funded by the CLARIN-nl program, under COMMIT project Infiniti, the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia Project, TIN2010- 21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542), the ESF Research Network Program ELIAS and the ACM Special Interest Group on Information Retrieval (SIGIR Student Travel Grant).