SlideShare a Scribd company logo
NEM Summit – 28 Sept. 2011
Identifying Topics in Social Media
Posts using DBpedia
Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media)
Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM)
Contents
Identifying Topics in Social Media Posts using DBpedia ⎢2
 Introduction
 Related Work
 Description of the Method
 Evaluation
 Conclusions
Identifying Topics in Social Media Posts using DBpedia
Introduction
Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢4
 Topic Identification
 “The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]
 Applications of Topic Identification for Social Media
 Automatically summarising the content published in a channel.
 Mining the interest of a given user.
 etc…
 Benefits for Advertising Companies
 To focus the advertisement actions to the appropriate channels.
 To serve ads to the users based in their interest.
Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢5
 Difficulties of Topic Identification in Social Media
 Different channels with heterogeneous texts
l Different lengths
 From short sentences on Twitter to medium-size articles in blogs
l Misspellings
 Posts completely written in uppercase (or lowercase) letters
 Makes difficult the detection of proper nouns.
 In Spanish, absence /presence of an accent in a word different meanings
 “té” = “tea” (common noun)
 “te” = “you” (personal pronoun)
l Use of set phrases
 E.g., “too many cooks spoil the broth” (if too many people try to take charge at
a task, the end product might be ruined)
 E.g., “rain cats and dogs” (rains heavily)
 It is important to take into account the context of the post
Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢6
 Why DBpedia?
 DBpedia is a structured Semantic Web representation of Wikipedia
l Wikipedia is maintained by thousands of editors
l Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]
 Each topic identified is mapped with a DBpedia resource
l E.g., The URI http://dbpedia.org/resource/Turin
 Represents the city of Torino
 Has about 45 attributes defined (population, area, latitude, longitude, etc.)
 Has labels and definitions in 14 different languages.
 It is linked with many semantic entities
 E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro
 It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino
 It is a nucleus for the Web of Data [Bizer et al, 2009]
l Data published on the Web according to Tim Berners-Lee’s Linked Data principles.
l Several billion RDF triples (i.e. facts)
l Multi-domain datasets (geographic information, people, companies, online
communities, etc…)
Introduction
Identifying Topics in Social Media Posts using DBpedia
Related Work
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢9
 Wikipedia has been exploited for the following tasks:
 Topic identification and text categorization
l [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et
al, 2008], [Schonhofen, 2009]
 Semantic Relatedness between fragments of text
l [Gabrilovich et al, 2007]
 Keyword Extraction
l [Mihalcea et al, 2007]
 Word sense disambiguation
l [Mihalcea, 2007]
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢10
 Uses of Wikipedia data-structure:
 Relating words in text with articles using article title information
l [Schonhofen, 2009]
 Exploiting anchor text in links
l [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]
 Exploiting the whole articles
l [Syed et al, 2008] [Gabrilovich, 2007]
 Exploiting categories to measure relatedness between articles
l [Coursey et al, 2009] [Syed et al, 2008]
 Exploiting disambiguation pages and redirection links to select
candidate senses and alternative labels
l [Mendelyan et al, 2008]
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢11
 Supervised learning methods
l [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]
 Unsupervised techniques
 Based on a Vector Space Model
l [Schonhofen, 2009]
 Based in a Graph
l [Coursey et al, 2009] [Syed et al, 2008]
 Combined methods (supervised and unsupervised)
 Based on a Vector Space Model
l [Mihalcea et al, 2007]
Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢12
 Our approach
 Exploits titles, disambiguation pages, redirection links and article
text to select candidate senses and alternative labels
 Uses an unsupervised method
 Uses a vector space model
 Main benefit in comparison with previous approaches:
 The interlinking of social media posts with the Web of data through
DBpedia resources
Identifying Topics in Social Media Posts using DBpedia
Description of the Method
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢14
Input
Part-of-
speech
tagging
• “torino”, “art”, “media”, “user”, “cloud”
Topic
Recognition
• http://dbpedia.org/resource/Turin
• http://dbpedia.org/resource/Art
• http://dbpedia.org/resource/User_(computing)
Language
Filtering
• “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢15
 Part-of-speech tagging
 Wp = w1,w2, ..., wn list of lexical units contained in the post
 lexcat(w) lexical category of the lexical unit w
 lemma(w) lemma of w
 L = {common noun, proper noun, acronym…} meaningful lexical categories
that we consider
 = {“RT”, “/cc”, “;)”, …} stop words (lemmas excluded)
 Kp = k1,k2, …, kn list of keywords with meaning
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢16
 Part-of-speech tagging example
Input
• But a hardware problem is more likely, especially if
you use the phone a lot while eating. The
Blackberry's tiny trackball could be suffering the
same accumulation of gunk and grime that can
plague a computer mouse that still uses a rubber
ball on the underside to roll around the desk.
Part-of-speech
tagging
• Blackberry, phone, trackball, computer,
problem, grime, hardware, mouse, desk,
rubber ball, gunk
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢17
 Topic Recognition (Sem4Tags [García-Silva et al, 2010])
POS
tagging
• Blackberry, phone, trackball, computer, problem, grime, hardware,
mouse, desk, rubber ball, gunk
Context
Selection
• Blackberry, {phone, hardware, trackball, mouse}
• Computer, {hardware, mouse, problem, desk}
• …
Disambiguation
• http://dbpedia.org/resource/BlackBerry
• http://dbpedia.org/resource/Computer
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢18
 Context Selection
 For each keyword, a set of up to 4 related keywords that will help to
disambiguate the its meaning
 4 is the number of words above which the context does not add more resolving
power to disambiguation [Kaplan, 1955]
 We compute semantic relatedness (active context) taking into account the
co-ocurrence of words in web pages [Gracia et al, 2009]
Keyword Relatedness Keyword Relatedness
phone 0.347 hardware 0.347
trackball 0.311 mouse 0.311
computer 0.288 desk 0.287
problem 0.246 rubber ball 0.246
grime 0.190 gunk 0.168
Active context selection for blackberry keyword
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢19
 Disambiguation Criteria
 OPTION 1: Most frequent sense for the ambiguous word
l Determined by Wikipedia editors (the first link in a disambiguation page)
 OPTION 2: Vector space model
1. A vector containing the keyword and its context
2. A vector containing top N terms is created from each candidate sense is created using
TF-IDF (Term Frequency and Inverse Document Frequency)
3. The cosine similarity is used to determine which vectorised sense is more similar to
the vector associated to the keyword
DBpedia resource Definition Similarity
BlackBerry
Is a line of mobile e-mail and
smartphone
0.224
Blackberry is an edible fruit 0.15
BlackBerry_(song) is a song by the Black Crowes 0.0
BlackBerry_Township,
_Itasca_County,
_Minnesota
Is a towship in … Itasca County 0.0
Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢20
 Language Filtering
 Tp = t1,t2, ..., tn set of topics identified
 l language to filter
 Labels(t) set of labels associated to a given topic (value of rdfs:label
property)
 lang(b) language of a given label
 Tl
p set of topics with labels in language l
Identifying Topics in Social Media Posts using DBpedia
Evaluation
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢22
 Evaluated with a corpora of 10,000 posts in Spanish extracted from
 Blogs
 Forums
 Microblogs (e.g., Twitter)
 Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)
 Review sites (e.g. , Ciao and Dooyoo)
 Audiovisual sites (e.g., YouTube and Flickr)
 News publising sites (e.g., elpais.com, elmundo.es)
 Others (web pages not classified in the categories above)
 Variants evaluated
1. Without considering any context
l Default Wikipedia sense assigned for a given keyword
2. Considering as context all the other keywords found in the same post
3. Active context selection technique
l Selecting the 4 most relevant topics from the keywords in the same post
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢23
 Coverage
 Part-of-speech tagging: nearly 100%
 Topic recognition: over 90% for almost all the cases
 After language filtering coverage is reduced in about 10% because not all
DBpedia resources have a label defined for Spanish language
Blogs Forums Microblogs Social Networks Others Reviews Audiovisual News Overall
POS Tagging 99.63% 96,64% 99.01% 98.14% 98.77% 98.20% 97.20% 99.62% 98.32%
Topic identification
Without context 96.7% 87.68% 94.22% 93.54% 92.71% 88.81% 90.29% 96.67% 92.35%
With context 96.64% 93.07% 95.54% 94.99% 95.13% 92.67% 97.41% 98.54% 95.02%
Active context 99.24% 89.71% 94.43% 96.40% 94.75% 93.81% 92.23% 97.4% 94.72%
Topic identification after language filtering
Without context 91.21% 79.04% 87.54% 82.64% 86.93% 70.15% 82.52% 90.71% 82.74%
With context 88.43% 80.84% 86.31% 85.24% 88.72% 76.19% 89.66% 92.46% 84.85%
Active context 89.69% 80.51% 86.51% 86.78% 89.78% 75.59% 80.58% 90.54% 84.73%
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢24
 Precision
 Evaluated a random sample of 1,816 posts (18,16%)
 47 human evaluator
 Each post and topics identified shown to 3 different evaluators
 Evaluation options:
1. The topic is not related with the post
2. The topic is somehow related with the post
3. The topic is closely related with the post
4. The evaluator has not enough information for taking a decision
 Fleiss’ kappa test
l Strength of agreement for 2 evaluators = 0.826 (very good)
l Strength of agreement for 3 evaluators = 0.493 (moderate)
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢25
Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢26
 Precision Results
 Precision depends on the channel
l From 59.19% for social networks
 More misspellings
 More common nouns
l To 88.89% for review sites
 Concrete products and brands
 Proper nouns tend to have a Wikipedia entry
 Context selection criteria also depends on the channel
l Active context selection better for microblogs and review sites
l Considering all the post keywords as context better for blogs
l Without context selection is better for the rest of the cases (almost all the channels)
 Naïve default sense selection is effective
Identifying Topics in Social Media Posts using DBpedia
Conclusions
Conclusions
Identifying Topics in Social Media Posts using DBpedia ⎢28
 We have achieved good results of coverage
 The precision depends on the channel (better for review sites,
worst for social networks)
 With respect to considering context or not, there is not a variant that
provide the best results for all the channels.
 Future lines of work:
 Improve Natural Language Processing
l Dealing with slang
l Detect set phrases
l Improve n-gram detection
l Dealing with microblogs’ specifics (e.g., hashtag expansion)
 Combine broad-domain topic identification with knowledge about
specific domains
l Use of domain ontologies in combination with DBpedia ontology
Thank you!
oscar.munoz@havasmedia.com

More Related Content

What's hot

Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
Bernhard Haslhofer
 
Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
Roku
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
Jens Lehmann
 
The Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataThe Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked Data
Richard Urban
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applications
Mark Greaves
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
Kapil Kumar
 

What's hot (9)

Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
The Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked DataThe Dublin Core 1:1 Principle in the Age of Linked Data
The Dublin Core 1:1 Principle in the Age of Linked Data
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applications
 
Social recommender system
Social recommender systemSocial recommender system
Social recommender system
 

Viewers also liked

Horario 19 de diciembre de 2014
Horario 19 de  diciembre de 2014Horario 19 de  diciembre de 2014
Horario 19 de diciembre de 2014
Cole Navalazarza
 
Bum Gennaio 2013
Bum Gennaio 2013Bum Gennaio 2013
Study Twitter
Study TwitterStudy Twitter
Study Twitter
Nurun
 
Conceptos Basicos de Red
Conceptos Basicos de RedConceptos Basicos de Red
Conceptos Basicos de Red
Yoly Gamero Bartolo
 
Foucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vidaFoucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vida
María Apellido
 
SafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server Encryption
SafeNet
 
Deria pendengaran
Deria pendengaranDeria pendengaran
Deria pendengaran
Azuan Dy
 
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgoPresentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
Ericka Vanessa pejendino perea
 
Limpieza del registro de windows
Limpieza del registro de windowsLimpieza del registro de windows
Limpieza del registro de windows
Juan Fco Alcantar Rmz
 
Curso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la RedCurso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la Red
Conversia
 
Rocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National ParkRocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National Park
Ankit Khandelwal
 
CONSEILS STRATEGIQUES POUR UN BUSINESS
CONSEILS STRATEGIQUES POUR UN BUSINESSCONSEILS STRATEGIQUES POUR UN BUSINESS
CONSEILS STRATEGIQUES POUR UN BUSINESS
Le Développement de son marketing
 
Componentes de los ecosistemas
Componentes de los ecosistemasComponentes de los ecosistemas
Componentes de los ecosistemas
María Eugenia Zapata Avendaño
 
Cateter venoso-central
Cateter venoso-centralCateter venoso-central
Cateter venoso-central
salomegg
 
Atividade: Quantas silabas quantas letras
Atividade: Quantas silabas quantas letrasAtividade: Quantas silabas quantas letras
Atividade: Quantas silabas quantas letras
oficinadeaprendizagemace
 

Viewers also liked (17)

Horario 19 de diciembre de 2014
Horario 19 de  diciembre de 2014Horario 19 de  diciembre de 2014
Horario 19 de diciembre de 2014
 
Bum Gennaio 2013
Bum Gennaio 2013Bum Gennaio 2013
Bum Gennaio 2013
 
Study Twitter
Study TwitterStudy Twitter
Study Twitter
 
Conceptos Basicos de Red
Conceptos Basicos de RedConceptos Basicos de Red
Conceptos Basicos de Red
 
Foucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vidaFoucault - De la amistad como modo de vida
Foucault - De la amistad como modo de vida
 
Osteoporosis en ap
Osteoporosis en apOsteoporosis en ap
Osteoporosis en ap
 
SafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet DataSecure vs. Native SQL Server Encryption
SafeNet DataSecure vs. Native SQL Server Encryption
 
Deria pendengaran
Deria pendengaranDeria pendengaran
Deria pendengaran
 
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgoPresentación proyecto implementaciòn Sistema de gestiòn del riesgo
Presentación proyecto implementaciòn Sistema de gestiòn del riesgo
 
El autismo
El autismoEl autismo
El autismo
 
Limpieza del registro de windows
Limpieza del registro de windowsLimpieza del registro de windows
Limpieza del registro de windows
 
Curso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la RedCurso de Formación Conversia - Marketing en la Red
Curso de Formación Conversia - Marketing en la Red
 
Rocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National ParkRocksport Camp Anubhav - Rajaji National Park
Rocksport Camp Anubhav - Rajaji National Park
 
CONSEILS STRATEGIQUES POUR UN BUSINESS
CONSEILS STRATEGIQUES POUR UN BUSINESSCONSEILS STRATEGIQUES POUR UN BUSINESS
CONSEILS STRATEGIQUES POUR UN BUSINESS
 
Componentes de los ecosistemas
Componentes de los ecosistemasComponentes de los ecosistemas
Componentes de los ecosistemas
 
Cateter venoso-central
Cateter venoso-centralCateter venoso-central
Cateter venoso-central
 
Atividade: Quantas silabas quantas letras
Atividade: Quantas silabas quantas letrasAtividade: Quantas silabas quantas letras
Atividade: Quantas silabas quantas letras
 

Similar to Identifying Topics in Social Media Posts using DBpedia

Identifying Topics in Social Media Posts
Identifying Topics in Social Media PostsIdentifying Topics in Social Media Posts
Identifying Topics in Social Media Postshavasmedialabs
 
Web 2.0
Web 2.0Web 2.0
Web 2.0bjornh
 
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
PhiloWeb
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
bjornh
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documentsmaria.grineva
 
Wikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsWikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization Systems
Jakob .
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintokeee
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
John Breslin
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting Wikipedia
Andrew Gray
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
Feedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAbleFeedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAble
Michael Priestley
 
Intro semanticweb
Intro semanticwebIntro semanticweb
Intro semanticweb
ultimate007
 
Web 20
Web 20Web 20
Web 20
Rob Kovi
 
Web 20-1217591424848412-9
Web 20-1217591424848412-9Web 20-1217591424848412-9
Web 20-1217591424848412-9
Radost Sviridon
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
IRJET Journal
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
shakimov
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
mdabrowski
 
Web 2.0 and the LMS
Web 2.0 and the LMSWeb 2.0 and the LMS
Web 2.0 and the LMS
Bryan Alexander
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
John Breslin
 
Social media as a tool for terminological research
Social media as a tool for terminological researchSocial media as a tool for terminological research
Social media as a tool for terminological research
TERMCAT
 

Similar to Identifying Topics in Social Media Posts using DBpedia (20)

Identifying Topics in Social Media Posts
Identifying Topics in Social Media PostsIdentifying Topics in Social Media Posts
Identifying Topics in Social Media Posts
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
Freddy Limpens: From folksonomies to ontologies: a socio-technical solution.
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documents
 
Wikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsWikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization Systems
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting Wikipedia
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Feedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAbleFeedable, Portable, Mashable, DITAble
Feedable, Portable, Mashable, DITAble
 
Intro semanticweb
Intro semanticwebIntro semanticweb
Intro semanticweb
 
Web 20
Web 20Web 20
Web 20
 
Web 20-1217591424848412-9
Web 20-1217591424848412-9Web 20-1217591424848412-9
Web 20-1217591424848412-9
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Web 2.0 and the LMS
Web 2.0 and the LMSWeb 2.0 and the LMS
Web 2.0 and the LMS
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
 
Social media as a tool for terminological research
Social media as a tool for terminological researchSocial media as a tool for terminological research
Social media as a tool for terminological research
 

More from Óscar Muñoz García

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social Media
Óscar Muñoz García
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media Agencies
Óscar Muñoz García
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
Óscar Muñoz García
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Óscar Muñoz García
 
Big Data and Marketing Technology
Big Data and Marketing TechnologyBig Data and Marketing Technology
Big Data and Marketing Technology
Óscar Muñoz García
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes Sociales
Óscar Muñoz García
 
Comparing user generated content published in different social media sources
Comparing user generated content published in different social media sourcesComparing user generated content published in different social media sources
Comparing user generated content published in different social media sources
Óscar Muñoz García
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relaciones
Óscar Muñoz García
 

More from Óscar Muñoz García (8)

Methods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social MediaMethods and Techniques for Segmentation of Consumers in Social Media
Methods and Techniques for Segmentation of Consumers in Social Media
 
Content Analytics for Media Agencies
Content Analytics for Media AgenciesContent Analytics for Media Agencies
Content Analytics for Media Agencies
 
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?
 
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Caracterización de los usuarios de medios sociales mediante lugar de residenc...
Caracterización de los usuarios de medios sociales mediante lugar de residenc...
 
Big Data and Marketing Technology
Big Data and Marketing TechnologyBig Data and Marketing Technology
Big Data and Marketing Technology
 
Análisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes SocialesAnálisis de Sentimientos en un Corpus de Redes Sociales
Análisis de Sentimientos en un Corpus de Redes Sociales
 
Comparing user generated content published in different social media sources
Comparing user generated content published in different social media sourcesComparing user generated content published in different social media sources
Comparing user generated content published in different social media sources
 
Social TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relacionesSocial TV, más allá de la audiencia. Participación y relaciones
Social TV, más allá de la audiencia. Participación y relaciones
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Identifying Topics in Social Media Posts using DBpedia

  • 1. NEM Summit – 28 Sept. 2011 Identifying Topics in Social Media Posts using DBpedia Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media) Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM)
  • 2. Contents Identifying Topics in Social Media Posts using DBpedia ⎢2  Introduction  Related Work  Description of the Method  Evaluation  Conclusions
  • 3. Identifying Topics in Social Media Posts using DBpedia Introduction
  • 4. Introduction Identifying Topics in Social Media Posts using DBpedia ⎢4  Topic Identification  “The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]  Applications of Topic Identification for Social Media  Automatically summarising the content published in a channel.  Mining the interest of a given user.  etc…  Benefits for Advertising Companies  To focus the advertisement actions to the appropriate channels.  To serve ads to the users based in their interest.
  • 5. Introduction Identifying Topics in Social Media Posts using DBpedia ⎢5  Difficulties of Topic Identification in Social Media  Different channels with heterogeneous texts l Different lengths  From short sentences on Twitter to medium-size articles in blogs l Misspellings  Posts completely written in uppercase (or lowercase) letters  Makes difficult the detection of proper nouns.  In Spanish, absence /presence of an accent in a word different meanings  “té” = “tea” (common noun)  “te” = “you” (personal pronoun) l Use of set phrases  E.g., “too many cooks spoil the broth” (if too many people try to take charge at a task, the end product might be ruined)  E.g., “rain cats and dogs” (rains heavily)  It is important to take into account the context of the post
  • 6. Introduction Identifying Topics in Social Media Posts using DBpedia ⎢6  Why DBpedia?  DBpedia is a structured Semantic Web representation of Wikipedia l Wikipedia is maintained by thousands of editors l Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]  Each topic identified is mapped with a DBpedia resource l E.g., The URI http://dbpedia.org/resource/Turin  Represents the city of Torino  Has about 45 attributes defined (population, area, latitude, longitude, etc.)  Has labels and definitions in 14 different languages.  It is linked with many semantic entities  E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro  It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino  It is a nucleus for the Web of Data [Bizer et al, 2009] l Data published on the Web according to Tim Berners-Lee’s Linked Data principles. l Several billion RDF triples (i.e. facts) l Multi-domain datasets (geographic information, people, companies, online communities, etc…)
  • 8. Identifying Topics in Social Media Posts using DBpedia Related Work
  • 9. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢9  Wikipedia has been exploited for the following tasks:  Topic identification and text categorization l [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et al, 2008], [Schonhofen, 2009]  Semantic Relatedness between fragments of text l [Gabrilovich et al, 2007]  Keyword Extraction l [Mihalcea et al, 2007]  Word sense disambiguation l [Mihalcea, 2007]
  • 10. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢10  Uses of Wikipedia data-structure:  Relating words in text with articles using article title information l [Schonhofen, 2009]  Exploiting anchor text in links l [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]  Exploiting the whole articles l [Syed et al, 2008] [Gabrilovich, 2007]  Exploiting categories to measure relatedness between articles l [Coursey et al, 2009] [Syed et al, 2008]  Exploiting disambiguation pages and redirection links to select candidate senses and alternative labels l [Mendelyan et al, 2008]
  • 11. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢11  Supervised learning methods l [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]  Unsupervised techniques  Based on a Vector Space Model l [Schonhofen, 2009]  Based in a Graph l [Coursey et al, 2009] [Syed et al, 2008]  Combined methods (supervised and unsupervised)  Based on a Vector Space Model l [Mihalcea et al, 2007]
  • 12. Related Work Identifying Topics in Social Media Posts using DBpedia ⎢12  Our approach  Exploits titles, disambiguation pages, redirection links and article text to select candidate senses and alternative labels  Uses an unsupervised method  Uses a vector space model  Main benefit in comparison with previous approaches:  The interlinking of social media posts with the Web of data through DBpedia resources
  • 13. Identifying Topics in Social Media Posts using DBpedia Description of the Method
  • 14. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢14 Input Part-of- speech tagging • “torino”, “art”, “media”, “user”, “cloud” Topic Recognition • http://dbpedia.org/resource/Turin • http://dbpedia.org/resource/Art • http://dbpedia.org/resource/User_(computing) Language Filtering • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
  • 15. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢15  Part-of-speech tagging  Wp = w1,w2, ..., wn list of lexical units contained in the post  lexcat(w) lexical category of the lexical unit w  lemma(w) lemma of w  L = {common noun, proper noun, acronym…} meaningful lexical categories that we consider  = {“RT”, “/cc”, “;)”, …} stop words (lemmas excluded)  Kp = k1,k2, …, kn list of keywords with meaning
  • 16. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢16  Part-of-speech tagging example Input • But a hardware problem is more likely, especially if you use the phone a lot while eating. The Blackberry's tiny trackball could be suffering the same accumulation of gunk and grime that can plague a computer mouse that still uses a rubber ball on the underside to roll around the desk. Part-of-speech tagging • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, rubber ball, gunk
  • 17. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢17  Topic Recognition (Sem4Tags [García-Silva et al, 2010]) POS tagging • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, rubber ball, gunk Context Selection • Blackberry, {phone, hardware, trackball, mouse} • Computer, {hardware, mouse, problem, desk} • … Disambiguation • http://dbpedia.org/resource/BlackBerry • http://dbpedia.org/resource/Computer
  • 18. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢18  Context Selection  For each keyword, a set of up to 4 related keywords that will help to disambiguate the its meaning  4 is the number of words above which the context does not add more resolving power to disambiguation [Kaplan, 1955]  We compute semantic relatedness (active context) taking into account the co-ocurrence of words in web pages [Gracia et al, 2009] Keyword Relatedness Keyword Relatedness phone 0.347 hardware 0.347 trackball 0.311 mouse 0.311 computer 0.288 desk 0.287 problem 0.246 rubber ball 0.246 grime 0.190 gunk 0.168 Active context selection for blackberry keyword
  • 19. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢19  Disambiguation Criteria  OPTION 1: Most frequent sense for the ambiguous word l Determined by Wikipedia editors (the first link in a disambiguation page)  OPTION 2: Vector space model 1. A vector containing the keyword and its context 2. A vector containing top N terms is created from each candidate sense is created using TF-IDF (Term Frequency and Inverse Document Frequency) 3. The cosine similarity is used to determine which vectorised sense is more similar to the vector associated to the keyword DBpedia resource Definition Similarity BlackBerry Is a line of mobile e-mail and smartphone 0.224 Blackberry is an edible fruit 0.15 BlackBerry_(song) is a song by the Black Crowes 0.0 BlackBerry_Township, _Itasca_County, _Minnesota Is a towship in … Itasca County 0.0
  • 20. Description of the Method Identifying Topics in Social Media Posts using DBpedia ⎢20  Language Filtering  Tp = t1,t2, ..., tn set of topics identified  l language to filter  Labels(t) set of labels associated to a given topic (value of rdfs:label property)  lang(b) language of a given label  Tl p set of topics with labels in language l
  • 21. Identifying Topics in Social Media Posts using DBpedia Evaluation
  • 22. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢22  Evaluated with a corpora of 10,000 posts in Spanish extracted from  Blogs  Forums  Microblogs (e.g., Twitter)  Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)  Review sites (e.g. , Ciao and Dooyoo)  Audiovisual sites (e.g., YouTube and Flickr)  News publising sites (e.g., elpais.com, elmundo.es)  Others (web pages not classified in the categories above)  Variants evaluated 1. Without considering any context l Default Wikipedia sense assigned for a given keyword 2. Considering as context all the other keywords found in the same post 3. Active context selection technique l Selecting the 4 most relevant topics from the keywords in the same post
  • 23. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢23  Coverage  Part-of-speech tagging: nearly 100%  Topic recognition: over 90% for almost all the cases  After language filtering coverage is reduced in about 10% because not all DBpedia resources have a label defined for Spanish language Blogs Forums Microblogs Social Networks Others Reviews Audiovisual News Overall POS Tagging 99.63% 96,64% 99.01% 98.14% 98.77% 98.20% 97.20% 99.62% 98.32% Topic identification Without context 96.7% 87.68% 94.22% 93.54% 92.71% 88.81% 90.29% 96.67% 92.35% With context 96.64% 93.07% 95.54% 94.99% 95.13% 92.67% 97.41% 98.54% 95.02% Active context 99.24% 89.71% 94.43% 96.40% 94.75% 93.81% 92.23% 97.4% 94.72% Topic identification after language filtering Without context 91.21% 79.04% 87.54% 82.64% 86.93% 70.15% 82.52% 90.71% 82.74% With context 88.43% 80.84% 86.31% 85.24% 88.72% 76.19% 89.66% 92.46% 84.85% Active context 89.69% 80.51% 86.51% 86.78% 89.78% 75.59% 80.58% 90.54% 84.73%
  • 24. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢24  Precision  Evaluated a random sample of 1,816 posts (18,16%)  47 human evaluator  Each post and topics identified shown to 3 different evaluators  Evaluation options: 1. The topic is not related with the post 2. The topic is somehow related with the post 3. The topic is closely related with the post 4. The evaluator has not enough information for taking a decision  Fleiss’ kappa test l Strength of agreement for 2 evaluators = 0.826 (very good) l Strength of agreement for 3 evaluators = 0.493 (moderate)
  • 25. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢25
  • 26. Evaluation Identifying Topics in Social Media Posts using DBpedia ⎢26  Precision Results  Precision depends on the channel l From 59.19% for social networks  More misspellings  More common nouns l To 88.89% for review sites  Concrete products and brands  Proper nouns tend to have a Wikipedia entry  Context selection criteria also depends on the channel l Active context selection better for microblogs and review sites l Considering all the post keywords as context better for blogs l Without context selection is better for the rest of the cases (almost all the channels)  Naïve default sense selection is effective
  • 27. Identifying Topics in Social Media Posts using DBpedia Conclusions
  • 28. Conclusions Identifying Topics in Social Media Posts using DBpedia ⎢28  We have achieved good results of coverage  The precision depends on the channel (better for review sites, worst for social networks)  With respect to considering context or not, there is not a variant that provide the best results for all the channels.  Future lines of work:  Improve Natural Language Processing l Dealing with slang l Detect set phrases l Improve n-gram detection l Dealing with microblogs’ specifics (e.g., hashtag expansion)  Combine broad-domain topic identification with knowledge about specific domains l Use of domain ontologies in combination with DBpedia ontology