This document discusses a method for identifying topics in social media posts using DBpedia. It begins with an introduction that outlines the task of topic identification, applications for social media, and challenges with short, misspelled texts. It then reviews related work exploiting Wikipedia and DBpedia for tasks like text categorization. The method section describes the process of part-of-speech tagging, context selection, disambiguation against DBpedia, and language filtering. An evaluation on 10,000 Spanish posts finds high coverage rates and precision varying by channel from 59-89%. The conclusions discuss achieving good coverage while noting precision depends on the channel and no single context approach works best across all channels.
Survey of Generative Clustering Models 2008Roman Stanchak
Survey of Generative Clustering Models "Probabilistic Topic Models" circa 2008. Class presentation by Roman Stanchak and Prithviraj Sen for University of Maryland College Park cmsc828g, Link Mining and Dynamic Graph Analysis. Spring 2008. Instructor: Prof. Lise Getoor
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries.
This is the presentation at the best paper award session at I-SEMANTICS 2011.
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Survey of Generative Clustering Models 2008Roman Stanchak
Survey of Generative Clustering Models "Probabilistic Topic Models" circa 2008. Class presentation by Roman Stanchak and Prithviraj Sen for University of Maryland College Park cmsc828g, Link Mining and Dynamic Graph Analysis. Spring 2008. Instructor: Prof. Lise Getoor
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries.
This is the presentation at the best paper award session at I-SEMANTICS 2011.
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
The Dublin Core 1:1 Principle in the Age of Linked DataRichard Urban
Presentation given at the International Conference on Dublin Core and Metadata Applications, Austin, TX. October 9, 2014. See associated paper http://dcevents.dublincore.org/IntConf/dc-2014/paper/view/263
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet
Given the vital records databases hold, these systems often represent one of the most critical areas of exposure for an organization. Consequently, as organizations look to comply with security best practices and regulatory mandates, database encryption is becoming increasingly common—and critical. Today, security teams looking to employ database encryption can choose from several alternatives. This paper provides a high level comparison of two approaches: Microsoft’s native encryption capabilities for SQL Server and the SafeNet DataSecure platform.
Curso de Formación Conversia - Marketing en la RedConversia
El curso de Marketing en la Red de Conversia, aporta al alumno el conocimiento de la herramienta del marketing y sus famosas 4Ps: Producto, Precio, Promoción y Distribución (Placement en inglés) a la vez que amplía sus conocimientos sobre las prácticas comerciales en internet, su normativa vigente y la amplia variedad de herramientas y soportes (banners, tags, links, mailings, etc.).
Travel Away:
220 Km from Delhi and 24km from Rishikesh
Check Out:
Mix of three wildlife sanctuaries & the dense green jungle
Camp In:
20 Luxurious Tents lover looking the beautiful forest
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
The Dublin Core 1:1 Principle in the Age of Linked DataRichard Urban
Presentation given at the International Conference on Dublin Core and Metadata Applications, Austin, TX. October 9, 2014. See associated paper http://dcevents.dublincore.org/IntConf/dc-2014/paper/view/263
SafeNet DataSecure vs. Native SQL Server EncryptionSafeNet
Given the vital records databases hold, these systems often represent one of the most critical areas of exposure for an organization. Consequently, as organizations look to comply with security best practices and regulatory mandates, database encryption is becoming increasingly common—and critical. Today, security teams looking to employ database encryption can choose from several alternatives. This paper provides a high level comparison of two approaches: Microsoft’s native encryption capabilities for SQL Server and the SafeNet DataSecure platform.
Curso de Formación Conversia - Marketing en la RedConversia
El curso de Marketing en la Red de Conversia, aporta al alumno el conocimiento de la herramienta del marketing y sus famosas 4Ps: Producto, Precio, Promoción y Distribución (Placement en inglés) a la vez que amplía sus conocimientos sobre las prácticas comerciales en internet, su normativa vigente y la amplia variedad de herramientas y soportes (banners, tags, links, mailings, etc.).
Travel Away:
220 Km from Delhi and 24km from Rishikesh
Check Out:
Mix of three wildlife sanctuaries & the dense green jungle
Camp In:
20 Luxurious Tents lover looking the beautiful forest
Nous développons des perspectives pour amorcer une activité en lignes, afin de vivre pleinement. Justement en utilisant l'internet qui est un outil indispensable actuellement.
Talk on "Dissecting Wikipedia" given at CRASSH, Cambridge, on 6th March 2013.
Abstract:
Andrew Gray, the British Library's Wikipedian in Residence, has been working on an AHRC-supported program to help more academics and researchers engage with Wikipedia. In this talk, he will give a brief history of the Wikipedia project, looking at its origins and the way it has developed over time. The talk will also cover the growing amount of research done around Wikipedia itself. Well over 2,000 peer-reviewed papers have been published which looked at Wikipedia in some way - looking at the project's content and community, or using this data as a way to study broader questions of collaboration and interaction.
Social media as a tool for terminological researchTERMCAT
Social media as a tool for terminological research
Anita Nuopponen - University of Vaasa
Niina Nissilä - University of Vaasa
VII EAFT Terminology Summit. Barcelona, 27-28 november 2014
Similar to Identifying Topics in Social Media Posts using DBpedia (20)
Methods and Techniques for Segmentation of Consumers in Social MediaÓscar Muñoz García
Social media has revolutionised the way in which consumers relate to each other and with brands. The opinions published in social media have a power of influencing purchase decisions as important as advertising campaigns. Consequently, marketers are increasing efforts and investments for obtaining indicators to measure brand health from the digital content generated by consumers.
Given the unstructured nature of social media contents, the technology used for processing such contents often implements Artificial Intelligence techniques, such as natural language processing, machine learning and semantic analysis algorithms.
This thesis contributes to the State of the Art, with a model for structuring and integrating the information posted on social media, and a number of techniques whose objectives are the identification of consumers, as well as their socio-demographic and psychographic segmentation. The consumer identification technique is based on the fingerprint of the devices they use to surf the Web and is tolerant to the changes that occur frequently in such fingerprint. The psychographic profiling techniques described infer the position of consumer in the purchase funnel, and allow to classify the opinions based on a series of marketing attributes. Finally, the socio-demographic profiling
techniques allow to obtain the residence and gender of consumers.
¿Cómo puede ayudar el Big Data a dirigir las campañas de comunicación?Óscar Muñoz García
Presentación que trata sobre diferentes proyectos de innovación en los que estamos trabajando en Havas Media orientados a entender el comportamiento de los consumidores mediante el análisis de Big Data procedente de medios sociales, y a activar la estrategia de comunicación de marca en tiempo real en un entorno omnicanal que tenga encuenta todos los puntos de contacto entre marcas y consumidores.
Caracterización de los usuarios de medios sociales mediante lugar de residenc...Óscar Muñoz García
La caracterización de los usuarios mediante atributos sociodemográficos es un paso necesario previo a la realización de estudios de opinión a partir de información publicada por dichos usuarios en los medios sociales. En este trabajo se presentan, comparan y evalúan diversas técnicas para la identificación de los atributos “género” y “lugar de residencia”, a partir de los metadatos asociados a dichos usuarios, as ́ı como el contenido publicado y compartido por los mismos, y sus redes de amistad. Los resultados obtenidos demuestran que la información proporcionada por la red social es muy útil para identificar dichos atributos.
Big Data represent an opportunity for organizations with data analysis needs. Companies need to prepare a number of functions to address the Big Data Challenge.
The following presentation describes the Big Data landscape for marketing technology, introducing several applications, and describing the three key aspects a media agency must focus on when dealing with Big Data analysis applications.
Análisis de Sentimientos en un Corpus de Redes SocialesÓscar Muñoz García
El análisis de sentimientos de textos en las redes sociales se ha convertido en un área de investigación cada vez más relevante debido a la influencia que las opiniones expresadas tienen en potenciales usuarios. De acuerdo con una clasificación conceptual de sentimientos y basándonos en un corpus de diversos dominios comerciales, hemos trabajado en la confección de reglas que permitan la clasificación de dichos textos según el sentimiento expresado con respecto a una marca, empresa o producto.
Comparing user generated content published in different social media sourcesÓscar Muñoz García
The growth of social media has populated the Web with valuable user generated content that can be exploited for many different and interesting purposes, such as, explaining or predicting real world outcomes through opinion mining. In this context, natural language
processing techniques are a key technology for analysing user generated content. Such content is characterised by its casual language, with short texts, misspellings, and set-phrases, among other characteristics that challenge content analysis. This paper shows the differences of the language used in heterogeneous social media sources, by analysing the distribution of the part-of-speech categories extracted from the analysis of the morphology of a sample of texts published in such sources. In addition, we evaluate the performance of three natural language processing techniques (i.e., language identification, sentiment analysis, and topic identification) showing the differences
on accuracy when applying such techniques to different types of user generated content.
Social TV, más allá de la audiencia. Participación y relacionesÓscar Muñoz García
La medición de audiencias permite averiguar a las cadenas emisoras de contenido audiovisual los contenidos con mayor aceptación y facilita a los anunciantes la optimización de la inversión publicitaria en los espacios televisivos.
El consumo de televisión está cambiando de un escenario en el que la interacción con el aparato de TV se limita al cambio de canal, a otro en el que se produce una participación activa, pública y espontánea de la audiencia, como respuesta a la emisión televisiva.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Climate Impact of Software Testing at Nordic Testing Days
Identifying Topics in Social Media Posts using DBpedia
1. NEM Summit – 28 Sept. 2011
Identifying Topics in Social Media
Posts using DBpedia
Óscar Muñoz-García, Manuel de la Higuera Hernández, Carlos Navarro (Havas Media)
Andrés García-Silva, Óscar Corcho (Ontology Engineering Group - UPM)
2. Contents
Identifying Topics in Social Media Posts using DBpedia ⎢2
Introduction
Related Work
Description of the Method
Evaluation
Conclusions
4. Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢4
Topic Identification
“The task of identifying the central ideas in a text” [Chin-Yew Lin, 1995]
Applications of Topic Identification for Social Media
Automatically summarising the content published in a channel.
Mining the interest of a given user.
etc…
Benefits for Advertising Companies
To focus the advertisement actions to the appropriate channels.
To serve ads to the users based in their interest.
5. Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢5
Difficulties of Topic Identification in Social Media
Different channels with heterogeneous texts
l Different lengths
From short sentences on Twitter to medium-size articles in blogs
l Misspellings
Posts completely written in uppercase (or lowercase) letters
Makes difficult the detection of proper nouns.
In Spanish, absence /presence of an accent in a word different meanings
“té” = “tea” (common noun)
“te” = “you” (personal pronoun)
l Use of set phrases
E.g., “too many cooks spoil the broth” (if too many people try to take charge at
a task, the end product might be ruined)
E.g., “rain cats and dogs” (rains heavily)
It is important to take into account the context of the post
6. Introduction
Identifying Topics in Social Media Posts using DBpedia ⎢6
Why DBpedia?
DBpedia is a structured Semantic Web representation of Wikipedia
l Wikipedia is maintained by thousands of editors
l Wikipedia evolves and adapts as knowledge changes [Syed et al, 2008]
Each topic identified is mapped with a DBpedia resource
l E.g., The URI http://dbpedia.org/resource/Turin
Represents the city of Torino
Has about 45 attributes defined (population, area, latitude, longitude, etc.)
Has labels and definitions in 14 different languages.
It is linked with many semantic entities
E.g. Birth place of Amedeo Avogadro: http://dbpedia.org/resource/Amedeo_Avogadro
It is linked with its Wikipedia article: http://en.wikipedia.org/wiki/Torino
It is a nucleus for the Web of Data [Bizer et al, 2009]
l Data published on the Web according to Tim Berners-Lee’s Linked Data principles.
l Several billion RDF triples (i.e. facts)
l Multi-domain datasets (geographic information, people, companies, online
communities, etc…)
9. Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢9
Wikipedia has been exploited for the following tasks:
Topic identification and text categorization
l [Bodo et al, 2007], [Coursey et al, 2009], [Gabrilovich et al., 2006], [Syed et
al, 2008], [Schonhofen, 2009]
Semantic Relatedness between fragments of text
l [Gabrilovich et al, 2007]
Keyword Extraction
l [Mihalcea et al, 2007]
Word sense disambiguation
l [Mihalcea, 2007]
10. Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢10
Uses of Wikipedia data-structure:
Relating words in text with articles using article title information
l [Schonhofen, 2009]
Exploiting anchor text in links
l [Coursey et al, 2009] [Mihalcea et al, 2007] [Mihalcea, 2007]
Exploiting the whole articles
l [Syed et al, 2008] [Gabrilovich, 2007]
Exploiting categories to measure relatedness between articles
l [Coursey et al, 2009] [Syed et al, 2008]
Exploiting disambiguation pages and redirection links to select
candidate senses and alternative labels
l [Mendelyan et al, 2008]
11. Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢11
Supervised learning methods
l [Bodo et al, 2007] [Gabrilovich et al, 2006] [Mendelyan et al, 2008]
Unsupervised techniques
Based on a Vector Space Model
l [Schonhofen, 2009]
Based in a Graph
l [Coursey et al, 2009] [Syed et al, 2008]
Combined methods (supervised and unsupervised)
Based on a Vector Space Model
l [Mihalcea et al, 2007]
12. Related Work
Identifying Topics in Social Media Posts using DBpedia ⎢12
Our approach
Exploits titles, disambiguation pages, redirection links and article
text to select candidate senses and alternative labels
Uses an unsupervised method
Uses a vector space model
Main benefit in comparison with previous approaches:
The interlinking of social media posts with the Web of data through
DBpedia resources
14. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢14
Input
Part-of-
speech
tagging
• “torino”, “art”, “media”, “user”, “cloud”
Topic
Recognition
• http://dbpedia.org/resource/Turin
• http://dbpedia.org/resource/Art
• http://dbpedia.org/resource/User_(computing)
Language
Filtering
• “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ...
15. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢15
Part-of-speech tagging
Wp = w1,w2, ..., wn list of lexical units contained in the post
lexcat(w) lexical category of the lexical unit w
lemma(w) lemma of w
L = {common noun, proper noun, acronym…} meaningful lexical categories
that we consider
= {“RT”, “/cc”, “;)”, …} stop words (lemmas excluded)
Kp = k1,k2, …, kn list of keywords with meaning
16. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢16
Part-of-speech tagging example
Input
• But a hardware problem is more likely, especially if
you use the phone a lot while eating. The
Blackberry's tiny trackball could be suffering the
same accumulation of gunk and grime that can
plague a computer mouse that still uses a rubber
ball on the underside to roll around the desk.
Part-of-speech
tagging
• Blackberry, phone, trackball, computer,
problem, grime, hardware, mouse, desk,
rubber ball, gunk
17. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢17
Topic Recognition (Sem4Tags [García-Silva et al, 2010])
POS
tagging
• Blackberry, phone, trackball, computer, problem, grime, hardware,
mouse, desk, rubber ball, gunk
Context
Selection
• Blackberry, {phone, hardware, trackball, mouse}
• Computer, {hardware, mouse, problem, desk}
• …
Disambiguation
• http://dbpedia.org/resource/BlackBerry
• http://dbpedia.org/resource/Computer
18. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢18
Context Selection
For each keyword, a set of up to 4 related keywords that will help to
disambiguate the its meaning
4 is the number of words above which the context does not add more resolving
power to disambiguation [Kaplan, 1955]
We compute semantic relatedness (active context) taking into account the
co-ocurrence of words in web pages [Gracia et al, 2009]
Keyword Relatedness Keyword Relatedness
phone 0.347 hardware 0.347
trackball 0.311 mouse 0.311
computer 0.288 desk 0.287
problem 0.246 rubber ball 0.246
grime 0.190 gunk 0.168
Active context selection for blackberry keyword
19. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢19
Disambiguation Criteria
OPTION 1: Most frequent sense for the ambiguous word
l Determined by Wikipedia editors (the first link in a disambiguation page)
OPTION 2: Vector space model
1. A vector containing the keyword and its context
2. A vector containing top N terms is created from each candidate sense is created using
TF-IDF (Term Frequency and Inverse Document Frequency)
3. The cosine similarity is used to determine which vectorised sense is more similar to
the vector associated to the keyword
DBpedia resource Definition Similarity
BlackBerry
Is a line of mobile e-mail and
smartphone
0.224
Blackberry is an edible fruit 0.15
BlackBerry_(song) is a song by the Black Crowes 0.0
BlackBerry_Township,
_Itasca_County,
_Minnesota
Is a towship in … Itasca County 0.0
20. Description of the Method
Identifying Topics in Social Media Posts using DBpedia ⎢20
Language Filtering
Tp = t1,t2, ..., tn set of topics identified
l language to filter
Labels(t) set of labels associated to a given topic (value of rdfs:label
property)
lang(b) language of a given label
Tl
p set of topics with labels in language l
22. Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢22
Evaluated with a corpora of 10,000 posts in Spanish extracted from
Blogs
Forums
Microblogs (e.g., Twitter)
Social networks (e.g., Facebook, MySpace, LinkedIn and Xing)
Review sites (e.g. , Ciao and Dooyoo)
Audiovisual sites (e.g., YouTube and Flickr)
News publising sites (e.g., elpais.com, elmundo.es)
Others (web pages not classified in the categories above)
Variants evaluated
1. Without considering any context
l Default Wikipedia sense assigned for a given keyword
2. Considering as context all the other keywords found in the same post
3. Active context selection technique
l Selecting the 4 most relevant topics from the keywords in the same post
23. Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢23
Coverage
Part-of-speech tagging: nearly 100%
Topic recognition: over 90% for almost all the cases
After language filtering coverage is reduced in about 10% because not all
DBpedia resources have a label defined for Spanish language
Blogs Forums Microblogs Social Networks Others Reviews Audiovisual News Overall
POS Tagging 99.63% 96,64% 99.01% 98.14% 98.77% 98.20% 97.20% 99.62% 98.32%
Topic identification
Without context 96.7% 87.68% 94.22% 93.54% 92.71% 88.81% 90.29% 96.67% 92.35%
With context 96.64% 93.07% 95.54% 94.99% 95.13% 92.67% 97.41% 98.54% 95.02%
Active context 99.24% 89.71% 94.43% 96.40% 94.75% 93.81% 92.23% 97.4% 94.72%
Topic identification after language filtering
Without context 91.21% 79.04% 87.54% 82.64% 86.93% 70.15% 82.52% 90.71% 82.74%
With context 88.43% 80.84% 86.31% 85.24% 88.72% 76.19% 89.66% 92.46% 84.85%
Active context 89.69% 80.51% 86.51% 86.78% 89.78% 75.59% 80.58% 90.54% 84.73%
24. Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢24
Precision
Evaluated a random sample of 1,816 posts (18,16%)
47 human evaluator
Each post and topics identified shown to 3 different evaluators
Evaluation options:
1. The topic is not related with the post
2. The topic is somehow related with the post
3. The topic is closely related with the post
4. The evaluator has not enough information for taking a decision
Fleiss’ kappa test
l Strength of agreement for 2 evaluators = 0.826 (very good)
l Strength of agreement for 3 evaluators = 0.493 (moderate)
26. Evaluation
Identifying Topics in Social Media Posts using DBpedia ⎢26
Precision Results
Precision depends on the channel
l From 59.19% for social networks
More misspellings
More common nouns
l To 88.89% for review sites
Concrete products and brands
Proper nouns tend to have a Wikipedia entry
Context selection criteria also depends on the channel
l Active context selection better for microblogs and review sites
l Considering all the post keywords as context better for blogs
l Without context selection is better for the rest of the cases (almost all the channels)
Naïve default sense selection is effective
28. Conclusions
Identifying Topics in Social Media Posts using DBpedia ⎢28
We have achieved good results of coverage
The precision depends on the channel (better for review sites,
worst for social networks)
With respect to considering context or not, there is not a variant that
provide the best results for all the channels.
Future lines of work:
Improve Natural Language Processing
l Dealing with slang
l Detect set phrases
l Improve n-gram detection
l Dealing with microblogs’ specifics (e.g., hashtag expansion)
Combine broad-domain topic identification with knowledge about
specific domains
l Use of domain ontologies in combination with DBpedia ontology