This document summarizes a presentation given at SSSW 2015 on making sense of semantic data. It discusses challenges in understanding semantic web data, including a "language gap" between semantic web languages like SPARQL and natural language. It presents an approach to bridging this gap through automatically verbalizing SPARQL queries in English. Evaluation results show this helps non-experts understand queries better and faster than the SPARQL format. It also discusses the "semantic gap" caused by mismatches between a question's semantics and a knowledge graph, and presents an approach using templates to generate SPARQL queries from natural language questions.
Presented at Social Media Breakfast Red Deer. Website and social analytics give you lots of data to look at, but what should you do with it? Learn how to make analytics work for you and understand how your various marketing and communication efforts are having an impact.
Presented at Social Media Breakfast Red Deer. Website and social analytics give you lots of data to look at, but what should you do with it? Learn how to make analytics work for you and understand how your various marketing and communication efforts are having an impact.
"4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QALD-9 Question Answering over Linked Data Challenge" as presented in the 17th International Semantic Web Conference ISWC, 8th - 12th of October 2018, held in Monterey, California, USA
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
A presentation at the workshop "Rich and loonely or poor and popular?" at the Dublin Core conference in Lisbon on September 4th, 2013. The main hypothesis is that when publishing (linked) data, the main criteria should not be richness and poorness, but suitability for purpose, granularity and adherence to agreed-on models.
This presentation was given at the International Workshop on Interacting with Linked Data (ILD 2012) co-located with the 9th Extended Semantic Web Conference 2012, Heraklion, and is related the publication of the same title.
Much research has been done to combine the fields of Data-bases and Natural Language Processing. While many works focus on the problem of deriving a structured query for a given natural language question, the problem of query verbalization -- translating a structured query into natural language -- is less explored. In this work we describe our approach to verbalizing SPARQL queries in order to create natural language expressions that are readable and understandable by the human day-to-day user. These expressions are helpful when having search engines that generate SPARQL queries for user-provided natural language questions or keywords. Displaying verbalizations of generated queries to a user enables the user to check whether the right question has been understood. While our approach enables verbalization of only a subset of SPARQL 1.1, this subset applies to 90% of the 209 queries in our training set. These observations are based on a corpus of SPARQL queries consisting of datasets from the QALD-1 challenge and the ILD2012 challenge.
The publication is available at http://www.aifb.kit.edu/images/b/b7/VerbalizingSparqlQueries.pdf
Translating Natural Language into SPARQL for Neural Question AnsweringTommaso Soru
Using Neural SPARQL Machines to translate an utterance into a structured query for question answering over the Linked Open Data cloud.
Invited talk at the 6th Leipzig Semantic Web Day (LSWT2018).
TechSEO Boost 2018: Search & Spam Fighting in the Age of Deep LearningCatalyst
It used to be a simpler world, when spammers were stuffing their pages with invisible keywords and participating in obvious link exchanges… As the web and AI technologies evolved, so did spam techniques and the methods used to detect them. In this session, you learn how Bing is using modern machine learning (including deep learning) to improve relevance, how this changes the spam landscape and what techniques we are using to counter these new threats.
n overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Knowledge Technologies: Opportunities and ChallengesFariz Darari
How to be one step ahead of leveraging knowledge technologies for your apps!
When: Dec 8, 2017
Where: Fl. 6, Multimedia Tower, Central Jakarta
Thanks to Ragil for the invitation!
paper: http://dl.acm.org/citation.cfm?id=2815849&CFID=533841763&CFTOKEN=85077894
Abstract:
The Linked Open Data (LOD) Cloud has more than tripled its sources in just three years (from 295 sources in 2011 to 1014 in 2014). While the LOD data are being produced at a increasing rate, LOD tools lack in producing an high level representation of datasets and in supporting users in the exploration and querying of a source. To overcome the above problems and significantly increase the number of consumers of LOD data, we devised a new method and a tool, called LODeX, that promotes the understanding, navigation and querying of LOD sources both for experts and for beginners. It also provides a standardized and homogeneous summary of LOD sources and supports user in the creation of visual queries on previously unknown datasets.
We have extensively evaluated the portability and usability of the tool. LODeX have been tested on the entire set of datasets available at Data Hub, i.e. 302 sources. In this paper, we showcase the usability evaluation of the different features of the tool (the Schema Summary representation and the visual query building) obtained on 27 users (comprising both Semantic Web experts and beginners).
KeyNote @SEMANTICS 2017 (Amsterdam, sept 2017) about convergences between NLP and KE at the era of the semantic web, with a focus on semantic relation extraction from text.
Semantics at the multimedia fragment level SSSW 2013Raphael Troncy
"Semantics at the multimedia fragment level or how enabling the remixing of online media" - Invited Talk given at the Semantic Web Summer School (SSSW), 12 July 2013
Santa Fe Complex
March 13, 2009
Martin Klein, Frank McCown,
Joan Smith, Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
Presentation by Nathan Schneider, Assistant Professor of Linguistics and Computer Science at Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019 (https://www.meetup.com/DC-NLP/events/264894589/).
"4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QALD-9 Question Answering over Linked Data Challenge" as presented in the 17th International Semantic Web Conference ISWC, 8th - 12th of October 2018, held in Monterey, California, USA
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
A presentation at the workshop "Rich and loonely or poor and popular?" at the Dublin Core conference in Lisbon on September 4th, 2013. The main hypothesis is that when publishing (linked) data, the main criteria should not be richness and poorness, but suitability for purpose, granularity and adherence to agreed-on models.
This presentation was given at the International Workshop on Interacting with Linked Data (ILD 2012) co-located with the 9th Extended Semantic Web Conference 2012, Heraklion, and is related the publication of the same title.
Much research has been done to combine the fields of Data-bases and Natural Language Processing. While many works focus on the problem of deriving a structured query for a given natural language question, the problem of query verbalization -- translating a structured query into natural language -- is less explored. In this work we describe our approach to verbalizing SPARQL queries in order to create natural language expressions that are readable and understandable by the human day-to-day user. These expressions are helpful when having search engines that generate SPARQL queries for user-provided natural language questions or keywords. Displaying verbalizations of generated queries to a user enables the user to check whether the right question has been understood. While our approach enables verbalization of only a subset of SPARQL 1.1, this subset applies to 90% of the 209 queries in our training set. These observations are based on a corpus of SPARQL queries consisting of datasets from the QALD-1 challenge and the ILD2012 challenge.
The publication is available at http://www.aifb.kit.edu/images/b/b7/VerbalizingSparqlQueries.pdf
Translating Natural Language into SPARQL for Neural Question AnsweringTommaso Soru
Using Neural SPARQL Machines to translate an utterance into a structured query for question answering over the Linked Open Data cloud.
Invited talk at the 6th Leipzig Semantic Web Day (LSWT2018).
TechSEO Boost 2018: Search & Spam Fighting in the Age of Deep LearningCatalyst
It used to be a simpler world, when spammers were stuffing their pages with invisible keywords and participating in obvious link exchanges… As the web and AI technologies evolved, so did spam techniques and the methods used to detect them. In this session, you learn how Bing is using modern machine learning (including deep learning) to improve relevance, how this changes the spam landscape and what techniques we are using to counter these new threats.
n overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Knowledge Technologies: Opportunities and ChallengesFariz Darari
How to be one step ahead of leveraging knowledge technologies for your apps!
When: Dec 8, 2017
Where: Fl. 6, Multimedia Tower, Central Jakarta
Thanks to Ragil for the invitation!
paper: http://dl.acm.org/citation.cfm?id=2815849&CFID=533841763&CFTOKEN=85077894
Abstract:
The Linked Open Data (LOD) Cloud has more than tripled its sources in just three years (from 295 sources in 2011 to 1014 in 2014). While the LOD data are being produced at a increasing rate, LOD tools lack in producing an high level representation of datasets and in supporting users in the exploration and querying of a source. To overcome the above problems and significantly increase the number of consumers of LOD data, we devised a new method and a tool, called LODeX, that promotes the understanding, navigation and querying of LOD sources both for experts and for beginners. It also provides a standardized and homogeneous summary of LOD sources and supports user in the creation of visual queries on previously unknown datasets.
We have extensively evaluated the portability and usability of the tool. LODeX have been tested on the entire set of datasets available at Data Hub, i.e. 302 sources. In this paper, we showcase the usability evaluation of the different features of the tool (the Schema Summary representation and the visual query building) obtained on 27 users (comprising both Semantic Web experts and beginners).
KeyNote @SEMANTICS 2017 (Amsterdam, sept 2017) about convergences between NLP and KE at the era of the semantic web, with a focus on semantic relation extraction from text.
Semantics at the multimedia fragment level SSSW 2013Raphael Troncy
"Semantics at the multimedia fragment level or how enabling the remixing of online media" - Invited Talk given at the Semantic Web Summer School (SSSW), 12 July 2013
Santa Fe Complex
March 13, 2009
Martin Klein, Frank McCown,
Joan Smith, Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
Presentation by Nathan Schneider, Assistant Professor of Linguistics and Computer Science at Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019 (https://www.meetup.com/DC-NLP/events/264894589/).
Representation Learning on Graphs with Complex Structures
Invited talk, Deep Learning for Graphs and Structured Data Embedding Workshop
WWW2019, San Francisco, May 13, 2019
A force directed approach for offline gps trajectory mapeXascale Infolab
SIGSPATIAL 2018 paper
A Force-Directed Approach for Offline GPS Trajectory Map Matching
Efstratios Rappos (University of Applied Sciences of Western Switzerland (HES-SO)),
Stephan Robert (University of Applied Sciences of Western Switzerland (HES-SO)),
Philippe Cudré-Mauroux (University of Fribourg)
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
Uduvudu exploits the semantic and structured nature of Linked Data to generate the best possible representation for a human based on a catalog of available Matchers and Templates. Matchers and Templates are designed that they can be build through an intuitive editor interface.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
Internet Infrastructures for Big Data
Talk given at Verisign's Distinguished Speaker Series, 2014
Prof. Philippe Cudre-Mauroux
eXascale Infolab
http://exascale.info/
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
2. On Making Sense
• ½ of Computer Science is about making sense of
some input data
– KDD (cf. Claudia & Laura tutorial)
– NLP (cf. Roberto’s talk)
– Multimedia Analysis
– Social Media / Big Data Analytics
– Visualization
– etc.
3. On the Menu Today
• Making Sense of Semantic Data
– Making sense of SPARQL & Semantic Web predicates
– Trust on Semantic Web data
– Emergent Semantics
• Leveraging Semantic Data for Sense Making
– Making sense of textual entities
– Making sense of relational data
– Making sense of webtables
5. Introduction
At some point in the early
twenty-first century, all of mankind
was united in celebration. We
marveled at our own magnificence as
we gave birth to AI.
– Morpheus, The Matrix
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 2 / 52
8. Linked Data Web
Sense Making
Helping end users to make sense of the Semantic Web.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 4 / 52
9. Gaps
Language Gap
Semantic Web speaks languages that normal users do not understand
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 5 / 52
12. Language Gap
Problem
What does it mean?
1 PREFIX dbo: <http :// dbpedia.org/ontology/>
2 PREFIX res: <http :// dbpedia.org/resource/>
3 SELECT DISTINCT ?person WHERE {
4 ?person dbo:team ?sportsTeam .
5 ?sportsTeam dbo:league res: Premier_League .
6 ?person dbo:birthDate ?date .
7 ?person dbo:birthPlace ?place .
8 { ?place dbo:locatedIn res:Africa .}
9 UNION
10 { ?place dbo:locatedIn res:Asia .}
11 }
12 ORDER BY DESC (? date)
13 OFFSET 0 LIMIT 1
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 8 / 52
13. Language Gap
Problem
What does it mean?
1 PREFIX dbo: <http :// dbpedia.org/ontology/>
2 PREFIX res: <http :// dbpedia.org/resource/>
3 SELECT DISTINCT ?person WHERE {
4 ?person dbo:team ?sportsTeam .
5 ?sportsTeam dbo:league res: Premier_League .
6 ?person dbo:birthDate ?date .
7 ?person dbo:birthPlace ?place .
8 { ?place dbo:locatedIn res:Africa .}
9 UNION
10 { ?place dbo:locatedIn res:Asia .}
11 }
12 ORDER BY DESC (? date)
13 OFFSET 0 LIMIT 1
Give me the youngest person who plays in a Premier League
team and was born in Africa or Asia.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 8 / 52
14. Language Gap
Solution
Verbalization frameworks for the Semantic Web
Document planner MicroplannerRealizer
http://github.com/AKSW/SemWeb2NL
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 9 / 52
15. Language Gap: Triple2NL/BGP2NL
Approach
1 ρ(s p o) ⇒ poss(ρ(p),ρ(s)) ∧
subj(BE,ρ(p)) ∧ dobj(BE,ρ(o))
2 ρ(s p o) ⇒ subj(ρ(p),ρ(s)) ∧ dobj(ρ(p),ρ(o))
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 10 / 52
16. Language Gap: Triple2NL/BGP2NL
Approach
1 ρ(s p o) ⇒ poss(ρ(p),ρ(s)) ∧
subj(BE,ρ(p)) ∧ dobj(BE,ρ(o))
2 ρ(s p o) ⇒ subj(ρ(p),ρ(s)) ∧ dobj(ρ(p),ρ(o))
1 :Momo :author :Ende
⇒ Momo’s author is Michael Ende.
2 ?x :author :Ende
⇒ ?x ’s author is Michael Ende.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 10 / 52
17. Language Gap: Triple2NL/BGP2NL
Approach
1 ρ(s p o) ⇒ poss(ρ(p),ρ(s)) ∧
subj(BE,ρ(p)) ∧ dobj(BE,ρ(o))
2 ρ(s p o) ⇒ subj(ρ(p),ρ(s)) ∧ dobj(ρ(p),ρ(o))
1 :Momo :author :Ende
⇒ Momo’s author is Michael Ende.
2 ?x :author :Ende
⇒ ?x ’s author is Michael Ende.
3 :Momo :writtenBy :Ende
⇒ Momo was written by Michael Ende.
4 ?x :writtenBy :Ende
⇒ ?x was written by Michael Ende.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 10 / 52
18. Language Gap: SPARQL2NL/RDF2NL
Approach
Combination rules
1 ρ((s, p, o1).(s, p, o2))
⇒ poss(ρ(p),ρ(s))∧ subj(BE,ρ(p)) ∧ dobj(BE, cc(ρ(o1), ρ(o1))
?x’s author is Paul Erd¨os and ?x’s author is Kevin Bacon.
⇒ ?x’s authors are Paul Erd¨os and Kevin Bacon.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 11 / 52
19. Language Gap: SPARQL2NL/RDF2NL
?place is Shakespeare’s birth place or ?place is Shakespeare’s death
place.
⇒ ?place is Shakespeare’s birth or death place.
This query retrieves values ?height such that ?height is Claudia
Schiffer’s height.
⇒ This query retrieves Claudia Schiffer’s height.
?person’s team is ?sportsTeam. ?person’s birth date is ?date.
?sportsTeam’s league is Premier League.
⇒ ?person’s team is ?sportsTeam, ?person’s birth date is ?date, and
?sportsTeam’s league is Premier League.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 12 / 52
20. Language Gap: Evaluation
125 participants, 49 SPARQL experts, 3 tasks
94% of verbalizations were understandable
5.31 ± 1.08 average adequacy score
0 50 100 150 200 250
Number of Survey Answers
1
2
3
4
5
6
Adequacy
0 20 40 60 80 100 120
Number of Survey Answers
1
2
3
4
5
6
Fluency
Figure : Adequacy and fluency results in survey
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 13 / 52
21. Language Gap: Evaluation
125 participants, 49 SPARQL experts, 3 tasks
Slightly larger error with NL for experts
Non-experts enabled understand the meaning of queries
0 0,2 0,4 0,6 0,8 1 1,2 1,4
error rate
SPARQL
NL
NL (SPARQL experts)
Figure : Error rate over the three tasks
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 13 / 52
22. Language Gap: Evaluation
125 participants, 49 SPARQL experts, 3 tasks
Non-experts faster with NL than experts with SPARQL
Experts faster with NL than experts with SPARQL
0 5 10 15 20
time in minutes (purple = standard deviation)
SPARQL
SPARQL (filtered)
NL
NL (filtered)
NL (SPARQL experts)
NL (SPARQL experts, filtered)
Figure : Average time needed
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 13 / 52
25. Language Gap: Challenges
Complex queries
Sacrifice adequacy for fluency
Other languages
Hybrid approach
Personalization
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 16 / 52
27. Semantic Gap
Problem
How do I communicate with it?
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 18 / 52
28. Semantic Gap
Solution
Question Answering Systems
Example:
Where did Abraham Lincoln die?
SELECT ?x WHERE {
res:Abraham Lincoln dbo:deathPlace ?x .
}
PowerAqua:
Triple representation:
state/place, die, Abraham Lincoln
Ontology mappings:
Place, deathPlace, Abraham Lincoln
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 19 / 52
29. Semantic Gap: Mismatch
Triples do not always provide a faithful representation of the semantic
structure of the question
Thus more expressive queries cannot be answered
Example 1:
Which cities have more than three universities?
SELECT ?y WHERE {
?x rdf:type dbo:University .
?x dbo:city ?y .
}
HAVING (COUNT(?x) > 3)
Triple representation:
cities, more than, universities three
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 20 / 52
30. Semantic Gap: Mismatch
Triples do not always provide a faithful representation of the semantic
structure of the question
Thus more expressive queries cannot be answered
Example 2:
Who produced the most films?
SELECT ?y WHERE {
?x rdf:type dbo:Film .
?x dbo:producer ?y .
}
ORDER BY DESC(COUNT(?x)) LIMIT 1
Triple representation:
person/organization, produced, most films
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 20 / 52
31. Semantic Gap: Approach
To understand a user question, we need to understand:
The words
Abraham Lincoln → res:Abraham Lincoln
died in → dbo:deathPlace
The semantic structure
the most N → ODER BY DESC(COUNT(?n)) LIMIT 1
more than three N → HAVING (COUNT(?n) > 3)
Template-Based Question Answering
1 Template generation: Understanding the semantic structure)
2 Template instantiation: Understanding the words)
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 21 / 52
32. Semantic Gap: Example
Query: Who produced the most films?
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 22 / 52
33. Semantic Gap: Example
Query: Who produced the most films?
1 SPARQL template:
SELECT ?x WHERE {
?y rdf:type ?c .
?y ?p ?x .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 22 / 52
34. Semantic Gap: Example
Query: Who produced the most films?
1 SPARQL template:
SELECT ?x WHERE {
?y rdf:type ?c .
?y ?p ?x .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
2 Instantiations:
?c = <http://dbpedia.org/ontology/Film>
?p = <http://dbpedia.org/ontology/producer>
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 22 / 52
35. Semantic Gap: Architecture
Natural
Language
Question
Semantic
Representaion
SPARQL
Query
Templates
Templates
with URI slots
Ranked SPARQL
Queries
Answer
LOD
Entity identification
Entity and Query Ranking
Query
Selection
Resources
and Classes
SPARQL
Endpoint
Type Checking
and Prominence
BOA Pattern
Library
Properties
Tagged
Question
Domain Independent
Lexicon
Domain Dependent
Lexicon
Parsing
Corpora?
!
Loading
State
Process
Uses
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 23 / 52
36. Semantic Gap: Template Generation
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 24 / 52
37. Semantic Gap: Template Generation
1 Natural language question is tagged
with part-of-speech information.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 24 / 52
38. Semantic Gap: Template Generation
2 Based on POS tags, lexical entries
are built on the fly.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 24 / 52
39. Semantic Gap: Template Generation
3 These lexical entries, together with
domain-independent lexical entries,
are used for parsing the question.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 24 / 52
40. Semantic Gap: Template Generation
4 The resulting semantic
representation is translated into a
SPARQL template.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 24 / 52
41. Semantic Gap: Who produced the most films?
domain-independent: who, the most
domain-dependent: produced/VBD, films/NNS
SPARQL template 1:
SELECT ?x WHERE {
?x ?p ?y .
?y rdf:type ?c .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 25 / 52
42. Semantic Gap: Who produced the most films?
domain-independent: who , the most
domain-dependent: produced/VBD, films/NNS
SPARQL template 1:
SELECT ?x WHERE {
?x ?p ?y .
?y rdf:type ?c .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 25 / 52
43. Semantic Gap: Who produced the most films?
domain-independent: who, the most
domain-dependent: produced/VBD , films/NNS
SPARQL template 1:
SELECT ?x WHERE {
?x ?p ?y .
?y rdf:type ?c .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 25 / 52
44. Semantic Gap: Who produced the most films?
domain-independent: who, the most
domain-dependent: produced/VBD, films/NNS
SPARQL template 2:
SELECT ?x WHERE {
?x ?p ?y .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?p PROPERTY [films]
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 25 / 52
45. Semantic Gap: Template instantiation
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 26 / 52
46. Semantic Gap: Template instantiation
1 For resources and classes:
Identify synonyms of the label using WordNet.
Retrieve entities with a label similar to the slot label
based on string similarities (trigram, Levenshtein,
substring).
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 26 / 52
47. Semantic Gap: Template instantiation
2 For property labels, the label is
additionally compared to natural
language expressions stored in the
BOA pattern library.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 26 / 52
48. Semantic Gap: Template instantiation
3 The highest ranking entities are
returned as candidates for filling the
query slots.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 26 / 52
49. BOA
The BOA pattern library is a repository of natural language
representations of Semantic Web predicates.
Idea:
For each predicate P in a data repository (e.g. DBpedia), collect the
set of entities S and O connected through P.
Search a text corpus (e.g. Wikipedia) for all sentences containing the
labels of S and O.
For all retrieved sentences, the natural language predicate is a
potential pattern for P. The potential patterns are then scored by a
neural network (e.g. according to frequency) and filtered.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 27 / 52
51. BOA
The use of BOA patterns allows us to match natural language expressions
and ontology concepts even if they are not string similar and not covered
by WordNet.
Examples:
married to → http://dbpedia.org/ontology/spouse
was born in → http://dbpedia.org/ontology/birthPlace
graduated from → http://dbpedia.org/ontology/almaMater
write → http://dbpedia.org/ontology/author
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 29 / 52
52. Example: Who produced the most films?
Candidates for filling query slots:
?c CLASS [films]
<http://dbpedia.org/ontology/Film>
<http://dbpedia.org/ontology/FilmFestival>
. . .
?p PROPERTY [produced]
<http://dbpedia.org/ontology/producer>
<http://dbpedia.org/property/producer>
<http://dbpedia.org/ontology/wineProduced>
. . .
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 30 / 52
53. Semantic Gap: Query ranking and selection
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 31 / 52
54. Semantic Gap: Query ranking and selection
1 Every entity receives a score
considering string similarity and
prominence
2 The score of a query is then
computed as the average of the
scores of the entities used to fill its
slots
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 31 / 52
55. Semantic Gap: Query ranking and selection
3 In addition, type checks are
performed
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 31 / 52
56. Semantic Gap: Query ranking and selection
4 Of the remaining queries, the one
with highest score that returns a
result is chosen to retrieve an
answer.
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 31 / 52
57. Example: Who produced the most films?
SELECT ?x WHERE {
?x <http://dbpedia.org/ontology/producer> ?y .
?y rdf:type <http://dbpedia.org/ontology/Film> .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.7592425075864263
SELECT ?x WHERE {
?x <http://dbpedia.org/ontology/film> ?y .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.6264001353183296
SELECT ?x WHERE {
?x <http://dbpedia.org/ontology/producer> ?y .
?y rdf:type <http://dbpedia.org/ontology/FilmFestival>.
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.6012584940627768
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 32 / 52
58. Evaluation Setup
Question set: 39 DBpedia training questions from QALD-1
5 could not be parsed due to unknown syntactic constructions or
uncovered domain-independent expressions
19 were answered exactly as required by the benchmark (with
precision and recall 1.0)
Another 2 are answered almost correctly (with precision and recall
greater than 0.8)
Mean precision: 0.61
Mean recall: 0.63
F-measure: 0.62
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 33 / 52
59. Main Sources of Error
Incorrect templates
Template structure does not coincide with structure of the data:
When did Germany join the EU?
res:Germany dbp:accessioneudate ?x .
Predicate detection fails
inhabitants dbp:population, dbp:populationTotal
owns dbo:keyPerson
higher dbp:elevationM
Wrong query is selected
Who wrote The pillars of the Earth?
res:The Pillars of the Earth (TV Miniseries) dbo:writer ?x .
res:The Pillars of the Earth dbo:author ?x .
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 34 / 52
60. Language Gap: Challenges
Schema-agnostic QA
Query Ranking
Relation Extraction
Ontology Lexicalization
Extraction of surface forms
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 35 / 52
65. Justification Gap: Proof Scoring
Combination of features including
1 Score of BOA pattern
2 Token distance
3 Total occurrence of resource labels
4 Similarity to title
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 40 / 52
66. Justification Gap: Trustworthiness
Combination of features including
1 Topic majority on the Web
2 Topic majority in results
3 Topic terms
4 Page rank
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 41 / 52
67. Justification Gap: Fact Confirmation
Combination of features including
1 Combined trustworthiness and proof score
2 Number of proofs
3 Total hit count
4 Domain/Range
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 42 / 52
68. Justification Gap: Evaluation
10 triples/property
Top-60 most used properties
473 from 600 manually verified to be true
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 43 / 52
69. Justification Gap: Evaluation
J48 is overall best classifier (78.8% - 87.6%)
Easiest data set: random
Mixed dataset hardest
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 44 / 52
77. The End
Thank you! Questions?
Axel Ngonga
http://aksw.org/AxelNgonga
ngonga@informatik.uni-leipzig.de
AKSW Research Group
University of Leipzig, Germany
@akswgroup
@NgongaAxel
Axel-Cyrille Ngonga Ngomo (AKSW) Sense Making July 10th, 2015 52 / 52
79. The Semantics of the Semantic Web
• A priori: top-down semantics
– Logical assertions
– Crisp reuse of conceptualization
• In practice: hybrid bottom-up/top-down approach
– (Human/software) agents are sloppy/ignorant
– Agents do not agree (for various reasons)
=> Centralized view on decentralized construct ?
80. Semantic Grounding
The meaning of symbols can be explained by its
semantic correspondences to other symbols alone
[“Understanding understanding” Rapaport 93]
• Type 1 semantics: understanding in terms of something else
• Problem: how to ground semantics?
• Type 2 semantics: understanding something in terms of itself
• “syntactic semantics”: grounding through recursive
understanding
81. Emergent Semantics
Emergent Semantics:
• Semantics as a posteriori agreement on conceptualizations
=> Don’t believe / enforce the schema !
• Semantics of symbols as recursive correspondences to other
symbols
• Analyzing transitive closures of mappings
• Self-organizing, bottom-up approach
• Global semantics (stable states) emerging from multiple
local interactions
• Syntactic semantics
• Studying semantics from a syntactic perspective
82. 3 Concrete Examples
1. Emergence of Semantic Interoperability
2. Entity disambiguation using same-as networks
3. A posteriori schema for LOD properties
83. • How many links do you need to make a semantic network
interoperable?
• Semantic interoperability as an emergent property!
⇒ Connectivity indicator: ci = ∑j,k (jk-j(bc+cc)-k) pjk
• Necessary condition for semantic interoperability in the
large: ci ≥ 0
Semantic Connectivity
Philippe Cudré-Mauroux, Karl Aberer: A Necessary Condition for Semantic
Interoperability in the Large. CoopIS/DOA/ODBASE 2004: 859-872.
84. Graph-Based Disambiguation
• The great thing about unique identifiers is that there are
so many to choose from
– URI jungle
– Disambiguation based on transitive closures on equality links
Philippe Cudré-Mauroux, Parisa Haghani, Michael Jost, Karl Aberer, Hermann de Meer:
idMesh: graph-based disambiguation of linked data. WWW 2009: 591-600.
85. A Posteriori Schema
• Instance data use schema constructs in creative
ways!
⇒ Retro-engineering of schema constructs based on the
deployment of instance data
⇒ Context-dependent, retro-compatible
Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux: Fixing the
Domain and Range of Properties in Linked Data by Context Disambiguation. LDOW 2015.
86. • Tons of research opportunities in this field
• Understanding the emergent properties of LOD
networks (and how to exploit them)
• Analyzing the deployment / use of semantic data (a
priori VS a posteriori views)
• Capturing user disagreement (e.g., multi-views
ontologies, fuzzy ontologies, results diversification)
Research Directions
88. Volume
■ amount of data
Velocity
■ speed of data in and out
Variety
■ range of data types and sources
[Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization"
Opportunity: The 3-Vs of Big Data
90. Information Management
• The story so far:
– Strict separation between unstructured and structured
data management infrastructures
DBMS
JDBC
SQL
Inverted Index
Keywords
HTTP
91. Information Integration
• Information integration is still one of the biggest
CS problem out there (according to many e.g., Gartner)
• Information integration typically requires some
sort of mediation
1. Unstructured Data: keywords, synsets
2. Structured Data: global schema, transitive closure of
schemas (mostly syntactic)
⇒ nightmarish if 1 and 2 taken separately, horror
marathon if considered together
92. Entities as Mediation
• Rising paradigm
– Store information at the entity granularity
– Integrate information by inter-linking entities
• Advantages?
– Coarser granularity compared to keywords
• More natural, e.g., brain functions similarly (or is it the
other way around?)
– Denormalized information compared to RDBMSs
• Schema-later, heterogeneity, sparsity
• Pre-computed joins, “Semantic” linking
• Drawbacks?
94. Exposing Textual Data
• The XI Pipeline
• Runs on massive amounts of data (Spark)
Mention
Extraction
NER
Entity
Linking
Entity
Typing
95. Named Entity Recognition (NER)
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidate
Selection
List of
selected
n-grams
Supervised
Classi!er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting
Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux:
Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408
96. Entity Linking
• Linking entities to text is an old problem…
– … and is extremely hard, esp. for machines
• Dozens of approaches have been suggested
• What if
– We want to combine approaches / frameworks?
– We want to leverage both human computations &
algorithms?
97. ZenCrowd
• Integrate textual data w/ the Web of Data
• Uses sets of algorithmic matchers to match
entities to online concepts
• Uses dynamic templating to create micro-
matching-tasks and publish them on MTurk
• Combines both algorithmic and human matchers
using probabilistic networks
Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux:
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing
techniques for large-scale entity linking. WWW 2012: 469-478
101. Entity Typing
• Entities can have many types (facets)
• Which fine-grained types are most relevant given the context?
Thing
American
Billionaires
People
from
King
County
People
from
Sea:le
Windows
People
Agent
Person
Living
People
American
People
of
Sco@sh
Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
102. TRank
• Fine-grained Typing
• Tree of 447’260 types
• Rooted on <owl:Thing>
• Depth of 19
• Ranks relevant types by analyzing the context
• Textual context
• Graph context
• Decision trees
• Linear regression
Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer:
TRank: Ranking Entity Types Using the Web of Data. ISWC 2013: 640-656.
103. Exposing Relational Data
• Mapping language file
describes the relation between
ontology and RDB
• Server provides HTML and
linked data views and a
SPARQL 1.1 endpoint
• Rewriting engine uses map-
pings to rewrite Jena &
Sesame API calls to SQL
queries and generates RDF
dumps in various formats
http://d2rq.org/ ,
http://aksw.org/Projects/Sparqlify.html , etc.
104. Exposing Webtables
• Wealth of data in (HTML) tables
• Yet another type of content to expose
Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin
Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, Cong Yu: Applying WebTables in
Practice. CIDR 2015
Tao, Cui, and David W. Embley. "Automatic hidden-web table interpretation, conceptualization,
and semantic annotation." Data & Knowledge Engineering 68.7 (2009): 683-703.
105. Application 1: Enterprise Search
• How can end-users reach entities?
⇒ Structured search
⇒ Keyword search
• On their names or attributes
– Obviously not ideal
• BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20
• Query extension, query completion or pseudo-relevance
feedback yield comparable (or worse) results
106. Hybrid Entity Search
The Descendants
TheDescendants
type
title
GeorgeClooney
George Clooney
name
May 6, 1961
dateOfBirth
type
ShaileneW
Shailene Woodley
name
Nov. 15, 1991
dateOfBirth
type
playsIn
playsIn
• Main idea: combine unstructured and structured
search
– Inverted index to locate first candidates
– Graph queries to refine the results
• Graph traversals (queries on object properties)
• Graph neighborhoods (queries on
data type properties)
Inverted Index
Keywords
HTTP
DBMS
SPARQL
107. Architecture
LOD Cloud
index()
User
Query Annotation
and Expansion
Inverted Index
RDF
Store
Ranking
FunctionsRanking
FunctionsRanking
Functions
query()
Entity Search
Keyword Query
intermediate
top-k results
Graph-Enriched
Results
Graph Traversals
(queries on object
properties)
Neighborhoods
(queries on datatype
properties)
Structured
Inverted Index
WordNet
3rd party
search engines
Final Ranking
Function
Pseudo-Relevance Feedback
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and
structured search for ad-hoc object retrieval. SIGIR 2012: 125-134
109. Application 3: Co-Reference Resolution
• Better co-reference resolution through the
knowledge base
Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and
Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015.
Barack Obama
called
Angela Merkel last week;
the president asked
the chancellor whether…
110. • NER in vertical domains
• Crowdsourcing parts of the processing
• Predicate extraction
• Summarization
• Exposing further types of content
• Updates / transactions
• Parallelization
• Higher-level applications
Research Opportunities