Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeAndre Freitas
The Challenge in a Nutshell
To create a query mechanism that semantically matches schema-agnostic user queries to knowledge base elements
The Goal
To support easy querying over complex databases with large schemata, relieving users from the need to understand the formal representation of the data
Relevance
The increase in the size and in the semantic heterogeneity of database schemas are bringing new requirements for users querying and searching structured data. At this scale it can become unfeasible for data consumers to be familiar with the representation of the data in order to query it. At the center of this discussion is the semantic gap between users and databases, which becomes more central as the scale and complexity of the data grows. Addressing this gap is a fundamental part of the Semantic Web vision.
Schema-agnostic query mechanisms aim at allowing users to be abstracted from the representation of the data, supporting the automatic matching between queries and databases. This challenge aims at emphasizing the role of schema-agnosticism as a key requirement for contemporary database management, by providing a test collection for evaluating flexible query and search systems over structured data in terms of their level of schema-agnosticism (i.e. their ability to map a query issued with the user terminology and structure, mapping it to the dataset vocabulary). The challenge is instantiated in the context of Semantic Web datasets.
Different Semantic Perspectives for Question Answering SystemsAndre Freitas
Question Answering systems define one of the most complex tasks in computational semantics. The intrinsic complexity of the QA task allows researchers of QA systems to investigate and explore different perspectives of semantics. However, this complexity also induces a bias towards a systems perspective, where researchers are alienated from a deeper reasoning on the semantic principles that are in place within the different components of the system. In this talk we will explore the semantic challenges, principles and perspectives behind the components of QA systems, aiming at providing a principled map and overview on the contribution of each component within the QA semantic interpretation goal.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeAndre Freitas
The Challenge in a Nutshell
To create a query mechanism that semantically matches schema-agnostic user queries to knowledge base elements
The Goal
To support easy querying over complex databases with large schemata, relieving users from the need to understand the formal representation of the data
Relevance
The increase in the size and in the semantic heterogeneity of database schemas are bringing new requirements for users querying and searching structured data. At this scale it can become unfeasible for data consumers to be familiar with the representation of the data in order to query it. At the center of this discussion is the semantic gap between users and databases, which becomes more central as the scale and complexity of the data grows. Addressing this gap is a fundamental part of the Semantic Web vision.
Schema-agnostic query mechanisms aim at allowing users to be abstracted from the representation of the data, supporting the automatic matching between queries and databases. This challenge aims at emphasizing the role of schema-agnosticism as a key requirement for contemporary database management, by providing a test collection for evaluating flexible query and search systems over structured data in terms of their level of schema-agnosticism (i.e. their ability to map a query issued with the user terminology and structure, mapping it to the dataset vocabulary). The challenge is instantiated in the context of Semantic Web datasets.
Different Semantic Perspectives for Question Answering SystemsAndre Freitas
Question Answering systems define one of the most complex tasks in computational semantics. The intrinsic complexity of the QA task allows researchers of QA systems to investigate and explore different perspectives of semantics. However, this complexity also induces a bias towards a systems perspective, where researchers are alienated from a deeper reasoning on the semantic principles that are in place within the different components of the system. In this talk we will explore the semantic challenges, principles and perspectives behind the components of QA systems, aiming at providing a principled map and overview on the contribution of each component within the QA semantic interpretation goal.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
Amit Sheth's Keynote talk given at: “Semantic Web in Action: Ontology-driven information search, integration and analysis,” Net Object Days 2003 and MATES03, Erfurt, Germany, September 23, 2003. http://knoesis.org
Note: slides 51-55 have audio.
A lecture/conversation focusing on the first 12 years of Semantic Web - delivered on February 21, 2012.
See http://j.mp/SWIntro for more details. More detailed course material is at http://knoesis.org/courses/web3/
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
This invited keynote at the Social Computing Track at WI-IAT21 gives an introduction to Knowledge Graphs and how they are built collaboratively by us. It gives also presents a brief analysis of the links in Wikidata.
Word Tagging with Foundational Ontology ClassesAndre Freitas
Semantic annotation is fundamental to deal with large-scale
lexical information, mapping the information to an enumerable set of
categories over which rules and algorithms can be applied, and foundational
ontology classes can be used as a formal set of categories for
such tasks. A previous alignment between WordNet noun synsets and
DOLCE provided a starting point for ontology-based annotation, but in
NLP tasks verbs are also of substantial importance. This work presents
an extension to the WordNet-DOLCE noun mapping, aligning verbs according
to their links to nouns denoting perdurants, transferring to the
verb the DOLCE class assigned to the noun that best represents that
verb’s occurrence. To evaluate the usefulness of this resource, we implemented
a foundational ontology-based semantic annotation framework,
that assigns a high-level foundational category to each word or phrase
in a text, and compared it to a similar annotation tool, obtaining an
increase of 9.05% in accuracy.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Deep neural networks for matching online social networking profilesTraian Rebedea
> Proposed a large dataset for matching online social networking profiles
›This allowed us to train a deep neural network for profile matching using both domain-specific features and word embeddings generated from textual descriptions from social profiles
›Experiments showed that the NN surpassed both unsupervised and supervised models, achieving a high precision (P = 0.95) with a good recall rate (R = 0.85)
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesDaniel Sonntag
We implemented a generic dialogue shell that can be configured for and applied to domain-specific dialogue applications. The dialogue system works robustly for a new domain when the application backend can automatically infer previously unknown knowledge (facts) and provide explanations for the inference steps involved. For this purpose, we employ URDF, a query engine for uncertain and potentially inconsistent RDF knowledge bases. URDF supports rule-based, first-order predicate logic as used in OWL-Lite and OWL-DL, with simple and effective top-down reasoning capabilities. This mechanism also generates explanation graphs. These graphs can then be displayed in the GUI of the dialogue shell and help the user understand the underlying reasoning processes. We believe that proper explanations are a main factor for increasing the level of user trust in end-to-end human-computer interaction systems.
Semantic Relation Classification: Task Formalisation and RefinementAndre Freitas
The identification of semantic relations between terms within texts is a fundamental task in Natural Language Processing which can support applications requiring a lightweight semantic interpretation model. Currently, semantic relation classification concentrates on relations which are evaluated over open-domain data. This work provides a critique on the set of abstract relations used for semantic relation classification with regard to their ability to express relationships between terms which are found in a domain-specific corpora. Based on this analysis, this work proposes an alternative semantic relation model based on reusing and extending the set of abstract relations present in the DOLCE ontology. The resulting set of relations is well grounded,
allows to capture a wide range of relations and could thus be used as a foundation for automatic classification of semantic relations.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Deep Learning Models for Question AnsweringSujit Pal
Talk about a hobby project to apply Deep Learning models to predict answers to 8th grade science multiple choice questions for the Allen AI challenge on Kaggle.
Extracting Multilingual Natural-Language Patterns for RDF PredicatesDaniel Gerber
Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, a bootstrapping strategy for ex- tracting RDF from text. The idea behind BOA is to extract natural-language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web. These patterns are then used to extract instance knowledge from natural-language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. The approach followed by BOA is quasi independent of the language in which the corpus is written. We demonstrate our approach by applying it to four different corpora and two different languages. We evaluate BOA on these data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high accuracy.
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
Tutorial on "Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge Graphs" presented at the 4th Joint International Conference on Semantic Technologies (JIST2014)
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
Amit Sheth's Keynote talk given at: “Semantic Web in Action: Ontology-driven information search, integration and analysis,” Net Object Days 2003 and MATES03, Erfurt, Germany, September 23, 2003. http://knoesis.org
Note: slides 51-55 have audio.
A lecture/conversation focusing on the first 12 years of Semantic Web - delivered on February 21, 2012.
See http://j.mp/SWIntro for more details. More detailed course material is at http://knoesis.org/courses/web3/
The ultimate goal of a recommender system is to suggest interesting and not obvious items (e.g., products to buy, people to connect with, movies to watch, etc.) to users, based on their preferences.
The advent of the Linked Open Data (LOD) initiative in the Semantic Web gave birth to a variety of open knowledge bases freely accessible on the Web. They provide a valuable source of information that can improve conventional recommender systems, if properly exploited.
Here I present several approaches to recommender systems that leverage Linked Data knowledge bases such as DBpedia. In particular, content-based and hybrid recommendation algorithms will be discussed.
For full details about the presented approaches please refer to the full papers mentioned in this presentation.
This invited keynote at the Social Computing Track at WI-IAT21 gives an introduction to Knowledge Graphs and how they are built collaboratively by us. It gives also presents a brief analysis of the links in Wikidata.
Word Tagging with Foundational Ontology ClassesAndre Freitas
Semantic annotation is fundamental to deal with large-scale
lexical information, mapping the information to an enumerable set of
categories over which rules and algorithms can be applied, and foundational
ontology classes can be used as a formal set of categories for
such tasks. A previous alignment between WordNet noun synsets and
DOLCE provided a starting point for ontology-based annotation, but in
NLP tasks verbs are also of substantial importance. This work presents
an extension to the WordNet-DOLCE noun mapping, aligning verbs according
to their links to nouns denoting perdurants, transferring to the
verb the DOLCE class assigned to the noun that best represents that
verb’s occurrence. To evaluate the usefulness of this resource, we implemented
a foundational ontology-based semantic annotation framework,
that assigns a high-level foundational category to each word or phrase
in a text, and compared it to a similar annotation tool, obtaining an
increase of 9.05% in accuracy.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Deep neural networks for matching online social networking profilesTraian Rebedea
> Proposed a large dataset for matching online social networking profiles
›This allowed us to train a deep neural network for profile matching using both domain-specific features and word embeddings generated from textual descriptions from social profiles
›Experiments showed that the NN surpassed both unsupervised and supervised models, achieving a high precision (P = 0.95) with a good recall rate (R = 0.85)
Explanations in Dialogue Systems through Uncertain RDF Knowledge BasesDaniel Sonntag
We implemented a generic dialogue shell that can be configured for and applied to domain-specific dialogue applications. The dialogue system works robustly for a new domain when the application backend can automatically infer previously unknown knowledge (facts) and provide explanations for the inference steps involved. For this purpose, we employ URDF, a query engine for uncertain and potentially inconsistent RDF knowledge bases. URDF supports rule-based, first-order predicate logic as used in OWL-Lite and OWL-DL, with simple and effective top-down reasoning capabilities. This mechanism also generates explanation graphs. These graphs can then be displayed in the GUI of the dialogue shell and help the user understand the underlying reasoning processes. We believe that proper explanations are a main factor for increasing the level of user trust in end-to-end human-computer interaction systems.
Semantic Relation Classification: Task Formalisation and RefinementAndre Freitas
The identification of semantic relations between terms within texts is a fundamental task in Natural Language Processing which can support applications requiring a lightweight semantic interpretation model. Currently, semantic relation classification concentrates on relations which are evaluated over open-domain data. This work provides a critique on the set of abstract relations used for semantic relation classification with regard to their ability to express relationships between terms which are found in a domain-specific corpora. Based on this analysis, this work proposes an alternative semantic relation model based on reusing and extending the set of abstract relations present in the DOLCE ontology. The resulting set of relations is well grounded,
allows to capture a wide range of relations and could thus be used as a foundation for automatic classification of semantic relations.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Deep Learning Models for Question AnsweringSujit Pal
Talk about a hobby project to apply Deep Learning models to predict answers to 8th grade science multiple choice questions for the Allen AI challenge on Kaggle.
Extracting Multilingual Natural-Language Patterns for RDF PredicatesDaniel Gerber
Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, a bootstrapping strategy for ex- tracting RDF from text. The idea behind BOA is to extract natural-language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web. These patterns are then used to extract instance knowledge from natural-language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. The approach followed by BOA is quasi independent of the language in which the corpus is written. We demonstrate our approach by applying it to four different corpora and two different languages. We evaluate BOA on these data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high accuracy.
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
Relation Extraction from the Web using Distant SupervisionIsabelle Augenstein
Slides of my presentation on "Relation Extraction from the Web using Distant Supervision" at EKAW 2014. Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/EKAW2014-Relation.pdf
Seed Selection for Distantly Supervised Web-Based Relation ExtractionIsabelle Augenstein
Slides of my presentation on "Seed Selection for Distantly Supervised Web-Based Relation Extraction" at the Semantic Web and Information Extraction workshop (SWAIE) and COLING 2014
Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/SWAIE2014-Seed.pdf
Invited talk in Heriot-Watt University Computer Science seminar series about our EMNLP paper on extracting non-standard relations from the Web with Distant Supervision and Imitation Learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
View the EMNLP poster here: http://www.slideshare.net/isabelleaugenstein/extracting-relations-between-nonstandard-entities-using-distant-supervision-and-imitation-learning
Schema-agnositc queries over large-schema databases: a distributional semanti...Andre Freitas
The evolution of data environments towards the growth in the size, complexity, dy-
namicity and decentralisation (SCoDD) of schemas drastically impacts contemporary
data management. The SCoDD trend emerges as a central data management concern
in Big Data scenarios, where users and applications have a demand for more complete
data, produced by independent data sources, under different semantic assumptions and
contexts of use. Most Database Management Systems (DBMSs) today target a closed
communication scenario, where the symbolic schema of the database is known a priori
by the database user, which is able to interpret it in an unambiguous way. The context
in which the data is consumed and produced is well-defined and it is typically the
same context in which the data was created. In contrast, data management under the
SCoDD conditions target an open communication scenario where the symbolic system of
the database is unknown by the user and multiple interpretation contexts are possible.
In this case the database can be created under a different context from the database
user. The emergence of this new data environment demands the revisit of the semantic
assumptions behind databases and the design of data access mechanisms which can
support semantically heterogeneous (open communication) data environments.
This work aims at filling this gap by proposing a complementary semantic model for
databases, based on distributional semantic models. Distributional semantics provides a
complementary perspective to the formal perspective of database semantics, which supports
semantic approximation as a first-class database operation. Differently from models
which describe uncertain and incomplete data or probabilistic databases, distributional-
relational models focuses on the construction of conceptual approximation approaches
for databases, supported by a comprehensive semantic model automatically built from
large-scale unstructured data external to the database, which serves as a semantic/com-
monsense knowledge base. The semantic model can be used to support schema-agnosticqueries, i.e. abstracting the data consumer from a specific conceptualization behind the
data.
The proposed distributional-relational semantic model is supported by a distributional
structured vector space model, named τ −Space, which represents structured data under
a distributional semantic model representation which, in coordination with a query plan-
ning approach, supports a schema-agnostic query mechanism for large-schema databases.
The query mechanism is materialized in the Treo query engine and is evaluated using
schema-agnostic natural language queries.
The evaluation of the query mechanism confirms that distributional semantics provides
a high-recall, medium-high precision, and low maintainability solution to cope with
the abstraction and conceptual-level differences in schema-agnostic queries over largeschema/
schema-less open domain dataset
folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
Introduction to natural language generation with artificial neural networks (ANNs) and a group poetry writing exercise where humans pretend to be neurons in an ANN.
Slides for my tutorial at the ESWC Summer School 2015, giving an introduction to information extraction with Linked Data and an introduction to one of the applications of information extraction, opinion mining.
Representing Texts as contextualized Entity Centric Linked Data GraphsAndre Freitas
The integration of a small fraction of the information present in the Web of Documents to the Linked Data
Web can provide a significant shift on the amount of information available to data consumers. However, information extracted from text does not easily fit into the usually highly normalized structure of ontology-based datasets. While the representation of structured data assumes a high level of regularity, relatively
simple and consistent conceptual models, the representation of information extracted from texts need to take into account large terminological variation, complex contextual/dependency patterns, and fuzzy or conflicting semantics. This work focuses on bridging the gap between structured and unstructured data, proposing the representation of text as structured discourse graphs (SDGs), targeting an RDF representation of unstructured
data. The representation focuses on a semantic best-effort information extraction scenario, where information from text is extracted under a pay-as-you-go data quality perspective, trading terminological normalization for domain-independency, context capture, wider representation scope and maximization of textual
information capture.
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary StudyAndre Freitas
The growing size, heterogeneity and complexity of databases
demand the creation of strategies to facilitate users and systems to consume
data. Ideally, query mechanisms should be schema-agnostic or
vocabulary-independent, i.e. they should be able to match user queries
in their own vocabulary and syntax to the data, abstracting data consumers
from the representation of the data. Despite being a central requirement across natural language interfaces and entity search, there is a lack on the conceptual analysis of schema-agnosticism and on the associated semantic differences between queries and databases. This work aims at providing an initial conceptualization for schema-agnostic queries aiming at providing a fine-grained classification which can support the scoping, evaluation and development of semantic matching approaches for schema-agnostic queries.
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
How can text-mining leverage developments in Deep Learning?
Text-mining focusses primary on extracting complex patterns from unstructured electronic data sets and applying machine learning for document classification. During the last decade, a generation of efficient and successful algorithms has been developed using bag-of-words models to represent document content and statistical and geometrical machine learning algorithms such as Conditional Random Fields and Support Vector Machines. These algorithms require relatively little training data and are fast on modern hardware. However, performance seems to be stuck around 90% F1 values.
In computer vision, deep learning has shown great success where the 90% barrier has been broken in many application. In addition, deep learning also shows new successes for transfer learning and self-learning such as reinforcement leaning. Dedicated hardware helped us to overcome computational challenges and methods such as training data augmentation solved the need for unrealistically large data sets.
So, it would make sense to apply deep learning also on textual data as well. But how do we represent textual data: there are many different methods for word embeddings and as many deep learning architectures. Training data augmentation, transfer learning and reinforcement leaning are not fully defined for textual data.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
You've heard the news, Data Science is the cool new career opportunity sweeping the world. Come learn from Thinkful Mentors all about this new and exciting industry.
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.
Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.
Where are all the Semantic Web agents? There are billions of "machine readable" open facts on the Semantic Web, i.e. Linked Open Data (LOD), isn't that enough? It looks like it's not. We're still far from seeing Lucy's and Pete's agents brilliantly solving their tasks with the help of other Semantic Web agents they can trust (Tim Berners Lee et al., The Semantic Web, Scientific American (2001) ). Despite its technological impact on many applications and areas, the Semantic Web promised to cause a breakthrough that we didn't yet experience. One issue is that LOD ontologies are not as linked as they should be. Another issue is that formalising only semi-structured Web pages or databases is not enough for making them able to operate. They also need to reason with commonsense knowledge, the encoding of which is a long-standing challenge in Artificial Intelligence. A third consideration is that most existing commonsense knowledge bases lack formal semantics and situational constraints. In this talk I will advocate the role of the Semantic Web as a provider of a knowledge graph of commonsense to Artificial Intelligence, and discuss ways and obstacles towards the achievement of this goal.
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
Knowledge graphs (KGs) have recently emerged as a powerful way to represent knowledge in multiple communities, including data mining, natural language processing and machine learning. Large-scale KGs like Wikidata and DBpedia are openly available, while in industry, the Google Knowledge Graph is a good example of proprietary knowledge that continues to fuel impressive advances in Google's semantic search capabilities. Yet, both crowdsourced and automatically constructed KGs suffer from noise, both during KG construction and during search and inference. In this talk, I will discuss how to build and use such knowledge graphs effectively, despite the noise and sparsity of labeled data, to solve real-world social problems such as providing insights in disaster situations, and helping law enforcement fight human trafficking. I will conclude by providing insight on the lessons learned, and the applicability of research techniques to industrial problems. The talk will be designed to appeal both to business and technical leaders.
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Andre Freitas
The increase in the size, heterogeneity and complexity of contemporary Big Data environments brings major challenges for the consumption of structured and semi–structured data. Addressing these challenges requires a convergence of approaches from different communities including databases, natural language processing, and information retrieval. Research on Natural Language Interfaces (NLI) and Question Answering systems has played a prominent role in stimulating a multidisciplinary approach to the problem that has moved the field from a futuristic vision to a concrete industry-level technological trend.
In this talk we distill the key principles of state-of-the-art approaches for data consumption using NLI. Particular attention is paid to the maturity and effectiveness of each approach together with discussion on future trends and active research questions.
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
ODSC East 2017: Data Science Models For GoodKarry Lu
Abstract: The rise of data science has been largely fueled by the promise of changing the business landscape - enhancing one's competitive advantage, increasing business optimization and efficiency, and ultimately delivering a better bottom-line. This promise reaches across sectors as machine learning methods are getting better, data access continues to grow, and computation power is easily accessible. However, because the practice of doing data science can be expensive, there is a danger that this so-called promise of data science may only be available to the most well-resourced organizations with sophisticated data capabilities and staff. For the past five years, DataKind has been working to ensure social change organizations too have access to data science, teaming them up with data scientists to build machine learning and artificial intelligence solutions that aim to reduce human suffering. In doing so, DataKind has learned what it takes to apply data science in the social sector and the many applications it has for creating positive change in the world. This session presents DataKind projects showcasing the wide range of applications for ML/AI for social good. From using satellite imagery and remote sensing techniques to detect wheat farm boundaries to protect livelihoods in Ethiopia, to leveraging NLP to automate the time consuming process of synthesizing findings from academic studies to inform conservation efforts and to classifying text records to better understand human rights conditions across the world to using machine learning to reduce traffic fatalities in U.S. cities, learn about some of the latest breakthroughs and findings in the data science for social good space and learn how you can get involved
In this talk we will summarise some of the detectable trends on AI beyond deep learning. We will focus on the current transition from deep learning to deep semantics, describing the enabling infrastructures, challenges and opportunities in the construction of the next generation AI systems. The talk will focus on Natural Language Processing (NLP) as an AI sub-domain and will link to the research at the AI Systems Lab at the University of Manchester.
Building AI Applications using Knowledge GraphsAndre Freitas
Goals of this Tutorial:
Provide a broad view of the multiple perspectives underlying knowledge graphs.
Show knowledge graphs as a foundation for building AI systems.
Method:
Focus on the contemporary and emerging perspectives.
Sampling exemplar approaches and infrastructures on each of these emerging perspectives (not an exhaustive survey).
Effective Semantics for Engineering NLP SystemsAndre Freitas
Provide a synthesis of the emerging representation trends behind NLP systems.
Shift in perspective:
Effective engineering (task driven, scalable) instead of sound formalism.
Best-effort representation.
Knowledge Graphs (Frege revisited)
Information Extraction & Text Classification
Distributional Semantic Models
Knowledge Graphs & Distributional Semantics
(Distributional-Relational Models)
Applications of DRMs
KG Completion
Semantic Parsing
Natural Language Inference
This paper discusses the “Fine-Grained
Sentiment Analysis on Financial Microblogs
and News” task as part of
SemEval-2017, specifically under the
“Detecting sentiment, humour, and truth”
theme. This task contains two tracks, where
the first one concerns Microblog messages
and the second one covers News Statements
and Headlines. The main goal behind both
tracks was to predict the sentiment score for
each of the mentioned companies/stocks.
The sentiment scores for each text instance
adopted floating point values in the range
of -1 (very negative/bearish) to 1 (very
positive/bullish), with 0 designating neutral
sentiment. This task attracted a total of 32
participants, with 25 participating in Track
1 and 29 in Track 2.
Categorization of Semantic Roles for Dictionary DefinitionsAndre Freitas
Understanding the semantic relationships between terms is a fundamental task in natural language
processing applications. While structured resources that can express those relationships in
a formal way, such as ontologies, are still scarce, a large number of linguistic resources gathering
dictionary definitions is becoming available, but understanding the semantic structure of natural
language definitions is fundamental to make them useful in semantic interpretation tasks. Based
on an analysis of a subset of WordNet’s glosses, we propose a set of semantic roles that compose
the semantic structure of a dictionary definition, and show how they are related to the definition’s
syntactic configuration, identifying patterns that can be used in the development of information
extraction frameworks and semantic models.
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...Andre Freitas
The growing size, heterogeneity and complexity of databases demand the creation of strategies to facilitate users and systems to consume data. Ideally, query mechanisms should be schema-agnostic, i.e. they should be able to match user queries in their own vocabulary and syntax to the data, abstracting data consumers from the representation of the data. This work provides an informationtheoretical framework to evaluate the semantic complexity involved in the query-database communication, under a schema-agnostic query scenario. Different entropy measures are introduced to quantify the semantic phenomena involved in the user-database communication, including structural complexity, ambiguity, synonymy and vagueness. The entropy measures are validated using natural language queries over Semantic Web databases. The analysis of the semantic complexity is used to improve the understanding of the core semantic dimensions present at the query-data matching process, allowing the improvement of the design of schema-agnostic query mechanisms and defining measures which can be used to assess the semantic uncertainty or difficulty behind a schema-agnostic querying task.
A Semantic Web Platform for Automating the Interpretation of Finite Element ...Andre Freitas
Finite Element (FE) models provide a rich framework to simulate dynamic biological systems, with applications ranging from hearing to cardiovascular research. With the growing complexity and sophistication of FE bio-simulation models (e.g. multi-scale and multi-domain models), the effort associated with the creation, analysis and reuse of
a FE model can grow unmanageable. This work investigates the role of semantic technologies to improve the automation, interpretation and reproducibility of FE simulations. In particular, the paper focuses on
the definition of a reference semantic architecture for FE bio-simulations and on the discussion of strategies to bridge the gap between numerical-level
and conceptual-level representations. The discussion is grounded on the SIFEM platform, a semantic infrastructure for FE simulations for cochlear mechanics.
Towards a Distributional Semantic Web StackAndre Freitas
The ability of distributional semantic models (DSMs) to dis-
cover similarities over large scale heterogeneous and poorly structured data brings them as a promising universal and low-effort framework to support semantic approximation and knowledge discovery. This position paper explores the role of distributional semantics in the Semantic Web vision, based on the state-of-the-art distributional-relational models, categorizing and generalizing existing approaches into a Distributional Semantic Web stack.
On the Semantic Representation and Extraction of Complex Category DescriptorsAndre Freitas
Natural language descriptors used for categorizations are
present from folksonomies to ontologies. While some descriptors are composed of simple expressions, other descriptors have complex compositional patterns (e.g. ‘French Senators Of The Second Empire’, ‘Churches
Destroyed In The Great Fire Of London And Not Rebuilt’). As conceptual models get more complex and decentralized, more content is transferred to unstructured natural language descriptors, increasing the
terminological variation, reducing the conceptual integration and the structure level of the model. This work describes a formal representation for complex natural language category descriptors (NLCDs). In the
representation, complex categories are decomposed into a graph of primitive concepts, supporting their interlinking and semantic interpretation. A category extractor is built and the quality of its extraction under the proposed representation model is evaluated.
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
Tasks such as question answering and semantic search are dependent
on the ability of querying & reasoning over large-scale commonsense knowledge
bases (KBs). However, dealing with commonsense data demands coping with
problems such as the increase in schema complexity, semantic inconsistency, incompleteness
and scalability. This paper proposes a selective graph navigation
mechanism based on a distributional relational semantic model which can be applied
to querying & reasoning over heterogeneous knowledge bases (KBs). The
approach can be used for approximative reasoning, querying and associational
knowledge discovery. In this paper we focus on commonsense reasoning as the
main motivational scenario for the approach. The approach focuses on addressing
the following problems: (i) providing a semantic selection mechanism for facts
which are relevant and meaningful in a specific reasoning & querying context
and (ii) allowing coping with information incompleteness in large KBs. The approach
is evaluated using ConceptNet as a commonsense KB, and achieved high
selectivity, high scalability and high accuracy in the selection of meaningful nav-
igational paths. Distributional semantics is also used as a principled mechanism
to cope with information incompleteness.
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
Big Data is based on the vision of providing users and applications with a more complete picture of the reality supported and mediated by data. This vision comes with the inherent price of data variety, i.e. data which is semantically heterogeneous, poorly structured, complex and with data quality issues. Despite the hype on technologies targeting data volume and velocity, solutions for coping with data variety remain fragmented and with limited adoption. In this talk we will focus on emerging data management approaches, supported by semantic technologies, to cope with data variety. We will provide a broad overview of semantic computing approaches and how they can be applied to data management challenges within organizations today. This talk will allow the audience to have a glimpse into the next-generation, Big Data-driven information systems.
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributio...Andre Freitas
The demand to access large amounts of heterogeneous structured
data is emerging as a trend for many users and applications.
However, the effort involved in querying heterogeneous
and distributed third-party databases can create major
barriers for data consumers. At the core of this problem is
the semantic gap between the way users express their information
needs and the representation of the data. This work
aims to provide a natural language interface and an associated
semantic index to support an increased level of vocabulary
independency for queries over Linked Data/Semantic
Web datasets, using a distributional-compositional semantics
approach. Distributional semantics focuses on the automatic
construction of a semantic model based on the statistical distribution
of co-occurring words in large-scale texts. The proposed
query model targets the following features: (i) a principled
semantic approximation approach with low adaptation
effort (independent from manually created resources such as
ontologies, thesauri or dictionaries), (ii) comprehensive semantic
matching supported by the inclusion of large volumes
of distributional (unstructured) commonsense knowledge into
the semantic approximation process and (iii) expressive natural language queries. The approach is evaluated using natural language queries on an open domain dataset and achieved avg. recall=0.81, mean avg. precision=0.62 and mean reciprocal rank=0.49.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. Understand how Question Answering (QA) can address Linked Data consumption challenges.
Provide you a quick overview of the state-of- the-art.
Provide you the fundamental pointers to develop your own QA system.
2
3. Motivation & Context
Challenges for QA over Linked Data
The Anatomy of a QA System
QA over Linked Data (Case Studies)
Evaluation of QA over Linked Data
Do-it-yourself (DIY): Core Resources
Trends
Take-away Message
3
5. Humans are built-in with natural language communication capabilities.
Very natural way for humans to communicate information needs.
The archetypal AI system.
5
7. A research field on its own.
Empirical bias: Focus on the development and evaluation of approaches and systems to answer questions over a knowledge base.
Multidisciplinary:
◦Natural Language Processing
◦Information Retrieval
◦Knowledge Representation
◦Databases
◦Linguistics
◦Artificial Intelligence
◦Software Engineering
◦...
7
8. From the QA expert perspective
◦QA depends on mastering different semantic computing techniques.
8
10. Keyword Search:
◦User still carries the major efforts in interpreting the data.
◦Satisfying information needs may depend on multiple search operations.
◦Answer-driven information access.
◦Input: Keyword search
Typically specification of simpler information needs.
◦Output: documents, structured data.
QA:
◦Delegates more ‘interpretation effort’ to the machines.
◦Query-driven information access.
◦Input: natural language query
Specification of complex information needs.
◦Output: direct answer.
10
11. Structured Queries:
◦A priori user effort in understanding the schemas behind databases.
◦Effort in mastering the syntax of a query language.
◦Satisfying information needs may depend on multiple querying operations.
◦Input: Structured query
◦Output: data records, aggregations, etc
QA:
◦Delegates more ‘semantic interpretation effort’ to the machine.
◦Input: natural language query
◦Output: direct natural language answer
11
12. Keyword search:
◦Simple information needs.
◦Vocabulary redundancy (large document collections, Web).
Structured queries:
◦Demand for absolute precision/recall guarantees.
◦Small & centralized schemas.
◦More data volume/smaller schema size.
QA:
◦Heterogeneous and schema-less data.
◦Specification of complex information needs.
◦More automated semantic interpretation.
12
17. QA is usually associated with the delegation of more of the ‘interpretation effort’ to the machines.
QA, keyword search and structured queries are complementary data access perspectives.
QA making its way to the industry.
17
20. Example: What is the currency of the Czech Republic?
SELECT DISTINCT ?uri WHERE {
res:Czech_Republic dbo:currency ?uri .
}
Main challenges:
Mapping natural language expressions to vocabulary elements (accounting for lexical and structural differences).
Handling meaning variations (e.g. ambiguous or vague expressions, anaphoric expressions).
20
21. URIs are language independent identifiers.
Their only actual connection to natural language is by the labels that are attached to them.
dbo:spouse rdfs:label “spouse”@en , “echtgenoot”@nl .
Labels, however, do not capture lexical variation:
wife of
husband of
married to
...
21
22. Which Greek cities have more than 1 million inhabitants?
SELECT DISTINCT ?uri
WHERE {
?uri rdf:type dbo:City .
?uri dbo:country res:Greece .
?uri dbo:populationTotal ?p .
FILTER (?p > 1000000)
}
22
23. Often the conceptual granularity of language does not coincide with that of the data schema.
When did Germany join the EU?
SELECT DISTINCT ?date
WHERE {
res:Germany dbp:accessioneudate ?date .
}
Who are the grandchildren of Bruce Lee?
SELECT DISTINCT ?uri
WHERE {
res:Bruce_Lee dbo:child ?c .
?c dbo:child ?uri .
}
23
24. In addition, there are expressions with a fixed, dataset-independent meaning.
Who produced the most films?
SELECT DISTINCT ?uri
WHERE {
?x rdf:type dbo:Film .
?x dbo:producer ?uri .
}
ORDER BY DESC(COUNT(?x))
OFFSET 0 LIMIT 1
24
25. Different datasets usually follow different schemas, thus provide different ways of answering an information need.
Example:
25
26. The meaning of expressions like the verbs to be, to have, and prepositions of, with, etc. strongly depends on the linguistic context.
Which museum has the most paintings?
?museum dbo:exhibits ?painting .
Which country has the most caves?
?cave dbo:location ?country .
26
27. The number of non-English actors on the web is growing substantially.
◦Accessing data.
◦Creating and publishing data.
Semantic Web:
In principle very well suited for multilinguality, as URIs are language-independent.
But adding multilingual labels is not common practice (less than a quarter of the RDF literals have language tags, and most of those tags are in English).
27
28. Requirement: Completeness and accuracy
(Wrong answers are worse than no answers)
In the context of the Semantic Web:
QA systems need to deal with heterogeneous and imperfect data.
◦Datasets are often incomplete.
◦Different datasets sometimes contain duplicate information, often using different vocabularies even when talking about the same things.
◦Datasets can also contain conflicting information and inconsistencies.
28
29. Data is distributed among a large collection of interconnected datasets.
Example: What are side effects of drugs used for the
treatment of Tuberculosis?
SELECT DISTINCT ?x
WHERE {
disease:1154 diseasome:possibleDrug ?d1.
?d1 a drugbank:drugs .
?d1 owl:sameAs ?d2.
?d2 sider:sideEffect ?x.
}
29
30. Requirement: Real-time answers, i.e. low processing time.
In the context of the Semantic Web:
Datasets are huge.
◦There are a lot of distributed datasets that might be relevant for answering the question.
◦Reported performance of current QA systems amounts to ~20-30 seconds per question (on one dataset).
30
31. Bridge the gap between natural languages and data.
Deal with incomplete, noisy and heterogeneous datasets.
Scale to a large number of huge datasets.
Use distributed and interlinked datasets.
Integrate structured and unstructured data.
Low maintainability costs (easily adaptable to new datasets and domains).
31
33. Categorization of question, answer and data types.
Important for:
◦What information in the question can be used?
◦Scoping the QA system.
◦Understanding the challenges before attacking the problem.
Based on:
◦Chin-Yew Lin: Question Answering.
◦Farah Benamara: Question Answering Systems: State of the Art and Future Directions.
33
34. Natural Language Interfaces (NLI)
◦Input: Natural language queries
◦Output:
QA: Direct answers.
NLI: Database records, text snippets, documents, data visualizations.
34
NLI
QA
36. The part of the question that says what is being asked:
◦Wh-words:
who, what, which, when, where, why, and how
◦Wh-words + nouns, adjectives or adverbs:
“which party …”, “which actress …”, “how long …”, “how tall …”.
36
37. Question focus is the property or entity that is being sought by the question
◦“In which city was Barack Obama born?”
◦“What is the population of Galway?”
Question topic: What the question is generally about
◦“What is the height of Mount Everest?”
(geography, mountains)
◦“Which organ is affected by the Meniere’s disease?”
(medicine)
37
38. Useful for distinguishing different processing strategies
◦FACTOID:
PREDICATIVE QUESTIONS:
“Who was the first man in space?”
“What is the highest mountain in Korea?”
“How far is Earth from Mars?”
“When did the Jurassic Period end?”
“Where is the Taj Mahal?”
LIST:
“Give me all cities in Germany.”
SUPERLATIVE:
“What is the highest mountain?”
YES-NO:
“Was Margaret Thatcher a chemist?”
38
39. Useful for distinguishing different processing strategies
◦OPINION:
“What do most Americans think of gun control?”
◦CAUSE & EFFECT:
“What is the most frequent cause for lung cancer?”
◦PROCESS:
“How do I make a cheese cake?”
◦EXPLANATION & JUSTIFICATION:
“Why did the revenue of IBM drop?”
◦ASSOCIATION QUESTION:
“What is the connection between Barack Obama and Indonesia?”
◦EVALUATIVE OR COMPARATIVE QUESTIONS:
“What is the difference between impressionism and expressionism?”
39
40. Usually:
◦Rules + Part-of-Speech Tags + Regular Expressions
... goes a long way!
40
47. Relevance: The level in which the answer addresses users information needs.
Correctness: The level in which the answer is factually correct.
Conciseness: The answer should not contain irrelevant information.
Completeness: The answer should be complete.
Simplicity: The answer should be easy to interpret.
Justification: Sufficient context should be provided to support the data consumer in the determination of the query correctness.
47
48. Right: The answer is correct and complete.
Inexact: The answer is incomplete or incorrect.
Unsupported: The answer does not have an appropriate evidence/justification.
Wrong: The answer is not appropriate for the question.
48
50. Simple Extraction: Direct extraction of snippets from the original document(s) / data records.
Combination: Combines excerpts from multiple sentences, documents / multiple data records, databases.
Summarization: Synthesis from large texts / data collections.
Operational/functional: Depends on the application of functional operators.
Reasoning: Depends on the application of an inference process over the original data.
50
51. Semantic Tractability (Popescu et al., 2003): Lexical and syntactic conditions for soundness and completeness.
Semantic Resolvability (Freitas et al., 2014): Vocabulary mapping types between the query and the answer.
Answer Locality (Webber et al., 2002): Whether answer fragments are distributed across different document fragments / documents or datasets/dataset records.
Derivability (Webber et al., 2002): Dependent if the answer is explicit or implicit. Level of reasoning dependency.
Semantic Complexity: Level of ambiguity and discourse/data heterogeneity.
51
53. Data pre-processing: Pre-processes the database data (includes indexing, data cleaning, feature extraction).
Question Analysis: Performs syntactic analysis and detects/extracts the core features of the question (NER, answer type, etc).
Data Matching: Matches terms in the question to entities in the data.
Query Construction: Generates structured query candidates considering the question-data mappings and the syntactic constraints in the query and in the database.
Scoring: Data matching and the query construction components output several candidates that need to be scored and ranked according to certain criteria.
Answer Retrieval & Extraction: Executes the query and extracts the natural language answer from the result set.
53
60. Two words are strongly similar if any of the following holds:
◦1. They have a synset in common (e.g. “human” and “person”)
◦2. A word is a hypernym/hyponym in the taxonomy of the other word.
◦3. If there exists an allowable “is-a” path connecting a synset associated with each word.
◦4. If any of the previous cases is true and the definition (gloss) of one of the synsets of the word (or its direct hypernyms/hyponyms) includes the other word as one of its synonyms, we said that they are highly similar.
61
Lopez et al. 2006
63. Key contributions:
◦Using ontologies to interpret user questions
◦Relies on a deep linguistic analysis that returns semantic representations aligned to the ontology vocabulary and structure
Evaluation: Geobase
64
64. 65
Ontology-based QA:
◦Ontologies play a central role in interpreting user questions
◦Output is a meaning representation that is aligned to the ontology underlying the dataset that is queried
◦ontological knowledge is used for drawing inferences, e.g. for resolving ambiguities
Grammar-based QA:
◦Rely on linguistic grammars that assign a syntactic and semantic representation to lexical units
◦Advantage: can deal with questions of arbitrary complexity
◦Drawback: brittleness (fail if question cannot be parsed because expressions or constructs are not covered by the grammar)
65. 66
Ontology-independent entries
◦mostly function words
quantifiers (some, every, two)
wh-words (who, when, where, which, how many)
negation (not)
◦manually specified and re-usable for all domains
Ontology-specific entries
◦content words and phrases corresponding to concepts and properties in the ontology
◦automatically generated from an ontology lexicon
66. Aim: capture rich and structured linguistic information about how ontology elements are lexicalized in a particular language
lemon (Lexicon Model for Ontologies)
http://lemon-model.net
◦meta-model for describing ontology lexica with RDF
◦declarative (abstracting from specific syntactic and semantic theories)
◦separation of lexicon and ontology
67
67. Semantics by reference:
◦The meaning of lexical entries is specified by pointing to elements in the ontology.
Example:
68
69. Which cities have more than three universities?
SELECT DISTINCT ?x WHERE {
?x rdf:type dbo:City .
?y rdf:type dbo:University .
?y dbo:city ?x .
}
GROUP BY ?y
HAVING (COUNT(?y) > 3)
70
71. Key contributions:
◦Constructs a query template that directly mirrors the linguistic structure of the question
◦Instantiates the template by matching natural language expressions with ontology concepts
Evaluation: QALD 2012
72
72. In order to understand a user question, we need to understand:
The words (dataset-specific)
Abraham Lincoln → res:Abraham Lincoln
died in → dbo:deathPlace
The semantic structure (dataset-independent)
who → SELECT ?x WHERE { … }
the most N → ORDER BY DESC(COUNT(?N)) LIMIT 1
more than i N → HAVING COUNT(?N) > i
73
73. Goal: An approach that combines both an analysis of the semantic structure and a mapping of words to URIs.
Two-step approach:
◦1. Template generation
Parse question to produce a SPARQL template that directly mirrors the structure of the question, including filters and aggregation operations.
◦2. Template instantiation
Instantiate SPARQL template by matching natural language expressions with ontology concepts using statistical entity identification and predicate detection.
74
74. SPARQL template:
SELECT DISTINCT ?x WHERE {
?y rdf:type ?c .
?y ?p ?x .
}
ORDER BY DESC(COUNT(?y))
OFFSET 0 LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
Instantiations:
?c = <http://dbpedia.org/ontology/Film>
?p = <http://dbpedia.org/ontology/producer>
75
77. 1. Natural language question is tagged with part-of-speech information.
2. Based on POS tags, grammar entries are built on the fly.
◦Grammar entries are pairs of:
tree structures (Lexicalized Tree Adjoining Grammar)
semantic representations (ext. Discourse Representation Structures)
3. These lexical entries, together with domain-independent lexical entries, are used for parsing the question (cf. Pythia).
4. The resulting semantic representation is translated into a SPARQL template.
78
78. Domain-independent: who, the most
Domain-dependent: produced/VBD, films/NNS
SPARQL template 1:
SELECT DISTINCT ?x WHERE {
?x ?p ?y .
?y rdf:type ?c .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?c CLASS [films]
?p PROPERTY [produced]
SPARQL template 2:
SELECT DISTINCT ?x WHERE {
?x ?p ?y .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
?p PROPERTY [films]
79
80. 1. For resources and classes, a generic approach to entity detection is applied:
◦Identify synonyms of the label using WordNet.
◦Retrieve entities with a label similar to the slot label based on string similarities (trigram, Levenshtein and substring similarity).
2. For property labels, the label is additionally compared to natural language expressions stored in the BOA pattern library.
3. The highest ranking entities are returned as candidates for filling the query slots.
81
83. 1. Every entity receives a score considering string similarity and prominence.
2. The score of a query is then computed as the average of the scores of the entities used to fill its slots.
3. In addition, type checks are performed:
◦For all triples ?x rdf:type <class>, all query triples ?x p e and e p ?x are checked w.r.t. whether domain/range of p is consistent with <class>.
4. Of the remaining queries, the one with highest score that returns a result is chosen.
84
84. SELECT DISTINCT ?x WHERE {
?x <http://dbpedia.org/ontology/producer> ?y .
?y rdf:type <http://dbpedia.org/ontology/Film> .
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.76
SELECT DISTINCT ?x WHERE {
?x <http://dbpedia.org/ontology/producer> ?y .
?y rdf:type <http://dbpedia.org/ontology/FilmFestival>.
}
ORDER BY DESC(COUNT(?y)) LIMIT 1
Score: 0.60
85
85. The created template structure does not always coincide with how the data is actually modelled.
Considering all possibilities of how the data could be modelled leads to a big amount of templates (and even more queries) for one question.
86
86. 87
Kwiatowski et al., 2013
Scaling Semantic Parsers with On-the-fly Ontology Matching
87. Recent approaches view interpretation as a machine translation problem (translating natural language questions into meaning representations or SPARQL queries).
Example:
Construct all possible interpretations and learn a model to score and rank them (from either question-query pairs or question-answer pairs).
QA over Freebase.
88
88. 1. datatset-independent probabilistic CCG parsing:
◦mapping sentences to underspecified meaning representations (containing generic logical constants not yet aligned to any ontology/dataset schema)
◦one grammar for all domains (with domain-independent entries as well as generic entries built on the basis of POS)
E.g.
◦city: N lambda x.city(x)
◦visit: SNP/NP lambda x y exists e . visit(x,y,e)
89
89. 2. ontology matching:
◦structural matching (transformations of the meaning representations)
◦Collapsing, e.g. public(x) and library(x) and of(x,NewYork,e) -> PublicLibraryOfNewYork
◦Expansion, e.g. discover(x,y,e) -> discover(x,e) and discover'(y,e)
◦Constant matching (replacing all generic constants with constants from the ontology)
This leads to a lot of possible interpretations.
learn function that ranks derivations, then prune and pick the highest ranked one
90
90. Estimate a linear model for scoring derivations (including all parsing and matching decisions) from question-answer pairs
Weighted features include:
◦parse features (e.g. pairings of words with categories)
◦structural features (e.g. types of constants, number of domain- independent constants) --> allows adaptation to knowledge base
◦lexical features (e.g. similarity of NL string and ontology constant based on stem and synonyms)
◦knowledge base features (e.g. violation of domain/range restrictions)
Weights are learned so they support separation of derivations that yield correct answers from those that don't
91
99. •Most semantic models have dealt with particular types of constructions, and have been carried out under very simplifying assumptions, in true lab conditions.
•If these idealizations are removed it is not clear at all that modern semantics can give a full account of all but the simplest models/statements.
Sahlgren, 2013
Formal World
Real World
100
Baroni et al. 2013
100. “Words occurring in similar (linguistic) contexts are semantically related.”
If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).
This can then be used as a surrogate of its semantic representation.
101
101. c1
child
husband
spouse
cn
c2 (number of times that the words occur in c1)
0.7
0.5
102
109. Instance search
◦Proper nouns
◦String similarity + node cardinality
Class (unary predicate) search
◦Nouns, adjectives and adverbs
◦String similarity + Distributional semantic relatedness
Property (binary predicate) search
◦Nouns, adjectives, verbs and adverbs
◦Distributional semantic relatedness
Navigation
Extensional expansion
◦Expands the instances associated with a class.
Operator application
◦Aggregations, conditionals, ordering, position
Disjunction & Conjunction
Disambiguation dialog (instance, predicate)
110
110. Minimize the impact of Ambiguity, Vagueness, Synonymy.
Address the simplest matchings first (heuristics).
Semantic Relatedness as a primitive operation.
Distributional semantics as commonsense knowledge.
Lightweight syntactic constraints
111
111. Transform natural language queries into triple patterns.
“Who is the daughter of Bill Clinton married to?”
112
116. Step 5: Determine Partial Ordered Dependency Structure (PODS)
◦Rules based.
Remove stop words.
Merge words into entities.
Reorder structure from core entity position.
Bill Clinton
daughter
married to
(INSTANCE)
Person
ANSWER TYPE
QUESTION FOCUS
117
117. Step 5: Determine Partial Ordered Dependency Structure (PODS)
◦Rules based.
Remove stop words.
Merge words into entities.
Reorder structure from core entity position.
Bill Clinton
daughter
married to
(INSTANCE)
Person
(PREDICATE)
(PREDICATE)
Query Features
118
118. Map query features into a query plan.
A query plan contains a sequence of:
◦Search operations.
◦Navigation operations.
(INSTANCE)
(PREDICATE)
(PREDICATE)
Query Features
(1) INSTANCE SEARCH (Bill Clinton)
(2) DISAMBIGUATE ENTITY TYPE
(3) GENERATE ENTITY FACETS
(4) p1 <- SEARCH RELATED PREDICATE (Bill Clintion, daughter)
(5) e1 <- GET ASSOCIATED ENTITIES (Bill Clintion, p1)
(6) p2 <- SEARCH RELATED PREDICATE (e1, married to)
(7) e2 <- GET ASSOCIATED ENTITIES (e1, p2)
(8) POST PROCESS (Bill Clintion, e1, p1, e2, p2)
Query Plan
119
120. Bill Clinton
daughter
married to
Person
:Bill_Clinton
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
(PIVOT ENTITY)
(ASSOCIATED TRIPLES)
121
121. Bill Clinton
daughter
married to
Person
:Bill_Clinton
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
sem_rel(daughter,alma mater)=0.001
122
123. Computation of a measure of “semantic proximity” between two terms.
Allows a semantic approximate matching between query terms and dataset terms.
It supports a commonsense reasoning-like behavior based on the knowledge embedded in the corpus.
124
124. Bill Clinton
daughter
married to
Person
:Bill_Clinton
:Chelsea_Clinton
:child
(PIVOT ENTITY)
126
125. Bill Clinton
daughter
married to
Person
:Bill_Clinton
:Chelsea_Clinton
:child
:Mark_Mezvinsky
:spouse
127
137. Semantic approximation in databases (as in any IR system): semantic best-effort.
Need some level of user disambiguation, refinement and feedback.
As we move in the direction of semantic systems we should expect the need for principled dialog mechanisms (like in human communication).
Pull the the user interaction back into the system.
141
141. Key contributions:
◦Evidence-based QA system.
◦Complex and high performance QA Pipeline.
Uses more than 50 scoring components that produce scores which range from probabilities and counts to categorical features.
◦Major cultural impact:
Before: QA as AI vision, academic exercise.
After: QA as an attainable software architecture in the short term.
Evaluation: Jeopardy! Challenge
145
142. 146 “Rap” Sheet This archaic term for a mischievous or annoying child can also mean a rogue or scamp. Rapscallion
Can be more challenging from a question analysis perspective
Higher specificity from an Information Retrieval perpective
143. Question analysis: includes shallow and deep parsing, extraction of logical forms, semantic role labelling, coreference resolution, relations extraction, named entity recognition, among others.
Question decomposition: decomposition of the question into separate phrases, which will generate constraints that need to be satisfied by evidence from the data.
148
145. Ferrucci et al. 2010
150 “Rap” Sheet
This archaic term for a mischievous or annoying
child can also mean a rogue or scamp. This archaic term for a mischievous or
annoying child. This term can also mean a rogue or
scamp. Rapscallion
146. Hypothesis generation:
◦Primary search
Document and passage retrieval
SPARQL queries are used over triple stores.
◦Candidate answer generation (maximizing recall).
Information extraction techniques are applied to the search results to generate candidate answers.
Soft filtering: Application of lightweight (less resource intensive) scoring algorithms to a larger set of initial candidates to prune the list of candidates before the more intensive scoring components.
151
148. Hypothesis and evidence scoring:
◦Supporting evidence retrieval
Seeks additional evidence for each candidate answer from the data sources while the deep evidence scoring step determines the degree of certainty that the retrieved evidence supports the candidate answers.
◦Deep evidence scoring
Scores are then combined into an overall evidence profile which groups individual features into aggregate evidence dimensions.
153
150. Answer merging: is a step that merges answer candidates (hypotheses) with different surface forms but with related content, combining their scores.
Ranking and confidence estimation: ranks the hypotheses and estimate their confidence based on the scores, using machine learning approaches over a training set. Multiple trained models cover different question types.
155
151. Ferrucci et al. 2010 Farrell, 2011
Answer Scoring
Models
Answer & Confidence
Question
Evidence Sources
Models
Models
Models
Models
Models
Primary
Search
Candidate Answer Generation
Hypothesis
Generation
Hypothesis and Evidence Scoring
Final Confidence Merging & Ranking
Synthesis
Answer Sources
Question & Topic Analysis
Question
Decomposition
Evidence Retrieval
Deep Evidence Scoring
Hypothesis
Generation
Hypothesis and Evidence Scoring
Learned Models help combine and weigh the Evidence
156
152. Question
100s Possible Answers
1000’s of
Pieces of Evidence
Multiple Interpretations
100,000’s scores from many simultaneous Text Analysis Algorithms
100s sources
Hypothesis
Generation
Hypothesis and Evidence Scoring
Final Confidence Merging & Ranking
Synthesis
Question & Topic Analysis
Question
Decomposition
Hypothesis Generation
Hypothesis and Evidence Scoring
Answer & Confidence
Ferrucci et al. 2010, Farrell, 2011
157
UIMA for interoperability
UIMA-AS for scale-out and speed
156. Measures how complete is the answer set.
The fraction of relevant instances that are retrieved.
Which are the Jovian planets in the Solar System?
◦Returned Answers:
Mercury
Jupiter
Saturn
Gold-standard:
–Jupiter
–Saturn
–Neptune
–Uranus
161
157. Measures how accurate is the answer set.
The fraction of retrieved instances that are relevant.
Which are the Jovian planets in the Solar System?
◦Returned Answers:
Mercury
Jupiter
Saturn
Gold-standard:
–Jupiter
–Saturn
–Neptune
–Uranus
162
158. Measures the ranking quality.
The Reciprocal-Rank (1/r) of a query can be defined as the rank r at which a system returns the first relevant result.
Which are the Jovian planets in the Solar System?
Returned Answers:
–Mercury
–Jupiter
–Saturn
Gold-standard:
–Jupiter
–Saturn
–Neptune
–Uranus
163
159. Query execution time
Indexing time
Index size
Dataset adaptation effort (Indexing time)
Semantic enrichment/disambiguation
◦# of operations/time
160. Question Answering over Linked Data (QALD-CLEF)
INEX Linked Data Track
BioASQ
SemSearch
167
162. QALD is a series of evaluation campaigns on question answering over linked data.
◦QALD-1 (ESWC 2011)
◦QALD-2 as part of the workshop
(Interacting with Linked Data (ESWC 2012))
◦QALD-3 (CLEF 2013)
◦QALD-4 (CLEF 2014)
It is aimed at all kinds of systems that mediate between a user, expressing his or her information need in natural language, and semantic data.
169
163. QALD-4 is part of the Question Answering track at CLEF 2014:
http://nlp.uned.es/clef-qa/
Tasks:
◦1. Multilingual question answering over DBpedia
◦2. Biomedical question answering on interlinked data
◦3. Hybrid question answering
170
164. Task:
Given a natural language question or keywords, either retrieve the correct answer(s) from a given RDF repository, or provide a SPARQL query that retrieves these answer(s).
◦Dataset: DBpedia 3.9 (with multilingual labels)
◦Questions: 200 training + 50 test
◦Seven languages:
English, Spanish, German, Italian, French, Dutch, Romanian
171
165. <question id = "36" answertype = "resource"
aggregation = "false"
onlydbo = "true" >
Through which countries does the Yenisei river flow?
Durch welche Länder fließt der Yenisei?
¿Por qué países fluye el río Yenisei?
...
PREFIX res: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?uri WHERE {
res:Yenisei_River dbo:country ?uri .
}
172
166. Datasets: SIDER, Diseasome, Drugbank
Questions: 25 training + 25 test
require integration of information from different datasets
Example: What is the side effects of drugs used for Tuberculosis?
SELECT DISTINCT ?x WHERE {
disease:1154 diseasome:possibleDrug ?v2 .
?v2 a drugbank:drugs .
?v3 owl:sameAs ?v2 .
?v3 sider:sideEffect ?x .
}
173
167. Dataset: DBpedia 3.9 (with English abstracts)
Questions: 25 training + 10 test
require both structured data and free text from the abstract to be answered
Example: Give me the currencies of all G8 countries.
SELECT DISTINCT ?uri WHERE {
?x text:"member of" text:"G8" .
?x dbo:currency ?uri .
}
174
168. Focuses on the combination of textual and structured data.
Datasets:
◦English Wikipedia (MediaWiki XML Format)
◦DBpedia 3.8 & YAGO2 (RDF)
◦Links among the Wikipedia, DBpedia 3.8, and YAGO2 URI's.
Tasks:
◦Ad-hoc Task: return a ranked list of results in response to a search topic that is formulated as a keyword query (144 search topics).
◦Jeopardy Task: Investigate retrieval techniques over a set of natural- language Jeopardy clues (105 search topics – 74 (2012) + 31 (2013)).
https://inex.mmci.uni-saarland.de/tracks/lod/
180
170. Focuses on entity search over Linked Datasets.
Datasets:
◦Sample of Linked Data crawled from publicly available sources (based on the Billion Triple Challenge 2009).
Tasks:
◦Entity Search: Queries that refer to one particular entity. Tiny sample of Yahoo! Search Query.
◦List Search: The goal of this track is select objects that match particular criteria. These queries have been hand-written by the organizing committee.
http://semsearch.yahoo.com/datasets.php#
182
171. List Search queries:
◦republics of the former Yugoslavia
◦ten ancient Greek city
◦kingdoms of Cyprus
◦the four of the companions of the prophet
◦Japanese-born players who have played in MLB where the British monarch is also head of state
◦nations where Portuguese is an official language
◦bishops who sat in the House of Lords
◦Apollo astronauts who walked on the Moon
183
172. Entity Search queries:
◦1978 cj5 jeep
◦employment agencies w. 14th street
◦nyc zip code
◦waterville Maine
◦LOS ANGELES CALIFORNIA
◦ibm
◦KARL BENZ
◦MIT
184
173. Balog & Neumayer, A Test Collection for Entity Search in DBpedia (2013).
185
174. Datasets:
◦PubMed documents
Tasks:
◦1a: Large-Scale Online Biomedical Semantic Indexing
Automatic annotation of PubMed documents.
Training data is provided.
◦1b: Introductory Biomedical Semantic QA
300 questions and related material (concepts, triples and golden answers).
186
175. Metrics, Statistics, Tests - Tetsuya Sakai (IR)
◦http://www.promise-noe.eu/documents/10156/26e7f254- 1feb-4169-9204-1c53cc1fd2d7
Building test Collections (IR Evaluation - Ian Soboroff)
◦http://www.promise-noe.eu/documents/10156/951b6dfb- a404-46ce-b3bd-4bbe6b290bfd
187
177. DBpedia
◦http://dbpedia.org/
YAGO
◦http://www.mpi-inf.mpg.de/yago-naga/yago/
Freebase
◦http://www.freebase.com/
Wikipedia dumps
◦http://dumps.wikimedia.org/
ConceptNet
◦http:// conceptnet5.media.mit.edu/
Common Crawl
◦http://commoncrawl.org/
Where to use:
◦As a commonsense KB or as a data source
189
178. High domain coverage:
◦~95% of Jeopardy! Answers.
◦~98% of TREC answers.
Wikipedia is entity-centric.
Curated link structure.
Complementary tools:
◦Wikipedia Miner.
Where to use:
◦Construction of distributional semantic models.
◦As a commonsense KB
190
179. WordNet
◦http://wordnet.princeton.edu/
Wiktionary
◦http://www.wiktionary.org/
◦API: https://www.mediawiki.org/wiki/API:Main_page
FrameNet
◦https://framenet.icsi.berkeley.edu/fndrupal/
VerbNet
◦http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
English lexicon for DBpedia 3.8 (in the lemon format)
◦http://lemon-model.net/lexica/dbpedia_en/
PATTY (collection of semantically-typed relational patterns)
◦http://www.mpi-inf.mpg.de/yago-naga/patty/
BabelNet
◦http://babelnet.org/
Where to use:
◦Query expansion
◦Semantic similarity
◦Semantic relatedness
◦Word sense disambiguation
191
188. Querying distributed linked data
Integration of structured and unstructured data
User interaction and context mechanisms
Integration of reasoning (deductive, inductive, counterfactual, abductive ...) on QA approaches and test collections
Measuring confidence and answer uncertainty
Multilinguality
Machine Learning
Reproducibility and resource integration in QA research
200
189. Linked/Big Data demand new principled semantic approaches to cope with the scale and heterogeneity of data.
Part of the Semantic Web/AI vision can be addressed today with a multi-disciplinary perspective:
◦Linked Data, IR and NLP
The multidiscipinarity of the QA problem can show what semantic computing have achieved and can be transported to other information system types.
Challenges are moving from the construction of basic QA systems to more sophisticated semantic functionalities.
Very active research area.
201
190. [1] Kaufmann & Bernstein, How Useful are Natural Language Interfaces to the Semantic Web for Casual End-users?, 2007
[2] Chin-Yew Lin, Question Answering.
[3] Farah Benamara, Question Answering Systems: State of the Art and Future Directions.
[4] Yahya et al Robust Question Answering over the Web of Linked Data, CIKM, 2013.
[5] Freitas et al., Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends, 2012.
[6] Freitas et al., Answering Natural Language Queries over Linked Data Graphs: A Distributional Semantics Approach,, 2014.
[7] Freitas et al., Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends, 2012.
[8] Freitas & Curry, Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach, IUI, 2014.
[9] Cimiano et al., Towards portable natural language interfaces to knowledge bases, 2008.
[10] Lopez et al., PowerAqua: fishing the semantic web, 2006.
[11] Damljanovic et al., Natural Language Interfaces to Ontologies: Combining Syntactic Analysis and Ontology-based Lookup through the User Interaction, 2010
[12] Unger et al. Template-based Question Answering over RDF Data, 2012.
[13] Cabrio et al., QAKiS: an Open Domain QA System based on Relational Patterns, 2012.
[14] How Useful Are Natural Language Interfaces to the Semantic Web for Casual End-Users?, 2007.
[15] Popescu et al.,Towards a theory of natural language interfaces to databases., 2003.
[16] Farrel, IBM Watson A Brief Overview and Thoughts for Healthcare Education and Performance Improvement .
[17] Freitas et al. On the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study, NLIWoD, 2014.
202