Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints
Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints
Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.
Distributed Query Processing for Federated RDF Data ManagementOlafGoerlitz
PhD defense talk about SPLENDID, a state-of-the-art implementation for efficient distributed SPARQL query processing on Linked Data using SPARQL endpoints and voiD descriptions.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)Dr.-Ing. Thomas Hartmann
In this thesis, a validation framework is introduced that enables to consistently execute RDF-based constraint languages on RDF data and to formulate constraints of any type. The framework reduces the representation of constraints to the absolute minimum, is based on formal logics, consists of a small lightweight vocabulary, and ensures consistency regarding validation results and enables constraint transformations for each constraint type across RDF-based constraint languages.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Comparison of features between ShEx (Shape Expressions) and SHACL (Shapes Constraint Language)
Changelog:
11/06/17
- Removed slides about compositionality
31/May/2017
- Added slide 30 about validation report
- Added slide 32 about stems
- Changed slides 7 and 8 adapting compact syntax to new operator .
23/05/2017:
Slide 14: Repaired typos in typos in sh:entailment, rdfs:range
21/05/2017:
- Slide 8. Changed the example to be an IRI and a datatype
- Added typically in slide 9
- Slide 10: Removed the phrase: "Target declarations can problematic when reusing/importing shapes"
and created slide 27 to talk about reuability
- Added slide 11 to talk about the differences in triggering validation
- Created slide 14 to talk about inference
- Renamed slide 15 as "Inference and triggering mechanism"
- Added slides 27 and 28 to talk about reuability
- Added slide 29 to talk about annotations
18/05/2017
- Slides 9 now includes an example using ShEx RDF vocabulary
- Slide 10 now says that target declarations are optional
- Slide 13 now says that some RDF Schema terms have special treatment in SHACL
- Example in slide 18 now uses sh:or instead of sh:and
- Added slides 22, 23 and 24 which show some features supported by SHACL but not supported by ShEx (property pair constraints, uniqueLang and owl:imports)
TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from th...JohannWanja
Deciding which RDF vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. We propose "TermPicker" as a novel approach enabling vocabulary reuse by recommending vocabulary terms based on various features of a term. These features include the term’s popularity, whether it is from an already used vocabulary, and the so-called schema-level pattern (SLP) feature that exploits which terms other data providers on the LOD cloud use to describe their data. We apply Learning To Rank to establish a ranking model for vocabulary terms based on the utilized features. The results show that using the SLP-feature improves the recommendation quality by 29% to 36% considering the Mean Average Precision and the Mean Reciprocal Rank at the first five positions compared to recommendations based on solely the term’s popularity and whether it is from an already used vocabulary.
RDF Constraint Checking using RDF Data Descriptions (RDD)Alexander Schätzle
Linked Open Data (LOD) sources on the Web are increas-
ingly becoming a mainstream method to publish and con-
sume data. For real-life applications, mechanisms to de-
scribe the structure of the data and to provide guarantees
are needed, as recently emphasized by the W3C in its Data
Shape Working Group. Using such mechanisms, data providers will be able to validate their data, assuring that it is structured in a way expected by data consumers. In turn, data consumers can design and optimize their applications to match the data format to be processed.
In this paper, we present several crucial aspects of RDD,
our language for expressing RDF constraints. We introduce
the formal semantics and describe how RDD constraints can be translated into SPARQL for constraint checking. Based on our fully working validator, we evaluate the feasibility and eciency of this checking process using two popular, state-of-the-art RDF triple stores. The results indicate that even a naive implementation of RDD based on SPARQL 1.0 will incur only a moderate overhead on the RDF loading process, yet some constraint types contribute an outsize share and scale poorly. Incorporating several preliminary optimizations, some of them based on SPARQL 1.1, we provide insights on how to overcome these limitations.
This talk was given by FORTH, Greece, at the European Data Forum (EDF) 2012 took place on June 6-7, 2012 in Copenhagen (Denmark) at the Copenhagen Business School (CBS).
Abstract:
Given the increasing amount of sensitive RDF data available on the Web, it becomes increasingly critical to guarantee secure access to this content. Access control is complicated when RDFS inference rules and other dependencies between access permissions of triples need to be considered; this is necessary, e.g., when we want to associate the access permissions of inferred triples with the ones that implied it. In this paper we advocate the use of abstract provenance models that are defined by means of abstract tokens operators to support fine grained access control for RDF graphs. The access label of a triple is a complex expression that encodes how said label was produced (i.e., the triples that contributed to its computation). This feature allows us to know exactly the effects of any possible change, thereby avoiding a complete recomputation of the labels when a change occurs. In addition, the same application can choose to enforce different access control policies or, different applications can enforce different policies on the same data, avoiding the recomputation of the label of a triple. Preliminary experiments have shown the applicability and benefits of our approach.
The semantic Web is built on the Resource Description Framework (RDF). RDF is a graph model. It would be expected that a wide range of network analytical tools could be directly applied to a RDF data set. However, most network algorithms assume that a graph does not have parallel edges which the RDF graph model allows. Two approaches will be examined: direct measures of RDF graph structure using ratios and extraction of graphs from an RDF data set. Py-Triple-Simple (http://code.google.com/p/py-triple-simple/), an experimental pure Python library, can extract “well behaved” graphs from an N-triples file and can quantify RDF graph structure using ratios.
How we use functional programming to find the bad guys @ Build Stuff LT and U...Richard Minerich
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Distributed Query Processing for Federated RDF Data ManagementOlafGoerlitz
PhD defense talk about SPLENDID, a state-of-the-art implementation for efficient distributed SPARQL query processing on Linked Data using SPARQL endpoints and voiD descriptions.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Doctoral Examination at the Karlsruhe Institute of Technology (08.07.2016)Dr.-Ing. Thomas Hartmann
In this thesis, a validation framework is introduced that enables to consistently execute RDF-based constraint languages on RDF data and to formulate constraints of any type. The framework reduces the representation of constraints to the absolute minimum, is based on formal logics, consists of a small lightweight vocabulary, and ensures consistency regarding validation results and enables constraint transformations for each constraint type across RDF-based constraint languages.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Comparison of features between ShEx (Shape Expressions) and SHACL (Shapes Constraint Language)
Changelog:
11/06/17
- Removed slides about compositionality
31/May/2017
- Added slide 30 about validation report
- Added slide 32 about stems
- Changed slides 7 and 8 adapting compact syntax to new operator .
23/05/2017:
Slide 14: Repaired typos in typos in sh:entailment, rdfs:range
21/05/2017:
- Slide 8. Changed the example to be an IRI and a datatype
- Added typically in slide 9
- Slide 10: Removed the phrase: "Target declarations can problematic when reusing/importing shapes"
and created slide 27 to talk about reuability
- Added slide 11 to talk about the differences in triggering validation
- Created slide 14 to talk about inference
- Renamed slide 15 as "Inference and triggering mechanism"
- Added slides 27 and 28 to talk about reuability
- Added slide 29 to talk about annotations
18/05/2017
- Slides 9 now includes an example using ShEx RDF vocabulary
- Slide 10 now says that target declarations are optional
- Slide 13 now says that some RDF Schema terms have special treatment in SHACL
- Example in slide 18 now uses sh:or instead of sh:and
- Added slides 22, 23 and 24 which show some features supported by SHACL but not supported by ShEx (property pair constraints, uniqueLang and owl:imports)
TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from th...JohannWanja
Deciding which RDF vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. We propose "TermPicker" as a novel approach enabling vocabulary reuse by recommending vocabulary terms based on various features of a term. These features include the term’s popularity, whether it is from an already used vocabulary, and the so-called schema-level pattern (SLP) feature that exploits which terms other data providers on the LOD cloud use to describe their data. We apply Learning To Rank to establish a ranking model for vocabulary terms based on the utilized features. The results show that using the SLP-feature improves the recommendation quality by 29% to 36% considering the Mean Average Precision and the Mean Reciprocal Rank at the first five positions compared to recommendations based on solely the term’s popularity and whether it is from an already used vocabulary.
RDF Constraint Checking using RDF Data Descriptions (RDD)Alexander Schätzle
Linked Open Data (LOD) sources on the Web are increas-
ingly becoming a mainstream method to publish and con-
sume data. For real-life applications, mechanisms to de-
scribe the structure of the data and to provide guarantees
are needed, as recently emphasized by the W3C in its Data
Shape Working Group. Using such mechanisms, data providers will be able to validate their data, assuring that it is structured in a way expected by data consumers. In turn, data consumers can design and optimize their applications to match the data format to be processed.
In this paper, we present several crucial aspects of RDD,
our language for expressing RDF constraints. We introduce
the formal semantics and describe how RDD constraints can be translated into SPARQL for constraint checking. Based on our fully working validator, we evaluate the feasibility and eciency of this checking process using two popular, state-of-the-art RDF triple stores. The results indicate that even a naive implementation of RDD based on SPARQL 1.0 will incur only a moderate overhead on the RDF loading process, yet some constraint types contribute an outsize share and scale poorly. Incorporating several preliminary optimizations, some of them based on SPARQL 1.1, we provide insights on how to overcome these limitations.
This talk was given by FORTH, Greece, at the European Data Forum (EDF) 2012 took place on June 6-7, 2012 in Copenhagen (Denmark) at the Copenhagen Business School (CBS).
Abstract:
Given the increasing amount of sensitive RDF data available on the Web, it becomes increasingly critical to guarantee secure access to this content. Access control is complicated when RDFS inference rules and other dependencies between access permissions of triples need to be considered; this is necessary, e.g., when we want to associate the access permissions of inferred triples with the ones that implied it. In this paper we advocate the use of abstract provenance models that are defined by means of abstract tokens operators to support fine grained access control for RDF graphs. The access label of a triple is a complex expression that encodes how said label was produced (i.e., the triples that contributed to its computation). This feature allows us to know exactly the effects of any possible change, thereby avoiding a complete recomputation of the labels when a change occurs. In addition, the same application can choose to enforce different access control policies or, different applications can enforce different policies on the same data, avoiding the recomputation of the label of a triple. Preliminary experiments have shown the applicability and benefits of our approach.
The semantic Web is built on the Resource Description Framework (RDF). RDF is a graph model. It would be expected that a wide range of network analytical tools could be directly applied to a RDF data set. However, most network algorithms assume that a graph does not have parallel edges which the RDF graph model allows. Two approaches will be examined: direct measures of RDF graph structure using ratios and extraction of graphs from an RDF data set. Py-Triple-Simple (http://code.google.com/p/py-triple-simple/), an experimental pure Python library, can extract “well behaved” graphs from an N-triples file and can quantify RDF graph structure using ratios.
How we use functional programming to find the bad guys @ Build Stuff LT and U...Richard Minerich
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
Seed Selection for Distantly Supervised Web-Based Relation ExtractionIsabelle Augenstein
Slides of my presentation on "Seed Selection for Distantly Supervised Web-Based Relation Extraction" at the Semantic Web and Information Extraction workshop (SWAIE) and COLING 2014
Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/SWAIE2014-Seed.pdf
Relation Extraction from the Web using Distant SupervisionIsabelle Augenstein
Slides of my presentation on "Relation Extraction from the Web using Distant Supervision" at EKAW 2014. Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/EKAW2014-Relation.pdf
Invited talk in Heriot-Watt University Computer Science seminar series about our EMNLP paper on extracting non-standard relations from the Web with Distant Supervision and Imitation Learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
View the EMNLP poster here: http://www.slideshare.net/isabelleaugenstein/extracting-relations-between-nonstandard-entities-using-distant-supervision-and-imitation-learning
Introduction to natural language generation with artificial neural networks (ANNs) and a group poetry writing exercise where humans pretend to be neurons in an ANN.
Slides for my tutorial at the ESWC Summer School 2015, giving an introduction to information extraction with Linked Data and an introduction to one of the applications of information extraction, opinion mining.
Deep Learning Models for Question AnsweringSujit Pal
Talk about a hobby project to apply Deep Learning models to predict answers to 8th grade science multiple choice questions for the Allen AI challenge on Kaggle.
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Glen Cathey
A deep dive into resume and LinkedIn sourcing and matching solutions claiming to use artificial intelligence, semantic search, and NLP, including how they work, their pros, cons, and limitations, and examples of what sourcers and recruiters can do that even the most advanced automated search and match algorithms can't do. Topics covered include human capital data information retrieval and analysis (HCDIR & A), Boolean and extended Boolean, semantic search, dynamic inference, dark matter resumes and social network profiles, and what I believe to be the ideal resume search and matching solution.
Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne
Christophe Debruyne, Jonathan Riggio, Emma Gustafson, Declan O'Sullivan, Mathieu Vinken, Tamara Vanhaecke, Olga De Troyer.
Presented at the 2020 IEEE 14th International Conference on Semantic Computing, San Diego, California, 3-5 February 2020
Toxicology aims to understand the adverse effects of
chemical compounds or physical agents on living organisms. For
chemicals, much information regarding safety testing of cosmetic
ingredients is now scattered in a plethora of safety evaluation
reports. Toxicologists in our university intend to collect this
information into a single repository. Their current approach uses
spreadsheets, does not scale well, and makes data curation and
querying cumbersome. Semantic technologies (e.g., RDF, OWL,
and Linked Data principles) would be more appropriate for
this purpose. However, this technology is not very accessible to
toxicologists without extensive training. In this paper, we report
on a tool that supports subject matter experts in the construction
of an RDF–based knowledge base for the toxicology domain. The
tool is using the jigsaw metaphor for guiding the subject matter
experts. We demonstrate that the jigsaw metaphor is a viable
option for generating RDF. Future work includes investigating
appropriate methods and tools for knowledge evolution and data
analysis.
LOD2 plenary meeting in Paris: presentation of WP6: State of Play: LOD2 Stack Architecture, by Bert Van Nuffelen, Kurt De Muelenaere, Bastiaan Deblieck - TenForce.
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. However, traditionally machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph. In this talk I will discuss methods that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. I will provide a conceptual review of key advancements in this area of representation learning on graphs, including random-walk based algorithms, and graph convolutional networks.
Information-Rich Programming in F# with Semantic DataSteffen Staab
Programming with rich data frequently implies that one
needs to search for, understand, integrate and program with
new data - with each of these steps constituting a major
obstacle to successful data use.
In this talk we will explain and demonstrate how our approach,
LITEQ - Language Integrated Types, Extensions and Queries for
RDF Graphs, which is realized as part of the F# / Visual Studio-
environment, supports the software developer. Using the extended
IDE the developer may now
a. explore new, previously unseen data sources,
which are either natively in RDF or mapped into RDF;
b. use the exploration of schemata and data in order to
construct types and objects in the F# environment;
c. automatically map between data and programming language objects in
order to make them persistent in the data source;
d. have extended typing functionality added to the F#
environment and resulting from the exploration of the data source
and its mapping into F#.
Core to this approach is the novel node path query language, NPQL,
that allows for interactive, intuitive exploration of data schemata and
data proper as well as for the mapping and definition
of types, object collections and individual objects.
Beyond the existing type provider mechanism for F#
our approach also allows for property-based navigation
and runtime querying for data objects.
A tutorial on how to create mappings using ontop, how inference (OWL 2 QL and RDFS) plays a role answering SPARQL queries in ontop, and how ontop's support for on-the-fly SQL query translation enables scenarios of semantic data access and data integration.
Talk at 3th Keystone Training School - Keyword Search in Big Linked Data - Institute for Software Technology and Interactive Systems, TU Wien, Austria, 2017
Often information is spread among
several data sources, such as hospital databases, lab databases,
spreadsheets, etc. Moreover, the complexity of each of these data sources
might make it difficult for end-users to access them, and even
more, to query all of them at the same time.
A new solution that has been proposed to this problem is
ontology-based data access (OBDA).
OBDA is a popular paradigm, developed since the mid 2000s, to query
various types of data sources
using a common vocabulary familiar to the end-users. In a nutshell
OBDA separates the user
from the data sources (relational databases, CVS files, etc.) by means
of an ontology, which is a common terminology that provides the user with a
convenient query vocabulary, hides the structure of the data sources,
and can enrich incomplete data with background knowledge. About a
dozen OBDA systems have been implemented in both academia and
industry.
In this tutorial we will give an overview of OBDA, and our system -ontop-
which is currently being used in the context of the European project
Optique. We will discuss how to use -ontop- for data integration,
in particular concentrating on:
– How to create an ontology (common vocabulary) for a life science domain.
– How to map available data sources to this ontology.
– How to query the database using the terms in the ontology.
– How to check consistency of the data sources w.r.t. the ontology
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
This presentation will provide a brief introduction to logical reasoning and overview of the most popular semantic schema and ontology languages: RDFS and the profiles of OWL 2.
While automatic reasoning has always inspired the imagination, numerous projects have failed to deliver to the promises. The typical pitfalls related to ontologies and symbolic reasoning fall into two categories:
- Over-engineered ontologies. The selected ontology language and modeling patterns can be too expressive. This can make the results of inference hard to understand and verify, which in its turn makes KG hard to evolve and maintain. It can also impose performance penalties far greater than the benefits.
- Inappropriate reasoning support. There are many inference algorithms and implementation approaches, which work well with taxonomies and conceptual models of few thousands of concepts, but cannot cope with KG of millions of entities.
- Inappropriate data layer architecture. One such example is reasoning with virtual KG, which is often infeasible.
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible — e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. In this talk, I will present some first steps towards addressing these problems, discussing our research on exaggeration detection, scientific fact checking, and on modelling information change in scientific communication more broadly.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. In this talk, I will present some first steps towards addressing these problems, discussing our research on exaggeration detection of scientific claims and on scientific fact checking.
The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, a knowledge-intensive and complex reasoning task. Most existing fact checking models predict a claim’s veracity with black-box models, which often lack explanations of the reasons behind their predictions and contain hidden vulnerabilities. The lack of transparency in fact checking systems and ML models, in general, has been exacerbated by increased model size and by “the right…to obtain an explanation of the decision reached” enshrined in European law. This talk presents some first solutions to generating explanations for fact checking models. It then examines how to assess the generated explanations using diagnostic properties, and how further optimising for these diagnostic properties can improve the quality of the generating explanations. Finally, the talk examines how to systemically reveal vulnerabilities of black-box fact checking models.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
Outline:
- Fact checking – what is it and why do we need it?
- False information online
- Content-based automatic fact checking
- Explainability – what is it and why do we need it?
- Making the right predictions for the right reasons
- Model training pipeline
- Explainable fact checking – some first solutions
- Rationale selection
- Generating free-text explanations
- Wrap-up
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Talk on 'Tracking False Information Online' at W-NUT workshop at EMNLP 2019.
=========
Digital media enables fast sharing of information and discussions among users. While this comes with many benefits to today’s society, such as broadening information access, the manner in which information is disseminated also has obvious downsides. Since fast access to information is expected by many users and news outlets are often under financial pressure, speedy access often comes at the expense of accuracy, which leads to misinformation. Moreover, digital media can be misused by campaigns to intentionally spread false information, i.e. disinformation, about events, individuals or governments. In this talk, I will present on different ways false information is spread online, including misinformation and disinformation. I will then report findings from our recent and ongoing work on automatic fact checking, stance detection and framing attitudes.
What can typological knowledge bases and language representations tell us abo...Isabelle Augenstein
One of the core challenges in typology is to record properties of languages in a structured way. As a result of manual efforts, typological knowledge bases have emerged, which contains information about languages’ phonological, morphological and syntactic properties; as well as information about language families. Ideally, such typological knowledge bases would provide useful information for multilingual NLP models to learn how to selectively share parameters.
A related area of research suggests a different way of encoding properties of languages, namely to learn language representation vectors directly from text documents.
In this talk, I will analyse and contrast these two ways of encoding linguistic properties, as well as present research on how the two can benefit one another.
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Isabelle Augenstein
Paper presented at NAACL 2018. Link: https://arxiv.org/abs/1802.09913
Abstract:
============
We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Spreading of mis- and disinformation is growing and is having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g. manual fact-checking by reporters cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Shared task summary for SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications
Paper: https://arxiv.org/abs/1704.02853
Abstract:
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
Extracting Relations between Non-Standard Entities using Distant Supervision ...Isabelle Augenstein
Poster for our EMNLP paper on extracting non-standard relations from the Web with distant supervision and imitation learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Securing your Kubernetes cluster_ a step-by-step guide to success !
Lodifier: Generating Linked Data from Unstructured Text
1. LODifier: Generating Linked Data from
Unstructured Text
Isabelle Augenstein1 Sebastian Pad´o1 Sebastian Rudolph2
1Department of Computational Linguistics, Universit¨at Heidelberg, Germany
2Institute AIFB, Karlsruhe Institute of Technology, Germany
{augenste,pado}@cl.uni-heidelberg.de, rudolph@kit.edu
Extended Semantic Web Conference, 2012
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 1 / 28
2. Outline
1 Motivation
2 The LODifier Architecture and Workflow
3 Evaluation in a Document Similarity Task
4 Conclusion
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 2 / 28
3. Introduction
Introduction
What?
Represent information from NL text as Linked Data
Creation of graph-based representation for unstructured plain text
Why?
Task is easy for structured text, but quite challenging for unstructured
text
Most other approaches are very selective, e.g., extract only
pre-defined relations
Our approach translates the text in its entirety
Possible use cases: Domain-independent scenarios in which no
pre-defined schema is available
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 3 / 28
4. Introduction
Introduction
How?
Use noise-tolerant NLP pipeline to analyze text
Translate result into RDF representation
Embed output in the LOD cloud by using vocabulary from DBPedia
and Wordnet
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 4 / 28
5. System Overview Pipeline
http:
//ww
w.w3
.org/
RDF
/icon
s/rdf
_flye
r.svg
1 of 1
08.12
.11
23:48
lemmatized
text
NE-marked
text
parsed text
word sense
URIs
parsed text
DRS
tokenized
text
DBpedia
URIs
LOD-linked
RDF
unstructured
text
Named Entity
Recognition
(Wikifier)
Parsing
(C&C)
Deep Semantic
Analysis
(Boxer)
Lemmatization
Word Sense
Disambiguation
(UKB)
Tokenization
RDF Generation and Integration
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 5 / 28
6. System NLP analysis steps
NLP analysis steps (NER)
Named Entity Recognition (Wikifier 1):
Named entities are recognized using Wikifier
Wikifier finds Wikipedia links for named entities
Wikipedia URLs are converted to DBPedia URIs
The New York Times reported that John McCarthy died.
He invented the programming language LISP.
Figure: Test sentences
[[The New York Times]] reported that [[John McCarthy (computer
scientist)|John McCarthy]] died. He invented the [[Programming
language|programming language]] [[Lisp (programming language)|
Lisp]].
Figure: Wikifier output for the test sentences.
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 6 / 28
7. System NLP analysis steps
NLP analysis steps (WSD)
Word Sense Disambiguation (UKB 2)):
Words are disambiguated using UKB, an unsupervised graph-based
WSD tool
UKB finds WordNet links for word senses
UKB output is converted to RDF WordNet URIs
ctx_0 w1 00965687-v !! report
ctx_0 w4 00358431-v !! die
ctx_1 w7 01634424-v !! invent
ctx_1 w9 06898352-n !! programming_language
ctx_1 w10 06901936-n !! lisp
Figure: UKB output for the test sentences.
2
http://ixa2.si.ehu.es/ukb/
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 7 / 28
8. System NLP analysis steps
NLP analysis steps (Parsing and Semantic Analysis)
Parsing (C&C 3):
Sentences are parsed with the C&C parser
The C&C parser is a statistical parser that uses the combined
categorial grammar (CCG)
CCG is a semantically motivated grammar, making it suitable for
further semantic analysis
Deep Semantic Analysis (Boxer 4):
Boxer builds on the output of C&C and produces discourse
representation structures (DRSs)
DRSs model the meaning of texts in terms of the relevant entities
(discourse referents) and relations between them (conditions)
3
http://svn.ask.it.usyd.edu.au/trac/candc/
4
http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 8 / 28
19. Evaluation Evaluation overview
Evaluation overview
Task:
Assess document similarity
Story link detection task, part of topic detection and tracking (TDT)
family of tasks
Data:
TDT-2 benchmark dataset 5
Consists of newspaper, TV and radio news in English, Arabic and
Mandarin
Subset: English language, only newspaper articles
183 positive and negative document pairs
5
http://projects.ldc.upenn.edu/TDT2/
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 19 / 28
20. Evaluation Evaluation overview
Evaluation overview
Method:
Define various document similarity measures
Measures without structural information
Measures with structural information
Documents are predicted to have the same topic if they have a
similarity of θ or more
θ is determined with supervised learning
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 20 / 28
21. Evaluation Evaluation method
Evaluation method
Similarity measures without structure:
Random baseline
Bag-of-words baseline: Word overlap between the two documents in
a document pair
Bag-of-URI baseline: URI overlap between two RDF documents
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 21 / 28
22. Evaluation Evaluation method
Evaluation method
URI class variants:
Three variants for URI baseline: Take the relative importance of
various URI classes into account
Variant 1: All NEs identified by Wikifier, all words successfully
disambiguated by UKB (namespaces dbpedia:, wn30:)
Variant 2: Adds all NEs recognized by Boxer (namespace ne:)
Variant 3: Further adds the URIs for all words that were not
recognized by Wikifier, UKB or Boxer (namespace class:)
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 22 / 28
23. Evaluation Evaluation method
Evaluation method
Extended setting:
Aims at drawing more information from LOD cloud into the similarity
computation
The generated RDF graph is enriched by:
DBpedia categories (namespace dbpedia:)
WordNet synsets and WordNet senses related to the URIs in the
generated graph (namespace wn30:)
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 23 / 28
24. Evaluation Evaluation method
Evaluation method
Similarity measures with structure:
Full-fledged graph similarity measures, e.g. based on isomorphic
subgraphs, are infeasible due to the size of the RDF graphs
Our similarity measures are based on the shortest paths between
relevant nodes
Our intuition is that short paths between URI pairs denote relevant
semantic information and we expect they are related by a short path
in other documents as well
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 24 / 28
25. Evaluation Evaluation method
Evaluation method
Similarity measures with structure:
proSimk,Rel,f (G1, G2) =
a,b∈Rel(G1)
a,b ∈Ck (G1)∩Ck (G2)
f ( (a, b))
a,b∈Rel(G1)
a,b ∈Ck (G1)
f ( (a, b))
k: path length, Rel: semantic relations, f : similarity function
proSimcnt: Uses f ( ) = 1, i.e., just counts the number of paths
irrespective of their length
proSimlen: Uses f ( ) = 1/ , giving less weight to longer paths
proSimsqlen: Uses f ( ) = 1/
√
, discounting long paths less
aggressively than proSimlen
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 25 / 28
26. Evaluation Results
Results
Model normal extended
Similarity measures without structure
Random Baseline 50.0 –
Bag of Words 63.0 –
Bag of URIs (Variant 3) 73.4 76.4
Similarity measures with structure
proSimcnt (k=6, Variant 1) 77.7 77.6
proSimcnt (k=6, Variant 2) 79.2 79.0
proSimcnt (k=8, Variant 3) 82.1 81.9
proSimlen (k=6, Variant 3) 81.5 81.4
proSimsqlen (k=8, Variant 3) 81.1 80.9
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 26 / 28
27. Summary
Summary
Main contributions:
Service for the LOD community that can be used in different
scenarios that can profit from structural representation of text
Combination of well-established concepts and systems in a deep
semantic pipeline
Linking existing NLP technologies to the LOD cloud
Evaluation of LODifier in a topic link detection task gives clear
evidence of its potential
Possible future work directions:
Instead of a pipeline approach, simultaneously consider information
from the NLP components to allow interaction between them
Test LODifier in different scenarios
Augenstein, Pad´o, Rudolph LODifier ESWC 2012 27 / 28