The document summarizes research from the AKSW group on question answering over interlinked data. It presents their approach called SINA, which takes a natural language question as input, segments it, disambiguates terms to resources, and outputs a SPARQL query to retrieve the answer. SINA uses a hidden Markov model for segmentation and disambiguation, constructs a query graph to link relevant resources, and was evaluated on life science and DBpedia datasets with over 80% accuracy on average. The goal is to enable question answering directly over interconnected data without requiring SPARQL proficiency.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Actively Learning to Rank Semantic Associations for Personalized Contextual E...Federico Bianchi
The Semantic Web. ESWC 2017. Lecture Notes in Computer Science, vol 10249. Springer, Cham
Knowledge Graphs (KG) represent a large amount of Semantic Associations (SAs), i.e., chains of relations that may reveal interesting and unknown connections between different types of entities. Applications for the contextual exploration of KGs help users explore information extracted from a KG, including SAs, while they are reading an input text. Because of the large number of SAs that can be extracted from a text, a first challenge in these applications is to effectively determine which SAs are most interesting to the users, defining a suitable ranking function over SAs. However, since different users may have different interests, an additional challenge is to personalize this ranking function to match individual users’ preferences. In this paper we introduce a novel active learning to rank model to let a user rate small samples of SAs, which are used to iteratively learn a personalized ranking function. Experiments conducted with two data sets show that the approach is able to improve the quality of the ranking function with a limited number of user interactions.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Actively Learning to Rank Semantic Associations for Personalized Contextual E...Federico Bianchi
The Semantic Web. ESWC 2017. Lecture Notes in Computer Science, vol 10249. Springer, Cham
Knowledge Graphs (KG) represent a large amount of Semantic Associations (SAs), i.e., chains of relations that may reveal interesting and unknown connections between different types of entities. Applications for the contextual exploration of KGs help users explore information extracted from a KG, including SAs, while they are reading an input text. Because of the large number of SAs that can be extracted from a text, a first challenge in these applications is to effectively determine which SAs are most interesting to the users, defining a suitable ranking function over SAs. However, since different users may have different interests, an additional challenge is to personalize this ranking function to match individual users’ preferences. In this paper we introduce a novel active learning to rank model to let a user rate small samples of SAs, which are used to iteratively learn a personalized ranking function. Experiments conducted with two data sets show that the approach is able to improve the quality of the ranking function with a limited number of user interactions.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
Our daily life is strongly influenced through decision-making processes based on large amounts of data, of which both the
data values as the meaningful (semantic) relationships can be included in knowledge graphs.
Given their automatic processing, knowledge graphs must be of high quality on both these fronts.
This thesis focuses on both improving data quality, as assessing semantic quality of knowledge graphs.
On the one hand, it describes a framework to generate knowledge graphs with extensible data transformations that can clean data ("RML + FnO"), expanded to perform data transformations automatically and implementation-independent ("FnO.io").
On the other hand, it describes a validation approach building on a rule-based reasoning solution ("Validatrr"). This takes into account the semantics used, and enables specific improvements to knowledge graph due to detail root cause explanation of quality problems.
Thanks to these contributions, data values in knowledge graphs are cleaned up while generating knowledge graphs, and they can be completed using automatic data transformations on existing knowledge graphs. Our validation approach makes it possible to accurately assess the quality of semantic relationships in knowledge graphs.
The combined work makes it easier to improve data quality and assess semantic quality for knowledge graphs, which ensures that knowledge graphs can be used correctly in decision-making processes.
Natural Language Processing on Non-Textual Datagpano
Talk by Casey Stella, presented at the SF Data Mining Hadoop Summit Meetup, on June 8, 2015. Notebook available at https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/ipython/clinical2vec.ipynb
Dynamic Search Using Semantics & StatisticsPaul Hofmann
This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search.
1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies).
2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse.
3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack of a practical ontology for modeling the usage domain. As a remedy to the third constraint, this article presents a scholarly ontology that was engineered to represent those classes for which large-scale bibliographic and usage data exists, supports usage research, and whose instantiation is scalable to the order of 50 million articles along with their associated artifacts (e.g. authors and journals) and an accompanying 1 billion usage events. The real world instantiation of the presented abstract ontology is a semantic network model of the scholarly community which lends the scholarly process to statistical analysis and computational support. We present the ontology, discuss its instantiation, and provide some example inference rules for calculating various scholarly artifact metrics.
With the continuously increasing number of datasets published in the Web of Data and form part of the Linked Open Data Cloud, it becomes more and more essential to identify resources that correspond to the same real world object in order to interlink web resources and set the basis for large-scale data integration. This requirement becomes apparent in a multitude of domains ranging from science (marine research, biology, astronomy, pharmacology) to semantic publishing and cultural domains. In this context, instance matching is of crucial importance.
It is though essential at this point to develop, along with instance and entity matching systems, benchmarks to determine the weak and strong points of those systems, as well as their overall quality in order to support users in deciding the system to use for their needs. Hence, well defined, and good quality benchmarks are important for comparing the performance of the developed instance matching systems.
In this tutorial we aim at:
- Discussing the state-of-the-art instance matching benchmarks
- Presenting the benchmark design principles
- Providing an analysis of the performance results of instance matching systems for the presented benchmarks
- Presenting the research directions that should be exploited for the creation of novel benchmarks to answer the needs of the Linked Data paradigm.
Please click here for the Tutorial web-page: http://www.ics.forth.gr/isl/BenchmarksTutorial/
Structural syntactic metrics for RDF Datasets that correlate with high level quality deficiencies.
The vision of the Linked Open Data (LOD) initiative is to provide a model for publishing data and meaningfully interlinking such dispersed but related data. Despite the importance of data quality for the successful growth of the LOD, only limited attention has been focused on quality of data prior to their publication on the LOD. This paper focuses on the systematic assessment of the quality of datasets prior to publication on the LOD cloud. To this end, we identify important quality deficiencies that need to be avoided and/or resolved prior to the publication of a dataset. We then propose a set of metrics to measure and identify these quality deficiencies in a dataset. This way, we enable the assessment and identification of undesirable quality characteristics of a dataset through our proposed metrics.
Slides for paper presentation at DEXA 2015:
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri:
Quality Metrics for Linked Open Data. DEXA (1) 2015: 144-152
Directed versus undirected network analysis of student essaysRoy Clariana
IWALS 2018
6th International Workshop on Advanced Learning Sciences
Perspectives on the Learner: Cognition, Brain, and Education
University of Pittsburgh, USA JUNE 6-8, 2018
Hoaxy is a tool to visualize the spread of URLs consisting low-credible web-documents. We use features related to propagation . dynamics to classify the duplicates of low-credible claims.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.
In this presentation, we will describe how we applied two new embedding schemes to Scopus, Elsevier’s broad coverage database of scientific, technical, and medical literature. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We will evaluate these embedding schemes and describe how we used the Vespa search engine to search these embeddings for similar documents within the Scopus dataset. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...BigMine
Networks (i.e., graphs) appears in many high-impact applications. Often these networks are collected from different sources, at different times, at different granularities. In this talk, I will present our recent work on mining such multiple networks. First, we will present two models - one on modeling a set of inter-connected networks (NoN); and the other on modeling a set of inter-connected co-evolving time series (NoT). For both models, we will show that by treating networks as context, we are able to model more complicate real-world applications. Second, we will present some algorithmic examples on how to do mining with such new models, including ranking, imputation and prediction. Finally, we will demonstrate the effectiveness of our new models and algorithms in some applications, including bioinformatics, and sensor networks.
Filtering Inaccurate Entity Co-references on the Linked Open Dataebrahim_bagheri
A method for identifying incorrect sameAs links on the Linked Open Data cloud
Details published in:
John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic:
Filtering Inaccurate Entity Co-references on the Linked Open Data. DEXA (1) 2015: 128-143
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeAndre Freitas
The Challenge in a Nutshell
To create a query mechanism that semantically matches schema-agnostic user queries to knowledge base elements
The Goal
To support easy querying over complex databases with large schemata, relieving users from the need to understand the formal representation of the data
Relevance
The increase in the size and in the semantic heterogeneity of database schemas are bringing new requirements for users querying and searching structured data. At this scale it can become unfeasible for data consumers to be familiar with the representation of the data in order to query it. At the center of this discussion is the semantic gap between users and databases, which becomes more central as the scale and complexity of the data grows. Addressing this gap is a fundamental part of the Semantic Web vision.
Schema-agnostic query mechanisms aim at allowing users to be abstracted from the representation of the data, supporting the automatic matching between queries and databases. This challenge aims at emphasizing the role of schema-agnosticism as a key requirement for contemporary database management, by providing a test collection for evaluating flexible query and search systems over structured data in terms of their level of schema-agnosticism (i.e. their ability to map a query issued with the user terminology and structure, mapping it to the dataset vocabulary). The challenge is instantiated in the context of Semantic Web datasets.
Our daily life is strongly influenced through decision-making processes based on large amounts of data, of which both the
data values as the meaningful (semantic) relationships can be included in knowledge graphs.
Given their automatic processing, knowledge graphs must be of high quality on both these fronts.
This thesis focuses on both improving data quality, as assessing semantic quality of knowledge graphs.
On the one hand, it describes a framework to generate knowledge graphs with extensible data transformations that can clean data ("RML + FnO"), expanded to perform data transformations automatically and implementation-independent ("FnO.io").
On the other hand, it describes a validation approach building on a rule-based reasoning solution ("Validatrr"). This takes into account the semantics used, and enables specific improvements to knowledge graph due to detail root cause explanation of quality problems.
Thanks to these contributions, data values in knowledge graphs are cleaned up while generating knowledge graphs, and they can be completed using automatic data transformations on existing knowledge graphs. Our validation approach makes it possible to accurately assess the quality of semantic relationships in knowledge graphs.
The combined work makes it easier to improve data quality and assess semantic quality for knowledge graphs, which ensures that knowledge graphs can be used correctly in decision-making processes.
Natural Language Processing on Non-Textual Datagpano
Talk by Casey Stella, presented at the SF Data Mining Hadoop Summit Meetup, on June 8, 2015. Notebook available at https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/ipython/clinical2vec.ipynb
Dynamic Search Using Semantics & StatisticsPaul Hofmann
This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search.
1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies).
2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse.
3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack of a practical ontology for modeling the usage domain. As a remedy to the third constraint, this article presents a scholarly ontology that was engineered to represent those classes for which large-scale bibliographic and usage data exists, supports usage research, and whose instantiation is scalable to the order of 50 million articles along with their associated artifacts (e.g. authors and journals) and an accompanying 1 billion usage events. The real world instantiation of the presented abstract ontology is a semantic network model of the scholarly community which lends the scholarly process to statistical analysis and computational support. We present the ontology, discuss its instantiation, and provide some example inference rules for calculating various scholarly artifact metrics.
With the continuously increasing number of datasets published in the Web of Data and form part of the Linked Open Data Cloud, it becomes more and more essential to identify resources that correspond to the same real world object in order to interlink web resources and set the basis for large-scale data integration. This requirement becomes apparent in a multitude of domains ranging from science (marine research, biology, astronomy, pharmacology) to semantic publishing and cultural domains. In this context, instance matching is of crucial importance.
It is though essential at this point to develop, along with instance and entity matching systems, benchmarks to determine the weak and strong points of those systems, as well as their overall quality in order to support users in deciding the system to use for their needs. Hence, well defined, and good quality benchmarks are important for comparing the performance of the developed instance matching systems.
In this tutorial we aim at:
- Discussing the state-of-the-art instance matching benchmarks
- Presenting the benchmark design principles
- Providing an analysis of the performance results of instance matching systems for the presented benchmarks
- Presenting the research directions that should be exploited for the creation of novel benchmarks to answer the needs of the Linked Data paradigm.
Please click here for the Tutorial web-page: http://www.ics.forth.gr/isl/BenchmarksTutorial/
Structural syntactic metrics for RDF Datasets that correlate with high level quality deficiencies.
The vision of the Linked Open Data (LOD) initiative is to provide a model for publishing data and meaningfully interlinking such dispersed but related data. Despite the importance of data quality for the successful growth of the LOD, only limited attention has been focused on quality of data prior to their publication on the LOD. This paper focuses on the systematic assessment of the quality of datasets prior to publication on the LOD cloud. To this end, we identify important quality deficiencies that need to be avoided and/or resolved prior to the publication of a dataset. We then propose a set of metrics to measure and identify these quality deficiencies in a dataset. This way, we enable the assessment and identification of undesirable quality characteristics of a dataset through our proposed metrics.
Slides for paper presentation at DEXA 2015:
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri:
Quality Metrics for Linked Open Data. DEXA (1) 2015: 144-152
Directed versus undirected network analysis of student essaysRoy Clariana
IWALS 2018
6th International Workshop on Advanced Learning Sciences
Perspectives on the Learner: Cognition, Brain, and Education
University of Pittsburgh, USA JUNE 6-8, 2018
Hoaxy is a tool to visualize the spread of URLs consisting low-credible web-documents. We use features related to propagation . dynamics to classify the duplicates of low-credible claims.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.
In this presentation, we will describe how we applied two new embedding schemes to Scopus, Elsevier’s broad coverage database of scientific, technical, and medical literature. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We will evaluate these embedding schemes and describe how we used the Vespa search engine to search these embeddings for similar documents within the Scopus dataset. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...BigMine
Networks (i.e., graphs) appears in many high-impact applications. Often these networks are collected from different sources, at different times, at different granularities. In this talk, I will present our recent work on mining such multiple networks. First, we will present two models - one on modeling a set of inter-connected networks (NoN); and the other on modeling a set of inter-connected co-evolving time series (NoT). For both models, we will show that by treating networks as context, we are able to model more complicate real-world applications. Second, we will present some algorithmic examples on how to do mining with such new models, including ranking, imputation and prediction. Finally, we will demonstrate the effectiveness of our new models and algorithms in some applications, including bioinformatics, and sensor networks.
Filtering Inaccurate Entity Co-references on the Linked Open Dataebrahim_bagheri
A method for identifying incorrect sameAs links on the Linked Open Data cloud
Details published in:
John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic:
Filtering Inaccurate Entity Co-references on the Linked Open Data. DEXA (1) 2015: 128-143
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeAndre Freitas
The Challenge in a Nutshell
To create a query mechanism that semantically matches schema-agnostic user queries to knowledge base elements
The Goal
To support easy querying over complex databases with large schemata, relieving users from the need to understand the formal representation of the data
Relevance
The increase in the size and in the semantic heterogeneity of database schemas are bringing new requirements for users querying and searching structured data. At this scale it can become unfeasible for data consumers to be familiar with the representation of the data in order to query it. At the center of this discussion is the semantic gap between users and databases, which becomes more central as the scale and complexity of the data grows. Addressing this gap is a fundamental part of the Semantic Web vision.
Schema-agnostic query mechanisms aim at allowing users to be abstracted from the representation of the data, supporting the automatic matching between queries and databases. This challenge aims at emphasizing the role of schema-agnosticism as a key requirement for contemporary database management, by providing a test collection for evaluating flexible query and search systems over structured data in terms of their level of schema-agnosticism (i.e. their ability to map a query issued with the user terminology and structure, mapping it to the dataset vocabulary). The challenge is instantiated in the context of Semantic Web datasets.
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
This is a presentation we perform internally every quarter as part of our Data Science Brown Bag Series. This presentation was talking about different types of soft clustering techniques - all of which the team currently performs depending on the complexity of the data and the complexity of customer problems. If you are interested in learning more about working with L-3 Data Tactics or interested in working for the L-3 Data Tactics Data Science team please contact us soon! Thank you.
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyMaribel Acosta Deibe
Summary of crowdsourcing studies to assess the quality of knowledge graphs and complete missing values. Results focus on findings over the DBpedia knowledge graph ( https://wiki.dbpedia.org/).
Related publications:
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In International Semantic Web Conference (pp. 260-276), 2013.
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web Semantics, 45, 41-62, 2017.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
Despite the attention Semantic Search is continuously gaining, several challenges affecting tool performance and user experience remain unsolved. Among these are: matching user terms with the searchspace, adopting view-based interfaces in the Open Web as well as supporting users while building their queries. This paper proposes an approach to move a step forward towards tackling these challenges by creating models of usage of Linked Data concepts and properties extracted from semantic query logs as a source of collaborative knowledge. We use two sets of query logs from the USEWOD workshops to create our models and show the potential of using them in the mentioned areas.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
It is widely known that the discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market.
AstraZeneca is a global, innovation-driven biopharmaceutical business that focuses on the discovery, development, and commercialization of prescription medicines for some of the world’s most serious diseases. Our scientists have been able to improve our success rate over the past 5 years by moving to a data-driven approach (the “5R”) to help develop better drugs faster, choose the right treatment for a patient and run safer clinical trials.
However, our scientists are still unable to make these decisions with all of the available scientific information at their fingertips. Data is sparse across our company as well as external public databases, every new technology requires a different data processing pipeline and new data comes at an increasing pace. It is often repeated that a new scientific paper appears every 30 seconds, which makes it impossible for any individual expert to keep up-to-date with the pace of scientific discovery.
To help our scientists integrate all of this information and make targeted decisions, we have used Spark on Azure Databricks to build a knowledge graph of biological insights and facts. The graph powers a recommendation system which enables any AZ scientist to generate novel target hypotheses, for any disease, leveraging all of our data.
In this talk, I will describe the applications of our knowledge graph and focus on the Spark pipelines we built to quickly assemble and create projections of the graph from 100s of sources. I will also describe the NLP pipelines we have built – leveraging spacy, bioBERT or snorkel – to reliably extract meaningful relations between entities and add them to our knowledge graph.
VOLT: A Provenance-Producing, Transparent SPARQL Proxy for the On-Demand Computation of Linked Data & its Applications to Spatiotemporally Dependent Data
Explains how deep learning creates howlers using commonly used annotation tools for images. We have identified several such howlers. Essentially, this presentation outlines the deficiencies of deep learning networks. We also explain the theoretical reasoning for these, building on Bengio's recent paper. The presentation also contains solutions which address these gaps, such as capsule networks, transfer learning, meta-learning and federated learning.
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Saeedeh Shekarpour
Although there is an emerging trend towards generating embeddings for primarily unstructured data and, recently, for structured data, no systematic suite for measuring the quality of embeddings has been proposed yet.
This deficiency is further sensed with respect to embeddings generated for structured data because there are no concrete evaluation metrics measuring the quality of the encoded structure as well as semantic patterns in the embedding space.
In this paper, we introduce a framework containing three distinct tasks concerned with the individual aspects of ontological concepts: (i) the categorization aspect, (ii) the hierarchical aspect, and (iii) the relational aspect.
Then, in the scope of each task, a number of intrinsic metrics are proposed for evaluating the quality of the embeddings.
Furthermore, w.r.t. this framework, multiple experimental studies were run to compare the quality of the available embedding models.
Employing this framework in future research can reduce misjudgment and provide greater insight about quality comparisons of embeddings for ontological concepts.
We positioned our sampled data and code at https://github.com/alshargi/Concept2vec under GNU General Public License v3.0.
CEVO: Comprehensive EVent Ontology Enhancing Cognitive Annotation on RelationsSaeedeh Shekarpour
While the general analysis of named entities has received substantial research attention on unstructured as well as structured data, the analysis of relations among named entities has received limited focus.
In fact, a review of the literature revealed a deficiency in research on the abstract conceptualization required to organize relations.
We believe that such an abstract conceptualization can benefit various communities and applications
such as natural language processing, information extraction, machine learning, and ontology engineering.
In this paper, we present Comprehensive EVent Ontology (CEVO), built on
Levin's conceptual hierarchy of English verbs that categorizes verbs with shared meaning, and syntactic behavior.
We present the fundamental concepts and requirements for this ontology.
Furthermore, we present three use cases employing the CEVO ontology on
annotation tasks: (i) annotating relations in plain text, (ii) annotating ontological properties, and (iii) linking textual relations to ontological properties.
These use-cases demonstrate the benefits of using CEVO for annotation: (i) annotating English verbs from an abstract conceptualization, (ii) playing the role of an upper ontology for organizing ontological properties, and
(iii) facilitating the annotation of text relations using any underlying vocabulary.
This resource is available at https://shekarpour.github.io/cevo.io/ using https://w3id.org/cevo namespace.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
1. +
Question Answering on
Interlinked Data
Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer
AKSW Research Group, Leipzig University
December 5 2013, IBM Research Center
3. + Motivation
Text
queries
(either
keyword
or
natural
language
)
are:
n
Simple
retrieval
approach
n
Popular
n
Implicit
and
ambiguous
seman=cs.
SPARQL
queries
require:
n
Knowledge
about
the
ontology
n
Proficiency
in
formula=ng
formal
queries
n
Explicit
and
unambigious
seman=cs.
AKSW
group
-‐
Ques=on
Answering
on
Interlinked
Data
(published
in
www2013)
3
4. + Comparison of Search Approaches
Data-Semantic
aware
Data-Semantic
unaware
Our
approach:
SINA
4
Question
Answering
Systems
Information
Retrieval
Keyword-based
query
AKSW group - Question Answering on Interlinked Data (published in www2013)
Natural language
query
5. + Example
5
1
n
3
Which televisions shows were created by Walt Disney?
select * where !
{ ?v0 a
!
?v0 dbo:creator
AKSW group - Question Answering on Interlinked Data (published in www2013)
2
!dbo:TelevisionShow.!
dbr:Walt_Disney. }!
6. + Aim and Challenges
Aim: Question answering over a set of interlinked data sources.
n
Query segmentation.
n
Resource disambiguation.
n
To construct a formal query (expressed in SPARQL)
AKSW group - Question Answering on Interlinked Data (published in www2013)
6
7. + Further Challenges over Interlinked Data
1.
Information for answering a certain question can be spread
among different datasets employing heterogeneous schemas.
2.
Constructing a federated formal query across different datasets
requires exploiting links between the different datasets on both the
schema and instance levels.
AKSW group - Question Answering on Interlinked Data (published in www2013)
7
9. + Test bed datasets
* One single dataset: DBpedia.
* Three interlinked datasets
from life-science:
ü Drugbank: is a
comprehensive knowledge
base containing information
about drugs, drug target (i.e.
protein) information,
interactions and enzymes.
ü Diseasome: contains
information about diseases and
genes associated with these
diseases.
ü Sider: contains information
about drugs and their side effects.
AKSW group - Question Answering on Interlinked Data (published in www2013)
9
10. + Main characteristics of federated queries
1.
Queries requiring fused information, e.g. side
effects of drugs used for Tuberculosis.
2.
Queries targeting combined information, e.g.
side effect an enzymes of drugs used for ASTHMA.
3.
10
Queries requiring keyword expansion, e.g. side
effects of Valdecoxib.
DrugBank
Sider
Drug
a
a
?v1
enzyme
?v0
Disease
?v2
sameAs
a
Diseasome
AKSW group - Question Answering on Interlinked Data (published in www2013)
Side Effect
Drug
a
Enzymes
Asthma
a
side effect
?v3
11. + Challenge 1: Query Segmentation and Resource
Disambiguation
l
Sample
ques5on:
What
is
the
side
effects
of
drugs
used
for
Tuberculosis?
l
Transformed
to
4-‐tuple
(side
#
effect
#
drug
#
Tuberculosis)
l
Different
segmenta=ons
are
possible:
1.
(
side
effect
#
drug
#
Tuberculosis)
2.
(
side
effect
drug
#
Tuberculosis
)
Mapping
of
the
segments
to
the
resources
in
the
underlying
knowledge
bases.
Each valid segment
AKSW group - Question Answering on Interlinked Data (published in www2013)
11
12. 12
Segment validation
ü
ü
Original tuple: (side # effect # drug # Tuberculosis).
Using a naive approach for finding all valid segments.
Valid Segments
Samples of Candidate Resources
Side effect
1. sider:class:sideeffect
!
2. sider:property:side_effects!
drug
1. drugbank: drugs
2.class:offer!
3.sider:drugs
4.diseases:possibledrug!
tuberculosis
1. diseases:1154
!
2. side_effects: C0041296!
AKSW group - Question Answering on Interlinked Data (published in www2013)
14. 14
Hidden Markov Model
•
•
•
•
A statistics model containing a set of states.
Moving from one state to another state generates a sequence of observations.
The probability of entering state only depends on the previous state.
Output is the most likely states generating the sequence of the observation.
AKSW group - Question Answering on Interlinked Data (published in www2013)
15. 15
State Space
•
•
•
•
A state represents a knowledge base resource.
Contains all resources in the knowledge base.
In practice, we prune the state space by excluding irrelevant states.
Adding an unknown entity state comprising all resources, which are not
available (anymore) in the pruned state space.
• Extension of State Space with reasoning: An extension of the state space
by including resources inferred from lightweight owl:sameAs reasoning.
AKSW group - Question Answering on Interlinked Data (published in www2013)
16. 16
Bootstrapping the Model Parameters
Emission Probability
•
The set-similarity level measures the difference between the label and the
segment in terms of the number of words using the Jaccard similarity.
•
The string-similarity level measures the string similarity of each word in the
segment with the most similar word in the label using the Levenshtein
distance.
AKSW group - Question Answering on Interlinked Data (published in www2013)
17. 17
Bootstrapping the Model Parameters
Transition Probability & Initial Probability
• Computing the transition probability and initial probability based on Semantic
relatedness of two resources.
• Semantic relatedness is based on two values: distance and connectivity
degree.
• We transform these two values to hub and authority values using HITS
algorithm.
• Initial probability and Transition probability
are defined as a uniform
distribution over the hub and and authority values.
AKSW group - Question Answering on Interlinked Data (published in www2013)
18. Evaluation of Bootstrapping
18
• The accuracy of different distribution functions, i.e., Normal, Zipfian and
uniform distributions for transition probability.
• We ran the distribution functions with two different inputs, i.e. distance and
connectivity degree values as well as hub and authority values.
AKSW group - Question Answering on Interlinked Data (published in www2013)
19. + Viterbi Algorithm
Aim: The most likely path generating the sequence of input keywords.
AKSW group - Question Answering on Interlinked Data (published in www2013)
19
20. +
20
Output of the HMM for the following query:
Which televisions shows were created by Walt Disney?
Probability
0.0023
0.0014
5.89E-4
3.53E-4
3.76E-5
Path of states
dbo:TelevisionShow , dbo:creator , dbr:
dbo:TelevisionShow , dbo:creator , dbr:
dbr:TelevisionShow , dbo:creator , dbr:
dbr:TelevisionShow , dbo:creator , dbr:
dbp:television , dbp:show , dbo:creator
AKSW group - Question Answering on Interlinked Data (published in www2013)
Walt_Disney!
Category:Walt_Disney!
Walt_Disney!
Category:Walt_Disney!
, dbr: Category:Walt_Disney!
21. +
21
Query Construction
AKSW group - Question Answering on Interlinked Data (published in www2013)
22. Query Construction Method
Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, connected multi-graph.
1
2
n
Forward Chaining:
1. CT: Comprehensive type.
2. CD: Comprehensive domain.
3. CR: Comprehensive range.
AKSW group - Question Answering on Interlinked Data (published in www2013)
22
23. Query Construction Method
Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, connected multi-graph.
1
2
n
Generating the Incomplete Query Graph (IQG)
Initializing vertices and primary edges.
• A vertex is added to IQG (1) If r is an instance, (2) If r is a class.
• Properties are added along with zero, one or two vertices.
AKSW group - Question Answering on Interlinked Data (published in www2013)
23
24. 24
Query Construction Method
Example: What is the side effects of drugs used for Tuberculosis?
• diseasome:1154 !
!
• diseasome:possibleDrug !
• sider:sideEffect !
!(type
!(type
!(type
Graph 1
!!
property)
sideEffect
possibleDrug
1154
instance) !!
property)!
?v0
?v1
Graph 2
AKSW group - Question Answering on Interlinked Data (published in www2013)
?v2
25. 25
Query Construction Method
Connecting Sub-graphs of an IQG:
1. Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of
disjoint graphs.
2. Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs.
• Direct properties: ?v0 ?p ?v1.
• Properties via owl:sameAs link.
(1) ?v0 owl:sameAs ?x. ?x ?p ?v1. !
(2) ?v0 ?p ?x. ?x owl:sameAs ?v1. !
(3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. !
Template 1
Template 2
possibleDrug
1154
?v0
1154
?v2
?v1
sideEffect
?v1
AKSW group - Question Answering on Interlinked Data (published in www2013)
possibleDrug
sideEffect
?v0
?v2
26. Evaluation
Goal of experiment:
How well:
1. resource disambiguation
2. query construction approaches perform.
Measurement of the performance:
1. For disambiguation using the Mean Reciprocal Rank (MRR).
2. Query construction in terms of precision and recall.
Benchmark
1. A natural- language query and the equivalent conjunctive SPARQL query.
2. 25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome.
3. QALD1 and QALD3 benchmark for DBpedia.
AKSW group - Question Answering on Interlinked Data (published in www2013)
26
27. Evaluation using life-science datasets
Without reasoning: precision = 0.91 recall = 0.88
With reasoning:
precision = 0.95 recall = 0.90
AKSW group - Question Answering on Interlinked Data (published in www2013)
27
28. + Evaluation using DBpedia
n
QALD3 Benchmark:
ü
contains 100 questions.
ü
32 original questions can be answered correctly.
n
QALD1 Benchmark:
ü
contains 50 questions.
ü
7 complex questions.
ü
13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF.
ü
14 slightly were modified to remove expansion and cleaning problem.
ü
MRR of disambiguation = 96%
ü
Query construction accuracy = 83%
AKSW group - Question Answering on Interlinked Data (published in www2013)
28
29. Runtime
Parallization over three components:
1. Segment validation
2. Resource retrieval
3. Query construction
AKSW group - Question Answering on Interlinked Data (published in www2013)
29
30. + Related work
AKSW group - Question Answering on Interlinked Data (published in www2013)
30