This document describes a mid-ontology learning approach for integrating ontology schemas from different linked data sources. It collects data from linked instances using owl:sameAs links. Predicates are grouped by exact matching of objects and pruning using string and knowledge-based similarity measures. The approach aims to automatically learn a simple ontology that can represent data from diverse domains in linked open data.
The Linked Open Data (LOD) cloud contains tremendous amounts of interlinked instances, from where we can retrieve abundant knowledge. However, because of the heterogeneous and big ontologies, it is time consuming to learn all the ontologies manually and it is difficult to observe which properties are important for describing instances of a specific class. In order to construct an ontology that can help users easily access to various data sets, we propose a semi-automatic ontology inte- gration framework that can reduce the heterogeneity of ontologies and retrieve frequently used core properties for each class. The framework consists of three main components: graph-based ontology integration, machine-learning-based ontology schema extraction, and an ontology merger. By analyzing the instances of the linked data sets, this framework acquires ontological knowledge and constructs a high-quality integrated ontology, which is easily understandable and effective in knowledge ac- quisition from various data sets using simple SPARQL queries.
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
This is a thesis presentation about interlinking educational data to Web of Data. I explain how I used the Linked Data approach to expose and interlink educational data to the Linked Open Data cloud
The document analyzes ontology reuse in 196 Linked Data vocabularies. It finds that 59.47% of elements are locally defined, while 40.53% are reused - mostly by importing other ontologies (67.05%). The ontologies reference a small set of common vocabularies like FOAF, DC and Geo. Future work includes completing the dataset and analyzing outliers to better understand ontology reuse on the Linked Data cloud.
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Jennifer D'Souza
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
What makes a linked data pattern interesting?Szymon Klarman
A short talk on the problem of mining linked data (RDF) patterns, introducing a few preliminary notions towards the definition of generic linked data mining algorithms.
Perspectives on mining knowledge graphs from textJennifer D'Souza
A survey presented at the International Winter School on Knowledge Graphs and Semantic Web 2021 http://www.kgswc.org/winter-school/; November 2021; DOI: 10.13140/RG.2.2.24482.56005
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
The Linked Open Data (LOD) cloud contains tremendous amounts of interlinked instances, from where we can retrieve abundant knowledge. However, because of the heterogeneous and big ontologies, it is time consuming to learn all the ontologies manually and it is difficult to observe which properties are important for describing instances of a specific class. In order to construct an ontology that can help users easily access to various data sets, we propose a semi-automatic ontology inte- gration framework that can reduce the heterogeneity of ontologies and retrieve frequently used core properties for each class. The framework consists of three main components: graph-based ontology integration, machine-learning-based ontology schema extraction, and an ontology merger. By analyzing the instances of the linked data sets, this framework acquires ontological knowledge and constructs a high-quality integrated ontology, which is easily understandable and effective in knowledge ac- quisition from various data sets using simple SPARQL queries.
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
This is a thesis presentation about interlinking educational data to Web of Data. I explain how I used the Linked Data approach to expose and interlink educational data to the Linked Open Data cloud
The document analyzes ontology reuse in 196 Linked Data vocabularies. It finds that 59.47% of elements are locally defined, while 40.53% are reused - mostly by importing other ontologies (67.05%). The ontologies reference a small set of common vocabularies like FOAF, DC and Geo. Future work includes completing the dataset and analyzing outliers to better understand ontology reuse on the Linked Data cloud.
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Jennifer D'Souza
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
What makes a linked data pattern interesting?Szymon Klarman
A short talk on the problem of mining linked data (RDF) patterns, introducing a few preliminary notions towards the definition of generic linked data mining algorithms.
Perspectives on mining knowledge graphs from textJennifer D'Souza
A survey presented at the International Winter School on Knowledge Graphs and Semantic Web 2021 http://www.kgswc.org/winter-school/; November 2021; DOI: 10.13140/RG.2.2.24482.56005
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
The document discusses using ontologies and Schema.org properties to connect biomedical data to ontology terms and concepts. Over 200 biomedical ontologies are in active use by life science databases at EMBL-EBI. Schema.org properties like MedicalCode and CreativeWork can be used to mark up ontology terms, data resources, and their relationships. This would allow questions about which ontologies and terms are used in specific data, and enable richer searching and discovery across data and ontologies.
Elsevier aims to construct knowledge graphs to help address challenges in research and medicine. Knowledge graphs link entities like people, concepts, and events to provide answers. Elsevier analyzes text and data to build knowledge graphs using techniques like information extraction, machine learning, and predictive modeling. Their knowledge graph integrates data from publications, clinical records, and other sources to power applications that help researchers, medical professionals, and patients. Knowledge graphs are a critical component for delivering value, especially as data volumes and needs accelerate.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...Agnieszka Ławrynowicz
The document proposes a substitutive itemset mining framework to find synonymous properties in linked data. It applies frequent itemset mining to transactions of subject-property-object triples extracted from DBpedia to find property pairs that frequently co-occur. These pairs are then analyzed to identify substitutive properties that can replace each other based on their common coverage of itemsets. An implementation of this approach found several synonymous property mappings when tested on organization data from DBpedia.
Ontology based clustering algorithms aim to standardize clustering by incorporating domain knowledge through ontologies. They calculate similarity matrices between objects using ontology-based methods, then merge the closest clusters and recalculate the matrix in an iterative process. Several ontology based clustering algorithms are discussed, including Apriori, which generates frequent item sets to cluster data, and algorithms that use ontologies to weight features or perform recursive mining on an FP-tree. These algorithms integrate distributed semantic web data through ontologies to improve search, classification and reuse of knowledge resources.
RuleML2015: FOWLA, a federated architecture for ontologiesRuleML
The progress of information and communication technologies has greatly increased the quantity of data to process. Thus, managing data heterogeneity is a problem nowadays. In the 1980s, the concept of a Federated Database Architecture (FDBA) was introduced as a collection of components to unite loosely coupled federation. Semantic web technologies mitigate the data
heterogeneity problem, however due to the data structure heterogeneity the integration of several ontologies is still a complex task. For tackling this problem, we propose a loosely coupled federated ontology architecture (FOWLA). Our
approach allows the coexistence of various ontologies sharing common data dynamically at query execution through logical rules. We have illustrated the advantages of adopting our approach through several examples and benchmarks.
We also compare our approach with other existing initiatives.
This document discusses using ontologies to make biological and biomedical data more interoperable and FAIR (Findable, Accessible, Interoperable, Reusable). It describes several ontology services and tools provided by EMBL-EBI to help with tasks like annotating data, mapping data to ontologies, searching and accessing ontologies, and publishing structured data. It also uses the example of the BioSamples database to illustrate challenges in working with large, heterogeneous datasets and how ontologies can help address issues like normalizing descriptions and attributes to enable better searching and data integration.
The document discusses the relationship between database systems and information retrieval (IR) systems. It notes that while they have traditionally operated in "parallel universes", addressing them together is now important for applications. It outlines some key differences and similarities between the two areas and discusses efforts to build more integrated database and IR platforms.
This document summarizes an OKFN Korea hackathon event focused on open data. It discusses modeling Seoul open government data using ontologies, linking it to external datasets like cultural heritage data, and publishing the enriched data in RDF format. It covers topics like linked data, modeling with RDF/RDFS/OWL, reusing existing vocabularies, ontology development best practices, and triple store storage solutions.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
The document discusses the convergence of database and information retrieval systems. It notes that both fields have traditionally focused on either structured or unstructured data but are now combining aspects of both. This is driven by new application needs that require flexible querying of both text and structured data. The document outlines the history and developments in this area, including early XML IR systems and more recent graph-based approaches that integrate ranking and probabilistic models from IR into structured querying.
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect matchChristoph Lange
The Distributed Ontology Language is a meta-language for integrating
ontologies written in different languages. Our notion of “distributed”
comprises logical heterogeneity within ontologies, modularity and reuse,
and links across ontologies in different places of the Web. Not only
can ontologies be distributed across the Web, but DOL's supply of
supported ontology languages can also be extended in a decentral way.
For this functionality, DOL builds on the Linked Open Data (LOD)
principles. But DOL also contributes to LOD use cases. Many current
LOD applications are limited by the weak expressivity of the RDF and
RDFS languages commonly used to express data and vocabularies.
Completely switching to a more expressive language would impair
scalability to big datasets. DOL addresses the scalability and
expressivity requirements by allowing to represent each aspect of a
dataset in the most suitable language and keeping these different
representations connected. This is particularly useful in geographic
information systems, where big datasets (e.g. Linked Geo Data, the LOD
version of OpenStreetMap) need to be integrated with formalisations of
complex spatial notions (e.g. in the first-order language Common Logic).
A Mathematical Approach to Ontology Authoring and DocumentationChristoph Lange
This document proposes using OMDoc, a framework for representing formal knowledge, to improve ontology authoring and documentation. It describes how OMDoc can:
1) Provide better support for modularity, documentation at different granularities, and linking documentation to formal representations compared to languages like OWL.
2) Model existing ontologies and translate between OMDoc and OWL/RDF formats to leverage existing tools.
3) Allow comprehensive, integrated documentation of ontologies through features like literate programming. The approach is evaluated by reimplementing the FOAF ontology in OMDoc.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
The document describes BigARTM, an open source library for regularized multimodal topic modeling of large collections. It discusses probabilistic topic modeling and how additive regularization of topic models (ARTM) handles ill-posed inverse problems in topic modeling. ARTM allows various regularizers to be combined. BigARTM provides a parallel implementation for improved time and memory performance. Experiments show how ARTM can combine regularizers and be used for classification and multi-language topic modeling. Multimodal topic modeling binds topics to terms, authors, images, links and other modalities.
The document describes the e-LICO intelligent discovery assistant system. It consists of several components including a planner and meta-learner. The planner interacts with scientists to achieve their knowledge discovery goals through an iterative process. Other components include a data mining optimization ontology and services/components that deliver the data mining platform to scientists.
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.
This document discusses dataset profiling and the LinkedUp data catalog. It describes how LinkedUp profiles 34 educational datasets, including information on their schemas, accessibility, and topic coverage. It also explains the benefits of dataset profiling, such as enabling federated querying and exploratory search over multiple datasets. Finally, it outlines techniques for profiling linked data and applications of the profiles through tools like Cite4Me and the LinkedUp data catalog.
Open Web Data for Education - Linked Data technologies for connecting open ed...Mathieu d'Aquin
This document discusses using linked data technologies to connect open educational data. It begins with an overview of the current state of open data in education, including open educational resources from universities, repositories, and publishers. It then discusses the need for common vocabularies to facilitate linking this data. The document presents several examples of representing educational data as linked open data, including the AIISO, BIBO, and LRMI ontologies as well as a case study on the Bologna Ontology. It concludes by discussing potential applications of open linked educational data like social resource discovery and research exploration.
The document discusses using ontologies and Schema.org properties to connect biomedical data to ontology terms and concepts. Over 200 biomedical ontologies are in active use by life science databases at EMBL-EBI. Schema.org properties like MedicalCode and CreativeWork can be used to mark up ontology terms, data resources, and their relationships. This would allow questions about which ontologies and terms are used in specific data, and enable richer searching and discovery across data and ontologies.
Elsevier aims to construct knowledge graphs to help address challenges in research and medicine. Knowledge graphs link entities like people, concepts, and events to provide answers. Elsevier analyzes text and data to build knowledge graphs using techniques like information extraction, machine learning, and predictive modeling. Their knowledge graph integrates data from publications, clinical records, and other sources to power applications that help researchers, medical professionals, and patients. Knowledge graphs are a critical component for delivering value, especially as data volumes and needs accelerate.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...Agnieszka Ławrynowicz
The document proposes a substitutive itemset mining framework to find synonymous properties in linked data. It applies frequent itemset mining to transactions of subject-property-object triples extracted from DBpedia to find property pairs that frequently co-occur. These pairs are then analyzed to identify substitutive properties that can replace each other based on their common coverage of itemsets. An implementation of this approach found several synonymous property mappings when tested on organization data from DBpedia.
Ontology based clustering algorithms aim to standardize clustering by incorporating domain knowledge through ontologies. They calculate similarity matrices between objects using ontology-based methods, then merge the closest clusters and recalculate the matrix in an iterative process. Several ontology based clustering algorithms are discussed, including Apriori, which generates frequent item sets to cluster data, and algorithms that use ontologies to weight features or perform recursive mining on an FP-tree. These algorithms integrate distributed semantic web data through ontologies to improve search, classification and reuse of knowledge resources.
RuleML2015: FOWLA, a federated architecture for ontologiesRuleML
The progress of information and communication technologies has greatly increased the quantity of data to process. Thus, managing data heterogeneity is a problem nowadays. In the 1980s, the concept of a Federated Database Architecture (FDBA) was introduced as a collection of components to unite loosely coupled federation. Semantic web technologies mitigate the data
heterogeneity problem, however due to the data structure heterogeneity the integration of several ontologies is still a complex task. For tackling this problem, we propose a loosely coupled federated ontology architecture (FOWLA). Our
approach allows the coexistence of various ontologies sharing common data dynamically at query execution through logical rules. We have illustrated the advantages of adopting our approach through several examples and benchmarks.
We also compare our approach with other existing initiatives.
This document discusses using ontologies to make biological and biomedical data more interoperable and FAIR (Findable, Accessible, Interoperable, Reusable). It describes several ontology services and tools provided by EMBL-EBI to help with tasks like annotating data, mapping data to ontologies, searching and accessing ontologies, and publishing structured data. It also uses the example of the BioSamples database to illustrate challenges in working with large, heterogeneous datasets and how ontologies can help address issues like normalizing descriptions and attributes to enable better searching and data integration.
The document discusses the relationship between database systems and information retrieval (IR) systems. It notes that while they have traditionally operated in "parallel universes", addressing them together is now important for applications. It outlines some key differences and similarities between the two areas and discusses efforts to build more integrated database and IR platforms.
This document summarizes an OKFN Korea hackathon event focused on open data. It discusses modeling Seoul open government data using ontologies, linking it to external datasets like cultural heritage data, and publishing the enriched data in RDF format. It covers topics like linked data, modeling with RDF/RDFS/OWL, reusing existing vocabularies, ontology development best practices, and triple store storage solutions.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
The document discusses the convergence of database and information retrieval systems. It notes that both fields have traditionally focused on either structured or unstructured data but are now combining aspects of both. This is driven by new application needs that require flexible querying of both text and structured data. The document outlines the history and developments in this area, including early XML IR systems and more recent graph-based approaches that integrate ranking and probabilistic models from IR into structured querying.
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect matchChristoph Lange
The Distributed Ontology Language is a meta-language for integrating
ontologies written in different languages. Our notion of “distributed”
comprises logical heterogeneity within ontologies, modularity and reuse,
and links across ontologies in different places of the Web. Not only
can ontologies be distributed across the Web, but DOL's supply of
supported ontology languages can also be extended in a decentral way.
For this functionality, DOL builds on the Linked Open Data (LOD)
principles. But DOL also contributes to LOD use cases. Many current
LOD applications are limited by the weak expressivity of the RDF and
RDFS languages commonly used to express data and vocabularies.
Completely switching to a more expressive language would impair
scalability to big datasets. DOL addresses the scalability and
expressivity requirements by allowing to represent each aspect of a
dataset in the most suitable language and keeping these different
representations connected. This is particularly useful in geographic
information systems, where big datasets (e.g. Linked Geo Data, the LOD
version of OpenStreetMap) need to be integrated with formalisations of
complex spatial notions (e.g. in the first-order language Common Logic).
A Mathematical Approach to Ontology Authoring and DocumentationChristoph Lange
This document proposes using OMDoc, a framework for representing formal knowledge, to improve ontology authoring and documentation. It describes how OMDoc can:
1) Provide better support for modularity, documentation at different granularities, and linking documentation to formal representations compared to languages like OWL.
2) Model existing ontologies and translate between OMDoc and OWL/RDF formats to leverage existing tools.
3) Allow comprehensive, integrated documentation of ontologies through features like literate programming. The approach is evaluated by reimplementing the FOAF ontology in OMDoc.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
The document describes BigARTM, an open source library for regularized multimodal topic modeling of large collections. It discusses probabilistic topic modeling and how additive regularization of topic models (ARTM) handles ill-posed inverse problems in topic modeling. ARTM allows various regularizers to be combined. BigARTM provides a parallel implementation for improved time and memory performance. Experiments show how ARTM can combine regularizers and be used for classification and multi-language topic modeling. Multimodal topic modeling binds topics to terms, authors, images, links and other modalities.
The document describes the e-LICO intelligent discovery assistant system. It consists of several components including a planner and meta-learner. The planner interacts with scientists to achieve their knowledge discovery goals through an iterative process. Other components include a data mining optimization ontology and services/components that deliver the data mining platform to scientists.
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.
This document discusses dataset profiling and the LinkedUp data catalog. It describes how LinkedUp profiles 34 educational datasets, including information on their schemas, accessibility, and topic coverage. It also explains the benefits of dataset profiling, such as enabling federated querying and exploratory search over multiple datasets. Finally, it outlines techniques for profiling linked data and applications of the profiles through tools like Cite4Me and the LinkedUp data catalog.
Open Web Data for Education - Linked Data technologies for connecting open ed...Mathieu d'Aquin
This document discusses using linked data technologies to connect open educational data. It begins with an overview of the current state of open data in education, including open educational resources from universities, repositories, and publishers. It then discusses the need for common vocabularies to facilitate linking this data. The document presents several examples of representing educational data as linked open data, including the AIISO, BIBO, and LRMI ontologies as well as a case study on the Bologna Ontology. It concludes by discussing potential applications of open linked educational data like social resource discovery and research exploration.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
Carole Goble is a professor in the school of computer science at the University of Manchester.
In this keynote, Carole offered her insights into research data management and data centres.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Written and presented by Carole Goble (University of Manchester) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
This curriculum vitae summarizes the qualifications and experience of Dr. Jie Bao. He is currently a research associate at Rensselaer Polytechnic Institute, a research affiliate at MIT, and a visiting scientist at Raytheon BBN Technologies. He received his Ph.D. in computer science from Iowa State University in 2007. His research focuses on areas including semantic web, linked data, description logics, and ontology engineering. He has over 50 publications and has served on numerous conference committees.
Data Provenance and Scientific Workflow ManagementNeuroMat
Introductory class on techniques and tools to manage scientific data, focusing on sources of information and data analysis. Lecturer: Prof. Kelly Rosa Braghetto, a NeuroMat associate investigator and a professor at the University of São Paulo's Department of Computer Science.
Integration of research literature and data (InFoLiS)Philipp Zumstein
Talk at CNI 2015 Spring Membership Meeting in Seattle on April 14th, 2015, see http://www.cni.org/events/membership-meetings/upcoming-meeting/spring-2015/
Abstract: The goal of the InFoLiS project is to connect research data and publications. Links between data and literature are created automatically by means of text mining and made available as Linked Open Data (LOD) for seamless integration into different retrieval systems. This enables scientists to directly access information about corresponding research data in a literature information system, and, vice versa, it is possible to directly find different interpretations and analyses in the literature of the same research data. In our talk, we will describe our methods for generating the links and give insight into the Linked Data infrastructure including the services we are currently building. Most importantly, we will detail how our solutions can be used by other institutions and invite all interested participants to discuss with us their ideas and thoughts on the requirements for these services to ensure broad interoperability with existing systems and infrastructures. InFoLiS is a joint project by the GESIS – Leibniz Institute for the Social Sciences, Cologne, Mannheim University Library, and Mannheim University supported by a grant from the DFG – German Research Foundation.
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Linked Data in a University Context: Publication, Applications and Beyond
The Open University (OU) is exposing its data as linked open data to make it more transparent, reusable and discoverable both internally and externally. This includes data about courses, research outputs, library resources and more. By linking its data to other university and external datasets, the OU aims to create new applications and make existing processes more efficient. Other universities in the UK and worldwide are now following the OU's example in publishing institutional data as linked open data.
(1) The document is an annotated bibliography on information extraction and natural language processing written by Jun-ichi Tsujii from the University of Tokyo.
(2) It provides references to key papers that have influenced the development of the field of information extraction over the last 5 years as of 2000, organized by topics such as general introduction, IE systems used in Message Understanding Conferences, and IE systems for biology and biomedical texts.
(3) The references cover techniques such as finite-state processing, pattern matching, and use of full parsers as well as domain-specific resources for biological IE systems.
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
This document discusses using natural language processing techniques to analyze scientific papers and extract structured knowledge. It describes analyzing papers to recognize named entities, parse syntactic dependencies and semantic arguments, resolve coreferences, and extract relations. This extracted information can be used to generate structured abstracts, find related papers, perform content-based search, and discover new facts. As an example, it outlines a project that aims to read research papers to assemble and reason over causal models in cancer biology.
The document describes a summer institute on discovering big data held in San Diego from August 5-9, 2013. It discusses several topics related to big data in neuroscience including available resources, how to find and connect relevant information, challenges around data integration from disparate sources, and using ontologies and machine learning for tasks like data tagging.
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
A glimpse of how we are used to connecting datasets on our laptops and how, imho, need to move to the Web of Data, including a demo connecting various sources all from your(!) machine.
Similar to Mid-Ontology Learning from Linked Data @JIST2011 (20)
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
1. 大学共同利用機関法人 情報・システム研究機構
国立情報学研究所
National Institute of Informatics
Mid-Ontology Learning from Linked Data
Lihua Zhao and Ryutaro Ichise
JIST2011, 12.05.2011, Hangzhou
2. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Outline
Introduction
Mid-Ontology Learning Approach
Experimental Evaluation
Related Work
Conclusion and Future Work
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 2
国立情報学研究所
National Institute of Informatics
3. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Linked Open Data
295 data sets, 31 billion RDF triples (as of Sep. 2011)
7 domains (cross-domain, geographic, media, life sciences,
government, user-generated content, and publications)
Interlinked Instances (owl:sameAs)
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 3
国立情報学研究所
National Institute of Informatics
4. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Challenging Problem
Each data set has specific ontology schema
DBpedia: http://dbpedia.org/property/population
Geonames: http://www.geonames.org/ontology#population
Time-consuming to learn all the ontology schema
DBpedia: 320 classes and thousands of properties.
Heterogeneity of ontology schema
http://dbpedia.org/property/populationTotal
http://dbpedia.org/property/population
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 4
国立情報学研究所
National Institute of Informatics
5. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Objective
Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/first use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 5
国立情報学研究所
National Institute of Informatics
6. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Simple ontology for various data sets: Mid-Ontology
Investigation on linked instances
owl:sameAs links identical or related instances
Scale down the data set
Automatic ontology learning
Integrate ontologies from diverse domain data sets
Automate the ontology construction process
Adapt to linked open data sets
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 6
国立情報学研究所
National Institute of Informatics
7. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 7
国立情報学研究所
National Institute of Informatics
8. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Data Collection
We scale down the data sets by collecting only linked instances,
from which we can extract related information.
Extract data linked with owl:sameAs
Select a core data set (inward & outward links)
Collect all instances that have owl:sameAs
Remove noisy instances of the core data set
Noisy instances: without any meaningful triple
Collect predicates and objects
collect <predicate, object> (PO) pairs from collected instances
collect PO pairs from linked instances (other data sets)
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 8
国立情報学研究所
National Institute of Informatics
9. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
An Example of Collected Data
dbpedia:Berlin owl:sameAs http://sws.geonames.org/2950159/
http://data.nytimes.com/N50987186835223032381 owl:sameAs dbpedia:Berlin
Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/first use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 9
国立情報学研究所
National Institute of Informatics
10. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 10
国立情報学研究所
National Institute of Informatics
11. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Refine groups using extracted relations
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 11
国立情報学研究所
National Institute of Informatics
12. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
One predicate may have various objects
Different predicates may have the same object value
Prune groups by similarity matching
Refine groups using extracted relations
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 12
国立情報学研究所
National Institute of Informatics
13. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Group Predicates by Exact Matching
Create initial groups (Gi ) of PO pairs
e.g. Gi .predicates = { db-prop:name, geo-onto:alternateName }
Gi .objects = { Berlin, Berlyn@af }
Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/first use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 13
国立情報学研究所
National Institute of Informatics
14. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Exact matching may ignore
Terms of predicates or objects written in different languages
Semantically identical or related predicates
Refine groups using extracted relations
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 14
国立情報学研究所
National Institute of Informatics
15. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Prune Groups by Similarity Matching
Ontology similarity matching at the concept level
String-based similarity measure: StrSim(O(Gi ), O(Gj ))
O(Gi ): objects in Gi
Prefix, Suffix, Levenshtein distance, and n-gram.
Knowledge-based similarity measure: WNSim(T (Gi ), T (Gj ))
T (Gi ): pre-processed terms of predicates in Gi
Natural Language Processing: tokenizing terms, removing stop words,
and stemming.
WordNet-based similarity measures: LCH, RES, HSO, JCN, LESK,
PATH, WUP, LIN, and VECTOR
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 15
国立情報学研究所
National Institute of Informatics
16. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Prune Groups by Similarity Matching
Similarity between initial groups {G1 , G2 , . . . Gk }
StrSim(O(Gi ), O(Gj )) + WNSim(T (Gi ), T (Gj ))
Sim(Gi , Gj ) =
2
Prune initial groups Gi
If Sim(Gi , Gj ) is higher than the predefined similarity threshold, we
merge Gi and Gj .
If an initial group Gi has not been merged and has only one PO
pair, we remove Gi .
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 16
国立情報学研究所
National Institute of Informatics
17. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
An Example of Similarity Calculation
Group Predicate Object
Gi http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /ontology /populationTotal 3439100
Gj http : //www .geonames.org /ontology #population 3426354
Example of String-based similarity measures on pairwise objects.
Pairwise Objects prefix suffix Levenshtein distance n-gram
“3439100”, “3426354” 0.29 0 0 0.29
Example of WordNet-based similarity measures on pairwise terms.
Pairwise Terms LCH RES HSO JCN LESK PATH WUP LIN VECTOR
population, population 1 1 1 1 1 1 1 1 1
population, total 0.4 0 0 0.06 0.03 0.11 0.33 0 0.06
0.145 + 0.5825
Sim(Gi , Gj ) = = 0.36375
2
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 17
国立情報学研究所
National Institute of Informatics
18. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Refine groups using extracted relations
Divide pruned groups according to rdfs:domain and rdfs:range.
Keep groups with high frequency
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 18
国立情報学研究所
National Institute of Informatics
19. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 19
国立情報学研究所
National Institute of Informatics
20. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Construction
Select terms for Mid-Ontology
Collect all the terms of predicates in each refined group Gi .
Collect all the pre-processed terms of P(Gi ) (predicates in Gi ).
Choose one term, which has the highest frequency and longest
term.
e.g. “area” and “areaCode” are totally different
Construct Relations
mo-prop:hasMembers to link Mid-Ontology classes and integrated
predicates
Construct Mid-Ontology
Automatically construct Mid-Ontology using selected terms and
mo-prop:hasMembers.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 20
国立情報学研究所
National Institute of Informatics
21. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Experimental Evaluation
Evaluate the Mid-Ontology approach from four different aspects:
Evaluation of Data Reduction
Evaluation of Ontology Quality
Evaluation with A SPARQL Example
Analysis of Mid-Ontology Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 21
国立情報学研究所
National Institute of Informatics
22. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Implementation
Environment
Linux Ubuntu 10.10, 16GB Memory, 1 TB Disk
Core i7 CPU 880 3.07GHz
Java, Netbeans 6.9
Virtuoso
High-performance server for RDF storage
SPARQL query endpoint
WordNet::Similarity
Implemented in Perl
Knowledge-based similarity measures
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 22
国立情報学研究所
National Institute of Informatics
23. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Experimental Data
DBpedia: cross-domain, 3.5 million things, 8.9 million URIs
Geonames: geographical domain, 7 million URIs
NYTimes: media domain, 10,467 subject news
Choose DBpedia as the core data set, because of its wealth of inward
and outward links to other data sets.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 23
国立情報学研究所
National Institute of Informatics
24. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Data Reduction
Evaluate the effectiveness of data reduction during the data
collection phase by comparing the number of instances.
Number of distinct instances during data collection phase.
Data set Before reduction owl:sameAs retrieval Noisy data removal
DBpedia 8,955,728 135,749 (1.52%) 88,506 (0.99%)
Geonames 7,479,714 128,961 (1.72%) 82,054 (1.10%)
NYTimes 10,467 9,226 (88.14%) 8,535 (81.54%)
Evaluation Analysis
The data sets are dramatically scaled down by keeping only
linked instances that share related information.
Successfully removed noisy instances, which may affect the
quality of the Mid-Ontology.
e.g. Removed instances with only db-prop:hasPhotosCollection
(broken link) and owl:sameAs link.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 24
国立情報学研究所
National Institute of Informatics
25. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Ontology Quality
Evaluate the quality of Mid-Ontology by validating whether
predicates in each class share related information.
Accuracy of Mid-Ontology
n |Correct Predicates in Ci |
i=1 |Ci |
ACC (MO) =
n
n: the number of classes
|Ci |: the number of predicates in class Ci .
Cardinality
|Number of Predicates|
Cardinality =
|Number of Classes|
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 25
国立情報学研究所
National Institute of Informatics
26. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Ontology Quality
Improvement achieved by our approach
MO no p r: with exact matching (without the pruning and
refining processes)
MO: with both pruning and refining processes
MO Number of Classes Number of Predicates Cardinality Accuracy
MO no p r 11 300 27.27 68.78%
MO 29 180 6.21 90.10%
Evaluation Analysis
Significantly improved the accuracy
Decreased the cardinality (Less number of predicates and more
classes)
Successfully removed unrelated predicates
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 26
国立情報学研究所
National Institute of Informatics
27. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation with A SPARQL Example
Evaluate the effectiveness of information retrieval with the
Mid-Ontology constructed with our approach.
Predicates grouped in mo-onto:population.
<rdf:Description rdf:about=“mid-onto:population”>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/population”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/popLatest”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/populationTotal”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/ontology/populationTotal”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/einwohner”/>
<mo-prop:hasMembers rdf:resource=“http://www.geonames.org/ontology#population”/>
</rdf:Description>
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 27
国立情報学研究所
National Institute of Informatics
28. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation with A SPARQL Example
SPARQL: Find places with a population of more than 10 million.
SELECT DISTINCT ?places
WHERE{ mid-onto:population mo-prop:hasMembers ?prop.
?places ?prop ?population.
FILTER (xsd:integer(?population) > 10000000). }
Single property for population Number of Results
http://dbpedia.org/property/population 177
http://dbpedia.org/property/popLatest 1
http://dbpedia.org/property/populationTotal 107
http://dbpedia.org/ontology/populationTotal 129
http://dbpedia.org/property/einwohner 1
http://www.geonames.org/ontology#population 244
Evaluation Analysis
Find 517 places with mid-onto:population.
Less results with each single predicate under the same
condition.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 28
国立情報学研究所
National Institute of Informatics
29. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Analysis of Mid-Ontology Approach
Analyze whether we can successfully identify how data sets are
connected.
Sample classes in the Mid-Ontology
DBpedia DBpedia & Geonames DBpedia & Geonames & NYTimes
mo-onto:birthdate mo-onto:population mo-onto:name
mo-onto:deathdate mo-onto:prominence mo-onto:long
mo-onto:motto mo-onto:postal
Evaluation Analysis
Predicates in DBpedia are heterogeneous.
Linked instances between DBpedia and Geonames are about
places.
Linked instances among DBpedia, Geonames, and NYTimes
are about events, persons, or places.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 29
国立情報学研究所
National Institute of Informatics
30. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Possible Application
Find missing owl:sameAs links
e.g. Find missing owl:sameAs link with mo-onto:population
http://dbpedia.org/resource/Cyclades db-prop:population “119549”
http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”
http://sws.geonames.org/259819/ geo-onto:population “119549”
http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 30
国立情報学研究所
National Institute of Informatics
31. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Possible Application
Find missing owl:sameAs links
e.g. Find missing owl:sameAs link with mo-onto:population
http://dbpedia.org/resource/Cyclades db-prop:population “119549”
http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”
http://sws.geonames.org/259819/ geo-onto:population “119549”
http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
Add owl:sameAs link
http://dbpedia.org/resource/Cyclades owl:sameAs http://sws.geonames.org/259819/
http://sws.geonames.org/259819/ owl:sameAs http://dbpedia.org/resource/Cyclades
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 31
国立情報学研究所
National Institute of Informatics
32. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Related Work
Construct intermediate-layer ontology from geospatial, zoology,
and genetics data resources. [Parundekar, et al.,2010]
Limited to a specific domain
Construct intermediate-level ontology by enriching upper
ontology (by adding new classes and properties). [Damova, et
al., 2010]
Still too large
Analysis of basic properties of SameAs network,
Pay-Level-Domain network and Class-Level Similarity network.
[Ding, et al., 2010]
Only frequent types are considered to analyze how data are connected
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 32
国立情報学研究所
National Institute of Informatics
33. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Conclusion and Future Work
Conclusion
Learning heterogeneous ontology schema in the linked open
data sets is not feasible.
An automatic Mid-Ontology learning approach can solve the
heterogeneity problem by integrating related predicates.
The Mid-Ontology has a high accuracy, and effective to search
from various data sets.
A simple Mid-Ontology can be constructed without learning
the entire ontology schema.
Future Work
Billion Triple Challenge (BTC) data set
Crawl links at two or three depths without a core data set
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 33
国立情報学研究所
National Institute of Informatics
34. Questions?
Lihua Zhao, lihua@nii.ac.jp
Ryutaro Ichise, ichise@nii.ac.jp
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 34
国立情報学研究所
National Institute of Informatics