This document outlines a metadata quality assurance framework. It discusses why data quality is important, what the framework can be used for, and its key principles. It then describes how metadata quality will be measured, including examining schema-independent structural features, use case scenarios, and cataloging known metadata problems. Specific discovery scenarios and their metadata requirements are provided as examples. The document concludes by outlining further steps to develop and implement the framework.
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
BioPharma and the broader research community is faced with the challenge of simply finding the appropriate internal and external datasets for downstream analytics, knowledge-generation and collaboration. With datasets as the core asset, we wanted to promote both human and machine exploitability, using web-centric data cataloguing principles as described in the W3C Data on the Web Best Practices. To do so, we adopted DCAT (Data CATalog Vocabulary) and VoID (Vocabulary of Interlinked Datasets) for both RDF and non-RDF datasets at summary, version and distribution levels. Further, we’ve described datasets using a limited set of well-vetted public vocabularies, focused on cross-omics analytes and clinical features of the catalogued datasets.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
BioPharma and the broader research community is faced with the challenge of simply finding the appropriate internal and external datasets for downstream analytics, knowledge-generation and collaboration. With datasets as the core asset, we wanted to promote both human and machine exploitability, using web-centric data cataloguing principles as described in the W3C Data on the Web Best Practices. To do so, we adopted DCAT (Data CATalog Vocabulary) and VoID (Vocabulary of Interlinked Datasets) for both RDF and non-RDF datasets at summary, version and distribution levels. Further, we’ve described datasets using a limited set of well-vetted public vocabularies, focused on cross-omics analytes and clinical features of the catalogued datasets.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen. Our processes enable simple creation of dataset records and linking to source data, providing a seamless federated knowledge graph for novice and advanced users alike.
Presented May 7th, 2019 at the Knowledge Graph Conference, Columbia University.
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer
What to do About FAIR…
In the experience of most pharma professionals, FAIR remains fairly abstract, bordering on inconclusive. This session will outline specific case studies – real problems with real data, and address opportunities and real concerns.
·
Why making data Findable, Actionable, Interoperable and Reusable is important.
Talk presented at the Data Driven Drug Development (D4) conference on March 20th, 2019.
OpenTox - an open community and framework supporting predictive toxicology an...Barry Hardy
Presented at ACS Boston 2015 at a Session on the growing impact of Open Science chaired by Andy Lang and Tony Williams dedicated to the work, memory and legacy of JC Bradley and the work we carry forward!
One important goal of OpenTox is to support the development of an Open Standards-based predictive toxicology framework that provides a unified access to toxicological data and models. OpenTox supports the development of tools for the integration of data, for the generation and validation of in silico models for toxic effects, libraries for the development and integration of modelling algorithms, and scientifically sound validation and reporting routines.
The OpenTox Application Programming Interface (API) is an important open standards development for software development purposes. It provides a specification against which development of global interoperable toxicology resources by the broader community can be carried out. The use of OpenTox API-compliant web services to communicate instructions between linked resources with URI addresses supports the use of a wide variety of commands to carry out operations such as data integration, algorithm use, model building and validation. The OpenTox Framework currently includes, with its APIs, services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, reporting, investigations, studies, assays, and authentication and authorisation, which may be combined into multiple applications satisfying a variety of different user needs. As OpenTox creates a semantic web for toxicology, it should be an ideal framework for incorporating toxicology data, ontology and modelling developments, thus supporting both a mechanistic framework for toxicology and best practices in statistical analysis and computational modelling.
In this presentation I will review the recent OpenTox-based development of applications including the ToxBank data infrastructure supporting integrated analysis across biochemical, functional and omics datasets supporting the safety assessment goals of the SEURAT-1 program which aims to develop alternatives to animal testing.
Finally, I will provide an overview of the working group activities of the newly formed OpenTox Association which aim to progress the development of open source, data, standards and tools in this area.
http://wiki.knoesis.org/index.php/MaterialWays
http://www.knoesis.org/?q=research/semMat
http://wiki.knoesis.org/index.php/MaterialWays
Abstract
The sharing, discovery, and application of materials science and engineering data and documents are possible only if domain scientists are able and willing to do so. We need to overcome technological challenges such as the development of convenient computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data, and cultural challenges such as proper protection, control, and credit for sharing data. Our thesis and value proposition is that associating machine-processable semantics with materials science and engineering data and documents can provide a solid foundation for overcoming challenges associated with data discovery, integration, and interoperability caused by data heterogeneity. Specifically, easy to use and low upfront cost lightweight semantics in the form of file-level annotation can enable document discovery and sharing, while deeper data-level annotation using standardized ontologies can benefit semantic search and summarization. Machine processability achieved through fine-grained semantic annotation, extraction, and translation can enable data integration, interoperability and reasoning, ultimately leading to Linked Open Materials Science Data. Thus, a different granularity of semantics provides a continuum of cost/ease of use and expressiveness trade-off. In this presentation, we also show the application of semantic techniques for content extraction from materials and process specifications which are semi-structured and table-rich, and the application of semantic web techniques and technologies for materials vocabulary integration and curation (via semantic media wiki), semantic web visualization, efficient representation of provenance metadata and access control (via singleton property), and biomaterials information extraction
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Tom Plasterer
Edge Informatics is an approach to accelerate collaboration in the BioPharma pipeline. By combining technical and social solutions knowledge can be shared and leveraged across the multiple internal and external silos participating in the drug development process. This is accomplished by making data assets findable, accessible, interoperable and reusable (FAIR). Public consortia and internal efforts embracing FAIR data and Edge Informatics are highlighted, in both preclinical and clinical domains.
This talk was presented at the Molecular Medicine Tri-Conference in San Francisco, CA on February 20, 2017
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
OSFair2017 Training | FAIR metrics - Starring your data setsOpen Science Fair
Peter Doorn, Marjan Grootveld & Elly Dijk talk about FAIR data principles and present the assessment tool that DANS is developing for data repositories | OSFair2017 Workshop
Workshop title: FAIR metrics - Starring your data sets
Workshop overview:
Do you want to join our effort to put the FAIR data principles into practice? Come and explore the assessment tool that DANS, Data Archiving and Networked Services in the Netherlands, is developing for data repositories.
The aim of our work is to implement the FAIR principles into a data assessment tool so that every dataset which is deposited or reused from any digital repository can be assessed in terms of a score on the principles Findable, Accessible, Interoperable, and Reusable, using a ‘FAIRness’ scale from 1 to 5 stars. In this interactive session participants can explore the pilot version of FAIRdat: the FAIR data assessment tool. The organisers would like to inform you about the project, and look forward to all feedback to improve the tool, or to improve the metrics that are used.
DAY 3 - PARALLEL SESSION 7
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...Nandana Mihindukulasooriya
Thesis PDF version: https://oa.upm.es/62935/
In the era of digital transformation, where most decision-making and artificial intelligence (AI) applications are becoming data-driven, data is becoming an essential asset. Linked Data, published in structured, machine-readable formats, with explicit semantics using Semantic Web standards, and with links to other data, is even more useful. The Linked (Open) Data cloud is growing with millions of new triples each year. Nevertheless, as we discuss in this thesis, such vast amounts of data bring several new challenges in ensuring the quality of Linked Data. The main goal of this thesis is to propose novel and scalable methods for automatic quality assessment and repair of Linked Data. The motivation for it is to significantly reduce the manual effort required by current quality assessment and repair, and to propose novel methods suitable for large-scale Linked Data sources such as DBpedia or Wikidata. The main hypothesis of this work is that data profiling metrics and automatic RDF Shape induction can be used to develop scalable and automatic quality assessment and repair methods. In this context, the following main contributions are delivered in this thesis: • LDQM, a Linked Data Quality Model for representing Linked Data quality in a standard manner and LD Sniffer, a tool based on LDQM for validating accessibility of Linked Data. LDQM contains 15 quality characteristics, 89 base measures, 23 derived measures, and 124 quality indicators. • Loupe, a framework for Linked Data profiling that includes the Loupe Extended Dataset Description Model and a suite of Linked Data profiling tools. The model consists of 84 Linked Data profiling metrics useful for quality assessment and repair tasks. Loupe tools have been used to evaluate 26 thousand datasets containing 34 billions of triples and Loupe contributed to the winning system of ISWC Semantic Web Challenge 2017. The Loupe Web portal has been visited more than 40,000 times by ~3000 unique visitors from 87 countries. • An automatic RDF Shape induction method that follows a data-driven approach to induce integrity constraints using data profiling metrics as features. The proposed method achieved an F1 of 98.81% in deriving maximum cardinality constraints, an F1 of 97.30% in deriving minimum cardinality constraints, and an F1 of 95.94% in deriving range constraints. • Four methods for automatic quality assessment and repair using RDF Shapes and data profiling metrics. They are motivated by several practical use cases that cover both Linked Data generation process and output and also cover both public and enterprise data. The four methods include (a) a method for detecting inconsistent mappings, (b) a method for detecting and eliminating noisy triples produced by open information extraction tools, (c) a method to repair links in RDF data, and (d) a method to complete type information in Linked Data ...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
Through the use of Linked Data (LD), Libraries, Archives and Museums (LAMs) have the potential to expose their collections to a larger audience and to allow for more efficient user searches. Despite this, relatively few LAMs have invested in LD projects and the majority of these display limited interlinking across datasets and institutions. A survey was conducted to understand Information Professionals' (IPs') position with regards to LD, with a particular focus on the interlinking problem. The survey was completed by 185 librarians, archivists, metadata cataloguers and researchers. Results indicated that, when interlinking, IPs find the process of ontology and property selection to be particularly challenging, and LD tooling to be technologically complex and unsuitable for their needs.
Our research is focused on developing an authoritative interlinking framework for LAMs with a view to increasing IP engagement in the linking process. Our framework will provide a set of standards to facilitate IPs in the selection of link types, specifically when linking local resources to authorities. The framework will include guidelines for authority, ontology and property selection, and for adding provenance data. A user-interface will be developed which will direct IPs through the resource interlinking process as per our framework. Although there are existing tools in this domain, our framework differs in that it will be designed with the needs and expertise of IPs in mind. This will be achieved by involving IPs in the design and evaluation of the framework. A mock-up of the interface has already been tested and adjustments have been made based on results. We are currently working on developing a minimal viable product so as to allow for further testing of the framework. We will present our updated framework, interface, and proposed interlinking solutions.
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwanandrea huang
Linked data paradigm has provided the potential for any data to link or to be linked with structural information, internally and externally. To improve on current cultural
service of the Union Catalog of Digital Archives Taiwan (catalog.digitalarchives.tw), a linked data prototype is developed and benefited by extending the Art & Architecture Thesaurus (AAT) for a machine-understandable catalog service.
However, knowledge engineering is time and labor consuming, especially for an archive that is non-western based in culture and multidisciplinary in natural. This
makes data semantics of the UCdaT are extremely challenged for mapping to international standards and vocabularies.
At this stage, the triple store is an experimental addition to the existing Union Catalog of Digital Archives Taiwan architecture, and provides semantic links to target collections for relative suggestions. This will guide us in creating a future technical architecture that is scalable to the whole archive level, compliant with learning by doing
guidelines, and preserves the data even that is difficult to be understood fully at present, but at least to be linked by others that may provide third-party’s understandings for their own reuse.
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen. Our processes enable simple creation of dataset records and linking to source data, providing a seamless federated knowledge graph for novice and advanced users alike.
Presented May 7th, 2019 at the Knowledge Graph Conference, Columbia University.
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer
What to do About FAIR…
In the experience of most pharma professionals, FAIR remains fairly abstract, bordering on inconclusive. This session will outline specific case studies – real problems with real data, and address opportunities and real concerns.
·
Why making data Findable, Actionable, Interoperable and Reusable is important.
Talk presented at the Data Driven Drug Development (D4) conference on March 20th, 2019.
OpenTox - an open community and framework supporting predictive toxicology an...Barry Hardy
Presented at ACS Boston 2015 at a Session on the growing impact of Open Science chaired by Andy Lang and Tony Williams dedicated to the work, memory and legacy of JC Bradley and the work we carry forward!
One important goal of OpenTox is to support the development of an Open Standards-based predictive toxicology framework that provides a unified access to toxicological data and models. OpenTox supports the development of tools for the integration of data, for the generation and validation of in silico models for toxic effects, libraries for the development and integration of modelling algorithms, and scientifically sound validation and reporting routines.
The OpenTox Application Programming Interface (API) is an important open standards development for software development purposes. It provides a specification against which development of global interoperable toxicology resources by the broader community can be carried out. The use of OpenTox API-compliant web services to communicate instructions between linked resources with URI addresses supports the use of a wide variety of commands to carry out operations such as data integration, algorithm use, model building and validation. The OpenTox Framework currently includes, with its APIs, services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, reporting, investigations, studies, assays, and authentication and authorisation, which may be combined into multiple applications satisfying a variety of different user needs. As OpenTox creates a semantic web for toxicology, it should be an ideal framework for incorporating toxicology data, ontology and modelling developments, thus supporting both a mechanistic framework for toxicology and best practices in statistical analysis and computational modelling.
In this presentation I will review the recent OpenTox-based development of applications including the ToxBank data infrastructure supporting integrated analysis across biochemical, functional and omics datasets supporting the safety assessment goals of the SEURAT-1 program which aims to develop alternatives to animal testing.
Finally, I will provide an overview of the working group activities of the newly formed OpenTox Association which aim to progress the development of open source, data, standards and tools in this area.
http://wiki.knoesis.org/index.php/MaterialWays
http://www.knoesis.org/?q=research/semMat
http://wiki.knoesis.org/index.php/MaterialWays
Abstract
The sharing, discovery, and application of materials science and engineering data and documents are possible only if domain scientists are able and willing to do so. We need to overcome technological challenges such as the development of convenient computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data, and cultural challenges such as proper protection, control, and credit for sharing data. Our thesis and value proposition is that associating machine-processable semantics with materials science and engineering data and documents can provide a solid foundation for overcoming challenges associated with data discovery, integration, and interoperability caused by data heterogeneity. Specifically, easy to use and low upfront cost lightweight semantics in the form of file-level annotation can enable document discovery and sharing, while deeper data-level annotation using standardized ontologies can benefit semantic search and summarization. Machine processability achieved through fine-grained semantic annotation, extraction, and translation can enable data integration, interoperability and reasoning, ultimately leading to Linked Open Materials Science Data. Thus, a different granularity of semantics provides a continuum of cost/ease of use and expressiveness trade-off. In this presentation, we also show the application of semantic techniques for content extraction from materials and process specifications which are semi-structured and table-rich, and the application of semantic web techniques and technologies for materials vocabulary integration and curation (via semantic media wiki), semantic web visualization, efficient representation of provenance metadata and access control (via singleton property), and biomaterials information extraction
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Tom Plasterer
Edge Informatics is an approach to accelerate collaboration in the BioPharma pipeline. By combining technical and social solutions knowledge can be shared and leveraged across the multiple internal and external silos participating in the drug development process. This is accomplished by making data assets findable, accessible, interoperable and reusable (FAIR). Public consortia and internal efforts embracing FAIR data and Edge Informatics are highlighted, in both preclinical and clinical domains.
This talk was presented at the Molecular Medicine Tri-Conference in San Francisco, CA on February 20, 2017
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
OSFair2017 Training | FAIR metrics - Starring your data setsOpen Science Fair
Peter Doorn, Marjan Grootveld & Elly Dijk talk about FAIR data principles and present the assessment tool that DANS is developing for data repositories | OSFair2017 Workshop
Workshop title: FAIR metrics - Starring your data sets
Workshop overview:
Do you want to join our effort to put the FAIR data principles into practice? Come and explore the assessment tool that DANS, Data Archiving and Networked Services in the Netherlands, is developing for data repositories.
The aim of our work is to implement the FAIR principles into a data assessment tool so that every dataset which is deposited or reused from any digital repository can be assessed in terms of a score on the principles Findable, Accessible, Interoperable, and Reusable, using a ‘FAIRness’ scale from 1 to 5 stars. In this interactive session participants can explore the pilot version of FAIRdat: the FAIR data assessment tool. The organisers would like to inform you about the project, and look forward to all feedback to improve the tool, or to improve the metrics that are used.
DAY 3 - PARALLEL SESSION 7
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...Nandana Mihindukulasooriya
Thesis PDF version: https://oa.upm.es/62935/
In the era of digital transformation, where most decision-making and artificial intelligence (AI) applications are becoming data-driven, data is becoming an essential asset. Linked Data, published in structured, machine-readable formats, with explicit semantics using Semantic Web standards, and with links to other data, is even more useful. The Linked (Open) Data cloud is growing with millions of new triples each year. Nevertheless, as we discuss in this thesis, such vast amounts of data bring several new challenges in ensuring the quality of Linked Data. The main goal of this thesis is to propose novel and scalable methods for automatic quality assessment and repair of Linked Data. The motivation for it is to significantly reduce the manual effort required by current quality assessment and repair, and to propose novel methods suitable for large-scale Linked Data sources such as DBpedia or Wikidata. The main hypothesis of this work is that data profiling metrics and automatic RDF Shape induction can be used to develop scalable and automatic quality assessment and repair methods. In this context, the following main contributions are delivered in this thesis: • LDQM, a Linked Data Quality Model for representing Linked Data quality in a standard manner and LD Sniffer, a tool based on LDQM for validating accessibility of Linked Data. LDQM contains 15 quality characteristics, 89 base measures, 23 derived measures, and 124 quality indicators. • Loupe, a framework for Linked Data profiling that includes the Loupe Extended Dataset Description Model and a suite of Linked Data profiling tools. The model consists of 84 Linked Data profiling metrics useful for quality assessment and repair tasks. Loupe tools have been used to evaluate 26 thousand datasets containing 34 billions of triples and Loupe contributed to the winning system of ISWC Semantic Web Challenge 2017. The Loupe Web portal has been visited more than 40,000 times by ~3000 unique visitors from 87 countries. • An automatic RDF Shape induction method that follows a data-driven approach to induce integrity constraints using data profiling metrics as features. The proposed method achieved an F1 of 98.81% in deriving maximum cardinality constraints, an F1 of 97.30% in deriving minimum cardinality constraints, and an F1 of 95.94% in deriving range constraints. • Four methods for automatic quality assessment and repair using RDF Shapes and data profiling metrics. They are motivated by several practical use cases that cover both Linked Data generation process and output and also cover both public and enterprise data. The four methods include (a) a method for detecting inconsistent mappings, (b) a method for detecting and eliminating noisy triples produced by open information extraction tools, (c) a method to repair links in RDF data, and (d) a method to complete type information in Linked Data ...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
Through the use of Linked Data (LD), Libraries, Archives and Museums (LAMs) have the potential to expose their collections to a larger audience and to allow for more efficient user searches. Despite this, relatively few LAMs have invested in LD projects and the majority of these display limited interlinking across datasets and institutions. A survey was conducted to understand Information Professionals' (IPs') position with regards to LD, with a particular focus on the interlinking problem. The survey was completed by 185 librarians, archivists, metadata cataloguers and researchers. Results indicated that, when interlinking, IPs find the process of ontology and property selection to be particularly challenging, and LD tooling to be technologically complex and unsuitable for their needs.
Our research is focused on developing an authoritative interlinking framework for LAMs with a view to increasing IP engagement in the linking process. Our framework will provide a set of standards to facilitate IPs in the selection of link types, specifically when linking local resources to authorities. The framework will include guidelines for authority, ontology and property selection, and for adding provenance data. A user-interface will be developed which will direct IPs through the resource interlinking process as per our framework. Although there are existing tools in this domain, our framework differs in that it will be designed with the needs and expertise of IPs in mind. This will be achieved by involving IPs in the design and evaluation of the framework. A mock-up of the interface has already been tested and adjustments have been made based on results. We are currently working on developing a minimal viable product so as to allow for further testing of the framework. We will present our updated framework, interface, and proposed interlinking solutions.
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwanandrea huang
Linked data paradigm has provided the potential for any data to link or to be linked with structural information, internally and externally. To improve on current cultural
service of the Union Catalog of Digital Archives Taiwan (catalog.digitalarchives.tw), a linked data prototype is developed and benefited by extending the Art & Architecture Thesaurus (AAT) for a machine-understandable catalog service.
However, knowledge engineering is time and labor consuming, especially for an archive that is non-western based in culture and multidisciplinary in natural. This
makes data semantics of the UCdaT are extremely challenged for mapping to international standards and vocabularies.
At this stage, the triple store is an experimental addition to the existing Union Catalog of Digital Archives Taiwan architecture, and provides semantic links to target collections for relative suggestions. This will guide us in creating a future technical architecture that is scalable to the whole archive level, compliant with learning by doing
guidelines, and preserves the data even that is difficult to be understood fully at present, but at least to be linked by others that may provide third-party’s understandings for their own reuse.
From local to global: Romanian cultural values in Europeana through Locloudlocloud
Presentation given by Sorina Stanca
Cluj County Library, Romania
LoCloud Conference
Sharing local cultural heritage online with LoCloud services
Amersfoort, Netherlands
5 February 2016
KB domeinaggregator voor publicaties naar DigitaleCollectie.nlElco van Staveren
KB is domeinaggregator voor metadata van gedigitaliseerde publicaties (boeken, kranten en tijdschriften) in Nederland. Deze presentatie heb ik gegeven op de studiedag 'De grote gemene deler' van DigitaleCollectie.nl, op 5 juni 2013 bij RCE in Amersfoort. Het is een oproep om digitalisering via Metamorfoze te laten lopen en daarmee te kiezen voor standaardisering
BEST PRACTICE: Value is the key to opening more doors – Dramatically enhance ...B2B Marketing
BEST PRACTICE: Value is the key to opening more doors – Dramatically enhance your value prop through data, love, and relevance
Cory Polonetsky, senior director, value proposition initiative, Elsevier
Small, smaller and smallest: working with small archaeological content provid...locloud
Presentation given by Holly Wright
Archaeology Data Service University of York, UK
LoCloud Conference
Sharing local cultural heritage online with LoCloud services
Amersfoort, Netherlands
5 February 2016
In this presentation I will show you breif about the lord of the rings so please see all the slides of this presentation and leave comments for my improvement. Thanks
Part 4 of tutorials at DC2008, Berlin. (International Conference on Dublin Core and Metadata Applications). See also part 1-3 by Jane Greenberg, Pete Johnston, and Mikael Nilsson on DC history, concepts, and other schemas. This part focuses on practical issues.
knowledge graphs are an emerging paradigm to represent information. yet their discovery and reuse is hampered by insufficient or inadequate metadata. here, the COST ACTION Distributed Knowledge Graphs had a first workshop to develop a KG metadata schema. In this presentation, the progress and plans are discussed with the W3C Community Group on Knowledge Graph Construction.
Meemoo manages a large quantity of mainly audiovisual material from more than 170 partners in cultural heritage and media. More than 6 million objects are currently stored, ranging from digitised newspapers, photos, videos, and audio. In addition, a number of access platforms make the digitised content available to specific target groups, including teachers, students, professional re-users, or the public.
Metadata is a key element in all of meemoo’s processes. An important part of our activities is to collect, integrate, manage, and search a large variety of heterogeneous metadata across the archived content. The scale of this has increased enormously, so a good and integrated approach is needed to deal with the amount of metadata, its need for flexibility, and how easy it is to find. One of the specific challenges is modelling and storing data from machine learning algorithms (speech recognition, face detection and entity recognition) for reuse.
In this talk, we will discuss the key points and lessons learned from implementing the new metadata roadmap at Meemoo, which is focused on a Knowledge Graph-based infrastructure. The goal of the roadmap is to establish a better data practice within the organization and offer application-independent, uniform access to (meta)data that is spread across various systems and formats.
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
RO-Crate: A framework for packaging research products into FAIR Research Objects presented to Research Data Alliance RDA Data Fabric/GEDE FAIR Digital Object meeting. 2021-02-25
This presentation was provided by Vinod Chachra of VTLS Inc. during the NISO event "Next Generation Discovery Tools: New Tools, Aging Standards," held March 27 - March 28, 2008.
DITA, Semantics, Content Management, Dynamic Documents, and Linked Data – A M...Paul Wlodarczyk
DITA was conceived as a model for improving reuse through topic-oriented modularization of content. Instead of creating new content or copying and pasting information which may or may not be current and authoritative, organizations manage a repository of content assets – or DITA topics – that can be centrally managed, maintained and reused across the enterprise. This helps to accelerate the creation and maintenance of documents and other deliverables and to ensure the quality and consistency of the content organizations publish. But the next frontier of DITA adoption is leveraging semantic technologies—taxonomies, ontologies and text analytics—to automate the delivery of targeted content. For example, a service incident from a customer is automatically matched with the appropriate response, which is authored and managed as a DITA topic. Learn how organizations can leverage DITA, semantics, content management, dynamic documents, and linked data to fully utilize the value of their information.
This presentation was provided by David Kuliman of Elsevier, during the NISO event "Content Presentation: Diversity of Formats." The webinar was held on February 10, 2021.
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
https://ucsb.zoom.us/meeting/register/tZYod-ippz4pHtaJ0d3ERPIFy2QIvKqjwpXR
FAIRy stories: the FAIR Data principles in theory and in practice
The ‘FAIR Guiding Principles for scientific data management and stewardship’ [1] launched a global dialogue within research and policy communities and started a journey to wider accessibility and reusability of data and preparedness for automation-readiness (I am one of the army of authors). Over the past 5 years FAIR has become a movement, a mantra and a methodology for scientific research and increasingly in the commercial and public sector. FAIR is now part of NIH, European Commission and OECD policy. But just figuring out what the FAIR principles really mean and how we implement them has proved more challenging than one might have guessed. To quote the novelist Rick Riordan “Fairness does not mean everyone gets the same. Fairness means everyone gets what they need”.
As a data infrastructure wrangler I lead and participate in projects implementing forms of FAIR in pan-national European biomedical Research Infrastructures. We apply web-based industry-lead approaches like Schema.org; work with big pharma on specialised FAIRification pipelines for legacy data; promote FAIR by Design methodologies and platforms into the researcher lab; and expand the principles of FAIR beyond data to computational workflows and digital objects. Many use Linked Data approaches.
In this talk I’ll use some of these projects to shine some light on the FAIR movement. Spoiler alert: although there are technical issues, the greatest challenges are social. FAIR is a team sport. Knowledge Graphs play a role – not just as consumers of FAIR data but as active contributors. To paraphrase another novelist, “It is a truth universally acknowledged that a Knowledge Graph must be in want of FAIR data.”
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
Major research instruments are generating orders of magnitude more data in relatively short timeframes. As a result, the research enterprise is increasingly challenged by what should be mundane tasks: describing data for discovery and making data securely accessible to the broader research community. The ad hoc methods currently employed place undue burden on scientists and system administrators alike, and it is clear that a more robust, scalable approach is required.
Bespoke data portals (and science gateways/data commons) are becoming more prominent as a means of enabling access to large datasets. in this tutorial we demonstrate how services for authentication, authorization, metadata management, and search may be integrated with popular web frameworks, and used in combination with fast, well-architected networks to make data discoverable and accessible. Outcomes: build a simple, but functional, data portal that facilitates flexible data description, faceted data search and secure data access.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Metadata Quality Assurance Part II. The implementation begins
1. Metadata Quality Assurance Framwork
Part II. – The implementation begins
Péter Király
peter.kiraly@gwdg.de
Göttingen, Geiststraße 10, GWDG meeting room 20/05/2016
Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
2. Metadata Quality Assurance Framework
2
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 17 December 2015
http://www.w3.org/TR/2015/WD-dwbp-20151217/
3. Metadata Quality Assurance Framework
3
What it is good for?
Improve the metadata
Improve metadata schema and its docum.
Propagate „good practice”
Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG:
Could be built in to current and planned data
management / data archiving tools
4. Metadata Quality Assurance Framework
4
Project principles
Full transparency
Open source, open data (CC0)
Minimal viable product
„Release early. Release often. And listen to
your customers” (Eric S. Raymond)
„Eat your own dog food”
Getting real https://gettingreal.37signals.com/
5. Metadata Quality Assurance Framework
5
Measurements
Schema-independent structural features
Existence, cardinality, uniqueness
Use case scenarios („fit for purpose”)
Requirements of the most important functions
Problem catalog
Known metadata problems
6. Metadata Quality Assurance Framework
6
Europeana Data Quality Committee
Online collaboration
Use case documents
Problem catalog
Tickets
Discussion forum
#EuropeanaDataQuali
ty
Bi-weekly teleconf
Bi-yearly face-to-face
meeting
Topics
Usage scenarios
Metadata profiles
Schema modification
Measuring
Event model
7. Metadata Quality Assurance Framework
7
Discovery scenarios and their metadata requirements
1. Basic retrieval with high precision and recall
2. Cross-language recall
3. Entity-based facets
4. Date-based facets
5. Improved language facets
6. Browse by subjects and resource types
7. Browse by agents
8. Browse/Search by Event
9. Entity-based knowledge cards and pages
10.Categorised similar items
11.Spatial search, browse, and map display
12.Entity-based autocompletion
13.Diversification of results
14.Hierarchical search and facets
Credit: the document was initialized by Tim Hill, Europeana’s search engineer
8. Metadata Quality Assurance Framework
8
Discovery scenarios and their metadata requirements - 3. Entity-based facets
Scenario
As a user, ... I want to be able to filter by whether a person is the
subject of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated by identifying URIs rather than free
text. These URIs need to be related, at a minimum, to a label for
each of the supported languages.
Measurement rules
The relevant field values should be resolvable URI
each URI should have labels in multiple languages
9. Metadata Quality Assurance Framework
9
Discovery scenarios and their metadata requirements – 4. Date-based facets
Scenario
I want to be able to filter my results by a variety of timespans, e.g.:
Date of creation
Date of publication
Date as subject
Metadata analysis
Dates should be fully and consistently normalised to follow the XSD
date-time data types. Dates expressed in styles like “490 avant J.C”
that are inherently language dependent should be avoided as they’re
very difficult to normalise (e.g. this should be represented as “-
0490”^^xsd:gYear).
Measurement rules
Field value should be XSD date-time data types
10. Metadata Quality Assurance Framework
10
Problem catalog
Title contents same as description contents
Systematic use of the same title
Bad string: "empty" (and variants)
Shelfmarks and other identifiers in fields
Creator not an agent name
Absurd geographical location
Subject field used as description field
Unicode U+FFFD (�)
Very short description field
Credit: the document was initialized by Tim Hill, Europeana’s search engineer
11. Metadata Quality Assurance Framework
11
Problem catalog
Description Title contents same as description contents
Example /2023702/35D943DF60D779EC9EF31F5DF...
Motivation Distorts search weightings
Checking Method Field comparison
Notes Record display: creator concatenated onto title
Metadata Scenario Basic Retrieval
12. Metadata Quality Assurance Framework
12
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)
https://www.w3.org/TR/shacl/
SHACL (Shapes Constraint Language) is a language for describing
and constraining the contents of RDF graphs. SHACL groups these
descriptions and constraints into "shapes", which specify conditions
that apply at a given RDF node. Shapes provide a high-level
vocabulary to identify predicates and their associated cardinalities,
datatypes and other constraints.
sh:equals, sh:notEquals
sh:hasValue
sh:in
sh:lessThan, sh:lessThanOrEquals
sh:minCount, sh:maxCount
sh:minLength, sh:maxLength
sh:pattern
31. Metadata Quality Assurance Framework
31
Problem catalog – Long subject – example (not so long...)
Conclusion: we
have to refine
the definition of
„long”
36. Metadata Quality Assurance Framework
36
Further steps
Building in completeness measurements to Europeana’s ingestion tool
Including usage statistics (log files, Google Analitics API)
Human evaluation of metadata quality
Measuring timeliness (changes of scores over time)
Machine learning:
Classification/Clustering of records
Statistical relevancy of measurements
Göttingen use case: proposed SUB project „Shared Print Study”
Göttingen use case: incorporating into research data management tool
Cooperation with other projects
37. Metadata Quality Assurance Framework
37
Architectural overview
Apache Spark
(Java)
OAI-PMH client (PHP)
Analysis with
Spark (Scala) Analysis with R
Web interface
(PHP, d3.js)
Hadoop File
System
JSON files
Apache Solr
Apache
Cassandra
JSON files
JSON files
Image files
CSV files
CSV files
recent workflow
planned workflow