OpenMinTeD hosted a series of webinars on interoperability. These slides are of the webinar on the level of metadata. Full webinar recording accessible through: https://www.fosteropenscience.eu/content/achieving-interoperability-between-resources-involved-tdm-level-metadata
WP3 Further specification of Functionality and Interoperability - GradmannEuropeana
The document discusses issues and recommendations for Work Group 3.2 on semantic and multilingual aspects of the Europeana digital library. Key points include:
- Europeana surrogates need rich semantic context in areas like place, time, people and concepts.
- The types of links between surrogates and semantic nodes, as well as the semantic technologies used, need to be determined.
- Support for multiple European languages in areas like search queries, results and functionality is important but requires further scope definition and identification of language resources.
Tutorial at OAI5 (cern.ch/oai5). Abstract: This tutorial will provide a practical overview of current practices in modelling complex or compound digital objects. It will examine some of the key scenarios around creating complex objects and will explore a number of approaches to packaging and transport. Taking research papers, or scholarly works, as an example, the tutorial will explore the different ways in which these, and their descriptive metadata, can be treated as complex objects. Relevant application profiles and metadata formats will be introduced and compared, such as Dublin Core, in particular the DCMI Abstract Model, and MODS, alongside content packaging standards, such as METS MPEG 21 DIDL and IMS CP. Finally, we will consider some future issues and activities that are seeking to address these. The tutorial will be of interest to librarians and technical staff with an interest in metadata or complex objects, their creation, management and re-use.
This document discusses annotation services provided by Brown University Library for annotating digital texts. It describes several digital humanities projects at Brown that involve annotation. It then explains how the library uses AtomPub and RDF to publish annotations on the web as Linked Open Data with metadata and links back to the annotated sources. Users can annotate portions of documents and their annotations will be ingested into the repository and syndicated as Atom feeds that others can subscribe to.
The document discusses three case studies related to making organizational taxonomies and resources more interoperable:
1) Integrating metadata across three Victorian government departments by aggregating, rationalizing, and harmonizing their schemas.
2) Repatriating cultural resources from the Quinkan people by using Dublin Core metadata with local extensions to provide a single access point for internal and external users.
3) The AccessForAll project, which uses metadata to match educational resources to individual learner needs and preferences to ensure equal accessibility. Standards are discussed as a way to balance local specificity and global interoperability.
JeromeDL is a social semantic digital library that allows users to easily publish and access resources online through metadata tagging and community sharing features. It integrates information from different metadata sources, provides interoperability between systems, and delivers more robust search interfaces powered by semantics. Resources are accessible by machines through rich metadata and the system involves the community in sharing knowledge through social features like comments, bookmarks, and user profiles.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
DBpedia Spotlight is a system that automatically annotates text documents with DBpedia URIs. It identifies mentions of entities in text and links them to the appropriate DBpedia resources, addressing the challenge of ambiguity. The system is highly configurable, allowing users to specify which types of entities to annotate and the desired balance of coverage and accuracy. An evaluation found DBpedia Spotlight performed competitively compared to other annotation systems.
WP3 Further specification of Functionality and Interoperability - GradmannEuropeana
The document discusses issues and recommendations for Work Group 3.2 on semantic and multilingual aspects of the Europeana digital library. Key points include:
- Europeana surrogates need rich semantic context in areas like place, time, people and concepts.
- The types of links between surrogates and semantic nodes, as well as the semantic technologies used, need to be determined.
- Support for multiple European languages in areas like search queries, results and functionality is important but requires further scope definition and identification of language resources.
Tutorial at OAI5 (cern.ch/oai5). Abstract: This tutorial will provide a practical overview of current practices in modelling complex or compound digital objects. It will examine some of the key scenarios around creating complex objects and will explore a number of approaches to packaging and transport. Taking research papers, or scholarly works, as an example, the tutorial will explore the different ways in which these, and their descriptive metadata, can be treated as complex objects. Relevant application profiles and metadata formats will be introduced and compared, such as Dublin Core, in particular the DCMI Abstract Model, and MODS, alongside content packaging standards, such as METS MPEG 21 DIDL and IMS CP. Finally, we will consider some future issues and activities that are seeking to address these. The tutorial will be of interest to librarians and technical staff with an interest in metadata or complex objects, their creation, management and re-use.
This document discusses annotation services provided by Brown University Library for annotating digital texts. It describes several digital humanities projects at Brown that involve annotation. It then explains how the library uses AtomPub and RDF to publish annotations on the web as Linked Open Data with metadata and links back to the annotated sources. Users can annotate portions of documents and their annotations will be ingested into the repository and syndicated as Atom feeds that others can subscribe to.
The document discusses three case studies related to making organizational taxonomies and resources more interoperable:
1) Integrating metadata across three Victorian government departments by aggregating, rationalizing, and harmonizing their schemas.
2) Repatriating cultural resources from the Quinkan people by using Dublin Core metadata with local extensions to provide a single access point for internal and external users.
3) The AccessForAll project, which uses metadata to match educational resources to individual learner needs and preferences to ensure equal accessibility. Standards are discussed as a way to balance local specificity and global interoperability.
JeromeDL is a social semantic digital library that allows users to easily publish and access resources online through metadata tagging and community sharing features. It integrates information from different metadata sources, provides interoperability between systems, and delivers more robust search interfaces powered by semantics. Resources are accessible by machines through rich metadata and the system involves the community in sharing knowledge through social features like comments, bookmarks, and user profiles.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
DBpedia Spotlight is a system that automatically annotates text documents with DBpedia URIs. It identifies mentions of entities in text and links them to the appropriate DBpedia resources, addressing the challenge of ambiguity. The system is highly configurable, allowing users to specify which types of entities to annotate and the desired balance of coverage and accuracy. An evaluation found DBpedia Spotlight performed competitively compared to other annotation systems.
The document discusses research into enabling customized accessibility for users by combining personal needs and preferences profiles (PNPs) with digital resource description metadata. PNPs describe a user's display, control, and content needs, while digital resource descriptions indicate how resources can be adapted across different modalities. Together, PNPs and resource descriptions allow for automatic, personalized accessibility adaptation of online content to individual user needs.
The document discusses technologies and infrastructure for publishing biodiversity data from environmental impact assessments (EIA). It covers the types and formats of EIA biodiversity data, tools for data capture and digitization, platforms for data discovery and publishing, ensuring data quality, and hosting data centers to facilitate long-term archiving and publishing of EIA biodiversity data.
The document discusses the concepts of semantic technology and the semantic web. It defines key concepts like tabula rasa, the network effect, and intelligence embedded in data through relationships. It also outlines technologies used in the semantic web like RDF, OWL, SPARQL, FOAF, and DBpedia and how search engines and companies are using these technologies for applications like sentiment analysis, natural language processing, and information extraction.
This document discusses Timbuctoo, an application designed for academic research that allows for complex and heterogeneous data. It explores archiving RDF datasets from Timbuctoo instances, including handling RDF graphs and triples, versioning datasets, and verifying dataset integrity and resolving links. A potential pipeline is proposed to ingest datasets from Timbuctoo into the EASY archive, but current Timbuctoo instances and datasets have obscure URIs and insufficient metadata, and the prototype pipeline lacks specifications. Archiving linked data from Timbuctoo could change the nature of preservation for archives.
This document provides an outline and overview of a seminar on text mining. It discusses basics of text mining including definitions, similarities to data mining, preprocessing operations, document features, and representational models of documents. It also describes general architectures of text mining systems and provides examples of system architectures for generic, domain-oriented, and advanced text mining systems with background knowledge bases.
Linkator: enriching web pages by automatically adding dereferenceable semanti...Samur Araujo
In this paper, we introduce Linkator, an application architecture that
exploits semantic annotations for automatically adding links to previously
generated web pages. Linkator provides a mechanism for dereferencing these
semantic annotations with what it calls semantic links. Automatically adding
links to web pages improves the users’ navigation. It connects the visited page
with external sources of information that the user can be interested in, but that
were not identified as such during the web page design phase. The process of
auto-linking encompasses: finding the terms to be linked and finding the
destination of the link. Linkator delegates the first stage to external semantic
annotation tools and it concentrates on the process of finding a relevant
resource to link to. In this paper, a use case is presented that shows how this
mechanism can support knowledge workers in finding publications during their
navigation on the web.
This document summarizes the Knowledge Engineering efforts for the TELDAP digital library project. It discusses (1) developing metadata models for different types of digital objects, including a union catalog model and models for websites and documents; (2) establishing hyperlinks between objects and keywords to connect related resources; and (3) constructing ontologies and thesauri like Getty AAT and Chinese WordNet to link keywords and establish implicit relationships between objects. The goal is to optimize access, retrieval and understanding of the large and growing collection of digital content.
The document discusses the Semantic Web, which refers to extending the current web by giving information well-defined meaning that computers can understand. It describes the evolution of the web from Web 1.0 to 3.0 and outlines key components that enable the Semantic Web like URIs, RDF, RDFS, OWL, and SPARQL. The technology brings benefits like improved search, interoperability, and opportunities for applications in areas like healthcare, e-learning, and more. Realizing its full potential will take generating vocabularies and developing applications that make use of shared semantic data.
Longwell is a tool that provides a graphical interface for exploring RDF data in a web browser. It displays types of resources as filters along the top and facets like properties on the right. Users can browse data by selecting types to view associated resources and properties. Queries powering Longwell return type and property frequencies to display, list properties for a selected type, and populate property panels with object values to enable interactive faceted browsing of RDF datasets.
Annotating Digital Texts in the Brown University LibraryTimothy Cole
The document discusses annotating digital texts at Brown University Library. It describes several projects at Brown that involve textual scholarship and digital humanities. It then explains the Pico Project, which aims to annotate Giovanni Pico della Mirandola's 900 Theses. It outlines how annotations of digital objects are ingested and stored in the Brown Digital Repository using AtomPub, XML, RDF, and Linked Data standards to allow for aggregation, syndication, and addressing of annotations.
The document discusses text mining and provides examples. It defines text mining as the extraction of implicit knowledge from large amounts of textual data. It discusses applications such as marketing, industry research, and job seeking. Key text mining methods covered include information retrieval, information extraction, web mining, and clustering. The document outlines the text mining process and discusses text characteristics, learning methods such as classification and clustering, and evaluation metrics. Examples are provided to illustrate classification using decision trees and k-nearest neighbors on structured and unstructured text data.
Open Annotation Collaboration BriefingTimothy Cole
The document summarizes a meeting of the Open Annotation Collaboration (OAC) project team. The OAC aims to develop an interoperable annotation model and specification to facilitate sharing annotations across systems. In phase 1, the OAC will analyze existing annotation practices, develop a data model and specification, integrate annotation tools into Zotero, and create a proof-of-concept implementation.
This document discusses modelling and representing social network data ontologically. It covers representing social individuals and relationships ontologically, as well as aggregating and reasoning with social network data. It discusses ontology languages like RDF, OWL, and FOAF that can be used to represent social network data and individuals semantically. It also talks about state-of-the-art approaches for representing network structure and attribute data, and the need for representations that can integrate different data sources and maintain identity.
Folksonomies: a bottom-up social categorization systemdomenico79
Folksonomies are a bottom-up social system for categorizing and sharing content created by users tagging resources with freely chosen keywords or tags. They have no controlled vocabulary or defined relationships between terms, instead allowing collaboration and adaptation through shared user-generated metadata. While lacking formal structure, folksonomies lower barriers to cooperation and foster serendipitous discovery, but can suffer from ambiguity due to synonyms, polysemy and other issues with vocabulary.
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
Presentation of the Global Biodiversity Information Facility (GBIF) knowledge organization system (KOS) work program for the National Center for Biomedical Ontology (NCBO) Web seminar series in October 2012. Available at http://www.bioontology.org/GBIF-vocabulary-management-for-biodiversity-informatics
Wikipedia as source of collaboratively created Knowledge Organization SystemsJakob .
The document discusses Wikipedia as a source of collaboratively created knowledge organization systems. It describes the structure of Wikipedia articles, categories, infoboxes, and how this structured data can be extracted and represented in semantic formats like RDF to create knowledge bases like DBpedia that link open data on the web. It also discusses some open issues around data quality, concepts and mapping when extracting and querying structured knowledge from Wikipedia.
Networked Digital Library Of Theses And Dissertationssinglish
The Networked Digital Library of Theses and Dissertations (NDLTD) is an international organization that promotes the electronic publishing and preservation of graduate theses and dissertations. NDLTD allows students to create electronic documents, increases access to student research, and supports long-term preservation of electronic theses and dissertations (ETDs). It currently holds over 767,000 ETDs from 90 institutions in 18 countries in 17 formats, with metadata described by the ETD-MS standard to facilitate searching and discovery.
This document provides an overview of digital libraries, including definitions, benefits, limitations, components, standards, and challenges. It defines a digital library as a collection of information stored and accessed electronically, extending the functions of a traditional library digitally. Benefits include improved access and searchability, easier information sharing and preservation. Emerging technologies discussed include metadata standards, XML, and protocols like OAI-PMH for metadata harvesting. Common digital library software includes DSpace, Greenstone, and EPrints. Challenges involve digitization, description, legal issues, presentation of heterogeneous resources, and economic sustainability.
This document provides an overview of digital libraries, including definitions, benefits, limitations, components, standards, and challenges. It defines a digital library as a collection of information stored and accessed electronically, extending the functions of a traditional library digitally. Benefits include improved access, information sharing, and preservation, while limitations include technological obsolescence and rights management. Key components discussed include digital objects, metadata, and tools like DSpace and Greenstone for developing digital libraries. Emerging standards around identifiers, encoding, and metadata are also summarized.
The document discusses research into enabling customized accessibility for users by combining personal needs and preferences profiles (PNPs) with digital resource description metadata. PNPs describe a user's display, control, and content needs, while digital resource descriptions indicate how resources can be adapted across different modalities. Together, PNPs and resource descriptions allow for automatic, personalized accessibility adaptation of online content to individual user needs.
The document discusses technologies and infrastructure for publishing biodiversity data from environmental impact assessments (EIA). It covers the types and formats of EIA biodiversity data, tools for data capture and digitization, platforms for data discovery and publishing, ensuring data quality, and hosting data centers to facilitate long-term archiving and publishing of EIA biodiversity data.
The document discusses the concepts of semantic technology and the semantic web. It defines key concepts like tabula rasa, the network effect, and intelligence embedded in data through relationships. It also outlines technologies used in the semantic web like RDF, OWL, SPARQL, FOAF, and DBpedia and how search engines and companies are using these technologies for applications like sentiment analysis, natural language processing, and information extraction.
This document discusses Timbuctoo, an application designed for academic research that allows for complex and heterogeneous data. It explores archiving RDF datasets from Timbuctoo instances, including handling RDF graphs and triples, versioning datasets, and verifying dataset integrity and resolving links. A potential pipeline is proposed to ingest datasets from Timbuctoo into the EASY archive, but current Timbuctoo instances and datasets have obscure URIs and insufficient metadata, and the prototype pipeline lacks specifications. Archiving linked data from Timbuctoo could change the nature of preservation for archives.
This document provides an outline and overview of a seminar on text mining. It discusses basics of text mining including definitions, similarities to data mining, preprocessing operations, document features, and representational models of documents. It also describes general architectures of text mining systems and provides examples of system architectures for generic, domain-oriented, and advanced text mining systems with background knowledge bases.
Linkator: enriching web pages by automatically adding dereferenceable semanti...Samur Araujo
In this paper, we introduce Linkator, an application architecture that
exploits semantic annotations for automatically adding links to previously
generated web pages. Linkator provides a mechanism for dereferencing these
semantic annotations with what it calls semantic links. Automatically adding
links to web pages improves the users’ navigation. It connects the visited page
with external sources of information that the user can be interested in, but that
were not identified as such during the web page design phase. The process of
auto-linking encompasses: finding the terms to be linked and finding the
destination of the link. Linkator delegates the first stage to external semantic
annotation tools and it concentrates on the process of finding a relevant
resource to link to. In this paper, a use case is presented that shows how this
mechanism can support knowledge workers in finding publications during their
navigation on the web.
This document summarizes the Knowledge Engineering efforts for the TELDAP digital library project. It discusses (1) developing metadata models for different types of digital objects, including a union catalog model and models for websites and documents; (2) establishing hyperlinks between objects and keywords to connect related resources; and (3) constructing ontologies and thesauri like Getty AAT and Chinese WordNet to link keywords and establish implicit relationships between objects. The goal is to optimize access, retrieval and understanding of the large and growing collection of digital content.
The document discusses the Semantic Web, which refers to extending the current web by giving information well-defined meaning that computers can understand. It describes the evolution of the web from Web 1.0 to 3.0 and outlines key components that enable the Semantic Web like URIs, RDF, RDFS, OWL, and SPARQL. The technology brings benefits like improved search, interoperability, and opportunities for applications in areas like healthcare, e-learning, and more. Realizing its full potential will take generating vocabularies and developing applications that make use of shared semantic data.
Longwell is a tool that provides a graphical interface for exploring RDF data in a web browser. It displays types of resources as filters along the top and facets like properties on the right. Users can browse data by selecting types to view associated resources and properties. Queries powering Longwell return type and property frequencies to display, list properties for a selected type, and populate property panels with object values to enable interactive faceted browsing of RDF datasets.
Annotating Digital Texts in the Brown University LibraryTimothy Cole
The document discusses annotating digital texts at Brown University Library. It describes several projects at Brown that involve textual scholarship and digital humanities. It then explains the Pico Project, which aims to annotate Giovanni Pico della Mirandola's 900 Theses. It outlines how annotations of digital objects are ingested and stored in the Brown Digital Repository using AtomPub, XML, RDF, and Linked Data standards to allow for aggregation, syndication, and addressing of annotations.
The document discusses text mining and provides examples. It defines text mining as the extraction of implicit knowledge from large amounts of textual data. It discusses applications such as marketing, industry research, and job seeking. Key text mining methods covered include information retrieval, information extraction, web mining, and clustering. The document outlines the text mining process and discusses text characteristics, learning methods such as classification and clustering, and evaluation metrics. Examples are provided to illustrate classification using decision trees and k-nearest neighbors on structured and unstructured text data.
Open Annotation Collaboration BriefingTimothy Cole
The document summarizes a meeting of the Open Annotation Collaboration (OAC) project team. The OAC aims to develop an interoperable annotation model and specification to facilitate sharing annotations across systems. In phase 1, the OAC will analyze existing annotation practices, develop a data model and specification, integrate annotation tools into Zotero, and create a proof-of-concept implementation.
This document discusses modelling and representing social network data ontologically. It covers representing social individuals and relationships ontologically, as well as aggregating and reasoning with social network data. It discusses ontology languages like RDF, OWL, and FOAF that can be used to represent social network data and individuals semantically. It also talks about state-of-the-art approaches for representing network structure and attribute data, and the need for representations that can integrate different data sources and maintain identity.
Folksonomies: a bottom-up social categorization systemdomenico79
Folksonomies are a bottom-up social system for categorizing and sharing content created by users tagging resources with freely chosen keywords or tags. They have no controlled vocabulary or defined relationships between terms, instead allowing collaboration and adaptation through shared user-generated metadata. While lacking formal structure, folksonomies lower barriers to cooperation and foster serendipitous discovery, but can suffer from ambiguity due to synonyms, polysemy and other issues with vocabulary.
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
Presentation of the Global Biodiversity Information Facility (GBIF) knowledge organization system (KOS) work program for the National Center for Biomedical Ontology (NCBO) Web seminar series in October 2012. Available at http://www.bioontology.org/GBIF-vocabulary-management-for-biodiversity-informatics
Wikipedia as source of collaboratively created Knowledge Organization SystemsJakob .
The document discusses Wikipedia as a source of collaboratively created knowledge organization systems. It describes the structure of Wikipedia articles, categories, infoboxes, and how this structured data can be extracted and represented in semantic formats like RDF to create knowledge bases like DBpedia that link open data on the web. It also discusses some open issues around data quality, concepts and mapping when extracting and querying structured knowledge from Wikipedia.
Networked Digital Library Of Theses And Dissertationssinglish
The Networked Digital Library of Theses and Dissertations (NDLTD) is an international organization that promotes the electronic publishing and preservation of graduate theses and dissertations. NDLTD allows students to create electronic documents, increases access to student research, and supports long-term preservation of electronic theses and dissertations (ETDs). It currently holds over 767,000 ETDs from 90 institutions in 18 countries in 17 formats, with metadata described by the ETD-MS standard to facilitate searching and discovery.
This document provides an overview of digital libraries, including definitions, benefits, limitations, components, standards, and challenges. It defines a digital library as a collection of information stored and accessed electronically, extending the functions of a traditional library digitally. Benefits include improved access and searchability, easier information sharing and preservation. Emerging technologies discussed include metadata standards, XML, and protocols like OAI-PMH for metadata harvesting. Common digital library software includes DSpace, Greenstone, and EPrints. Challenges involve digitization, description, legal issues, presentation of heterogeneous resources, and economic sustainability.
This document provides an overview of digital libraries, including definitions, benefits, limitations, components, standards, and challenges. It defines a digital library as a collection of information stored and accessed electronically, extending the functions of a traditional library digitally. Benefits include improved access, information sharing, and preservation, while limitations include technological obsolescence and rights management. Key components discussed include digital objects, metadata, and tools like DSpace and Greenstone for developing digital libraries. Emerging standards around identifiers, encoding, and metadata are also summarized.
The document discusses various technologies for metasearching or cross-searching multiple databases at once, including Z39.50 for real-time searching, SRU/SRW web services, and OAI-PMH for metadata harvesting. It explains concepts like XML, web services, SOAP, and WSDL, and provides examples of how technologies like Z39.50, SRU, and OAI-PMH enable searching across different data sources.
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...Jenn Riley
This document discusses tools and techniques for creating, maintaining, and distributing shareable metadata. It emphasizes that metadata should be structured to be understandable outside of local contexts and useful for other institutions. Key aspects of shareable metadata include using appropriate content and vocabularies, ensuring records are coherent, providing useful context, and conforming to standards. The document also outlines example workflows and considerations for making metadata shareable.
Examines how new technologies can be applied to overcome problems in controlled vocabularies, focusing on Resource Description Framework (RDF), Simple Knowledge Organisation System (SKOS), metadata registries and web services. Part of the Cataloguing and Indexing Group in Scotland (CIGS) seminar "Toto, I've got a feeling we're not in Kansas anymore": metadata issues and Web2.0 services.
The Web of Linked Open Data, or LOD, is the most relevant achievement of the Semantic Web. Initially proposed by Tim Berners-Lee in a seminal paper published in Scientific American in 2001, the Semantic Web envisions a web where software agents can interact with large volumes of structured, easy to process data. It is now when users have at our disposal the first, mature results of this vision. Among them, and probably the most significant ones, are the different LOD initiatives and projects that publish open data in standard formats like RDF.
This presentation provides an overview and comparison of different LOD initiatives in the area of patent information, and analyses potential opportunities for building new information services based on largely available datasets of patent information. Information is based on different interviews conducted with innovation agents and on the analysis of professional bibliography and current implementations.
LOD opportunities are not only restricted to information aggregators, but also to end-users and innovation agents that need to face with the difficulties of dealing with large amounts of data. In both cases, the opportunities offered by LOD need to be assessed, as LOD has just become a standard, universal method to distribute, share and access data.
Urm concept for sharing information inside of communitiesKarel Charvat
The document describes the Uniform Resource Management (URM) concept for sharing information within communities. URM provides a framework for standardized description of information using metadata schemes and controlled vocabularies to improve discovery. It is implemented through various portals and tools that allow users to manage and discover knowledge according to context. Initial implementations included portals for nature, sustainability and rural information in the Czech Republic and Latvia. URM supports collaborative knowledge sharing through interoperable systems based on open standards.
This document discusses metadata, which is structured data that describes and helps manage information resources. There are different types of metadata including descriptive, structural, and administrative. Metadata serves important functions like allowing resources to be discovered and organized. Several metadata standards are discussed, including Dublin Core, METS, MODS, EAD, and LOM. The document also covers metadata creation, quality issues, and ways metadata can be improved.
Dataset description: DCAT and other vocabulariesValeria Pesce
This document discusses metadata needed to describe datasets for applications to find and understand them when stored in data catalogs or repositories. It examines existing dataset description vocabularies like DCAT and their limitations in fully capturing necessary metadata.
Key points made:
- Machine-readable metadata is important for datasets to be discoverable and usable by applications when stored across repositories.
- Metadata should describe the dataset, distributions, dimensions, semantics, protocols/APIs, subsets etc.
- Vocabularies like DCAT provide some metadata but don't fully cover dimensions, semantics, protocols/APIs or subsets.
- No single vocabulary or data catalog solution currently provides all necessary metadata for full semantic interoperability.
The document discusses semantic mapping in CLARIN Component Metadata Infrastructure (CMDI). CMDI allows flexible yet semantically interoperable metadata descriptions through the use of explicit schemas and semantic registries like ISOcat and RelationRegistry. These registries define concepts and relationships that can be shared across metadata profiles and elements. Semantic mapping helps achieve recall and disambiguation in metadata searches across the diverse set of CMDI profiles and components.
This document provides an overview of metadata, including:
1) Definitions of metadata from various sources, describing it as data that describes other data or information resources.
2) The main types of metadata - descriptive, processing, administrative, and semantic. Descriptive metadata retrieves information, processing metadata processes information, and administrative metadata manages information.
3) How metadata can be created automatically by tools or manually by people. Metadata schemes provide a formal structure to identify a discipline's knowledge and link it to information resources.
The document discusses metadata schemas and standards for digital library projects in China. It describes several existing metadata schemes including the General Format for Digitalized Chinese Full-text (GFDCF), the CPDLP Metadata Profiles, and the Chinese Metadata Specifications (CMS). It also discusses applying ontologies to build a unified metadata framework, including ontologies of Chinese information resources and bibliographic relations. This would help address issues of lack of unified semantics, mappings between schemas, and diversification in the Chinese metadata landscape.
The document discusses metadata schemes and their components. It defines a metadata scheme as a set of defined metadata elements and rules for a specific purpose. It provides examples of common metadata schemes and discusses their semantics (meanings), content rules, and syntax. The document also outlines some key purposes and benefits of metadata such as documentation, organization, search and retrieval, and preservation of information resources.
The document discusses perspectives on metadata from web resources and database systems. It describes how metadata comes in many forms and serves various purposes, such as supporting discovery and identification of information resources on the web (resource metadata), and ensuring consistency and analysis of structured data in databases (metadata in database systems). Resource metadata commonly follows standards and is stored separately from the resources it describes, while database metadata includes both structural metadata describing data organization and content metadata in the form of data dictionaries.
The document discusses semantic web technology, which aims to make information on the web better understood by machines by giving data well-defined meaning. It outlines the evolution of web technologies from the initial web to the semantic web. Key aspects of semantic web technology include ontologies to define common vocabularies, semantic annotations to associate meaning with data, and reasoning capabilities to enable complex queries and analyses. Languages, tools, and applications are needed to implement these semantic web standards and make the web of linked data usable.
Semantic Web: Technolgies and Applications for Real-WorldAmit Sheth
Amit Sheth and Susie Stephens, "Semantic Web: Technolgies and Applications for Real-World," Tutorial at 2007 World Wide Web Conference, Banff, Canada.
Tutorial discusses technologies and deployed real-world applications through 2007.
Tutorial description at: http://www2007.org/tutorial-T11.php
Similar to Webinar slides: Interoperability between resources involved in TDM at the level of metadata (20)
This document summarizes a presentation about text and data mining of scientific literature. It discusses the large and growing amounts of digital content and data being produced, and challenges around making sense of it all. It introduces text mining as an emerging solution to analyze and extract insights from unstructured text sources. The presentation describes the OpenMinted framework, which aims to create an open infrastructure for text and data mining services, tools, and annotated corpora. It discusses registering and discovering services, running jobs, and sharing results. Finally, it covers challenges around interoperability, legal issues, policies, and sustainability.
Resource sync overview and real-world use cases for discovery, harvesting, an...openminted_eu
This document summarizes an overview presentation about ResourceSync and its implementations at Hyku and the Digital Public Library of America (DPLA). Some key points:
- ResourceSync was developed as an update to OAI-PMH for synchronizing web resources between systems in a more flexible way. It supports resource lists, change lists, and dumps.
- Hyku has implemented ResourceSync publishing capabilities, and the DPLA has developed a harvester for the Hyku endpoint. This allows for incremental metadata updates rather than full resynchronization of data sets.
- Next steps include potentially supporting resource dumps in Hyku and harvesting from 3 DPLA providers using ResourceSync by the end of the year
Seamless access to the world's open access research papers via resources syncopenminted_eu
Describes a set of scholarly communications use cases for ResourcesSync and present the development and integration of the PublisherConnector in CORE. By Petr Knoth
Text Mining: the next data frontier. Beyond Open Accessopenminted_eu
1) The presentation discusses the need for text and data mining (TDM) tools to make sense of the vast amount of digital data and literature being produced. It notes there are over 1.8 billion websites and 3.46 billion internet users producing large amounts of data daily. 2) Similarly, the global research community produces around 2.5 million new scholarly articles per year, but much of this work is never read or cited. 3) The presentation proposes establishing an open TDM platform called "OpenMinted" that would allow researchers to discover, share, and reuse knowledge extracted from text-based sources through the use of shared TDM services and tools.
This document discusses the work of the WG3 Legal Interoperability working group for the OpenMinTeD project. The goal of the working group is to study copyright and related rights restrictions on text and data mining (TDM) activities and identify contractual and licensing tools to support TDM. It outlines legal barriers like copyright and database rights, as well as exceptions and limitations. It also discusses the use of licenses to enable access and how policy choices could address limitations of licenses. The working group's deliverables will include a compatibility matrix of licenses and ongoing analysis presented in academic papers.
How can repositories support the text mining of their content and why?openminted_eu
This document discusses how repositories can support text and data mining (TDM) of their content. It provides three principles for repositories to follow: (1) establish direct links from metadata to the full text content, (2) provide universal access to harvesting systems at the same level as humans, and (3) ensure metadata is correctly referenced and content is accessible. The role of repositories is to aggregate research papers at full text to enable large-scale TDM by external services. However, many repositories currently do not fully support this due to issues like incomplete metadata records and non-dereferenceable identifiers.
The document discusses the potential value of text and data mining UK theses. It notes that UK theses represent unique cutting-edge research not published elsewhere. The EThOS database contains metadata on over 430,000 UK theses totaling around 6 million pages of research annually. Several examples are provided of text and data mining projects that have extracted useful information from UK theses, such as identifying trends in dementia research and discovering new chemical compounds. While thesis metadata is openly available, accessing the full texts requires permission due to copyright. Overall, the document argues that UK theses represent a valuable untapped resource for text and data mining research.
OpenMinTeD - Repositories in the centre of new scientific knowledgeopenminted_eu
OpenMinted aims to establish an open text and data mining platform for researchers to discover, create, share and reuse knowledge from scholarly sources. It will provide interoperable services for machine reading, information extraction and predictive analysis of structured data from unstructured text. Key challenges include making content and services discoverable, interoperable, and addressing intellectual property rights. OpenMinted will build on existing repositories and language resources and technologies, and involve stakeholders from its inception to evaluate outcomes.
Jisc has invested in text mining capabilities and established the National Centre for Text Mining (NaCTeM) to fund various text aggregation projects. Jisc provides open access, bibliographic, and subscription management services that include text mining of over 25 million records and 600 journal titles in CORE and journal archives. There is potential to develop user-facing text mining applications using these combined data sets to unlock hidden information and develop new knowledge.
OpenMinted: It's Uses and Benefits for the Social Sciencesopenminted_eu
Presentation as presented at the ITOC workshop in Philadelphia, 20 February 2016.
Uses and Benefits for the Social Sciences research community.
By GESIS - Leibniz Institute for the Social Sciences
The document discusses text and data mining (TDM) projects in Europe. It describes how TDM can be used to understand the past by mining historical books, predict the future by mining newspapers, and save lives by mining scientific publications about diseases. It also outlines some current barriers to TDM in Europe like a lack of awareness, skills and tools, licensing and copyright issues. Two EU projects are highlighted: FutureTDM which aims to identify TDM barriers and policy solutions, and OpenMinTeD which builds a collaborative TDM infrastructure.
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
The document discusses natural language processing (NLP) infrastructure and challenges in text and data mining. It describes DKPro, an open-source collection of NLP tools that provides interoperability between projects. DKPro Core allows running NLP pipelines with no installation through dependency fetching. Challenges discussed include balancing data protection with interoperability and moving data and analytics as needs change. The talk proposes addressing these through open APIs and repositories to discover, access, deploy and retrieve analytics and their results.
OpenMinTeD: Making Sense of Large Volumes of Dataopenminted_eu
The document discusses making scientific content more accessible and useful through text and data mining. It notes that the global research community generates over 1.5 million new articles per year but many are never read or cited. Emerging solutions like machine reading, understanding and predicting can help structure and mine textual data to extract meaningful insights. The OpenMinted project aims to establish an open text and data mining platform and infrastructure for researchers to collaboratively work with scientific sources. It outlines challenges around content, services and processing as well as main routes to make content more accessible through metadata, transfer protocols and licensing. The project involves various partners and use cases across domains like scholarly communication, life sciences, agriculture and social sciences.
Experiences of Text Mining; the National Library of Austria perspectiveopenminted_eu
Max Kaiser discusses text mining challenges for cultural heritage institutions using the Austrian National Library as a case study. The library has digitized over 600,000 volumes and made them available online through partnerships. While technology exists for tasks like named entity recognition and topic modeling, challenges remain in integrating unstable OCR text data into production systems due to evolving source materials and algorithms. User needs must also be understood to ensure text mining benefits cultural heritage.
Text and Data Mining at the Royal Library in the Netherlandsopenminted_eu
The Koninklijke Bibliotheek has a large collection of machine readable structured and semi-structured data that is the result of over 200 years of collecting, 30 years of digitization, and 10 years of collecting born-digital content. Examples of datasets include newspapers from 1840-1995 made available through an ngram viewer, political speeches from 1814 to present enriched and visualized, and radio bulletins developed through collaborations. Lessons learned are that researchers use the data in unexpected ways, collaborations provide insights, opening data creates new opportunities, and connections are built with the research community.
OpenMinTeD is an EU infrastructure project that aims to establish an open and sustainable text mining infrastructure. It will bring together accessible content, discoverable text mining services, and efficient processing capabilities. This will allow researchers to collaboratively create, discover, share and reuse knowledge extracted from a wide range of scientific text sources. The project involves 16 partners from 6 countries and will run for 3 years, starting in June 2015.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
3. OpenMinTeD sets out to create an open, service-oriented
e-Infrastructure for TDM of scientific and scholarly
content. Researchers can collaboratively create, discover,
share and re-use Knowledge from a wide range of text-
based scientific related sources in a seamless way.
"Achieving interoperability between resources involved in TDM
at the level of metadata"
4. • Text and Data Mining: “the discovery by computer of new,
previously unknown information, by automatically extracting
and relating information from different (…) resources to reveal
otherwise hidden meanings” [Hearst 1999]
• Interoperability: Relating to systems, especially of computers
or telecommunications, that are capable of working together
without being specially configured to do so. [American Heritage®
Dictionary of the English Language, Fifth Edition. (2011)]
5. • Language Resource: It encompasses both data sets (textual,
multimodal/multimedia and lexical data, grammars, language
models etc.) and tools/technologies/services used for their
processing [WG1 - Wiki glossary]
• Metadata: contains descriptive, contextual and provenance
assertions about the properties of a Digital Object [RDA - DFT Core
terms]
6. • to be mined, i.e. in OpenMinTeD, scientific & scholarly publications (built
as "corpora")
• ancillary/reference resources, e.g. typesystems, linguistic tagsets,
terminological lexica, ontologies, machine learning models, reference
corpora, training corpora etc.
• "components" in the form of
• downloadable and locally executable tools
• web services
• workflows composed of the above
7. • registry service: to register and, later on, search and find content and
s/w components that can process this content - targeting end users
including TDM experts
• workflow service: to search and find s/w components & ancillary
resources that are (or can be made) compatible (hence,
interoperable) in order to compose workflows - targeting TDM service
developers
• document properties that users use in their queries to discover the
resources but also
• document properties that will support the automatic discovery of
compatibility between (a) s/w components & (b) between content &
s/w components (aka. find interoperable resources)
9. Goal: achieve interoperability
• per language resource type
• across language resource types
Problems
• various metadata schemas
• various communities
semantics!!
crosswalks/mappings/semantic links
10. • Need to define a common core vocabulary for the description of
the resource properties, e.g.
• language of a publication/corpus & language of the input of a s/w
component
• domain/subject of a publication/corpus & domain/subject of an
ontology that can be used to annotate it
but
• how can we select the "common denominator" from all the
schemas?
• gaps in original metadata records deemed important for TDM
• wealth of original records & loss of information
• mismatches between metadata elements/values
11. organize the schema elements and accommodate common vs.
particular features of resources
be flexible enough to support varying degrees of documentation
completeness
cover documentation needs of all resource types involved in TDM
cover needs of resource discoverability and TDM processing
reuse what is available vs. create and recommend new elements
and values
document processing procedure and outputs
standardize/normalize user input vs. allow for free user input
12. • OMTD Deliverable D5.2 - Interoperability Requirements Specification
[soon to be made publicly available]
• scenarios & use cases targeted by OMTD in the Areas of: scholarly
communication, life sciences, agriculture & biodiversity, social
sciences
• overview of relevant metadata schemas (e.g. OpenAIRE, CORE,
RIOXX guidelines, CrossRef, MetaShare, DataCite, DCAT, CMDI
relevant metadata profiles etc.) – cf. OMTD Deliverable D5.1 -
Interoperability Landscaping Report
19. • obligatory: record what is necessary for intended purposes vs. ease
to document,
• e.g. language for scholarly publications but title and author??, format and subject
of a document??
• recommended: features that can help the user or future uses or that
users find useful but providers have not yet standardized,
• e.g. documentation / help files, attribution, citation papers
• optional: all remaining information related to the lifecycle of a
resource
• e.g. funding information (still: funding agencies are becoming more and more
interested in it!), projects where the resources have been used and created
outputs
20. • organize the schema into semantically coherent elements
• common to all types of resources (e.g. identification, licensing etc.)
• per resource type
• re-usable for more than one resource type but not globally applicable (e.g. subject
classification) and
• strictly applied to specific resource types (e.g. evaluation for s/w components)
22. • identification & provenance of the metadata record
• metadata record identifier
• metadata creation date
• identification of the resource
• identifiers with identificationScheme (name/URI)
• title & description (multilingual; English should be there but ?)
• distribution & licensing/access
• distribution medium/format (e.g. executable code, downloadable text etc.)
• licence and/or rightsStatement (ongoing work)
• licence text or URL (provided by system for standard licences)
• contact information
• either email or landing page
• resource type (& subtype)
23. abstract
/ full text
typesystem
title
character
encoding
format
language
dependencies
input / output
content resourc
e
algorithm
tagset
typesystem
language annotation
level
language
classification
tagset
annotation
level
character
encoding
format
size
language
character
encoding
format
publisher
/ journal
classification
authors
annotation
resource
typesystem
tagset
annotation
resource
language
classification
character
encoding
format
size
24. • relations between resources can be encoded
• inside each metadata record (e.g. between publication & authors)
• separately, from both metadata records (e.g. between component & model)
• implementation issues for optionality and restrictions: uniformity of
metadata records across sources vs. better treatment of restrictions
via the registry service which restrictions should be in the schema
and which restrictions should be in a system built on top of the
schema?
25. • recommend and link to authority lists for properties
• format: IANA list of media types BUT need for extensions!
• language: ISO 639-3 vs. IETF BCP47
• subject classification: DDC, LCSH, EUROVOC, discipline-specific lists… we
cannot enforce one scheme, so we recommend their use and ask for reference to
it; this is currently encoded as enumerations but link to external source is a better
solution
• create elements & values in attested gaps & where considered best
for OMTD purposes
• classification of components, lexical/conceptual resources etc.
• annotation set of elements and values [ to be included in the output resources
automatically via the platform]
• links to be provided to elements in other metadata schemas (DataCite,
CrossRef, DCAT, etc.) (ongoing work)
26. • adopt entire metadata schemas and registries for satellite entities
• repositories & registries openDOAR & re3data
• journals DOAJ
BUT
• persons ORCID & SCOPUS id
• organisations ISNI & fundref
& covered with own metadata elements
27. • link to other resources or satellite entities via identifier (PID) or
descriptive elements: recommend but allow for backup solutions when
the identifier is not there
• identifier preferably from an authority source, with reference to it: DOI for
publications, DataCite for datasets & services, ORCID for persons, ISNI or fundef
for organizations etc.
• but allow for other identifiers too: "identifierSchemeName" &
"identifierSchemeURL"
• descriptive elements: title, full name, etc.
• value system for elements
• e.g. free text vs. controlled vocabularies
• represented as enumeration
• semantics of closed & open vocabularies
• open vocabularies with the additional value "other" but … how can one add values
and yet curate the vocabularies??
28. • annotation set of elements:
• set of elements and values that can be added independently as a block to each
resource following the processing
• information on s/w component(s), type of annotation, tagsets, annotation
resources, annotators, format etc.
• covering provenance requirements but also to be used as input for further
processing
29. • XSD schemas v1.0.0 & documentation:
https://openminted.github.io/openminted-site/releases/omtd-
share/1.0.0/html/index.html
• Guidelines: on the way!
• Conversions from existing descriptors (ongoing work)
30. • not all information is available (e.g. licence, direct link to publication
contents, language of metadata fields, subject etc.)
• different approach between schemas (element vs. attribute)
• lack of a common API approach (as OAI-PMH across repositories)
• different mechanisms for flagging OA content
• inconsistent provision of full text links (incl. in CrossRef TDM)
• legal and technical issues around systematic full text aggregation
from publishers (including via CrossRef TDM)
• full text harvesting/crawling limits in place on publisher endpoints
• lack of support for discovery of new content
• lack of documentation on publisher systems
31. • largely technical information
• some non-technical information possible but seldom used (e.g.
developer information in Maven) - why?
• technical elements present but in many cases possible values not
restricted (e.g. media-type or language)
• "persistent identifier" e.g. in Maven is self-assigned and global
uniqueness is not enforced but governed by best-practice in contrast
to e.g. centrally assigned DOI- good or bad or tolerable?
32. • the closest to OMTD-SHARE schema (for obvious reasons)
• resource types converted: corpora, components, lexical/conceptual
resources & models
• main problems were the lack of persistent identifiers and the
decisions taken for further standardization/normalisation