This document summarizes the key developments that led to the release of Version 2.0 of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It describes how OAI-PMH evolved from earlier specifications like the Santa Fe Convention and Version 1.x of the protocol. The document outlines some of the major highlights of OAI-PMH 2.0, including changes made to verbs and responses, as well as new features introduced. It also reviews the process that was followed to develop, test, and release Version 2.0 between 2001-2002 with input from the OAI technical committee and other contributors.
Mind the gap! Reflections on the state of repository data harvestingSimeon Warner
A 24x7 presentation at Open Repositories 2017 in Brisbane, Australia.
I start with an opinionated history of the evolution of repository data harvesting since the late 1990's to the present. A conclusion is that we are currently in danger of creating a repository environment with fewer cross-repository services than before, with the potential to reinforce the silos we hope to open. I suggest that the community needs to agree upon a new solution, and further suggest that solution should be ResourceSync.
The Open Archives Initiative Protocol for Metadata HarvestingAndy Powell
UKOLN is supported by various organizations and focuses on digital information management. The document discusses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), including its roots in preprint archives, how it allows harvesting of metadata records through HTTP, and its impacts on institutions, libraries, and researchers by providing an open framework for sharing scholarly works.
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKAndy Powell
UKOLN is a center of expertise in digital information management supported by various organizations. The document discusses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), including its history and how it allows harvesting of metadata from data providers by service providers through a simple protocol. It also discusses the potential impact of OAI-PMH on institutions, libraries, and researchers.
This document provides an overview of an August 21, 2008 webinar on OpenURL implementation and link resolution. The webinar covered OpenURL standards and implementations, the KBART working group aimed at improving knowledge base data transfer, and proposed solutions to issues like inaccurate data leading to failed links. The goal is to improve access for library patrons by reducing false positives and negatives when resolving links, with the ideal being patrons easily find all relevant full text and services for a given reference. The KBART group involves various organizations working to ensure timely and accurate transfer of data between publishers, link resolvers, and libraries.
The document discusses enabling tools and methods for international collaboration using virtual workspaces. It describes three types of workspaces - DataSpaces to connect data providers and users, ActivitySpaces to connect people working on shared activities, and ToolSpaces to connect tool creators and users. Examples are given of different projects that have used these virtual workspaces to facilitate collaboration, including developing naming conventions, air quality data summits, and designing a community air quality data system. The workspaces allow distributed groups to easily share and archive related content, discussions, and resources to support ongoing collaborative work.
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauJISC KeepIt project
This presentation offers a brief introduction to provenance, a record of the process that led to the current state of an object, based on a new descriptive model designed to allow provenance information to be exchanged between systems, the Open Provenance Model (OPM). It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
Report on the International Linked Open Data for Libraries, Archives and Muse...Adrian Stevenson
The document summarizes the Linked Open Data in Libraries, Archives & Museums Summit held in June 2011 in San Francisco. Over 100 people from over 85 organizations participated, including major libraries, archives, and museums. The summit aimed to advance the publication and use of Linked Open Data among cultural heritage institutions. Participants discussed topics like explaining Linked Data to non-technical staff, assessing the costs and benefits, licensing and rights issues, crowdsourcing, vocabulary maintenance, and user tools. Next steps include further events and collaborations to continue developing Linked Open Data practices in cultural heritage organizations.
This presentation introduces the COAR Interest Group that will provide a forum about controlled vocabularies to describe Open Access scientific results
Mind the gap! Reflections on the state of repository data harvestingSimeon Warner
A 24x7 presentation at Open Repositories 2017 in Brisbane, Australia.
I start with an opinionated history of the evolution of repository data harvesting since the late 1990's to the present. A conclusion is that we are currently in danger of creating a repository environment with fewer cross-repository services than before, with the potential to reinforce the silos we hope to open. I suggest that the community needs to agree upon a new solution, and further suggest that solution should be ResourceSync.
The Open Archives Initiative Protocol for Metadata HarvestingAndy Powell
UKOLN is supported by various organizations and focuses on digital information management. The document discusses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), including its roots in preprint archives, how it allows harvesting of metadata records through HTTP, and its impacts on institutions, libraries, and researchers by providing an open framework for sharing scholarly works.
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKAndy Powell
UKOLN is a center of expertise in digital information management supported by various organizations. The document discusses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), including its history and how it allows harvesting of metadata from data providers by service providers through a simple protocol. It also discusses the potential impact of OAI-PMH on institutions, libraries, and researchers.
This document provides an overview of an August 21, 2008 webinar on OpenURL implementation and link resolution. The webinar covered OpenURL standards and implementations, the KBART working group aimed at improving knowledge base data transfer, and proposed solutions to issues like inaccurate data leading to failed links. The goal is to improve access for library patrons by reducing false positives and negatives when resolving links, with the ideal being patrons easily find all relevant full text and services for a given reference. The KBART group involves various organizations working to ensure timely and accurate transfer of data between publishers, link resolvers, and libraries.
The document discusses enabling tools and methods for international collaboration using virtual workspaces. It describes three types of workspaces - DataSpaces to connect data providers and users, ActivitySpaces to connect people working on shared activities, and ToolSpaces to connect tool creators and users. Examples are given of different projects that have used these virtual workspaces to facilitate collaboration, including developing naming conventions, air quality data summits, and designing a community air quality data system. The workspaces allow distributed groups to easily share and archive related content, discussions, and resources to support ongoing collaborative work.
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauJISC KeepIt project
This presentation offers a brief introduction to provenance, a record of the process that led to the current state of an object, based on a new descriptive model designed to allow provenance information to be exchanged between systems, the Open Provenance Model (OPM). It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
Report on the International Linked Open Data for Libraries, Archives and Muse...Adrian Stevenson
The document summarizes the Linked Open Data in Libraries, Archives & Museums Summit held in June 2011 in San Francisco. Over 100 people from over 85 organizations participated, including major libraries, archives, and museums. The summit aimed to advance the publication and use of Linked Open Data among cultural heritage institutions. Participants discussed topics like explaining Linked Data to non-technical staff, assessing the costs and benefits, licensing and rights issues, crowdsourcing, vocabulary maintenance, and user tools. Next steps include further events and collaborations to continue developing Linked Open Data practices in cultural heritage organizations.
This presentation introduces the COAR Interest Group that will provide a forum about controlled vocabularies to describe Open Access scientific results
The document describes the aDORe Federation Architecture, which was developed to address challenges of scale in digital repositories. The key aspects are:
1) It is a 3-tier architecture that federates distributed digital object repositories to provide unified access to content.
2) The first tier consists of surrogate and sometimes datastream repositories that store metadata about digital objects and bitstreams.
3) The architecture leverages URIs to identify digital objects, surrogates, repositories and interfaces to allow federated access across repositories.
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
towards interoperable archives: the Universal Preprint Service initiativeHerbert Van de Sompel
The document discusses the Universal Preprint Service initiative which aims to promote interoperability between preprint archives. It provides background on existing preprint models and services. The initiative is supported by several organizations and held its first meeting in 1999 to discuss technical recommendations for achieving interoperability between archives.
= Finding a Good Ontology: The Open Ontology Repository Initiative =
Can you find a good ontology to use or extend for your application?
Building on previous registry and repository efforts, the Open Ontology Repository Initiative is a community effort developing open source software for finding, using, and maintaining open source and other ontologies.
The initial implementation of OOR is based on BioPortal (http://bioportal.bioontology.org), which is used to access and share ontologies that are actively used in biomedical communities and currently supports OWL, OBO, and Protege ontologies, LexGrid and RRF vocabularies, and ontology mapping. BioPortal has been developed by the National Center for Biomedical Ontology with support from the NIH Roadmap, but its infrastructure is domain-independent and being extended in various directions.
This presentation will include the following:
* A demonstration of the current public OOR instance
* OOR requirements and challenges
* On-going and planned development efforts (Common Logic support, federation, gatekeeping, provenance, governance, etc.)
* Details on how you can become involved
The Royal Society of Chemistry has provided access to data associated with millions of chemical compounds via our ChemSpider database for over 5 years. During this period the richness and complexity of the data has continued to expand dramatically and the original vision for providing an integrated hub for structure-centric data has been delivered across the world to hundreds of thousands of users. With an intention of expanding the reach to cover more diverse aspects of chemistry-related data including compounds, reactions and analytical data, to name just a few data-types, we are in the process of implementing a new architecture to build a Chemistry Data Repository. The data repository will manage the challenges of associated metadata, the various levels of required security (private, shared and public) and exposing the data as appropriate using semantic web technologies. Ultimately this platform will become the host for all chemicals, reactions and analytical data contained within RSC publications and specifically supplementary information. This presentation will report on how our efforts to manage chemistry related data has impacted chemists and projects across the world and will review specifically our contributions to projects involving natural products for collaborators in Brazil and China, for the Open Source Drug Discovery project in India, and our collaborations with scientists in Russia.
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
This document discusses the challenges of managing chemistry data online and the development of an open data repository to address these challenges. It proposes a new architecture for a data repository that would integrate diverse chemistry data types through APIs and user interfaces. The repository would standardize data, enable deposition from various sources, and provide metrics and recognition to encourage participation. However, challenges remain around data formats, encouraging data sharing, and meeting scientists' needs. The document advocates for continued testing and collaboration to develop effective solutions.
The document discusses technical challenges and approaches for building an open ecosystem of heterogeneous heritage collections. It describes Echoes, an open-source project that provides integrated access to digital cultural assets from different institutions. The key challenges addressed include dealing with different metadata schemas, poor data quality, data deduplication, and automatic enrichment. Technical approaches used by Echoes to overcome these challenges include modular tools for data analysis, transformation to a common schema, quality assurance, and enrichment.
This document provides summaries of several upcoming conferences, training programs, videoconferences, and workshops related to metadata and digital libraries. It also summarizes two ongoing projects: the development of a MARC 21 XML schema by the Library of Congress to facilitate the communication and conversion of MARC records, and the Metadata Encoding and Transmission Standard (METS) being developed by the Library of Congress as a standard for encoding metadata about digital library objects.
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
Presentation of the SSN XG results at eResearch Australia 2011 https://eresearchau.files.wordpress.com/2012/06/74-semantically-enabling-the-web-of-things-the-w3c-semantic-sensor-network-ontology.pdf
The document discusses the eXtensible Catalog Drupal Toolkit, which allows libraries to take control of their website and metadata. It provides a single user interface for an integrated library system, digital repository, library web content, and subscribed content. The toolkit uses Drupal modules and themes to connect different elements like an OAI-PMH harvester, metadata records, and a Solr index. It can schedule harvests and provides documentation and credits the contributors.
OCLC Research provides an overview of their work themes and projects. They conduct applied research to prototype new systems and services for libraries. Notable projects include the Crosswalk Web Service, which improves metadata translation between schemes, Terminology Services, which offers modular controlled vocabularies, WorldCat Identities which provides summary pages for authors, and VIAF which links authority records to identify the same persons or organizations across national files. RLG Programs also collaborates with research institutions on projects related to descriptive practices, digital scholarship, and architecture and standards.
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Ricard de la Vega
Echoes provides open, easy and innovative access to digital cultural assets from different institutions and is available in several languages. Within a single and integrated platform, users have access to a wide range of information on archaeology, architecture, books, monuments, people, photography etc. This can be explored using different criteria: concepts, digital objects, people, places and time. The platform can be installed for a region or a theme.
Echoes has developed tools that allow to analyze, clean and transform data collections to Europeana Data Model (EDM). Also tools to validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and used with different data visualizations.
This document discusses the GLOBE architecture for federating learning object repositories. It describes:
- GLOBE's use of LOM metadata standard and OAI-PMH protocol for metadata harvesting between repositories.
- The hybrid federated query and harvesting approach used to allow distributed and centralized searching of content.
- Key components of the GLOBE architecture including repositories, registry, harvester, validation services, and the ARIADNE tools for implementing repositories.
- Analysis of LOM usage in GLOBE repositories, including which elements are used most and quality issues around metadata completeness and consistency.
Nanotech Standards - Industry Participation - Strengthening Tiesharidoss
This document discusses standards development for nanotechnology. It outlines how standards were developed anticipatorily, participatorily, and responsively for various industries like microelectronics, information technology, and digital optical storage. It recommends high priority collaborative projects between standards bodies in areas like clean rooms and environmental products. It also recommends continued knowledge assessment, open access archives, standards education, and engagement of international communities and start-ups. The purpose is to demonstrate stewardship, develop performance standards for beneficial products, and kick start sensor applications to help nanoelectronics standards development.
Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.
The document discusses controlled vocabularies managed by the NERC Vocabulary Server (NVS). It provides details on the collections, concepts, and mappings contained in NVS. It also describes the governance and management of vocabularies through the NVS Vocabulary Management Group. Finally, it outlines recent progress made in improving transparency of governance models and versioning of concepts in NVS.
The document discusses the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). It describes OAI-PMH as a standard that allows data providers to make metadata available via HTTP so that service providers can harvest the metadata to develop value-added services. It provides details on the various requests and operations that are part of the OAI-PMH protocol. The document also discusses some implementation issues and examples of service providers that utilize OAI-PMH harvested metadata.
The document describes the aDORe Federation Architecture, which was developed to address challenges of scale in digital repositories. The key aspects are:
1) It is a 3-tier architecture that federates distributed digital object repositories to provide unified access to content.
2) The first tier consists of surrogate and sometimes datastream repositories that store metadata about digital objects and bitstreams.
3) The architecture leverages URIs to identify digital objects, surrogates, repositories and interfaces to allow federated access across repositories.
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
towards interoperable archives: the Universal Preprint Service initiativeHerbert Van de Sompel
The document discusses the Universal Preprint Service initiative which aims to promote interoperability between preprint archives. It provides background on existing preprint models and services. The initiative is supported by several organizations and held its first meeting in 1999 to discuss technical recommendations for achieving interoperability between archives.
= Finding a Good Ontology: The Open Ontology Repository Initiative =
Can you find a good ontology to use or extend for your application?
Building on previous registry and repository efforts, the Open Ontology Repository Initiative is a community effort developing open source software for finding, using, and maintaining open source and other ontologies.
The initial implementation of OOR is based on BioPortal (http://bioportal.bioontology.org), which is used to access and share ontologies that are actively used in biomedical communities and currently supports OWL, OBO, and Protege ontologies, LexGrid and RRF vocabularies, and ontology mapping. BioPortal has been developed by the National Center for Biomedical Ontology with support from the NIH Roadmap, but its infrastructure is domain-independent and being extended in various directions.
This presentation will include the following:
* A demonstration of the current public OOR instance
* OOR requirements and challenges
* On-going and planned development efforts (Common Logic support, federation, gatekeeping, provenance, governance, etc.)
* Details on how you can become involved
The Royal Society of Chemistry has provided access to data associated with millions of chemical compounds via our ChemSpider database for over 5 years. During this period the richness and complexity of the data has continued to expand dramatically and the original vision for providing an integrated hub for structure-centric data has been delivered across the world to hundreds of thousands of users. With an intention of expanding the reach to cover more diverse aspects of chemistry-related data including compounds, reactions and analytical data, to name just a few data-types, we are in the process of implementing a new architecture to build a Chemistry Data Repository. The data repository will manage the challenges of associated metadata, the various levels of required security (private, shared and public) and exposing the data as appropriate using semantic web technologies. Ultimately this platform will become the host for all chemicals, reactions and analytical data contained within RSC publications and specifically supplementary information. This presentation will report on how our efforts to manage chemistry related data has impacted chemists and projects across the world and will review specifically our contributions to projects involving natural products for collaborators in Brazil and China, for the Open Source Drug Discovery project in India, and our collaborations with scientists in Russia.
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
This document discusses the challenges of managing chemistry data online and the development of an open data repository to address these challenges. It proposes a new architecture for a data repository that would integrate diverse chemistry data types through APIs and user interfaces. The repository would standardize data, enable deposition from various sources, and provide metrics and recognition to encourage participation. However, challenges remain around data formats, encouraging data sharing, and meeting scientists' needs. The document advocates for continued testing and collaboration to develop effective solutions.
The document discusses technical challenges and approaches for building an open ecosystem of heterogeneous heritage collections. It describes Echoes, an open-source project that provides integrated access to digital cultural assets from different institutions. The key challenges addressed include dealing with different metadata schemas, poor data quality, data deduplication, and automatic enrichment. Technical approaches used by Echoes to overcome these challenges include modular tools for data analysis, transformation to a common schema, quality assurance, and enrichment.
This document provides summaries of several upcoming conferences, training programs, videoconferences, and workshops related to metadata and digital libraries. It also summarizes two ongoing projects: the development of a MARC 21 XML schema by the Library of Congress to facilitate the communication and conversion of MARC records, and the Metadata Encoding and Transmission Standard (METS) being developed by the Library of Congress as a standard for encoding metadata about digital library objects.
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
Presentation of the SSN XG results at eResearch Australia 2011 https://eresearchau.files.wordpress.com/2012/06/74-semantically-enabling-the-web-of-things-the-w3c-semantic-sensor-network-ontology.pdf
The document discusses the eXtensible Catalog Drupal Toolkit, which allows libraries to take control of their website and metadata. It provides a single user interface for an integrated library system, digital repository, library web content, and subscribed content. The toolkit uses Drupal modules and themes to connect different elements like an OAI-PMH harvester, metadata records, and a Solr index. It can schedule harvests and provides documentation and credits the contributors.
OCLC Research provides an overview of their work themes and projects. They conduct applied research to prototype new systems and services for libraries. Notable projects include the Crosswalk Web Service, which improves metadata translation between schemes, Terminology Services, which offers modular controlled vocabularies, WorldCat Identities which provides summary pages for authors, and VIAF which links authority records to identify the same persons or organizations across national files. RLG Programs also collaborates with research institutions on projects related to descriptive practices, digital scholarship, and architecture and standards.
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Ricard de la Vega
Echoes provides open, easy and innovative access to digital cultural assets from different institutions and is available in several languages. Within a single and integrated platform, users have access to a wide range of information on archaeology, architecture, books, monuments, people, photography etc. This can be explored using different criteria: concepts, digital objects, people, places and time. The platform can be installed for a region or a theme.
Echoes has developed tools that allow to analyze, clean and transform data collections to Europeana Data Model (EDM). Also tools to validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and used with different data visualizations.
This document discusses the GLOBE architecture for federating learning object repositories. It describes:
- GLOBE's use of LOM metadata standard and OAI-PMH protocol for metadata harvesting between repositories.
- The hybrid federated query and harvesting approach used to allow distributed and centralized searching of content.
- Key components of the GLOBE architecture including repositories, registry, harvester, validation services, and the ARIADNE tools for implementing repositories.
- Analysis of LOM usage in GLOBE repositories, including which elements are used most and quality issues around metadata completeness and consistency.
Nanotech Standards - Industry Participation - Strengthening Tiesharidoss
This document discusses standards development for nanotechnology. It outlines how standards were developed anticipatorily, participatorily, and responsively for various industries like microelectronics, information technology, and digital optical storage. It recommends high priority collaborative projects between standards bodies in areas like clean rooms and environmental products. It also recommends continued knowledge assessment, open access archives, standards education, and engagement of international communities and start-ups. The purpose is to demonstrate stewardship, develop performance standards for beneficial products, and kick start sensor applications to help nanoelectronics standards development.
Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.
The document discusses controlled vocabularies managed by the NERC Vocabulary Server (NVS). It provides details on the collections, concepts, and mappings contained in NVS. It also describes the governance and management of vocabularies through the NVS Vocabulary Management Group. Finally, it outlines recent progress made in improving transparency of governance models and versioning of concepts in NVS.
The document discusses the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). It describes OAI-PMH as a standard that allows data providers to make metadata available via HTTP so that service providers can harvest the metadata to develop value-added services. It provides details on the various requests and operations that are part of the OAI-PMH protocol. The document also discusses some implementation issues and examples of service providers that utilize OAI-PMH harvested metadata.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
UiPath Test Automation using UiPath Test Suite series, part 5
oai-2.0-adv.ppt
1. Advanced Overview of Version 2.0 of the
Open Archives Initiative
Protocol for Metadata Harvesting
Michael L. Nelson
Old Dominion University
Norfolk VA
mln@cs.odu.edu
Herbert Van de Sompel
Los Alamos National Laboratory
Los Alamos NM
herbertv@lanl.gov
Simeon Warner
Cornell University
Ithaca NY
simeon@cs.cornell.edu
ACM/IEEE Joint Conference on Digital Libraries
Houston, Texas
13:30 - 17:00
May 27 2003
latest version at: http://www.cs.odu.edu/~mln/jcdl03/
2. Scope and Focus
• This Tutorial is not…
– an introduction to OAI-PMH
– a listing of all the wonderful projects that use
OAI-PMH
– a discussion of the merits of metadata
harvesting vs. distributed searching
• A passing familiarity is assumed for:
– web / http interaction
– Dublin Core / metadata
3. Outline
• How 2.0 evolved from SFC and 1.x
– people, processes, events
• 2.0 highlights
– comparison with 1.x
• Guidelines, recommendations, best
practices for 2.0 implementations
– harvesters, repositories, aggregators, optional
containers
• Novel applications of OAI-PMH
5. about eprints
document
like objects
resources
metadata OAMS
unqualified
Dublin Core
unqualified
Dublin Core
transport HTTP HTTP HTTP
responses XML XML XML
requests HTTP GET/POST HTTP GET/POST HTTP GET/POST
verbs Dienst OAI-PMH OAI-PMH
nature experimental experimental stable
model
metadata
harvesting
metadata
harvesting
metadata
harvesting
Santa Fe
convention
OAI-PMH
v.1.0/1.1
OAI-PMH
v.2.0
6. Santa Fe Convention [02/2000]
• goal: optimize discovery of e-prints
• input:
• the UPS prototype
• RePEc /SODA “data provider / service
provider model”
• Dienst protocol
• deliberations at Santa Fe meeting
[10/99]
7. OAI-PMH v.1.0 [01/2001]
• goal: optimize discovery of document-like
objects
• input:
• SFC
• DLF meetings on metadata harvesting
• deliberations at Cornell meeting [09/00]
• alpha test group of OAI-PMH v.1.0
8. • low-barrier interoperability specification
• metadata harvesting model: data provider / service
provider
• focus on document-like objects
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• experimental: 12-18 months
OAI-PMH v.1.0 [01/2001]
9. Selected Pre- 2.0 OAI Highlights
• October 21-22, 1999 - initial UPS meeting
• February 15, 2000 - Santa Fe Convention published in D-Lib Magazine
– precursor to the OAI metadata harvesting protocol
• June 3, 2000 - workshop at ACM DL 2000 (Texas)
• August 25, 2000 - OAI steering committee formed, DLF/CNI support
• September 7-8, 2000 - technical meeting at Cornell University
– defined the core of the current OAI metadata harvesting protocol
• September 21, 2000 - workshop at ECDL 2000 (Portugal)
• November 1, 2000 - Alpha test group announced (~15 organizations)
• January 23, 2001 - OAI protocol 1.0 announced, OAI Open Day in the
U.S. (Washington DC)
– purpose: freeze protocol for 12-16 months, generate critical mass
• February 26, 2001 - OAI Open Day in Europe (Berlin)
• July 3, 2001 - OAI protocol 1.1 announced
– to reflect changes in the W3C’s XML latest schema
recommendation
• September 8, 2001 - workshop at ECDL 2001 (Darmstadt)
10. OAI-PMH v.2.0 [06/2002]
• goal: recurrent exchange of metadata about resources
between systems
• input:
• OAI-PMH v.1.0
• feedback on OAI-implementers
• deliberations by OAI-tech [09/01 - 06/02]
• alpha test group of OAI-PMH v.2.0 [03/02 - 06/02]
•officially released June 14, 2002
11. • low-barrier interoperability specification
• metadata harvesting model: data provider / service
provider
• metadata about resources
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• stable
OAI-PMH v.2.0 [06/2002]
12. releasing OAI-PMH v.2.0
(illustrating the OAI process)
See also
Lagoze, Carl and Van de Sompel, Herbert. The making of the Open
Archives Initiative Protocol for Metadata Harvesting. 2003. Library Hi
Tech. v21, N2. Draft
14. • created for 1 year period
• charge:
• review functionality and nature of OAI-PMH v.1.0
• investigate extensions
• release stable version of OAI-PMH by 05/02
• determine need for infrastructure to support broad
adoption of the protocol
• communication: listserv, SourceForge, conference calls
creation of OAI-tech [06/01]
15. US representatives
Thomas Krichel (Long Island U) - Jeff Young (OCLC) - Tim
Cole - (U of Illinois at Urbana Champaign) - Hussein Suleman
(Virginia Tech) - Simeon Warner (Cornell U) - Michael Nelson
(NASA) - Caroline Arms (LoC) - Mohammad Zubair (Old
Dominion U) - Steven Bird (U Penn.)
European representatives
Andy Powell (Bath U. & UKOLN) - Mogens Sandfaer (DTV) -
Thomas Baron (CERN) - Les Carr (U of Southampton)
OAI-tech
16. • review process by OAI-tech:
• identification of issues
• conference call to filter/combine issues
• white paper per issue
• on-line discussion per white paper
• proposal for resolution of issue by OAI-exec
• discussion of proposal & closure of issue
• conference call to resolve open issues
pre-alpha phase [09/01 – 02/02]
17. • creation of revised protocol document
• in-person meeting Lagoze - Van de Sompel -
Nelson – Warner
• autonomous decisions
• internal vetting of protocol document
pre-alpha phase [02/02]
18. • alpha-1 release to OAI-tech March 1st
2002
• OAI-tech extended with alpha testers
• discussions/implementations by OAI-tech
• ongoing revision of protocol document
alpha phase [02/02 – 05/02]
19. • The British Library
• Cornell U. -- NSDL project & e-print arXiv
• Ex Libris
• FS Consulting Inc -- harvester for my.OAI
• Humboldt-Universität zu Berlin
• InQuirion Pty Ltd, RMIT University
• Library of Congress
• NASA
• OCLC
OAI-PMH 2.0 alpha testers (1/2)
20. OAI-PMH 2.0 alpha testers (2/2)
• Old Dominion U. -- ARC , DP9
• U. of Illinois at Urbana-Champaign
• U. Of Southampton -- OAIA (now Celestial),
CiteBase, eprints.org
• UCLA, John Hopkins U., Indiana U., NYU -- sheet
music collection
• UKOLN, U. of Bath -- RDN
• Virginia Tech -- repository explorer
21. beta phase [05/02-06/02]
• beta release on May 1st 2002 to:
• registered data providers and service
providers
• interested parties
• fine tuning of protocol document
• preparation for the release of 2.0
conformant tools by alpha testers
24. Overview of OAI-PMH Verbs
Verb Function
Identify description of repository
ListMetadataFormats metadata formats supported by
repository
ListSets sets defined by repository
ListIdentifiers OAI unique ids contained in
repository
ListRecords listing of N records
GetRecord listing of a single record
metadata
about the
repository
harvesting
verbs
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
26. resource
all available metadata
about David
item
Dublin Core
metadata
MARC
metadata
SPECTRUM
metadata records
item = identifier
record = identifier + metadata format + datestamp
set-membership is
item-level property
resource – item - record
27. OAI-PMH vs HTTP
• clear separation of OAI-PMH and HTTP
• OAI-PMH error handling
• all OK at HTTP level? => 200 OK
• something wrong at OAI-PMH level? =>
OAI-PMH error (e.g. badVerb)
• http codes 302, 503, etc. still available to
implementers, but no longer represent OAI-PMH
events
28. other improvements
• all protocol responses can be validated with
a single XML Schema
• easier for data providers
• no redundancy in type definitions
• SOAP-ready
• clean for error handling
29. <?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
…..
</metadata>
</record>
</GetRecord>
</OAI-PMH>
response no errors
note no http encoding
of the OAI-PMH request
31. • Identify more expressive
Identify
<Identify>
<repositoryName>Library of Congress 1</repositoryName>
<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>r.e.gillian@larc.nasa.gov</adminEmail>
<adminEmail>rgillian@visi.net</adminEmail>
<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>
<deletedRecord>transient</deletedRecord>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>deflate</compression>
32. protocol vs periphery
• clear distinction between protocol and
periphery
• fixed protocol document
• extensible implementation guidelines:
• e.g. sample metadata formats, description
containers, about containers
• allows for OAI guidelines and community
guidelines
33. • introduction of provenance container to
facilitate tracing of harvesting history
provenance
<about>
<provenance>
<originDescription>
<baseURL>http://an.oa.org</baseURL>
<identifier>oai:r1:plog/9801001</identifier>
<datestamp>2001-08-13T13:00:02Z</datestamp>
<metadataPrefix>oai_dc</metadataPrefix>
<harvestDate>2001-08-15T12:01:30Z</harvestDate>
<originDescription>
… … …
</originDescription>
</originDescription>
</provenance>
</about>
34. • introduction of friends container to
facilitate dynamic discovery of repositories
friends
<description>
<friends>
<baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL>
<baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL>
<baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL>
<baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL>
</friends>
</description>
35. • introduction of branding container for
DPs to suggest rendering & association hints
<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/
http://www.openarchives.org/OAI/2.0/branding.xsd">
<collectionIcon>
<url>http://my.site/icon.png</url>
<link>http://my.site/homepage.html</link>
<title>MySite(tm)</title>
<width>88</width>
<height>31</height>
</collectionIcon>
<metadataRendering
metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/"
mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering>
<metadataRendering
metadataNamespace="http://another.place/MARC"
mimeType="text/css">http://another.place/MARCrender.css</metadataRendering>
</branding>
branding
40. • SOAP implementation
• Result set filtering
• “all” / “best” metadata
• GetRecord -> GetRecords
• Machine readable rights management
• XML format for “mini-archives”
49. Error Summary
Identify BA
ListMetadata
Formats
BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6 defined verbs
this is an inversion of the table in section 3.6 of the OAI-PMH specification
51. Minimal Repository
• 2.0 provides many expressive, but optional
features
– but still low barrier!
• if you are writing your own repository software,
the quickest path to implementation can involve
initially:
– only supporting DC
– skipping: <about>, sets, compression
– skip flow control (resumptionTokens) if < 1000 items
• add optional features as requirements and
familiarity allows
52. Be Honest with datestamp!
• a change in the process of dynamic generation of a
metadata format really does mean all records have
been updated!
– harvester caveat: an incremental harvest could yield an
entire repository dump if all the date stamps change (for
example, if the metadata mapping rules change)
if (internalItemDatestamp > disseminationInterfaceDatestamp) {
datestamp = internalItemDatestamp
} else {
datestamp = disseminationInterfaceDatestamp
}
53. Not Hiding Updates
• OAI-PMH is designed to allow incremental
harvesting
• Updates must be available by the end of the
period of the datestamp assigned, i.e.
– Day granularity => during same day
– Seconds granularity => during same second
• Reason: harvesters need to overlap requests by
just one datestamp interval (one day or one
second)
– in 1.x, 2 intervals were required (in many circumstances)
54. State in resumptionTokens
• HTTP is stateless
• resumptionTokens allow state information to be
passed back to the repository to create a
complete list from sequence of incomplete lists
• EITHER – all state in resumptionToken
• OR – cache result set in repository
55. Caching the Result Set
• Repository caches results of initial request,
returns only incomplete list
• resumptionToken does not contain all state
information, it includes:
– a session id
– offset information, necessary for idempotency
• resumptionToken allows repository to return
next incomplete list
• increased complexity due to cache management
– but a potential performance win
56. All State in the
resumptionToken
• Arrange that remaining items/headers in complete list
response can be specified with a new query and encode that
in resumptionToken
• One simple approach is to return items/headers in id order
and make the new query specify the same parameters and
the last id return (or by date)
– simple to implement, but possibly longer execution times
• Can encode parameters very simply:
<resumptionToken>metadataPrefix=oai_dc
from=1999-02-03&until=2002-04-01&
lastid=fghy:45:123</resumptionToken>
57. resumptionToken attributes (1)
• expirationDate – likely to be useful when cache clean-up
schedule is known
– Do not specify expirationDate if all state in
resumptionToken
• badResumptionToken error to be used if
resumptionToken expired
– May also be used if request cannot be completed for some other
reason
• e.g.: if repository changes cause the incomplete list to have no
records
– issue badRT’s judiciously; it can invalidate a lot of effort by a lot
of harvesters
58. resumptionToken attributes (2)
• completeListSize and cursor optionally
provide information about size of complete
list and number of records so far
disseminated
– not (currently) widely used
– use consistently if used
– designed for status monitoring
– caveat harvester: completeListSize may be
approximate and may be revised
59. resumptionToken
The only defined use of resumptionToken is as follows:
•a repository must include a resumptionToken element as
part of each response that includes an incomplete list;
•in order to retrieve the next portion of the complete
list, the next request must use the value of that
resumptionToken element as the value of the
resumptionToken argument of the request;
•the response containing the incomplete list that
completes the list must include an empty resumptionToken
element;
60. Flow Control & Load Balancing
How to respond to a “bad” harvester:
1. HTTP status code 200; response to OAI-PMH request
with a resumptionToken.
2. HTTP status code 503; with the Retry-After header
set to an appropriate value if subsequent request
follows too quickly or if the server is heavily loaded.
3. HTTP status code 403; with an appropriate reason
specified if subsequent requests do not adhere to
Retry-After delays.
61. 302 Load Balancing
• Interactive users on main DL machine should not be
impacted by metadata harvesting
– don’t take deliveries through the front door
– not part of the protocol; defined outside the protocol
OAI
Server
naca.larc.nasa.gov/oai/
if load > 0.50
redirect request
OAI
Server
buckets.dsi.internet2.edu/naca/oai/
harvester
http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc
HTTP Status Code 302
http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc
<?xml version=“1.0” encoding=“UTF-8”?>
…
<ListIdentifiers>
…
</ListIdentifiers>
62. DNS Load Balancing
• using a DNS rotor, establish
– a.foo.org, b.foo.org, c.foo.org
– each with a synchronized copy of the repository
– let DNS & chance distribute the load
– implication: if resumptionTokens could issued to
loosely synchronized servers, it is likely that
the rTs will be stateful
63. Load Balancing Caveats
• Copies of the repository must be synchronized
– (cf. Pande, et al. JCDL 02)
• Complex hierarchies are possible
– programmer must insure no cycles in redirection graphs!
• The baseURL in the reply must always point to the
original repository, not the repository that
eventually answered the request
64. Error Handling: Verbosity
More is better…
<error code="badArgument">Illegal argument ‘foo’</error>
<error code="badArgument">Illegal argument ‘bar’</error>
is preferred over:
<error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error>
which is preferred over:
<error code="badArgument">Illegal arguments</error>
65. Error Handling: Levels
• the OAI-PMH error / exception conditions are for
OAI-PMH semantic events
• they are not for situations when:
– the database is down
– a record is malformed
• remember: record = id + datestamp + metadataPrefix
• if you’re missing one of those, you don’t have an OAI record!
– and other conditions that occur outside the OAI scope
• use http codes 500, 503 or other appropriate values to indicate
non-OAI problems
66. Error Handling: Extensions
• Arguments that are not 'required', 'optional' or
'exclusive’ are 'illegal' and should generate
badArgument errors.
• If you want to extend the OAI-PMH:
– stop and consider: do you really need to?
• maybe you should have different OAI-PMH interfaces, or creative
metadata formats
– if you really, really want to, tunnel your extensions through the
“set” feature
• see http://www.dlib.org/dlib/december01/suleman/12suleman.html for
examples
67. Idempotency of “List”
Requests (1)
• Purpose is to allow harvesters to recover from lost
responses or crashes without starting a large
harvest from scratch
• Recover by re-issuing request using
resumptionToken from previous request
• IMPLICATION: harvester must accept both the
most recent resumptionToken issued and the
previous one
68. Idempotency of “List”
Requests (2)
• response to a re-issued request must contain all unchanged
records
• any changed records will get new datestamps after time of
initial request
• changes will be picked up by subsequent harvest if not
included
[no experience yet with incomplete responses to ListSets or
ListMetadataFormats requests]
69. Case Study: “bucket” based repositories
• Buckets: see Nelson & Maly, CACM 44(5)
• 2.0
– NTRS - ntrs.nasa.gov/ (MySQL, DC)
– LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer)
– NACA - naca.larc.nasa.gov/oai2.0/ (file system, refer)
• 1.1
– LTRS - techreports.larc.nasa.gov/ltrs/oai/
– NACA - naca.larc.nasa.gov/ltrs/oai/
– Open Video - www.open-video.org/oai/ (MySQL, local)
– JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS Access dump, local)
– GLTRS (filesystem, HTML scraping)
• Characteristics:
– resumptionToken support initially skipped; added later (all)
• highly encoded rT’s: [2001-01-01!!!!301!600]
– sets initially skipped, added later (LTRS)
– initially had load balancing with 2 NACA repositories…
70. Case Study: “bucket” based repositories
• in bucket terminology:
– 6 OAI verbs (methods) added to the existing list
of methods
• http://ntrs.nasa.gov/?method=list_methods
• http://ntrs.nasa.gov/?method=list_source&target=ListIdentifiers
– a data element is added to the bucket that contains
the specifics of the particular repository and its
metadata format
• http://ntrs.nasa.gov/?method=display&pkg_name=oai&element_na
me=oai.pl
72. Be a Polite OAI Neighbor
• Re-use existing free harvester software/libraries:
http://www.openarchives.org/tools/index.html
• If you insist on writing your own harvester, read
http://www.robotstxt.org/wc/robots.html
• Provide meaningful User-Agent & From headers
– Should be present in HTTP headers of all robot requests
– Should be configured even if using someone else’s harvester
73. Harvesting Sequence
• Issue Identify request
– Check OAI-PMH version
– Check baseURL, granularity, compression
• Issue ListMetadataFormats request
– Get information regarding selected metadataPrefix
• Issue ListSets request if using sets
– Check set structure matches expectation
• Issue ListIdentifier or ListRecords
request
– Continue until end of complete list
74. Listen to the Repository
• Check Identify’s <granularity> element if you wish to use finer
than YYYY-MM-DD
• If you harvest with sets, remember that “:” indicates hierarchy
– harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d”
• Check for and handle non-200 HTTP status codes, 503, 302 and
4xx in particular
• Empty resumptionToken => end of complete list
• Ask for compressed responses if the repository supports them
75. Harvesting Everything
• Issue an Identify request to find protocol version, finest datestamp
granularity supported, if compression is supported…
• Issue a ListMetadataFormats request to obtain a list of all
metadataPrefixes supported.
• Harvest using a ListRecords request for each metadataPrefix
supported. Knowledge of the datestamp granularity allows for less
overlap in incremental harvesting if granularities finer than a day are
supported.
• Set structure can be inferred from the setSpec elements in the header
blocks of each record returned (consistency checks are possible).
• Items may be reconstructed from the constituent records.
• Provenance and other information in <about> blocks may be re-
assembled at the item level if it is the same for all metadata formats
harvested. However, this information may be supplied differently for
different metadata formats and may thus need to be store separately
for each metadata format.
76. Harvesting v1.1 and v2.0
• Not difficult to handle
both cases, test
Identify response:
– v1.1: <Identify>
<protocolVersion>
– v2.0 <OAI-PMH>
<Identify>
<protocolVersion>
• Different error
and exception
handling
• Many similarities,
harvesters can
share lots of code
77. Harvesting Demo
• Harvester written in Perl (Uses LWP, Expat and
XML::Parser, no schema validation)
• Handles v1.0, v1.1 and v2.0
• Sequence of requests: Identify, ListMetadataFormats,
ListSets then ListRecords/ListIdentifiers
• Support for incremental harvesting, uses responseDate
from last harvest to get new start datestamp
• Supports response compression (gzip, compress)
• UTF-8 conditioning to deal with some “imperfect”
repositories
78. Harvesting logs
• Alan Kent’s v2.0 harvester logs:
http://www.inquirion.com:8123/public/collList;collListCmd=list
• Alan Kent’s summary of v1.1 harvesting results
http://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm
• Celestial v1.1 harvesting logs
http://celestial.eprints.org/cgi-bin/status
• DP9 gateway using arc harvested information
http://arc.cs.odu.edu:8080/dp9/index.jsp
79. <friends> example (1)
A light-weight, data-provider driven way to communicate
the existence of “others”, e.g.
http://ntrs.nasa.gov/?verb=Identify
…
<description>
<friends …namespace stuff… >
<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>
<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>
<baseURL>http://eprints.riacs.edu/perl/oai/</baseURL>
<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>
</friends>
</description>
…
82. Aggregator / Cache / Proxy
Implementation
(see also Aggregator Implementation Guidelines:
http://www.openarchives.org/OAI/2.0/guidelines-
aggregator.htm)
83. <provenance> & datestamps
• Reminder: datestamps are local to the
repository, a re-exporting service
must use new local datestamps
• Such services should use the
<provenance> container to preserve
the original datestamps and other
information
84. Identifiers are Local
• Identifiers are local to the repository
• Unless you absolutely did not change the
metadata and the identifier corresponds to a
recognized URI scheme, use a new identifier upon
re-exporting
– use the <provenance> container to preserve the
harvesting history
85. oai-identifier
• Just one option for identifiers in OAI-PMH
• The v2.0 oai-identifier scheme is not
compatible with v1.1:
– repositoryName now domain name based
– not reliant upon OAI centralized registration
• One-to-one mapping for escaping
characters: %3F allowed, %3f not
– allows simple comparison
86. Derived from the same item?
3 ways to determine if records share provenance
from the same item:
1. both records have the same identifier and the
baseURL in the request elements of the OAI-PMH
reponses which include the record are the same;
2. both records have the same identifier and that
identifier belongs to some recognized URI scheme;
3. the provenance containers of both records have the
same entries for both the identifier and baseURL;
87. <provenance> example (1)
<responseDate>2002-02-08T08:55:46.1</responseDate>
<request verb="GetRecord" metadataPrefix="odd_fmt"
identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request>
<GetRecord ...namespace stuff…
<record>
<header>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp>
</header>
<metadata> …metadata record in odd_fmt… </metadata>
</record>
</GetRecord>
Consider a request from crosswalker.oa.org:
http://odd.oa.org?verb=GetRecord
&identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt
and the following response from odd.oa.org:
88. Imagine that crosswalker.oa.org cross-walks
harvested metadata from odd_fmt into oai_marc and
then re-exposes the metadata with new identifiers.
A request from getmarc.oa.org:
http://crosswalker.oa.org?verb=GetRecord
&identifier=oai:cw.oa.org:z1x2y3
&metadataPrefix=oai_marc
might then yield the following response from
crosswalker.oa.org:
<provenance> example (2)
90. <provenance> example (4)
This oai_marc record is then re-exposed by
getmarc.oa.org with the same identifier
oai:cw.oa.og:z1x2y3 (because the record
has not been altered).
The associated <provenance> container might
be:
93. arXiv (1)
http://arXiv.org/oai2
• Existing system, running >11 years, written mostly in
Perl
• Flat file system for ‘database’
• 230k papers with metadata in homebrew format
• ~200 updates/day. OAI repository just one view of
system, must integrate with daily update schedule
94. arXiv (2)
• Write in Perl
– Easy integration with rest of system, reuse
code from v1.0/v1.1 interface
– Use libwww; XML::DOM
• Daily rebuild of datestamp database
– No existing date in system appropriate
– Base on Unix cdate of metadata files
• On-the-fly metadata translation
– Straightforward, avoids data duplication
95. arXiv (3)
• Flow control to avoid loading server and to
avoid harvesters tripping robot alarms
– resumptionTokens to limit response size
(1500 records or 15k identifiers / response)
– 503 Retry-After replies based on client ip
• Implement resumptionTokens that
include all state
– Avoid need to cache result sets / clean cache
96. NSDL Metadata Repository
http://services.nsdl.org:8080/nsdloai/OAI
• Implemented as an integral part of a new system
• Expect heavy load; db target size >10M items; stateless
resumptionTokens
• Java servlets; Xerces; Oracle (JDBC interface); strict validation
throughout
• Based on rewrite of Cocoa (NCSA UIUC)
• Integral to NSDL services model: provides data for user
interface and search services
98. • “Using OAI-PMH…Differently” Young, Van de
Sompel, Hickey, D-Lib Magazine, 9(7/8),
2003
– DL Usage logs ~ LANL
– Registry of metadata formats for OpenURL ~
OCLC & LANL
• http://www.openurl.info/registry/
• http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf
– GSFAD Thesaurus ~ OCLC
• Other uses?
99.
100. OAI-PMH access to DL usage logs
• usage logs filtered and stored in MySQL db
• accessible as 2 OAI-PMH repositories:
• document oriented
• agent oriented (user-proxy)
• interlinked
• recommender system:
• harvests logs
• interpretes logs
• exposes relationships (OpenURL access)
101. Phase 1: creating recommender system
Document
logs
local
Agent logs
local
document log agent log
Log processing
Log
based
recom.
system
About local and remote data
106. OAI-PMH-conformant OpenURL Registry
• NISO OpenURL Framework builds on Registry
• Registry entry:
• unique identifier
• always DC record
• sometimes XHTML or XML Schema
definition
107. OAI-PMH-conformant OpenURL Registry
• Collaboration with OCLC Office of Research:
• Registry is OAI-PMH harvestable
• Registry is browseable through overlaying
of PURL and XSLT
113. OAI-PMH-conformant GSFAD Thesaurus
• OCLC Office of Research:
• GSFAD Thesaurus is OAI-PMH harvestable
• Thesaurus is user-browseable through
overlaying of PURL and XSLT
• Thesaurus is accessible by machines via
OAI-PMH-based web services
115. Other Uses For the OAI-PMH
• Assumptions:
– Traditional DLs / SPs will continue on their
present path of increasing sophistication
• citation indexing, search results viz, personalization,
recommendations, subject-based filtering, etc.
– growth rates remain the same (~5x DPs as SPs)
• Premise: OAI-PMH is applicable to any
scenario that needs to update / synchronize
distributed state
– Future opportunities are possible by creatively
interpreting the OAI-PMH data model
116. Typical Values
• repository
– collection of publications
• resource
– scholarly publication
• item
– all metadata (DC + MARC)
• record
– a single metadata format
• datestamp
– last update / addition of a record
• metadata format
– bibliographic metadata format
• set
– originating institution or subject categories
117. Repositories…
• Stretching the idea of a repository a bit:
– contextually sensitive repositories
• “personalization for harvesters”
• communication between strangers, or communication
between friends?
– OAI-PMH for individual complex objects?
• OAI-PMH without MySQL?!
– Fedora, Multi-valent documents, buckets
– tar, jar, zip, etc. files
118. Resource
• What if resource were:
– computer system status
• uptime, who, w, df, ps, etc.
– or generalized “system” status
• e.g., sports league standings
– people
• personnel databases
• authority files for authors
119. Item
• What if item were:
– software
• union of versions + formats
– all forms of metadata
• administrative + structural
• citations, annotations, reviews, etc.
– data
• e.g., newsfeeds and other XML expressible content
– metadataPrefixes or sets could be defined to be different
versions
120. Record
• What if record were:
– specific software instantiations / updates
– access / retrieval logs for DLs (or computer systems)
– push / pull model inversion
• put a harvester on the client behind a firewall, the client
contacts a DP and receives “instructions” on how to submit
the desired document (e.g., send email to a specified
address)
121. Datestamp
• semantics of datestamp are strongly influenced by
the choice of resource / item / record /
metadataPrefix, but it could be used to:
– signify change of set membership (e.g., workflow: item
moves from “submitted” to “approved”)
– change datestamp to reflect access to the DP
• e.g., in conjunction with metadataPrefixes of “accessed” or
“mirrored”
122. metadataPrefix
• what if metadataPrefix were:
– instructions for extracting / archiving / scraping the
resource
• verb=ListRecords&metadataPrefix=extract_TIFFs
– code fragments to run locally
• (harvested from a trusted source!)
– XSLT for other metadataPrefixes
• branding container is at the repository-level, this could be
record- or item-level
123. Set
• sets are already used for tunneling OAI-PMH
extensions (see Suleman & Fox, D-Lib 7(12))
• other uses:
– in aggregators, automatically create 1 set per baseURL
– have “hidden” sets (or metadataPrefix) that have
administrative or community-specific values (or triggers)
• set=accessed>1000&from=2001-01-01
• set=harvestMeWithTheseARGS&until=2002-05-
05&metadataPrefix=oai_marc