Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi [email_address] , 011-24305503
OAI-PMH The mission of the Open Archives Initiative (OAI) (www.openarchives.org) is to develop and promote “interoperability standards that aim to facilitate efficient dissemination of content.
OAI-PMH The OAI-PMH is based on a simple and powerful model “whereby repositories (data providers) make metadata . . . available via a well-defined protocol. The exposure of the metadata allows other organizations (service providers) to harvest it and then aggregate it, post-process it, and refine it with the goal of developing services that add value.
Background Has origins in ePrints (arXive, CogPrints), dating back to 1999  actively seeking wider applicability Nothing to do with OAIS Aims to “f acilitate the efficient dissemination of content”   free access to the archives (at least: metadata) consistent interfaces for archives and service provider low barrier protocol / effortless implementation (e.g., because based on HTTP, XML, DC) Now on version 2.0 (June 2002)
OAI-PMH: what’s it all about Service providers  harvest metadata from  data providers. “ service” Adapted from http://www.oaforum.org/tutorial/english/page3.htm OAI PMH Requests (HTTP) Metadata (XML) Data provider Metadata (+ resources) Harvester Service Provider Metadata
What can be requested (verb) Description of the archive (Identify) A list of metadata formats supported by the data provider (ListMetadataFormats) A list of sets provided (ListSets) A list of resource identifiers (ListIdentifiers) Many records (ListRecords) An individual record (GetRecord)
Example Requests http://archive.example.org/oaipmh?verb=Identify http:// archive.example.org/oaipmh?verb= ListRecords&metadataPrefix=oai_dc
Metadata Formats Metadata may be returned in any XML format Dublin Core is mandatory OAI-PMH specifies the XML schema to use No single DC element is mandatory Other element sets / bindings are optional Qualified DC (e.g. RDN, NSDL) MODS (LoC) LOM (RDN-LTSN) ODRL (JORUM  ( I think) ) ...
Sets A grouping of items made to allow  selective harvesting E.g. all theses E.g. the Engineering section E.g. all resources from a given source Optional
List Records Harvester can ask for specific metadata format for All available items  All items in a set All records modified in given date range (A single item — GetRecord) Data provider can return All relevant records Some relevant records + resumption token An error code (no such set / metadata format)
Static Repositories Even lighter-weight specification for data providers with small and relatively static collections E.g. the output from a conference Essentially an XML file available at a URL Accessed through a “static repository gateway” intermediary
Issues: complexity Providing data is easy Harvesting data is easy However Doing so  may  lead to complex workflow / policy issues What do you do with the harvested metadata? Do you modify the metadata you harvest? If so, do you feed this back to the provider? What if the provider changes a modified record? Does a service provider disseminate via OAI?
Issues: uptake Lots of implementers, who have produced lots of useful support However Relatively little commercial uptake Relatively little support for harvesting rich metadata Relatively little support/consensus on sets
Issues: Harvesting resource (e.g. Full text) Nothing in OAI-PMH requires that full-text should be available for harvesting. Resource may be physical or accessed controlled Nothing in OAI-PMH requires that information required for harvesting should be available. However in many cases OAI-PMH  will  provide the information required to harvest the resource. http:// www.myoai.co
Service Provider Arc: A Cross Archive Search Service http://arc.cs.odu.edu/ From  October 2000 Arc  is an experimental research service that serves as a platform for demonstrating the scalability of the OAI-PMH and as a vehicle for providing access to OAI-compliant repositories through a unified search interface.  Arc  is the oldest federated search service based on the OAI-PMH.
OAISTER OAIster is a union catalog of digital resources. We provide access to these digital resources by "harvesting" their descriptive metadata (records) using  OAI-PMH  (the Open Archives Initiative Protocol for Metadata Harvesting).
Service Providers Citebase http://citebase.eprints.org/ May 2001 Citebase  “allows researchers to search across free, full-text research  literature eprint archives, with results ranked according to many criteria (e.g., citation impact), and then to navigate that literature using citation links and analysis.”
Contd… SAIL-eprints  (Search, Alert, Impact and Link) http://eprints.bo.cnr.it/ April 2003  SAIL-eprints (Search, Alert, Impact and Link)  is “an electronic open access service provider for finding scientific or technical documents, published or unpublished, in Chemistry, Physics, Engineering, Materials Sciences, Nanotechnologies, Microelectronics, Computer Sciences, Astronomy, Astrophysics, Earth Sciences, Meteorology, Oceanography, . . . [Agriculture], and related . . . [subjects].”
Resources Open Archives Initiative http://www.openarchives.org/ Spec, best practice guide and useful resources, mailing lists  OAI for beginners http://www.oaforum.org/tutorial/ Online tutorial OAI Repository Explorer http://www.purl.org/NET/oai_explorer Web interface for issuing OAI-PMH requests
 
 
 
Thanks

Digitisation and institutional repositories 3

  • 1.
    Open Archive Initiative– Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi [email_address] , 011-24305503
  • 2.
    OAI-PMH The missionof the Open Archives Initiative (OAI) (www.openarchives.org) is to develop and promote “interoperability standards that aim to facilitate efficient dissemination of content.
  • 3.
    OAI-PMH The OAI-PMHis based on a simple and powerful model “whereby repositories (data providers) make metadata . . . available via a well-defined protocol. The exposure of the metadata allows other organizations (service providers) to harvest it and then aggregate it, post-process it, and refine it with the goal of developing services that add value.
  • 4.
    Background Has originsin ePrints (arXive, CogPrints), dating back to 1999 actively seeking wider applicability Nothing to do with OAIS Aims to “f acilitate the efficient dissemination of content” free access to the archives (at least: metadata) consistent interfaces for archives and service provider low barrier protocol / effortless implementation (e.g., because based on HTTP, XML, DC) Now on version 2.0 (June 2002)
  • 5.
    OAI-PMH: what’s itall about Service providers harvest metadata from data providers. “ service” Adapted from http://www.oaforum.org/tutorial/english/page3.htm OAI PMH Requests (HTTP) Metadata (XML) Data provider Metadata (+ resources) Harvester Service Provider Metadata
  • 6.
    What can berequested (verb) Description of the archive (Identify) A list of metadata formats supported by the data provider (ListMetadataFormats) A list of sets provided (ListSets) A list of resource identifiers (ListIdentifiers) Many records (ListRecords) An individual record (GetRecord)
  • 7.
    Example Requests http://archive.example.org/oaipmh?verb=Identifyhttp:// archive.example.org/oaipmh?verb= ListRecords&metadataPrefix=oai_dc
  • 8.
    Metadata Formats Metadatamay be returned in any XML format Dublin Core is mandatory OAI-PMH specifies the XML schema to use No single DC element is mandatory Other element sets / bindings are optional Qualified DC (e.g. RDN, NSDL) MODS (LoC) LOM (RDN-LTSN) ODRL (JORUM ( I think) ) ...
  • 9.
    Sets A groupingof items made to allow selective harvesting E.g. all theses E.g. the Engineering section E.g. all resources from a given source Optional
  • 10.
    List Records Harvestercan ask for specific metadata format for All available items All items in a set All records modified in given date range (A single item — GetRecord) Data provider can return All relevant records Some relevant records + resumption token An error code (no such set / metadata format)
  • 11.
    Static Repositories Evenlighter-weight specification for data providers with small and relatively static collections E.g. the output from a conference Essentially an XML file available at a URL Accessed through a “static repository gateway” intermediary
  • 12.
    Issues: complexity Providingdata is easy Harvesting data is easy However Doing so may lead to complex workflow / policy issues What do you do with the harvested metadata? Do you modify the metadata you harvest? If so, do you feed this back to the provider? What if the provider changes a modified record? Does a service provider disseminate via OAI?
  • 13.
    Issues: uptake Lotsof implementers, who have produced lots of useful support However Relatively little commercial uptake Relatively little support for harvesting rich metadata Relatively little support/consensus on sets
  • 14.
    Issues: Harvesting resource(e.g. Full text) Nothing in OAI-PMH requires that full-text should be available for harvesting. Resource may be physical or accessed controlled Nothing in OAI-PMH requires that information required for harvesting should be available. However in many cases OAI-PMH will provide the information required to harvest the resource. http:// www.myoai.co
  • 15.
    Service Provider Arc:A Cross Archive Search Service http://arc.cs.odu.edu/ From October 2000 Arc is an experimental research service that serves as a platform for demonstrating the scalability of the OAI-PMH and as a vehicle for providing access to OAI-compliant repositories through a unified search interface. Arc is the oldest federated search service based on the OAI-PMH.
  • 16.
    OAISTER OAIster isa union catalog of digital resources. We provide access to these digital resources by "harvesting" their descriptive metadata (records) using OAI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting).
  • 17.
    Service Providers Citebasehttp://citebase.eprints.org/ May 2001 Citebase “allows researchers to search across free, full-text research literature eprint archives, with results ranked according to many criteria (e.g., citation impact), and then to navigate that literature using citation links and analysis.”
  • 18.
    Contd… SAIL-eprints (Search, Alert, Impact and Link) http://eprints.bo.cnr.it/ April 2003 SAIL-eprints (Search, Alert, Impact and Link) is “an electronic open access service provider for finding scientific or technical documents, published or unpublished, in Chemistry, Physics, Engineering, Materials Sciences, Nanotechnologies, Microelectronics, Computer Sciences, Astronomy, Astrophysics, Earth Sciences, Meteorology, Oceanography, . . . [Agriculture], and related . . . [subjects].”
  • 19.
    Resources Open ArchivesInitiative http://www.openarchives.org/ Spec, best practice guide and useful resources, mailing lists OAI for beginners http://www.oaforum.org/tutorial/ Online tutorial OAI Repository Explorer http://www.purl.org/NET/oai_explorer Web interface for issuing OAI-PMH requests
  • 20.
  • 21.
  • 22.
  • 23.

Editor's Notes

  • #5   1st Meeting of the OAI   October 21-22 1999, Santa Fe, New Mexico D-Lib article in Feb 2000: http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html
  • #6 The service provider will use a “harvester” to issue OAI requests in order to collect metadata from a data provider’s repository. Example services: Anything you can do with metadata. Cross search: aggregate resource discovery metadata from many repositories (BUT OAI-PMH is not a search protocol per se ). Metadata enhancement/transformation: merge records, augment metadata (e.g. add implicit information about encoding schemes), change binding. (Diane Hillman et al “Improving metadata quality : recombination and augmentation” (NSDL) http://metamanagement.comm.nsdl.org/Metadata_Augmentation--DC2004.html
  • #7 Will focus on those which are highlighted.
  • #8 These are examples using HTTP GET, also possible to use POST
  • #9 By insisting the simple DC is used the OAI aims to provide a base level of interoperability.
  • #10 Selective harvesting is not the same as searching, sets have to be pre-defined by the data provider. You can’t assume that an OAI data provider will provide sets.
  • #11 OAI is not a query protocol.
  • #12 STARGATE (CDLR at Strathclyde university) http://cdlr.strath.ac.uk/stargate/ is investigating use of OAI static repositories
  • #13 None of the issues are insurmountable if you’re not over-ambitious to start with and think through potential problems. NSDL metadata recombination work is an example.
  • #14 Commercial uptake: in work with a scientific publisher (Inderscience) through JISC PALS project, have found that CSA, CrossRef, Elsevier etc. have refused to use OAI-PMH preferring their own harvesting techniques. In part perhaps due to not understanding how flexible OAI can be (e.g can support any metadata) In part because can do this simply without OAI-PMH (e.g. web crawling) In part reflects balance of power between service provider and (small) data provider (DP has to play by SP rules; DP has to support many harvesting specs, SP get to control what they use) PERX project have produced “Marketing with metadata” aimed at encouraging publishers to use metadata interoperability standards including OAI-PMH Support for rich metadata reflects variety and lack of consensus for rich metadata
  • #15 Location: no metadata element is mandatory, therefore can’t rely on identifier/location being available. However if you’re dealing with resource discovery metadata for resources available online, it normally is.