The Open Archives Initiative Protocol for Metadata Harvesting


Published on

CRIS + Open Access = The Route to Research Knowledge on the GRID, Brussels – 21 September 2004

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Open Archives Initiative Protocol for Metadata Harvesting

  1. 1. UKOLN is supported by: The Open Archives Initiative Protocol for Metadata Harvesting CRIS + Open Access = The Route to Research Knowledge on the GRID Brussels – 21 September 2004 Andy Powell, UKOLN, University of Bath [email_address] A centre of expertise in digital information management
  2. 2. Contents <ul><li>a brief history of OAI </li></ul><ul><li>10 technical things you should know about the OAI-PMH </li></ul><ul><li>potential impact… </li></ul><ul><ul><li>institutional context </li></ul></ul><ul><ul><li>the role of the library? </li></ul></ul><ul><ul><li>the researcher </li></ul></ul><ul><li>current activities/issues </li></ul><ul><li>OAI and the semantic Web </li></ul><ul><li>note : primary focus is on the technology </li></ul>
  3. 3. OAI roots <ul><li>the roots of OAI lie in the development of eprint archives… </li></ul><ul><ul><li>arXiv, CogPrints, NACA (NASA), RePEc, NDLTD, NCSTRL </li></ul></ul><ul><li>each offered Web interface for deposit of articles and for end-user searches </li></ul><ul><li>difficult for end-users to work across archives without having to learn multiple different interfaces </li></ul><ul><li>recognised need for single search interface to all archives </li></ul><ul><ul><li>Universal Pre-print Service (UPS) </li></ul></ul>
  4. 4. Searching vs. harvesting <ul><li>two possible approaches to building a single search interface to multiple eprint archives… </li></ul><ul><ul><li>cross-searching multiple archives based on protocol like Z39.50 </li></ul></ul><ul><ul><li>harvesting metadata into one or more ‘central’ services – bulk move data to the user-interface </li></ul></ul><ul><li>US digital library experience in this area indicated that cross-searching not preferred approach </li></ul><ul><ul><li>distributed searching of N nodes viable, but only for small values of N </li></ul></ul>
  5. 5. Harvesting requirements <ul><li>in order that harvesting approach can work there need to be agreements about… </li></ul><ul><ul><li>transport protocols – HTTP vs. FTP vs. … </li></ul></ul><ul><ul><li>metadata formats – DC vs. MARC vs. … </li></ul></ul><ul><ul><li>quality assurance – mandatory elements, mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice </li></ul></ul><ul><ul><li>intellectual property and usage rights – who can do what with the records </li></ul></ul><ul><li>work in this area resulted in the “Santa Fe Convention” </li></ul>
  6. 6. Development of OAI-PMH <ul><li>2 year metamorphosis thru various names </li></ul><ul><ul><li>Santa Fe Convention, OAI-PMH versions 1.0, 1.1… </li></ul></ul><ul><ul><li>OAI Protocol for Metadata Harvesting 2.0 </li></ul></ul><ul><li>development steered by international technical committee </li></ul><ul><li>inter-version stability helped developer confidence </li></ul><ul><li>move from focus on eprints to more generic protocol </li></ul><ul><ul><li>move from OAI-specific metadata schema to mandatory support for DC </li></ul></ul>
  7. 7. Bluffer’s guide to OAI <ul><li>OAI-PMH short for Open Archives Initiative Protocol for Metadata Harvesting </li></ul><ul><li>a low-cost mechanism for harvesting metadata records </li></ul><ul><ul><li>from ‘data providers’ to ‘service providers’ </li></ul></ul><ul><li>allows ‘service provider’ to say ‘give me some or all of your metadata records’ </li></ul><ul><ul><li>where ‘some’ is based on date-stamps, sets, metadata formats </li></ul></ul><ul><li>eprint heritage but widely deployed </li></ul><ul><ul><li>images, museum artefacts, learning objects, … </li></ul></ul>
  8. 8. Bluffer’s guide to OAI <ul><li>based on HTTP and XML </li></ul><ul><ul><li>simple, Web-friendly, fast deployment </li></ul></ul><ul><li>OAI-PMH is not a search protocol </li></ul><ul><ul><li>but use can underpin search-based services based on Z39.50 or SRW or SOAP or… </li></ul></ul><ul><li>OAI-PMH typically carries metadata </li></ul><ul><ul><li>content (e.g. full-text or image) made available separately – typically at URL in metadata </li></ul></ul><ul><li>mandates simple DC as record format </li></ul><ul><ul><li>but extensible to any XML format – IEEE LOM, ONIX, MARC, METS, MPEG-21, etc. </li></ul></ul>
  9. 9. Bluffer’s guide to OAI <ul><li>metadata and ‘content’ often made freely available – but not a requirement </li></ul><ul><ul><li>OAI-PMH can be used between closed groups </li></ul></ul><ul><ul><li>or, can make metadata available but restrict access to content in some way </li></ul></ul><ul><li>underlying HTTP protocol provides </li></ul><ul><ul><li>access control – e.g. HTTP BASIC </li></ul></ul><ul><ul><li>compression mechanisms (for improving performance of harvesters) </li></ul></ul><ul><ul><li>could, in theory, also provide encryption if required </li></ul></ul>
  10. 10. Dublin Core <ul><li>OAI-PMH mandates use of simple DC as lowest common denominator </li></ul><ul><li>agreed XML schema – ‘oai_dc’ </li></ul><ul><ul><li>simple DC – 15 metadata properties </li></ul></ul><ul><ul><li>all DC properties optional and repeatable </li></ul></ul> Rights Identifier Publisher Coverage Format Description Relation Type Subject Language Date Creator Source Contributor Title
  11. 11. OAI and Google OAI gateway OAI gateway makes harvested metadata available to Google… eprint archive(s) HTTP OAI-PMH Examples… Dspace and Google OAIster and Yahoo
  12. 12. Impact on institutions… <ul><li>OAI-PMH technology provides an open, relatively stable technical framework </li></ul><ul><ul><li>allows institution to re-consider management of intellectual output </li></ul></ul><ul><ul><li>greater confidence in availability of external services (e.g. discovery, access, analysis) </li></ul></ul><ul><li>the technical bit is easy </li></ul><ul><ul><li> software (Southampton), DSpace (MIT/HP), Fedora </li></ul></ul><ul><li>but, technical solutions are always easy! </li></ul><ul><ul><li>real problem is cultural change required to get academics to deposit </li></ul></ul>
  13. 13. Impact on libraries… <ul><li>library is natural choice as ‘managing agent’ for the institutional repository </li></ul><ul><ul><li>quality control </li></ul></ul><ul><ul><li>metadata enhancement </li></ul></ul><ul><ul><li>preservation </li></ul></ul><ul><li>but libraries often weak technically (not always!) therefore technical collaboration within institution may be required </li></ul><ul><li>beginning to see some evidence of externally ‘hosted’ repository services being offered </li></ul>
  14. 14. Impact on researchers… <ul><li>OAI-PMH technology provides a ‘disruptive’ technical framework that supports </li></ul><ul><ul><li>new ways for individual researcher to disclose his/her research output </li></ul></ul><ul><ul><li>development of new kinds of ‘research’ discovery services </li></ul></ul><ul><li>can use ‘personal’ OAI repository </li></ul><ul><li>but, need to </li></ul><ul><ul><li>clarify roles of institutional, discipline and personal repositories </li></ul></ul><ul><ul><li>overcome FUD – IPR, peer-review, ability to ‘publish’, quality control, inertia </li></ul></ul>
  15. 15. Current activities/issues <ul><li>protocol now stable and few changes being discussed </li></ul><ul><li>some lightweight noises about re-implementing OAI-PMH using SOAP (Web services) but little enthusiasm for pushing these kinds of changes forward </li></ul><ul><li>some work on OAI-rights issues – formalising mechanisms for attaching IPR statements and/or licences to the records being exchanged using the protocol, e.g. Creative Commons </li></ul>
  16. 16. Creative Commons <ul><li>CC is “ devoted to expanding the range of creative work available for others to build upon and share ” </li></ul><ul><li>provides ‘standard’ licences for content </li></ul><ul><ul><li>attribution </li></ul></ul><ul><ul><li>noncommercial </li></ul></ul><ul><ul><li>no derivative works </li></ul></ul><ul><ul><li>share alike </li></ul></ul><ul><li>mechanisms for indicating licence on Web pages </li></ul>
  17. 17. Works vs. manifestations <ul><li>implementers have tended to see ‘eprints’ as single-entity objects </li></ul><ul><li>some evidence that this is too simplistic </li></ul><ul><ul><li>some repositories expose metadata about the ‘work’, others expose metadata about the ‘expressions’ </li></ul></ul><ul><li>need more consistency in our use the OAI-PMH to expose metadata about both ‘works’ and ‘manifestations’ </li></ul><ul><li>complex objects encoded using METS or MPEG-21 DIDL (may include ‘objects’ as well as ‘metadata about objects’) </li></ul>
  18. 18. Works vs. manifestations work manifestations metadata about the work metadata about manifestation 1 metadata about manifestation 2 oai_dc
  19. 19. OAI and the SW <ul><li>most metadata carried by the protocol currently is not RDF </li></ul><ul><li>not suitable for processing directly by semantic Web applications </li></ul><ul><li>need to build ‘knowledge’ about the structure of the metadata formats in use into the harvesting application </li></ul><ul><li>but could use the protocol to carry RDF/XML </li></ul>
  20. 20. Questions…