The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK


Published on

A presentation at the AULIC Institutional Repositories Meeting, University of Bristol – 23 May 2005

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK

  1. 1. UKOLN is supported by: The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK AULIC Institutional Repositories Meeting University of Bristol – 23 May 2005 Andy Powell, UKOLN, University of Bath [email_address] A centre of expertise in digital information management
  2. 2. Contents <ul><li>a brief history of OAI </li></ul><ul><li>10 technical things you should know about the OAI-PMH </li></ul><ul><li>potential impact… </li></ul><ul><ul><li>institutional context </li></ul></ul><ul><ul><li>the role of the library? </li></ul></ul><ul><ul><li>the researcher </li></ul></ul><ul><li>ePrints UK project </li></ul>
  3. 3. OAI roots <ul><li>the roots of OAI lie in the development of eprint archives… </li></ul><ul><ul><li>arXiv, CogPrints, NACA (NASA), RePEc, NDLTD, NCSTRL </li></ul></ul><ul><li>each offered Web interface for deposit of articles and for end-user searches </li></ul><ul><li>difficult for end-users to work across archives without having to learn multiple different interfaces </li></ul><ul><li>recognised need for single search interface to all archives </li></ul><ul><ul><li>Universal Pre-print Service (UPS) </li></ul></ul>
  4. 4. Searching vs. harvesting <ul><li>two possible approaches to building a single search interface to multiple eprint archives… </li></ul><ul><ul><li>cross-searching multiple archives based on protocol like Z39.50 </li></ul></ul><ul><ul><li>harvesting metadata into one or more ‘central’ services – bulk move data to the user-interface </li></ul></ul><ul><li>digital library experience in this area indicated that cross-searching not preferred approach </li></ul><ul><ul><li>distributed searching of N nodes viable, but only for small values of N </li></ul></ul>
  5. 5. Harvesting requirements <ul><li>in order that harvesting approach can work there need to be agreements about… </li></ul><ul><ul><li>transport protocols – HTTP vs. FTP vs. … </li></ul></ul><ul><ul><li>metadata formats – DC vs. MARC vs. … </li></ul></ul><ul><ul><li>quality assurance – mandatory elements, mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice </li></ul></ul><ul><ul><li>intellectual property and usage rights – who can do what with the records </li></ul></ul><ul><li>work in this area resulted in the “Santa Fe Convention” </li></ul>
  6. 6. Development of OAI-PMH <ul><li>2 year metamorphosis thru various names </li></ul><ul><ul><li>Santa Fe Convention, OAI-PMH versions 1.0, 1.1… </li></ul></ul><ul><ul><li>OAI Protocol for Metadata Harvesting 2.0 </li></ul></ul><ul><li>development steered by international technical committee </li></ul><ul><li>simplicity and inter-version stability helped developer confidence </li></ul><ul><li>move from focus on eprints to more generic protocol </li></ul><ul><ul><li>move from OAI-specific metadata schema to mandatory support for Dublin Core </li></ul></ul>
  7. 7. Bluffer’s guide to OAI <ul><li>OAI-PMH short for Open Archives Initiative Protocol for Metadata Harvesting </li></ul><ul><li>a low-cost mechanism for harvesting metadata records </li></ul><ul><ul><li>from ‘data providers’ to ‘service providers’ </li></ul></ul><ul><li>allows ‘service provider’ to say ‘give me some or all of your metadata records’ </li></ul><ul><ul><li>where ‘some’ is based on date-stamps, sets, metadata formats </li></ul></ul><ul><li>eprint heritage but widely deployed </li></ul><ul><ul><li>images, museum artefacts, learning objects, … </li></ul></ul>
  8. 8. Bluffer’s guide to OAI <ul><li>based on HTTP and XML </li></ul><ul><ul><li>simple, Web-friendly, fast deployment </li></ul></ul><ul><li>OAI-PMH is not a search protocol </li></ul><ul><ul><li>but use can underpin search-based services based on Z39.50 or SRW or SOAP or… </li></ul></ul><ul><li>OAI-PMH typically carries metadata </li></ul><ul><ul><li>content (e.g. full-text or image) made available separately – typically at URL in metadata </li></ul></ul><ul><li>mandates simple DC as record format </li></ul><ul><ul><li>but extensible to any XML format – IEEE LOM, ONIX, MARC, METS, MPEG-21, etc. </li></ul></ul>
  9. 9. Bluffer’s guide to OAI <ul><li>metadata and ‘content’ often made freely available – but not a requirement </li></ul><ul><ul><li>OAI-PMH can be used between closed groups </li></ul></ul><ul><ul><li>or, can make metadata available but restrict access to content in some way </li></ul></ul><ul><li>underlying HTTP protocol provides </li></ul><ul><ul><li>access control – e.g. HTTP BASIC </li></ul></ul><ul><ul><li>compression mechanisms (for improving performance of harvesters) </li></ul></ul><ul><ul><li>could, in theory, also provide encryption if required </li></ul></ul>
  10. 10. Dublin Core <ul><li>OAI-PMH mandates use of simple DC as lowest common denominator </li></ul><ul><li>agreed XML schema – ‘oai_dc’ </li></ul><ul><ul><li>simple DC – 15 metadata properties </li></ul></ul><ul><ul><li>all DC properties optional and repeatable </li></ul></ul> Rights Identifier Publisher Coverage Format Description Relation Type Subject Language Date Creator Source Contributor Title
  11. 11. Impact on institutions… <ul><li>OAI-PMH technology provides an open, relatively stable technical framework </li></ul><ul><ul><li>allows institution to re-consider management of intellectual output </li></ul></ul><ul><ul><li>greater confidence in availability of external services (e.g. discovery, access, analysis) </li></ul></ul><ul><li>the technical bit is easy </li></ul><ul><ul><li> software (Southampton), DSpace (MIT/HP), Fedora </li></ul></ul><ul><li>but, technical solutions are always easy! </li></ul><ul><ul><li>real problem is cultural change required to get academics to deposit </li></ul></ul>
  12. 12. Impact on libraries… <ul><li>library is natural choice as ‘managing agent’ for the institutional repository </li></ul><ul><ul><li>quality control </li></ul></ul><ul><ul><li>metadata enhancement </li></ul></ul><ul><ul><li>preservation </li></ul></ul><ul><li>but technical strengths of libraries quite variable, therefore technical collaboration within institution may be required </li></ul><ul><li>beginning to see some evidence of externally ‘hosted’ repository services being offered </li></ul>
  13. 13. Impact on researchers… <ul><li>OAI-PMH technology provides a ‘disruptive’ technical framework that supports </li></ul><ul><ul><li>new ways for individual researcher to disclose his/her research output </li></ul></ul><ul><ul><li>development of new kinds of ‘research’ discovery services </li></ul></ul><ul><li>can use ‘personal’ OAI repository </li></ul><ul><li>but, need to </li></ul><ul><ul><li>clarify roles of institutional, discipline and personal repositories </li></ul></ul><ul><ul><li>overcome FUD – IPR, peer-review, ability to ‘publish’, quality control, inertia </li></ul></ul>
  14. 14. ePrints UK <ul><li>RDN project funded by JISC under FAIR programme </li></ul><ul><li>now finished but ‘service’ still running </li></ul><ul><li>UK ‘service provider’ </li></ul><ul><li>harvesting metadata from all UK eprint archives </li></ul><ul><li>single point of discovery to UK eprints </li></ul><ul><li>working with OCLC and University of Southampton to automatically enhance harvested metadata </li></ul>
  15. 15. ePrints UK eprint archive(s) ePrints UK OAI-PMH name authority subject classification citation analysis End-user
  16. 16. What did we learn? <ul><li>impact of eprint archives still quite low </li></ul><ul><li>national coverage is potentially interesting to funders but not to end-users </li></ul><ul><li>automatically enhancing metadata is difficult, particularly w.r.t. </li></ul><ul><ul><li>subject classification </li></ul></ul><ul><ul><li>name authority </li></ul></ul><ul><li>approaches to metadata creation varied – no clear cataloguing guidelines </li></ul><ul><ul><li>linkage to full-text from metadata record inconsistent </li></ul></ul>
  17. 21. OAI and Google OAI gateway OAI gateway makes harvested metadata available to Google… eprint archive(s) HTTP OAI-PMH Examples… DSpace and Google OAIster and Yahoo
  18. 22. Current activities/issues <ul><li>protocol now stable and few changes being discussed </li></ul><ul><li>some lightweight noises about re-implementing OAI-PMH using SOAP (Web services) but little enthusiasm for pushing these kinds of changes forward </li></ul><ul><li>some work on OAI-rights issues – formalising mechanisms for attaching IPR statements and/or licences to the records being exchanged using the protocol, e.g. Creative Commons </li></ul>
  19. 23. Creative Commons <ul><li>CC is “ devoted to expanding the range of creative work available for others to build upon and share ” </li></ul><ul><li>provides ‘standard’ licences for content </li></ul><ul><ul><li>attribution </li></ul></ul><ul><ul><li>noncommercial </li></ul></ul><ul><ul><li>no derivative works </li></ul></ul><ul><ul><li>share alike </li></ul></ul><ul><li>mechanisms for indicating licence on Web pages </li></ul>
  20. 24. Works vs. manifestations <ul><li>implementers have tended to see ‘eprints’ as single-entity objects </li></ul><ul><li>some evidence that this is too simplistic </li></ul><ul><ul><li>some repositories expose metadata about the ‘work’, others expose metadata about the ‘expressions’ </li></ul></ul><ul><li>need more consistency in our use the OAI-PMH to expose metadata about both ‘works’ and ‘manifestations’ </li></ul><ul><li>complex objects encoded using METS or MPEG-21 DIDL (may include ‘objects’ as well as ‘metadata about objects’) </li></ul>
  21. 25. Questions…