Open Archives Initiatives For Metadata Harvesting


Published on

Published in: Technology, Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Open Archives Initiatives For Metadata Harvesting

  1. 1. Open Archives Initiatives for Metadata Harvesting A Framework for Building Open Digital Libraries Term Paper-1 Submitted by NIKESH.N International School of Information Management University of Mysore 2010
  2. 2. Open Archives Initiatives for Metadata Harvesting A Framework for Building Open Digital Libraries 1.0 Introduction Digital Library may be defined as system that supports collection, organization, storage, retrieval and dissemination of Digital Documents. It may be viewed as the intersection of Library Science, Computer Science and networked information systems. Open movements are gaining acceptance in the scholarly information arena and many of the Universities and research centers have started to provide public access to their repositories. With the growing number of repositories of digital repositories in the Web, it became difficult for the users to visit individual places in search of information. Many organizational repositories have not been indexed by the search engines. Such mechanism is therefore required by which the repositories can share the resources and work in coordination, to provide a broader purview to the users. The mechanism which provides the ability to the information systems to work in coordination has been termed as Interoperability. Open Archives Initiative is one of the landmark efforts to ensure the availability of the metadata of digital resources of many repositories at the users’ end. The essence of the open archives approach is to enable access to Web-accessible material through interoperable repositories for metadata sharing, publishing and archiving. Such interoperability requirements necessitated the development of standards such as the Dublin Core Metadata Element Set and the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH). These standards have achieved a degree of success in the DL community largely because of their generality and simplicity. 2.0 Need for a Harvester protocol There is a growing need to make resources, not only descriptive metadata, harvestable in an interoperable manner. There are two major use cases that motivate this need: • Preservation: The need to periodically transfer digital content from a data repository to one or more trusted digital repositories charged with storing and preserving safety copies of the
  3. 3. content. The trusted digital repositories need a mechanism to automatically synchronize with the originating data repository. • Discovery: The need to use content itself in the creation of services. Examples include search engines that make full-text from multiple data repositories searchable, and citation indexing systems that extract references from the full-text content. Another scenario is the provision of thumbnail versions of high-quality images from cultural heritage collections to external services that build browsing interfaces that include the thumbnails 3.0 OAI Protocol for Metadata Harvesting (OAI-PMH) In October of 1999 the Open Archives Initiative (OAI) was launched in an attempt to address interoperability issues among the many existing and independent DLs. The focus was on high- level communication among systems and simplicity of protocols. The OAI has since received much media attention in the DL community and, primarily because of the simplicity of its standards, has attracted many early adopters. It defines a mechanism for harvesting records containing metadata from repositories. 3.1 Definitions of Key terms • Open archives Initiatives (OAI) OAI is an initiative to develop and promote interoperability standards that aim to facilitate the efficient dissemination of content. • Archive The term "archive" in the name Open Archives Initiative reflects the origins of the OAI in the e-prints community where the term archive is generally accepted as a synonym for repository of scholarly papers. Members of the archiving profession have justifiably noted the strict definition of an ?archive? within their domain; with connotations of preservation of long-term value, statutory authorization and institutional policy. The OAI uses the term ? archive? in a broader sense: as a repository for stored information. Language and terms are never unambiguous and uncontroversial and the OAI respectfully requests the indulgence of the professional archiving community with this broader use of ?archive?
  4. 4. (OAI definition quoted from FAQ on OAI Web site) • OAI Protocol for Metadata Harvesting (OAI-PMH) OAI-PMH is a lightweight harvesting protocol for sharing metadata between services. • Protocol A protocol is a set of rules defining communication between systems. FTP (File Transfer Protocol) and HTTP (Hypertext Transport Protocol) are examples of other protocols used for communication between systems across the Internet. • Harvesting In the OAI context, harvesting refers specifically to the gathering together of metadata from a number of distributed repositories into a combined data store. 3.2 Prerequisites to develop metadata harvesting protocol To facilitate metadata harvesting there needs to be agreement on: o Transport protocol - HTTP or FTP or other such protocol o Metadata format - Dublin Core or MARC or other such format o Metadata Quality Assurance - mandatory element set, naming and subject conventions, etc. o Intellectual Property and Usage Rights - who can do what with what? 3.3 OAI: Key players There are two groups of 'participants': Data Providers and Service Providers.
  5. 5. Data Providers (open archives, repositories) provide free access to metadata, and may, but do not necessarily, offer free access to full texts or other resources. OAI-PMH provides an easy to implement, low barrier solution for Data Providers. Service Providers use the OAI interfaces of the Data Providers to harvest and store metadata. Note that this means that there are no live search requests to the Data Providers; rather, services are based on the harvested data via OAI-PMH. Service Providers may select certain subsets from Data Providers (e.g., by set hierarchy or date stamp). Service Providers offer (value-added) services on the basis of the metadata harvested, and they may enrich the harvested metadata in order to do so. 3.4 How it works
  6. 6. Prerequisites to develop metadata harvesting protocol To facilitate metadata harvesting there needs to be agreement on: o Transport protocol - HTTP or FTP or other such protocol o Metadata format - Dublin Core or MARC or other such format o Metadata Quality Assurance - mandatory element set, naming and subject conventions, etc. o Intellectual Property and Usage Rights - who can do what with what? The OAI-PMH gives a simple technical option for data providers to make their metadata available to services, based on the open standards HTTP (Hypertext Transport Protocol) and XML (Extensible Markup Language). The metadata that is harvested may be in any format that is agreed by a community (or by any discrete set of data and service providers), although unqualified Dublin Core is specified to provide a basic level of interoperability. Thus, metadata from many sources can be gathered together in one database, and services can be provided based on this centrally harvested or "aggregated" data. The link between this metadata and the related content is not defined by the OAI protocol. It is important to realize that OAI-PMH does not provide a search across this data, it simply makes it possible to bring the data together in one place. In order to provide services, the harvesting approach must be combined with other mechanisms. 3.5 Protocol details Records A record is the metadata of a resource in a specific format. A record has three parts: a header and metadata, both of which are mandatory, and an optional about statement. Each of these is made up of various components as set out below. header (mandatory) identifier (mandatory: 1 only)
  7. 7. datestamp (mandatory: 1 only) setSpec elements (optional: 0, 1 or more) status attribute for deleted item metadata (mandatory) XML encoded metadata with root tag, namespace repositories must support Dublin Core, may support other formats about (optional) rights statements provenance statements Datestamps A datestamp is the date of last modification of a metadata record. Datestamp is a mandatory characteristic of every item. It has two possible levels of granularity: YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ. The function of the datestamp is to provide information on metadata that enables selective harvesting using from and until arguments. Its applications are in incremental update mechanisms. It gives either the date of creation, last modification, or deletion. Deletion is covered with three support levels: no, persistent, transient. Metadata schema OAI-PMH supports dissemination of multiple metadata formats from a repository. The properties of metadata formats are: – id string to specify the format (metadataPrefix) – metadata schema URL (XML schema to test validity) – XML namespace URI (global identifier for metadata format) Repositories must be able to disseminate unqualified Dublin Core. Further arbitrary metadata formats can be defined and transported via the OAI-PMH. Any returned metadata must comply
  8. 8. with an XML namespace specification. The Dublin Core Metadata Element Set contains 15 elements. All elements are optional, and all elements may be repeated. 3.6 The Dublin Core Metadata Element Set: Title Contributor Source Creator Date Language Subject Type Relation Description Format Coverage Publisher Identifier Rights Sets Sets enable a logical partitioning of repositories. They are optional archives do not have to define Sets. There are no recommendations for the implementation of Sets. Sets are not necessarily exhaustive of the content of a repository. They are not necessarily strictly hierarchical. It is important and necessary to have negotiated agreements within communities defining useful sets for the communities. • function: selective harvesting (set parameter) • applications: subject gateways, dissertation search engine, and others • examples o publication types (thesis, article, ?) o document types (text, audio, image, ?) o content sets, according to DNB (medicine, biology, ?) 3.7 Request format Requests must be submitted using the GET or POST methods of HTTP, and repositories must support both methods. At least one key=value pair: verb=RequestType (where RequestType is
  9. 9. some type of request such as ListRecords) must be provided. Additional key=value pairs depend on the request type. example for GET request: verb=ListRecords&metadataPrefix=oai_dc The encoding of special characters must be supported; for example, ":" (host port separator) becomes "%3A" 3.8 Response Responses are formatted as HTTP responses. The content type must be text/xml. HTTP-based status codes, as distinguished from OAI-PMH errors, such as 302 (redirect) and 503 (service not available) may be returned. Compression codes are optional in OAI-PMH, only identity encoding is mandatory. The response format must be well-formed XML with markup as follows: 1. XML declaration (<?xml version="1.0" encoding="UTF-8" ?>) 2. root element named OAI-PMH with three attributes (xmlns, xmlns:xsi, xsi:schemaLocation) 3. three child elements 1. responseDate (UTC datetime) 2. request (the request that generated this response) 3. a) error (in case of an error or exception condition) b) element with the name of the OAI-PMH request
  10. 10. 3.9 OAI- PMH Verbs Here ‘verb’ means request type which the service provider/harvester sends to get responses from data providers. There is a standard set of 6 verbs: o Identify o ListMetadataFormats o ListSets o GetRecord o ListIdentifiers o ListRecords Function Identify Description of repository ListMetadataFormats Metadata format supported by the repository ListSets Sets defined by repository ListIdentifiers Retrieves unique identifiers of the item ListRecords Used to harvest records from the repository GetRecords Retrieves individual metadata record from the repository
  11. 11. A harvester is not required to use all types. However, a repository must implement all types. There are required and optional arguments, depending on request types. 4.0 Dspace : OAI compatible Digital Library Software DSpace is open source software for building and managing Digital repositories. Developed jointly by MIT Libraries and Hewlett-Packard (HP), is freely available to research institutions as an open source system that can be customized and extended. DSpace is a digital institutional repository that captures, stores, indexes, preserves, and redistributes content in digital formats. Institutional Repository is a set of services that a research institution/ organization/ university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members Typically, DSpace has been deployed for Institutional Repositories of publications, thesis and dissertations. There are several groups working on extending its capabilities such implementation of ontologies in search interface and for submission module, customization for management of electronic theses and dissertations and for localization and international of the package for the world languages. Dspace is compliant with OAI-PMH ver 2.0 and metadata in Dspace digital libraries can be harvested. 4.1 DSpace Search System The end user can browse, search and access the collections using the hierarchies and also the alphabetic bar menu. For searching the collection, Dspace uses Lucene Search Engine, which is a part of Apache Jakarta Project (1). Additionally research projects such as the …(Portugal)… provides Ontologies that enables context based querying. This work like subject based directory structures. Lucene search engine has very powerful search features that encompass many search approaches of the end-user. It provides the basic ‘exact term’ or keyword search. In addition it allows fielded search akin the field level search of library databases. In Dspace, Dublin Core elements are used for the field names. Lucene also facilitates Boolean search, range searches, term boosting and proximity searches. The interesting search facility lucene uses fuzzy logic that is based on the Levenstien’s alogorithm (5) that can replace and match terms by similarity. This feature is especially useful in instances where we hear a term and guess it spellings and more so in the case of personal names.
  12. 12. 4.2 Metadata in Dspace DSpace users deal with/come across metadata in the following modules: D Administration modules: Dublin core registry, administrative metadata- default values, mail alert to subscribers a Submission modules: descriptive metadata a Harvesting – OAI-PMH using the DC elements (unqualified) a Search result display: brief and full metadata 4.3 Metadata harvesting in Dspace Dspace is compliant with the OAI-PMH for exposing metadata. OAI-PMH allows repositories to expose an hierarchy of sets in which records may be placed. DSpace exposes collections as sets. Each collection has a corresponding OAI set and harvestors use a verb (OAI- command) ListSets, to discover the sets. Only the 15 basic Dublin Core elements is exposed at present. 5.0 OAI Harvester Software o Arc ( o Citebase ( o CYCLADES ( o DP9 ( o MeIND ( o METALIS ( o my.OAI ( o NCSTRL ( o Purseus ( o Public Knowledge Project – Open Archives Harvester ( o OAICAT ( o OAI Repository Explorer ( o OAIster ( o OASIC (Open Archvies en SIC) ( o OAIHarvester ( o DLESE OAI Software ( 6.0 Future Prospects
  13. 13. Some more work has to be done in order to make OAI-PMH as a complete globally accepted metadata harvesting protocol: o Tools and software has to be developed by which the non-OAI-PMH compliant repositories can be converted into OAI-PMH compliant so that the repository can be made data provider. o The higher versions of the protocol should be made compatible of the lower ones. At metadata creation level some standardization is required, as a particular resource is described inconsistently at different repositories. Vocabulary control measures should be also taken care of. Still some more improvements are awaited in OAI-PMH protocol, and then only we can ensure a comprehensive view of the resources available on a particular subject to our end-users. 7.0 Conclusion Much promise is seen for the use of the protocol within an open archives approach. Support for a new pattern for scholarly communication is the most publicized potential benefit. Perhaps most readily achievable are the goals of surfacing 'hidden resources' and low cost interoperability. Although the OAI-PMH is technically very simple, building coherent services that meet user requirements remains complex. The OAI-PMH protocol could become part of the infrastructure of the Web, as taken-for-granted as the HTTP protocol now is, if a combination of its relative simplicity and proven success by early implementers in a service context leads to widespread uptake by research organizations, publishers and archives. REFERENCES 1. 2. Breeding, M. (2002, April). The Emergence of the Open Archives Initiative: This Protocol could become a key part of the digital library infrastructure. Information Today. from 3. Breeding, M. (2002). Understanding the Protocol for Metadata Harvesting of the Open Archives Initiative. Computers in Libraries, 22(8). 4. Lagoze, C., & Sompel, H. V. d. (2001, January). The Open Archives Initiative Protocol for Metadata Harvesting,from
  14. 14. 5. Lynch, C. A. (2001, August). Metadata Harvesting and the Open Archives Initiative. ARL Bimonthly Report 217. from 6. Shearer, K. (2002, March). The Open Archives Initiative: Developing an Interoperability Framework for Scholarly Publishing. CARL/ABRC Background Series, No. 5. from 7. Suleman, H., & Fox, E. A. (2001, December). A Framework for Building Open Digital Libraries. D-Lib Magazine, 7(12). from 8. Sompel, H. V. d., & Lagoze, C. (2000, February). The Santa Fe Convention of the Open Archives Initiative. D-Lib Magazine, 6(2). from oai/02vandesompel-oai.html 9. Warner, S. (2001, June). Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial. HEP Libraries Webzine Issue 4. from 11 . 12 . Michael Shepherd, (2003), Interoperability for Digital Libraries, DRTC Workshop on Semantic Web 8th – 10th December, 2003,DRTC, Bangalore 13 . 14 .