  • No coverage of technical details – beyond me. Do want to cover concepts, definitions so that if someone talks to you about these things, you will understand
    1. 1. Open Archives Initiative -Protocol for Metadata Harvesting April 8, 2013 Richard Sapon-White 1
    2. 2. Overview Definitions History The OAI Model Protocol for Metadata Harvesting 2
    3. 3. Definitions Harvester - client application issuing OAI-PMH requests Harvesting - the gathering together of metadata from a number of distributed repositories into a combined data store Archives – synonym for a repository of scholarly papers Protocol - a set of rules defining communication between systems (such as ftp or http) 3
    4. 4. History of the OAI E-print servers = archives or repositories E-print servers provide access to scientific and technical papers, scholarly journal articles Authors deposit pre-prints or published articles in these repositories Concept: public, free access to scholarly information without paid subscription to journals 4
    5. 5. History of the OAI (cont.) Why?  Scholarly research belongs to people  Speeds the sharing of research  Better for authors and readers Known as the “open archives movement” Has nothing to do with physical archives (repositories of institutional history or collections of unpublished materials) 5
    6. 6. History of the OAI (cont.) Many e-print servers grew  Overlapping disciplinary coverage  Overlapping geographic coverage Developing need to  search multiple repositories simultaneously (=federated searching)  automatically identify and copy papers from other repositories (=repository synchronization) 6
    7. 7. History of the OAI (cont.) Meeting of experts, 1999, Santa Fe, New Mexico, USA Defined an interface so that repositories could expose metadata for papers they held Metadata could then be discovered by federated search services and other repositories and copied Known as the Santa Fe Convention (later developed into PMH – Protocol for Metadata Harvesting 7
    8. 8. The Open Archives Model Similar concept to union catalog Metadata “harvested” and stored in central repository “Pull” rather than “push” model Collecting is similar to Internet spider collecting HTML content 8
    9. 9. PMH and Z39.50 Differs from Z39.50 (specifically rejected at Santa Fe) Z39.50:  allows a client to search a remote information server across a network  Difficult to perform high-quality federated searches across many servers – would need to deal with each server individually  Complex protocol 9
    10. 10. PHM and Z39.50 (cont.) PHM is a simple protocol User interacts with database of harvested metadata, not with individual repositories Database is constructed by the federated search service using PHM Therefore, performance depends only on the federated search service, not the individual repositories 10
    11. 11. Metadata Harvesting Protocol Queries and responses carried over http Harvester application can request a single metadata record or group of records to be exported  Application can restrict records by date to only gather new records (since previous harvesting) 11
    12. 12. Metadata Harvesting Protocol (cont.) OAI-compliant data providers are capable of responding to such requests  Data provider must be able to export metadata in at least DC (unqualified) using XML communication syntax  Data provider includes URI with metadata 12
    13. 13. Metadata Harvesting Protocol (cont.) Servers can also provide metadata in other schemes beside DC Harvester applications can request metadata in other schemes beside DC Harvester applications can also query a metadata repository for:  List of metadata formats supported by repository  List of record sets supported by the repository  List of the identifiers of all records within the repository 13
    14. 14. Why the OAI PHM is important Provides for a minimal level of interoperability Drives development of community-specific metadata schemes Potential for new modes of scholarly communication Dependent on widespread implementation by research organizations, publishers, and “memory organizations” (i.e., libraries, museums, archives) 14
    15. 15. QUIZ!!! 15
    16. 16. Problems with Metadata Harvesting Loss of data when mapping unqualified DC Incorrect data from improper mapping Inconsistent punctuation and formatting because of diverse sources of metadata  High variance in data between institutions 16
    17. 17. Metasearching Many systems = many metadata standards Convert to single system (harvesting)? Maintain individual element sets BUT create interface to search simultaneously across heterogeneous databases Voila: Metasearching!  Not a single method 17
    18. 18. Definition From NISO MetaSearch Initiative: “search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at one time.” Best known: Z39.50 protocol. Used to search remote library catalogs. 18
    19. 19. Z39.50 Allows computers to communicate to retrieve information – between client and server Searches and results are restricted to Z39.50 databases 19
    20. 20. Z39.50 results Server may interpret the query incorrectly  Some automatically add Boolean “and” while others add Boolean “or”  Vocabulary issues – different vocabulary in different databases  Display results in order retrieved, by database found, by data, by relevance 20
    21. 21. Problems with Z39.50 High recall, little precision Also present in Google Search: few studies on user satisfaction Results may display in an irrelevant order for the searcher 21
    22. 22. Metasearching: pros and cons Single database searching allows users to use specialized indexing or controlled vocabulary Single portal:  No need for searcher to select a particular database from list of databases 22
    23. 23. Case Studies Divide into 3-4 groups Read the case study Discuss and report:  Describe the case briefly (2 min.)  What can we learn from this case study? (3 min.) 23