Open Archives Initiative -Protocol
    for Metadata Harvesting

            April 8, 2013
        Richard Sapon-White




                                 1
Overview

 Definitions
 History
 The OAI Model
 Protocol for Metadata Harvesting




                                     2
Definitions

 Harvester - client application issuing OAI-PMH
  requests
 Harvesting - the gathering together of metadata
  from a number of distributed repositories into a
  combined data store
 Archives – synonym for a repository of scholarly
  papers
 Protocol - a set of rules defining communication
  between systems (such as ftp or http)

                                                     3
History of the OAI

 E-print servers = archives or repositories
 E-print servers provide access to scientific and
  technical papers, scholarly journal articles
 Authors deposit pre-prints or published articles in
  these repositories
 Concept: public, free access to scholarly
  information without paid subscription to journals


                                                        4
History of the OAI (cont.)

 Why?
      Scholarly research belongs to people
      Speeds the sharing of research
      Better for authors and readers
 Known as the “open archives movement”
 Has nothing to do with physical archives
  (repositories of institutional history or collections
  of unpublished materials)

                                                          5
History of the OAI (cont.)

 Many e-print servers grew
     Overlapping disciplinary coverage
     Overlapping geographic coverage
 Developing need to
     search multiple repositories simultaneously
      (=federated searching)
     automatically identify and copy papers from
      other repositories (=repository synchronization)

                                                         6
History of the OAI (cont.)

 Meeting of experts, 1999, Santa Fe, New Mexico,
  USA
 Defined an interface so that repositories could
  expose metadata for papers they held
 Metadata could then be discovered by federated
  search services and other repositories and copied
 Known as the Santa Fe Convention (later developed
  into PMH – Protocol for Metadata Harvesting

                                                  7
The Open Archives Model

 Similar concept to union catalog
 Metadata “harvested” and stored in central
  repository
 “Pull” rather than “push” model
 Collecting is similar to Internet spider
  collecting HTML content


                                               8
PMH and Z39.50

 Differs from Z39.50 (specifically rejected at Santa
  Fe)
 Z39.50:
      allows a client to search a remote information
       server across a network
      Difficult to perform high-quality federated searches
       across many servers – would need to deal with each
       server individually
      Complex protocol

                                                              9
PHM and Z39.50 (cont.)

 PHM is a simple protocol
 User interacts with database of harvested metadata,
  not with individual repositories
 Database is constructed by the federated search
  service using PHM
 Therefore, performance depends only on the
  federated search service, not the individual
  repositories

                                                    10
Metadata Harvesting Protocol

 Queries and responses carried over http
 Harvester application can request a single
  metadata record or group of records to be
  exported
     Application can restrict records by date to only
      gather new records (since previous harvesting)



                                                         11
Metadata Harvesting Protocol
                (cont.)
 OAI-compliant data providers are capable of
  responding to such requests
     Data provider must be able to export metadata in
      at least DC (unqualified) using XML
      communication syntax
     Data provider includes URI with metadata



                                                     12
Metadata Harvesting Protocol
                 (cont.)
 Servers can also provide metadata in other schemes
  beside DC
 Harvester applications can request metadata in
  other schemes beside DC
 Harvester applications can also query a metadata
  repository for:
      List of metadata formats supported by repository
      List of record sets supported by the repository
      List of the identifiers of all records within the repository

                                                                      13
Why the OAI PHM is
                 important
 Provides for a minimal level of interoperability
 Drives development of community-specific
  metadata schemes
 Potential for new modes of scholarly
  communication
 Dependent on widespread implementation by
  research organizations, publishers, and “memory
  organizations” (i.e., libraries, museums, archives)

                                                        14
QUIZ!!!

 http://www.oaforum.org/tutorial/english/page1.h




                                             15
Problems with Metadata
                Harvesting
 Loss of data when mapping unqualified DC
 Incorrect data from improper mapping
 Inconsistent punctuation and formatting
  because of diverse sources of metadata
     High variance in data between institutions




                                                   16
Metasearching

 Many systems = many metadata standards
 Convert to single system (harvesting)?
 Maintain individual element sets BUT create
  interface to search simultaneously across
  heterogeneous databases
 Voila: Metasearching!
     Not a single method

                                            17
Definition

 From NISO MetaSearch Initiative:
  “search and retrieval to span multiple databases,
  sources, platforms, protocols, and vendors at one
  time.”
 Best known: Z39.50 protocol. Used to
  search remote library catalogs.



                                                      18
Z39.50

 Allows computers to communicate to
  retrieve information – between client and
  server
 Searches and results are restricted to Z39.50
  databases




                                                  19
Z39.50 results

 Server may interpret the query incorrectly
     Some automatically add Boolean “and” while
      others add Boolean “or”
     Vocabulary issues – different vocabulary in
      different databases
     Display results in order retrieved, by database
      found, by data, by relevance


                                                        20
Problems with Z39.50

 High recall, little precision
 Also present in Google Search: few studies
  on user satisfaction
 Results may display in an irrelevant order for
  the searcher



                                               21
Metasearching: pros and cons

 Single database searching allows users to use
  specialized indexing or controlled
  vocabulary
 Single portal:
     No need for searcher to select a particular
      database from list of databases



                                                    22
Case Studies

 Divide into 3-4 groups
 Read the case study
 Discuss and report:
     Describe the case briefly (2 min.)
     What can we learn from this case study? (3 min.)




                                                     23

Metadata april 8 2013

  • 1.
    Open Archives Initiative-Protocol for Metadata Harvesting April 8, 2013 Richard Sapon-White 1
  • 2.
    Overview  Definitions  History The OAI Model  Protocol for Metadata Harvesting 2
  • 3.
    Definitions  Harvester -client application issuing OAI-PMH requests  Harvesting - the gathering together of metadata from a number of distributed repositories into a combined data store  Archives – synonym for a repository of scholarly papers  Protocol - a set of rules defining communication between systems (such as ftp or http) 3
  • 4.
    History of theOAI  E-print servers = archives or repositories  E-print servers provide access to scientific and technical papers, scholarly journal articles  Authors deposit pre-prints or published articles in these repositories  Concept: public, free access to scholarly information without paid subscription to journals 4
  • 5.
    History of theOAI (cont.)  Why?  Scholarly research belongs to people  Speeds the sharing of research  Better for authors and readers  Known as the “open archives movement”  Has nothing to do with physical archives (repositories of institutional history or collections of unpublished materials) 5
  • 6.
    History of theOAI (cont.)  Many e-print servers grew  Overlapping disciplinary coverage  Overlapping geographic coverage  Developing need to  search multiple repositories simultaneously (=federated searching)  automatically identify and copy papers from other repositories (=repository synchronization) 6
  • 7.
    History of theOAI (cont.)  Meeting of experts, 1999, Santa Fe, New Mexico, USA  Defined an interface so that repositories could expose metadata for papers they held  Metadata could then be discovered by federated search services and other repositories and copied  Known as the Santa Fe Convention (later developed into PMH – Protocol for Metadata Harvesting 7
  • 8.
    The Open ArchivesModel  Similar concept to union catalog  Metadata “harvested” and stored in central repository  “Pull” rather than “push” model  Collecting is similar to Internet spider collecting HTML content 8
  • 9.
    PMH and Z39.50 Differs from Z39.50 (specifically rejected at Santa Fe)  Z39.50:  allows a client to search a remote information server across a network  Difficult to perform high-quality federated searches across many servers – would need to deal with each server individually  Complex protocol 9
  • 10.
    PHM and Z39.50(cont.)  PHM is a simple protocol  User interacts with database of harvested metadata, not with individual repositories  Database is constructed by the federated search service using PHM  Therefore, performance depends only on the federated search service, not the individual repositories 10
  • 11.
    Metadata Harvesting Protocol Queries and responses carried over http  Harvester application can request a single metadata record or group of records to be exported  Application can restrict records by date to only gather new records (since previous harvesting) 11
  • 12.
    Metadata Harvesting Protocol (cont.)  OAI-compliant data providers are capable of responding to such requests  Data provider must be able to export metadata in at least DC (unqualified) using XML communication syntax  Data provider includes URI with metadata 12
  • 13.
    Metadata Harvesting Protocol (cont.)  Servers can also provide metadata in other schemes beside DC  Harvester applications can request metadata in other schemes beside DC  Harvester applications can also query a metadata repository for:  List of metadata formats supported by repository  List of record sets supported by the repository  List of the identifiers of all records within the repository 13
  • 14.
    Why the OAIPHM is important  Provides for a minimal level of interoperability  Drives development of community-specific metadata schemes  Potential for new modes of scholarly communication  Dependent on widespread implementation by research organizations, publishers, and “memory organizations” (i.e., libraries, museums, archives) 14
  • 15.
  • 16.
    Problems with Metadata Harvesting  Loss of data when mapping unqualified DC  Incorrect data from improper mapping  Inconsistent punctuation and formatting because of diverse sources of metadata  High variance in data between institutions 16
  • 17.
    Metasearching  Many systems= many metadata standards  Convert to single system (harvesting)?  Maintain individual element sets BUT create interface to search simultaneously across heterogeneous databases  Voila: Metasearching!  Not a single method 17
  • 18.
    Definition  From NISOMetaSearch Initiative: “search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at one time.”  Best known: Z39.50 protocol. Used to search remote library catalogs. 18
  • 19.
    Z39.50  Allows computersto communicate to retrieve information – between client and server  Searches and results are restricted to Z39.50 databases 19
  • 20.
    Z39.50 results  Servermay interpret the query incorrectly  Some automatically add Boolean “and” while others add Boolean “or”  Vocabulary issues – different vocabulary in different databases  Display results in order retrieved, by database found, by data, by relevance 20
  • 21.
    Problems with Z39.50 High recall, little precision  Also present in Google Search: few studies on user satisfaction  Results may display in an irrelevant order for the searcher 21
  • 22.
    Metasearching: pros andcons  Single database searching allows users to use specialized indexing or controlled vocabulary  Single portal:  No need for searcher to select a particular database from list of databases 22
  • 23.
    Case Studies  Divideinto 3-4 groups  Read the case study  Discuss and report:  Describe the case briefly (2 min.)  What can we learn from this case study? (3 min.) 23

Editor's Notes

  • #3 No coverage of technical details – beyond me. Do want to cover concepts, definitions so that if someone talks to you about these things, you will understand