Metadata april 8 2013

Open Archives Initiative -Protocol
for Metadata Harvesting

April 8, 2013
Richard Sapon-White

1

Overview

 Definitions
 History
 The OAI Model
 Protocol for Metadata Harvesting

2

Definitions

 Harvester - client application issuing OAI-PMH
requests
 Harvesting - the gathering together of metadata
from a number of distributed repositories into a
combined data store
 Archives – synonym for a repository of scholarly
papers
 Protocol - a set of rules defining communication
between systems (such as ftp or http)

3

History of the OAI

 E-print servers = archives or repositories
 E-print servers provide access to scientific and
technical papers, scholarly journal articles
 Authors deposit pre-prints or published articles in
these repositories
 Concept: public, free access to scholarly
information without paid subscription to journals

4

History of the OAI (cont.)

 Why?
 Scholarly research belongs to people
 Speeds the sharing of research
 Better for authors and readers
 Known as the “open archives movement”
 Has nothing to do with physical archives
(repositories of institutional history or collections
of unpublished materials)

5


 Many e-print servers grew
 Overlapping disciplinary coverage
 Overlapping geographic coverage
 Developing need to
 search multiple repositories simultaneously
(=federated searching)
 automatically identify and copy papers from
other repositories (=repository synchronization)

6


 Meeting of experts, 1999, Santa Fe, New Mexico,
USA
 Defined an interface so that repositories could
expose metadata for papers they held
 Metadata could then be discovered by federated
search services and other repositories and copied
 Known as the Santa Fe Convention (later developed
into PMH – Protocol for Metadata Harvesting

7

The Open Archives Model

 Similar concept to union catalog
 Metadata “harvested” and stored in central
repository
 “Pull” rather than “push” model
 Collecting is similar to Internet spider
collecting HTML content

8

PMH and Z39.50

 Differs from Z39.50 (specifically rejected at Santa
Fe)
 Z39.50:
 allows a client to search a remote information
server across a network
 Difficult to perform high-quality federated searches
across many servers – would need to deal with each
server individually
 Complex protocol

9

PHM and Z39.50 (cont.)

 PHM is a simple protocol
 User interacts with database of harvested metadata,
not with individual repositories
 Database is constructed by the federated search
service using PHM
 Therefore, performance depends only on the
federated search service, not the individual
repositories

10

Metadata Harvesting Protocol

 Queries and responses carried over http
 Harvester application can request a single
metadata record or group of records to be
exported
 Application can restrict records by date to only
gather new records (since previous harvesting)

11

(cont.)
 OAI-compliant data providers are capable of
responding to such requests
 Data provider must be able to export metadata in
at least DC (unqualified) using XML
communication syntax
 Data provider includes URI with metadata

12

(cont.)
 Servers can also provide metadata in other schemes
beside DC
 Harvester applications can request metadata in
other schemes beside DC
 Harvester applications can also query a metadata
repository for:
 List of metadata formats supported by repository
 List of record sets supported by the repository
 List of the identifiers of all records within the repository

13

Why the OAI PHM is
important
 Provides for a minimal level of interoperability
 Drives development of community-specific
metadata schemes
 Potential for new modes of scholarly
communication
 Dependent on widespread implementation by
research organizations, publishers, and “memory
organizations” (i.e., libraries, museums, archives)

14

QUIZ!!!

 http://www.oaforum.org/tutorial/english/page1.h

15

Problems with Metadata
Harvesting
 Loss of data when mapping unqualified DC
 Incorrect data from improper mapping
 Inconsistent punctuation and formatting
because of diverse sources of metadata
 High variance in data between institutions

16

Metasearching

 Many systems = many metadata standards
 Convert to single system (harvesting)?
 Maintain individual element sets BUT create
interface to search simultaneously across
heterogeneous databases
 Voila: Metasearching!
 Not a single method

17

Definition

 From NISO MetaSearch Initiative:
“search and retrieval to span multiple databases,
sources, platforms, protocols, and vendors at one
time.”
 Best known: Z39.50 protocol. Used to
search remote library catalogs.

18

Z39.50

 Allows computers to communicate to
retrieve information – between client and
server
 Searches and results are restricted to Z39.50
databases

19

Z39.50 results

 Server may interpret the query incorrectly
 Some automatically add Boolean “and” while
others add Boolean “or”
 Vocabulary issues – different vocabulary in
different databases
 Display results in order retrieved, by database
found, by data, by relevance

20

Problems with Z39.50

 High recall, little precision
 Also present in Google Search: few studies
on user satisfaction
 Results may display in an irrelevant order for
the searcher

21

Metasearching: pros and cons

 Single database searching allows users to use
specialized indexing or controlled
vocabulary
 Single portal:
 No need for searcher to select a particular
database from list of databases

22

Case Studies

 Divide into 3-4 groups
 Read the case study
 Discuss and report:
 Describe the case briefly (2 min.)
 What can we learn from this case study? (3 min.)

23

Metadata april 8 2013

More Related Content

What's hot

Viewers also liked

Similar to Metadata april 8 2013

More from Richard.Sapon-White

Recently uploaded

Metadata april 8 2013

Editor's Notes