the UPS protoproto project


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

the UPS protoproto project

  1. 1. herbert van de sompel, michael nelson, thomas krichel the UPS protoproto project UPS 1 Meeting Santa Fe - October 21th 1999
  2. 2. project description demo the UPS protoproto dex the data exchange framework
  3. 3. project why a protoproto? •  UPS: enable cross-archive end-user services •  protoproto: –  facilitate discussions –  identify issues involved in creating cross-archive services –  experiment with digital object concepts for archive material –  does not claim to be a solution •  protoproto is multi-disciplinary –  a special instance of cross-archive –  there is a market –  promotional value
  4. 4. project who? •  coordination: herbert van de sompel, michael nelson, thomas krichel •  involvement of: – Old Dominion U & NASA Langley – U of Surrey – U of Ghent – Los Alamos National Laboratory - Library – Russian Academy of Science - Siberian branch
  5. 5. project sponsors •  Los Alamos National Laboratory - Research Library •  JISC eLib WoPEc project
  6. 6. project datasets –  metadata only –  full text remains at archives –  static dumps obtained ca. July 99 objects full-text !organization the arXiv 85,223 85,223 17,983 CogPrints 742 659 14 NACA 3,036 3,036 100 NCSTRL 29,184 9,084 93 NDLTD 1,590 951 1 RePEc 73,367 13,582 2,453 Total 193,142 112,535
  7. 7. project metadata formats format the arXiv internal CogPrints internal NACA Refer NCSTRL RFC1807 NDLTD MARC RePEc ReDIF
  8. 8. project metadata extraction •  Getting metadata out of archives –  not all archives support metadata extraction •  some archives have undocumented metadata extraction procedures –  not all archives support rich criteria for extraction •  single dump concept only •  Intellectual property and use rights not always clear
  9. 9. project metadata quality •  Metadata has problems with: –  record duplication –  crucial missing fields –  internal errors –  ambiguous references to people and places, publications
  10. 10. project metadata conversion •  all datasets converted to ReDIF: •  essential to have a single fomat for the creation of services •  supply by archives in a single format was not realistic •  no downgrading of data •  data enhancements: •  creation of unique identifier •  addition of raw subject-classification •  normalization of publication types
  11. 11. project re-creation of archives •  creation of archives for ReDIF-ed metadata •  using intelligent digital objects : “buckets” RePEc arXiv NCSTRL
  12. 12. project buckets •  Buckets were chosen to study the implications of using rich, intelligent objects in UPS •  Buckets are: –  DL protocol / system independent –  self-contained and mobile –  handle their own display, enforcement of terms and conditions, and dissemination of their contents –  designed for bundling multiple data representations and data instance types •  The aggregative nature of buckets is well suited for adding valued-added services at the object level
  13. 13. project creation of end-user service •  NCSTRL+ digital library service •  indexing buckets in archives by requesting their metadata •  enhanced user-interface •  NCSTRL+ search results point at buckets •  buckets auto-display •  buckets provide link to full-text in native archive
  14. 14. project scaling problems •  UPS contains 193K objects –  using buckets consumed inodes (~60 inodes per bucket) •  filesystem reformatted with more generous amount of inodes –  Solaris and Dienst conflict •  Dienst wants each object in an publishing authority to be in a single directory •  Solaris has a hard limit of 32K objects in a directory •  resolution: use many (100+) authorities for UPS
  15. 15. project addition of linking service •  integrate the archives with the traditional communication mechanism •  context-sensitive linking to deliver extended services via SFX technology
  16. 16. project SFX linking service extended services metadata evaluate metadata metadata system A system B
  17. 17. project SFX linking database
  18. 18. project addition of linking service •  buckets for arXiv, NCSTRL and RePEc are SFX- aware •  Cogprints, NACA, NDLTD not SFX-aware •  SLAC/SPIRES is SFX-aware •  linking services for preprint metadata + for published version
  19. 19. demo the UPS protoproto •  will be available starting beginning of November •  UPS list will be notified •  disclaimer “not a production system”
  20. 20. dex some issues (I) • data exchange framework • data provision vs. data implementation • central searching, distributed archives •  need for a framework by which archives can describe themselves: •  content •  terms and conditions •  protocols, criteria supported to extract (meta)data •  metadata scheme, subject classification scheme, material-type scheme, ...
  21. 21. dex some issues (II) •  need for an identifier scheme for archives and archive objects • (cf. ISSN, ISBN, DOI) •  metadata quality obstructs the creation of services •  desirabile to extend metadata with citation information •  smart objects •  archived objects that are active, not passsive
  22. 22. dex providing vs. implementing data •  Providing data: –  publishing into an archive –  providing methods for metadata “harvesting” •  provide non-technical context for sharing information also •  Implementing Data: –  harvest metadata from providers –  implement user interface to data •  Even if provided by the same DL, these are distinct functions
  23. 23. dex providing vs. implementing data Native harvesting interface Input Provider Native Input interface end-user Provider interface interface Native end-user interface No machine based way to Machine and user interfaces extract metadata… for extracting metadata….
  24. 24. dex providing vs. implementing data Native Input and harvesting end-user Implementor interfaces optional interface Native Native harvesting harvesting interface interface Input Provider Input Provider interface interface Native Native end-user end-user interface interface optional (e.g., RePEc)
  25. 25. dex self-describing archives •  Much of the learning about the constituent UPS archives occurred out of band… •  Given an unknown archive, we should be able to algorithmically determine the archive’s metadata... Native harvesting interface Where possible, the harvesting interface Input interface Provider should provide the same criteria as the end-user Native end-user interface interface
  26. 26. dex self-describing archives •  Recommended criteria for metadata extraction: –  subject classification –  accession date –  publication date •  Criteria for archive description –  metadata formats employed –  contact information for archive –  publication type scheme –  identifier scheme –  subject classification scheme
  27. 27. dex identifiers •  Useful in: –  reference linking –  can be used in citations –  resolving duplications •  UPS duplications were removed by hand –  tracking publication lifecycle •  Need the ability for an object to have multiple unique identifiers –  organization, discipline, etc.
  28. 28. dex smart objects •  Premise: Objects are more important than the archives that hold them •  SODA: Smart Objects, Dumb Archives •  Objects should be the canonical authority for •  metadata •  contents •  use •  Objects should be able to grow and change •  correct metadata •  add new formats •  add new services •  reflect the lifecycle of the object
  29. 29. dex smart objects •  It would be beneficial if the archived objects could be heterogenous: •  with their own “look-and-feel” •  unique functionality / services –  e.g., the data archiving needs of an atmospheric scientist can be different than that of a computer scientist, engineer or medical researcher •  yet maintained a standard API for: •  extracting metadata •  content retrieval •  resource discovery on the object •  terms and conditions
  30. 30. dex lessons learned •  A strong distinction between the provision of data, and the implementation of data –  also, a socio-legal context for sharing metadata •  Open, “self-describing” archives •  A universal, unique identifier name space •  Archived objects with more intelligence and flexibility