Scalable Identifiers for Natural
      History Collections
                12 August 2012

    University of California Curation Center
           California Digital Library
California Digital Library
Serving the University of California
• 10 campuses
• 360K students, faculty, and staff
• 100’s of museums, art                CDL supports the research
   galleries, observatories, marine      lifecycle
   centers, botanical gardens          • Collections
• 5 medical centers                    • Digital Special Collections
• 5 law schools                        • Discovery & Delivery
• 3 National Labs                      • Publishing Group
                                       • UC Curation Center (UC3)
The research data problem
an article about data, but no data
What EZID data citation offers
•   Precise identification of a dataset (DOI, ARK)
•   Credit to data producers and data publishers
•   A link from traditional literature to the data
•   Exposure and research metrics for datasets
    (Web of Knowledge, Google)
EZID: Long term identifiers made easy




                        Take control of the
                        management and distribution
                        of your research, share and get
                        credit for it, and build your
                        reputation through its collection
                        and documentation
EZID: Long term identifiers made easy




                        Take control of the
                        management and distribution
                        of your research, share and get
                        credit for it, and build your
                        reputation through its collection
                        and documentation
DataCite
German National Library of Economics (ZBW)

German National Library of Science and Technology (TIB)
                                                             Canada Institute for Scientific and Technical Info. (CISTI)
German National Library of Medicine (ZB MED)
                                                             Technical Information Center of Denmark
GESIS - Leibniz Institute for the Social Sciences, Germany
                                                             Institute for Scientific & Technical Information (INIST-
Australian National Data Service (ANDS)
                                                                 CNRS), France
ETH Zurich, Switzerland
                                                             TU Delft Library, The Netherlands

                                                             The Swedish National Data Service (SNDS)

                                                             The British Library , UK

                                                             California Digital Library (CDL), USA
                                                             Office of Scientific & Technical Information (OSTI), USA

                                                             Purdue University Library
EZID Clients
                                        A current, partial list

UC Berkeley Library (on behalf of the UC Berkeley     The Digital Archaeological Record (tDAR)
campus) Sponsored accounts:

      Open Context                                    Dryad Digital Repository
      CRCNS.org
UC San Diego Library (on behalf of the UC San Diego   Fred Hutchinson Cancer Research Center
campus)

American Astronomical Society(AAS)                    LabArchives
Centre national de documentation                      National Center for Atmospheric Research
                                                      (NCAR)
pédagogique(CNDP)
Cornell Institute for Social & Economic               USGS/Earth Sciences Data Clearinghouse
Research                                              (formerly National Biological Info. Infrastructure)
New features in development

• Suffix pass-thru: do NT and get N/ST/S for free
• Service replicas: manager and resolver
• Content negotiation and inflections: ? ?? / .
• URN (Uniform Resource Name) support (urn:uuid:)
• ARK community and governance, eg, registries
Some identifier dimensions
• registration (storing and updating ids for
  resolution)
• non-registration (id awareness via rules)
• persistence flavors
• resolution
• clusters (closely coupled ids)
• other relations (part, whole, related)
Identifier generation
• inspiration ("I think I'll call it MyKitty/Photos")
• systematic inspiration (title/author/vol/issue)
• counter (421, 422, 423, ...)
• timestamp
• hash computed over content (MD5, SHA256)
• hash of randomized timestamp plus registry
  (uuidgen, noid)
• randomized counter plus registry (EZID/noid)
Identifier registration
• use filesystem tree as resolver (any old
  website)
• use web server config file
• use web server backing database
• use a service (bit.ly, EZID, DataCite, local
  Handle service)
Identifier non-registration
Identifiers “exposed” but not registered, eg,
  awareness via rules
• extension (abc/def is "part of" abc)
• parameter (abc_N_M works for N or M less
  than 100,000)
• general query (arbitrary data cells)
Identifier persistence flavors
• persistent id to very dynamic content
  (eg, home page)
• persistent id to stable but correctable content
  (eg, landing page)
• persistent id to never-changing content
  (eg, spreadsheet)
  – persistent ids to non-recommended content
• persistent id to stable but growing content
  (serial pub)
Identifier resolution
• DNS (domain names)
• DNS + HTTP (any website)
• DNS + HTTP + redirects (eg, URL
  shorteners, N2T/EZID system)
• DNS + HTTP + redirects + Handle resolver
  (DOIs and Handles)
Identifier clusters
Related, but very closely couple identifiers
• object files
• alternate object files
• object metadata
GUID Definitions
• GUID -- Definition 1 (wikipedia)
  – A 128-bit id generated per RFC 4122, eg,
  – uuidgen -> EEF45689-BBE5-4FB6-9E80-
    41B78F6578E2
• GUID -- definition 2 (earth sciences?)
  – any globally unique identifier
Service replicas
• EZID is an id manager that populates N2T
   – It tolerates down time
   – Other id manager services might one day populate N2T
• N2T (Name-to-Thing) is an id resolver that ...
   – It is very intolerant of down time, since it services all
     access requests for locations and metadata
   – N2T replicas underway
URN support
• N2T and EZID are agnostic about kinds of
  things, names, and metadata
   – Digital, physical, abstract, living, fictional, groups, etc.
   – Any metadata & known profiles (DataCite, Dublin Kernel)
   – ARK, DOI, URN, Handle, IVOA, LSID, PMID, etc., requiring
     namespace “write” permission, eg, via DataCite
• In test: Uniform Resource Names (URNs)
   – urn:uuid namespace
Under the hood keysmithing terms:
bows, shoulders, blades, tips, covers
Suffix pass-thru: NT gets N/ST/S for free

Idea: if name N points to target T, then requests for N
  extended by any suffix N/S can take you to T/S
• For dataset doi:10.5072/Big4 with 10,000
  nameable components,
   – Register and manage 10,001 names or 1 name?
   – Eg, http://x.y.z/foo/Big4/db/table/cell/45-8.txt could be
     reached with doi:1.5072/Big4/table/cell/45-8.txt
• In test with ARKs. Conflict with other resolvers?
Tombstone and other surrogate pages

Tombstone, incubation, and other surrogate pages
  (probation?) auto-generated from metadata, eg,
  http://n2t.net/ezid/tombstone/id/ark:/20775/bb3243444z
Reserved identifiers and multiple targets

• Some ids must be created and managed (reserved)
  before going public, eg, for manuscript preparation
• In test: infrastructure for multiple targets and
  multiple instances of any metadata element
• What should user experience be for multiple targets?
   – Present a menu of targets (burden of choice)?
   – One target chosen for them (burden of inflexibility)?
Identifier (ARK) inflections: ? ?? / .

• Inflect: change endings w.o. creating new words
  – Terminal ? means “I want metadata”, which is similar to
    linked data content negotiation (also in EZID test)
  – Terminal ?? means “I also want support metadata”
  – Drawing board: / could mean “I want a landing page”
    and . could mean “I want the usual computable thing”
• Allow inflections beyond ARKs to DOIs/URNs?
Example: http://n2t.net/ark:/13030/qt0349g1rh?
        Renninger, Heidi; Phillips, Nathan; Hodel, Donald. “Comparative hydraulic and
            anatomic properties in palm trees (Washingtoniarobusta) of varying
            heights”. 2009-04-29. ark:/13030/qt0349g1rh



     HTML content with
     embedded comments in
     ANVL/ERC and RDF



erc:
who: Renninger, Heidi,; Phillips,
   Nathan,; Hodel, Donald,
what: Comparative hydraulic and
   anatomic properties in palm
   trees (Washingtoniarobusta)
   of varying heights
when: 2009-04-29
where: ark:/13030/qt0349g1rh
ARK community and governance

•   ARK mailing list: arks-forum@googlegroups.com
•   Topics: governance, community, standardization
•   Registry maintenance: shoulders and NAANs
•   N2T consortium with alternative EZID-like services

Scalable Identifiers for Natural History Collections

  • 1.
    Scalable Identifiers forNatural History Collections 12 August 2012 University of California Curation Center California Digital Library
  • 2.
    California Digital Library Servingthe University of California • 10 campuses • 360K students, faculty, and staff • 100’s of museums, art CDL supports the research galleries, observatories, marine lifecycle centers, botanical gardens • Collections • 5 medical centers • Digital Special Collections • 5 law schools • Discovery & Delivery • 3 National Labs • Publishing Group • UC Curation Center (UC3)
  • 3.
    The research dataproblem an article about data, but no data
  • 4.
    What EZID datacitation offers • Precise identification of a dataset (DOI, ARK) • Credit to data producers and data publishers • A link from traditional literature to the data • Exposure and research metrics for datasets (Web of Knowledge, Google)
  • 5.
    EZID: Long termidentifiers made easy Take control of the management and distribution of your research, share and get credit for it, and build your reputation through its collection and documentation
  • 6.
    EZID: Long termidentifiers made easy Take control of the management and distribution of your research, share and get credit for it, and build your reputation through its collection and documentation
  • 7.
    DataCite German National Libraryof Economics (ZBW) German National Library of Science and Technology (TIB) Canada Institute for Scientific and Technical Info. (CISTI) German National Library of Medicine (ZB MED) Technical Information Center of Denmark GESIS - Leibniz Institute for the Social Sciences, Germany Institute for Scientific & Technical Information (INIST- Australian National Data Service (ANDS) CNRS), France ETH Zurich, Switzerland TU Delft Library, The Netherlands The Swedish National Data Service (SNDS) The British Library , UK California Digital Library (CDL), USA Office of Scientific & Technical Information (OSTI), USA Purdue University Library
  • 8.
    EZID Clients A current, partial list UC Berkeley Library (on behalf of the UC Berkeley The Digital Archaeological Record (tDAR) campus) Sponsored accounts: Open Context Dryad Digital Repository CRCNS.org UC San Diego Library (on behalf of the UC San Diego Fred Hutchinson Cancer Research Center campus) American Astronomical Society(AAS) LabArchives Centre national de documentation National Center for Atmospheric Research (NCAR) pédagogique(CNDP) Cornell Institute for Social & Economic USGS/Earth Sciences Data Clearinghouse Research (formerly National Biological Info. Infrastructure)
  • 9.
    New features indevelopment • Suffix pass-thru: do NT and get N/ST/S for free • Service replicas: manager and resolver • Content negotiation and inflections: ? ?? / . • URN (Uniform Resource Name) support (urn:uuid:) • ARK community and governance, eg, registries
  • 10.
    Some identifier dimensions •registration (storing and updating ids for resolution) • non-registration (id awareness via rules) • persistence flavors • resolution • clusters (closely coupled ids) • other relations (part, whole, related)
  • 11.
    Identifier generation • inspiration("I think I'll call it MyKitty/Photos") • systematic inspiration (title/author/vol/issue) • counter (421, 422, 423, ...) • timestamp • hash computed over content (MD5, SHA256) • hash of randomized timestamp plus registry (uuidgen, noid) • randomized counter plus registry (EZID/noid)
  • 12.
    Identifier registration • usefilesystem tree as resolver (any old website) • use web server config file • use web server backing database • use a service (bit.ly, EZID, DataCite, local Handle service)
  • 13.
    Identifier non-registration Identifiers “exposed”but not registered, eg, awareness via rules • extension (abc/def is "part of" abc) • parameter (abc_N_M works for N or M less than 100,000) • general query (arbitrary data cells)
  • 14.
    Identifier persistence flavors •persistent id to very dynamic content (eg, home page) • persistent id to stable but correctable content (eg, landing page) • persistent id to never-changing content (eg, spreadsheet) – persistent ids to non-recommended content • persistent id to stable but growing content (serial pub)
  • 15.
    Identifier resolution • DNS(domain names) • DNS + HTTP (any website) • DNS + HTTP + redirects (eg, URL shorteners, N2T/EZID system) • DNS + HTTP + redirects + Handle resolver (DOIs and Handles)
  • 16.
    Identifier clusters Related, butvery closely couple identifiers • object files • alternate object files • object metadata
  • 17.
    GUID Definitions • GUID-- Definition 1 (wikipedia) – A 128-bit id generated per RFC 4122, eg, – uuidgen -> EEF45689-BBE5-4FB6-9E80- 41B78F6578E2 • GUID -- definition 2 (earth sciences?) – any globally unique identifier
  • 18.
    Service replicas • EZIDis an id manager that populates N2T – It tolerates down time – Other id manager services might one day populate N2T • N2T (Name-to-Thing) is an id resolver that ... – It is very intolerant of down time, since it services all access requests for locations and metadata – N2T replicas underway
  • 19.
    URN support • N2Tand EZID are agnostic about kinds of things, names, and metadata – Digital, physical, abstract, living, fictional, groups, etc. – Any metadata & known profiles (DataCite, Dublin Kernel) – ARK, DOI, URN, Handle, IVOA, LSID, PMID, etc., requiring namespace “write” permission, eg, via DataCite • In test: Uniform Resource Names (URNs) – urn:uuid namespace
  • 20.
    Under the hoodkeysmithing terms: bows, shoulders, blades, tips, covers
  • 21.
    Suffix pass-thru: NTgets N/ST/S for free Idea: if name N points to target T, then requests for N extended by any suffix N/S can take you to T/S • For dataset doi:10.5072/Big4 with 10,000 nameable components, – Register and manage 10,001 names or 1 name? – Eg, http://x.y.z/foo/Big4/db/table/cell/45-8.txt could be reached with doi:1.5072/Big4/table/cell/45-8.txt • In test with ARKs. Conflict with other resolvers?
  • 22.
    Tombstone and othersurrogate pages Tombstone, incubation, and other surrogate pages (probation?) auto-generated from metadata, eg, http://n2t.net/ezid/tombstone/id/ark:/20775/bb3243444z
  • 23.
    Reserved identifiers andmultiple targets • Some ids must be created and managed (reserved) before going public, eg, for manuscript preparation • In test: infrastructure for multiple targets and multiple instances of any metadata element • What should user experience be for multiple targets? – Present a menu of targets (burden of choice)? – One target chosen for them (burden of inflexibility)?
  • 24.
    Identifier (ARK) inflections:? ?? / . • Inflect: change endings w.o. creating new words – Terminal ? means “I want metadata”, which is similar to linked data content negotiation (also in EZID test) – Terminal ?? means “I also want support metadata” – Drawing board: / could mean “I want a landing page” and . could mean “I want the usual computable thing” • Allow inflections beyond ARKs to DOIs/URNs?
  • 25.
    Example: http://n2t.net/ark:/13030/qt0349g1rh? Renninger, Heidi; Phillips, Nathan; Hodel, Donald. “Comparative hydraulic and anatomic properties in palm trees (Washingtoniarobusta) of varying heights”. 2009-04-29. ark:/13030/qt0349g1rh HTML content with embedded comments in ANVL/ERC and RDF erc: who: Renninger, Heidi,; Phillips, Nathan,; Hodel, Donald, what: Comparative hydraulic and anatomic properties in palm trees (Washingtoniarobusta) of varying heights when: 2009-04-29 where: ark:/13030/qt0349g1rh
  • 26.
    ARK community andgovernance • ARK mailing list: arks-forum@googlegroups.com • Topics: governance, community, standardization • Registry maintenance: shoulders and NAANs • N2T consortium with alternative EZID-like services

Editor's Notes

  • #9 Academic, non-profit, government, and commercial