Persistently identifying Web site content Future-proofing Institutional Web sites DCC and Wellcome Library workshop
Contents <ul><li>context  </li></ul><ul><li>functional requirements </li></ul><ul><li>issues raised </li></ul><ul><li>prac...
Context – institutional Web sites <ul><li>institutional Web sites are: </li></ul><ul><ul><li>heterogeneous  – i.e. wide va...
Context – man vs. machine <ul><li>identifiers serve a human and machine/software purpose </li></ul><ul><ul><li>person: “ h...
Context – what is being identified <ul><li>the most important question in any discussion about identifiers is “what is bei...
Context - works vs. manifestations <ul><li>one key aspect is whether the identifier is for an abstract ‘work’ or a particu...
Functional requirements… <ul><li>the JISC IE technical standards document says… </li></ul>Every significant item that is m...
What should be identified? <ul><li>“ every significant item ” </li></ul><ul><li>what does that mean? </li></ul><ul><li>eve...
What does ‘reasonably persistent’ mean? <ul><li>notion of ‘persistence’ is application dependent </li></ul><ul><li>perhaps...
What does ‘break’ mean? <ul><li>what does it mean for an identifier to break? </li></ul><ul><li>need to differentiate betw...
Usability issues <ul><li>“ the only good long-term identifier is a good short-term identifier ” </li></ul><ul><li>unless i...
Interim conclusions… <ul><li>identifiers for content on institutional Web sites should be URIs </li></ul><ul><ul><li>why? ...
‘ http’ URI problems? <ul><li>but ‘http’ URIs tend to break don’t they? </li></ul><ul><ul><li>note: usually it is the reso...
How indirection works (or not?) <ul><li>populate resolution service tables with identifier -> locator mappings (and possib...
What about uniqueness? <ul><li>the same identifier should not be assigned to more than one resource </li></ul><ul><li>a re...
ARK system <ul><li>ARKs are worthy of note since they are ‘http’ URIs </li></ul><ul><ul><li>and therefore meet many of the...
Anatomy of ‘http’ URIs http:// www.somewhere.ac.uk/physics/index.cfm ? name=about http:// www.somewhere.ac.uk/chemistry/re...
Improving persistence of ‘http’ URIs <ul><li>choose long-lived DNS domain names – e.g. try to avoid details of internal or...
Conclusions and recommendations <ul><li>persistent identifiers require persistent commitment from the institution (and thi...
Questions…
Upcoming SlideShare
Loading in...5
×

Persistently identifying website content

1,971

Published on

A presentation given at the Digital Curation Centre Joint Workshop on Future-Proofing Institutional Websites, held in London in January 2006.

See http://www.dcc.ac.uk/events/fpw-2006/

Published in: Education, Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,971
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Persistently identifying website content

  1. 1. Persistently identifying Web site content Future-proofing Institutional Web sites DCC and Wellcome Library workshop
  2. 2. Contents <ul><li>context </li></ul><ul><li>functional requirements </li></ul><ul><li>issues raised </li></ul><ul><li>practical suggestions </li></ul><ul><li>note: not going to look at any particular solutions in any detail – PURLs, DOIs, Handles, ARKs, … </li></ul>
  3. 3. Context – institutional Web sites <ul><li>institutional Web sites are: </li></ul><ul><ul><li>heterogeneous – i.e. wide variety of content, managed/unmanaged, formal/informal </li></ul></ul><ul><ul><li>primarily accessed via mainstream Web browsers – but that may change over time </li></ul></ul><ul><ul><li>dynamic – i.e. content is regularly added (and changed and removed!) </li></ul></ul><ul><ul><li>closely tied to the institution – and institutions are liable to change! </li></ul></ul>
  4. 4. Context – man vs. machine <ul><li>identifiers serve a human and machine/software purpose </li></ul><ul><ul><li>person: “ here’s one I found earlier ” – e.g. using del.icio.us or connotea </li></ul></ul><ul><ul><li>machine: “ is this the same as that ?” </li></ul></ul><ul><li>worth remembering that machines tend to be fairly stupid… </li></ul><ul><ul><li>e.g. if some people use the PURL and some use the corresponding URL, then del.icio.us won’t spot that their entries are about the same thing </li></ul></ul><ul><li>in most cases, being able to resolve the identifier is helpful to both people and machines </li></ul><ul><li>in most cases, the longer an identifier lasts, the better – even after the resolution service breaks! </li></ul>
  5. 5. Context – what is being identified <ul><li>the most important question in any discussion about identifiers is “what is being identified?” </li></ul><ul><li>in the case of institutional Web sites… </li></ul><ul><ul><li>the site </li></ul></ul><ul><ul><li>significant parts of the site </li></ul></ul><ul><ul><li>static documents, individual images, etc. </li></ul></ul><ul><ul><li>dynamic services </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>some possibility for confusion here </li></ul><ul><ul><li>e.g. what does http://www.bris.ac.uk/ identify? </li></ul></ul><ul><li>but in the case of institutional Web sites, people usually do the ‘right thing’ and what is being identified is obvious from the context… </li></ul>
  6. 6. Context - works vs. manifestations <ul><li>one key aspect is whether the identifier is for an abstract ‘work’ or a particular ‘’manifestation’ of that work </li></ul><ul><li>t here are some scenarios in which it is necessary to identify the ‘work’… </li></ul><ul><li>in other cases, it is necessary to identify a particular ‘manifestation’ of the work </li></ul><ul><li>beginning to see this problem in the development of eprint archives and institutional repositories </li></ul>“ Crystal Studio is a recommended resource for the teaching of crystallography at undergraduate level.“ &quot;To perform this exercise you will need a copy of Crystal Studio version 5.0 (versions 4.0 Lite and 4.0 Professional do not support the required options).&quot;
  7. 7. Functional requirements… <ul><li>the JISC IE technical standards document says… </li></ul>Every significant item that is made available through a JISC IE network service should be assigned a URI that is reasonably persistent. This means that item URIs should not be expected to break for a period of 10-15 years after they have first been used. For this reason, JISC IE service components should not hardcode file format, server technology, service organisational structure or other information that is likely to change over a 10-15 year period into item URIs. If items become unavailable during that period, then the URI should resolve to a Web page that explains why the item is no longer available and what actions the end-user can take to obtain a copy of the item or similar resources. Furthermore, item URIs should not contain end-user-specific information, i.e. all item URIs should work for all end-users (albeit allowing for appropriate authentication challenges to be inserted into the process by which the URI is resolved). http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/standards/ Every significant item that is made available through a JISC IE network service should be assigned a URI that is reasonably persistent. This means that item URIs should not be expected to break for a period of 10-15 years after they have first been used. For this reason, JISC IE service components should not hardcode file format, server technology, service organisational structure or other information that is likely to change over a 10-15 year period into item URIs. If items become unavailable during that period, then the URI should resolve to a Web page that explains why the item is no longer available and what actions the end-user can take to obtain a copy of the item or similar resources. Furthermore, item URIs should not contain end-user-specific information, i.e. all item URIs should work for all end-users (albeit allowing for appropriate authentication challenges to be inserted into the process by which the URI is resolved).
  8. 8. What should be identified? <ul><li>“ every significant item ” </li></ul><ul><li>what does that mean? </li></ul><ul><li>every resource that people are likely to want to cite persistently? </li></ul><ul><li>there might be stuff on institutional Web sites that we don’t need to cite persistently </li></ul><ul><ul><li>but often difficult to pre-judge what is significant and what isn’t </li></ul></ul><ul><ul><li>and judgements about significance and required level of persistence may come from outside the institution </li></ul></ul>
  9. 9. What does ‘reasonably persistent’ mean? <ul><li>notion of ‘persistence’ is application dependent </li></ul><ul><li>perhaps helpful to think about 15 – 20 year timeframe? </li></ul><ul><ul><li>longer than the Web has been around to date </li></ul></ul><ul><ul><li>solutions for 20 year period may well last longer </li></ul></ul><ul><ul><li>‘ forever’ is too long </li></ul></ul><ul><li>what will have changed in 20 years time? </li></ul><ul><ul><li>technology - HTML replaced? HTTP replaced? DNS replaced? URI system replaced? </li></ul></ul><ul><ul><li>organisations – mergers, closures, new institutions, new government departments, etc. </li></ul></ul><ul><ul><li>people – deaths, retirements, etc. </li></ul></ul><ul><ul><li>countries! </li></ul></ul>
  10. 10. What does ‘break’ mean? <ul><li>what does it mean for an identifier to break? </li></ul><ul><li>need to differentiate between the breakage of services on the identifier and breakage of the identifier itself </li></ul><ul><li>most obvious services on identifiers are ‘resolution services’ </li></ul><ul><ul><li>“ give me a representation of the identified thing ” </li></ul></ul><ul><ul><li>known as ‘dereferencing’ in W3C documentation </li></ul></ul><ul><li>resolution services can break (by design or by accident) but the identifier may live on and remain useful </li></ul><ul><li>the identifier itself only breaks when all parties (including software systems) have forgotten what it identified, or when parties no longer agree about what it identifies (e.g. if it gets re-assigned) </li></ul>
  11. 11. Usability issues <ul><li>“ the only good long-term identifier is a good short-term identifier ” </li></ul><ul><li>unless identifiers work well now, then they won’t turn into persistent identifiers because they won’t be used at all </li></ul><ul><li>what does “work well” mean (particularly in the context of institutional Web sites)? </li></ul><ul><ul><li>conformant with current Internet standards </li></ul></ul><ul><ul><li>usable in Web browsers (without additional plug-ins - i.e. usable by everyone) </li></ul></ul><ul><ul><li>meaningful to people </li></ul></ul><ul><ul><li>resolvable </li></ul></ul><ul><ul><li>simple to assign and maintain </li></ul></ul><ul><ul><li>low cost (in terms of money and time) </li></ul></ul>
  12. 12. Interim conclusions… <ul><li>identifiers for content on institutional Web sites should be URIs </li></ul><ul><ul><li>why? because the URI is the global and unambiguous standard for identifiers on the Internet </li></ul></ul><ul><li>‘ http’ URIs are better than any other form of URI </li></ul><ul><ul><li>why? because they work in current Internet tools, particularly Web browsers </li></ul></ul><ul><ul><li>built-in resolution mechanism </li></ul></ul><ul><ul><li>easy to assign and low-cost (typically!) </li></ul></ul>
  13. 13. ‘ http’ URI problems? <ul><li>but ‘http’ URIs tend to break don’t they? </li></ul><ul><ul><li>note: usually it is the resolution service that breaks (i.e. they stop working as locators) - this doesn’t necessarily imply that they stop functioning as identifiers though the two may be closely related </li></ul></ul><ul><li>reasons for fragility of ‘http’ URI resolution examined later </li></ul><ul><li>but ‘poor design’ and lack of commitment often to blame </li></ul><ul><li>not necessarily the case that one can apply generic Internet-wide findings about ‘http’ URI breakage to ‘institutional’ Web sites </li></ul><ul><li>attempts at more persistent forms of identifier often based on moving away from direct ties to HTTP and/or introducing a level of indirection </li></ul>
  14. 14. How indirection works (or not?) <ul><li>populate resolution service tables with identifier -> locator mappings (and possibly other metadata) </li></ul><ul><ul><li>DOI: 10.1000/182 -> http://www.doi.org/hb.html </li></ul></ul><ul><ul><li>Handle: 4263537/4002 -> http://www.handle.net/documentation.html </li></ul></ul><ul><ul><li>ARK: http://ark.nlm.nih.gov/ark:/12025/ pm10611131 -> http://brain.oxfordjournals.org/cgi/content/full/123/1/171 </li></ul></ul><ul><ul><li>PURL: http://purl.org/net/ukoln -> http://www.ukoln.ac.uk/ </li></ul></ul><ul><li>typically used as the basis for HTTP redirects, e.g. </li></ul><ul><ul><li>http://dx.doi.org/10.1000/182 -> http://www.doi.org/hb.html </li></ul></ul><ul><ul><li>http://hdl.handle.net/4263537/4002 -> http://www.handle.net/documentation.html </li></ul></ul><ul><ul><li>etc. </li></ul></ul><ul><li>helps to ensure persistence… but </li></ul><ul><ul><li>HTTP redirects not handled very well by browsers - end-user is typically left using the non-persistent URI  </li></ul></ul><ul><ul><li>need commitment to maintain resolver services and tables </li></ul></ul><ul><ul><li>introduces a second (at least) identifier for each resource </li></ul></ul>
  15. 15. What about uniqueness? <ul><li>the same identifier should not be assigned to more than one resource </li></ul><ul><li>a resource may have more than one identifier assigned to it… but this should be avoided as far as possible </li></ul><ul><ul><li>e.g. the DOI “10.1000/182” can be encoded as a URI in several ways: </li></ul></ul><ul><ul><li>http://dx.doi.org/10.1000/182 , doi:10.1000/182 , urn:doi:10.1000/182 and info:doi/10.1000/182 </li></ul></ul><ul><ul><li>therefore, DOI-aware applications need to have knowledge of these encodings hard-coded into them (partly because the DOI itself is just a string, but also because nothing in the URI specification indicates that the URI encodings are equivalent) </li></ul></ul><ul><ul><li>though within a domain this may become the norm (e.g. Google Scholar, Crossref, Connotea, etc.) </li></ul></ul>
  16. 16. ARK system <ul><li>ARKs are worthy of note since they are ‘http’ URIs </li></ul><ul><ul><li>and therefore meet many of the usability requirements outlined earlier </li></ul></ul><ul><li>ARKs clearly flag an institutional commitment to persistence </li></ul><ul><ul><li>the identifier owner (often the resource owner) commits to maintaining ARK services and associated metadata </li></ul></ul><ul><ul><li>no reliance on third-party resolver </li></ul></ul><ul><li>but they suffer from the HTTP redirect problem </li></ul><ul><li>and ultimately may lead to multiple URIs being assigned to a single resource </li></ul>
  17. 17. Anatomy of ‘http’ URIs http:// www.somewhere.ac.uk/physics/index.cfm ? name=about http:// www.somewhere.ac.uk/chemistry/report.rtf ‘ http’ URI scheme – URI persistence not reliant on HTTP protocol, but is reliant on continued registration and management of the scheme (and of the URI spec. itself!) DNS domain name – persistence reliant on continued ownership and management of the DNS domain name (and the DNS!) Component hierarchy, often organisationally based – persistence reliant on continued management of component structure, i.e. not re-using old components Server technology – change of technology may enforce change of URI, leading to multiple URIs for same resource (with no simple mechanism for determining equivalence) File format – inappropriate if identifier is for the ‘work’ rather than the ‘manifestation’ - because changing the format will result in a new URI
  18. 18. Improving persistence of ‘http’ URIs <ul><li>choose long-lived DNS domain names – e.g. try to avoid details of internal organisational structure </li></ul><ul><li>partition URI components by ‘function’ rather than by organisational structure - because structure is likely to change </li></ul><ul><li>avoid exposing Web server technology in URIs (Cold Fusion, PHP, etc.) - to allow changes to technology without URI proliferation and resolver breakage </li></ul><ul><li>avoid embedding details of document format into URIs, unless particular manifestation is being identified </li></ul><ul><li>avoid embedding end-user or session information into URIs – so that they can be shared between people </li></ul>
  19. 19. Conclusions and recommendations <ul><li>persistent identifiers require persistent commitment from the institution (and third-parties) </li></ul><ul><li>need to determine what ‘persistent’ means in practice (on the basis that ‘forever’ is unrealistic) </li></ul><ul><li>‘ http’ URIs can be made more persistent if they are constructed and managed sensibly </li></ul><ul><li>use of DOIs/Handles/ARKs/PURLs may be appropriate (particularly where domain practice is clear) </li></ul><ul><ul><li>but need to be clear about cost/benefits and institutional and third-party commitment to maintaining resolver tables and associated services </li></ul></ul><ul><ul><li>where these are used, always and only use the ‘http’ form of URI (e.g. http://dx.doi.org/10.1000/182) </li></ul></ul>
  20. 20. Questions…
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×