Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Collecting Government Web Content at the National Library of Australia AGLIN Forum 2 May 2012 Paul Koerbin Manager Web Archiving National Library of Australia
  2. 2. Web Archiving at the NLA• Background• Scale of collections• Archival collections (selective, bulk, govt)• Objectives, selection and scope• Retention and preservation• Finding government content in PANDORA
  3. 3. Web Archiving at the NLA• Began web archiving activity in 1996 –• Government content is included in all NLA web collections – „PANDORA Archive‟ collection, 1996 to now • Selective – The „auscrawl‟ whole .au domain harvest collections • Annual since 2005 – The „whole-of-government‟ collections • Seed list • 2011, 2012
  4. 4. Web Archiving at the NLA• Scale of collecting – PANDORA (as at April 2012, i.e. 15 years of collecting) • 31,000 titles – All govt ~ 55 % of titles – Commonwealth Govt ~ 12 % of titles • 75,000 instances • 145 million files • 6.5 Tb – Australian .au domain harvests 2005-2011 • 3.5 billion files • 140 Tb – ‘Whole-of-government ‘ seed list crawl 2011 • 7.4 million files • 538 Gb
  5. 5. Web Archiving at the NLA• PANDORA Archive – Strong representation of govt content including Commonwealth, State and Territory, and local govt (> 50 % of titles) – Generally does not include whole departmental websites – Prominent ministerial micro-sites (speeches, press releases) – Government initiatives websites (e.g. Firearms buyback, 2000) – Major reports, enquiries, documents (e.g. Gershon Review, 2008) – Discrete „titles‟ and „instances‟ – no links between instances – Quality checked – Catalogued and full text indexed – Accessible through the Trove and PANDORA discovery services
  6. 6. Web Archiving at the NLA• Whole .au domain harvests („auscrawl‟) – Crawls of the entire .au domain (plus some) – Averages over 1 million hosts crawled each year (av. 650m files) – Includes second level domain – Relies on crawler capabilities and subject to crawler limitations and constraints – Obeys robots.txt (except for inline image and style elements) – No quality checking for completeness of harvest or functionality (e.g. look and style) – Retains linkages between content that is in scope for the crawl – Full-text and URL indexes – But, not accessible to public
  7. 7. Web Archiving at the NLA• Collecting Commonwealth Govt websites – Whole-of-government arrangements • Whole-of-government ICT policy • Secretaries‟ ICT Governance Board, 7 May 2010 • AGIMO circular 2010/01 • governance/Whole-of-Government-ICT-Policies.html • Covers FMA Act agencies – CAC Act agencies – still require individual permissions • Subject to opt-out arrangements • Replaced the need for individual copyright licence arrangements coordinated through the CCA • NLA now permitted to collect, preserve and make accessible freely available govt web content
  8. 8. Web Archiving at the NLA• Whole-of-government collection – Based on list of specified URLs (most at domain level) – Around 800 seed URLs – Only includes FMA Act agency sites – No QA and fixing – Obeys robots.txt (except for inline images and style elements) – Full-text and URL indexes – No pubic access yet (but perhaps soon)
  9. 9. Web Archiving at the NLA• Collecting mandate and objective – The National Library Act 1960 mandate to build and maintain a national comprehensive collection of material relating to Australia and Australians – ... and to make the collection available in the national interest – Objective is about ensuring future and ongoing access to materials of interest to Australia‟s social, cultural and publishing heritage – Not the function of NLA web collecting (archiving) program to satisfy requirements for agencies under the Archives Act 1983
  10. 10. Web Archiving at the NLA• Government „Web Guide‟ recordkeeping advice: – “Archiving websites” • Mandatory requirement (Archives Act 1983 and Evidence Act 1995) • seek advice from NAA – “Retaining access to outdated content” • Not a mandatory requirement • Recommends nominating content for inclusion in PANDORA • Does not ensure safeguarding of content • Selective – Create own publicly accessible archive – Publish advice how people can access out of date content• New „whole-of-government‟ web collection • More inclusive and larger scale than PANDORA • FMA Act agencies requirement (with „opt-out‟ provisions) • CAC Act agencies – opt-in!
  11. 11. Web Archiving at the NLA• PANDORA selection – Commonwealth Government publications a priority collecting area – Methodical approaches have been attempted but ... – Curator expertise and current awareness – Stakeholders as nominators (e.g. indexing agencies, other collecting areas in NLA, Parl Library, depts) – Selecting and scoping • Whole site, part site, specific documents • Substance and research value • Scheduling (when to harvest and how frequently) • Resources to undertake work • Technical constraints
  12. 12. Web Archiving at the NLA• PANDORA collecting – Websites and web „documents‟ • documents (discrete files), whole sites, parts of sites • text, images, video, style elements, client side scripts – Content is harvested using a crawl robot • efficient (no work for publisher), automated process • deposit of complex objects is harder to deal with – Dynamic content becomes static HTML • an artefact of the original • the published version as you would view it from a web browser, not from the content management system • loses dynamic functionality • „normalising‟ process – Persistent URIs
  13. 13. Web Archiving at the NLA• Retention of collected web content – Archiving means preservation – Long term access – Collections developed and maintained in perpetuity for future generations – What is the preservation reality? • Is access in perpetuity achievable? – Investing in systems to manage for preservation • More than preserving the bit stream • Establishing preservation intent • Collecting and managing preservation metadata • Understanding formats and their risks (... and actions?)
  14. 14. Web Archiving at the NLA• „DIY‟ archive of your published web content – Use a subscription service • ArchiveIT (Internet Archive) • CDL Web Archiving Service – Build your own with open-source tools • Heritrix archival crawler • WARC packages • Wayback interface – Lightweight approach • HTTrack (free) offline browser for website snapshots – Citation service • on demand archiving of web resources
  15. 15. Web Archiving at the NLA• Current and future developments at NLA – Digital Library Infrastructure Replacement (DLIR) project • Replacing infrastructure that manages our digital assets • Will require new web collecting infrastructure and processes • Already taking steps such as the seed list crawl – Some testing of new tools underway (Heritrix, Wayback) – Opening access to domain harvest content (
  16. 16. Web Archiving at the NLA• Extension of „legal deposit‟ to digital content – Attorney-General‟s consultation paper • Submissions closed 14 April – Proposed model covers: • physical format digital (mandatory delivery) • online electronic publications (mandatory delivery on demand) – May put pressure on NLA resources & priorities – Already have „whole-of-government‟ arrangements • Bulk harvesting of FMA Act agencies‟ domains • Seek „opt-in‟ from CAC Act agencies
  17. 17. Web Archiving at the NLA• Finding government content in PANDORA – Full text search through Trove • Trove „Archived websites 1996 - now‟ silo • All Trove (results in „Books‟ and „Archived websites‟ • PANDORA portal – Browse lists on PANDORA portal site • „Commonwealth Government‟ (263 titles) – Catalogue (MARC record search) • NLA online catalogue • Libraries Australia • Trove (books silo) • Search e.g.: innovation industry pandora – Advanced search options for best results – „Pandora electronic collection‟ (MARC 830 series field)
  18. 18.
  19. 19. Web Archiving at the NLA• Government Web Guide and NAA links – Archiving websites • – Retaining access of outdated content • – NAA Archiving Websites advice • Websites:-Advice-and-Policy-Statement