• Like
  • Save
Upcoming SlideShare
Loading in...5




Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Aglin Aglin Presentation Transcript

    • Collecting Government Web Content at the National Library of Australia AGLIN Forum 2 May 2012 Paul Koerbin Manager Web Archiving National Library of Australia
    • Web Archiving at the NLA• Background• Scale of collections• Archival collections (selective, bulk, govt)• Objectives, selection and scope• Retention and preservation• Finding government content in PANDORA
    • Web Archiving at the NLA• Began web archiving activity in 1996 – http://pandora.nla.gov.au/• Government content is included in all NLA web collections – „PANDORA Archive‟ collection, 1996 to now • Selective – The „auscrawl‟ whole .au domain harvest collections • Annual since 2005 – The „whole-of-government‟ collections • Seed list • 2011, 2012
    • Web Archiving at the NLA• Scale of collecting – PANDORA (as at April 2012, i.e. 15 years of collecting) • 31,000 titles – All govt ~ 55 % of titles – Commonwealth Govt ~ 12 % of titles • 75,000 instances • 145 million files • 6.5 Tb – Australian .au domain harvests 2005-2011 • 3.5 billion files • 140 Tb – ‘Whole-of-government ‘ seed list crawl 2011 • 7.4 million files • 538 Gb
    • Web Archiving at the NLA• PANDORA Archive – Strong representation of govt content including Commonwealth, State and Territory, and local govt (> 50 % of titles) – Generally does not include whole departmental websites – Prominent ministerial micro-sites (speeches, press releases) – Government initiatives websites (e.g. Firearms buyback, 2000) – Major reports, enquiries, documents (e.g. Gershon Review, 2008) – Discrete „titles‟ and „instances‟ – no links between instances – Quality checked – Catalogued and full text indexed – Accessible through the Trove and PANDORA discovery services
    • Web Archiving at the NLA• Whole .au domain harvests („auscrawl‟) – Crawls of the entire .au domain (plus some) – Averages over 1 million hosts crawled each year (av. 650m files) – Includes gov.au second level domain – Relies on crawler capabilities and subject to crawler limitations and constraints – Obeys robots.txt (except for inline image and style elements) – No quality checking for completeness of harvest or functionality (e.g. look and style) – Retains linkages between content that is in scope for the crawl – Full-text and URL indexes – But, not accessible to public
    • Web Archiving at the NLA• Collecting Commonwealth Govt websites – Whole-of-government arrangements • Whole-of-government ICT policy • Secretaries‟ ICT Governance Board, 7 May 2010 • AGIMO circular 2010/01 • http://www.finance.gov.au/e-government/strategy-and- governance/Whole-of-Government-ICT-Policies.html • Covers FMA Act agencies – CAC Act agencies – still require individual permissions • Subject to opt-out arrangements • Replaced the need for individual copyright licence arrangements coordinated through the CCA • NLA now permitted to collect, preserve and make accessible freely available govt web content
    • Web Archiving at the NLA• Whole-of-government collection – Based on list of specified URLs (most at domain level) – Around 800 seed URLs – Only includes FMA Act agency sites – No QA and fixing – Obeys robots.txt (except for inline images and style elements) – Full-text and URL indexes – No pubic access yet (but perhaps soon)
    • Web Archiving at the NLA• Collecting mandate and objective – The National Library Act 1960 mandate to build and maintain a national comprehensive collection of material relating to Australia and Australians – ... and to make the collection available in the national interest – Objective is about ensuring future and ongoing access to materials of interest to Australia‟s social, cultural and publishing heritage – Not the function of NLA web collecting (archiving) program to satisfy requirements for agencies under the Archives Act 1983
    • Web Archiving at the NLA• Government „Web Guide‟ recordkeeping advice: – “Archiving websites” • Mandatory requirement (Archives Act 1983 and Evidence Act 1995) • seek advice from NAA – “Retaining access to outdated content” • Not a mandatory requirement • Recommends nominating content for inclusion in PANDORA • Does not ensure safeguarding of content • Selective – Create own publicly accessible archive – Publish advice how people can access out of date content• New „whole-of-government‟ web collection • More inclusive and larger scale than PANDORA • FMA Act agencies requirement (with „opt-out‟ provisions) • CAC Act agencies – opt-in!
    • Web Archiving at the NLA• PANDORA selection – Commonwealth Government publications a priority collecting area – Methodical approaches have been attempted but ... – Curator expertise and current awareness – Stakeholders as nominators (e.g. indexing agencies, other collecting areas in NLA, Parl Library, depts) – Selecting and scoping • Whole site, part site, specific documents • Substance and research value • Scheduling (when to harvest and how frequently) • Resources to undertake work • Technical constraints
    • Web Archiving at the NLA• PANDORA collecting – Websites and web „documents‟ • documents (discrete files), whole sites, parts of sites • text, images, video, style elements, client side scripts – Content is harvested using a crawl robot • efficient (no work for publisher), automated process • deposit of complex objects is harder to deal with – Dynamic content becomes static HTML • an artefact of the original • the published version as you would view it from a web browser, not from the content management system • loses dynamic functionality • „normalising‟ process – Persistent URIs
    • Web Archiving at the NLA• Retention of collected web content – Archiving means preservation – Long term access – Collections developed and maintained in perpetuity for future generations – What is the preservation reality? • Is access in perpetuity achievable? – Investing in systems to manage for preservation • More than preserving the bit stream • Establishing preservation intent • Collecting and managing preservation metadata • Understanding formats and their risks (... and actions?)
    • Web Archiving at the NLA• „DIY‟ archive of your published web content – Use a subscription service • ArchiveIT (Internet Archive) www.archive-it.org • CDL Web Archiving Service webarchives.cdlib.org – Build your own with open-source tools • Heritrix archival crawler crawler.archive.org • WARC packages • Wayback interface – Lightweight approach • HTTrack (free) offline browser for website snapshots www.httrack.com – Citation service • on demand archiving of web resources webcitation.org
    • Web Archiving at the NLA• Current and future developments at NLA – Digital Library Infrastructure Replacement (DLIR) project • Replacing infrastructure that manages our digital assets • Will require new web collecting infrastructure and processes • Already taking steps such as the gov.au seed list crawl – Some testing of new tools underway (Heritrix, Wayback) – Opening access to domain harvest content (gov.au)
    • Web Archiving at the NLA• Extension of „legal deposit‟ to digital content – Attorney-General‟s consultation paper • Submissions closed 14 April – Proposed model covers: • physical format digital (mandatory delivery) • online electronic publications (mandatory delivery on demand) – May put pressure on NLA resources & priorities – Already have „whole-of-government‟ arrangements • Bulk harvesting of FMA Act agencies‟ domains • Seek „opt-in‟ from CAC Act agencies
    • Web Archiving at the NLA• Finding government content in PANDORA – Full text search through Trove • Trove „Archived websites 1996 - now‟ silo • All Trove (results in „Books‟ and „Archived websites‟ • PANDORA portal – Browse lists on PANDORA portal site • „Commonwealth Government‟ (263 titles) – Catalogue (MARC record search) • NLA online catalogue • Libraries Australia • Trove (books silo) • Search e.g.: innovation industry pandora – Advanced search options for best results – „Pandora electronic collection‟ (MARC 830 series field)
    • http://www.flickr.com/photos/ricksmit/15671245/
    • Web Archiving at the NLA• Government Web Guide and NAA links – Archiving websites • http://webguide.gov.au/recordkeeping/archiving-a-website/ – Retaining access of outdated content • http://webguide.gov.au/recordkeeping/retaining-access-to-outdated-content/ – NAA Archiving Websites advice • http://www.naa.gov.au/records-management/publications/index.aspx#Archiving- Websites:-Advice-and-Policy-Statement