Published on

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Trends in Use ofPandora Archive Presentation at IIPC Open Day The Broad Value of Web Archives 30th April, 2012, Library of Congress Monica OmodeiDirector, Web Archiving and Digital Preservation National Library of Australia momodei @
  2. 2. About the Pandora Archive •  Selective, Collaborative Approach " –  high value, discrete, timely collecting" –  A number of partners contribute to Pandora" •  Targeted Australian content " –  selection policy, nominations are reviewed" •  Historical – started 1996" •  Bibliocentric approach " –  archived sites/publications are fully catalogued" •  Publicly accessible" –  full content keyword search through national resource discovery service –  Browse is of reconstituted version of original site –  Metadata indexed in google"
  3. 3. Pandora Archive Stats •  Size – 6.32 TB" •  Number of Files > 140 million" •  Number of titles > 30.5K" •  Number of title instances > 73.5K"
  4. 4. Whole domain archive•  We have also commissioned the IA to crawl the .au domain for us annually since 2005•  Legislation prevents us from making this accessible yet•  Hopefully soon we will be able to allow access to researchers
  5. 5. Australian web domain crawlsYear! 2005! 2006! 2007! 2008! 2009! 2011!Files! 185 596 516 1 billion! 765 660 million! million! million! million! million!Hosts 811,523! 1,046,038! 1,247,614! 3,038,658! 1,074,645! 1,346,549!crawled!Size (TBs) 6.69! ! 19.04! 18.47! 34.55! 24.29! 30.71!
  6. 6. The Bad News•  we have no legal deposit legislation for electronic publications so permission to archive must be obtained" –  significant content missed because permission to copy refused"•  QA and fixing process can be labour intensive" –  Technical infrastructure ten years old"•  Selection guidelines outdated and dont align"•  Significant content missed because of resourcing constraints and high labour cost"•  Search and browse functionality very limited" –  no URL search, no time-based searching"•  Current infrastructure doesnʼt scale for broader themed collections with multiple sites or for domain- scale archiving
  7. 7. Glass half full •  Situation will improve markedly if Legal Deposit provisions extended to digital publications" – The Australian Attorney-General has released a consultation paper with a model for this extension" •  Broader coverage will be achieved when infrastructure is upgraded, improving scalability and reducing labour costs for QA/fixing – We have commenced a multi-year Digital Library Infrastructure Replacement Project which includes upgrading our web archiving tools" – We are currently trialling Heritrix for collaborative thematic collecting, and wayback for access to our commissioned sub-domain archive"
  8. 8. DLIR Project•  Digital Library Infrastructure Replacement"•  RFP was followed by RFT for components where reasonable solutions had been proposed (including core repository)"•  The RFT evaluation recommended proceeding to contract negotiations with the selected tenderer for each component"•  Currently preparing a submission for ministerial approval prior to contract negotiations with vendors.
  9. 9. Patterns of Use•  Which archived sites are popular and why ?"•  Is use of our archive growing ?"•  What is the relative interest in older vs more recent captures ?"•  Who is using our archives ?"•  And what for ?
  10. 10. Which archived sites are popular ? •  Data source – filtered, aggregated web access log data which counts access to titles " •  Examined top 30 archived titles (# of accesses) for each year 2009 to 2012" •  Selected some to examine and speculate as to why they might be popular" •  Included consistently high ranking, and ones that were very variable between years
  11. 11. Reasons for popularity of archived version •  Were once popular and are now decommissioned, particularly if domain name continues to exist and redirects to the archive" •  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content" •  Popular referencing sources cite the archive as well as the live site (if it still exists)
  12. 12. Conclusions•  Be more proactive in identifying unresponsive domains "•  Market automatic redirect services to web site owners/ managers"•  Allow Google to index archive content for sites which are no longer live "
  13. 13. Is use of Pandora growing ?Annual access figures for Pandora Web Site and Archive NB robots.txt was not introduced on the site until 2005 Web site design change in 2008 affected measure downward
  14. 14. Interest in older vs recent content •  Filtered access logs by reference from the entry page to the archived instance •  aggregated accesses by age(year) of archived instance •  Added number of instances of that age in the archive as a reference
  15. 15. Age of instances accessed
  16. 16. Who is using archive . " •  Online survey linked to from search service - approx 450 respondents •  Age, gender, location, education •  How did they arrive •  What type of information and for what purpose •  Is it still available on the live web ?
  17. 17. But first an anecdoteArticle in major newspaper – quoteWE at Spring Loaded are no conspiracy theorists, butthe disappearance of Liberal Party policies is curious.First went the policy documents. A recent revamp ofthe website saw the pre-election press releases go.But thanks to the National Library of Australia sInternet archive, many of the policies can be seen at When Spring Loaded askedabout the missing policies, the Liberal Party said therewas nothing untoward .
  18. 18. Examples of lost web sites• Qantas own special web site presenting their case during the major dispute with pilots, engineers and cabin crew unions that grounded the airline in 2011• Jeff Kennetts campaign web site in the 1999 Victorian State election - the first use of the web by a politician during a campaign in Australia
  19. 19. About therespondents
  20. 20. How did they arrive ?
  21. 21. What information was sought ?
  22. 22. What for ?
  23. 23. Other questions•  Did you realise that you were going to enter an archived version of a web site, not the live one (60% yes to 40% no)•  Was the resource you were looking for no longer available on the live web ? (50-50)•  Have you visited other web archives ? (60% yes, 40% no)
  24. 24. Conclusions•  We need to market our archive better•  Promote redirects for closing, unsupported web sites•  Convert archives to arc/warc so memento API will find content•  allow google indexing of content for archived web sites where live version is extinct or substantially altered