1) Preserving access to archived web content is challenging due to changing software and formats.
2) The PANDORA Web Archive case study highlights issues around mixed accessibility over time as browsers, plugins and standards change.
3) The presentation considers documenting the technical environments of web archives to better support alternative access strategies like emulation and format migration.
The architecture of Generative AI for enterprises.pdf
Preserving access
1. Preserving access:
Making more informed
“guesses” about what works
Prepared by: Maxine Davis, Collaboration Research Officer
Presented by: David Pearson, Acting Director
Web Archiving & Digital Preservation,
National Library of Australia
IIPC Open Day, San Francisco, 7 October 2009
1
2. Presentation Outline
• The problem
• Case study: PANDORA Web Archive
• Some approaches & options
– Approach 1: Unified Digital Format
Registry (UDFR)
– Approach 2: Wikipedia
– Approach 3: Another way documenting
what web archives actually use/d
2
3. The problem
• The World Wide Web is constantly
evolving
– Requires combinations of software/hardware
to render web content
– But what is used for creation and access
changes
• Web archives
– Contain snapshots of websites taken at
different times (different sites or same sites
multiple times)
– Lots of files, many file formats, various
versions
– Aim for ongoing access
3
4. Process of version “creep”
in the archive
• Mixed accessibility resulting from:
– Different browsers, plug-ins, operating
systems in use (then and now)
– Backwards compatibility not guaranteed
– Changes in standards and coding practices
(deprecated, dead & non-standard tags)
– Obsolescence of file formats & renderers
• Changes to access paths
– Incremental loss of access not directly
obvious
– Alternative access paths not specified
4
5. Case study:
PANDORA Australia’s Web Archive (1)
• Selective archive began collecting 1996
– Sites individually selected by NLA &
partners
– As at July 2009 over 70.6 million files
– Accessible over the web using standard
web browser
• .au whole domain harvests
– 4 annual harvests 2005-2008 completed,
2009 underway with Internet Archive
– Combined harvests 05-08 ~ 2.3 billion files
– Not currently publicly available
5
7. IIPC Preservation Working Group
discussions
• Need for documenting the
technical environment
• Support required for alternative
preservation action strategies
– Emulation of past environments
– Migration to standard formats
– Risk notification
– Recording conversion and alternate
access paths
• Exploring different approaches
• Sharing information sensible
7
8. Technical information of interest
• Browsers + plug-ins/helper
applications versions &
dependencies
• Used approximately when?
• Appropriate for which individual/
type of file format or whole
archive?
8
9. Already documented?
• Manufacturer/vendor’s websites
• Developer’s networks, forums, blogs,
etc.
• File format registries
• File extension resources
• Software archives/download sites
• Internet history websites
• Internet statistics websites
• Wikipedia
9
10. Possible Approach 1: UDFR
• Digital format registry will result from
proposed merger of PRONOM and
GDFR
• Pros
– Considerable intellectual investment already
– Could be used for general digital preservation and
potential interaction with other tools
• Cons
– Under development
– Web archive requirements need to be specified, use
cases developed, changes to data model, population
with relevant data and regular updating
– Temporal aspect not currently catered for
– Entry point Individual file format or software type [could
be a pro?]
10
11. Possible Approach 2: Wikipedia (1)
• Pros
– Existing free, web-based
collaborative multilingual
project
– Draws together a rich set of
information
• browsers, layout engines,
plug-ins & software, statistics,
creators, standards, etc.
• lists, history, comparisons,
timelines, links to internal &
external references
– Updated by many voluntary
contributors
11
12. Possible Approach 2: Wikipedia (2)
• Cons
– General audience, not specific to web archive
requirements or specific web archive
– Amount of detail varies (between different
language versions, articles)
– Can be edited by multiple users (+ & -)
– Not designed to interact with other digital
preservation tools as UDFR has potential to do
12
14. Possible Approach 3:
Documenting what web archives
are using/used
• Pros
– Time based software suite approach
– Starting point for
• Potential UDFR seed list
• Identifying commonly used software
• Inferring additional software requirements
• Identifying alternate access paths
• Cons
– Easier to document current versions
– Obscure/obsolete material in our collections
may be unknown
14
15. Individual web archives as
sources of information
• Analysis of archive contents & harvesting
statistics
• Web archivists observations & records
– UK Web Archive Technology Watch blog
• Website usage statistics
– Browser versions & operating systems
– Indicative of popularity
• Archived sites
– Plug-in requirements, file type information
– May include useful information websites
– Internet Archive complementary collection
15
16. Example: NLA Web archiving
software environment July 2009
• Operating system: Windows XP
• Computer: Windows PC, Intel Pentium 4
• Browser: Internet Explorer 7 (main browser),
IE8, Firefox 3.0
• Additional software:
– Adobe Reader 8
– Adobe Shockwave Player
– Adobe Flash Player 10
– Real Player 10
– Apple QuickTime 7
– Windows Media Player 11
– Java 6 Update 11
– JavaScript enabled
– Word, Excel, PowerPoint 2003
– WinZip
16
17. Example: Earlier NLA Software
Environment
2005 2000 1996
Windows 2000 Windows 95 Windows 3.1/ Windows
for Workgroups
Windows PC Windows PC Windows PC
IE6 (since June 2002) Netscape Navigator 4.08 Netscape Navigator 1, 2
or 3?
Adobe Acrobat Reader Acrobat Reader Acrobat Reader
Macromedia Shockwave Macromedia Shockwave Macromedia Shockwave
Macromedia Flash player Macromedia Flash ?
player
Real Player Real Player Real Audio player
Apple QuickTime Apple QuickTime QuickTime
Windows Media Player 9? Windows Media Player Netscape Media Player?
6.4?
Java ? Java ? Java?
JavaScript enabled JavaScript enabled JavaScript enabled
Word, Excel, PowerPoint Word, Excel, PowerPoint Word, Excel, PowerPoint
WinZip WinZip PKUnzip ?
17
18. Example: Comparison NLA and
BnF software environments
NLA web archivist’s BnF Librarian’s BnF public in-house
software 2009 software since 2005 access software
2008
Internet Explorer 7 and 8 Internet Explorer Internet Explorer
Firefox 3.0
Adobe Reader 8 Acrobat Reader* Adobe Reader
Adobe Shockwave Player Macromedia Flash Adobe Flash player
Adobe Flash Player 10 player* Adobe Shockwave
Real Player 10 Windows Media Player* player
Apple QuickTime 7 QuickTime* VLC Media player
Windows Media Player 11 Java Virtual Machine Real player
Java 6 Update 11 (Microsoft)* Word, Excel &
JavaScript enabled PowerPoint Viewers
Later additions: Java Virtual Machine
Word, Excel, PowerPoint
Firefox
2003
RealOne Player 10
WinZip
*Software versions
progressively updated to latest
compatible with Windows XP 18
19. Going forward
• Is it worth pursuing approach 3?
• If so where would we record
(IIPC PWG wiki?, other
suggestions)?
• Interested in contributing?
19
20. Questions?
Contact
• David Pearson
dapearson@nla.gov.au
• Maxine Davis
madavis@nla.gov.au
Report to IIPC PWG by
end October 2009
Everything, for Everyone
Forever 20