I say emulate


Published on

Published in: Technology, Art & Photos
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

I say emulate

  1. 1. I Say Emulate; He Says Migrate Are emulation or migration feasible preservation strategies? National Library of Australia Prepared by: Andrew Stawowczyk Long Presented by: 1 David Pearson
  2. 2. Archiving the Web• Many institutions actively harvest the web• Collecting scale vary• Preservation practices not well understood and implemented• Collecting intent may differ depending on the institution 2
  3. 3. Web Archives• Type • Text oriented • Multimedia (video/audio) oriented • Picture oriented • Databases • Combination of all types• Storage • Uncompressed • Compressed (WARC) • Combination 3
  4. 4. Web Objects and Elements• Challenge: Web archives may contain any type of digital object• Common objects • HTML/XML and related (htm, html, xml, css, etc.) • Images (raster images – JPEG, GIF, PNG) • Media • Audio files (au, wav, aiff, midi, mp3) • Video files (mov, mpg, wmv, rm)• Other objects • File Archives (usually compressed – zip, tar, gz, arc, sit) • Images (raster images – bmp, tiff) • Images (vector images - SVG) • Text files (txt, csv, rtf) • Document files • PDF • Microsoft Word, Excel, Power Point 4
  5. 5. Comparative statistics of NLA web collections PANDORA (selective) .au Domain HarvestsFiles: 73 million Files: 2.3 billionSize: 3.26 TB Size: 78.75 TB Domain 2005 2006 2007 2008 Harvest Unique 185 million 596 million 516 million 1 billion files Hosts 811,523 1,046,038 1,247,614 3,038,658 crawled Size 6.69 TB 19.04 18.47TB 34.55 TB 5
  6. 6. What are we preserving? Preservation Intent• Preservation of: • Physical media? • Bit-stream (logical form of data)? • Action (rendering data into something useful to user)? • User experience?• Important Considerations • Creator’s perceived intent • Institution’s preservation intent 6 Based on Heslop and Davis (2002)
  7. 7. What are we preserving? Properties• Object Properties (Properties regarded as important would vary depending on the intention of the collecting institution) • • Derived from file format High-level – e.g. layout, formatting or WEB • Measured – identified directly by computer • Intended – Set by the collecting body 7
  8. 8. Possible Preservation Actions 1• Emulation The original environment is recreated on a contemporary hardware using specialised software (emulator) and original software.• Renderers• Specialised software, operating in the contemporary environment and used to access (render) original files. It is similar to emulation. 8
  9. 9. Possible Preservation Actions 2• Migration Original file formats are migrated (converted) to another format, which is supported by current hardware/software. e.g. MS Word 3.0 to MS Word 2008 9
  10. 10. Possible Preservation Actions 3 Not long-term sustainable• Technological Museum Collect and maintain the original hardware and software• Take No Action Do nothing 10
  11. 11. Digital Preservation Preliminaries• Collection objects need to be correctly recognised and identified• Preservation intent(s) need to be defined• High-level preservation actions need to be defined (e.g. shall we use emulation or migration?)• Practical-level preservation actions need to be defined Object Format + Preservation Intent = Appropriate Action Dillema: How to properly migrate data if preservation intent(s) are unknown or not defined 11
  12. 12. Tools Required for Emulation• Emulators • Fast, stable, flexible, extendable• Licenced Operating Systems• Various drivers• Web browsers• Browser plug-ins• Other programs as required (e.g. Java, Adobe Acrobat Reader) 12
  13. 13. Tools Required in Migration• Format identifiers• Format converters• Link updaters• QA automatonsCAMiLEON project – Migration on Request ToolXENA 13
  14. 14. Project Tests General Testing Environment• Large slice of uncompressed PANDORA archive (random selection)• Whole Domain Harvest archive have not been included in tests (WARC files)• Multiple hardware combinations• Multiple OS combinations• Multiple Web Browsers 14
  15. 15. Project Tests Material SampleTesting the industrial scale tools• PANDORA slice • 861Gb • 18,019,172 files • 2,379,326 foldersTesting object properties• Smaller slice of PANDORA slice • 20 objects of each selected types •Audio, html, images, pdf, video, zip, MS documents 15
  16. 16. Project Tests Methodology• Large sample testing (861Gb, 18,019,172 files) • Attempt to identify objects in the sample using DROID • Attempt to migrate jpeg images to png and update links• Small sample testing • Select smaller sub-sample, with objects mostly created before year 2000 • Identify objects in the sample • View and experience selected objects in contemporary environments using various platforms, OS and browsers • View and experience selected objects in old environments using emulations on various platforms, using different OS and browsers • Migrate selected objects and review them in various environments 16
  17. 17. Project Tests Tools tested• Common • Emulation • DROID • QEMU • JHOVE • Bochs • TRiID • File Identifier • MS Virtual PC (Not exactly an emulator) • Lister (dev. in-house) • OS ● Dioscuri – MS Win XP Pro – MS Win 3.1 • Migration – MS Win 98SE • ImageMagick – Ubuntu 9.04 • MediaCoder • Web Browsers – MS IE 7 • Swf>>avi – Firefox 3 • OpenOffice Tools – Arachne 1.2 • XENA – Mosaic 2 17 – Netscape 4
  18. 18. Project Tests Control – Current Environment• Properties observed in selected files Object Basic Characteristics (based on Emulation Project by KB) 1. Content : the text, images, etc. from the object 2. Structure : the cohesion between different parts of the object 3. Context : the meaning of the object. 4. Appearance : the way an object is presented to the user. 5. Behaviour : the interaction of the object with the user or system.E.g. for HTML pages: •Rendering of text, images, media files • Font, layout, colours, contrast, brightness, animation smoothness, sound quality, etc. •Objects dependencies •Mouse & keyboard behaviour •Data extraction 18
  19. 19. Project Tests Emulated Environments• Hardware • Dell Optiplex GX620, P4, 4.4GHz x 3.39GHZ, 3.5Gb RAM • Power Mac G4EMULATORS:• Bochs • Host: WinXP Pro v2002 SP3 Ubuntu 9.04 • Client: Win 3.1, MS DOS 6.2 WinXP Pro SP2• Dioscuri 0.4.0 • Host: WinXP Pro v2002 SP3 • Client: Win3.1, MS DOS 6.2 19
  20. 20. Project Tests Emulated Environments• Qemu • Host: MS WinXP Pro v2002 SP3 • Clients: MS Win98SE MS Win 3.1 MS DOS 6.2 Ubuntu 9.04 • Host: Ubuntu 9.04 • Clients: MS WinXP Pro SP2, P4, 12.92GHz, 256Mb RAM MS Win98SE MS Win 3.1• Microsoft Virtual PC • Host: MS WinXP Pro v2002 SP3 • Clients: MS Win 3.1 MS Win98SE 20
  21. 21. Tests - Summary Emulation•Setting up emulators was relatively simple•Additional software (especially to work with disk images)proved to be extremely useful.•Licencing was at times a big obstacle. (E.g. Impossible toemulate Macintosh environment legally).•A lot of dependencies exist. It is a complex task to makeprograms work correctly. •e.g Windows XP requires internet or over-the-phone activation after 30 days 21
  22. 22. Tests – Summary Emulation• All Some of the dll libraries in Win 3.1 did not agree with 16-bit Netscape and Mosaic programs• Bochs 2.3.7 for Windows • Extremely slow in GUI environments • No full screen mode. Limited end-user experience.• Dioscuri • Sluggish at times • Didn’t like some of the images created in WinImage• Qemu 0.9.0 for Windows and Linux • Much faster but still sluggish at times • Win98SE couldnt run in hi-res, hi-colour mode• Microsoft Virtual PC Relatively fast (its a virtualisation software on PC) but still sluggish at times 22
  23. 23. Tests - Summary Migration Environment•Dell Optiplex GX620•MS Windows XP Pro v2002 SP3•Networked drive with PANDORA sample 23
  24. 24. Tests - Summary Migration•Available tools are imperfect and slow. • e.g. DROID took more than two weeks to examine slightly over 18 million files and many of them were not recognised•It is very difficult to examine contents of the containerformats (e.g. avi or rm)•Network connections need to be as fast as possible•It is difficult to make informed decision aboutmigration without preservation intent clearly defined 24
  25. 25. Tests - General Comments• No proven methods exist Real-world testing is needed • Most documented approaches are ad-hoc - no commodity solutions• Tools are few and inadequate 25
  26. 26. Tests - General Comments• Preservation policies, especially about preservation intent are needed• Significant resources are needed to practically tackle the problem 26
  27. 27. Andrew Stawowczyk LongStrategistDigital Preservation StandardsNLAanlong@nla.gov.auDavid PearsonDirector (Acting)Web Archiving and Digital Preservation BranchNLAdapearso@nla.gov.au Project Report is due end of October 2009 27