Your SlideShare is downloading. ×
0
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Formats Over Time: Exploring UK Web History

1,903

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,903
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Formats over Time Exploring UK Web HistoryAndrew JacksonUK Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto
  • 2. Formats over TimeDEBATING OBSOLESCENCE
  • 3. Rothenberg & Rosenthal On Format Obsolescence  Jeff Rothenberg:   “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)  David Rosenthal:   “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)   That network effects inhibit obsolescence  Where is the evidence?
  • 4. Formats over TimeAN EXPERIMENT
  • 5. UK Web Domain Dataset (1994-2010)  UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB  Execution at Scale   Stored on HDFS   Map-Reduce
  • 6. Identification Tools  DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded  Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend   Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size
  • 7. A Common Language For Format Identifiers  Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types   fmt/18 becomes application/pdf; version=1.4   Allows easy comparison at sub-type level   Can easily extend to cover other properties:   text/plain; charset=UTF-8   application/pdf; software=“Adobe Acrobat 6.0”   Also extended Tika to output details from PDFs
  • 8. Format Profile Dataset  Server, Tika & DROID-B format profiles, over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf !
 application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; 
 source="Adobe PageMaker 6.0" !
 application/pdf; version=1.2 !2004 !1  CC0 – free to download and reuse   http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it  Source code of all tools and modifications also available   https://github.com/openplanets/nanite
  • 9. ResultsCOMPARING TOOLS
  • 10. Percentage)of)resources) uniden0fied) 0%# 1%# 10%# 19 100%# 96 19 # 97 Coverage & Depth 19 # 98 19 # 99 20 # 00 20 # 01 20 # 02 20 # 03 Year) 20 # 04 20 # 05 20 # 06 20 # 07 20 #No format-version-level information from Apache Tika. 08 20 # 09 20 # 10 DROID1B#v.59# # Apache#Tika#1.1#
  • 11. Inconsistencies  Gaps   37 formats spotted by DROID-B but not Tika   Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B   But at least 20 are due to not using the full DROID  Conflicts   Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…  Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.
  • 12. ResultsFORMATS OVER TIME
  • 13. Image Formats Over Time 100.00000%% 10.00000%%Percentage)of)crawl) JPEG% 1.00000%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% GIF% 0.10000%% PNG% 0.01000%% ICON% 0.00100%% XBM% 0.00010%% TIFF% 0.00001%% Year)
  • 14. HTML Versions Over Time 100%%Percentage)of)HTML)Resources) 90%% 80%% XHTML%1.0% 70%% 60%% HTML%4.01% 50%% HTML%4.0% 40%% HTML%3.2% 30%% 20%% 10%% HTML%2.0% 0%% % % % % % % % % % % % % % % % 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 Year)
  • 15. Percentage)of)PDF)Resources) 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 19 96 $ 19 1.0$ 97 $ 19 1.1$ 98 $ 19 99 $ 20 00 PDF Versions Over Time $ 20 01 1.2$ $ 20 02 $ 20 03 $Year) 20 04 $ 1.3$ 20 05 $ 20 06 $ 20 07 $ 1.4$ 20 08 $ 20 1.5$ 09 $ 20 1.6$ 10 $
  • 16. Format Usage Versus Time 10,000,000,000"NumberofResourcesinArchive 1,000,000,000" 100,000,000" 10,000,000" 1,000,000" 100,000" 10,000" 1,000" 100" 10" 1" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Timespan[Years]
  • 17. ResultsIMPLEMENTATIONS
  • 18. PDF Software Over Time 100%(Percentage)of)PDF)Resources) 90%( 80%( Acrobat( Acrobat( 70%( PDFWriter( 60%( 50%( 40%( 30%( 20%( 10%( Acrobat(Dis,ller( 0%( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 96 99 00 01 02 03 05 06 09 10 97 98 04 07 08 19 19 20 20 20 20 20 20 20 20 19 19 20 20 20 Year) Over 2100 Distinct PDF Software IDs
  • 19. JPEG Hardware Over Time 100%$Percentage)of)Harware)IDs) 90%$ NIKON$D40$ 80%$ 70%$ 60%$ MX1700$ 50%$ 40%$ 30%$ 20%$ E990$ 10%$ DS5$ CYBERSHOT$ 0%$ 19 $ 19 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ $ 19 $ 19 $ 19 $ 20 $ 20 $ 20 $ 95 96 99 00 01 02 03 05 06 09 10 94 97 98 04 07 08 19 Year) Over 2100 Distinct JPEG Hardware IDs
  • 20. Formats over TimeCONCLUSIONS
  • 21. Summary  Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required  Please re-use our data, or ask for more  Firmer conclusions need:   Richer, more detailed results   From a wider range of corpora  This approach only gives creator information   A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)
  • 22. Questions? webarchive.org.uk

×