Formats over Time                           Exploring UK Web HistoryAndrew JacksonUK Web Archive, The British Library     ...
Formats over TimeDEBATING OBSOLESCENCE
Rothenberg & Rosenthal On Format Obsolescence    Jeff Rothenberg:       “Digital Information Lasts Forever –        Or F...
Formats over TimeAN EXPERIMENT
UK Web Domain Dataset (1994-2010)  UK Web Domain Dataset (1994-2010)     From the Internet Archive     Millions of webs...
Identification Tools    DROID       Well-known in digital preservation community       Format version level identificat...
A Common Language For Format Identifiers    Comparison and combination requires a common model       Map PRONOM IDs to e...
Format Profile Dataset    Server, Tika & DROID-B format profiles, over time:     image/png image/png image/png; version=1...
ResultsCOMPARING TOOLS
Percentage)of)resources)                                                                                   uniden0fied)    ...
Inconsistencies    Gaps       37 formats spotted by DROID-B but not Tika          Notably includes earlier Office forma...
ResultsFORMATS OVER TIME
Image Formats Over Time                       100.00000%%                        10.00000%%Percentage)of)crawl)           ...
HTML Versions Over Time                                100%%Percentage)of)HTML)Resources)                                 ...
Percentage)of)PDF)Resources)                   0%$                  10%$                  20%$                  30%$      ...
Format Usage Versus Time                                  10,000,000,000"NumberofResourcesinArchive                       ...
ResultsIMPLEMENTATIONS
PDF Software Over Time                               100%(Percentage)of)PDF)Resources)                                90%(...
JPEG Hardware Over Time                             100%$Percentage)of)Harware)IDs)                              90%$     ...
Formats over TimeCONCLUSIONS
Summary    Format obsolescence is complex       Network effects do appear to stabilize formats       But once popular f...
Questions?             webarchive.org.uk
Upcoming SlideShare
Loading in...5
×

Formats Over Time: Exploring UK Web History

1,931

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,931
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Formats Over Time: Exploring UK Web History

  1. 1. Formats over Time Exploring UK Web HistoryAndrew JacksonUK Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto
  2. 2. Formats over TimeDEBATING OBSOLESCENCE
  3. 3. Rothenberg & Rosenthal On Format Obsolescence  Jeff Rothenberg:   “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)  David Rosenthal:   “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)   That network effects inhibit obsolescence  Where is the evidence?
  4. 4. Formats over TimeAN EXPERIMENT
  5. 5. UK Web Domain Dataset (1994-2010)  UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB  Execution at Scale   Stored on HDFS   Map-Reduce
  6. 6. Identification Tools  DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded  Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend   Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size
  7. 7. A Common Language For Format Identifiers  Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types   fmt/18 becomes application/pdf; version=1.4   Allows easy comparison at sub-type level   Can easily extend to cover other properties:   text/plain; charset=UTF-8   application/pdf; software=“Adobe Acrobat 6.0”   Also extended Tika to output details from PDFs
  8. 8. Format Profile Dataset  Server, Tika & DROID-B format profiles, over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf !
 application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; 
 source="Adobe PageMaker 6.0" !
 application/pdf; version=1.2 !2004 !1  CC0 – free to download and reuse   http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it  Source code of all tools and modifications also available   https://github.com/openplanets/nanite
  9. 9. ResultsCOMPARING TOOLS
  10. 10. Percentage)of)resources) uniden0fied) 0%# 1%# 10%# 19 100%# 96 19 # 97 Coverage & Depth 19 # 98 19 # 99 20 # 00 20 # 01 20 # 02 20 # 03 Year) 20 # 04 20 # 05 20 # 06 20 # 07 20 #No format-version-level information from Apache Tika. 08 20 # 09 20 # 10 DROID1B#v.59# # Apache#Tika#1.1#
  11. 11. Inconsistencies  Gaps   37 formats spotted by DROID-B but not Tika   Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B   But at least 20 are due to not using the full DROID  Conflicts   Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…  Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.
  12. 12. ResultsFORMATS OVER TIME
  13. 13. Image Formats Over Time 100.00000%% 10.00000%%Percentage)of)crawl) JPEG% 1.00000%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% GIF% 0.10000%% PNG% 0.01000%% ICON% 0.00100%% XBM% 0.00010%% TIFF% 0.00001%% Year)
  14. 14. HTML Versions Over Time 100%%Percentage)of)HTML)Resources) 90%% 80%% XHTML%1.0% 70%% 60%% HTML%4.01% 50%% HTML%4.0% 40%% HTML%3.2% 30%% 20%% 10%% HTML%2.0% 0%% % % % % % % % % % % % % % % % 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 Year)
  15. 15. Percentage)of)PDF)Resources) 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 19 96 $ 19 1.0$ 97 $ 19 1.1$ 98 $ 19 99 $ 20 00 PDF Versions Over Time $ 20 01 1.2$ $ 20 02 $ 20 03 $Year) 20 04 $ 1.3$ 20 05 $ 20 06 $ 20 07 $ 1.4$ 20 08 $ 20 1.5$ 09 $ 20 1.6$ 10 $
  16. 16. Format Usage Versus Time 10,000,000,000"NumberofResourcesinArchive 1,000,000,000" 100,000,000" 10,000,000" 1,000,000" 100,000" 10,000" 1,000" 100" 10" 1" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Timespan[Years]
  17. 17. ResultsIMPLEMENTATIONS
  18. 18. PDF Software Over Time 100%(Percentage)of)PDF)Resources) 90%( 80%( Acrobat( Acrobat( 70%( PDFWriter( 60%( 50%( 40%( 30%( 20%( 10%( Acrobat(Dis,ller( 0%( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 96 99 00 01 02 03 05 06 09 10 97 98 04 07 08 19 19 20 20 20 20 20 20 20 20 19 19 20 20 20 Year) Over 2100 Distinct PDF Software IDs
  19. 19. JPEG Hardware Over Time 100%$Percentage)of)Harware)IDs) 90%$ NIKON$D40$ 80%$ 70%$ 60%$ MX1700$ 50%$ 40%$ 30%$ 20%$ E990$ 10%$ DS5$ CYBERSHOT$ 0%$ 19 $ 19 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ $ 19 $ 19 $ 19 $ 20 $ 20 $ 20 $ 95 96 99 00 01 02 03 05 06 09 10 94 97 98 04 07 08 19 Year) Over 2100 Distinct JPEG Hardware IDs
  20. 20. Formats over TimeCONCLUSIONS
  21. 21. Summary  Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required  Please re-use our data, or ask for more  Firmer conclusions need:   Richer, more detailed results   From a wider range of corpora  This approach only gives creator information   A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)
  22. 22. Questions? webarchive.org.uk
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×