Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Formats over Time                           Exploring UK Web HistoryAndrew JacksonUK Web Archive, The British Library     ...
Formats over TimeDEBATING OBSOLESCENCE
Rothenberg & Rosenthal On Format Obsolescence    Jeff Rothenberg:       “Digital Information Lasts Forever –        Or F...
Formats over TimeAN EXPERIMENT
UK Web Domain Dataset (1994-2010)  UK Web Domain Dataset (1994-2010)     From the Internet Archive     Millions of webs...
Identification Tools    DROID       Well-known in digital preservation community       Format version level identificat...
A Common Language For Format Identifiers    Comparison and combination requires a common model       Map PRONOM IDs to e...
Format Profile Dataset    Server, Tika & DROID-B format profiles, over time:     image/png image/png image/png; version=1...
ResultsCOMPARING TOOLS
Percentage)of)resources)                                                                                   uniden0fied)    ...
Inconsistencies    Gaps       37 formats spotted by DROID-B but not Tika          Notably includes earlier Office forma...
ResultsFORMATS OVER TIME
Image Formats Over Time                       100.00000%%                        10.00000%%Percentage)of)crawl)           ...
HTML Versions Over Time                                100%%Percentage)of)HTML)Resources)                                 ...
Percentage)of)PDF)Resources)                   0%$                  10%$                  20%$                  30%$      ...
Format Usage Versus Time                                  10,000,000,000"NumberofResourcesinArchive                       ...
ResultsIMPLEMENTATIONS
PDF Software Over Time                               100%(Percentage)of)PDF)Resources)                                90%(...
JPEG Hardware Over Time                             100%$Percentage)of)Harware)IDs)                              90%$     ...
Formats over TimeCONCLUSIONS
Summary    Format obsolescence is complex       Network effects do appear to stabilize formats       But once popular f...
Questions?             webarchive.org.uk
Upcoming SlideShare
Loading in …5
×

Formats Over Time: Exploring UK Web History

2,711 views

Published on

Published in: Technology
  • Be the first to comment

Formats Over Time: Exploring UK Web History

  1. 1. Formats over Time Exploring UK Web HistoryAndrew JacksonUK Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto
  2. 2. Formats over TimeDEBATING OBSOLESCENCE
  3. 3. Rothenberg & Rosenthal On Format Obsolescence  Jeff Rothenberg:   “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)  David Rosenthal:   “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)   That network effects inhibit obsolescence  Where is the evidence?
  4. 4. Formats over TimeAN EXPERIMENT
  5. 5. UK Web Domain Dataset (1994-2010)  UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB  Execution at Scale   Stored on HDFS   Map-Reduce
  6. 6. Identification Tools  DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded  Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend   Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size
  7. 7. A Common Language For Format Identifiers  Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types   fmt/18 becomes application/pdf; version=1.4   Allows easy comparison at sub-type level   Can easily extend to cover other properties:   text/plain; charset=UTF-8   application/pdf; software=“Adobe Acrobat 6.0”   Also extended Tika to output details from PDFs
  8. 8. Format Profile Dataset  Server, Tika & DROID-B format profiles, over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf !
 application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; 
 source="Adobe PageMaker 6.0" !
 application/pdf; version=1.2 !2004 !1  CC0 – free to download and reuse   http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it  Source code of all tools and modifications also available   https://github.com/openplanets/nanite
  9. 9. ResultsCOMPARING TOOLS
  10. 10. Percentage)of)resources) uniden0fied) 0%# 1%# 10%# 19 100%# 96 19 # 97 Coverage & Depth 19 # 98 19 # 99 20 # 00 20 # 01 20 # 02 20 # 03 Year) 20 # 04 20 # 05 20 # 06 20 # 07 20 #No format-version-level information from Apache Tika. 08 20 # 09 20 # 10 DROID1B#v.59# # Apache#Tika#1.1#
  11. 11. Inconsistencies  Gaps   37 formats spotted by DROID-B but not Tika   Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B   But at least 20 are due to not using the full DROID  Conflicts   Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…  Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.
  12. 12. ResultsFORMATS OVER TIME
  13. 13. Image Formats Over Time 100.00000%% 10.00000%%Percentage)of)crawl) JPEG% 1.00000%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% GIF% 0.10000%% PNG% 0.01000%% ICON% 0.00100%% XBM% 0.00010%% TIFF% 0.00001%% Year)
  14. 14. HTML Versions Over Time 100%%Percentage)of)HTML)Resources) 90%% 80%% XHTML%1.0% 70%% 60%% HTML%4.01% 50%% HTML%4.0% 40%% HTML%3.2% 30%% 20%% 10%% HTML%2.0% 0%% % % % % % % % % % % % % % % % 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 Year)
  15. 15. Percentage)of)PDF)Resources) 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 19 96 $ 19 1.0$ 97 $ 19 1.1$ 98 $ 19 99 $ 20 00 PDF Versions Over Time $ 20 01 1.2$ $ 20 02 $ 20 03 $Year) 20 04 $ 1.3$ 20 05 $ 20 06 $ 20 07 $ 1.4$ 20 08 $ 20 1.5$ 09 $ 20 1.6$ 10 $
  16. 16. Format Usage Versus Time 10,000,000,000"NumberofResourcesinArchive 1,000,000,000" 100,000,000" 10,000,000" 1,000,000" 100,000" 10,000" 1,000" 100" 10" 1" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Timespan[Years]
  17. 17. ResultsIMPLEMENTATIONS
  18. 18. PDF Software Over Time 100%(Percentage)of)PDF)Resources) 90%( 80%( Acrobat( Acrobat( 70%( PDFWriter( 60%( 50%( 40%( 30%( 20%( 10%( Acrobat(Dis,ller( 0%( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( 96 99 00 01 02 03 05 06 09 10 97 98 04 07 08 19 19 20 20 20 20 20 20 20 20 19 19 20 20 20 Year) Over 2100 Distinct PDF Software IDs
  19. 19. JPEG Hardware Over Time 100%$Percentage)of)Harware)IDs) 90%$ NIKON$D40$ 80%$ 70%$ 60%$ MX1700$ 50%$ 40%$ 30%$ 20%$ E990$ 10%$ DS5$ CYBERSHOT$ 0%$ 19 $ 19 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ 20 $ $ 19 $ 19 $ 19 $ 20 $ 20 $ 20 $ 95 96 99 00 01 02 03 05 06 09 10 94 97 98 04 07 08 19 Year) Over 2100 Distinct JPEG Hardware IDs
  20. 20. Formats over TimeCONCLUSIONS
  21. 21. Summary  Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required  Please re-use our data, or ask for more  Firmer conclusions need:   Richer, more detailed results   From a wider range of corpora  This approach only gives creator information   A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)
  22. 22. Questions? webarchive.org.uk

×