SlideShare a Scribd company logo
Revisiting File Formats for Digitization Steven T. Puglia Digital Conversion Services Manager Office of Strategic Initiatives Library of Congress 101 Independence Ave, SE Washington, DC 20540, USA Phone: 202-707-5726 Email: spug@loc.gov
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
In general, within the digital library community, format and compression recommendations for master and derivative image files remain based on older perspectives regarding digitization, digital preservation, and IT/network/web technologies.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Recommended Data Formats for Preservation Purposes in the Florida Digital Archive http://fclaweb.fcla.edu/uploads/Lydia%20Motyka/FDA_documentation/recFormats.pdf
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
2. Further resolved, that such images are of sufficient quality to serve as preservation images for books which are: Found in the stacks, not in the rare book room. Likely to remain available somewhere in physical form. 3. Further resolved, that such images are of comparable or superior quality to accepted preservation approaches such as microfilm. 4. Further resolved, that cost matters in digital library image conversion projects, even though it is other people's money. With a final nod to the improvements proposed for JPEG 2000, Sharpe argued that at a minimum, the library and archival community should not close the door on the use of visually lossless compression.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Rise of Information in the Digital Age http://www.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.html?sid=ST2011021100514
Really big data: The challenges of managing mountains of information,  by John Brandon, October 18, 2011  http://www.computerworld.com/s/article/9220504/Really_big_data_The_challenges_of_managing_mountains_of_information ? The Library of Congress processes 2.5 petabytes of data each year, which amounts to 40TB per week. Thomas Youkel, group chief of enterprise systems engineering at the Library, estimates the data load will quadruple in the next few years as the Library continues to carry out its dual mandates to serve up data for historians and preserve information in all its forms.
[object Object],[object Object],[object Object],[object Object]
Andy Jackson, The British Library http://www.openplanetsfoundation.org/blogs/2011-01-12-format-obsolescence-and-sustainable-access This means that the long-term cost of preserving our collection scales not only with the size of the files, but also rises as the number of formats we are required to support is increased.
[object Object],[object Object],[object Object],[object Object],[object Object]
David Rosenthal, Stanford University http://blog.dshr.org/2011/03/how-few-copies.html Compression reduces the redundancy within a single copy and increases the risk of damage.  There are also techniques that increase the redundancy within a single copy and reduce the risk.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Erik Hetzner, California Digital Library http://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c I see no reason to store, as a matter of policy, uncompressed files on our disks. In fact, I think we should be more aggressive about compressing files. (Hetzner focuses on lossless compression.)
Erik Hetzner, California Digital Library http://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c Even without error correcting codes, I don’t think the arguments for storing uncompressed data only as a matter of policy are strong at all.  When we take error correcting codes into account, not compressing your data as a policy in order to keep a higher level of redundancy seems like the worst way to increase the redundancy of the data.  Smart people have figured out how to make codes which can reliably correct limited errors in bytestreams. Why not use them?
Data corruption is and will remain a problem.  An active part of digital preservation will be to overcome this problem.  The LOCKSS concept includes one approach for dealing with the problem – “…the bits and bytes are continually audited and  repaired…to protect fragile digital content for the very long time.”  http://www.eecs.harvard.edu/~mema/publications/SOSP2003.pdf   LOCKSS now has a 12 year track record.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
If image files are being brought into a managed environment, compression, particularly lossless compression, is much less of a concern. Conversely, if images are being stored on DVDs on a shelf, then compression raises the risks significantly.
One option for file format and compression (lossless and lossy) - JPEG 2000
There remain barriers for many organizations to adoption of JPEG 2000 (limited open source tools), and concerns and related potential risks (corruption and potential legal issues).  These issues have been acknowledged within the broader cultural heritage digitization community.
A number of research studies have been conducted on the robustness of JPEG 2000. Studies have seen similar results in terms of susceptibility to corruption.  Nevertheless, organizations have concluded that JPEG 2000 is an appropriate file format choice from a robustness perspective – “conclude that JPEG 2000 is a good current solution for our digital repositories.”  A Format for Digital Preservation of Images by Buonora and Liberati http://www.dlib.org/dlib/july08/buonora/07buonora.html
It is worth noting the format includes some “resiliency” elements that add robustness and thereby counteract some effects of data loss. These resiliency elements are described in the notes at the bottom of the Sustainability of Digital Formats – Planning for Library of Congress web page ( http://www.digitalpreservation.gov/formats/fdd/fdd000138.shtml) .
Wellcome Library http://jpeg2000wellcomelibrary.blogspot.com/2010/06/we-need-how-much-storage.html In 2009, the  Wellcome Library  set out an ambitious vision to digitise a large proportion of its historic collections. This would take the annual digitisation activities of the Library from hundreds, or at most, thousands of images per year to several million images per year.  … we realised this could see the generation of up to 30m images over 5 years. Exciting, but perhaps slightly daunting, considering we didn't yet have an infrastructure to fully support such a large collection of digital assets.
Wellcome Library- Anyone reading this blog will understand why the scale of the programme is key to the blog topic.  When we asked our IT department to tell us how much it would cost to store 30m TIFF files - our de facto standard for the couple hundred thousand images in our existing  picture library  - we were stunned.  Two petabytes of online, spinning disk storage with a top-of-the-line enterprise management system and remote backup would cost  how much?   We learned that the cost would be something like a fifth of our total budget for the entire digitisation programme.
Wellcome Library- Should we consider a lower-cost storage solution? Even tape back-up was quite expensive for that scale, and you can't serve images up online from tape anyway.  We revised our image sizes, factoring in smaller and smaller resolutions and/or bit depths for material like the printed books, which didn't need full colour, high resolution images. We still couldn't afford the storage costs. Finally, we saw the light and started looking into a relatively new image format called  JPEG 2000 .
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
It is very possible, more digital images are produced by mass digitization efforts and saved as JPEG 2000 files than other file formats. Despite concerns and a clear need for organizational support relating to implementing JPEG 2000, far more cultural heritage organizations are using JPEG 2000 for digitization than most people realize.
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions: There is not a single answer to the question of file format for raster image files produced by digitization projects. There are a number of file formats worthy of consideration – suitable from technical, sustainability, fiscal, and other perspectives. Compression can represent a reasonable risk for appropriate efforts, and is likely a practical reality as digitization and digital preservation efforts scale.  Not using compression likely represents a real risk, particularly given the dramatic and continued growth in digital data.

More Related Content

Similar to Puglia marac-file formats-20111020

Planning Beyond Digitization: Digital Preservation for Audiovisual Collections
Planning Beyond Digitization: Digital Preservation for Audiovisual Collections Planning Beyond Digitization: Digital Preservation for Audiovisual Collections
Planning Beyond Digitization: Digital Preservation for Audiovisual Collections
Kara Van Malssen
 
Preparation, Proceed and Review of preservation of Digital Library
Preparation, Proceed and Review of preservation of Digital Library Preparation, Proceed and Review of preservation of Digital Library
Preparation, Proceed and Review of preservation of Digital Library
Asheesh Kamal
 
Getting started in digital preservation
Getting started in digital preservationGetting started in digital preservation
Getting started in digital preservation
Sarah Jones
 
Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]guest410707c
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the Pond
ULB - Bibliothèques
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Benoit Pauwels
 
Dp%20 fudamentals%20%28ch1%29
Dp%20 fudamentals%20%28ch1%29Dp%20 fudamentals%20%28ch1%29
Dp%20 fudamentals%20%28ch1%29
Navid Abbaspour
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Love Arora
 
Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)
Mal Booth
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
dbpublications
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
lljohnston
 
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
Brad Houston
 
Trm Introduction
Trm IntroductionTrm Introduction
Trm Introduction
DigitalPreservationEurope
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processing
Pranav Gontalwar
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop FrameworkIRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET Journal
 
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
National Library of Australia
 
CollaborativeDatasetBuilding
CollaborativeDatasetBuildingCollaborativeDatasetBuilding
CollaborativeDatasetBuildingArmaan Bindra
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
Aaron Collie
 
Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?
weisinger
 

Similar to Puglia marac-file formats-20111020 (20)

Planning Beyond Digitization: Digital Preservation for Audiovisual Collections
Planning Beyond Digitization: Digital Preservation for Audiovisual Collections Planning Beyond Digitization: Digital Preservation for Audiovisual Collections
Planning Beyond Digitization: Digital Preservation for Audiovisual Collections
 
Preparation, Proceed and Review of preservation of Digital Library
Preparation, Proceed and Review of preservation of Digital Library Preparation, Proceed and Review of preservation of Digital Library
Preparation, Proceed and Review of preservation of Digital Library
 
Getting started in digital preservation
Getting started in digital preservationGetting started in digital preservation
Getting started in digital preservation
 
Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]Gettingstartedwithdigitalcollectionsweb[1]
Gettingstartedwithdigitalcollectionsweb[1]
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the Pond
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the Pond
 
Dp%20 fudamentals%20%28ch1%29
Dp%20 fudamentals%20%28ch1%29Dp%20 fudamentals%20%28ch1%29
Dp%20 fudamentals%20%28ch1%29
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
 
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
 
Trm Introduction
Trm IntroductionTrm Introduction
Trm Introduction
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processing
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop FrameworkIRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
 
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
What are some of the things that the ‘Feds’ are doing about Digital ‘Stuff’?
 
CollaborativeDatasetBuilding
CollaborativeDatasetBuildingCollaborativeDatasetBuilding
CollaborativeDatasetBuilding
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
 
Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?
 

More from MARAC Bethlehem PC

Business Process Analysis for Your Records Management Program
Business Process Analysis for Your Records Management ProgramBusiness Process Analysis for Your Records Management Program
Business Process Analysis for Your Records Management Program
MARAC Bethlehem PC
 
Subject guides for archives - Eva Guggemos
Subject guides for archives - Eva GuggemosSubject guides for archives - Eva Guggemos
Subject guides for archives - Eva Guggemos
MARAC Bethlehem PC
 
Marac 2011 social media what is it good for
Marac 2011 social media what is it good forMarac 2011 social media what is it good for
Marac 2011 social media what is it good forMARAC Bethlehem PC
 
Born-Digital Records: Moving from Theory to Practice
Born-Digital Records: Moving from Theory to PracticeBorn-Digital Records: Moving from Theory to Practice
Born-Digital Records: Moving from Theory to Practice
MARAC Bethlehem PC
 
Implementing a DAMS
Implementing a DAMSImplementing a DAMS
Implementing a DAMS
MARAC Bethlehem PC
 
Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web Archives
MARAC Bethlehem PC
 
Portals for Promoting Useful Knowledge - APS subject guides
Portals for Promoting Useful Knowledge - APS subject guidesPortals for Promoting Useful Knowledge - APS subject guides
Portals for Promoting Useful Knowledge - APS subject guidesMARAC Bethlehem PC
 
CLIR_Hidden_Collections_Student_Engagement
CLIR_Hidden_Collections_Student_EngagementCLIR_Hidden_Collections_Student_Engagement
CLIR_Hidden_Collections_Student_Engagement
MARAC Bethlehem PC
 
CLIR_Hidden_Collections_and_Student_Engagement
CLIR_Hidden_Collections_and_Student_EngagementCLIR_Hidden_Collections_and_Student_Engagement
CLIR_Hidden_Collections_and_Student_Engagement
MARAC Bethlehem PC
 
Documenting the Folk: The White Top Folk Festival
Documenting the Folk: The White Top Folk FestivalDocumenting the Folk: The White Top Folk Festival
Documenting the Folk: The White Top Folk Festival
MARAC Bethlehem PC
 
Marac subject guides pflug
Marac subject guides pflugMarac subject guides pflug
Marac subject guides pflug
MARAC Bethlehem PC
 
CollectiveAccess: Open Source Collection Management for Archives
CollectiveAccess: Open Source Collection Management for ArchivesCollectiveAccess: Open Source Collection Management for Archives
CollectiveAccess: Open Source Collection Management for ArchivesMARAC Bethlehem PC
 
Doub
DoubDoub
Reconceptualization of Special Collections, University of Maryland
Reconceptualization of Special Collections, University of MarylandReconceptualization of Special Collections, University of Maryland
Reconceptualization of Special Collections, University of MarylandMARAC Bethlehem PC
 

More from MARAC Bethlehem PC (20)

Business Process Analysis for Your Records Management Program
Business Process Analysis for Your Records Management ProgramBusiness Process Analysis for Your Records Management Program
Business Process Analysis for Your Records Management Program
 
Subject guides for archives - Eva Guggemos
Subject guides for archives - Eva GuggemosSubject guides for archives - Eva Guggemos
Subject guides for archives - Eva Guggemos
 
Marac11
Marac11Marac11
Marac11
 
Marac 2011
Marac 2011Marac 2011
Marac 2011
 
Marac 2011
Marac 2011Marac 2011
Marac 2011
 
Marac 2011 social media what is it good for
Marac 2011 social media what is it good forMarac 2011 social media what is it good for
Marac 2011 social media what is it good for
 
Marac 2011
Marac 2011Marac 2011
Marac 2011
 
Amy mc donald marac
Amy mc donald maracAmy mc donald marac
Amy mc donald marac
 
Born-Digital Records: Moving from Theory to Practice
Born-Digital Records: Moving from Theory to PracticeBorn-Digital Records: Moving from Theory to Practice
Born-Digital Records: Moving from Theory to Practice
 
Implementing a DAMS
Implementing a DAMSImplementing a DAMS
Implementing a DAMS
 
Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web Archives
 
Gallagher marac
Gallagher maracGallagher marac
Gallagher marac
 
Portals for Promoting Useful Knowledge - APS subject guides
Portals for Promoting Useful Knowledge - APS subject guidesPortals for Promoting Useful Knowledge - APS subject guides
Portals for Promoting Useful Knowledge - APS subject guides
 
CLIR_Hidden_Collections_Student_Engagement
CLIR_Hidden_Collections_Student_EngagementCLIR_Hidden_Collections_Student_Engagement
CLIR_Hidden_Collections_Student_Engagement
 
CLIR_Hidden_Collections_and_Student_Engagement
CLIR_Hidden_Collections_and_Student_EngagementCLIR_Hidden_Collections_and_Student_Engagement
CLIR_Hidden_Collections_and_Student_Engagement
 
Documenting the Folk: The White Top Folk Festival
Documenting the Folk: The White Top Folk FestivalDocumenting the Folk: The White Top Folk Festival
Documenting the Folk: The White Top Folk Festival
 
Marac subject guides pflug
Marac subject guides pflugMarac subject guides pflug
Marac subject guides pflug
 
CollectiveAccess: Open Source Collection Management for Archives
CollectiveAccess: Open Source Collection Management for ArchivesCollectiveAccess: Open Source Collection Management for Archives
CollectiveAccess: Open Source Collection Management for Archives
 
Doub
DoubDoub
Doub
 
Reconceptualization of Special Collections, University of Maryland
Reconceptualization of Special Collections, University of MarylandReconceptualization of Special Collections, University of Maryland
Reconceptualization of Special Collections, University of Maryland
 

Recently uploaded

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Puglia marac-file formats-20111020

  • 1. Revisiting File Formats for Digitization Steven T. Puglia Digital Conversion Services Manager Office of Strategic Initiatives Library of Congress 101 Independence Ave, SE Washington, DC 20540, USA Phone: 202-707-5726 Email: spug@loc.gov
  • 2.
  • 3. In general, within the digital library community, format and compression recommendations for master and derivative image files remain based on older perspectives regarding digitization, digital preservation, and IT/network/web technologies.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. Recommended Data Formats for Preservation Purposes in the Florida Digital Archive http://fclaweb.fcla.edu/uploads/Lydia%20Motyka/FDA_documentation/recFormats.pdf
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. 2. Further resolved, that such images are of sufficient quality to serve as preservation images for books which are: Found in the stacks, not in the rare book room. Likely to remain available somewhere in physical form. 3. Further resolved, that such images are of comparable or superior quality to accepted preservation approaches such as microfilm. 4. Further resolved, that cost matters in digital library image conversion projects, even though it is other people's money. With a final nod to the improvements proposed for JPEG 2000, Sharpe argued that at a minimum, the library and archival community should not close the door on the use of visually lossless compression.
  • 16.
  • 17.
  • 18. Rise of Information in the Digital Age http://www.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.html?sid=ST2011021100514
  • 19. Really big data: The challenges of managing mountains of information, by John Brandon, October 18, 2011 http://www.computerworld.com/s/article/9220504/Really_big_data_The_challenges_of_managing_mountains_of_information ? The Library of Congress processes 2.5 petabytes of data each year, which amounts to 40TB per week. Thomas Youkel, group chief of enterprise systems engineering at the Library, estimates the data load will quadruple in the next few years as the Library continues to carry out its dual mandates to serve up data for historians and preserve information in all its forms.
  • 20.
  • 21. Andy Jackson, The British Library http://www.openplanetsfoundation.org/blogs/2011-01-12-format-obsolescence-and-sustainable-access This means that the long-term cost of preserving our collection scales not only with the size of the files, but also rises as the number of formats we are required to support is increased.
  • 22.
  • 23. David Rosenthal, Stanford University http://blog.dshr.org/2011/03/how-few-copies.html Compression reduces the redundancy within a single copy and increases the risk of damage. There are also techniques that increase the redundancy within a single copy and reduce the risk.
  • 24.
  • 25. Erik Hetzner, California Digital Library http://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c I see no reason to store, as a matter of policy, uncompressed files on our disks. In fact, I think we should be more aggressive about compressing files. (Hetzner focuses on lossless compression.)
  • 26. Erik Hetzner, California Digital Library http://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c Even without error correcting codes, I don’t think the arguments for storing uncompressed data only as a matter of policy are strong at all. When we take error correcting codes into account, not compressing your data as a policy in order to keep a higher level of redundancy seems like the worst way to increase the redundancy of the data. Smart people have figured out how to make codes which can reliably correct limited errors in bytestreams. Why not use them?
  • 27. Data corruption is and will remain a problem. An active part of digital preservation will be to overcome this problem. The LOCKSS concept includes one approach for dealing with the problem – “…the bits and bytes are continually audited and repaired…to protect fragile digital content for the very long time.” http://www.eecs.harvard.edu/~mema/publications/SOSP2003.pdf LOCKSS now has a 12 year track record.
  • 28.
  • 29. If image files are being brought into a managed environment, compression, particularly lossless compression, is much less of a concern. Conversely, if images are being stored on DVDs on a shelf, then compression raises the risks significantly.
  • 30. One option for file format and compression (lossless and lossy) - JPEG 2000
  • 31. There remain barriers for many organizations to adoption of JPEG 2000 (limited open source tools), and concerns and related potential risks (corruption and potential legal issues). These issues have been acknowledged within the broader cultural heritage digitization community.
  • 32. A number of research studies have been conducted on the robustness of JPEG 2000. Studies have seen similar results in terms of susceptibility to corruption. Nevertheless, organizations have concluded that JPEG 2000 is an appropriate file format choice from a robustness perspective – “conclude that JPEG 2000 is a good current solution for our digital repositories.” A Format for Digital Preservation of Images by Buonora and Liberati http://www.dlib.org/dlib/july08/buonora/07buonora.html
  • 33. It is worth noting the format includes some “resiliency” elements that add robustness and thereby counteract some effects of data loss. These resiliency elements are described in the notes at the bottom of the Sustainability of Digital Formats – Planning for Library of Congress web page ( http://www.digitalpreservation.gov/formats/fdd/fdd000138.shtml) .
  • 34. Wellcome Library http://jpeg2000wellcomelibrary.blogspot.com/2010/06/we-need-how-much-storage.html In 2009, the Wellcome Library set out an ambitious vision to digitise a large proportion of its historic collections. This would take the annual digitisation activities of the Library from hundreds, or at most, thousands of images per year to several million images per year. … we realised this could see the generation of up to 30m images over 5 years. Exciting, but perhaps slightly daunting, considering we didn't yet have an infrastructure to fully support such a large collection of digital assets.
  • 35. Wellcome Library- Anyone reading this blog will understand why the scale of the programme is key to the blog topic. When we asked our IT department to tell us how much it would cost to store 30m TIFF files - our de facto standard for the couple hundred thousand images in our existing picture library - we were stunned. Two petabytes of online, spinning disk storage with a top-of-the-line enterprise management system and remote backup would cost how much? We learned that the cost would be something like a fifth of our total budget for the entire digitisation programme.
  • 36. Wellcome Library- Should we consider a lower-cost storage solution? Even tape back-up was quite expensive for that scale, and you can't serve images up online from tape anyway. We revised our image sizes, factoring in smaller and smaller resolutions and/or bit depths for material like the printed books, which didn't need full colour, high resolution images. We still couldn't afford the storage costs. Finally, we saw the light and started looking into a relatively new image format called JPEG 2000 .
  • 37.
  • 38.
  • 39. It is very possible, more digital images are produced by mass digitization efforts and saved as JPEG 2000 files than other file formats. Despite concerns and a clear need for organizational support relating to implementing JPEG 2000, far more cultural heritage organizations are using JPEG 2000 for digitization than most people realize.
  • 40.
  • 41. Conclusions: There is not a single answer to the question of file format for raster image files produced by digitization projects. There are a number of file formats worthy of consideration – suitable from technical, sustainability, fiscal, and other perspectives. Compression can represent a reasonable risk for appropriate efforts, and is likely a practical reality as digitization and digital preservation efforts scale. Not using compression likely represents a real risk, particularly given the dramatic and continued growth in digital data.