20140410 ifla digitization workshop [idlc kuala lumpur]


Newspapers digitization workshop at the 2014 International Digital Libraries Conference in Kuala Lumpur April 10, 2014.

  1. 1. Newspaper digitization Frederick Zarndt IFLA Newspapers Section @cowboyMontana hashtag #IFLAnewspaper
  2. 2. the agenda 10.30 Morning tea break 1. Introductions 2.Review of the OAIS reference model 3.Newspaper digitization programs 4. Selection of materials 5. Importance of standards 6.Project management 7. Digitization workflow 7.1. Images 7.2. Metadata 7.3. File formats 8.Digitization workflow demonstration with docWorks 9. Quality assurance and acceptance criteria 10. Tools for digitization, workflow, digital preservation, and project management 11. Digital preservation considerations 12.Wrap-up 13.00 Lunch 15.30 Afternoon tea break
  3. 3. An Open Archival Information System (or OAIS) is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https:// (accessed March 2014).
  4. 4. Open Archival Information System (OAIS) reference model • Negotiate for and accept appropriate information from information Producers. • Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation. • Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided. • Ensure that the information to be preserved is Independently Understandable to the Designated Community. In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information. • Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original. • Make the preserved information available to the Designated Community. Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https:// (accessed March 2014).
  5. 5. Open Archival Information System (OAIS) reference model
  6. 6. programs
  7. 7. National Collaborative Individual programs
  8. 8. national programs national: centrally funded and managed programs with several participants. strict standards. • National Digital Newspaper Program (Library of Congress) • Australian Newspaper Digitisation Program programs
  9. 9. cooperative programs cooperative: organizations collaborate to achieve a common goal but digitization programs are managed separately. flexible standards. • Europeana newspapers • Digital Public Library of America programs
  10. 10. individual programs individual: organization digitizes on its own. may or, more usually, does not follow open standards. all commercial organizations. • ProQuest Historical Newspapers • • Newsbank • many others… programs
  11. 11. programs • digitization program requires careful thought • must be adapted to local circumstances • ask those who have gone before • join the IFLA Newspapers Section! (ask me how) Image courtesy of Donald Zolan.
  12. 12. ? programs ? Discussion questions 1. Has your organization already begun to digitize newspapers? How is the digitization program organized and funded? 2. If your organization hasn’t yet begun to digitize newspapers, what type of digitization program would best suits your organization / state / country? Why?
  13. 13. Experience is that marvelous thing that enables you to recognize a mistake when you make it again. ! F. P. Jones
  14. 14. selection
  15. 15. reasons for digitization newspapers are deteriorating microfilm is dissolving no storage space selection
  16. 16. access • Who are your users? Do you know? • Can you ask them what they expect from a digital newspaper collection? Can you trust their answers? • Trove, Papers Past, Cambridge Public Library, CDNC: These digital newspaper collections are used mostly by people 50+ years old and with an interest in family history. ? selection
  17. 17. Library of Congress selection criteria for the National Digital Newspaper Program (NDNP) selection ! • Image quality • Intellectual content • Refinements
  18. 18. selection for NDNP Image quality ! All NDNP newspaper images are scanned from microfilm. 1. Microfilm should be produced from properly prepared unbound originals. 2. Microfilm reduction ratio should be less the 20x. This allows 400dpi images to be scanned from the film. 3. Variations in microfilm density within and between images should be more than 0.2. 4. Negative microfilm duplicated for scanning should have resolution test patterns readable at 5.0 or higher. For camera master microfilm without resolution test charts, resolution can be estimated by comparison to film with resolution test charts and original material. selection
  19. 19. selection for NDNP Intellectual content ! 1. Newspaper title reflects the political, economic and cultural history of the State. 2. Selected newspaper titles should ensure broad geographical coverage. 3. Newspaper titles that provide coverage of a geographic area or a group over long time periods are preferred over short lived titles or titles with significant gaps. selection
  20. 20. selection for NDNP Selection criteria refinements ! 1. Orphan titles: Special consideration should be given to high research value titles that have ceased publication and lack active ownership. 2. Newspaper titles that document a significant (minority) community at the state or regional level may be given special consideration. 3. Newspaper which have already been digitized by other organizations (for example, ProQuest) should not be digitized again. selection
  21. 21. selection for ANDP National Library of Australia collection managers in consultation with staff from Preservation Services nominate materials for digitization. The Library works closely with state and territory libraries to systematically digitise newspapers held in these libraries. Selected newspapers include this with ! • Cultural and/or historical significance • Uniqueness and/or rarity of the material • Copyright status or permission to digitise obtained • Material in high demand • Material at risk because of its physical condition selection
  22. 22. copyright Most newspapers titles selected for digitization are out of copyright and in the public domain. Negotiating use rights is quite simply too much trouble and fraught with legal pitfalls. Copyright laws and policies vary considerably between countries. selection
  23. 23. 23 …however… Digitization and public access to in-copyright newspapers is not impossible. selection
  29. 29. ? selection ? Discussion questions 1. Has your organization already selected newspapers to digitize? Why did it choose the titles that were selected? Please answer (hypothetically) if your organization hasn’t begun a newspapers digitization program. 2. Why would or why wouldn’t your organization select in-copyright newspapers to digitize?
  30. 30. 30 importance of standards
  31. 31. open standards • Availability : Open standards are available for all to read and implement. • Maximize end-user choice : Open standards create a fair, competitive market for implementation of the standards. They do not lock the customer into a particular vendor or group. • No royalty : Open standards are free for all to implement, with no royalty or fee. • No discrimination : Open standards and the organizations that administer them do not favor one implementor over another for any reason other than the technical standards compliance of a vendor's implementation. • Extension or subset : Implementations of open standards may be extended, or offered in subset form. However, certification organizations may decline to certify subset implementations, and may place requirements upon extensions. • Predatory practices : Open standards may employ license terms that protect against subversion of the standard by embrace-and-extend tactics. The licenses attached to the standard may require the publication of reference information for extensions, and a license for all others to create, distribute and sell software that is compatible with the extensions. An open standard may not otherwise prohibit extensions. importance of standards Adapted from FOSS Open Standards.
  32. 32. open standards standards • Not restrictive : Less chance of being locked in by a specific technology and/or vendor. • Interoperable : Easier for systems from different parties or using different technologies to interoperate and communicate of with one another. importance • Protection against obsolescence : Better protection of the data files created by an application against obsolescence. • Portable : Applications / data are easier to port from one platform to another since they follows known guidelines and rules, and the interfaces. Adapted from FOSS Open Standards. 32
  33. 33. newspapers and standards What standards are important for newspaper digitization? ! • METS XML is an open standard administered by the METS editorial board. See • ALTO XML is an open standard administered by the ALTO editorial board. See • Various image file formats including TIFF, JPEG, JPEG2000. • PDF/A is a portable document format developed by Adobe. It is a subset of the complete PDF specification and has been adopted by ISO as a standard. See • Various library metadata standards including, but not limited to • MODS XML • Dublin Core • PREMIS importance of standards
  34. 34. importance of standards with few exceptions libraries use METS XML + ALTO XML + image files (TIFF, JPEG2000) for newspaper digitization programs importance of standards
  35. 35. proprietary standards Olive ActivePaper Archive stores historical newspaper data in an XML format that is as capable as METS/ALTO XML but is not an open standard. Early versions of WordPerfect (MS Word too) stored data in a proprietary format, not in an open standard like Open Document Format (ODF). WordPerfect or special software is needed to view the files. Adobe’s Flash is a de facto but not an open standard. Flash now appears to be on a path to obsolescence, destined to be replaced by HTML5. importance of standards
  36. 36. ?importance of standards ? Discussion questions 1. Name a few standards that you use every time you connect to the Internet. 2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use?
  37. 37. In theory, there's no difference between theory and practice, but in practice, there is. ! Anonymous
  38. 38. project management
  39. 39. From the Standish Group’s 2012 Chaos Report on IT Project Failure. project management
  40. 40. high cost of IT failure Roger Sessions estimates that the worldwide cost of IT failure is USD $500 billion per month Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple Architectures for Complex Enterprises and many articles. He is a founding member of the Board of Directors of the International Association of Software Architects. 40 project management
  41. 41. in a recent survey of 1230 IT professionals conducted by Embarcadero Technologies, 2 of the 3 biggest project challenges cited by the IT pros are “poor planning” and “poor or no requirements” 41 plan! project management
  42. 42. in a March 2007 web poll conducted by the Computing Technology Industry Association "nearly 28 percent of the more than 1,000 respondents singled out poor communications as the number one cause of project failure" 42 communicate! project management
  43. 43. A recent survey of 752 IEEE members conducted by IEEE Spectrum and The New York Times discovered that "just 9 percent of 133 respondents whose organizations currently offshore R&D reported 'No problem'. The biggest headache was 'Language, communication, or culture' barriers, as reported by 54.1 percent of respondents." ( 43 communicate! project management
  44. 44. In their 2009 book Cultural Intelligence: Living and Working Globally, Thomas and Inkson say “Although we increasingly cross boundaries and surmount barriers to trade, migration, travel, and the exchange of information, cultural boundaries are not so easily bridged. Unlike legal, political, or economic aspects of the global environment, which are observable, culture is largely invisible. Therefore, culture is the aspect of the global context that is most often overlooked.” 44 communicate! project management
  45. 45. plan! Taimour al Neimat. Why IT project fail. The PROJECT PERFECT White Paper Collection. Oct 2005. info_it_projects_fail.pdf accessed Mar 2014. project management in a white paper written for Project Perfect by Taimour al Neimat, he lists • poor planning • unclear goals and objectives • objectives changing during the project • unrealistic time or resource estimates • lack of executive support and user involvement • failure to communicate and act as a team • inappropriate skills as primary causes for the failure of complex IT projects
  46. 46. typical tender evaluation criteria in priority order ! 1. understanding of requirements 2. reputation of service bureau 3. price 46 requirements? project management
  47. 47. incomplete requirements requirements in recent tender from an (anonymous) government agency somewhere in the world ! • project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000 • 19 pages of definitions, governing law, proposal evaluation criteria, contractual conditions, instructions about tender response format, etc • technical requirements description? < 1 page • data acceptance criteria? “a high level of accuracy” 47 project management
  48. 48. complete requirements Library of Congress JPEG2000 profile 48 project management
  49. 49. a recent newspapers digitization program established by a prominent national library ! • digitize more than 20 million text pages • high level image and xml requirements • value of work awarded? > USD $5,000,000 • after award of work, technical requirements expand to 43+ pages from ~3 pages • acceptance criteria? added as an afterthought and not well defined project management poor planing
  50. 50. the value of simplicity “There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.” ! C.A.R. Hoare Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford University, Senior Researcher at Microsoft Research, recipient of the ACM Turing Award, author of many books on computers and software. project management
  51. 51. • unitary: the requirement addresses one and only one thing • complete: the requirement is fully stated in one place with no missing information • consistent: the requirement does not contradict any other requirement and is fully consistent with all authoritative external documentation • atomic: it does not contain conjunctions, for example, "the code field must validate American and Canadian postal codes" should be written as two separate requirements project management good requirements
  52. 52. ! • traceable: the requirement meets all or part of a business need as stated by stakeholders and authoritatively documented • current: the requirement has not been made obsolete by the passage of time • feasible: the requirement can be implemented within the constraints of the project • unambiguous: the requirement is concisely stated without recourse to technical jargon, acronyms • verifiable: the implementation of the requirement can be determined through one of four possible methods: inspection, demonstration, test, or analysis project management good requirements
  53. 53. 53 project management
  54. 54. simple principles for (good) communication • be impeccable with your word • don’t take anything personally • don’t make assumptions • always do your best • be mindful
  55. 55. why (better) communication is necessary no communication ... little communication ... poor communication ... reduced communication ... ... all result in more assumptions about intent!
  56. 56. The single biggest problem with communication is the illusion that it has taken place. George Bernard Shaw, 1925 Nobel Peace Prize for Literature.
  57. 57. project management “projects are about communication, communication, and communication” Elenbass, B. Staging a Project: Are You Setting Your Project Up for Success? Proceedings of the Project Management Institute Annual Seminars & Symposiums. 2000.
  58. 58. the value of prototypes / pilots “Plan to throw one away; you will anyhow. If there is anything new about the function of a system, the first implementation will have to be redone completely to achieve a satisfactory (i.e., acceptably small, fast, and maintainable) result. It costs a lot less if you plan to have a prototype.” ! Butler Lampson Butler Lampson was a founding member of Xerox PARC, worked for DEC, and now works at Microsoft Research. He is an adjunct professor at MIT and an ACM Fellow. project management
  59. 59. implement: pilot create requirements and acceptance criteria repeat { digitize (small) pilot batch test data against acceptance criteria adjust requirements and acceptance criteria } until (no more adjustments are necessary) digitize more data pilot batches are VERY VERY important!! 59 project management
  60. 60. reasons for in-house production ! • collection cannot be moved • collection is badly organized • digitization must be done slowly over a long period • digitization is very simple 60 project management implement: in-house
  61. 61. reasons for outsourced production ! • originals can’t be scanned in-house because… • equipment is too expensive • output data is beyond staff experience • labor is too expensive • large volume of work in a short time • insufficient space, infrastructure, or staff 61 project management implement: outsource
  62. 62. project management tools The project management tool one chooses should be intuitive, easy to use, and accessible to all. If it isn’t, many will avoid / refuse / dislike / resent using it. ! • Discussion of project management tools at http:// software • List of project management tools at http:// software project management
  63. 63. ? project management ? Discussion questions 1. What project management practices does your organization follow? Why? 2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use? 3. What reasons, in addition to those already cited, would your organization have to digitize newspapers in-house or to outsource digitization?
  64. 64. “Perfection is attained, not when there is nothing left to add, but when there is nothing left to take away.” ! Antoine de St. Exupery
  65. 65. digitization workflow
  66. 66. digitization workflow ! • digital library: one or more digital collections
  67. 67. 67 digital library digitization workflow
  68. 68. digitization workflow ! • digital library: one or more digital collections • digital collection: organized group(s) of digital objects
  69. 69. 69 digital collection
  70. 70. digitization workflow ! • digital library: one or more digital collections • digital collection: organized group(s) of digital objects • digital object: a surrogate or digital copy of the original source document, for example, a newspaper issue
  71. 71. digital object
  72. 72. An example of what ALTO makes possible The Day book. (Chicago, Ill.), 29 Feb. 1912. Chronicling America: Historic American Newspapers. Lib. of Congress. <>
  73. 73. digitization workflow ! • digital library: one or more digital collections • digital collection: organized group(s) of digital objects • digital object: a surrogate or digital copy of the original source document, for example, a newspaper issue • metadata: data about data. information about a digital object(s) or a digital collection(s) or the original source document(s)
  74. 74. 74 metadata digitization workflow
  75. 75. • to enhance accessibility • to increase collaboration and cooperation between libraries and archives around the world • to promote research • to provide opportunities for entrepreneurs • other reasons? 75 why digitize newspapers? digitization workflow
  76. 76. Open Archival Information System (OAIS) reference model
  77. 77. digitization workflow
  78. 78. the digitization process produce digital objects ingest preserve access produce images access source images objects
  79. 79. the digitization process produce images source images
  80. 80. standard file formats • image file formats • TIFF • JPEG2000 • JPEG • GIF • text file formats • PDF, PDF/A, PDF/A-1b, PDF/A-1a • TEI XML • HTML • plain text • NITF / NewsML • metadata • METS • MODS / PREMIS / ALTO / MIX ... digitization workflow
  81. 81. ?image decisions ¿ • image production source materials • original documents: better quality, more expensive • microfiche: poorer quality, less expensive, microfiche quality varies • bit depth • black-and-white (bitonal) • greyscale • color • resolution • compression • no compression • lossless (reversible) • lossy (irreversible) • image metadata digitization workflow
  82. 82. image format comparison compression bit depth metadata color management mime type patent 1st public release JBIG (.jbig, .jbg) lossless 1-bit no no 2000? JPEG (.jpg, .jpeg) lossy, DCT, RLE, Huffman 8-bit 12-bit 24-bit yes yes image/jpeg public.jpeg no 1992 JPEG2000 (.jp2) many lossless and lossy compression algorithms 8-bit 16-bit color to 48 bits yes yes image/jp2 public.jpeg200 yes but part 1 is patent free 2000 TIFF (.tiff, .tif) none LZW RLE ZIP Other 1, 2, 4, 8, 16, 24, 32 bits yes yes image/tiff public.tiff no 1986 Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, (accessed August 1, 2012)
  83. 83. image compression comparison The Sacred Heart Review 300dpi Los Angeles Star 300dpi Die Susquehanna Zeitung 600dpi TIFF (uncompressed) 17.2 MB 87 MB 415.5 MB TIFF (lossless LZW compression) 10.2 MB 75.8 MB 232.9 MB JPEG (maximum quality [lossless]) 7.0 MB 37.2MB 101.1 MB JPEG (medium quality) 1.5 MB 4.6 MB 10.2MB JPEG2000 (lossless compression) 7.1 MB 52.7 MB 166.2 MB JPEG2000 (lossy [70] compression) 5.1 MB 37.1 MB 116.7 MB JPEG2000 (lossy [30] compression) 2.2 MB 16.1 MB 50.3 MB
  84. 84. image bit depth comparison USA case law image 1 300dpi USA case law image 2 300dpi TIFF 1-bit CCITT G4 compression 40 KB 87 KB JPEG2000 W5x3 reversible compression 2.6 MB 3.6 MB JPEG2000 W9x7 irreversible compression 647 KB 1 MB
  85. 85. GIGO GARBAGE IN, GARBAGE OUT Image courtesy of (accessed at March 2014).
  86. 86. raw OCR text Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f Havodiv y d, Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last , Mr . Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. AsbtCnvHall, mar Lancaster, Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol " t h r o u g h I n s b e a d , 1 w h i c h instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week, newspaper image Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
  87. 87. ? digitization workflow ? Discussion topics 1. Assume your organization decides to digitize 1000 newspaper issues averaging 12 pages per issue. The images are scanned 2-up and average 80MB each. How much disk storage is needed for the images? 2. Now assume instead that your organization uses TIFF images with LZW (lossless) compression, which saves on average 40%. How much disk storage is needed for the images?
  88. 88. why (better) communication is necessary
  89. 89. the digitization process produce digital objects images objects
  90. 90. the digitization process images image objects processing layout analysis OCR metadata build digital objects
  91. 91. the digitization process images image objects processing layout analysis OCR metadata build digital objects • crop, de-skew, split images • apply image improvement algorithms as needed • sharpening filters • local adaptive thresholding • remove text bleed-thru • etc • create master images • create working images
  98. 98. the digitization process images image objects processing layout analysis OCR metadata build digital objects • analyze layout of text image • estimate font types and sizes • calculate coordinates of text blocks • determine layout object types (text, illustration, headline, etc)
  99. 99. newspaper text layout analysis
  100. 100. the digitization process images image objects processing layout analysis OCR metadata build digital objects • perform optical character recognition (OCR) • calculate word and character coordinates • calculate word and character confidences • apply language dictionaries • correct OCR text (optional)
  101. 101. the digitization process images image objects processing layout analysis OCR metadata build digital objects • populate metadata fields • verify / correct page numbers • verify / correct document structure
  102. 102. the digitization process images image objects processing layout analysis OCR metadata build digital objects • create METS / ALTO XML files • create image files and image metadata • create PDF files (if required) • verify digital object • calculate file fixity checks (checksums) • perform file validation and verification • perform quality assurance
  103. 103. real world digitization production workflow • automatic production steps performed by software ! • manual production steps performed by operators
  104. 104. digital library standards • METS XML for descriptive, structural, technical, and administrative metadata ! • descriptive metadata • Metadata Object Description Standard (MODS) selected metadata from MARC • Dublin Core fundamental group of text elements for describing and cataloging ! • technical metadata • ALTO for OCR text • PREMIS for digital preservation • MIX and ANSI/NISO Z39.87 for images
  105. 105. Metadata Encoding and Transmission Standard ! • METS is a XML standard for encoding descriptive, administrative, and structural metadata about objects within a digital library • METS files consist of 7 (optional) sections: header, descriptive, administrative, file map, structural map, structural link, and behavior • METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile • current version 1.9.1 • administered by METS editorial board (international group of volunteers) • standards hosted by Library of Congress at standards/mets/
  106. 106. METS file structure Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.
  107. 107. Metadata Object Description Schema • MODS is an XML schema for a bibliographic element set that may be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones • Subset of MARC 21 • Mappings exist between MODS and MARC, Dublin Core, and RDA (conversion tools exist) • May be used in conjunction with METS XML • current version 3.4 • administered by Library of Congress Network Development and MARC Standards Office with help from interested users • standards hosted by Library of Congress at standards/mods/
  108. 108. MODS metadata in METS XML <mets:dmdSec ID=""> ! <mets:mdWrap MDTYPE="MODS"> ! ! <mets:xmlData> ! ! ! <mods:mods xmlns=""> ! ! ! ! <mods:language> ! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm> ! ! ! ! </mods:language> ! ! ! ! <mods:genre>newspaper issue</mods:genre> ! ! ! ! <mods:originInfo> ! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued> ! ! ! ! </mods:originInfo> ! ! ! ! <mods:relatedItem type="host"> ! ! ! ! ! <mods:titleInfo> ! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title> ! ! ! ! ! </mods:titleInfo> ! ! ! ! ! <mods:genre>newspaper</mods:genre> ! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="volume"> ! ! ! ! ! ! ! <mods:number>IX</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="issue"> ! ! ! ! ! ! ! <mods:number>12</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! </mods:relatedItem> ! ! ! </mods:mods> ! ! </mets:xmlData> ! </mets:mdWrap> </mets:dmdSec>
  109. 109. Dublin Core metadata • Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery. • Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85 • Metadata terms last updated 14-Jun-2012 • May be used in conjunction with METS XML • Dublin Core Metadata Initiative (DCMI) is an open organization, incorporated as a public, not-for-profit company in Singapore • Dublin Core Metadata Initiative is hosted at http://
  110. 110. Analyzed Layout and Text Object ! • ALTO XML provides technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper • commonly used in conjunction with METS XML but may be used standalone • current version 2.1 • administered by ALTO editorial board (international group of volunteers) • standards hosted by Library of Congress at standards/alto/
  111. 111. <?xml version="1.0" encoding="UTF-8"?> <alto xmlns:xsi="" xsi:noNamespaceSchemaLocation="" xmlns:xlink=""> <Description> ! <MeasurementUnit>pixel</MeasurementUnit> ! <sourceImageInformation> ! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName> ! </sourceImageInformation> </Description> <Styles> ! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> ! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles> <Layout> ! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967"> ! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/> ! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/> ! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/> ! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/> ! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194"> ! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831"> ! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831"> ! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT"> ! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75"> ! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/> ! ! ! ! ! ! !<SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/> ! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/> ! ! ! ! ! ! </TextLine> ! ! ! ! ! </TextBlock> ! ! ! ! </ComposedBlock> ! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/> ! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/> ! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46"> ! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/> ! ! ! ! ! ! !<SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/> ! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/> ! ! ! ! ! ! </TextLine> ! ! ! ! ! </TextBlock> ! ! ! ! </ComposedBlock> ! ! ! </ComposedBlock> ! </PrintSpace> </Page> </Layout> </alto> Analyzed Layout and Text Object
  112. 112. Analyzed Layout and Text Object book
  113. 113. Analyzed Layout and Text Object newspaper
  114. 114. Preservation Metadata Implementation Strategies • PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use • In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities • PREMIS data dictionary current version 2.2 • May be used in conjunction with METS XML • PREMIS tools are freely available • PREMIS Maintenance Activity and Editorial Committee has international members from libraries and industry • PREMIS data dictionary is hosted at standards/premis/
  115. 115. PREMIS data in METS file <mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis=""> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec>
  116. 116. digitization workflow
  117. 117. implement: software • commercial off-the-shelf (COTS) • open source • customized COTS • customized open source • custom in-house 117
  118. 118. ? digitization workflow ? Discussion topics 1. Assuming your organization will digitize historic newspapers, will it digitize the newspapers in-house or out-source digitization? Why? (If you don’t know, guesses and speculations are fine.) 2. Describe your organizations current digitization workflow.
  119. 119. quality assurance and acceptance criteria
  120. 120. quality assurance and acceptance criteria Wikipedia on data quality: ! The processes and technologies involved in ensuring the conformance of data values to requirements and acceptance criteria quality assurance
  121. 121. • is the digital object complete? are all its components present? • is the digital object verifiable? • is the digital object uncorrupted? • do the components of the digital object conform to standards? • do the file names conform to project requirements? • does the directory structure conform to project requirements? • does the digital object metadata conform to project specifications? quality assurance automatic quality checks
  122. 122. • does the digital object metadata meet accuracy specifications? • does the text meet accuracy specifications? • is the image quality satisfactory? • are article continuations correct? • is the text in reading order? quality assurance manual quality checks
  123. 123. what’s wrong with this? acceptance criteria for an English language digitization project at a large, well-known, and internationally recognized national library ! character accuracy > 80% word accuracy > 75% significant word accuracy > 65% quality assurance
  124. 124. what’s wrong with this? project quality requirement: ! “a high level of accuracy”
  125. 125. what’s wrong with this? project quality requirement: ! “article titles must be 99.5% accurate”
  126. 126. what’s wrong with this? project quality requirement: ! “article title characters in each issue must be 99.5% accurate, that is, each issue may have no more than 5 errors in 1000 article title characters”
  127. 127. image quality ! •sharpness: the amount of detail an image can convey •noise: random variation of image density •dynamic range •contrast (gamma): the slope of the tone reproduction curve in a log-log space. high contrast usually involves loss of dynamic range — loss of detail, or clipping, in highlights or shadows. •vignetting: darkens images near the corners •artifacts: “leftovers” from sharpening or compression Wikipedia contributors, “Image quality," Wikipedia, The Free Encyclopedia, http:// (accessed March 2014). quality assurance
  128. 128. image quality ! “…images which are ultimately to be viewed by human beings, the only “correct” method of quantifying visual image quality is through subjective evaluation. in practice, however, subjective evaluation is usually too inconvenient, time-consuming assurance and expensive…” ! quality “…best way to assess the quality of an image is to look at it because human eyes are the ultimate viewers of most images…” Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment so difficult? IEEE Transactions on Image Processing. April 2004. Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. April 2004.
  129. 129. acceptance criteria for the National Library of Australia NDP 129
  130. 130. ? quality assurance ? Discussion topics 1. How does your organization currently do quality assurance for digital data? 2. How much time / effort is given to writing quality assurance procedures and acceptance criteria for digitized data?
  131. 131. digitization tools
  132. 132. open source vs. commercial software: pros digitization tools Adapted from Open Gov Summit 2013. • acquisition : cost, development and implementation contract costs are likely to be lower than for proprietary software. less likely that there will be contractually-bound upgrade costs. total cost of ownership over the lifetime of usage must be taken into account • data transferability : with open source code and open data formats, there are greater opportunities to share data across interoperable platforms • re-use : open source is free from per user or per instance costs and there is a guaranteed freedom to use it in any way. re-use is enabled. open-source-solutions/
  133. 133. open source vs. commercial software: digitization tools Adapted from Open Gov Summit 2013. • cost effective : pay once or not at all for development (if at all) and reuse where appropriate. • non-restrictive : open source licenses do not limit or restrict who can use the software, the type of user, or the areas of business in which the software can be used. provides a licensing model that enables rapid provisioning of both known and unanticipated users and in new use cases. • scalable : open source solutions are scalable upwards and downwards with a reduction in the risk of longer term financial implications. no license fees on a “per user” or “per box” basis. no redundant licenses open-source-solutions/ pros
  134. 134. open source vs. commercial software: digitization tools Adapted from Open Gov Summit 2013. • easy to prototype and adapt : open source software is particularly suitable for rapid prototyping and experimentation, where the ability to “test drive” the software with minimal costs and administrative delays can be important. (proprietary software suppliers may also provide the same through a ‘proof of concept’ phase at minimal or no cost.) open-source-solutions/ pros
  135. 135. • support and maintenance costs : may outweigh those of the proprietary package and include ‘hidden’ commitments. • intellectual property rights : as code is modified and adapted, there may be legal risks the code’s open source status and who owns the intellectual property rights of the modified code. • expertise : requires software installation and maintenance expertise. modification of open source code requires software development expertise.must ensure that they have the right level of expertise to manage it effectively. digitization tools open source vs. commercial software: cons Adapted from Open Gov Summit 2013. open-source-solutions/
  136. 136. digitization tools a variety of open source and commercial off-the-shelf (COTS) software is available for digitization projects • easier for systems from different parties or using different technologies to interoperate and communicate with one another • better protection of the data files created by an application against obsolescence of the application • applications / data are easier to port from one platform to another since they follows known guidelines and rules, and the interfaces
  137. 137. digitization tools ocr software open source • ABBYY FineReader ( • Tesseract ( • Nuance OmniPage ( • IRIS Readiris ( • LEADTOOLS OCR ( • OCRopus ( Wikipedia contributors, “Optical optical character" Wikipedia, The Free Encyclopedia, http:// (accessed March 2014). Wikipedia contributors, “Comparison of optical character recognition software," Wikipedia, The Free Encyclopedia, (accessed March 2014).
  138. 138. digitization tools imaging software open source • LEADTOOLS image SDK ( • ImageGear image SDK ( • FreeImage image SDK ( • BlackIce image toolkits ( • Adobe Photoshop ( • GIMP ( • GraphicsMagick ( • ImageMagick (
  139. 139. digitization tools digital workflow software • Content Conversion Specialists docWorks (http://content-conversion. com) • ScanFlow ( • Goobi ( • Zissor ( open source
  140. 140. digitization tools other software • BagIt : hierarchical file packaging format for the exchange of digital content. A "bag" has just enough structure to safely enclose descriptive "tags" and a "payload" but does not require any knowledge of the payload's internal semantics. See http:// and http:// open source
  141. 141. ? digitization tools ? Discussion questions 1. What software tools does your organization use for digital projects or digital libraries? 2. Does your organization host a digital library? If so, does it use Google Analytics or a similar tool? Why or why not? 3. What software tools does your organization use for project management? Are the tools web-based?
  142. 142. digital preservation Preservation of software and preservation of data are two sides of the same coin. From February 2011 Workshop for Digital Curators.
  143. 143. preservation Open Archival Information System (OAIS) reference model
  144. 144. digitization≠digital preservation!
  145. 145. Vint Cerf on “bit rot”
  146. 146. digital preservation long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required
  147. 147. digital data risks • standards / format obsolescence • migration to new format, media, or hardware • media obsolescence / decay • bit rot
  148. 148. format obsolescence remember … WordPerfect ? MARC records ? Adobe Flash ?
  149. 149. strategies for format obsolescence • migrate data to new formats • create a computer software museum with virtual machines • format registries • format validators • don’t worry about it!
  150. 150. Jeff Rothenberg on format obsolescence “... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats.” Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American. January 1995. Expanded version published February, 1999. (accessed 1 August 2012 at
  151. 151. standard model for format obsolescence • digital format registry collects information about target format • this information is used to build format identification and verification tools • holders of content use these tools to extract metadata from content in target format; metadata is stored with the content • format registry scans computing environment to determine which formats are obsolescent; notifications sent for obsolete formats • on receiving such a notification, someone builds a tool to convert obsolete format to non-obsolete format using the format specification in the registry • on receiving such a notification, holder of content in obsolete format uses conversion tool and content metadata to convert the file in an obsolete format to a file in a non-obsolete format
  152. 152. David Rosenthal on format obsolescence “... format obsolescence is a rare problem that happens infrequently to a minority of unpopular formats ...” David Rosenthal. Format obsolescence: Assessing the threat and the defenses. (accessed 1 August 2012 at LibraryHighTech2010.pdf
  153. 153. alternate model for format obsolescence • store only essential data • perform only essential tasks • delay performing tasks as long as possible David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi: 10.1108/07378831011047613 (accessed 1 August 2012 at files/LibraryHighTech2010.pdf).
  154. 154. importance of standards vis-a-vis format obsolescence well-defined standards … ! • guide developers in creation of tools • facilitates development of a broad range of tools for any format • allow developers to maintain existing tools
  155. 155. data migration risks • file format changes, for example, PDF 1.4 to PDF 1.8 • file name differences, for example, case sensitive /insensitive names, new operating system • extended file attributes • file permissions, for example, BSD Unix drwxr-xr-x@ to Windows file permissions • soft links / hard links
  156. 156. media obsolescence • 5 ¼” floppy disks • 8 track tapes • 3 ½” floppy disks • ZIP drives • CD-R, CD-RW, Blu-Ray • DAT tapes • microfilm • etc
  157. 157. strategies for media obsolescence • migrate data to new media, for example, floppy disks to DVD • create and maintain a computer hardware museum
  158. 158. media decay a report by NIST and the Library of Congress says ... • virtually all CD-Rs tested indicated an estimated life expectancy beyond 15 years • only 47 percent of recordable DVDs indicated an estimated life expectancy beyond 15 years, some had a life expectancy as short as 1.9 years • in practice actual lifetimes may be considerably shorter
  159. 159. prevention / detection of media decay • proper storage • data file checksums (MD5, SHA-1, ...) • monitor media integrity • migrate data from old media to new media
  160. 160. bit rot gradual decay of data due to … • storage media failure because of media quality • storage media failure because of improper storage • random events (bit-flip, environmental influences) • software / hardware errors
  161. 161. prevention / detection of bit rot • data file fixity check (checksums) such as MD5, SHA-1, ... • monitor file integrity with frequent, corrective audits • duplicate copies, geographically distributed
  162. 162. distributed decentralized digital preservation • the more copies, the safer the data • the more independent copies, the safer the data • the more frequently copies are audited, the safer the data Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?
  163. 163. distributed decentralized digital preservation • n+1 copies are safer than n copies • n independent copies on different storage devices / media are safer than n copies on similar or identical storage devices / media • data audited every week is safer than data audited every month
  164. 164. LOCKSS Lots Of Copies Keep Stuff Safe LOCKSS box: Open source LOCKSS software installed on a dedicated computer or virtual machine. • It ingests content from target websites using a web crawler similar to those used by search engines. • It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences. • It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available. • It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content. • It dynamically migrates content to new formats as needed for display. From LOCKSS webpages
  165. 165. how LOCKSS works data copied to another LOCKSS box library X LOCKSS box library Y LOCKSS box my library LOCKSS box data
  166. 166. how LOCKSS works data audited library X LOCKSS box library Y LOCKSS box my library LOCKSS box audit data
  167. 167. how LOCKSS works data audited library X LOCKSS box library Y LOCKSS box my library LOCKSS box audit data audit fails ok audit
  168. 168. how LOCKSS works data copied to another LOCKSS box library X LOCKSS box library Y LOCKSS box my library LOCKSS box data
  169. 169. private LOCKSS networks Alabama Digital Preservation Network ( CLOCKSS (Controlled LOCKSS), a non-profit collaboration of North American, European, and Asian cultural heritage institutions whose purpose is to preserve digital content with LOCKSS ( MetaArchive Cooperative is a digital preservation cooperative created by cultural heritage institutions (
  170. 170. digital preservation references • Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. Proceedings of a conference on digital preservation held at the National Library of Estonia in May 2011. (accessed 15 August 2012 at default/files/ANADP_Educopia_2012.pdf). • David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi: 10.1108/07378831011047613 (accessed 1 August 2012 at files/LibraryHighTech2010.pdf). • David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at • Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American January 1995. Expanded version published February 1999. (accessed 1 August 2012 at • Joint Information Systems Committee (JISC) Programme on Digital Preservation at • Library of Congress on Digital Preservation at • Stanford University’s website for LOCKSS at
  171. 171. newspaper digitization programs around the world National Library of Finland ( British Newspaper Archives, British Library ( newspapers) National Digital Newspaper Program, Library of Congress ( National Library of New Zealand ( National Library of Australia, Australian Digital Newspapers Program ( Koninklijke Bibliotheek, the Netherlands ( Singapore National Library Board ( Bibliotheque nationale de France ( Europeana Newspapers Project, a collaboration of 17 organizations ( National Library of Latvia (
  172. 172. • Library of Congress National Digital Newspaper Program • Australian Newspaper Digitisation Program program • IFLA Newspapers Section Digitisation projects and best practices • ICON: International Coalition on Newspapers
  173. 173. • METS, MODS, ALTO, PRISM, and other library standards : • OAIS : • NISO standards and guidelines : publications/rp • Good practice guides : • And many, many more
  174. 174. Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https:// (accessed March 17, 2013).
  175. 175. ?! Frederick Zarndt Secretary, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.