2. the agenda 10.30 Morning tea break
1. Introductions
2.Review of the OAIS
reference model
3.Newspaper digitization
programs
4. Selection of materials
5. Importance of standards
6.Project management
7. Digitization workflow
7.1. Images
7.2. Metadata
7.3. File formats
8.Digitization workflow
demonstration with
docWorks
9. Quality assurance and
acceptance criteria
10. Tools for digitization,
workflow, digital
preservation, and project
management
11. Digital preservation
considerations
12.Wrap-up
13.00 Lunch
15.30 Afternoon tea break
3. An Open Archival Information System (or
OAIS) is an archive, consisting of an
organization of people and systems, that has
accepted the responsibility to preserve
information and make it available for a
Designated Community.
Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https://
en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).
4. Open Archival Information System (OAIS)
reference model
• Negotiate for and accept appropriate information from information Producers.
• Obtain sufficient control of the information provided to the level needed to ensure
Long-Term Preservation.
• Determine, either by itself or in conjunction with other parties, which communities
should become the Designated Community and, therefore, should be able to
understand the information provided.
• Ensure that the information to be preserved is Independently Understandable to the
Designated Community. In other words, the community should be able to
understand the information without needing the assistance of the experts who
produced the information.
• Follow documented policies and procedures which ensure that the information is
preserved against all reasonable contingencies, and which enable the information to
be disseminated as authenticated copies of the original, or as traceable to the
original.
• Make the preserved information available to the Designated Community.
Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https://
en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).
8. national programs
national: centrally funded and managed
programs with several participants. strict
standards.
• National Digital Newspaper Program
(Library of Congress)
• Australian Newspaper Digitisation
Program
programs
9. cooperative programs
cooperative: organizations collaborate
to achieve a common goal but
digitization programs are managed
separately. flexible standards.
• Europeana newspapers
• Digital Public Library of America
programs
10. individual programs
individual: organization digitizes on its own.
may or, more usually, does not follow open
standards. all commercial organizations.
• ProQuest Historical Newspapers
• Newspapers.com
• Newsbank
• many others…
programs
11. programs
• digitization program requires
careful thought
• must be adapted to local
circumstances
• ask those who have gone before
• join the IFLA Newspapers
Section! (ask me how)
Image courtesy of Donald Zolan.
12. ? programs ?
Discussion questions
1. Has your organization already begun to digitize
newspapers? How is the digitization program
organized and funded?
2. If your organization hasn’t yet begun to digitize
newspapers, what type of digitization program
would best suits your organization / state /
country? Why?
13. Experience is that marvelous thing
that enables you to recognize a
mistake when you make it again.
!
F. P. Jones
15. reasons for digitization
newspapers are deteriorating
microfilm is dissolving
no storage space
selection
16. access
• Who are your users? Do you know?
• Can you ask them what they expect
from a digital newspaper collection?
Can you trust their answers?
• Trove, Papers Past, Cambridge
Public Library, CDNC: These digital
newspaper collections are used
mostly by people 50+ years old and
with an interest in family history.
?
selection
17. Library of Congress selection criteria for the
National Digital Newspaper Program (NDNP)
selection
!
• Image quality
• Intellectual content
• Refinements
http://www.loc.gov/ndnp/guidelines/selection.html
18. selection for NDNP
Image quality
!
All NDNP newspaper images are scanned from microfilm.
1. Microfilm should be produced from properly prepared
unbound originals.
2. Microfilm reduction ratio should be less the 20x. This allows
400dpi images to be scanned from the film.
3. Variations in microfilm density within and between images
should be more than 0.2.
4. Negative microfilm duplicated for scanning should have
resolution test patterns readable at 5.0 or higher. For camera
master microfilm without resolution test charts, resolution
can be estimated by comparison to film with resolution test
charts and original material.
selection
19. selection for NDNP
Intellectual content
!
1. Newspaper title reflects the political, economic and cultural
history of the State.
2. Selected newspaper titles should ensure broad geographical
coverage.
3. Newspaper titles that provide coverage of a geographic area or a
group over long time periods are preferred over short lived titles
or titles with significant gaps.
selection
20. selection for NDNP
Selection criteria refinements
!
1. Orphan titles: Special consideration should be given to high
research value titles that have ceased publication and lack active
ownership.
2. Newspaper titles that document a significant (minority)
community at the state or regional level may be given special
consideration.
3. Newspaper which have already been digitized by other
organizations (for example, ProQuest) should not be digitized
again.
selection
21. selection for ANDP
National Library of Australia collection managers in
consultation with staff from Preservation Services nominate
materials for digitization. The Library works closely with state
and territory libraries to systematically digitise newspapers
held in these libraries. Selected newspapers include this with
!
• Cultural and/or historical significance
• Uniqueness and/or rarity of the material
• Copyright status or permission to digitise obtained
• Material in high demand
• Material at risk because of its physical condition
selection
https://www.nla.gov.au/policy-and-planning/collection-digitisation-policy
22. copyright
Most newspapers titles selected
for digitization are out of
copyright and in the public
domain. Negotiating use rights is
quite simply too much trouble and
fraught with legal pitfalls.
Copyright laws and policies vary considerably between countries.
selection
29. ? selection ?
Discussion questions
1. Has your organization already selected
newspapers to digitize? Why did it choose the
titles that were selected? Please answer
(hypothetically) if your organization hasn’t
begun a newspapers digitization program.
2. Why would or why wouldn’t your organization
select in-copyright newspapers to digitize?
31. open standards
• Availability : Open standards are available for all to read and implement.
• Maximize end-user choice : Open standards create a fair, competitive market
for implementation of the standards. They do not lock the customer into a
particular vendor or group.
• No royalty : Open standards are free for all to implement, with no royalty or
fee.
• No discrimination : Open standards and the organizations that administer
them do not favor one implementor over another for any reason other than
the technical standards compliance of a vendor's implementation.
• Extension or subset : Implementations of open standards may be extended, or
offered in subset form. However, certification organizations may decline to
certify subset implementations, and may place requirements upon extensions.
• Predatory practices : Open standards may employ license terms that protect
against subversion of the standard by embrace-and-extend tactics. The
licenses attached to the standard may require the publication of reference
information for extensions, and a license for all others to create, distribute
and sell software that is compatible with the extensions. An open standard
may not otherwise prohibit extensions.
importance of standards
Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards
32. open standards
standards • Not restrictive : Less chance of being locked in by a specific
technology and/or vendor.
• Interoperable : Easier for systems from different parties or
using different technologies to interoperate and communicate
of with one another.
importance • Protection against obsolescence : Better protection of the data
files created by an application against obsolescence.
• Portable : Applications / data are easier to port from one
platform to another since they follows known guidelines and
rules, and the interfaces.
Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards
32
33. newspapers and standards
What standards are important for newspaper digitization?
!
• METS XML is an open standard administered by the METS editorial
board. See http://www.loc.gov/standards/mets/.
• ALTO XML is an open standard administered by the ALTO editorial
board. See http://www.loc.gov/standards/alto/.
• Various image file formats including TIFF, JPEG, JPEG2000.
• PDF/A is a portable document format developed by Adobe. It is a
subset of the complete PDF specification and has been adopted by
ISO as a standard. See http://www.pdfa.org/.
• Various library metadata standards including, but not limited to
• MODS XML http://www.loc.gov/standards/mods/
• Dublin Core http://dublincore.org/
• PREMIS http://www.loc.gov/standards/premis/
importance of standards
34. importance of standards
with few exceptions
libraries use METS XML +
ALTO XML + image files (TIFF,
JPEG2000) for newspaper
digitization programs
importance of standards
35. proprietary standards
Olive ActivePaper Archive stores historical
newspaper data in an XML format that is as
capable as METS/ALTO XML but is not an
open standard.
Early versions of WordPerfect (MS Word
too) stored data in a proprietary format, not
in an open standard like Open Document
Format (ODF). WordPerfect or special
software is needed to view the files.
Adobe’s Flash is a de facto but not an open
standard. Flash now appears to be on a path
to obsolescence, destined to be replaced by
HTML5.
importance of standards
36. ?importance of standards
?
Discussion questions
1. Name a few standards that you use every time
you connect to the Internet.
2. What library standards does your organization
currently use? What other, non-library
standards, if any, does your organization use?
37. In theory, there's no difference
between theory and practice, but in
practice, there is.
!
Anonymous
39. From the Standish Group’s 2012 Chaos Report on IT Project Failure.
project management
40. high cost of IT failure
Roger Sessions estimates that the worldwide cost of
IT failure is USD $500 billion per month
Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple
Architectures for Complex Enterprises and many articles. He is a founding member
of the Board of Directors of the International Association of Software Architects. 40
project management
41. in a recent survey of 1230 IT professionals
conducted by Embarcadero Technologies, 2 of the
3 biggest project challenges cited by the IT pros are
“poor planning” and “poor or no requirements”
41
plan!
project management
42. in a March 2007 web poll conducted by the
Computing Technology Industry Association "nearly
28 percent of the more than 1,000 respondents
singled out poor communications as the number one
cause of project failure"
42
communicate!
project management
43. A recent survey of 752 IEEE members conducted by IEEE
Spectrum and The New York Times discovered that "just 9
percent of 133 respondents whose organizations currently
offshore R&D reported 'No problem'. The biggest
headache was 'Language, communication, or culture'
barriers, as reported by 54.1 percent of respondents."
(http://www.spectrum.ieee.org/feb07/4881
43
communicate!
project management
44. In their 2009 book Cultural Intelligence: Living and Working
Globally, Thomas and Inkson say “Although we increasingly
cross boundaries and surmount barriers to trade, migration,
travel, and the exchange of information, cultural boundaries
are not so easily bridged. Unlike legal, political, or economic
aspects of the global environment, which are observable,
culture is largely invisible. Therefore, culture is the aspect of
the global context that is most often overlooked.”
44
communicate!
project management
45. plan!
Taimour al Neimat. Why IT project fail. The PROJECT PERFECT White Paper
Collection. Oct 2005. http://www.projectperfect.com.au/downloads/Info/
info_it_projects_fail.pdf accessed Mar 2014. project management
in a white paper written for Project Perfect by Taimour
al Neimat, he lists
• poor planning
• unclear goals and objectives
• objectives changing during the project
• unrealistic time or resource estimates
• lack of executive support and user involvement
• failure to communicate and act as a team
• inappropriate skills
as primary causes for the failure of complex IT projects
46. typical tender evaluation criteria in priority order
!
1. understanding of requirements
2. reputation of service bureau
3. price
46
requirements?
project management
47. incomplete requirements
requirements in recent tender from an
(anonymous) government agency somewhere in the
world
!
• project to convert ~ 170,000 text images to xml
• value of project ~ USD $180,000
• 19 pages of definitions, governing law, proposal
evaluation criteria, contractual conditions,
instructions about tender response format, etc
• technical requirements description? < 1 page
• data acceptance criteria? “a high level of
accuracy”
47
project management
49. a recent newspapers digitization program
established by a prominent national library
!
• digitize more than 20 million text pages
• high level image and xml requirements
• value of work awarded? > USD $5,000,000
• after award of work, technical requirements
expand to 43+ pages from ~3 pages
• acceptance criteria? added as an afterthought and
not well defined
project management
poor planing
50. the value of simplicity
“There are two ways of constructing a software
design: one way is to make it so simple that there
are obviously no deficiencies and the other way is
to make it so complicated that there are no obvious
deficiencies.”
!
C.A.R. Hoare
Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford
University, Senior Researcher at Microsoft Research, recipient of the ACM
Turing Award, author of many books on computers and software.
project management
51. • unitary: the requirement addresses one and only one
thing
• complete: the requirement is fully stated in one place
with no missing information
• consistent: the requirement does not contradict any
other requirement and is fully consistent with all
authoritative external documentation
• atomic: it does not contain conjunctions, for example,
"the code field must validate American and Canadian
postal codes" should be written as two separate
requirements
project management
good requirements
52. ! • traceable: the requirement meets all or part of a
business need as stated by stakeholders and
authoritatively documented
• current: the requirement has not been made obsolete
by the passage of time
• feasible: the requirement can be implemented within
the constraints of the project
• unambiguous: the requirement is concisely stated
without recourse to technical jargon, acronyms
• verifiable: the implementation of the requirement
can be determined through one of four possible
methods: inspection, demonstration, test, or analysis
project management
good requirements
54. simple principles for (good)
communication
• be impeccable with your word
• don’t take anything personally
• don’t make assumptions
• always do your best
• be mindful
55. why (better) communication is
necessary
no communication ...
little communication ...
poor communication ...
reduced communication ...
... all result in more assumptions about intent!
56. The single biggest problem
with communication is the
illusion that it has taken place.
George Bernard Shaw, 1925 Nobel Peace Prize for Literature.
57. project management
“projects are about communication,
communication, and communication”
Elenbass, B. Staging a Project: Are You Setting Your Project
Up for Success? Proceedings of the Project Management
Institute Annual Seminars & Symposiums. 2000.
58. the value of prototypes / pilots
“Plan to throw one away; you will anyhow. If there is
anything new about the function of a system, the first
implementation will have to be redone completely to achieve
a satisfactory (i.e., acceptably small, fast, and maintainable)
result. It costs a lot less if you plan to have a prototype.”
!
Butler Lampson
Butler Lampson was a founding member of Xerox PARC, worked for DEC,
and now works at Microsoft Research. He is an adjunct professor at MIT
and an ACM Fellow.
project management
59. implement: pilot
create requirements and acceptance criteria
repeat
{
digitize (small) pilot batch
test data against acceptance criteria
adjust requirements and acceptance criteria
}
until (no more adjustments are necessary)
digitize more data
pilot batches are VERY VERY important!!
59
project management
60. reasons for in-house production
!
• collection cannot be moved
• collection is badly organized
• digitization must be done slowly over a long
period
• digitization is very simple
60
project management
implement: in-house
61. reasons for outsourced production
!
• originals can’t be scanned in-house because…
• equipment is too expensive
• output data is beyond staff experience
• labor is too expensive
• large volume of work in a short time
• insufficient space, infrastructure, or staff
61
project management
implement: outsource
62. project management tools
The project management tool one chooses should be
intuitive, easy to use, and accessible to all. If it isn’t,
many will avoid / refuse / dislike / resent using it.
!
• Discussion of project management tools at http://
en.wikipedia.org/wiki/Comparison_of_project-management_
software
• List of project management tools at http://
en.wikipedia.org/wiki/Comparison_of_project-management_
software
project management
63. ? project management
?
Discussion questions
1. What project management practices does your
organization follow? Why?
2. What library standards does your organization
currently use? What other, non-library
standards, if any, does your organization use?
3. What reasons, in addition to those already cited,
would your organization have to digitize
newspapers in-house or to outsource digitization?
64. “Perfection is attained, not when there
is nothing left to add, but when there
is nothing left to take away.”
!
Antoine de St. Exupery
70. digitization workflow
!
• digital library: one or more digital collections
• digital collection: organized group(s) of digital
objects
• digital object: a surrogate or digital copy of
the original source document, for example, a
newspaper issue
72. An example of what
ALTO makes possible
The Day book. (Chicago, Ill.), 29 Feb. 1912. Chronicling America: Historic American Newspapers. Lib. of Congress.
<http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-02-29/ed-1/seq-26/>
73. digitization workflow
!
• digital library: one or more digital collections
• digital collection: organized group(s) of digital
objects
• digital object: a surrogate or digital copy of
the original source document, for example, a
newspaper issue
• metadata: data about data. information about
a digital object(s) or a digital collection(s) or
the original source document(s)
75. • to enhance accessibility
• to increase collaboration and cooperation
between libraries and archives around the
world
• to promote research
• to provide opportunities for entrepreneurs
• other reasons?
75
why digitize newspapers?
digitization workflow
81. ?image decisions ¿
• image production source materials
• original documents: better quality, more
expensive
• microfiche: poorer quality, less
expensive, microfiche quality varies
• bit depth
• black-and-white (bitonal)
• greyscale
• color
• resolution
• compression
• no compression
• lossless (reversible)
• lossy (irreversible)
• image metadata
digitization workflow
82. image format comparison
compression bit depth metadata color
management
mime
type patent 1st public
release
JBIG
(.jbig, .jbg) lossless 1-bit no no 2000?
JPEG
(.jpg, .jpeg)
lossy, DCT, RLE,
Huffman
8-bit
12-bit
24-bit
yes yes image/jpeg
public.jpeg no 1992
JPEG2000
(.jp2)
many lossless and
lossy compression
algorithms
8-bit
16-bit
color to 48 bits
yes yes image/jp2
public.jpeg200
yes but
part 1 is
patent free
2000
TIFF
(.tiff, .tif)
none
LZW
RLE
ZIP
Other
1, 2, 4, 8, 16,
24, 32 bits
yes yes image/tiff
public.tiff no 1986
Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free
Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats
(accessed August 1, 2012)
84. image bit depth comparison
USA case law image 1
300dpi
USA case law image 2
300dpi
TIFF 1-bit CCITT G4 compression 40 KB 87 KB
JPEG2000 W5x3 reversible compression 2.6 MB 3.6 MB
JPEG2000 W9x7 irreversible compression 647 KB 1 MB
85. GIGO
GARBAGE IN, GARBAGE OUT
Image courtesy of http://epsos.de (accessed at http://commons.wikimedia.org March
2014).
86. raw OCR text
Deaths. lln»rieff, Esq. of <c .. Qn.
Sunday, the till. greatly Drandrellt, of
Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn
l j j j i l F i i j ' 1 1 f Havodiv y d,
Carnarvonshire, S ; **" *- ' « ' March
Oxford, F. Tfovmeud, Uerald. » • V .
•On Tncsdav last , Mr . Charles.
IWilinson, this 8 ; had vf thesis#,, a week
ago, which tcrminate<i'iu his death. . / ' ■
O'i Sunday, dJst nit. at. AsbtCnvHall,
mar Lancaster, Mr.,Geo. Worn ick,
many years house'steward hit late Once
The Hamilton and Brandon. He locked
himself h»oWn'r«wte<: soon. twelve
o'clock" that dny, and fii»-d a loaded pistol
" t h r o u g h I n s b e a d , 1 w h i c h
instantaneously killed him. Coronet's
Verdict, shot himself in a temporary fit of
Friday week,
newspaper image
Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
87. ? digitization workflow
?
Discussion topics
1. Assume your organization decides to digitize 1000
newspaper issues averaging 12 pages per issue. The
images are scanned 2-up and average 80MB each.
How much disk storage is needed for the images?
2. Now assume instead that your organization uses
TIFF images with LZW (lossless) compression,
which saves on average 40%. How much disk
storage is needed for the images?
98. the digitization process
images image objects
processing
layout
analysis OCR metadata
build
digital
objects
• analyze layout of text image
• estimate font types and sizes
• calculate coordinates of text blocks
• determine layout object types (text,
illustration, headline, etc)
100. the digitization process
images image objects
processing
layout
analysis OCR metadata
build
digital
objects
• perform optical character recognition (OCR)
• calculate word and character coordinates
• calculate word and character confidences
• apply language dictionaries
• correct OCR text (optional)
102. the digitization process
images image objects
processing
layout
analysis OCR metadata
build
digital
objects
• create METS / ALTO XML files
• create image files and image metadata
• create PDF files (if required)
• verify digital object
• calculate file fixity checks (checksums)
• perform file validation and verification
• perform quality assurance
103. real world
digitization
production
workflow
• automatic production
steps performed by
software
!
• manual production steps
performed by operators
104. digital library standards
• METS XML for descriptive, structural, technical, and
administrative metadata
!
• descriptive metadata
• Metadata Object Description Standard (MODS)
selected metadata from MARC
• Dublin Core fundamental group of text elements for
describing and cataloging
!
• technical metadata
• ALTO for OCR text
• PREMIS for digital preservation
• MIX and ANSI/NISO Z39.87 for images
105. Metadata Encoding and
Transmission Standard
!
• METS is a XML standard for encoding descriptive, administrative,
and structural metadata about objects within a digital library
• METS files consist of 7 (optional) sections: header, descriptive,
administrative, file map, structural map, structural link, and
behavior
• METS profiles describe a class of METS documents in sufficient
detail to provide both document authors and programmers the
guidance to create and process METS documents conforming with a
particular profile
• current version 1.9.1
• administered by METS editorial board (international group of
volunteers)
• standards hosted by Library of Congress at http://www.loc.gov/
standards/mets/
106. METS file structure
Graphic from Karin Bredenberg, Communicating Archival Metadata conference and
workshops. Riksarkivet, 2011.
107. Metadata Object Description Schema
• MODS is an XML schema for a bibliographic element set that may
be used for library applications. Derivative of MARC 21
bibliographic format. Includes a subset of MARC fields, using
language-based tags rather than numeric ones
• Subset of MARC 21
• Mappings exist between MODS and MARC, Dublin Core, and RDA
(conversion tools exist)
• May be used in conjunction with METS XML
• current version 3.4
• administered by Library of Congress Network Development and
MARC Standards Office with help from interested users
• standards hosted by Library of Congress at http://www.loc.gov/
standards/mods/
109. Dublin Core metadata
• Dublin Core is a set of vocabulary terms used to describe
resources for the purposes of discovery.
• Dublin Core metadata element set is endorsed in IETF RFC
5013, ISO 15836-2009, and NISO Z39.85
• Metadata terms last updated 14-Jun-2012
• May be used in conjunction with METS XML
• Dublin Core Metadata Initiative (DCMI) is an open
organization, incorporated as a public, not-for-profit company
in Singapore
• Dublin Core Metadata Initiative is hosted at http://
dublincore.org/
110. Analyzed Layout and Text Object
!
• ALTO XML provides technical metadata for describing the layout
and content of physical text resources, such as pages of a book or a
newspaper
• commonly used in conjunction with METS XML but may be used
standalone
• current version 2.1
• administered by ALTO editorial board (international group of
volunteers)
• standards hosted by Library of Congress at http://www.loc.gov/
standards/alto/
114. Preservation Metadata
Implementation Strategies
• PREMIS is a core set of implementable preservation metadata,
broadly applicable across a wide range of digital preservation
contexts and supported by guidelines and recommendations for
creation, management, and use
• In 2003 OCLC and RLG jointly sponsored the formation of the
PREMIS working group comprised of international experts in the
use of metadata to support digital preservation activities
• PREMIS data dictionary current version 2.2
• May be used in conjunction with METS XML
• PREMIS tools are freely available
• PREMIS Maintenance Activity and Editorial Committee has
international members from libraries and industry
• PREMIS data dictionary is hosted at http://www.loc.gov/
standards/premis/
118. ? digitization workflow
?
Discussion topics
1. Assuming your organization will digitize
historic newspapers, will it digitize the
newspapers in-house or out-source
digitization? Why? (If you don’t know, guesses
and speculations are fine.)
2. Describe your organizations current
digitization workflow.
120. quality assurance
and acceptance criteria
Wikipedia on data quality:
!
The processes and technologies involved in
ensuring the conformance of data values to
requirements and acceptance criteria
quality assurance
121. • is the digital object complete? are all its
components present?
• is the digital object verifiable?
• is the digital object uncorrupted?
• do the components of the digital object
conform to standards?
• do the file names conform to project
requirements?
• does the directory structure conform to
project requirements?
• does the digital object metadata conform to
project specifications?
quality assurance
automatic quality checks
122. • does the digital object metadata meet
accuracy specifications?
• does the text meet accuracy
specifications?
• is the image quality satisfactory?
• are article continuations correct?
• is the text in reading order?
quality assurance
manual quality checks
123. what’s wrong with this?
acceptance criteria for an English language
digitization project at a large, well-known, and
internationally recognized national library
!
character accuracy > 80%
word accuracy > 75%
significant word accuracy > 65%
quality assurance
124. what’s wrong with this?
project quality requirement:
!
“a high level of accuracy”
125. what’s wrong with this?
project quality requirement:
!
“article titles must be 99.5% accurate”
126. what’s wrong with this?
project quality requirement:
!
“article title characters in each issue must be 99.5%
accurate, that is, each issue may have no more than
5 errors in 1000 article title characters”
127. image quality
!
•sharpness: the amount of detail an image can
convey
•noise: random variation of image density
•dynamic range
•contrast (gamma): the slope of the tone
reproduction curve in a log-log space. high
contrast usually involves loss of dynamic range —
loss of detail, or clipping, in highlights or shadows.
•vignetting: darkens images near the corners
•artifacts: “leftovers” from sharpening or
compression
Wikipedia contributors, “Image quality," Wikipedia, The Free Encyclopedia, http://
en.wikipedia.org/wiki/Image_quality (accessed March 2014). quality assurance
128. image quality
!
“…images which are ultimately to be viewed by
human beings, the only “correct” method of
quantifying visual image quality is through subjective
evaluation. in practice, however, subjective
evaluation is usually too inconvenient, time-consuming
assurance
and expensive…”
!
quality “…best way to assess the quality of an image is to
look at it because human eyes are the ultimate
viewers of most images…”
Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment
so difficult? IEEE Transactions on Image Processing. April 2004. Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error
Visibility to Structural Similarity. IEEE Transactions on Image Processing.
April 2004.
130. ? quality assurance
?
Discussion topics
1. How does your organization currently do
quality assurance for digital data?
2. How much time / effort is given to writing
quality assurance procedures and acceptance
criteria for digitized data?
132. open source vs. commercial software:
pros
digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-
• acquisition : cost, development and implementation
contract costs are likely to be lower than for proprietary
software. less likely that there will be contractually-bound
upgrade costs. total cost of ownership over the lifetime of
usage must be taken into account
• data transferability : with open source code and open data
formats, there are greater opportunities to share data across
interoperable platforms
• re-use : open source is free from per user or per instance
costs and there is a guaranteed freedom to use it in any way.
re-use is enabled.
open-source-solutions/
133. open source vs. commercial software:
digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-
• cost effective : pay once or not at all for development (if at all)
and reuse where appropriate.
• non-restrictive : open source licenses do not limit or restrict
who can use the software, the type of user, or the areas of
business in which the software can be used. provides a
licensing model that enables rapid provisioning of both known
and unanticipated users and in new use cases.
• scalable : open source solutions are scalable upwards and
downwards with a reduction in the risk of longer term
financial implications. no license fees on a “per user” or “per
box” basis. no redundant licenses
open-source-solutions/
pros
134. open source vs. commercial software:
digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-
• easy to prototype and adapt : open source software is
particularly suitable for rapid prototyping and
experimentation, where the ability to “test drive” the software
with minimal costs and administrative delays can be
important. (proprietary software suppliers may also provide
the same through a ‘proof of concept’ phase at minimal or no
cost.)
open-source-solutions/
pros
135. • support and maintenance costs : may outweigh those of
the proprietary package and include ‘hidden’
commitments.
• intellectual property rights : as code is modified and
adapted, there may be legal risks the code’s open source
status and who owns the intellectual property rights of
the modified code.
• expertise : requires software installation and
maintenance expertise. modification of open source
code requires software development expertise.must
ensure that they have the right level of expertise to
manage it effectively.
digitization tools
open source vs. commercial software:
cons
Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-
open-source-solutions/
136. digitization tools
a variety of open source and commercial off-the-shelf (COTS)
software is available for digitization projects
• easier for systems from different parties or using different
technologies to interoperate and communicate with one
another
• better protection of the data files created by an application
against obsolescence of the application
• applications / data are easier to port from one platform to
another since they follows known guidelines and rules, and the
interfaces
137. digitization tools
ocr software
open source
• ABBYY FineReader (http://www.abbyy.com)
• Tesseract (https://code.google.com/p/tesseract-ocr)
• Nuance OmniPage (http://www.nuance.com)
• IRIS Readiris (http://www.irislink.com)
• LEADTOOLS OCR (http://www.leadtools.com)
• OCRopus (https://code.google.com/p/ocropus)
Wikipedia contributors, “Optical optical character" Wikipedia, The Free Encyclopedia, http://
en.wikipedia.org/wiki/Optical_character_recognition (accessed March 2014).
Wikipedia contributors, “Comparison of optical character recognition software," Wikipedia, The Free
Encyclopedia, http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software
(accessed March 2014).
140. digitization tools
other software
• BagIt : hierarchical file packaging format for the
exchange of digital content. A "bag" has just enough
structure to safely enclose descriptive "tags" and a
"payload" but does not require any knowledge of the
payload's internal semantics. See http://
sourceforge.net/projects/loc-xferutils and http://
tools.ietf.org/html/draft-kunze-bagit-06.
open source
141. ? digitization tools
?
Discussion questions
1. What software tools does your organization use for
digital projects or digital libraries?
2. Does your organization host a digital library? If so,
does it use Google Analytics or a similar tool? Why
or why not?
3. What software tools does your organization use for
project management? Are the tools web-based?
142. digital preservation
Preservation of software and preservation of data are two sides of
the same coin. From February 2011 Workshop for Digital Curators.
146. digital preservation
long-term, error-free storage of digital
information, with means for retrieval
and interpretation, for the entire time
span the information is required
147. digital data risks
• standards / format obsolescence
• migration to new format, media,
or hardware
• media obsolescence / decay
• bit rot
149. strategies for
format obsolescence
• migrate data to new formats
• create a computer software museum
with virtual machines
• format registries
• format validators
• don’t worry about it!
150. Jeff Rothenberg on
format obsolescence
“... digital documents are evolving so
rapidly that shifts in the forms of documents
must inevitably arise. New forms do not
necessarily subsume their predecessors or
provide compatibility with previous formats.”
Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published
in Scientific American. January 1995. Expanded version published February, 1999.
(accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)
151. standard model
for format obsolescence
• digital format registry collects information about target format
• this information is used to build format identification and
verification tools
• holders of content use these tools to extract metadata from
content in target format; metadata is stored with the content
• format registry scans computing environment to determine
which formats are obsolescent; notifications sent for obsolete
formats
• on receiving such a notification, someone builds a tool to convert
obsolete format to non-obsolete format using the format
specification in the registry
• on receiving such a notification, holder of content in obsolete
format uses conversion tool and content metadata to convert the
file in an obsolete format to a file in a non-obsolete format
152. David Rosenthal on
format obsolescence
“... format obsolescence is a rare problem that
happens infrequently to a minority of
unpopular formats ...”
David Rosenthal. Format obsolescence: Assessing the threat and the defenses.
(accessed 1 August 2012 at http://lockss.org/locksswiki/files/
LibraryHighTech2010.pdf
153. alternate model
for format obsolescence
• store only essential data
• perform only essential tasks
• delay performing tasks as long as possible
David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library
High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi:
10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/
files/LibraryHighTech2010.pdf).
154. importance of standards
vis-a-vis format obsolescence
well-defined standards …
!
• guide developers in creation of tools
• facilitates development of a broad range of
tools for any format
• allow developers to maintain existing tools
155. data migration risks
• file format changes, for example, PDF 1.4 to
PDF 1.8
• file name differences, for example, case
sensitive /insensitive names, new operating
system
• extended file attributes
• file permissions, for example, BSD Unix
drwxr-xr-x@ to Windows file permissions
• soft links / hard links
156. media obsolescence
• 5 ¼” floppy disks
• 8 track tapes
• 3 ½” floppy disks
• ZIP drives
• CD-R, CD-RW, Blu-Ray
• DAT tapes
• microfilm
• etc
157. strategies for
media obsolescence
• migrate data to new media, for example,
floppy disks to DVD
• create and maintain a computer hardware
museum
158. media decay
a report by NIST and the Library of Congress says ...
• virtually all CD-Rs tested indicated an estimated life
expectancy beyond 15 years
• only 47 percent of recordable DVDs indicated an
estimated life expectancy beyond 15 years, some
had a life expectancy as short as 1.9 years
• in practice actual lifetimes may be considerably
shorter
159. prevention / detection
of media decay
• proper storage
• data file checksums (MD5, SHA-1, ...)
• monitor media integrity
• migrate data from old media to new media
160. bit rot
gradual decay of data due to …
• storage media failure because of media quality
• storage media failure because of improper storage
• random events (bit-flip, environmental influences)
• software / hardware errors
161. prevention / detection
of bit rot
• data file fixity check (checksums) such as MD5,
SHA-1, ...
• monitor file integrity with frequent, corrective
audits
• duplicate copies, geographically distributed
162. distributed decentralized
digital preservation
• the more copies, the safer the data
• the more independent copies, the safer the
data
• the more frequently copies are audited, the
safer the data
Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?
163. distributed decentralized
digital preservation
• n+1 copies are safer than n copies
• n independent copies on different storage
devices / media are safer than n copies on similar
or identical storage devices / media
• data audited every week is safer than data audited
every month
164. LOCKSS
Lots Of Copies Keep Stuff Safe
LOCKSS box: Open source LOCKSS software installed on a
dedicated computer or virtual machine.
• It ingests content from target websites using a web crawler similar to those used by
search engines.
• It preserves content by continually comparing the content it has collected with the
same content collected by other LOCKSS Boxes, and repairing any differences.
• It delivers authoritative content to readers by acting as a web proxy, cache or via
Metadata resolvers when the publisher’s website is not available.
• It provides management through a web interface that allows librarians to select new
content for preservation, monitor the content being preserved and control access to the
preserved content.
• It dynamically migrates content to new formats as needed for display.
From LOCKSS webpages http://www.lockss.org.
165. how LOCKSS works
data copied to another LOCKSS box
library X
LOCKSS box
library Y
LOCKSS box
my library
LOCKSS box
data
166. how LOCKSS works
data audited
library X
LOCKSS box
library Y
LOCKSS box
my library
LOCKSS box
audit
data
167. how LOCKSS works
data audited
library X
LOCKSS box
library Y
LOCKSS box
my library
LOCKSS box
audit
data
audit fails
ok
audit
168. how LOCKSS works
data copied to another LOCKSS box
library X
LOCKSS box
library Y
LOCKSS box
my library
LOCKSS box
data
169. private LOCKSS networks
Alabama Digital Preservation
Network (http://www.adpn.org/).
CLOCKSS (Controlled LOCKSS), a non-profit collaboration
of North American, European, and Asian cultural heritage
institutions whose purpose is to preserve digital content with
LOCKSS (http://www.clockss.org).
MetaArchive Cooperative is a digital preservation
cooperative created by cultural heritage institutions
(http://www.metaarchive.org).
170. digital preservation references
• Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to
Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012.
Proceedings of a conference on digital preservation held at the National Library of
Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/
default/files/ANADP_Educopia_2012.pdf).
• David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library
High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi:
10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/
files/LibraryHighTech2010.pdf).
• David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM
v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at
http://lockss.org/locksswiki/files/ACM2010.pdf).
• Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published
in Scientific American January 1995. Expanded version published February 1999.
(accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)
• Joint Information Systems Committee (JISC) Programme on Digital Preservation at
http://www.jisc.ac.uk/preservation.
• Library of Congress on Digital Preservation at http://www.digitalpreservation.gov.
• Stanford University’s website for LOCKSS at http://www.lockss.org.
171. newspaper digitization programs around
the world
National Library of Finland (http://digi.kansalliskirjasto.fi/)
British Newspaper Archives, British Library (http://www.bl.uk/welcome/
newspapers)
National Digital Newspaper Program, Library of Congress
(http://chroniclingamerica.loc.gov/)
National Library of New Zealand (http://paperspast.natlib.govt.nz/)
National Library of Australia, Australian Digital Newspapers Program
(http://trove.nla.gov.au/newspaper)
Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/)
Singapore National Library Board (http://newspapers.nl.sg/)
Bibliotheque nationale de France (http://gallica.bnf.fr/)
Europeana Newspapers Project, a collaboration of 17 organizations
(http://www.europeana-newspapers.eu/)
National Library of Latvia (https://periodika.lndb.lv/)
172. • Library of Congress National Digital Newspaper
Program http://www.loc.gov/ndnp/
• Australian Newspaper Digitisation Program
http://www.nla.gov.au/content/newspaper-digitisation-
program
• IFLA Newspapers Section Digitisation projects
and best practices http://www.ifla.org/node/6777
• ICON: International Coalition on Newspapers
http://icon.crl.edu/digitization.htm
173. • METS, MODS, ALTO, PRISM, and other library standards :
http://www.loc.gov/standards
• OAIS : http://public.ccsds.org/publications/RefModel.aspx
• NISO standards and guidelines : http://www.niso.org/
publications/rp
• Good practice guides : http://www.ukoln.ac.uk
• And many, many more
174. Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https://
en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).
175. ?!
Frederick Zarndt
Secretary, IFLA Newspapers Section
frederick@frederickzarndt.com
Photo held by John Oxley Library, State Library of Queensland. Original from
Courier-mail, Brisbane, Queensland, Australia.