SlideShare a Scribd company logo
1 of 175
Download to read offline
Newspaper 
digitization 
Frederick Zarndt 
IFLA Newspapers Section 
frederick@frederickzarndt.com 
@cowboyMontana 
hashtag #IFLAnewspaper
the agenda 10.30 Morning tea break 
1. Introductions 
2.Review of the OAIS 
reference model 
3.Newspaper digitization 
programs 
4. Selection of materials 
5. Importance of standards 
6.Project management 
7. Digitization workflow 
7.1. Images 
7.2. Metadata 
7.3. File formats 
8.Digitization workflow 
demonstration with 
docWorks 
9. Quality assurance and 
acceptance criteria 
10. Tools for digitization, 
workflow, digital 
preservation, and project 
management 
11. Digital preservation 
considerations 
12.Wrap-up 
13.00 Lunch 
15.30 Afternoon tea break
An Open Archival Information System (or 
OAIS) is an archive, consisting of an 
organization of people and systems, that has 
accepted the responsibility to preserve 
information and make it available for a 
Designated Community. 
Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https:// 
en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).
Open Archival Information System (OAIS) 
reference model 
• Negotiate for and accept appropriate information from information Producers. 
• Obtain sufficient control of the information provided to the level needed to ensure 
Long-Term Preservation. 
• Determine, either by itself or in conjunction with other parties, which communities 
should become the Designated Community and, therefore, should be able to 
understand the information provided. 
• Ensure that the information to be preserved is Independently Understandable to the 
Designated Community. In other words, the community should be able to 
understand the information without needing the assistance of the experts who 
produced the information. 
• Follow documented policies and procedures which ensure that the information is 
preserved against all reasonable contingencies, and which enable the information to 
be disseminated as authenticated copies of the original, or as traceable to the 
original. 
• Make the preserved information available to the Designated Community. 
Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https:// 
en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).
Open Archival Information System (OAIS) 
reference model
programs
National 
Collaborative 
Individual 
programs
national programs 
national: centrally funded and managed 
programs with several participants. strict 
standards. 
• National Digital Newspaper Program 
(Library of Congress) 
• Australian Newspaper Digitisation 
Program 
programs
cooperative programs 
cooperative: organizations collaborate 
to achieve a common goal but 
digitization programs are managed 
separately. flexible standards. 
• Europeana newspapers 
• Digital Public Library of America 
programs
individual programs 
individual: organization digitizes on its own. 
may or, more usually, does not follow open 
standards. all commercial organizations. 
• ProQuest Historical Newspapers 
• Newspapers.com 
• Newsbank 
• many others… 
programs
programs 
• digitization program requires 
careful thought 
• must be adapted to local 
circumstances 
• ask those who have gone before 
• join the IFLA Newspapers 
Section! (ask me how) 
Image courtesy of Donald Zolan.
? programs ? 
Discussion questions 
1. Has your organization already begun to digitize 
newspapers? How is the digitization program 
organized and funded? 
2. If your organization hasn’t yet begun to digitize 
newspapers, what type of digitization program 
would best suits your organization / state / 
country? Why?
Experience is that marvelous thing 
that enables you to recognize a 
mistake when you make it again. 
! 
F. P. Jones
selection
reasons for digitization 
newspapers are deteriorating 
microfilm is dissolving 
no storage space 
selection
access 
• Who are your users? Do you know? 
• Can you ask them what they expect 
from a digital newspaper collection? 
Can you trust their answers? 
• Trove, Papers Past, Cambridge 
Public Library, CDNC: These digital 
newspaper collections are used 
mostly by people 50+ years old and 
with an interest in family history. 
? 
selection
Library of Congress selection criteria for the 
National Digital Newspaper Program (NDNP) 
selection 
! 
• Image quality 
• Intellectual content 
• Refinements 
http://www.loc.gov/ndnp/guidelines/selection.html
selection for NDNP 
Image quality 
! 
All NDNP newspaper images are scanned from microfilm. 
1. Microfilm should be produced from properly prepared 
unbound originals. 
2. Microfilm reduction ratio should be less the 20x. This allows 
400dpi images to be scanned from the film. 
3. Variations in microfilm density within and between images 
should be more than 0.2. 
4. Negative microfilm duplicated for scanning should have 
resolution test patterns readable at 5.0 or higher. For camera 
master microfilm without resolution test charts, resolution 
can be estimated by comparison to film with resolution test 
charts and original material. 
selection
selection for NDNP 
Intellectual content 
! 
1. Newspaper title reflects the political, economic and cultural 
history of the State. 
2. Selected newspaper titles should ensure broad geographical 
coverage. 
3. Newspaper titles that provide coverage of a geographic area or a 
group over long time periods are preferred over short lived titles 
or titles with significant gaps. 
selection
selection for NDNP 
Selection criteria refinements 
! 
1. Orphan titles: Special consideration should be given to high 
research value titles that have ceased publication and lack active 
ownership. 
2. Newspaper titles that document a significant (minority) 
community at the state or regional level may be given special 
consideration. 
3. Newspaper which have already been digitized by other 
organizations (for example, ProQuest) should not be digitized 
again. 
selection
selection for ANDP 
National Library of Australia collection managers in 
consultation with staff from Preservation Services nominate 
materials for digitization. The Library works closely with state 
and territory libraries to systematically digitise newspapers 
held in these libraries. Selected newspapers include this with 
! 
• Cultural and/or historical significance 
• Uniqueness and/or rarity of the material 
• Copyright status or permission to digitise obtained 
• Material in high demand 
• Material at risk because of its physical condition 
selection 
https://www.nla.gov.au/policy-and-planning/collection-digitisation-policy
copyright 
Most newspapers titles selected 
for digitization are out of 
copyright and in the public 
domain. Negotiating use rights is 
quite simply too much trouble and 
fraught with legal pitfalls. 
Copyright laws and policies vary considerably between countries. 
selection
23 
…however… 
Digitization and public access to 
in-copyright newspapers is not 
impossible. 
selection
24
25
26
27
28
? selection ? 
Discussion questions 
1. Has your organization already selected 
newspapers to digitize? Why did it choose the 
titles that were selected? Please answer 
(hypothetically) if your organization hasn’t 
begun a newspapers digitization program. 
2. Why would or why wouldn’t your organization 
select in-copyright newspapers to digitize?
30 
importance of standards
open standards 
• Availability : Open standards are available for all to read and implement. 
• Maximize end-user choice : Open standards create a fair, competitive market 
for implementation of the standards. They do not lock the customer into a 
particular vendor or group. 
• No royalty : Open standards are free for all to implement, with no royalty or 
fee. 
• No discrimination : Open standards and the organizations that administer 
them do not favor one implementor over another for any reason other than 
the technical standards compliance of a vendor's implementation. 
• Extension or subset : Implementations of open standards may be extended, or 
offered in subset form. However, certification organizations may decline to 
certify subset implementations, and may place requirements upon extensions. 
• Predatory practices : Open standards may employ license terms that protect 
against subversion of the standard by embrace-and-extend tactics. The 
licenses attached to the standard may require the publication of reference 
information for extensions, and a license for all others to create, distribute 
and sell software that is compatible with the extensions. An open standard 
may not otherwise prohibit extensions. 
importance of standards 
Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards
open standards 
standards • Not restrictive : Less chance of being locked in by a specific 
technology and/or vendor. 
• Interoperable : Easier for systems from different parties or 
using different technologies to interoperate and communicate 
of with one another. 
importance • Protection against obsolescence : Better protection of the data 
files created by an application against obsolescence. 
• Portable : Applications / data are easier to port from one 
platform to another since they follows known guidelines and 
rules, and the interfaces. 
Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards 
32
newspapers and standards 
What standards are important for newspaper digitization? 
! 
• METS XML is an open standard administered by the METS editorial 
board. See http://www.loc.gov/standards/mets/. 
• ALTO XML is an open standard administered by the ALTO editorial 
board. See http://www.loc.gov/standards/alto/. 
• Various image file formats including TIFF, JPEG, JPEG2000. 
• PDF/A is a portable document format developed by Adobe. It is a 
subset of the complete PDF specification and has been adopted by 
ISO as a standard. See http://www.pdfa.org/. 
• Various library metadata standards including, but not limited to 
• MODS XML http://www.loc.gov/standards/mods/ 
• Dublin Core http://dublincore.org/ 
• PREMIS http://www.loc.gov/standards/premis/ 
importance of standards
importance of standards 
with few exceptions 
libraries use METS XML + 
ALTO XML + image files (TIFF, 
JPEG2000) for newspaper 
digitization programs 
importance of standards
proprietary standards 
Olive ActivePaper Archive stores historical 
newspaper data in an XML format that is as 
capable as METS/ALTO XML but is not an 
open standard. 
Early versions of WordPerfect (MS Word 
too) stored data in a proprietary format, not 
in an open standard like Open Document 
Format (ODF). WordPerfect or special 
software is needed to view the files. 
Adobe’s Flash is a de facto but not an open 
standard. Flash now appears to be on a path 
to obsolescence, destined to be replaced by 
HTML5. 
importance of standards
?importance of standards 
? 
Discussion questions 
1. Name a few standards that you use every time 
you connect to the Internet. 
2. What library standards does your organization 
currently use? What other, non-library 
standards, if any, does your organization use?
In theory, there's no difference 
between theory and practice, but in 
practice, there is. 
! 
Anonymous
project management
From the Standish Group’s 2012 Chaos Report on IT Project Failure. 
project management
high cost of IT failure 
Roger Sessions estimates that the worldwide cost of 
IT failure is USD $500 billion per month 
Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple 
Architectures for Complex Enterprises and many articles. He is a founding member 
of the Board of Directors of the International Association of Software Architects. 40 
project management
in a recent survey of 1230 IT professionals 
conducted by Embarcadero Technologies, 2 of the 
3 biggest project challenges cited by the IT pros are 
“poor planning” and “poor or no requirements” 
41 
plan! 
project management
in a March 2007 web poll conducted by the 
Computing Technology Industry Association "nearly 
28 percent of the more than 1,000 respondents 
singled out poor communications as the number one 
cause of project failure" 
42 
communicate! 
project management
A recent survey of 752 IEEE members conducted by IEEE 
Spectrum and The New York Times discovered that "just 9 
percent of 133 respondents whose organizations currently 
offshore R&D reported 'No problem'. The biggest 
headache was 'Language, communication, or culture' 
barriers, as reported by 54.1 percent of respondents." 
(http://www.spectrum.ieee.org/feb07/4881 
43 
communicate! 
project management
In their 2009 book Cultural Intelligence: Living and Working 
Globally, Thomas and Inkson say “Although we increasingly 
cross boundaries and surmount barriers to trade, migration, 
travel, and the exchange of information, cultural boundaries 
are not so easily bridged. Unlike legal, political, or economic 
aspects of the global environment, which are observable, 
culture is largely invisible. Therefore, culture is the aspect of 
the global context that is most often overlooked.” 
44 
communicate! 
project management
plan! 
Taimour al Neimat. Why IT project fail. The PROJECT PERFECT White Paper 
Collection. Oct 2005. http://www.projectperfect.com.au/downloads/Info/ 
info_it_projects_fail.pdf accessed Mar 2014. project management 
in a white paper written for Project Perfect by Taimour 
al Neimat, he lists 
• poor planning 
• unclear goals and objectives 
• objectives changing during the project 
• unrealistic time or resource estimates 
• lack of executive support and user involvement 
• failure to communicate and act as a team 
• inappropriate skills 
as primary causes for the failure of complex IT projects
typical tender evaluation criteria in priority order 
! 
1. understanding of requirements 
2. reputation of service bureau 
3. price 
46 
requirements? 
project management
incomplete requirements 
requirements in recent tender from an 
(anonymous) government agency somewhere in the 
world 
! 
• project to convert ~ 170,000 text images to xml 
• value of project ~ USD $180,000 
• 19 pages of definitions, governing law, proposal 
evaluation criteria, contractual conditions, 
instructions about tender response format, etc 
• technical requirements description? < 1 page 
• data acceptance criteria? “a high level of 
accuracy” 
47 
project management
complete requirements 
Library of Congress JPEG2000 profile 
48 
project management
a recent newspapers digitization program 
established by a prominent national library 
! 
• digitize more than 20 million text pages 
• high level image and xml requirements 
• value of work awarded? > USD $5,000,000 
• after award of work, technical requirements 
expand to 43+ pages from ~3 pages 
• acceptance criteria? added as an afterthought and 
not well defined 
project management 
poor planing
the value of simplicity 
“There are two ways of constructing a software 
design: one way is to make it so simple that there 
are obviously no deficiencies and the other way is 
to make it so complicated that there are no obvious 
deficiencies.” 
! 
C.A.R. Hoare 
Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford 
University, Senior Researcher at Microsoft Research, recipient of the ACM 
Turing Award, author of many books on computers and software. 
project management
• unitary: the requirement addresses one and only one 
thing 
• complete: the requirement is fully stated in one place 
with no missing information 
• consistent: the requirement does not contradict any 
other requirement and is fully consistent with all 
authoritative external documentation 
• atomic: it does not contain conjunctions, for example, 
"the code field must validate American and Canadian 
postal codes" should be written as two separate 
requirements 
project management 
good requirements
! • traceable: the requirement meets all or part of a 
business need as stated by stakeholders and 
authoritatively documented 
• current: the requirement has not been made obsolete 
by the passage of time 
• feasible: the requirement can be implemented within 
the constraints of the project 
• unambiguous: the requirement is concisely stated 
without recourse to technical jargon, acronyms 
• verifiable: the implementation of the requirement 
can be determined through one of four possible 
methods: inspection, demonstration, test, or analysis 
project management 
good requirements
53 
project management
simple principles for (good) 
communication 
• be impeccable with your word 
• don’t take anything personally 
• don’t make assumptions 
• always do your best 
• be mindful
why (better) communication is 
necessary 
no communication ... 
little communication ... 
poor communication ... 
reduced communication ... 
... all result in more assumptions about intent!
The single biggest problem 
with communication is the 
illusion that it has taken place. 
George Bernard Shaw, 1925 Nobel Peace Prize for Literature.
project management 
“projects are about communication, 
communication, and communication” 
Elenbass, B. Staging a Project: Are You Setting Your Project 
Up for Success? Proceedings of the Project Management 
Institute Annual Seminars & Symposiums. 2000.
the value of prototypes / pilots 
“Plan to throw one away; you will anyhow. If there is 
anything new about the function of a system, the first 
implementation will have to be redone completely to achieve 
a satisfactory (i.e., acceptably small, fast, and maintainable) 
result. It costs a lot less if you plan to have a prototype.” 
! 
Butler Lampson 
Butler Lampson was a founding member of Xerox PARC, worked for DEC, 
and now works at Microsoft Research. He is an adjunct professor at MIT 
and an ACM Fellow. 
project management
implement: pilot 
create requirements and acceptance criteria 
repeat 
{ 
digitize (small) pilot batch 
test data against acceptance criteria 
adjust requirements and acceptance criteria 
} 
until (no more adjustments are necessary) 
digitize more data 
pilot batches are VERY VERY important!! 
59 
project management
reasons for in-house production 
! 
• collection cannot be moved 
• collection is badly organized 
• digitization must be done slowly over a long 
period 
• digitization is very simple 
60 
project management 
implement: in-house
reasons for outsourced production 
! 
• originals can’t be scanned in-house because… 
• equipment is too expensive 
• output data is beyond staff experience 
• labor is too expensive 
• large volume of work in a short time 
• insufficient space, infrastructure, or staff 
61 
project management 
implement: outsource
project management tools 
The project management tool one chooses should be 
intuitive, easy to use, and accessible to all. If it isn’t, 
many will avoid / refuse / dislike / resent using it. 
! 
• Discussion of project management tools at http:// 
en.wikipedia.org/wiki/Comparison_of_project-management_ 
software 
• List of project management tools at http:// 
en.wikipedia.org/wiki/Comparison_of_project-management_ 
software 
project management
? project management 
? 
Discussion questions 
1. What project management practices does your 
organization follow? Why? 
2. What library standards does your organization 
currently use? What other, non-library 
standards, if any, does your organization use? 
3. What reasons, in addition to those already cited, 
would your organization have to digitize 
newspapers in-house or to outsource digitization?
“Perfection is attained, not when there 
is nothing left to add, but when there 
is nothing left to take away.” 
! 
Antoine de St. Exupery
digitization workflow
digitization workflow 
! 
• digital library: one or more digital collections
67 
digital library 
digitization workflow
digitization workflow 
! 
• digital library: one or more digital collections 
• digital collection: organized group(s) of digital 
objects
69 
digital collection
digitization workflow 
! 
• digital library: one or more digital collections 
• digital collection: organized group(s) of digital 
objects 
• digital object: a surrogate or digital copy of 
the original source document, for example, a 
newspaper issue
digital object
An example of what 
ALTO makes possible 
The Day book. (Chicago, Ill.), 29 Feb. 1912. Chronicling America: Historic American Newspapers. Lib. of Congress. 
<http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-02-29/ed-1/seq-26/>
digitization workflow 
! 
• digital library: one or more digital collections 
• digital collection: organized group(s) of digital 
objects 
• digital object: a surrogate or digital copy of 
the original source document, for example, a 
newspaper issue 
• metadata: data about data. information about 
a digital object(s) or a digital collection(s) or 
the original source document(s)
74 
metadata 
digitization workflow
• to enhance accessibility 
• to increase collaboration and cooperation 
between libraries and archives around the 
world 
• to promote research 
• to provide opportunities for entrepreneurs 
• other reasons? 
75 
why digitize newspapers? 
digitization workflow
Open Archival Information System (OAIS) 
reference model
digitization workflow
the digitization process 
produce 
digital 
objects 
ingest 
preserve 
access 
produce images access 
source images objects
the digitization process 
produce images 
source images
standard file formats 
• image file formats 
• TIFF 
• JPEG2000 
• JPEG 
• GIF 
• text file formats 
• PDF, PDF/A, PDF/A-1b, PDF/A-1a 
• TEI XML 
• HTML 
• plain text 
• NITF / NewsML 
• metadata 
• METS 
• MODS / PREMIS / ALTO / MIX ... 
digitization workflow
?image decisions ¿ 
• image production source materials 
• original documents: better quality, more 
expensive 
• microfiche: poorer quality, less 
expensive, microfiche quality varies 
• bit depth 
• black-and-white (bitonal) 
• greyscale 
• color 
• resolution 
• compression 
• no compression 
• lossless (reversible) 
• lossy (irreversible) 
• image metadata 
digitization workflow
image format comparison 
compression bit depth metadata color 
management 
mime 
type patent 1st public 
release 
JBIG 
(.jbig, .jbg) lossless 1-bit no no 2000? 
JPEG 
(.jpg, .jpeg) 
lossy, DCT, RLE, 
Huffman 
8-bit 
12-bit 
24-bit 
yes yes image/jpeg 
public.jpeg no 1992 
JPEG2000 
(.jp2) 
many lossless and 
lossy compression 
algorithms 
8-bit 
16-bit 
color to 48 bits 
yes yes image/jp2 
public.jpeg200 
yes but 
part 1 is 
patent free 
2000 
TIFF 
(.tiff, .tif) 
none 
LZW 
RLE 
ZIP 
Other 
1, 2, 4, 8, 16, 
24, 32 bits 
yes yes image/tiff 
public.tiff no 1986 
Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free 
Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats 
(accessed August 1, 2012)
image compression comparison 
The Sacred Heart 
Review 
300dpi 
Los Angeles Star 
300dpi 
Die Susquehanna 
Zeitung 
600dpi 
TIFF (uncompressed) 17.2 MB 87 MB 415.5 MB 
TIFF (lossless LZW compression) 10.2 MB 75.8 MB 232.9 MB 
JPEG (maximum quality [lossless]) 7.0 MB 37.2MB 101.1 MB 
JPEG (medium quality) 1.5 MB 4.6 MB 10.2MB 
JPEG2000 (lossless compression) 7.1 MB 52.7 MB 166.2 MB 
JPEG2000 (lossy [70] compression) 5.1 MB 37.1 MB 116.7 MB 
JPEG2000 (lossy [30] compression) 2.2 MB 16.1 MB 50.3 MB
image bit depth comparison 
USA case law image 1 
300dpi 
USA case law image 2 
300dpi 
TIFF 1-bit CCITT G4 compression 40 KB 87 KB 
JPEG2000 W5x3 reversible compression 2.6 MB 3.6 MB 
JPEG2000 W9x7 irreversible compression 647 KB 1 MB
GIGO 
GARBAGE IN, GARBAGE OUT 
Image courtesy of http://epsos.de (accessed at http://commons.wikimedia.org March 
2014).
raw OCR text 
Deaths. lln»rieff, Esq. of <c .. Qn. 
Sunday, the till. greatly Drandrellt, of 
Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn 
l j j j i l F i i j ' 1 1 f Havodiv y d, 
Carnarvonshire, S ; **" *- ' « ' March 
Oxford, F. Tfovmeud, Uerald. » • V . 
•On Tncsdav last , Mr . Charles. 
IWilinson, this 8 ; had vf thesis#,, a week 
ago, which tcrminate<i'iu his death. . / ' ■ 
O'i Sunday, dJst nit. at. AsbtCnvHall, 
mar Lancaster, Mr.,Geo. Worn ick, 
many years house'steward hit late Once 
The Hamilton and Brandon. He locked 
himself h»oWn'r«wte<: soon. twelve 
o'clock" that dny, and fii»-d a loaded pistol 
" t h r o u g h I n s b e a d , 1 w h i c h 
instantaneously killed him. Coronet's 
Verdict, shot himself in a temporary fit of 
Friday week, 
newspaper image 
Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
? digitization workflow 
? 
Discussion topics 
1. Assume your organization decides to digitize 1000 
newspaper issues averaging 12 pages per issue. The 
images are scanned 2-up and average 80MB each. 
How much disk storage is needed for the images? 
2. Now assume instead that your organization uses 
TIFF images with LZW (lossless) compression, 
which saves on average 40%. How much disk 
storage is needed for the images?
why (better) communication 
is necessary
the digitization process 
produce 
digital 
objects 
images objects
the digitization process 
images image objects 
processing 
layout 
analysis OCR metadata 
build 
digital 
objects
the digitization process 
images image objects 
processing 
layout 
analysis OCR metadata 
build 
digital 
objects 
• crop, de-skew, split images 
• apply image improvement algorithms as 
needed 
• sharpening filters 
• local adaptive thresholding 
• remove text bleed-thru 
• etc 
• create master images 
• create working images
92
93
94
what’s wrong 
with this 
image?
text is skewed 
about 1° from 
vertical
text is de-skewed 
text is skewed
the digitization process 
images image objects 
processing 
layout 
analysis OCR metadata 
build 
digital 
objects 
• analyze layout of text image 
• estimate font types and sizes 
• calculate coordinates of text blocks 
• determine layout object types (text, 
illustration, headline, etc)
newspaper text layout analysis
the digitization process 
images image objects 
processing 
layout 
analysis OCR metadata 
build 
digital 
objects 
• perform optical character recognition (OCR) 
• calculate word and character coordinates 
• calculate word and character confidences 
• apply language dictionaries 
• correct OCR text (optional)
the digitization process 
images image objects 
processing 
layout 
analysis OCR metadata 
build 
digital 
objects 
• populate metadata fields 
• verify / correct page numbers 
• verify / correct document structure
the digitization process 
images image objects 
processing 
layout 
analysis OCR metadata 
build 
digital 
objects 
• create METS / ALTO XML files 
• create image files and image metadata 
• create PDF files (if required) 
• verify digital object 
• calculate file fixity checks (checksums) 
• perform file validation and verification 
• perform quality assurance
real world 
digitization 
production 
workflow 
• automatic production 
steps performed by 
software 
! 
• manual production steps 
performed by operators
digital library standards 
• METS XML for descriptive, structural, technical, and 
administrative metadata 
! 
• descriptive metadata 
• Metadata Object Description Standard (MODS) 
selected metadata from MARC 
• Dublin Core fundamental group of text elements for 
describing and cataloging 
! 
• technical metadata 
• ALTO for OCR text 
• PREMIS for digital preservation 
• MIX and ANSI/NISO Z39.87 for images
Metadata Encoding and 
Transmission Standard 
! 
• METS is a XML standard for encoding descriptive, administrative, 
and structural metadata about objects within a digital library 
• METS files consist of 7 (optional) sections: header, descriptive, 
administrative, file map, structural map, structural link, and 
behavior 
• METS profiles describe a class of METS documents in sufficient 
detail to provide both document authors and programmers the 
guidance to create and process METS documents conforming with a 
particular profile 
• current version 1.9.1 
• administered by METS editorial board (international group of 
volunteers) 
• standards hosted by Library of Congress at http://www.loc.gov/ 
standards/mets/
METS file structure 
Graphic from Karin Bredenberg, Communicating Archival Metadata conference and 
workshops. Riksarkivet, 2011.
Metadata Object Description Schema 
• MODS is an XML schema for a bibliographic element set that may 
be used for library applications. Derivative of MARC 21 
bibliographic format. Includes a subset of MARC fields, using 
language-based tags rather than numeric ones 
• Subset of MARC 21 
• Mappings exist between MODS and MARC, Dublin Core, and RDA 
(conversion tools exist) 
• May be used in conjunction with METS XML 
• current version 3.4 
• administered by Library of Congress Network Development and 
MARC Standards Office with help from interested users 
• standards hosted by Library of Congress at http://www.loc.gov/ 
standards/mods/
MODS metadata in METS XML 
<mets:dmdSec ID="issue-nla.news-issn18368190_18740425"> 
! <mets:mdWrap MDTYPE="MODS"> 
! ! <mets:xmlData> 
! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3"> 
! ! ! ! <mods:language> 
! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm> 
! ! ! ! </mods:language> 
! ! ! ! <mods:genre>newspaper issue</mods:genre> 
! ! ! ! <mods:originInfo> 
! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued> 
! ! ! ! </mods:originInfo> 
! ! ! ! <mods:relatedItem type="host"> 
! ! ! ! ! <mods:titleInfo> 
! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title> 
! ! ! ! ! </mods:titleInfo> 
! ! ! ! ! <mods:genre>newspaper</mods:genre> 
! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier> 
! ! ! ! ! <mods:part> 
! ! ! ! ! ! <mods:detail type="volume"> 
! ! ! ! ! ! ! <mods:number>IX</mods:number> 
! ! ! ! ! ! </mods:detail> 
! ! ! ! ! </mods:part> 
! ! ! ! ! <mods:part> 
! ! ! ! ! ! <mods:detail type="issue"> 
! ! ! ! ! ! ! <mods:number>12</mods:number> 
! ! ! ! ! ! </mods:detail> 
! ! ! ! ! </mods:part> 
! ! ! ! </mods:relatedItem> 
! ! ! </mods:mods> 
! ! </mets:xmlData> 
! </mets:mdWrap> 
</mets:dmdSec>
Dublin Core metadata 
• Dublin Core is a set of vocabulary terms used to describe 
resources for the purposes of discovery. 
• Dublin Core metadata element set is endorsed in IETF RFC 
5013, ISO 15836-2009, and NISO Z39.85 
• Metadata terms last updated 14-Jun-2012 
• May be used in conjunction with METS XML 
• Dublin Core Metadata Initiative (DCMI) is an open 
organization, incorporated as a public, not-for-profit company 
in Singapore 
• Dublin Core Metadata Initiative is hosted at http:// 
dublincore.org/
Analyzed Layout and Text Object 
! 
• ALTO XML provides technical metadata for describing the layout 
and content of physical text resources, such as pages of a book or a 
newspaper 
• commonly used in conjunction with METS XML but may be used 
standalone 
• current version 2.1 
• administered by ALTO editorial board (international group of 
volunteers) 
• standards hosted by Library of Congress at http://www.loc.gov/ 
standards/alto/
<?xml version="1.0" encoding="UTF-8"?> 
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"> 
<Description> 
! <MeasurementUnit>pixel</MeasurementUnit> 
! <sourceImageInformation> 
! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName> 
! </sourceImageInformation> 
</Description> 
<Styles> 
! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> 
! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> 
</Styles> 
<Layout> 
! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967"> 
! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/> 
! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/> 
! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/> 
! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/> 
! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194"> 
! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831"> 
! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831"> 
! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT"> 
! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75"> 
! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/> 
! ! ! ! ! ! !<SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/> 
! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/> 
! ! ! ! ! ! </TextLine> 
! ! ! ! ! </TextBlock> 
! ! ! ! </ComposedBlock> 
! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/> 
! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/> 
! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46"> 
! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/> 
! ! ! ! ! ! !<SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/> 
! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/> 
! ! ! ! ! ! </TextLine> 
! ! ! ! ! </TextBlock> 
! ! ! ! </ComposedBlock> 
! ! ! </ComposedBlock> 
! </PrintSpace> 
</Page> 
</Layout> 
</alto> 
Analyzed Layout and Text Object
Analyzed Layout and Text Object 
book
Analyzed Layout and Text Object 
newspaper
Preservation Metadata 
Implementation Strategies 
• PREMIS is a core set of implementable preservation metadata, 
broadly applicable across a wide range of digital preservation 
contexts and supported by guidelines and recommendations for 
creation, management, and use 
• In 2003 OCLC and RLG jointly sponsored the formation of the 
PREMIS working group comprised of international experts in the 
use of metadata to support digital preservation activities 
• PREMIS data dictionary current version 2.2 
• May be used in conjunction with METS XML 
• PREMIS tools are freely available 
• PREMIS Maintenance Activity and Editorial Committee has 
international members from libraries and industry 
• PREMIS data dictionary is hosted at http://www.loc.gov/ 
standards/premis/
PREMIS data in METS file 
<mets:amdSec> 
<mets:techMD ID="PREMISOBJECT1"> 
<mets:mdWrap MDTYPE="PREMIS"> 
<mets:xmlData> 
<premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> 
<premis:objectIdentifier> 
<premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> 
<premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> 
</premis:objectIdentifier> 
<premis:objectCategory>file</premis:objectCategory> 
<premis:objectCharacteristics> 
<premis:format> 
<premis:formatDesignation> 
<premis:formatName>TIFF</premis:formatName> 
<premis:formatVersion>TIFF 6.0</premis:formatVersion> 
</premis:formatDesignation> 
</premis:format> 
</premis:objectCharacteristics> 
<premis:relationship> 
<premis:relationshipType>derivation</premis:relationshipType> 
<premis:relationshipSubType>is derivative of</premis:relationshipSubType> 
<premis:relatedObjectIdentification> 
<premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> 
<premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> 
<premis:relatedObjectSequence>0</premis:relatedObjectSequence> 
</premis:relatedObjectIdentification> 
<premis:relatedEventIdentification> 
<premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> 
<premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> 
<premis:relatedEventSequence>0</premis:relatedEventSequence> 
</premis:relatedEventIdentification> 
</premis:relationship> 
</premis:object> 
</mets:xmlData> 
</mets:mdWrap> 
</mets:techMD> 
</mets:amdSec>
digitization workflow
implement: software 
• commercial off-the-shelf (COTS) 
• open source 
• customized COTS 
• customized open source 
• custom in-house 
117
? digitization workflow 
? 
Discussion topics 
1. Assuming your organization will digitize 
historic newspapers, will it digitize the 
newspapers in-house or out-source 
digitization? Why? (If you don’t know, guesses 
and speculations are fine.) 
2. Describe your organizations current 
digitization workflow.
quality assurance and 
acceptance criteria
quality assurance 
and acceptance criteria 
Wikipedia on data quality: 
! 
The processes and technologies involved in 
ensuring the conformance of data values to 
requirements and acceptance criteria 
quality assurance
• is the digital object complete? are all its 
components present? 
• is the digital object verifiable? 
• is the digital object uncorrupted? 
• do the components of the digital object 
conform to standards? 
• do the file names conform to project 
requirements? 
• does the directory structure conform to 
project requirements? 
• does the digital object metadata conform to 
project specifications? 
quality assurance 
automatic quality checks
• does the digital object metadata meet 
accuracy specifications? 
• does the text meet accuracy 
specifications? 
• is the image quality satisfactory? 
• are article continuations correct? 
• is the text in reading order? 
quality assurance 
manual quality checks
what’s wrong with this? 
acceptance criteria for an English language 
digitization project at a large, well-known, and 
internationally recognized national library 
! 
character accuracy > 80% 
word accuracy > 75% 
significant word accuracy > 65% 
quality assurance
what’s wrong with this? 
project quality requirement: 
! 
“a high level of accuracy”
what’s wrong with this? 
project quality requirement: 
! 
“article titles must be 99.5% accurate”
what’s wrong with this? 
project quality requirement: 
! 
“article title characters in each issue must be 99.5% 
accurate, that is, each issue may have no more than 
5 errors in 1000 article title characters”
image quality 
! 
•sharpness: the amount of detail an image can 
convey 
•noise: random variation of image density 
•dynamic range 
•contrast (gamma): the slope of the tone 
reproduction curve in a log-log space. high 
contrast usually involves loss of dynamic range — 
loss of detail, or clipping, in highlights or shadows. 
•vignetting: darkens images near the corners 
•artifacts: “leftovers” from sharpening or 
compression 
Wikipedia contributors, “Image quality," Wikipedia, The Free Encyclopedia, http:// 
en.wikipedia.org/wiki/Image_quality (accessed March 2014). quality assurance
image quality 
! 
“…images which are ultimately to be viewed by 
human beings, the only “correct” method of 
quantifying visual image quality is through subjective 
evaluation. in practice, however, subjective 
evaluation is usually too inconvenient, time-consuming 
assurance 
and expensive…” 
! 
quality “…best way to assess the quality of an image is to 
look at it because human eyes are the ultimate 
viewers of most images…” 
Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment 
so difficult? IEEE Transactions on Image Processing. April 2004. Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error 
Visibility to Structural Similarity. IEEE Transactions on Image Processing. 
April 2004.
acceptance criteria for the 
National Library of Australia NDP 
129
? quality assurance 
? 
Discussion topics 
1. How does your organization currently do 
quality assurance for digital data? 
2. How much time / effort is given to writing 
quality assurance procedures and acceptance 
criteria for digitized data?
digitization tools
open source vs. commercial software: 
pros 
digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- 
• acquisition : cost, development and implementation 
contract costs are likely to be lower than for proprietary 
software. less likely that there will be contractually-bound 
upgrade costs. total cost of ownership over the lifetime of 
usage must be taken into account 
• data transferability : with open source code and open data 
formats, there are greater opportunities to share data across 
interoperable platforms 
• re-use : open source is free from per user or per instance 
costs and there is a guaranteed freedom to use it in any way. 
re-use is enabled. 
open-source-solutions/
open source vs. commercial software: 
digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- 
• cost effective : pay once or not at all for development (if at all) 
and reuse where appropriate. 
• non-restrictive : open source licenses do not limit or restrict 
who can use the software, the type of user, or the areas of 
business in which the software can be used. provides a 
licensing model that enables rapid provisioning of both known 
and unanticipated users and in new use cases. 
• scalable : open source solutions are scalable upwards and 
downwards with a reduction in the risk of longer term 
financial implications. no license fees on a “per user” or “per 
box” basis. no redundant licenses 
open-source-solutions/ 
pros
open source vs. commercial software: 
digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- 
• easy to prototype and adapt : open source software is 
particularly suitable for rapid prototyping and 
experimentation, where the ability to “test drive” the software 
with minimal costs and administrative delays can be 
important. (proprietary software suppliers may also provide 
the same through a ‘proof of concept’ phase at minimal or no 
cost.) 
open-source-solutions/ 
pros
• support and maintenance costs : may outweigh those of 
the proprietary package and include ‘hidden’ 
commitments. 
• intellectual property rights : as code is modified and 
adapted, there may be legal risks the code’s open source 
status and who owns the intellectual property rights of 
the modified code. 
• expertise : requires software installation and 
maintenance expertise. modification of open source 
code requires software development expertise.must 
ensure that they have the right level of expertise to 
manage it effectively. 
digitization tools 
open source vs. commercial software: 
cons 
Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- 
open-source-solutions/
digitization tools 
a variety of open source and commercial off-the-shelf (COTS) 
software is available for digitization projects 
• easier for systems from different parties or using different 
technologies to interoperate and communicate with one 
another 
• better protection of the data files created by an application 
against obsolescence of the application 
• applications / data are easier to port from one platform to 
another since they follows known guidelines and rules, and the 
interfaces
digitization tools 
ocr software 
open source 
• ABBYY FineReader (http://www.abbyy.com) 
• Tesseract (https://code.google.com/p/tesseract-ocr) 
• Nuance OmniPage (http://www.nuance.com) 
• IRIS Readiris (http://www.irislink.com) 
• LEADTOOLS OCR (http://www.leadtools.com) 
• OCRopus (https://code.google.com/p/ocropus) 
Wikipedia contributors, “Optical optical character" Wikipedia, The Free Encyclopedia, http:// 
en.wikipedia.org/wiki/Optical_character_recognition (accessed March 2014). 
Wikipedia contributors, “Comparison of optical character recognition software," Wikipedia, The Free 
Encyclopedia, http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software 
(accessed March 2014).
digitization tools 
imaging software 
open source 
• LEADTOOLS image SDK (http://www.leadtools.com) 
• ImageGear image SDK (http://www.accusoft.com) 
• FreeImage image SDK (http://freeimage.sourceforge.net) 
• BlackIce image toolkits (http://www.blackice.com) 
• Adobe Photoshop (http://www.adobe.com/Photoshop) 
• GIMP (http://www.gimp.org) 
• GraphicsMagick (http://www.graphicsmagick.org) 
• ImageMagick (http://www.imagemagick.org)
digitization tools 
digital workflow software 
• Content Conversion Specialists docWorks (http://content-conversion. 
com) 
• ScanFlow (http://www.treventus.com) 
• Goobi (http://www.goobi.org) 
• Zissor (http://zissor.com) 
open source
digitization tools 
other software 
• BagIt : hierarchical file packaging format for the 
exchange of digital content. A "bag" has just enough 
structure to safely enclose descriptive "tags" and a 
"payload" but does not require any knowledge of the 
payload's internal semantics. See http:// 
sourceforge.net/projects/loc-xferutils and http:// 
tools.ietf.org/html/draft-kunze-bagit-06. 
open source
? digitization tools 
? 
Discussion questions 
1. What software tools does your organization use for 
digital projects or digital libraries? 
2. Does your organization host a digital library? If so, 
does it use Google Analytics or a similar tool? Why 
or why not? 
3. What software tools does your organization use for 
project management? Are the tools web-based?
digital preservation 
Preservation of software and preservation of data are two sides of 
the same coin. From February 2011 Workshop for Digital Curators.
preservation 
Open Archival Information System (OAIS) 
reference model
digitization≠digital preservation!
Vint Cerf on “bit rot”
digital preservation 
long-term, error-free storage of digital 
information, with means for retrieval 
and interpretation, for the entire time 
span the information is required
digital data risks 
• standards / format obsolescence 
• migration to new format, media, 
or hardware 
• media obsolescence / decay 
• bit rot
format obsolescence 
remember … 
WordPerfect ? 
MARC records ? 
Adobe Flash ?
strategies for 
format obsolescence 
• migrate data to new formats 
• create a computer software museum 
with virtual machines 
• format registries 
• format validators 
• don’t worry about it!
Jeff Rothenberg on 
format obsolescence 
“... digital documents are evolving so 
rapidly that shifts in the forms of documents 
must inevitably arise. New forms do not 
necessarily subsume their predecessors or 
provide compatibility with previous formats.” 
Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published 
in Scientific American. January 1995. Expanded version published February, 1999. 
(accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)
standard model 
for format obsolescence 
• digital format registry collects information about target format 
• this information is used to build format identification and 
verification tools 
• holders of content use these tools to extract metadata from 
content in target format; metadata is stored with the content 
• format registry scans computing environment to determine 
which formats are obsolescent; notifications sent for obsolete 
formats 
• on receiving such a notification, someone builds a tool to convert 
obsolete format to non-obsolete format using the format 
specification in the registry 
• on receiving such a notification, holder of content in obsolete 
format uses conversion tool and content metadata to convert the 
file in an obsolete format to a file in a non-obsolete format
David Rosenthal on 
format obsolescence 
“... format obsolescence is a rare problem that 
happens infrequently to a minority of 
unpopular formats ...” 
David Rosenthal. Format obsolescence: Assessing the threat and the defenses. 
(accessed 1 August 2012 at http://lockss.org/locksswiki/files/ 
LibraryHighTech2010.pdf
alternate model 
for format obsolescence 
• store only essential data 
• perform only essential tasks 
• delay performing tasks as long as possible 
David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library 
High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi: 
10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/ 
files/LibraryHighTech2010.pdf).
importance of standards 
vis-a-vis format obsolescence 
well-defined standards … 
! 
• guide developers in creation of tools 
• facilitates development of a broad range of 
tools for any format 
• allow developers to maintain existing tools
data migration risks 
• file format changes, for example, PDF 1.4 to 
PDF 1.8 
• file name differences, for example, case 
sensitive /insensitive names, new operating 
system 
• extended file attributes 
• file permissions, for example, BSD Unix 
drwxr-xr-x@ to Windows file permissions 
• soft links / hard links
media obsolescence 
• 5 ¼” floppy disks 
• 8 track tapes 
• 3 ½” floppy disks 
• ZIP drives 
• CD-R, CD-RW, Blu-Ray 
• DAT tapes 
• microfilm 
• etc
strategies for 
media obsolescence 
• migrate data to new media, for example, 
floppy disks to DVD 
• create and maintain a computer hardware 
museum
media decay 
a report by NIST and the Library of Congress says ... 
• virtually all CD-Rs tested indicated an estimated life 
expectancy beyond 15 years 
• only 47 percent of recordable DVDs indicated an 
estimated life expectancy beyond 15 years, some 
had a life expectancy as short as 1.9 years 
• in practice actual lifetimes may be considerably 
shorter
prevention / detection 
of media decay 
• proper storage 
• data file checksums (MD5, SHA-1, ...) 
• monitor media integrity 
• migrate data from old media to new media
bit rot 
gradual decay of data due to … 
• storage media failure because of media quality 
• storage media failure because of improper storage 
• random events (bit-flip, environmental influences) 
• software / hardware errors
prevention / detection 
of bit rot 
• data file fixity check (checksums) such as MD5, 
SHA-1, ... 
• monitor file integrity with frequent, corrective 
audits 
• duplicate copies, geographically distributed
distributed decentralized 
digital preservation 
• the more copies, the safer the data 
• the more independent copies, the safer the 
data 
• the more frequently copies are audited, the 
safer the data 
Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?
distributed decentralized 
digital preservation 
• n+1 copies are safer than n copies 
• n independent copies on different storage 
devices / media are safer than n copies on similar 
or identical storage devices / media 
• data audited every week is safer than data audited 
every month
LOCKSS 
Lots Of Copies Keep Stuff Safe 
LOCKSS box: Open source LOCKSS software installed on a 
dedicated computer or virtual machine. 
• It ingests content from target websites using a web crawler similar to those used by 
search engines. 
• It preserves content by continually comparing the content it has collected with the 
same content collected by other LOCKSS Boxes, and repairing any differences. 
• It delivers authoritative content to readers by acting as a web proxy, cache or via 
Metadata resolvers when the publisher’s website is not available. 
• It provides management through a web interface that allows librarians to select new 
content for preservation, monitor the content being preserved and control access to the 
preserved content. 
• It dynamically migrates content to new formats as needed for display. 
From LOCKSS webpages http://www.lockss.org.
how LOCKSS works 
data copied to another LOCKSS box 
library X 
LOCKSS box 
library Y 
LOCKSS box 
my library 
LOCKSS box 
data
how LOCKSS works 
data audited 
library X 
LOCKSS box 
library Y 
LOCKSS box 
my library 
LOCKSS box 
audit 
data
how LOCKSS works 
data audited 
library X 
LOCKSS box 
library Y 
LOCKSS box 
my library 
LOCKSS box 
audit 
data 
audit fails 
ok 
audit
how LOCKSS works 
data copied to another LOCKSS box 
library X 
LOCKSS box 
library Y 
LOCKSS box 
my library 
LOCKSS box 
data
private LOCKSS networks 
Alabama Digital Preservation 
Network (http://www.adpn.org/). 
CLOCKSS (Controlled LOCKSS), a non-profit collaboration 
of North American, European, and Asian cultural heritage 
institutions whose purpose is to preserve digital content with 
LOCKSS (http://www.clockss.org). 
MetaArchive Cooperative is a digital preservation 
cooperative created by cultural heritage institutions 
(http://www.metaarchive.org).
digital preservation references 
• Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to 
Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. 
Proceedings of a conference on digital preservation held at the National Library of 
Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/ 
default/files/ANADP_Educopia_2012.pdf). 
• David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library 
High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi: 
10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/ 
files/LibraryHighTech2010.pdf). 
• David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM 
v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at 
http://lockss.org/locksswiki/files/ACM2010.pdf). 
• Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published 
in Scientific American January 1995. Expanded version published February 1999. 
(accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf) 
• Joint Information Systems Committee (JISC) Programme on Digital Preservation at 
http://www.jisc.ac.uk/preservation. 
• Library of Congress on Digital Preservation at http://www.digitalpreservation.gov. 
• Stanford University’s website for LOCKSS at http://www.lockss.org.
newspaper digitization programs around 
the world 
National Library of Finland (http://digi.kansalliskirjasto.fi/) 
British Newspaper Archives, British Library (http://www.bl.uk/welcome/ 
newspapers) 
National Digital Newspaper Program, Library of Congress 
(http://chroniclingamerica.loc.gov/) 
National Library of New Zealand (http://paperspast.natlib.govt.nz/) 
National Library of Australia, Australian Digital Newspapers Program 
(http://trove.nla.gov.au/newspaper) 
Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/) 
Singapore National Library Board (http://newspapers.nl.sg/) 
Bibliotheque nationale de France (http://gallica.bnf.fr/) 
Europeana Newspapers Project, a collaboration of 17 organizations 
(http://www.europeana-newspapers.eu/) 
National Library of Latvia (https://periodika.lndb.lv/)
• Library of Congress National Digital Newspaper 
Program http://www.loc.gov/ndnp/ 
• Australian Newspaper Digitisation Program 
http://www.nla.gov.au/content/newspaper-digitisation- 
program 
• IFLA Newspapers Section Digitisation projects 
and best practices http://www.ifla.org/node/6777 
• ICON: International Coalition on Newspapers 
http://icon.crl.edu/digitization.htm
• METS, MODS, ALTO, PRISM, and other library standards : 
http://www.loc.gov/standards 
• OAIS : http://public.ccsds.org/publications/RefModel.aspx 
• NISO standards and guidelines : http://www.niso.org/ 
publications/rp 
• Good practice guides : http://www.ukoln.ac.uk 
• And many, many more
Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https:// 
en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).
?! 
Frederick Zarndt 
Secretary, IFLA Newspapers Section 
frederick@frederickzarndt.com 
Photo held by John Oxley Library, State Library of Queensland. Original from 
Courier-mail, Brisbane, Queensland, Australia.

More Related Content

What's hot

DWF WP2 BIREME WHO Lowcostlaptop 20080604
DWF WP2 BIREME WHO Lowcostlaptop 20080604DWF WP2 BIREME WHO Lowcostlaptop 20080604
DWF WP2 BIREME WHO Lowcostlaptop 20080604Ron Burger
 
Keep Calm and Make It Real
Keep Calm and Make It RealKeep Calm and Make It Real
Keep Calm and Make It RealWiLS
 
OAA12 - What difference to the sustainability of open access can (a donor lik...
OAA12 - What difference to the sustainability of open access can (a donor lik...OAA12 - What difference to the sustainability of open access can (a donor lik...
OAA12 - What difference to the sustainability of open access can (a donor lik...BioMedCentral
 
Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...
Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...
Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...Open Data Research Network
 
Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...
Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...
Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...Genetic Engineering & Society Center
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in librariesLiaquat Rahoo
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris
 
Computer Networking meets Social Psychology
Computer Networking meets Social PsychologyComputer Networking meets Social Psychology
Computer Networking meets Social PsychologyWaldir Moreira
 
ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...
ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...
ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...CTA
 
Lessons from the UK: Data access, patient trust & real-world impact with heal...
Lessons from the UK: Data access, patient trust & real-world impact with heal...Lessons from the UK: Data access, patient trust & real-world impact with heal...
Lessons from the UK: Data access, patient trust & real-world impact with heal...Varsha Khodiyar
 
Challenges and emerging practices for knowledge organization in the electron...
Challenges and emerging practices for knowledge  organization in the electron...Challenges and emerging practices for knowledge  organization in the electron...
Challenges and emerging practices for knowledge organization in the electron...Anil Mishra
 
Planning and Implementing a Digital Library Project
Planning and Implementing a Digital Library ProjectPlanning and Implementing a Digital Library Project
Planning and Implementing a Digital Library ProjectJenn Riley
 
Innovative Approaches in Library Service Delivery
Innovative Approaches in Library Service DeliveryInnovative Approaches in Library Service Delivery
Innovative Approaches in Library Service DeliveryRebecca Jones
 

What's hot (19)

DWF WP2 BIREME WHO Lowcostlaptop 20080604
DWF WP2 BIREME WHO Lowcostlaptop 20080604DWF WP2 BIREME WHO Lowcostlaptop 20080604
DWF WP2 BIREME WHO Lowcostlaptop 20080604
 
Keep Calm and Make It Real
Keep Calm and Make It RealKeep Calm and Make It Real
Keep Calm and Make It Real
 
OAA12 - What difference to the sustainability of open access can (a donor lik...
OAA12 - What difference to the sustainability of open access can (a donor lik...OAA12 - What difference to the sustainability of open access can (a donor lik...
OAA12 - What difference to the sustainability of open access can (a donor lik...
 
Evolution of e-Content Distribution: Ad Hoc to Standardization
Evolution of e-Content Distribution: Ad Hoc to StandardizationEvolution of e-Content Distribution: Ad Hoc to Standardization
Evolution of e-Content Distribution: Ad Hoc to Standardization
 
Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...
Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...
Opening the Gates: Will Open Data Initiatives Make Local Governments in the P...
 
Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...
Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...
Todd Kuiken - PR vs Engagement: Balancing Facts and Values in International G...
 
An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in libraries
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
 
Computer Networking meets Social Psychology
Computer Networking meets Social PsychologyComputer Networking meets Social Psychology
Computer Networking meets Social Psychology
 
ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...
ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...
ICM and KSS in IWM as evidence-based DSS to inform Policy and Investment deci...
 
Lessons from the UK: Data access, patient trust & real-world impact with heal...
Lessons from the UK: Data access, patient trust & real-world impact with heal...Lessons from the UK: Data access, patient trust & real-world impact with heal...
Lessons from the UK: Data access, patient trust & real-world impact with heal...
 
CAS & SDI service
CAS & SDI serviceCAS & SDI service
CAS & SDI service
 
Challenges and emerging practices for knowledge organization in the electron...
Challenges and emerging practices for knowledge  organization in the electron...Challenges and emerging practices for knowledge  organization in the electron...
Challenges and emerging practices for knowledge organization in the electron...
 
Open Discovery Initiative Update - CNI, April 4, 2013
Open Discovery Initiative Update - CNI, April 4, 2013Open Discovery Initiative Update - CNI, April 4, 2013
Open Discovery Initiative Update - CNI, April 4, 2013
 
ICT Literacy in Libraries
ICT Literacy in LibrariesICT Literacy in Libraries
ICT Literacy in Libraries
 
Planning and Implementing a Digital Library Project
Planning and Implementing a Digital Library ProjectPlanning and Implementing a Digital Library Project
Planning and Implementing a Digital Library Project
 
Ices wgdim-may-2010
Ices wgdim-may-2010Ices wgdim-may-2010
Ices wgdim-may-2010
 
Innovative Approaches in Library Service Delivery
Innovative Approaches in Library Service DeliveryInnovative Approaches in Library Service Delivery
Innovative Approaches in Library Service Delivery
 

Viewers also liked

Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...cneudecker
 
International Newspaper Digitization:ALA Newspaper Interest Group
International Newspaper Digitization:ALA Newspaper Interest GroupInternational Newspaper Digitization:ALA Newspaper Interest Group
International Newspaper Digitization:ALA Newspaper Interest GroupFrederick Zarndt
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]Frederick Zarndt
 
What does Digital Disruption look like?
What does Digital Disruption look like?What does Digital Disruption look like?
What does Digital Disruption look like?Mike Shaw
 
HR Transformation-The Digitization Impact: The Future is Now
HR Transformation-The Digitization Impact: The Future is NowHR Transformation-The Digitization Impact: The Future is Now
HR Transformation-The Digitization Impact: The Future is NowManish Mohan Misra
 

Viewers also liked (9)

Top Ideas for Digital Bangladesh
Top Ideas for Digital BangladeshTop Ideas for Digital Bangladesh
Top Ideas for Digital Bangladesh
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
International Newspaper Digitization:ALA Newspaper Interest Group
International Newspaper Digitization:ALA Newspaper Interest GroupInternational Newspaper Digitization:ALA Newspaper Interest Group
International Newspaper Digitization:ALA Newspaper Interest Group
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]
 
Digitizing Historic Newspapers: Workflow
Digitizing Historic Newspapers: WorkflowDigitizing Historic Newspapers: Workflow
Digitizing Historic Newspapers: Workflow
 
What does Digital Disruption look like?
What does Digital Disruption look like?What does Digital Disruption look like?
What does Digital Disruption look like?
 
HR Transformation-The Digitization Impact: The Future is Now
HR Transformation-The Digitization Impact: The Future is NowHR Transformation-The Digitization Impact: The Future is Now
HR Transformation-The Digitization Impact: The Future is Now
 
VTDNP: Starting a State Newspaper Digitization Program
VTDNP: Starting a State Newspaper Digitization ProgramVTDNP: Starting a State Newspaper Digitization Program
VTDNP: Starting a State Newspaper Digitization Program
 
Newspaper Digitization: Paper - Microfilm - Digital
Newspaper Digitization: Paper - Microfilm - DigitalNewspaper Digitization: Paper - Microfilm - Digital
Newspaper Digitization: Paper - Microfilm - Digital
 

Similar to 20140410 ifla digitization workshop [idlc kuala lumpur]

Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationHistoric Environment Scotland
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationEDINA, University of Edinburgh
 
Data accessibilityandchallenges
Data accessibilityandchallengesData accessibilityandchallenges
Data accessibilityandchallengesjyotikhadake
 
Special material-powerpoint
Special material-powerpointSpecial material-powerpoint
Special material-powerpointEthel88
 
Cal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCarly Strasser
 
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...University of Connecticut Libraries
 
What Do Records Managers Need to Know About Open Source, Open Standards, Open...
What Do Records Managers Need to Know About Open Source, Open Standards, Open...What Do Records Managers Need to Know About Open Source, Open Standards, Open...
What Do Records Managers Need to Know About Open Source, Open Standards, Open...Cheryl McKinnon
 
DMP health sciences
DMP health sciencesDMP health sciences
DMP health sciencesSarah Jones
 
Introduction to knowledge sharing systems: considerations for the conceptual ...
Introduction to knowledge sharing systems: considerations for the conceptual ...Introduction to knowledge sharing systems: considerations for the conceptual ...
Introduction to knowledge sharing systems: considerations for the conceptual ...Nikos Manouselis
 
Creating a Data Management Plan for your Research
Creating a Data Management Plan for your ResearchCreating a Data Management Plan for your Research
Creating a Data Management Plan for your ResearchRobin Rice
 
2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning Workshop2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning WorkshopLizzy_Rolando
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation Research Data Alliance
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation Research Data Alliance
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarFAIRDOM
 

Similar to 20140410 ifla digitization workshop [idlc kuala lumpur] (20)

Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
 
Data accessibilityandchallenges
Data accessibilityandchallengesData accessibilityandchallenges
Data accessibilityandchallenges
 
RDM & ELNs @ Edinburgh
RDM & ELNs @ EdinburghRDM & ELNs @ Edinburgh
RDM & ELNs @ Edinburgh
 
Special material-powerpoint
Special material-powerpointSpecial material-powerpoint
Special material-powerpoint
 
Cal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPTool
 
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
Seeing Connecticut Now and Then: Repository Services that Support Your Best M...
 
What Do Records Managers Need to Know About Open Source, Open Standards, Open...
What Do Records Managers Need to Know About Open Source, Open Standards, Open...What Do Records Managers Need to Know About Open Source, Open Standards, Open...
What Do Records Managers Need to Know About Open Source, Open Standards, Open...
 
DMP health sciences
DMP health sciencesDMP health sciences
DMP health sciences
 
Research Data Management: Why is it important?
Research Data Management: Why is it  important?Research Data Management: Why is it  important?
Research Data Management: Why is it important?
 
Introduction to knowledge sharing systems: considerations for the conceptual ...
Introduction to knowledge sharing systems: considerations for the conceptual ...Introduction to knowledge sharing systems: considerations for the conceptual ...
Introduction to knowledge sharing systems: considerations for the conceptual ...
 
Creating a Data Management Plan for your Research
Creating a Data Management Plan for your ResearchCreating a Data Management Plan for your Research
Creating a Data Management Plan for your Research
 
2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning Workshop2012 Fall Data Management Planning Workshop
2012 Fall Data Management Planning Workshop
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management Webinar
 
SciELO
SciELO SciELO
SciELO
 
Managing e-journals: Requirements & Best Practice - Ina Smith
Managing e-journals: Requirements & Best Practice - Ina SmithManaging e-journals: Requirements & Best Practice - Ina Smith
Managing e-journals: Requirements & Best Practice - Ina Smith
 
Managing e-journals: Requirements & Best Practice - Ina Smith
Managing e-journals: Requirements & Best Practice - Ina SmithManaging e-journals: Requirements & Best Practice - Ina Smith
Managing e-journals: Requirements & Best Practice - Ina Smith
 

More from Frederick Zarndt

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesFrederick Zarndt
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and PracticesFrederick Zarndt
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017Frederick Zarndt
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesFrederick Zarndt
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]Frederick Zarndt
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Frederick Zarndt
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Frederick Zarndt
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]Frederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Frederick Zarndt
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...Frederick Zarndt
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...Frederick Zarndt
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...Frederick Zarndt
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]Frederick Zarndt
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...Frederick Zarndt
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...Frederick Zarndt
 
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...Frederick Zarndt
 

More from Frederick Zarndt (20)

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum Archives
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin Principles
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...
 
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
 

Recently uploaded

(怎样办)Sherbrooke毕业证本科/硕士学位证书
(怎样办)Sherbrooke毕业证本科/硕士学位证书(怎样办)Sherbrooke毕业证本科/硕士学位证书
(怎样办)Sherbrooke毕业证本科/硕士学位证书mbetknu
 
Start Donating your Old Clothes to Poor People kurnool
Start Donating your Old Clothes to Poor People kurnoolStart Donating your Old Clothes to Poor People kurnool
Start Donating your Old Clothes to Poor People kurnoolSERUDS INDIA
 
Club of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationClub of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationEnergy for One World
 
13875446-Ballistic Missile Trajectories.ppt
13875446-Ballistic Missile Trajectories.ppt13875446-Ballistic Missile Trajectories.ppt
13875446-Ballistic Missile Trajectories.pptsilvialandin2
 
(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证mbetknu
 
2024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 282024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 28JSchaus & Associates
 
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...Suhani Kapoor
 
Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...Christina Parmionova
 
WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.Christina Parmionova
 
Madurai Call Girls 7001305949 WhatsApp Number 24x7 Best Services
Madurai Call Girls 7001305949 WhatsApp Number 24x7 Best ServicesMadurai Call Girls 7001305949 WhatsApp Number 24x7 Best Services
Madurai Call Girls 7001305949 WhatsApp Number 24x7 Best Servicesnajka9823
 
(办)McGill毕业证怎么查学位证书
(办)McGill毕业证怎么查学位证书(办)McGill毕业证怎么查学位证书
(办)McGill毕业证怎么查学位证书mbetknu
 
Building the Commons: Community Archiving & Decentralized Storage
Building the Commons: Community Archiving & Decentralized StorageBuilding the Commons: Community Archiving & Decentralized Storage
Building the Commons: Community Archiving & Decentralized StorageTechSoup
 
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...narwatsonia7
 
Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012rehmti665
 
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012rehmti665
 
High Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service Mumbai
High Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service MumbaiHigh Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service Mumbai
High Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service Mumbaisonalikaur4
 
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…nishakur201
 
2024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 262024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 26JSchaus & Associates
 

Recently uploaded (20)

(怎样办)Sherbrooke毕业证本科/硕士学位证书
(怎样办)Sherbrooke毕业证本科/硕士学位证书(怎样办)Sherbrooke毕业证本科/硕士学位证书
(怎样办)Sherbrooke毕业证本科/硕士学位证书
 
Start Donating your Old Clothes to Poor People kurnool
Start Donating your Old Clothes to Poor People kurnoolStart Donating your Old Clothes to Poor People kurnool
Start Donating your Old Clothes to Poor People kurnool
 
Club of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationClub of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological Civilization
 
13875446-Ballistic Missile Trajectories.ppt
13875446-Ballistic Missile Trajectories.ppt13875446-Ballistic Missile Trajectories.ppt
13875446-Ballistic Missile Trajectories.ppt
 
(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证(多少钱)Dal毕业证国外本科学位证
(多少钱)Dal毕业证国外本科学位证
 
2024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 282024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 28
 
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Bikaner Aishwarya 8250192130 Independent Escort Servic...
 
Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...Enhancing Indigenous Peoples' right to self-determination in the context of t...
Enhancing Indigenous Peoples' right to self-determination in the context of t...
 
WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.
 
Madurai Call Girls 7001305949 WhatsApp Number 24x7 Best Services
Madurai Call Girls 7001305949 WhatsApp Number 24x7 Best ServicesMadurai Call Girls 7001305949 WhatsApp Number 24x7 Best Services
Madurai Call Girls 7001305949 WhatsApp Number 24x7 Best Services
 
(办)McGill毕业证怎么查学位证书
(办)McGill毕业证怎么查学位证书(办)McGill毕业证怎么查学位证书
(办)McGill毕业证怎么查学位证书
 
Building the Commons: Community Archiving & Decentralized Storage
Building the Commons: Community Archiving & Decentralized StorageBuilding the Commons: Community Archiving & Decentralized Storage
Building the Commons: Community Archiving & Decentralized Storage
 
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
 
Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012Call Girls Rohini Delhi reach out to us at ☎ 9711199012
Call Girls Rohini Delhi reach out to us at ☎ 9711199012
 
Call Girls In Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In  Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCeCall Girls In  Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
 
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
Call Girls Connaught Place Delhi reach out to us at ☎ 9711199012
 
High Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service Mumbai
High Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service MumbaiHigh Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service Mumbai
High Class Call Girls Mumbai Tanvi 9910780858 Independent Escort Service Mumbai
 
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
Goa Escorts WhatsApp Number South Goa Call Girl … 8588052666…
 
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
 
2024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 262024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 26
 

20140410 ifla digitization workshop [idlc kuala lumpur]

  • 1. Newspaper digitization Frederick Zarndt IFLA Newspapers Section frederick@frederickzarndt.com @cowboyMontana hashtag #IFLAnewspaper
  • 2. the agenda 10.30 Morning tea break 1. Introductions 2.Review of the OAIS reference model 3.Newspaper digitization programs 4. Selection of materials 5. Importance of standards 6.Project management 7. Digitization workflow 7.1. Images 7.2. Metadata 7.3. File formats 8.Digitization workflow demonstration with docWorks 9. Quality assurance and acceptance criteria 10. Tools for digitization, workflow, digital preservation, and project management 11. Digital preservation considerations 12.Wrap-up 13.00 Lunch 15.30 Afternoon tea break
  • 3. An Open Archival Information System (or OAIS) is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https:// en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).
  • 4. Open Archival Information System (OAIS) reference model • Negotiate for and accept appropriate information from information Producers. • Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation. • Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided. • Ensure that the information to be preserved is Independently Understandable to the Designated Community. In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information. • Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original. • Make the preserved information available to the Designated Community. Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https:// en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).
  • 5. Open Archival Information System (OAIS) reference model
  • 8. national programs national: centrally funded and managed programs with several participants. strict standards. • National Digital Newspaper Program (Library of Congress) • Australian Newspaper Digitisation Program programs
  • 9. cooperative programs cooperative: organizations collaborate to achieve a common goal but digitization programs are managed separately. flexible standards. • Europeana newspapers • Digital Public Library of America programs
  • 10. individual programs individual: organization digitizes on its own. may or, more usually, does not follow open standards. all commercial organizations. • ProQuest Historical Newspapers • Newspapers.com • Newsbank • many others… programs
  • 11. programs • digitization program requires careful thought • must be adapted to local circumstances • ask those who have gone before • join the IFLA Newspapers Section! (ask me how) Image courtesy of Donald Zolan.
  • 12. ? programs ? Discussion questions 1. Has your organization already begun to digitize newspapers? How is the digitization program organized and funded? 2. If your organization hasn’t yet begun to digitize newspapers, what type of digitization program would best suits your organization / state / country? Why?
  • 13. Experience is that marvelous thing that enables you to recognize a mistake when you make it again. ! F. P. Jones
  • 15. reasons for digitization newspapers are deteriorating microfilm is dissolving no storage space selection
  • 16. access • Who are your users? Do you know? • Can you ask them what they expect from a digital newspaper collection? Can you trust their answers? • Trove, Papers Past, Cambridge Public Library, CDNC: These digital newspaper collections are used mostly by people 50+ years old and with an interest in family history. ? selection
  • 17. Library of Congress selection criteria for the National Digital Newspaper Program (NDNP) selection ! • Image quality • Intellectual content • Refinements http://www.loc.gov/ndnp/guidelines/selection.html
  • 18. selection for NDNP Image quality ! All NDNP newspaper images are scanned from microfilm. 1. Microfilm should be produced from properly prepared unbound originals. 2. Microfilm reduction ratio should be less the 20x. This allows 400dpi images to be scanned from the film. 3. Variations in microfilm density within and between images should be more than 0.2. 4. Negative microfilm duplicated for scanning should have resolution test patterns readable at 5.0 or higher. For camera master microfilm without resolution test charts, resolution can be estimated by comparison to film with resolution test charts and original material. selection
  • 19. selection for NDNP Intellectual content ! 1. Newspaper title reflects the political, economic and cultural history of the State. 2. Selected newspaper titles should ensure broad geographical coverage. 3. Newspaper titles that provide coverage of a geographic area or a group over long time periods are preferred over short lived titles or titles with significant gaps. selection
  • 20. selection for NDNP Selection criteria refinements ! 1. Orphan titles: Special consideration should be given to high research value titles that have ceased publication and lack active ownership. 2. Newspaper titles that document a significant (minority) community at the state or regional level may be given special consideration. 3. Newspaper which have already been digitized by other organizations (for example, ProQuest) should not be digitized again. selection
  • 21. selection for ANDP National Library of Australia collection managers in consultation with staff from Preservation Services nominate materials for digitization. The Library works closely with state and territory libraries to systematically digitise newspapers held in these libraries. Selected newspapers include this with ! • Cultural and/or historical significance • Uniqueness and/or rarity of the material • Copyright status or permission to digitise obtained • Material in high demand • Material at risk because of its physical condition selection https://www.nla.gov.au/policy-and-planning/collection-digitisation-policy
  • 22. copyright Most newspapers titles selected for digitization are out of copyright and in the public domain. Negotiating use rights is quite simply too much trouble and fraught with legal pitfalls. Copyright laws and policies vary considerably between countries. selection
  • 23. 23 …however… Digitization and public access to in-copyright newspapers is not impossible. selection
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. ? selection ? Discussion questions 1. Has your organization already selected newspapers to digitize? Why did it choose the titles that were selected? Please answer (hypothetically) if your organization hasn’t begun a newspapers digitization program. 2. Why would or why wouldn’t your organization select in-copyright newspapers to digitize?
  • 30. 30 importance of standards
  • 31. open standards • Availability : Open standards are available for all to read and implement. • Maximize end-user choice : Open standards create a fair, competitive market for implementation of the standards. They do not lock the customer into a particular vendor or group. • No royalty : Open standards are free for all to implement, with no royalty or fee. • No discrimination : Open standards and the organizations that administer them do not favor one implementor over another for any reason other than the technical standards compliance of a vendor's implementation. • Extension or subset : Implementations of open standards may be extended, or offered in subset form. However, certification organizations may decline to certify subset implementations, and may place requirements upon extensions. • Predatory practices : Open standards may employ license terms that protect against subversion of the standard by embrace-and-extend tactics. The licenses attached to the standard may require the publication of reference information for extensions, and a license for all others to create, distribute and sell software that is compatible with the extensions. An open standard may not otherwise prohibit extensions. importance of standards Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards
  • 32. open standards standards • Not restrictive : Less chance of being locked in by a specific technology and/or vendor. • Interoperable : Easier for systems from different parties or using different technologies to interoperate and communicate of with one another. importance • Protection against obsolescence : Better protection of the data files created by an application against obsolescence. • Portable : Applications / data are easier to port from one platform to another since they follows known guidelines and rules, and the interfaces. Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards 32
  • 33. newspapers and standards What standards are important for newspaper digitization? ! • METS XML is an open standard administered by the METS editorial board. See http://www.loc.gov/standards/mets/. • ALTO XML is an open standard administered by the ALTO editorial board. See http://www.loc.gov/standards/alto/. • Various image file formats including TIFF, JPEG, JPEG2000. • PDF/A is a portable document format developed by Adobe. It is a subset of the complete PDF specification and has been adopted by ISO as a standard. See http://www.pdfa.org/. • Various library metadata standards including, but not limited to • MODS XML http://www.loc.gov/standards/mods/ • Dublin Core http://dublincore.org/ • PREMIS http://www.loc.gov/standards/premis/ importance of standards
  • 34. importance of standards with few exceptions libraries use METS XML + ALTO XML + image files (TIFF, JPEG2000) for newspaper digitization programs importance of standards
  • 35. proprietary standards Olive ActivePaper Archive stores historical newspaper data in an XML format that is as capable as METS/ALTO XML but is not an open standard. Early versions of WordPerfect (MS Word too) stored data in a proprietary format, not in an open standard like Open Document Format (ODF). WordPerfect or special software is needed to view the files. Adobe’s Flash is a de facto but not an open standard. Flash now appears to be on a path to obsolescence, destined to be replaced by HTML5. importance of standards
  • 36. ?importance of standards ? Discussion questions 1. Name a few standards that you use every time you connect to the Internet. 2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use?
  • 37. In theory, there's no difference between theory and practice, but in practice, there is. ! Anonymous
  • 39. From the Standish Group’s 2012 Chaos Report on IT Project Failure. project management
  • 40. high cost of IT failure Roger Sessions estimates that the worldwide cost of IT failure is USD $500 billion per month Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple Architectures for Complex Enterprises and many articles. He is a founding member of the Board of Directors of the International Association of Software Architects. 40 project management
  • 41. in a recent survey of 1230 IT professionals conducted by Embarcadero Technologies, 2 of the 3 biggest project challenges cited by the IT pros are “poor planning” and “poor or no requirements” 41 plan! project management
  • 42. in a March 2007 web poll conducted by the Computing Technology Industry Association "nearly 28 percent of the more than 1,000 respondents singled out poor communications as the number one cause of project failure" 42 communicate! project management
  • 43. A recent survey of 752 IEEE members conducted by IEEE Spectrum and The New York Times discovered that "just 9 percent of 133 respondents whose organizations currently offshore R&D reported 'No problem'. The biggest headache was 'Language, communication, or culture' barriers, as reported by 54.1 percent of respondents." (http://www.spectrum.ieee.org/feb07/4881 43 communicate! project management
  • 44. In their 2009 book Cultural Intelligence: Living and Working Globally, Thomas and Inkson say “Although we increasingly cross boundaries and surmount barriers to trade, migration, travel, and the exchange of information, cultural boundaries are not so easily bridged. Unlike legal, political, or economic aspects of the global environment, which are observable, culture is largely invisible. Therefore, culture is the aspect of the global context that is most often overlooked.” 44 communicate! project management
  • 45. plan! Taimour al Neimat. Why IT project fail. The PROJECT PERFECT White Paper Collection. Oct 2005. http://www.projectperfect.com.au/downloads/Info/ info_it_projects_fail.pdf accessed Mar 2014. project management in a white paper written for Project Perfect by Taimour al Neimat, he lists • poor planning • unclear goals and objectives • objectives changing during the project • unrealistic time or resource estimates • lack of executive support and user involvement • failure to communicate and act as a team • inappropriate skills as primary causes for the failure of complex IT projects
  • 46. typical tender evaluation criteria in priority order ! 1. understanding of requirements 2. reputation of service bureau 3. price 46 requirements? project management
  • 47. incomplete requirements requirements in recent tender from an (anonymous) government agency somewhere in the world ! • project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000 • 19 pages of definitions, governing law, proposal evaluation criteria, contractual conditions, instructions about tender response format, etc • technical requirements description? < 1 page • data acceptance criteria? “a high level of accuracy” 47 project management
  • 48. complete requirements Library of Congress JPEG2000 profile 48 project management
  • 49. a recent newspapers digitization program established by a prominent national library ! • digitize more than 20 million text pages • high level image and xml requirements • value of work awarded? > USD $5,000,000 • after award of work, technical requirements expand to 43+ pages from ~3 pages • acceptance criteria? added as an afterthought and not well defined project management poor planing
  • 50. the value of simplicity “There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.” ! C.A.R. Hoare Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford University, Senior Researcher at Microsoft Research, recipient of the ACM Turing Award, author of many books on computers and software. project management
  • 51. • unitary: the requirement addresses one and only one thing • complete: the requirement is fully stated in one place with no missing information • consistent: the requirement does not contradict any other requirement and is fully consistent with all authoritative external documentation • atomic: it does not contain conjunctions, for example, "the code field must validate American and Canadian postal codes" should be written as two separate requirements project management good requirements
  • 52. ! • traceable: the requirement meets all or part of a business need as stated by stakeholders and authoritatively documented • current: the requirement has not been made obsolete by the passage of time • feasible: the requirement can be implemented within the constraints of the project • unambiguous: the requirement is concisely stated without recourse to technical jargon, acronyms • verifiable: the implementation of the requirement can be determined through one of four possible methods: inspection, demonstration, test, or analysis project management good requirements
  • 54. simple principles for (good) communication • be impeccable with your word • don’t take anything personally • don’t make assumptions • always do your best • be mindful
  • 55. why (better) communication is necessary no communication ... little communication ... poor communication ... reduced communication ... ... all result in more assumptions about intent!
  • 56. The single biggest problem with communication is the illusion that it has taken place. George Bernard Shaw, 1925 Nobel Peace Prize for Literature.
  • 57. project management “projects are about communication, communication, and communication” Elenbass, B. Staging a Project: Are You Setting Your Project Up for Success? Proceedings of the Project Management Institute Annual Seminars & Symposiums. 2000.
  • 58. the value of prototypes / pilots “Plan to throw one away; you will anyhow. If there is anything new about the function of a system, the first implementation will have to be redone completely to achieve a satisfactory (i.e., acceptably small, fast, and maintainable) result. It costs a lot less if you plan to have a prototype.” ! Butler Lampson Butler Lampson was a founding member of Xerox PARC, worked for DEC, and now works at Microsoft Research. He is an adjunct professor at MIT and an ACM Fellow. project management
  • 59. implement: pilot create requirements and acceptance criteria repeat { digitize (small) pilot batch test data against acceptance criteria adjust requirements and acceptance criteria } until (no more adjustments are necessary) digitize more data pilot batches are VERY VERY important!! 59 project management
  • 60. reasons for in-house production ! • collection cannot be moved • collection is badly organized • digitization must be done slowly over a long period • digitization is very simple 60 project management implement: in-house
  • 61. reasons for outsourced production ! • originals can’t be scanned in-house because… • equipment is too expensive • output data is beyond staff experience • labor is too expensive • large volume of work in a short time • insufficient space, infrastructure, or staff 61 project management implement: outsource
  • 62. project management tools The project management tool one chooses should be intuitive, easy to use, and accessible to all. If it isn’t, many will avoid / refuse / dislike / resent using it. ! • Discussion of project management tools at http:// en.wikipedia.org/wiki/Comparison_of_project-management_ software • List of project management tools at http:// en.wikipedia.org/wiki/Comparison_of_project-management_ software project management
  • 63. ? project management ? Discussion questions 1. What project management practices does your organization follow? Why? 2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use? 3. What reasons, in addition to those already cited, would your organization have to digitize newspapers in-house or to outsource digitization?
  • 64. “Perfection is attained, not when there is nothing left to add, but when there is nothing left to take away.” ! Antoine de St. Exupery
  • 66. digitization workflow ! • digital library: one or more digital collections
  • 67. 67 digital library digitization workflow
  • 68. digitization workflow ! • digital library: one or more digital collections • digital collection: organized group(s) of digital objects
  • 70. digitization workflow ! • digital library: one or more digital collections • digital collection: organized group(s) of digital objects • digital object: a surrogate or digital copy of the original source document, for example, a newspaper issue
  • 72. An example of what ALTO makes possible The Day book. (Chicago, Ill.), 29 Feb. 1912. Chronicling America: Historic American Newspapers. Lib. of Congress. <http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-02-29/ed-1/seq-26/>
  • 73. digitization workflow ! • digital library: one or more digital collections • digital collection: organized group(s) of digital objects • digital object: a surrogate or digital copy of the original source document, for example, a newspaper issue • metadata: data about data. information about a digital object(s) or a digital collection(s) or the original source document(s)
  • 75. • to enhance accessibility • to increase collaboration and cooperation between libraries and archives around the world • to promote research • to provide opportunities for entrepreneurs • other reasons? 75 why digitize newspapers? digitization workflow
  • 76. Open Archival Information System (OAIS) reference model
  • 78. the digitization process produce digital objects ingest preserve access produce images access source images objects
  • 79. the digitization process produce images source images
  • 80. standard file formats • image file formats • TIFF • JPEG2000 • JPEG • GIF • text file formats • PDF, PDF/A, PDF/A-1b, PDF/A-1a • TEI XML • HTML • plain text • NITF / NewsML • metadata • METS • MODS / PREMIS / ALTO / MIX ... digitization workflow
  • 81. ?image decisions ¿ • image production source materials • original documents: better quality, more expensive • microfiche: poorer quality, less expensive, microfiche quality varies • bit depth • black-and-white (bitonal) • greyscale • color • resolution • compression • no compression • lossless (reversible) • lossy (irreversible) • image metadata digitization workflow
  • 82. image format comparison compression bit depth metadata color management mime type patent 1st public release JBIG (.jbig, .jbg) lossless 1-bit no no 2000? JPEG (.jpg, .jpeg) lossy, DCT, RLE, Huffman 8-bit 12-bit 24-bit yes yes image/jpeg public.jpeg no 1992 JPEG2000 (.jp2) many lossless and lossy compression algorithms 8-bit 16-bit color to 48 bits yes yes image/jp2 public.jpeg200 yes but part 1 is patent free 2000 TIFF (.tiff, .tif) none LZW RLE ZIP Other 1, 2, 4, 8, 16, 24, 32 bits yes yes image/tiff public.tiff no 1986 Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats (accessed August 1, 2012)
  • 83. image compression comparison The Sacred Heart Review 300dpi Los Angeles Star 300dpi Die Susquehanna Zeitung 600dpi TIFF (uncompressed) 17.2 MB 87 MB 415.5 MB TIFF (lossless LZW compression) 10.2 MB 75.8 MB 232.9 MB JPEG (maximum quality [lossless]) 7.0 MB 37.2MB 101.1 MB JPEG (medium quality) 1.5 MB 4.6 MB 10.2MB JPEG2000 (lossless compression) 7.1 MB 52.7 MB 166.2 MB JPEG2000 (lossy [70] compression) 5.1 MB 37.1 MB 116.7 MB JPEG2000 (lossy [30] compression) 2.2 MB 16.1 MB 50.3 MB
  • 84. image bit depth comparison USA case law image 1 300dpi USA case law image 2 300dpi TIFF 1-bit CCITT G4 compression 40 KB 87 KB JPEG2000 W5x3 reversible compression 2.6 MB 3.6 MB JPEG2000 W9x7 irreversible compression 647 KB 1 MB
  • 85. GIGO GARBAGE IN, GARBAGE OUT Image courtesy of http://epsos.de (accessed at http://commons.wikimedia.org March 2014).
  • 86. raw OCR text Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f Havodiv y d, Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last , Mr . Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. AsbtCnvHall, mar Lancaster, Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol " t h r o u g h I n s b e a d , 1 w h i c h instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week, newspaper image Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
  • 87. ? digitization workflow ? Discussion topics 1. Assume your organization decides to digitize 1000 newspaper issues averaging 12 pages per issue. The images are scanned 2-up and average 80MB each. How much disk storage is needed for the images? 2. Now assume instead that your organization uses TIFF images with LZW (lossless) compression, which saves on average 40%. How much disk storage is needed for the images?
  • 89. the digitization process produce digital objects images objects
  • 90. the digitization process images image objects processing layout analysis OCR metadata build digital objects
  • 91. the digitization process images image objects processing layout analysis OCR metadata build digital objects • crop, de-skew, split images • apply image improvement algorithms as needed • sharpening filters • local adaptive thresholding • remove text bleed-thru • etc • create master images • create working images
  • 92. 92
  • 93. 93
  • 94. 94
  • 95. what’s wrong with this image?
  • 96. text is skewed about 1° from vertical
  • 97. text is de-skewed text is skewed
  • 98. the digitization process images image objects processing layout analysis OCR metadata build digital objects • analyze layout of text image • estimate font types and sizes • calculate coordinates of text blocks • determine layout object types (text, illustration, headline, etc)
  • 100. the digitization process images image objects processing layout analysis OCR metadata build digital objects • perform optical character recognition (OCR) • calculate word and character coordinates • calculate word and character confidences • apply language dictionaries • correct OCR text (optional)
  • 101. the digitization process images image objects processing layout analysis OCR metadata build digital objects • populate metadata fields • verify / correct page numbers • verify / correct document structure
  • 102. the digitization process images image objects processing layout analysis OCR metadata build digital objects • create METS / ALTO XML files • create image files and image metadata • create PDF files (if required) • verify digital object • calculate file fixity checks (checksums) • perform file validation and verification • perform quality assurance
  • 103. real world digitization production workflow • automatic production steps performed by software ! • manual production steps performed by operators
  • 104. digital library standards • METS XML for descriptive, structural, technical, and administrative metadata ! • descriptive metadata • Metadata Object Description Standard (MODS) selected metadata from MARC • Dublin Core fundamental group of text elements for describing and cataloging ! • technical metadata • ALTO for OCR text • PREMIS for digital preservation • MIX and ANSI/NISO Z39.87 for images
  • 105. Metadata Encoding and Transmission Standard ! • METS is a XML standard for encoding descriptive, administrative, and structural metadata about objects within a digital library • METS files consist of 7 (optional) sections: header, descriptive, administrative, file map, structural map, structural link, and behavior • METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile • current version 1.9.1 • administered by METS editorial board (international group of volunteers) • standards hosted by Library of Congress at http://www.loc.gov/ standards/mets/
  • 106. METS file structure Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.
  • 107. Metadata Object Description Schema • MODS is an XML schema for a bibliographic element set that may be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones • Subset of MARC 21 • Mappings exist between MODS and MARC, Dublin Core, and RDA (conversion tools exist) • May be used in conjunction with METS XML • current version 3.4 • administered by Library of Congress Network Development and MARC Standards Office with help from interested users • standards hosted by Library of Congress at http://www.loc.gov/ standards/mods/
  • 108. MODS metadata in METS XML <mets:dmdSec ID="issue-nla.news-issn18368190_18740425"> ! <mets:mdWrap MDTYPE="MODS"> ! ! <mets:xmlData> ! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3"> ! ! ! ! <mods:language> ! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm> ! ! ! ! </mods:language> ! ! ! ! <mods:genre>newspaper issue</mods:genre> ! ! ! ! <mods:originInfo> ! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued> ! ! ! ! </mods:originInfo> ! ! ! ! <mods:relatedItem type="host"> ! ! ! ! ! <mods:titleInfo> ! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title> ! ! ! ! ! </mods:titleInfo> ! ! ! ! ! <mods:genre>newspaper</mods:genre> ! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="volume"> ! ! ! ! ! ! ! <mods:number>IX</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! ! <mods:part> ! ! ! ! ! ! <mods:detail type="issue"> ! ! ! ! ! ! ! <mods:number>12</mods:number> ! ! ! ! ! ! </mods:detail> ! ! ! ! ! </mods:part> ! ! ! ! </mods:relatedItem> ! ! ! </mods:mods> ! ! </mets:xmlData> ! </mets:mdWrap> </mets:dmdSec>
  • 109. Dublin Core metadata • Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery. • Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85 • Metadata terms last updated 14-Jun-2012 • May be used in conjunction with METS XML • Dublin Core Metadata Initiative (DCMI) is an open organization, incorporated as a public, not-for-profit company in Singapore • Dublin Core Metadata Initiative is hosted at http:// dublincore.org/
  • 110. Analyzed Layout and Text Object ! • ALTO XML provides technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper • commonly used in conjunction with METS XML but may be used standalone • current version 2.1 • administered by ALTO editorial board (international group of volunteers) • standards hosted by Library of Congress at http://www.loc.gov/ standards/alto/
  • 111. <?xml version="1.0" encoding="UTF-8"?> <alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"> <Description> ! <MeasurementUnit>pixel</MeasurementUnit> ! <sourceImageInformation> ! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName> ! </sourceImageInformation> </Description> <Styles> ! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> ! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles> <Layout> ! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967"> ! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/> ! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/> ! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/> ! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/> ! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194"> ! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831"> ! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831"> ! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT"> ! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75"> ! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/> ! ! ! ! ! ! !<SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/> ! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/> ! ! ! ! ! ! </TextLine> ! ! ! ! ! </TextBlock> ! ! ! ! </ComposedBlock> ! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/> ! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/> ! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46"> ! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/> ! ! ! ! ! ! !<SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/> ! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/> ! ! ! ! ! ! </TextLine> ! ! ! ! ! </TextBlock> ! ! ! ! </ComposedBlock> ! ! ! </ComposedBlock> ! </PrintSpace> </Page> </Layout> </alto> Analyzed Layout and Text Object
  • 112. Analyzed Layout and Text Object book
  • 113. Analyzed Layout and Text Object newspaper
  • 114. Preservation Metadata Implementation Strategies • PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use • In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities • PREMIS data dictionary current version 2.2 • May be used in conjunction with METS XML • PREMIS tools are freely available • PREMIS Maintenance Activity and Editorial Committee has international members from libraries and industry • PREMIS data dictionary is hosted at http://www.loc.gov/ standards/premis/
  • 115. PREMIS data in METS file <mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec>
  • 117. implement: software • commercial off-the-shelf (COTS) • open source • customized COTS • customized open source • custom in-house 117
  • 118. ? digitization workflow ? Discussion topics 1. Assuming your organization will digitize historic newspapers, will it digitize the newspapers in-house or out-source digitization? Why? (If you don’t know, guesses and speculations are fine.) 2. Describe your organizations current digitization workflow.
  • 119. quality assurance and acceptance criteria
  • 120. quality assurance and acceptance criteria Wikipedia on data quality: ! The processes and technologies involved in ensuring the conformance of data values to requirements and acceptance criteria quality assurance
  • 121. • is the digital object complete? are all its components present? • is the digital object verifiable? • is the digital object uncorrupted? • do the components of the digital object conform to standards? • do the file names conform to project requirements? • does the directory structure conform to project requirements? • does the digital object metadata conform to project specifications? quality assurance automatic quality checks
  • 122. • does the digital object metadata meet accuracy specifications? • does the text meet accuracy specifications? • is the image quality satisfactory? • are article continuations correct? • is the text in reading order? quality assurance manual quality checks
  • 123. what’s wrong with this? acceptance criteria for an English language digitization project at a large, well-known, and internationally recognized national library ! character accuracy > 80% word accuracy > 75% significant word accuracy > 65% quality assurance
  • 124. what’s wrong with this? project quality requirement: ! “a high level of accuracy”
  • 125. what’s wrong with this? project quality requirement: ! “article titles must be 99.5% accurate”
  • 126. what’s wrong with this? project quality requirement: ! “article title characters in each issue must be 99.5% accurate, that is, each issue may have no more than 5 errors in 1000 article title characters”
  • 127. image quality ! •sharpness: the amount of detail an image can convey •noise: random variation of image density •dynamic range •contrast (gamma): the slope of the tone reproduction curve in a log-log space. high contrast usually involves loss of dynamic range — loss of detail, or clipping, in highlights or shadows. •vignetting: darkens images near the corners •artifacts: “leftovers” from sharpening or compression Wikipedia contributors, “Image quality," Wikipedia, The Free Encyclopedia, http:// en.wikipedia.org/wiki/Image_quality (accessed March 2014). quality assurance
  • 128. image quality ! “…images which are ultimately to be viewed by human beings, the only “correct” method of quantifying visual image quality is through subjective evaluation. in practice, however, subjective evaluation is usually too inconvenient, time-consuming assurance and expensive…” ! quality “…best way to assess the quality of an image is to look at it because human eyes are the ultimate viewers of most images…” Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment so difficult? IEEE Transactions on Image Processing. April 2004. Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. April 2004.
  • 129. acceptance criteria for the National Library of Australia NDP 129
  • 130. ? quality assurance ? Discussion topics 1. How does your organization currently do quality assurance for digital data? 2. How much time / effort is given to writing quality assurance procedures and acceptance criteria for digitized data?
  • 132. open source vs. commercial software: pros digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- • acquisition : cost, development and implementation contract costs are likely to be lower than for proprietary software. less likely that there will be contractually-bound upgrade costs. total cost of ownership over the lifetime of usage must be taken into account • data transferability : with open source code and open data formats, there are greater opportunities to share data across interoperable platforms • re-use : open source is free from per user or per instance costs and there is a guaranteed freedom to use it in any way. re-use is enabled. open-source-solutions/
  • 133. open source vs. commercial software: digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- • cost effective : pay once or not at all for development (if at all) and reuse where appropriate. • non-restrictive : open source licenses do not limit or restrict who can use the software, the type of user, or the areas of business in which the software can be used. provides a licensing model that enables rapid provisioning of both known and unanticipated users and in new use cases. • scalable : open source solutions are scalable upwards and downwards with a reduction in the risk of longer term financial implications. no license fees on a “per user” or “per box” basis. no redundant licenses open-source-solutions/ pros
  • 134. open source vs. commercial software: digitization tools Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- • easy to prototype and adapt : open source software is particularly suitable for rapid prototyping and experimentation, where the ability to “test drive” the software with minimal costs and administrative delays can be important. (proprietary software suppliers may also provide the same through a ‘proof of concept’ phase at minimal or no cost.) open-source-solutions/ pros
  • 135. • support and maintenance costs : may outweigh those of the proprietary package and include ‘hidden’ commitments. • intellectual property rights : as code is modified and adapted, there may be legal risks the code’s open source status and who owns the intellectual property rights of the modified code. • expertise : requires software installation and maintenance expertise. modification of open source code requires software development expertise.must ensure that they have the right level of expertise to manage it effectively. digitization tools open source vs. commercial software: cons Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of- open-source-solutions/
  • 136. digitization tools a variety of open source and commercial off-the-shelf (COTS) software is available for digitization projects • easier for systems from different parties or using different technologies to interoperate and communicate with one another • better protection of the data files created by an application against obsolescence of the application • applications / data are easier to port from one platform to another since they follows known guidelines and rules, and the interfaces
  • 137. digitization tools ocr software open source • ABBYY FineReader (http://www.abbyy.com) • Tesseract (https://code.google.com/p/tesseract-ocr) • Nuance OmniPage (http://www.nuance.com) • IRIS Readiris (http://www.irislink.com) • LEADTOOLS OCR (http://www.leadtools.com) • OCRopus (https://code.google.com/p/ocropus) Wikipedia contributors, “Optical optical character" Wikipedia, The Free Encyclopedia, http:// en.wikipedia.org/wiki/Optical_character_recognition (accessed March 2014). Wikipedia contributors, “Comparison of optical character recognition software," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software (accessed March 2014).
  • 138. digitization tools imaging software open source • LEADTOOLS image SDK (http://www.leadtools.com) • ImageGear image SDK (http://www.accusoft.com) • FreeImage image SDK (http://freeimage.sourceforge.net) • BlackIce image toolkits (http://www.blackice.com) • Adobe Photoshop (http://www.adobe.com/Photoshop) • GIMP (http://www.gimp.org) • GraphicsMagick (http://www.graphicsmagick.org) • ImageMagick (http://www.imagemagick.org)
  • 139. digitization tools digital workflow software • Content Conversion Specialists docWorks (http://content-conversion. com) • ScanFlow (http://www.treventus.com) • Goobi (http://www.goobi.org) • Zissor (http://zissor.com) open source
  • 140. digitization tools other software • BagIt : hierarchical file packaging format for the exchange of digital content. A "bag" has just enough structure to safely enclose descriptive "tags" and a "payload" but does not require any knowledge of the payload's internal semantics. See http:// sourceforge.net/projects/loc-xferutils and http:// tools.ietf.org/html/draft-kunze-bagit-06. open source
  • 141. ? digitization tools ? Discussion questions 1. What software tools does your organization use for digital projects or digital libraries? 2. Does your organization host a digital library? If so, does it use Google Analytics or a similar tool? Why or why not? 3. What software tools does your organization use for project management? Are the tools web-based?
  • 142. digital preservation Preservation of software and preservation of data are two sides of the same coin. From February 2011 Workshop for Digital Curators.
  • 143. preservation Open Archival Information System (OAIS) reference model
  • 145. Vint Cerf on “bit rot”
  • 146. digital preservation long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required
  • 147. digital data risks • standards / format obsolescence • migration to new format, media, or hardware • media obsolescence / decay • bit rot
  • 148. format obsolescence remember … WordPerfect ? MARC records ? Adobe Flash ?
  • 149. strategies for format obsolescence • migrate data to new formats • create a computer software museum with virtual machines • format registries • format validators • don’t worry about it!
  • 150. Jeff Rothenberg on format obsolescence “... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats.” Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American. January 1995. Expanded version published February, 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)
  • 151. standard model for format obsolescence • digital format registry collects information about target format • this information is used to build format identification and verification tools • holders of content use these tools to extract metadata from content in target format; metadata is stored with the content • format registry scans computing environment to determine which formats are obsolescent; notifications sent for obsolete formats • on receiving such a notification, someone builds a tool to convert obsolete format to non-obsolete format using the format specification in the registry • on receiving such a notification, holder of content in obsolete format uses conversion tool and content metadata to convert the file in an obsolete format to a file in a non-obsolete format
  • 152. David Rosenthal on format obsolescence “... format obsolescence is a rare problem that happens infrequently to a minority of unpopular formats ...” David Rosenthal. Format obsolescence: Assessing the threat and the defenses. (accessed 1 August 2012 at http://lockss.org/locksswiki/files/ LibraryHighTech2010.pdf
  • 153. alternate model for format obsolescence • store only essential data • perform only essential tasks • delay performing tasks as long as possible David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi: 10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/ files/LibraryHighTech2010.pdf).
  • 154. importance of standards vis-a-vis format obsolescence well-defined standards … ! • guide developers in creation of tools • facilitates development of a broad range of tools for any format • allow developers to maintain existing tools
  • 155. data migration risks • file format changes, for example, PDF 1.4 to PDF 1.8 • file name differences, for example, case sensitive /insensitive names, new operating system • extended file attributes • file permissions, for example, BSD Unix drwxr-xr-x@ to Windows file permissions • soft links / hard links
  • 156. media obsolescence • 5 ¼” floppy disks • 8 track tapes • 3 ½” floppy disks • ZIP drives • CD-R, CD-RW, Blu-Ray • DAT tapes • microfilm • etc
  • 157. strategies for media obsolescence • migrate data to new media, for example, floppy disks to DVD • create and maintain a computer hardware museum
  • 158. media decay a report by NIST and the Library of Congress says ... • virtually all CD-Rs tested indicated an estimated life expectancy beyond 15 years • only 47 percent of recordable DVDs indicated an estimated life expectancy beyond 15 years, some had a life expectancy as short as 1.9 years • in practice actual lifetimes may be considerably shorter
  • 159. prevention / detection of media decay • proper storage • data file checksums (MD5, SHA-1, ...) • monitor media integrity • migrate data from old media to new media
  • 160. bit rot gradual decay of data due to … • storage media failure because of media quality • storage media failure because of improper storage • random events (bit-flip, environmental influences) • software / hardware errors
  • 161. prevention / detection of bit rot • data file fixity check (checksums) such as MD5, SHA-1, ... • monitor file integrity with frequent, corrective audits • duplicate copies, geographically distributed
  • 162. distributed decentralized digital preservation • the more copies, the safer the data • the more independent copies, the safer the data • the more frequently copies are audited, the safer the data Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?
  • 163. distributed decentralized digital preservation • n+1 copies are safer than n copies • n independent copies on different storage devices / media are safer than n copies on similar or identical storage devices / media • data audited every week is safer than data audited every month
  • 164. LOCKSS Lots Of Copies Keep Stuff Safe LOCKSS box: Open source LOCKSS software installed on a dedicated computer or virtual machine. • It ingests content from target websites using a web crawler similar to those used by search engines. • It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences. • It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available. • It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content. • It dynamically migrates content to new formats as needed for display. From LOCKSS webpages http://www.lockss.org.
  • 165. how LOCKSS works data copied to another LOCKSS box library X LOCKSS box library Y LOCKSS box my library LOCKSS box data
  • 166. how LOCKSS works data audited library X LOCKSS box library Y LOCKSS box my library LOCKSS box audit data
  • 167. how LOCKSS works data audited library X LOCKSS box library Y LOCKSS box my library LOCKSS box audit data audit fails ok audit
  • 168. how LOCKSS works data copied to another LOCKSS box library X LOCKSS box library Y LOCKSS box my library LOCKSS box data
  • 169. private LOCKSS networks Alabama Digital Preservation Network (http://www.adpn.org/). CLOCKSS (Controlled LOCKSS), a non-profit collaboration of North American, European, and Asian cultural heritage institutions whose purpose is to preserve digital content with LOCKSS (http://www.clockss.org). MetaArchive Cooperative is a digital preservation cooperative created by cultural heritage institutions (http://www.metaarchive.org).
  • 170. digital preservation references • Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. Proceedings of a conference on digital preservation held at the National Library of Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/ default/files/ANADP_Educopia_2012.pdf). • David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi: 10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/ files/LibraryHighTech2010.pdf). • David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/ACM2010.pdf). • Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American January 1995. Expanded version published February 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf) • Joint Information Systems Committee (JISC) Programme on Digital Preservation at http://www.jisc.ac.uk/preservation. • Library of Congress on Digital Preservation at http://www.digitalpreservation.gov. • Stanford University’s website for LOCKSS at http://www.lockss.org.
  • 171. newspaper digitization programs around the world National Library of Finland (http://digi.kansalliskirjasto.fi/) British Newspaper Archives, British Library (http://www.bl.uk/welcome/ newspapers) National Digital Newspaper Program, Library of Congress (http://chroniclingamerica.loc.gov/) National Library of New Zealand (http://paperspast.natlib.govt.nz/) National Library of Australia, Australian Digital Newspapers Program (http://trove.nla.gov.au/newspaper) Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/) Singapore National Library Board (http://newspapers.nl.sg/) Bibliotheque nationale de France (http://gallica.bnf.fr/) Europeana Newspapers Project, a collaboration of 17 organizations (http://www.europeana-newspapers.eu/) National Library of Latvia (https://periodika.lndb.lv/)
  • 172. • Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/ • Australian Newspaper Digitisation Program http://www.nla.gov.au/content/newspaper-digitisation- program • IFLA Newspapers Section Digitisation projects and best practices http://www.ifla.org/node/6777 • ICON: International Coalition on Newspapers http://icon.crl.edu/digitization.htm
  • 173. • METS, MODS, ALTO, PRISM, and other library standards : http://www.loc.gov/standards • OAIS : http://public.ccsds.org/publications/RefModel.aspx • NISO standards and guidelines : http://www.niso.org/ publications/rp • Good practice guides : http://www.ukoln.ac.uk • And many, many more
  • 174. Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https:// en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).
  • 175. ?! Frederick Zarndt Secretary, IFLA Newspapers Section frederick@frederickzarndt.com Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.