SlideShare a Scribd company logo
Grootschalige digitalisering van archivalia Marc Holtman
Analoge originele documenten in depot Scans Metadata (digitaal) ontsluiten Scanning zoeken en raadplegen Usability Vindbaarheid
Analoge originele documenten in depot Scans Metadata (digitaal) ontsluiten Scanning zoeken en raadplegen Usability Vindbaarheid
Analoge originele documenten in depot Scans Scanning Begrippen Principes Technische aspecten Economische principes Werkproces Bij  grootschalige  digitalisering van  archivalia
Voorstelrondje   http:// / atlantis /? application =nadere%20toegangen&database= nt & entrypoint =personen&query=&vanaf=0&sessienummer=1.ef00& relatienr =1&extra1=&extra2=&aantal_per_pagina=9&service=object& templatename = item.htm & recordnumber =0&detailobject=%3cd%7cw2kzed02%3aD%3a%2fatlantisdatabase%2fatlantis.nt.db%7c86%7c0%7caca04%7c10000%3e#    Greep uit voorbeelden huiswerk Boeken ( Internet  Archive ) Google books ( Google  Books ) Munich Digitisation Centre ( Digital  collections )
2009 2008 2007 17.958 2006 25.014 2002 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
2009 2008 2007 512.592 17.958 2006 224.050 25.014 2002 40.048 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
106.000 2009 118.312 2008 92.678 2007 512.592 17.958 2006 224.050 25.014 2002 40.048 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
531.143 106.000 2009 538.483 118.312 2008 520.483 92.678 2007 512.592 17.958 2006 224.050 25.014 2002 40.048 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
De tijd dat digitalisering een optie was is voorbij. Het aantal gebruikers van de website stijgt sterk. Maar ook de verwachting dat de stukken online te raadplegen zijn. ,[object Object]
V. Hoe lang duurt het om alles te digitaliseren 1 meter = 7.000 scans Productie = 10.000 scans per week A. 431 jaar V. Hoeveel scans levert digitalisering van 32 kilometer archief  A.  224.000.000 scans
Aantal te digitaliseren documenten in een archief loopt in een project al snel in de  honderdduizenden tot miljoenen Incidentele en structurele kosten moeten ook bij deze enorme aantallen beheersbaar blijven  Incidentele en structurele kosten afhankelijk van: B.  Werkprocessen : organisatie reproductieproces A.  Technische aspecten : kwaliteitsnorm scans / bestandsgrootte
99% van alle documenten uit een archief betreft  tekstdocumenten ,[object Object],[object Object],[object Object],Maar wat als het grootschalig wordt, en uitgangspunt vooral “raadpleging” (oftewel “lezen van de tekst”) is?
Principes, begrippen, technische aspecten en werkproces Grootschalig digitaliseren van archivalia
We Scan We Store We Do Principles, image quality and workflow principles Compression and filesize Workflow, tools and practical issues
Goals of digitization projects vary from access to substitution of the originals In every project quality standard and method are set, depending on purpose and type of material For all projects we have  one  workflow We always work on project basis We scan Digitization at the Amsterdam City Archives in general
We scan 1. At large scale the more scans being made, the lower the price per scan Large scale production is a prerequisite in order to keep production costs as low as possible
Documents that are being digitized in this reproduction process can have the following forms We scan Small and large size Bound and loose-leafed entities Card indexes Old and modern material Low and high contrast documents Text alone, text and image together Hybrid forms 3. A broad spectrum of document types
Costs for producing and storing scans are determined to a high extent by the quality standard  set for the scans Purpose of the scans:  archival research using the web, straight from screen or print We scan 4. For archival research from screen or print The higher the standard of quality, the higher the costs will be In order to keep costs low it is prudent to allow the standard of quality follow from the requirement the end user places on the scan Textual information legible in de originals must be legible in the scans
But has no added value for the customer at all A quality higher than that inevitably will push up both incidental and structural costs We scan 4. For archival research from screen or print Specified (basic) quality standard: Reproduction of all significant information  Reproduction of details which are not part of the textual information is not required
We scan Scan quality and legibility High quality scan Modified scan (contrast) Optimal tonal range Example: very “light” original Excellent flexibility Poor tonal range Little flexibility Experience in practice learns that what is experienced as being “good legibility” is very personal.  We decided to solve this problem with a smart filter in the document viewer. Poor legibility Excellent legibility Which one would you buy?
Skimming on the quality of scans (it can be better) is purely an economic decision, not one taken  on principle We scan 4. For archival research from screen or print It  does  make sense to let the standard of quality follow from the purpose the end-uses places on of the scans Price rates scanning, external partner 0,05 $ Legibility, auto-feed 0,30 – 0,75 $ Legibility 3 – 10 $ High-end Price comparison scanning costs
This way damage or loss of the originals is ruled out After digitization the originals can not be requested in the reading room anymore We scan 5. For conservation and security The scans in the scanning on request service are made for the purpose of access / archival research Not  as a substitute for the originals Nevertheless, digitization does have a real conservation function Conservation of the originals remains the major concern
A file can contain one – hundreds of documents We scan By definition the entire file is scanned Never just a selection of pages There are a few reasons for this: 6. Always complete files The costs for scanning are not so much a factor of quantity, but rather of the manual processing involving in it In the originals or the metadata it has to be indicated which documents are being digitized When shown in the Archiefbank, the user expects completeness When non-scanned pages have to be digitized later, the entire preparation process has to be gone through once again
Contracting out of scanning was a logical choice We scan The in-house scan facilities are not designed for large-scale digitizing The complexity of the workflow and material to be scanned calls for Investing only makes sense by very high production, organized on a large scale 7. Contracting out the scanning to external partners Specialized hard- and software Specialized set-ups Knowledge Very complex technical infrastructure
This calls for intensive collaboration Also, the workflows of archive and digitizer have to dovetail We scan There are many scanning companies Most do have experience in bulk processing But not in this degree of complexity and diversity 7. Contracting out scanning is more than awarding a contract to a supplier Contracting out the scanning to external partners
Customers think a low price is important This means that costs for producing and storing scans have to be as low as possible Archival research easily runs into the use of dozens to hundreds of documents We scan The price of an ordinary copy in our reading room should be the benchmark Low costs 100 scans should not cost $ 100 The costs when purchasing scans online should be competitive with travel costs when visiting our reading room
We use a combination of 1 and 3 We store Storage costs still are considerably high when producing large quantities of scans In order to bring structural costs down file size of the scans has to be as low as possible This can be achieved in three ways  Scans with a file size as small as possible 1. Skimming on resolution 3. Using (lossless or lossy) compression on the files 2. Skimming on bit depth / amount of colors (only possible in formats like TIFF and PNG)
Hoe fijner het gebruikte raster bij scanning, hoe meer informatie, hoe hoger de detaillering Maar, hoe dan ook sterke vereenvoudiging van de werkelijkheid Resolutie Op een bepaald detailniveau zullen altijd de afzonderlijke “rastercellen” zien Voor tekstdocumenten moet het raster een fijnmazigheid hebben die overeenkomt met details uit de tekstuele informatie. Een punt op een i moet als zodanig nog te onderscheiden zijn Maar bijvoorbeeld details in de structuur van het papier hoeven in de scan niet zichtbaar te zijn
Resolutie wordt meestal uitgedrukt in  DPI  (Dots Per Inch) Of – eigenlijk beter –  PPI  (Pixels Per Inch) DPI zegt dus iets over de informatiedichtheid per lengtemaat Resolutie En daarmee iets over de  theoretisch  haalbare  kwaliteit Maar verder helemaal niets over de  objectieve  kwaliteit van een scan Zowel een scanner van € 50,- van de Aldi, als een high-end scanner van  €  50.000 kunnen op 300 dpi scannen Maar de kwaliteit van de geproduceerde scan zal duidelijk verschillen Meten van het detailoplossend vermogen van een scanner kan met behulp van controlekaartjes waarmee zogenaamde  lijnenparen  worden gemeten
Benchmark resolutie is meestal 300 dpi Dit is gebaseerd op de kleinste letter e ( 1 mm) in drukwerk Niet alle documenten bevatten details die zo klein zijn Resolutie benodigde resolutie kan o.a. worden berekend met de zogenaamde Quality Index: http://
Resolutie is in sterke mate bepalend voor de bestandsgrootte: Resolutie Resolutie (A4) Bestandsgrootte 300 dpi 24 Mb 400 dpi 44 Mb 800 dpi 177 Mb 1600 dpi 708 Mb 3200 dpi 2,8 Gb
Resolutie Voorbeelden 300 dpi 200 dpi 150 dpi
Resolutie Conclusie : bij 150 dpi: kleine bestanden en meeste tekst nog prima leesbaar Maar, is het verstandig om hier bij digitaliseren van uit te gaan?  Bij lage resolutie ook lagere structurele beheerkosten. Over enkele jaren wellicht met betere technologie opnieuw scannen. Maar niet voldoende wanneer we in de toekomst op basis van deze images in een hogere kwaliteit willen leveren, OCR toe willen passen en/of willen converteren naar betere compressie- en bestandsformaten. Keuze afhankelijk van doelstellingen, middelen, aantallen
Kleur Een  pixel  is een vakje met een enkele kleur De kleinste eenheid van een digitaal bestand is een  bit : deze heeft de waarde 0 of 1 Wanneer een pixel uit 1 bit bestaat kan deze pixel de waarde zwart (0) of wit (1) hebben Willen we meer kleuren kunnen definiëren bij een pixel dan zullen we het aantal bits per pixel uit moeten breiden Met 8 bits (die elk de waarde 1 of 0) aan kunnen nemen zijn 256 combinaties, en dus kleuren mogelijk (bijvoorbeeld 0 0 0 1 0 0 1 1 ) Kleurdiepte: bits en bytes De meeste camera’s gebruiken  8 bits per kleurkanaal  (in totaal dus 24 bits) Hiermee zijn 16,7 miljoen kleuren mogelijk
24 bits (8 bits per kleurkanaal) 8 bits, grijswaarden 1 bit, zwart-wit
Compressie Methode waarmee de informatie efficiënter beschreven kan worden Opslaan: 48 letters Woorden coderen Compressie Bestandsgrootte neemt af Peer Spel Spel Spel Spel Peer Peer Spel Spel Spel Peer Peer P = Peer S = Spel
Compressie Opslaan: 12 letters (plus coderingstabel Resultaat P S S S S P P S S S P P P = Peer S = Spel
Compressie Twee soorten compressie: A. Lossless  (exact omkeerbaar) Er gaat geen informatie verloren Vergelijk het met een kussen waar je alle lucht uitdrukt voor je deze verpakt. Haal je het kussen uit de verpakking dan wordt het weer exact het kussen zoals het was voor verpakking.  B. Lossy  (niet exact omkeerbaar) Bepaalde  informatie wordt weggegooid Weer drukken we lucht uit het kussen, maar omdat we een nog kleinere verpakking willen halen we ook een paar veertjes weg. Dit hoeft niet erg te zijn, want wellicht geeft het gemis van een paar veertjes in het gebruik geen oncomfortabeler kussen. Alleen, weggegooide veertjes zullen ook bij het opnieuw uit de verpakking halen niet meer worden toegevoegd.
Compressie en informatieverlies Een veelgehoorde stelling: Lossy compressie niet gebruiken bij opslag van images, want bij lossy compressie treedt informatieverlies op Bij lossy compressie treedt inderdaad informatieverlies op, maar dat hoeft niet per definitie verlies van  betekenisvolle informatie  te betekenen Sowieso is beter is om te zeggen: verlies van informatie ten opzichte van het ongecomprimeerde bestand.  Scanning is namelijk - ten opzichte van het origineel - onlosmakelijk verbonden met verlies van informatie, ook bij toepassing van lossless compressie.
Lossy compressie Voorbeelden JPEG kwaliteit 10 (300 dpi) JPEG kwaliteit 12 (300 dpi) JPEG kwaliteit 4 (300 dpi) JPEG kwaliteit 4 (200 dpi) JPEG 2000, part 6
Compressie en duurzaamheid Veelgehoorde stelling: Gecomprimeerde bestanden hebben een grotere kans om corrupt te raken dan niet gecomprimeerde bestanden. Daarom mag er geen datacompressie worden toegepast. Uit onderzoek is gebleken dat deze stelling niet juist is. Andere oplossingsrichting voor preservering: redundantie in opslag Juist gecomprimeerde bestanden lenen zich hier goed voor
We store Resolution, compression and legibility: an example 300 dpi, high quility JPEG 200 dpi, low quility JPEG Scans with a file size as small as possible
Comparison between file format, compression,  resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
TIFF uncompressed Comparison between file format, compression,  resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
JPEG (psd) 10 Comparison between file format, compression,  resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
JPEG (psd) 4 Comparison between file format, compression,  resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
JPEG2000 lossless Comparison between file format, compression,  resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
We store Comparison storage costs Storage of 500.000 images Avg size per scan uncompressed =  22,1 MB Price rate : 1 TB, storage in a controlled e-repository environment on two separate locations, including IT costs $ 7.000  (NLD, nov 2009) Scans with a file size as small as possible (File)size still  does  matter! $ 420.000 $ 8.680 $ 77.000 $ 770.000 Costs 10 years $ 42.000 $ 868 $ 7.700 $ 77.000 Costs 1 year 6 TB 124 GB 1,1 TB 11 TB Storage JPEG 2000 (part 1, ll) JPEG 4 (200 dpi) JPEG 10 Tiff uncompressed Fileformat
Projects with different goals, document types and partners take place at the same time A streamlined, standardized process is indispensable when digitizing on a large scale  Guidelines and best practices often take no account of these complex factors  and the amount of scans to be produced We developed a process in which large scale and flexibility are starting points All  digitization projects follow this process Developing the reproduction process We Do
We developed a simple, but effective workflow application in-house This asks for workflow management with a user-friendly application For all projects, at any moment, it has to be clear: We Do What the current status is of each to digitize unit Where each unit can be located What current and succeeding tasks are to be performed on each unit Developing the reproduction process
In the following slides we focus on the weekly production of 10.000 scans in the digitizing on request service We developed a simple, but effective workflow application in-house This asks for workflow management with a user-friendly application For all projects, at any moment, it has to be clear: We Do What the current status is of each to be digitized unit Where each unit can be located What current and succeeding tasks are to be performed on each unit Developing the reproduction process
All public files can be requested for digitization via the findings aids in the Archiefbank Just by clicking on the “digitize” button Production of 10.000 scans on weekly basis 1. Requesting for digitization We Do
A unit to be digitized must be able to be identified at each step of the handling process The units therefore get a unique meaningless  order number An order number is provided by the metadata management system and is the basis for In practice: all units to be digitized get an  order ticket 2. Providing ordernumbers Communication with the digitizer Scanning Assigning filenames Registration of filenames Billing by digitizer We Do
A unit to be digitized must be able to be identified at each step of the handling process The units therefore get a unique meaningless  order number An order number is provided by the metadata management system and is the basis for In practice: all units to be digitized get an  order ticket 2. Providing ordernumbers Communication with the digitizer Scanning Assigning filenames Registration of filenames Billing by digitizer We Do
The workflow system generates a list of all originals to asses from the repositories The list is sorted on repository / shelf to make retrieval efficient We Do 3. Assessing the originals
All assessed originals are stored in a special room In this room all checks are executed We Do 4. Checking the originals
Information about the originals in our management  systems is not always complete If an item falls into one of these categories the request is rejected B. Condition of the material A rough check of the originals takes place A. Content We Do 4. Checking the originals Copyrights Publicity Privacy Items that are in such a condition that digitizing or transport could cause damage, or are packaged in a way that scanning in conventional set-ups is not possible do not qualify for standard way of digitization
Information about the originals in our management  systems is not always complete If an item falls into one of these categories the request is rejected B. Condition of the material A rough check of the originals takes place A. Content We Do 4. Checking the originals Copyrights Publicity Privacy Items that are in such a condition that digitizing or transport could cause damage, or are packaged in a way that scanning in conventional set-ups is not possible do not qualify for standard way of digitization
Material preparation is limited to the most minimal We Do 4. Checking the originals Staples are being removed as a rule Small reparations are executed by our restoration employees The sequence of the originals as found in the repository is not checked or altered We Do We don’t  The originals are not numbered
But this is only true when the numbering tallies exact, because: Numbering the originals has one advantage: We Do Not number the originals The completeness of the scans (compared to the originals) can be guaranteed Numbers that are assigned double lead to illogical end numbers (100 scans: scan 100 has been numbered as 99) Experiments with numbering in practice learned that faultless numbering can not be realized A missing number in a sequence of scans leads to the conclusion that there is one original that has not been scanned
Securing completeness can be realized by other means: We Do Comparing scans to originals 1:1 after digitization Scanning the originals twice  # scans =  365 # scans =  365 Low quality  High quality master files Not number the originals
For secure transport, special flight cases are used We Do 5. Transport
It has to be perfectly clear which filenames this should be After scanning the scan operator or data manager has to assign filenames to the scans Because, when the meaning changes, filenames should change too As a rule filenames contain no meaningful information We Do 6. / 7. Scanning and assigning filenames Filenames are the key between scans  metadata
Assigning filenames at City Archives Amsterdam Customer request Management systems First 6#:  ordernr Last 6#:  serial nr Order ticket Filename Scanning the order A20758000001 A20758000002 A20758000003 Range A20758000001 – A20758999999 Archive 195 File 836 Order: A20758 A20758000004 A20758000005 Scan report A20758000001 A20758000002 A20758000003 A20758000004 A20758000005 12 digits Registration filenames Import
An application from which all checks can be executed is in development Scans and metadata are checked efficiently Where possible checks are automated  10. 11. Checking scans and metadata Basic checks We Do Depends on project Completeness Script Filenames Visual check production scans Visual check reference scans Quality scans Jhove File format validity MD-5 checksum comparison Data integrity Virus checker Viruses Method Check
After import the “order for digitization” of each unit is completed After approving of all checks, scans and metadata are imported into the management systems The imports are executed automatically, on basis of scripts and standard protocols  for file transfer 13. 14. Import metadata and scans into management systems We Do
After import the metadata are optimized for the search system For exchange of finding aids we use EAD From any workstation at the archive, directly via the CMS of the website The website is hosted from an external location Metadata are uploaded to the webserver by simple HTTP transfer 18. Import metadata into the website We Do
Until then scans are transported by use of portable USB harddisks Bandwith of the internet connections at the archive is still too small for direct sFTP  (or suchlike) upload of large quantities of scans to the webserver It seems likely that in the near future this will change 17. Import scans into the website Transport medium We Do
Derivates for use of thumbnails and zoom / contrast functionality are made After connecting the harddisk to the server the import process starts Some basic checks are executed on the scans Import 17. Import scans into the website We Do
The requester can decide whether to buy scans or not When both scans and metadata have been imported, automatically an email is send  to the requester for digitization This email contains a link to the finding aid and thumbnails on the website Request completed We Do The happy customer:
MARAC Conference October 30 2009 The requester can decide whether to buy scans or not When both scans and metadata have been imported, automatically an e-mail is sent to the requester for digitization This email contains a link to the finding aid and thumbnails on the website Request complete! The happy customer: We Do
Costs and income €  200,000 Digitization projects €  52,000 Webservices €  140,000 Digitsation on request Costs Archiefbank (2008) €  40,000 Government €  330,350 Project funding €  100,000 Digitsation on request Income Archiefbank (2008)

More Related Content

Similar to Grootschalige digitalisering van archivalia

An Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your RequirementsAn Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your Requirements
DocuFi, offering HAI and Infection Prevention Analytics
DocuFile - Intelligent Migration
DocuFile - Intelligent MigrationDocuFile - Intelligent Migration
DocuFile - Intelligent Migration
Digital Library Solutions
Digital Library SolutionsDigital Library Solutions
Digital Library Solutions
Pressmart Media Limited
Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...
Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...
Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...
Raymond Cunningham
DTI Overview Presentation 2010
DTI Overview Presentation 2010DTI Overview Presentation 2010
DTI Overview Presentation 2010
DTI Overview Presentation 2010
DTI Overview Presentation 2010DTI Overview Presentation 2010
DTI Overview Presentation 2010
Darrin Campbell
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
Is Your Cloud Content Strategy a Ticking Time Bomb?
Is Your Cloud Content Strategy a Ticking Time Bomb?Is Your Cloud Content Strategy a Ticking Time Bomb?
Is Your Cloud Content Strategy a Ticking Time Bomb?
Foxit Software Inc.
Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)
Mal Booth
Desmond Devendran
Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things
IIA Conference 2017 - Edmonton, AB - Paperless Governement
IIA Conference 2017 - Edmonton, AB - Paperless GovernementIIA Conference 2017 - Edmonton, AB - Paperless Governement
IIA Conference 2017 - Edmonton, AB - Paperless Governement
Bruce Covington
Digitization of Physical Assets
Digitization of Physical AssetsDigitization of Physical Assets
Digitization of Physical Assets
Daniel Novak
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
Bharat Kalia
Developing a plan for your imaging project
Developing a plan for your imaging projectDeveloping a plan for your imaging project
Developing a plan for your imaging project
What is Batch Document Processing? A tutorial for document capture.
What is Batch Document Processing?  A tutorial for document capture.What is Batch Document Processing?  A tutorial for document capture.
What is Batch Document Processing? A tutorial for document capture.
DocuFi, offering HAI and Infection Prevention Analytics
Document Automation and Integration Webinar For CVision
Document Automation and Integration Webinar For CVisionDocument Automation and Integration Webinar For CVision
Document Automation and Integration Webinar For CVision
Chris Riley ☁
Digitisation workshop pres 2009(v1)
Digitisation workshop pres 2009(v1)Digitisation workshop pres 2009(v1)
Digitisation workshop pres 2009(v1)
Mal Booth
Insourcing Webinar
Insourcing WebinarInsourcing Webinar
Insourcing Webinar
ESI Attorneys LLC
Scanning 101 Standards
Scanning 101 StandardsScanning 101 Standards
Scanning 101 Standards
Jenel Farrell

Similar to Grootschalige digitalisering van archivalia (20)

An Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your RequirementsAn Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your Requirements
DocuFile - Intelligent Migration
DocuFile - Intelligent MigrationDocuFile - Intelligent Migration
DocuFile - Intelligent Migration
Digital Library Solutions
Digital Library SolutionsDigital Library Solutions
Digital Library Solutions
Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...
Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...
Beginning an Imaging Program: Achieving Success and Avoiding the Pitfalls – A...
DTI Overview Presentation 2010
DTI Overview Presentation 2010DTI Overview Presentation 2010
DTI Overview Presentation 2010
DTI Overview Presentation 2010
DTI Overview Presentation 2010DTI Overview Presentation 2010
DTI Overview Presentation 2010
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Is Your Cloud Content Strategy a Ticking Time Bomb?
Is Your Cloud Content Strategy a Ticking Time Bomb?Is Your Cloud Content Strategy a Ticking Time Bomb?
Is Your Cloud Content Strategy a Ticking Time Bomb?
Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)Digitisation Workshop Pres 2008(V1)
Digitisation Workshop Pres 2008(V1)
Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things
IIA Conference 2017 - Edmonton, AB - Paperless Governement
IIA Conference 2017 - Edmonton, AB - Paperless GovernementIIA Conference 2017 - Edmonton, AB - Paperless Governement
IIA Conference 2017 - Edmonton, AB - Paperless Governement
Digitization of Physical Assets
Digitization of Physical AssetsDigitization of Physical Assets
Digitization of Physical Assets
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
Developing a plan for your imaging project
Developing a plan for your imaging projectDeveloping a plan for your imaging project
Developing a plan for your imaging project
What is Batch Document Processing? A tutorial for document capture.
What is Batch Document Processing?  A tutorial for document capture.What is Batch Document Processing?  A tutorial for document capture.
What is Batch Document Processing? A tutorial for document capture.
Document Automation and Integration Webinar For CVision
Document Automation and Integration Webinar For CVisionDocument Automation and Integration Webinar For CVision
Document Automation and Integration Webinar For CVision
Digitisation workshop pres 2009(v1)
Digitisation workshop pres 2009(v1)Digitisation workshop pres 2009(v1)
Digitisation workshop pres 2009(v1)
Insourcing Webinar
Insourcing WebinarInsourcing Webinar
Insourcing Webinar
Scanning 101 Standards
Scanning 101 StandardsScanning 101 Standards
Scanning 101 Standards

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

Grootschalige digitalisering van archivalia

  • 1. Grootschalige digitalisering van archivalia Marc Holtman
  • 2. Analoge originele documenten in depot Scans Metadata (digitaal) ontsluiten Scanning zoeken en raadplegen Usability Vindbaarheid
  • 3. Analoge originele documenten in depot Scans Metadata (digitaal) ontsluiten Scanning zoeken en raadplegen Usability Vindbaarheid
  • 4. Analoge originele documenten in depot Scans Scanning Begrippen Principes Technische aspecten Economische principes Werkproces Bij grootschalige digitalisering van archivalia
  • 6. http:// / atlantis /? application =nadere%20toegangen&database= nt & entrypoint =personen&query=&vanaf=0&sessienummer=1.ef00& relatienr =1&extra1=&extra2=&aantal_per_pagina=9&service=object& templatename = item.htm & recordnumber =0&detailobject=%3cd%7cw2kzed02%3aD%3a%2fatlantisdatabase%2fatlantis.nt.db%7c86%7c0%7caca04%7c10000%3e# Greep uit voorbeelden huiswerk Boeken ( Internet Archive ) Google books ( Google Books ) Munich Digitisation Centre ( Digital collections )
  • 7. 2009 2008 2007 17.958 2006 25.014 2002 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
  • 8. 2009 2008 2007 512.592 17.958 2006 224.050 25.014 2002 40.048 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
  • 9. 106.000 2009 118.312 2008 92.678 2007 512.592 17.958 2006 224.050 25.014 2002 40.048 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
  • 10. 531.143 106.000 2009 538.483 118.312 2008 520.483 92.678 2007 512.592 17.958 2006 224.050 25.014 2002 40.048 26.598 1998   27.738 1992   29.788 1988   24.027 1982 Website Reading rooms Year Visitors
  • 11.
  • 12. V. Hoe lang duurt het om alles te digitaliseren 1 meter = 7.000 scans Productie = 10.000 scans per week A. 431 jaar V. Hoeveel scans levert digitalisering van 32 kilometer archief A. 224.000.000 scans
  • 13. Aantal te digitaliseren documenten in een archief loopt in een project al snel in de honderdduizenden tot miljoenen Incidentele en structurele kosten moeten ook bij deze enorme aantallen beheersbaar blijven Incidentele en structurele kosten afhankelijk van: B. Werkprocessen : organisatie reproductieproces A. Technische aspecten : kwaliteitsnorm scans / bestandsgrootte
  • 14.
  • 15. Principes, begrippen, technische aspecten en werkproces Grootschalig digitaliseren van archivalia
  • 16. We Scan We Store We Do Principles, image quality and workflow principles Compression and filesize Workflow, tools and practical issues
  • 17. Goals of digitization projects vary from access to substitution of the originals In every project quality standard and method are set, depending on purpose and type of material For all projects we have one workflow We always work on project basis We scan Digitization at the Amsterdam City Archives in general
  • 18. We scan 1. At large scale the more scans being made, the lower the price per scan Large scale production is a prerequisite in order to keep production costs as low as possible
  • 19. Documents that are being digitized in this reproduction process can have the following forms We scan Small and large size Bound and loose-leafed entities Card indexes Old and modern material Low and high contrast documents Text alone, text and image together Hybrid forms 3. A broad spectrum of document types
  • 20. Costs for producing and storing scans are determined to a high extent by the quality standard set for the scans Purpose of the scans: archival research using the web, straight from screen or print We scan 4. For archival research from screen or print The higher the standard of quality, the higher the costs will be In order to keep costs low it is prudent to allow the standard of quality follow from the requirement the end user places on the scan Textual information legible in de originals must be legible in the scans
  • 21. But has no added value for the customer at all A quality higher than that inevitably will push up both incidental and structural costs We scan 4. For archival research from screen or print Specified (basic) quality standard: Reproduction of all significant information Reproduction of details which are not part of the textual information is not required
  • 22. We scan Scan quality and legibility High quality scan Modified scan (contrast) Optimal tonal range Example: very “light” original Excellent flexibility Poor tonal range Little flexibility Experience in practice learns that what is experienced as being “good legibility” is very personal. We decided to solve this problem with a smart filter in the document viewer. Poor legibility Excellent legibility Which one would you buy?
  • 23. Skimming on the quality of scans (it can be better) is purely an economic decision, not one taken on principle We scan 4. For archival research from screen or print It does make sense to let the standard of quality follow from the purpose the end-uses places on of the scans Price rates scanning, external partner 0,05 $ Legibility, auto-feed 0,30 – 0,75 $ Legibility 3 – 10 $ High-end Price comparison scanning costs
  • 24. This way damage or loss of the originals is ruled out After digitization the originals can not be requested in the reading room anymore We scan 5. For conservation and security The scans in the scanning on request service are made for the purpose of access / archival research Not as a substitute for the originals Nevertheless, digitization does have a real conservation function Conservation of the originals remains the major concern
  • 25. A file can contain one – hundreds of documents We scan By definition the entire file is scanned Never just a selection of pages There are a few reasons for this: 6. Always complete files The costs for scanning are not so much a factor of quantity, but rather of the manual processing involving in it In the originals or the metadata it has to be indicated which documents are being digitized When shown in the Archiefbank, the user expects completeness When non-scanned pages have to be digitized later, the entire preparation process has to be gone through once again
  • 26. Contracting out of scanning was a logical choice We scan The in-house scan facilities are not designed for large-scale digitizing The complexity of the workflow and material to be scanned calls for Investing only makes sense by very high production, organized on a large scale 7. Contracting out the scanning to external partners Specialized hard- and software Specialized set-ups Knowledge Very complex technical infrastructure
  • 27. This calls for intensive collaboration Also, the workflows of archive and digitizer have to dovetail We scan There are many scanning companies Most do have experience in bulk processing But not in this degree of complexity and diversity 7. Contracting out scanning is more than awarding a contract to a supplier Contracting out the scanning to external partners
  • 28. Customers think a low price is important This means that costs for producing and storing scans have to be as low as possible Archival research easily runs into the use of dozens to hundreds of documents We scan The price of an ordinary copy in our reading room should be the benchmark Low costs 100 scans should not cost $ 100 The costs when purchasing scans online should be competitive with travel costs when visiting our reading room
  • 29. We use a combination of 1 and 3 We store Storage costs still are considerably high when producing large quantities of scans In order to bring structural costs down file size of the scans has to be as low as possible This can be achieved in three ways Scans with a file size as small as possible 1. Skimming on resolution 3. Using (lossless or lossy) compression on the files 2. Skimming on bit depth / amount of colors (only possible in formats like TIFF and PNG)
  • 30. Hoe fijner het gebruikte raster bij scanning, hoe meer informatie, hoe hoger de detaillering Maar, hoe dan ook sterke vereenvoudiging van de werkelijkheid Resolutie Op een bepaald detailniveau zullen altijd de afzonderlijke “rastercellen” zien Voor tekstdocumenten moet het raster een fijnmazigheid hebben die overeenkomt met details uit de tekstuele informatie. Een punt op een i moet als zodanig nog te onderscheiden zijn Maar bijvoorbeeld details in de structuur van het papier hoeven in de scan niet zichtbaar te zijn
  • 31. Resolutie wordt meestal uitgedrukt in DPI (Dots Per Inch) Of – eigenlijk beter – PPI (Pixels Per Inch) DPI zegt dus iets over de informatiedichtheid per lengtemaat Resolutie En daarmee iets over de theoretisch haalbare kwaliteit Maar verder helemaal niets over de objectieve kwaliteit van een scan Zowel een scanner van € 50,- van de Aldi, als een high-end scanner van € 50.000 kunnen op 300 dpi scannen Maar de kwaliteit van de geproduceerde scan zal duidelijk verschillen Meten van het detailoplossend vermogen van een scanner kan met behulp van controlekaartjes waarmee zogenaamde lijnenparen worden gemeten
  • 32. Benchmark resolutie is meestal 300 dpi Dit is gebaseerd op de kleinste letter e ( 1 mm) in drukwerk Niet alle documenten bevatten details die zo klein zijn Resolutie benodigde resolutie kan o.a. worden berekend met de zogenaamde Quality Index: http://
  • 33. Resolutie is in sterke mate bepalend voor de bestandsgrootte: Resolutie Resolutie (A4) Bestandsgrootte 300 dpi 24 Mb 400 dpi 44 Mb 800 dpi 177 Mb 1600 dpi 708 Mb 3200 dpi 2,8 Gb
  • 34. Resolutie Voorbeelden 300 dpi 200 dpi 150 dpi
  • 35. Resolutie Conclusie : bij 150 dpi: kleine bestanden en meeste tekst nog prima leesbaar Maar, is het verstandig om hier bij digitaliseren van uit te gaan? Bij lage resolutie ook lagere structurele beheerkosten. Over enkele jaren wellicht met betere technologie opnieuw scannen. Maar niet voldoende wanneer we in de toekomst op basis van deze images in een hogere kwaliteit willen leveren, OCR toe willen passen en/of willen converteren naar betere compressie- en bestandsformaten. Keuze afhankelijk van doelstellingen, middelen, aantallen
  • 36. Kleur Een pixel is een vakje met een enkele kleur De kleinste eenheid van een digitaal bestand is een bit : deze heeft de waarde 0 of 1 Wanneer een pixel uit 1 bit bestaat kan deze pixel de waarde zwart (0) of wit (1) hebben Willen we meer kleuren kunnen definiëren bij een pixel dan zullen we het aantal bits per pixel uit moeten breiden Met 8 bits (die elk de waarde 1 of 0) aan kunnen nemen zijn 256 combinaties, en dus kleuren mogelijk (bijvoorbeeld 0 0 0 1 0 0 1 1 ) Kleurdiepte: bits en bytes De meeste camera’s gebruiken 8 bits per kleurkanaal (in totaal dus 24 bits) Hiermee zijn 16,7 miljoen kleuren mogelijk
  • 37. 24 bits (8 bits per kleurkanaal) 8 bits, grijswaarden 1 bit, zwart-wit
  • 38. Compressie Methode waarmee de informatie efficiënter beschreven kan worden Opslaan: 48 letters Woorden coderen Compressie Bestandsgrootte neemt af Peer Spel Spel Spel Spel Peer Peer Spel Spel Spel Peer Peer P = Peer S = Spel
  • 39. Compressie Opslaan: 12 letters (plus coderingstabel Resultaat P S S S S P P S S S P P P = Peer S = Spel
  • 40. Compressie Twee soorten compressie: A. Lossless (exact omkeerbaar) Er gaat geen informatie verloren Vergelijk het met een kussen waar je alle lucht uitdrukt voor je deze verpakt. Haal je het kussen uit de verpakking dan wordt het weer exact het kussen zoals het was voor verpakking. B. Lossy (niet exact omkeerbaar) Bepaalde informatie wordt weggegooid Weer drukken we lucht uit het kussen, maar omdat we een nog kleinere verpakking willen halen we ook een paar veertjes weg. Dit hoeft niet erg te zijn, want wellicht geeft het gemis van een paar veertjes in het gebruik geen oncomfortabeler kussen. Alleen, weggegooide veertjes zullen ook bij het opnieuw uit de verpakking halen niet meer worden toegevoegd.
  • 41. Compressie en informatieverlies Een veelgehoorde stelling: Lossy compressie niet gebruiken bij opslag van images, want bij lossy compressie treedt informatieverlies op Bij lossy compressie treedt inderdaad informatieverlies op, maar dat hoeft niet per definitie verlies van betekenisvolle informatie te betekenen Sowieso is beter is om te zeggen: verlies van informatie ten opzichte van het ongecomprimeerde bestand. Scanning is namelijk - ten opzichte van het origineel - onlosmakelijk verbonden met verlies van informatie, ook bij toepassing van lossless compressie.
  • 42. Lossy compressie Voorbeelden JPEG kwaliteit 10 (300 dpi) JPEG kwaliteit 12 (300 dpi) JPEG kwaliteit 4 (300 dpi) JPEG kwaliteit 4 (200 dpi) JPEG 2000, part 6
  • 43. Compressie en duurzaamheid Veelgehoorde stelling: Gecomprimeerde bestanden hebben een grotere kans om corrupt te raken dan niet gecomprimeerde bestanden. Daarom mag er geen datacompressie worden toegepast. Uit onderzoek is gebleken dat deze stelling niet juist is. Andere oplossingsrichting voor preservering: redundantie in opslag Juist gecomprimeerde bestanden lenen zich hier goed voor
  • 44. We store Resolution, compression and legibility: an example 300 dpi, high quility JPEG 200 dpi, low quility JPEG Scans with a file size as small as possible
  • 45. Comparison between file format, compression, resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
  • 46. TIFF uncompressed Comparison between file format, compression, resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
  • 47. JPEG (psd) 10 Comparison between file format, compression, resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
  • 48. JPEG (psd) 4 Comparison between file format, compression, resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
  • 49. JPEG2000 lossless Comparison between file format, compression, resolution and file size Scans with a file size as small as possible We store 55% 6 Tb 12 MB 24 bits 300 dpi Lossless Part 1 JPEG2000 0,5% 59 Gb 120 Kb 24 bits 300 dpi Lossy Part 6 34% 3,7 Tb 7,5 Mb 24 bits 300 dpi Lossy Qua (ps) 12 JPEG 10% 1,1 Tb 2,1 Mb 24 bits 300 dpi Lossy Qua (ps) 10 1,1% 124 Gb 255 Kb 24 bits 200 dpi Lossy Qua (ps) 4 Filesize 3,3 Mb 22,1 Mb Avg Lossy --- Type 15% 100% % 1,6 Tb 11 Tb 500.000 400 dpi 300 dpi Resolution Qua (ps) 10 No Compression 24 bits 24 bits TIFF Color Format
  • 50. We store Comparison storage costs Storage of 500.000 images Avg size per scan uncompressed = 22,1 MB Price rate : 1 TB, storage in a controlled e-repository environment on two separate locations, including IT costs $ 7.000 (NLD, nov 2009) Scans with a file size as small as possible (File)size still does matter! $ 420.000 $ 8.680 $ 77.000 $ 770.000 Costs 10 years $ 42.000 $ 868 $ 7.700 $ 77.000 Costs 1 year 6 TB 124 GB 1,1 TB 11 TB Storage JPEG 2000 (part 1, ll) JPEG 4 (200 dpi) JPEG 10 Tiff uncompressed Fileformat
  • 51. Projects with different goals, document types and partners take place at the same time A streamlined, standardized process is indispensable when digitizing on a large scale Guidelines and best practices often take no account of these complex factors and the amount of scans to be produced We developed a process in which large scale and flexibility are starting points All digitization projects follow this process Developing the reproduction process We Do
  • 52. We developed a simple, but effective workflow application in-house This asks for workflow management with a user-friendly application For all projects, at any moment, it has to be clear: We Do What the current status is of each to digitize unit Where each unit can be located What current and succeeding tasks are to be performed on each unit Developing the reproduction process
  • 53. In the following slides we focus on the weekly production of 10.000 scans in the digitizing on request service We developed a simple, but effective workflow application in-house This asks for workflow management with a user-friendly application For all projects, at any moment, it has to be clear: We Do What the current status is of each to be digitized unit Where each unit can be located What current and succeeding tasks are to be performed on each unit Developing the reproduction process
  • 54. All public files can be requested for digitization via the findings aids in the Archiefbank Just by clicking on the “digitize” button Production of 10.000 scans on weekly basis 1. Requesting for digitization We Do
  • 55. A unit to be digitized must be able to be identified at each step of the handling process The units therefore get a unique meaningless order number An order number is provided by the metadata management system and is the basis for In practice: all units to be digitized get an order ticket 2. Providing ordernumbers Communication with the digitizer Scanning Assigning filenames Registration of filenames Billing by digitizer We Do
  • 56. A unit to be digitized must be able to be identified at each step of the handling process The units therefore get a unique meaningless order number An order number is provided by the metadata management system and is the basis for In practice: all units to be digitized get an order ticket 2. Providing ordernumbers Communication with the digitizer Scanning Assigning filenames Registration of filenames Billing by digitizer We Do
  • 57. The workflow system generates a list of all originals to asses from the repositories The list is sorted on repository / shelf to make retrieval efficient We Do 3. Assessing the originals
  • 58. All assessed originals are stored in a special room In this room all checks are executed We Do 4. Checking the originals
  • 59. Information about the originals in our management systems is not always complete If an item falls into one of these categories the request is rejected B. Condition of the material A rough check of the originals takes place A. Content We Do 4. Checking the originals Copyrights Publicity Privacy Items that are in such a condition that digitizing or transport could cause damage, or are packaged in a way that scanning in conventional set-ups is not possible do not qualify for standard way of digitization
  • 60. Information about the originals in our management systems is not always complete If an item falls into one of these categories the request is rejected B. Condition of the material A rough check of the originals takes place A. Content We Do 4. Checking the originals Copyrights Publicity Privacy Items that are in such a condition that digitizing or transport could cause damage, or are packaged in a way that scanning in conventional set-ups is not possible do not qualify for standard way of digitization
  • 61. Material preparation is limited to the most minimal We Do 4. Checking the originals Staples are being removed as a rule Small reparations are executed by our restoration employees The sequence of the originals as found in the repository is not checked or altered We Do We don’t The originals are not numbered
  • 62. But this is only true when the numbering tallies exact, because: Numbering the originals has one advantage: We Do Not number the originals The completeness of the scans (compared to the originals) can be guaranteed Numbers that are assigned double lead to illogical end numbers (100 scans: scan 100 has been numbered as 99) Experiments with numbering in practice learned that faultless numbering can not be realized A missing number in a sequence of scans leads to the conclusion that there is one original that has not been scanned
  • 63. Securing completeness can be realized by other means: We Do Comparing scans to originals 1:1 after digitization Scanning the originals twice # scans = 365 # scans = 365 Low quality High quality master files Not number the originals
  • 64. For secure transport, special flight cases are used We Do 5. Transport
  • 65. It has to be perfectly clear which filenames this should be After scanning the scan operator or data manager has to assign filenames to the scans Because, when the meaning changes, filenames should change too As a rule filenames contain no meaningful information We Do 6. / 7. Scanning and assigning filenames Filenames are the key between scans metadata
  • 66. Assigning filenames at City Archives Amsterdam Customer request Management systems First 6#: ordernr Last 6#: serial nr Order ticket Filename Scanning the order A20758000001 A20758000002 A20758000003 Range A20758000001 – A20758999999 Archive 195 File 836 Order: A20758 A20758000004 A20758000005 Scan report A20758000001 A20758000002 A20758000003 A20758000004 A20758000005 12 digits Registration filenames Import
  • 67. An application from which all checks can be executed is in development Scans and metadata are checked efficiently Where possible checks are automated 10. 11. Checking scans and metadata Basic checks We Do Depends on project Completeness Script Filenames Visual check production scans Visual check reference scans Quality scans Jhove File format validity MD-5 checksum comparison Data integrity Virus checker Viruses Method Check
  • 68. After import the “order for digitization” of each unit is completed After approving of all checks, scans and metadata are imported into the management systems The imports are executed automatically, on basis of scripts and standard protocols for file transfer 13. 14. Import metadata and scans into management systems We Do
  • 69. After import the metadata are optimized for the search system For exchange of finding aids we use EAD From any workstation at the archive, directly via the CMS of the website The website is hosted from an external location Metadata are uploaded to the webserver by simple HTTP transfer 18. Import metadata into the website We Do
  • 70. Until then scans are transported by use of portable USB harddisks Bandwith of the internet connections at the archive is still too small for direct sFTP (or suchlike) upload of large quantities of scans to the webserver It seems likely that in the near future this will change 17. Import scans into the website Transport medium We Do
  • 71. Derivates for use of thumbnails and zoom / contrast functionality are made After connecting the harddisk to the server the import process starts Some basic checks are executed on the scans Import 17. Import scans into the website We Do
  • 72. The requester can decide whether to buy scans or not When both scans and metadata have been imported, automatically an email is send to the requester for digitization This email contains a link to the finding aid and thumbnails on the website Request completed We Do The happy customer:
  • 73. MARAC Conference October 30 2009 The requester can decide whether to buy scans or not When both scans and metadata have been imported, automatically an e-mail is sent to the requester for digitization This email contains a link to the finding aid and thumbnails on the website Request complete! The happy customer: We Do
  • 74. Costs and income € 200,000 Digitization projects € 52,000 Webservices € 140,000 Digitsation on request Costs Archiefbank (2008) € 40,000 Government € 330,350 Project funding € 100,000 Digitsation on request Income Archiefbank (2008)

Editor's Notes

  1. I will take you a step deeper into the workprocess of creating large amounts of scans. I’ll tell you about starting points and choises we have made and I’ll show you the result of some research we have done, particularyu towards image quality and filesize. Also, I’ll sohw you some back- and frontoffice tools from our webstie.
  2. I will take you a step deeper into the workprocess of creating large amounts of scans. I’ll tell you about starting points and choises we have made and I’ll show you the result of some research we have done, particularyu towards image quality and filesize. Also, I’ll sohw you some back- and frontoffice tools from our webstie.