Digitalização: Captura de Imagem
e Fluxo de Trabalho
Martin Kalfatovic, Keri Thompson &
Connie Rinaldo
Selection
Refinement
Digitization
CurationUse
Selection
Collection Management Cycle
• Communication
Selection
Refinement
Digitization
CurationUse
Selection
Collection Management Cycle
• Workflow has become more
complicated
• Difficulty finding books that are
easy to scan
• Reviewing titles in copyright takes
time
• Fragile books need repair
• The same amount of work, but a
different kind
Upload spreadsheet titles scanned plans. Include OCLC number, title, volume number,
Author, Publisher, Date
Tool tries to find matches in other spreadsheets submitted
Lesson: metadata is always worse than you think
Title, volumes needed
Which library has which volumes,
additional information
conversation
about which
volumes need
to be scanned
GEMINI: A Critical Tool
Selection
Refinement
Digitization
CurationUse
Selection
• Purpose - to provide an accurate digital
representation of the original object
• one page per image
• (except Field note-books - 2 pages per image)
• no image editing
• Reuse existing metadata
• in the library catalog
• other sources (BioStor etc.)
Capture: Scanning
Capture-Scanning
• Most libraries BHL US / UK use the Internet
Archive (IA) for scanning books
• Some shared funds/one contract for all BHL
• Open Access, nonprofit
• Services inexpensive
• Each member library has its own workflow
• Members provide basic metadata from library
catalog
• In-house digitization or hire another seller
• MACAW
• * Scan books, from
cover to cover one
image per page?
• * Also called
"volume" or "item"
is a physical unit,
not intellectual
unity, ie, a book =
multiple articles or
book = a
monograph
Cover
Cover
good stuff
Partial replication in
Alexandria, Egypt
Secondary backup is in the
Smithsonian, including TIFF
scanned volumes for home (SIL)
~ 90TB
Primary Storage files and
"staging area" is on the
Internet Archive in San
Francisco, USA
Images scanned by the library or other
vendor
Metadata collected through Z39.50
Additional metadata for the item and
pages entered by library staff using the
software Macaw (biblio software mimics
IA)
In-house scanning
Smithsonian Libraries:
uses 2 sets of Phase One:
P65 60 MP camera on a copy stand and BC100 -
dual-chamber 40mP
CaptureOne software
By folios (> 36cm), fragile books
EXCEPT Notebooks Field
Project (Smithsonian
Archives) - 2 pages per
image to notebooks, letters
flatbed scanner
Capture: Harvest
• Scheduled tasks automated
• Books already in the Internet Archive
• subject terms
• Library "call numbers”
• BioStor/articles
Selection
Refinement
Digitization
CurationUse
Selection
Interface for staff to
edit records and
serial volumes put in
order
Curated add and edit
metadata includes
books, merging records
and authors, removing
volumes that are
outside the scope of
the collection, re-scan
books with errors.
CURATION
allows people to
enter the page-level
metadata such as
page number, page
type (picture, text,
etc.)
creates XML files to
upload to IA
Replicates software
functionality from
Internet Archive
Installed in a shared
SI server for
partners to use
MACAW: MetadatA Collection And Workflow
A Critical Tool
•"Title" Record MARC library catalog
•Transformed into MARCXML and MODS
•Information "Volume" catalog or introduced by humans, stored
in xml
•"Segment" (article) the information entered by humans or
bioStor etc. (after scanning)
•"Page" metadata entered by humans, stored in the XML file that
provides structure to the digital object
Metadata
add metadata
page level,
such as page
numbers or
titles of
articles
• Other files derived from Internet Archive processes
– PDF
– Djvu (OCR text - .txt and .xml)
– ePub/Daisy/Kindle
• Other files created by BHL processes
–Taxonomic names
–OCR text
– BHL METS
Discovering and storing species names associated with pages allows the creation of
"species bibliographies," EOL.org connections, GBIF connections
Selection
Refinement
Digitization
CurationUse
Selection
Users can (and do!)
Report technical
problems
Request new
functionality
Report data errors
Request scanning of
specific titles
Gemini
Which library has which volumes,
additional information
Gemini
Title, volumes needed
Assigned to
Cornell
University
Requestor
For all we know, in response to user requests is rare in the world
of Digital Library.
Smithsonian Libraries
Workflow
s
database
library
catalog
Macaw
Internet
Archive
Move &
de-
duplicate
tracking &
shipping
Scanning &
metadata
harvesting
BHL
transform
& package
scanning &
metadata
harvesting
create
metadata
page
create
derivative
create
metadata
page
MARC  MARCxml
URL to BHL into MARC record species names
quality
control
(% sample)
• Obrigada!
Serial Gemini workflow

Digitalização: Captura de Imagem e Fluxo de Trabalho - Constance Rinaldo

  • 1.
    Digitalização: Captura deImagem e Fluxo de Trabalho Martin Kalfatovic, Keri Thompson & Connie Rinaldo
  • 2.
  • 4.
  • 5.
  • 6.
    • Workflow hasbecome more complicated • Difficulty finding books that are easy to scan • Reviewing titles in copyright takes time • Fragile books need repair • The same amount of work, but a different kind
  • 7.
    Upload spreadsheet titlesscanned plans. Include OCLC number, title, volume number, Author, Publisher, Date Tool tries to find matches in other spreadsheets submitted Lesson: metadata is always worse than you think
  • 8.
    Title, volumes needed Whichlibrary has which volumes, additional information conversation about which volumes need to be scanned GEMINI: A Critical Tool
  • 9.
  • 10.
    • Purpose -to provide an accurate digital representation of the original object • one page per image • (except Field note-books - 2 pages per image) • no image editing • Reuse existing metadata • in the library catalog • other sources (BioStor etc.) Capture: Scanning
  • 11.
    Capture-Scanning • Most librariesBHL US / UK use the Internet Archive (IA) for scanning books • Some shared funds/one contract for all BHL • Open Access, nonprofit • Services inexpensive • Each member library has its own workflow • Members provide basic metadata from library catalog • In-house digitization or hire another seller • MACAW
  • 12.
    • * Scanbooks, from cover to cover one image per page? • * Also called "volume" or "item" is a physical unit, not intellectual unity, ie, a book = multiple articles or book = a monograph Cover Cover good stuff
  • 13.
    Partial replication in Alexandria,Egypt Secondary backup is in the Smithsonian, including TIFF scanned volumes for home (SIL) ~ 90TB Primary Storage files and "staging area" is on the Internet Archive in San Francisco, USA
  • 14.
    Images scanned bythe library or other vendor Metadata collected through Z39.50 Additional metadata for the item and pages entered by library staff using the software Macaw (biblio software mimics IA) In-house scanning
  • 15.
    Smithsonian Libraries: uses 2sets of Phase One: P65 60 MP camera on a copy stand and BC100 - dual-chamber 40mP CaptureOne software By folios (> 36cm), fragile books EXCEPT Notebooks Field Project (Smithsonian Archives) - 2 pages per image to notebooks, letters flatbed scanner
  • 16.
    Capture: Harvest • Scheduledtasks automated • Books already in the Internet Archive • subject terms • Library "call numbers” • BioStor/articles
  • 17.
  • 18.
    Interface for staffto edit records and serial volumes put in order Curated add and edit metadata includes books, merging records and authors, removing volumes that are outside the scope of the collection, re-scan books with errors. CURATION
  • 19.
    allows people to enterthe page-level metadata such as page number, page type (picture, text, etc.) creates XML files to upload to IA Replicates software functionality from Internet Archive Installed in a shared SI server for partners to use MACAW: MetadatA Collection And Workflow A Critical Tool
  • 20.
    •"Title" Record MARClibrary catalog •Transformed into MARCXML and MODS •Information "Volume" catalog or introduced by humans, stored in xml •"Segment" (article) the information entered by humans or bioStor etc. (after scanning) •"Page" metadata entered by humans, stored in the XML file that provides structure to the digital object Metadata
  • 21.
    add metadata page level, suchas page numbers or titles of articles
  • 22.
    • Other filesderived from Internet Archive processes – PDF – Djvu (OCR text - .txt and .xml) – ePub/Daisy/Kindle • Other files created by BHL processes –Taxonomic names –OCR text – BHL METS
  • 23.
    Discovering and storingspecies names associated with pages allows the creation of "species bibliographies," EOL.org connections, GBIF connections
  • 24.
  • 25.
    Users can (anddo!) Report technical problems Request new functionality Report data errors Request scanning of specific titles Gemini
  • 26.
    Which library haswhich volumes, additional information Gemini Title, volumes needed Assigned to Cornell University Requestor For all we know, in response to user requests is rare in the world of Digital Library.
  • 27.
    Smithsonian Libraries Workflow s database library catalog Macaw Internet Archive Move & de- duplicate tracking& shipping Scanning & metadata harvesting BHL transform & package scanning & metadata harvesting create metadata page create derivative create metadata page MARC  MARCxml URL to BHL into MARC record species names quality control (% sample)
  • 28.
  • 29.

Editor's Notes

  • #3 [1 min] Collection mgmt to me is a continuous cycle of pre-digitization and post-digitization workflows Getting the content scanned is 1 thing and managing the content after it’s been scanned is just as important You’ll notice that our users play a key role in the cycle
  • #4 At the start of the project, trying to scan as much as possible as fast as possible “feed the beast” = low hanging fruit As the project matures, it becomes more difficult to find material to scan that is in good condition, that is in the Public Domain, or that is on the shelf! Hired a full-time in house scanner to do folios, rare fragile material Most staff, like scanning, is funded by grants. Not permanent, which means not truly programmatic/infrastructure.
  • #5 18 plus institutions, 30 plus people, 4 plus time zones
  • #6 [1 min] Collection mgmt to me is a continuous cycle of pre-digitization and post-digitization workflows Getting the content scanned is 1 thing and managing the content after it’s been scanned is just as important You’ll notice that our users play a key role in the cycle
  • #7 Workflow has become more complicated Difficulty finding books that are easy to scan Copyright review takes time Fragile books need repair Same quantity of work but different type, slower collection growth
  • #8 Upload spreadsheet of titles you plan to scan. Include OCLC number, Title, Volume Number, Author, Publisher, Date Tool tries to find matches in other submitted spreadsheets Lesson: your metadata is always worse than you think it is Problems: does not match against BHL in Real Time. Still must check BHL to be sure. Doesn’t always happen Problem: fuzzy matching algorithm is not that great. Works best against numbers (OCLC number) (OCLC? WorldCat? Union catalog for Libraries) Your metadata is always worse than you think. 
  • #9 Repurpose a generic “issue tracking” system to do many things -track requests for scanning -track titles libraries plan to scan (serial volumes) -track metadata error reports -track website bugs Comment trail can be very long. Conversation vs. database. Confusing to database people (me) but shows history of selection. The selection refinement process can take a long time!
  • #12 Some background: Most BHL US/UK libraries use Internet Archive as our scanning “vendor” (partner) this was part of the original BHL formation and grant agreement with MacArthur. IA chosen because committed to Open Access, Non-profit, and low cost services – more than just digitization Members can also do their own scanning, or contract to other vendor, but all scans must be “staged” at internet archive Members provide basic metadata from their library catalogs
  • #13 This decision to scan physical units of books is based in the limitations of available library data. Libraries typically assign data at the “title” level, with maybe some data about individual volumes of a serial. Workflow is designed around scanning physical books. We are working on incorporating born-digital publications. Focus is on the information content of the book rather than the book-as-historical-object
  • #14 TO REITERATE: For BHL, IA – petaboxes SI – Isilon Total BHL storage currently ~ 90TB. It is so low because IA supplies compressed JP2s, and we store them in a .zip file.
  • #15 Images scanned by library or other vendor Metadata harvested via Z39.50 Additional metadata for item and pages entered by library staff using Macaw software (mimics IA biblio software)
  • #16 Scanned by library or other vendor Smithsonian Libraries uses 2 systems: P65+ 60MP camera on a copy stand BC100 – dual camera 40MP scanning backs CaptureOne image editing software Macaw for extra metadata
  • #17 Analysis of MARCxml records in IA (not all books have MARC records) for 050 and 090 (call number) and 650 (subject headings) Capture - Harvest Automated, scheduled tasks Books from Internet Archive subject terms library “call numbers” Manually entering in identifier Article citations BioStor
  • #19 Curation includes adding and correcting metadata for books, merging records and authors, removing volumes that are outside of the scope of the collection, rescanning books with errors Title id 3971 Edit record for item (from MARC) Edit volumes attached to the title record – correct volume information, re-order volumes
  • #21 4 levels of descriptive metadata (administrative, structural data produced while scanning) “Title” MARC record from library catalog Transformed into MARCXML and MODS “Volume” information from catalog or entered by human, stored in xml file “Segment” (article) information entered by human OR from bioStor etc. “Page” metadata entered by human, stored in xml file that provides structure to digital object
  • #22 Item 22379 Add article “segment” title and other information “paginate” = add page data que están fuera del alcance de la colección
  • #24 Run taxonomic intellegence to find names (this shows manual editing, but it is an automated process) Discovering and storing species names associated with pages enables creation of ”species bibliographies”, connections to EOL.org, spLink and other useful tools Descubriendo y almacenamiento nombres de las especies asociadas con las páginas permite la creación de "especies bibliografías," conexiones a EOL.org, Splink y otras herramientas útiles
  • #26 Won’t show portal functionality – save for William tomorrow. Users are a big part of the data management process (administracion de la collecion)
  • #27 Here is a request from Dr. Karl Siegert that BHL scan Annales de l’Institut Pasteur. As far as we know, responding to user requests is rare in the Digital Library world. MBLWHOI has this title, as does Cornell University