Central Registry for Digitized Objects Links Production and Bibliographic Control
1. Central Registry
for Digitized Objects:
Linking Production and
Bibliographic Control
Ralf Stockmann
Göttinger Digitization Center
2. As things are now
• Huge ventures in
– Digitization
• Google
• Microsoft
• National programs
• Local centers
– Accessibility
• World Digital Library
• European Digital Library
• National portals
• Google Book Search
3. As things are now
• We just face the dawn of mass digitization
– Leaving behind the state of
manufacturing
– Entering industrialization
– Scanning Robots
– Accessible Full Text (OCR)
4. Lack of …
• Coordination in
digitization activities
– Who scans what
where when
in which quality
and how will it
be accessible
• How is “quality” defined?
• Do we agree on “what”?
5. Facing the Consequences
Technical
Improvements
Costs
Waste of Ressources
Costs / Value
Additional
Benefit
Number of digitized items per volume
6. The Solution
• Central registry for digitized objects
• Focused on the production context (no user
frontend)
• API driven
– Application Programming Interface
– Query / Ingest
– Simple implementation into existing workflow-tools
• Batch mode (lists)
• Open Source / free service
• Matching on volume level
– Score / probability
7. Implementation
Backend Services
EROMM / EDL / OCLC / …
Registry / Meta Data Store
Aggregator / Normalizer / Mapping
API
Query
Ingest Ingest Ingest
? ? ? ! ! !
Present Collections Running Project Notice of Intent
8. Metadata Store
• Bibliographic
– Title
– Author
– Date
– Place of publication Matching / Score
– Number of Pages (?) „what“
– Language
– Print / Format
– Edition
• Technical
– Resolution
– Color depth
– File type / compression
• Accessibility Additional Judging
– Institution „who, where, which
– Persistent identifier quality, how
– Rights accesible“
– URL
• Status
– Digitized
– In Progress Decisive Factor
– Intended (Timeline?) „when“
– Requested?
9. Obstacles
• (open source) Tools for automated matching /
scoring?
• Interface for manual comparison / decision making
• Multivolume works: low rate of uniformity (near
50% of physical SUB stock before 1900)
• Unicode
• Transliteration tables
• Random bound books
• Reliable identifier
– ISBN for old books?
• Anticipated rate of accuracy: 50 – 70 %
10. Appreciation of Values
• The goal is NOT to build a reliable database in terms of
library standards
• But to prevent further waste of resources.
• If we manage to archive just 50% precision,
• We saved a min. 50% of founding!
11. Work Packages
• Define metadata model
• Set up database
• Implement mapping tools
• Define API calls
• Implement API
• Build some connectors to popular mass digitization workflow
tools (e.g. “Goobi”)
• Establish ISBN workflow
• Harvest existing sources
• Start with a community of actual projects
• Get some (!) founding
• Estimated schedule plan: 6 months