Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
On the two sides of the pond
1. On the Two Sides of the Pond
By Hans-Jörg Lieder,
Head of the Department of Bibliographic Services – Union Catalogue of Serials
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz;
Dr. Katalin Radics,
Distinguished Librarian; Librarian of the West European Collections and Classics
Young Research Library, University of California, Los Angeles
4. Newspapers on the way to discoloring and disintegration
Storage facility of the University of California Libraries on
the UCLA campus
5. - Leaflets 13”x18.5” or 33cm x 47cm
- Imprint indicating the title, date, the number of the issue;
warning
-Published four or five times a day
6. UCLA stamps including receiving dates
Packed in wrapping paper probably
after 1940, packages of 700-800 sheets
No documentation (ordering or receiving
records) in the library archives; no
correspondence
Normal serial subscription scheme (?)
Very minimal cataloging record – very
low use
7. Towards a Weeding Decision
Brittle condition
Check for other holdings in California, US and World libraries
OCLC – no other holdings at the time of checking
Nine 1938 issues at BNF
No holding at the German National Library (Deutsche
Nationalbibliothek)
Contact with head of Zeitungsabteilung, Staatsbibliothek – no
holding in Germany
UNIQUE!!!
Decision: keep and preserve the UCLA holdings.
8. Keep and Preserve
9600 pages
1936-1940 with gaps
Acid-free boxes
The most fragile pages in mylar
10. Title
Deutsches Nachrichtenbüro. 5 Jahrg., Nr. 1581, 1938 October 1,
Erste Morgen-Ausgabe
Alt ID
3813183_1938-10-01_1581 [Local]
AltTitle
Erste Morgen-Ausgabe [Descriptive]
Deutsches Nachrichtenbüro [Descriptive]
Date
October 1, 1938 [Publication]
1938-10-01 [Normalized]
Format
1 p. [Extent]
Language
ger
Name
University of California, Los Angeles. Library. Dept. of Special
Collections [Repository]
Type
newspapers [Genre]
text [Type Of Resource]
11. Digitized copies: part of UCLA Digital Library at
http://digital2.library.ucla.edu/ -- freely accessible
Searchable only by date
More sophisticated searching capability needed – day by day chronicle of the
Third Reich for a short period of time
-events
-names
-institutions etc.
Deutsches Nachrichten Büro – December 5, 1933network of 36 local services (Landesdienste)
12. Indexing needed
Fraktur – major problem
Transliteration into Latin characters
OCR (Optical Character Recognition) – has to be made in Germany
Looking for a German
Partner
14. … but who are “we”?
• Project: Europeana Newspapers: http://www.europeana-newspapers.eu/
• 18 partners from 12 countries
• Tasks:
• Provide OCR for 18 million pages
• Provide OLR for 2 million pages
• Provide NER experimentally in assorted languages
• Provide best practice recommendations for newspaper metadata
• Provide quality prediction tools
• Aggregate content and make it available to TEL and Europeana
OCR = Optical Character Recognition
OLR = Optical Layout Recognition
NER = Named Entities Recognition
15. A Dance of Acronyms:
UCLA, SBB and CCS
UCLA sent data on hard drive
SBB
• Checked data for correctness and moved images into directory
structure
• Sent data to CCS in Hamburg for OCR and OLR
CCS (Content Conversion Specialists)
• Created full texts per article
• Stuck data in NZ web service for preliminary presentation purposes
SBB
• Will perform QA of OCR and OLR results
• Will provide all data to UCLA for further use
• Will present data in ZEFYS, its own newspaper portal; to the
Deutsche Digitale Bibliothek; to TEL (The European Library) and to
Europeana
16. Layout and structure analysis
recognition of words, text lines, text blocks,
columns and classification of text blocks,
illustrations, advertisements, tables and the
following page types:
- title page (the title page of an issue)
- content page (a page that consists of content/text only)
- illustration page (a page that has at least one illustration)
- advertisement page (a page that contains adverts only)
Structure analysis through classification of
headlines and grouping of zones into articles
(incl. article continuation)
17. ENP OLR workflow | Conversion without scanning
Digital Image
Digital Image
Metadata
Metadata
Delivery
Delivery
Digital Object
Digital Object
Return
Return
Material location
Conversion facility
Inspection //
Inspection
Automatic QA
Automatic QA
Conversion
MD Recording
Reject
Reject
Doc Delivery
Doc Delivery
18. Quality assurance
@ CCS | Automated markup and basic manual correction:
- headlines, illustrations, tables, captions, advertisements, etc.
- article segmentation and grouping of zones into articles (incl. continuation)
@ Content Provider (Library)
Recommended:
- Zoning: correct classification of blocks as „text“ or „illustration“
- Article segmentation: correct identification of headlines/text blocks/captions
- Grouping: correct gouping of blocks (text, illustration) to articles
- Metadata: correct title, issue date and issue number
Optional:
- Page types: correct page types
- Page numbers: correct page sequence
- OCR: perform text correction of specific zones (e.g. headlines, captions)
19. Output | METS/ALTO package
METS/ALTO metadata schemas to describe the structured digital output object
A newspaper issue processed in docWorks is converted into one METS XML
file. It reflects the whole physical and logical structure, manages all links to the
image files and the related ALTO XML files. ALTO is based on a standardized
page description schema and contains all information of a page (print space,
margins, coordinates, OCR results).
Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices
- automated article classification and clustering through data/text mining and
linguistic technologies
- user engagement for manual online text correction, article classification,
annotation, building personal collections, etc.
- sharing articles via social media platforms like Facebook, Twitter, etc.
_______________
METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object